SEQUEST® release 2012.01

    release 2012.01 rev 6 (2012.01.6), release date 2012/10/19
  • With no variable mods are specified for a search, SEQUEST would write out randomly write out a terminal modification character which has been corrected.
    release 2012.01 rev 5 (2012.01.5), release date 2012/10/12
  • Fixes errors with internal decoy searches. The modification strings for decoy peptides had errors in a few places resulting in seg faults which were corrected.
    release 2012.01 rev 4 (2012.01.4), release date 2012/08/15
  • Fixes stop codon character '*' support that went missing in the big 2012.01 update. These characters were simply ignored and the underlying sequence was concatenated before this fix. Now these are properly treated as stop characters and when present act as separators between candidate peptides.
    release 2012.01 rev 3 (2012.01.3), release date 2012/07/13
  • The "activation_method" parameter was not working; error traced back to an MSToolkit update. The problem has been fixed for mzXML files; this parameter doesn't look to be supported well for mzML files.
    release 2012.01 rev 2 (2012.01.2), release date 2012/06/20
  • Fix another segfault that showed up in the internal decoy searches. Not enough space was allocated to store the N/C-term modifications for both target and decoy peptides. Segfault manifests itself when a large peptide is analyzed in an internal decoy search.
  • Address user definable amino acids J and U possibly not being assigned masses but appearing in the sequence database. If a database contains either a J or U but the user does not assign a mass to these residues, a segfault could occur.
    release 2012.01 rev 1 (2012.01.1), release date 2012/05/16
  • Fixes E-value issue with the 2012.01.0 release. In that version, the xcorr histogram was erroneously only tracking scores that scored high enough to be stored during the search. This release fixes the problem where the xcorr for all scored peptides contribute to the E-value calculation. Pertinent only when "print_expect_score = 1".
  • The E-value calculation no longer uses the non-linear (quadratic) curve fitting with the lmfit code. E-values are calculated from the cummulative distribution of the xcorr's using a linear regression of their log transform. This is similar to how version 2011.01 was calculating E-values but we're now using the cummulative distribution instead of actual xcorr distribution (for proper definition of E-value as number of expected random hits at or above score cutoff versus number of expected random hits at score cutoff).
  • Fix segfaults with the decoy searches in 2012.01.0. Decoy peptides were not represented correctly internal to the program which would cause the segfault.
  • Issue with xcorr decoys (for minimum # xcorr scores to calculate E-value histograms) where the calculation would fail for specific ion series (such as a-ions) under certain conditions. The xcorr decoy generation program could generate negative fragment ion masses which would cause the program to terminate with an error message.

For those of you not using the E-value scores or performing internal decoy searches, you would not be impacted by the issues that this minor release addresses.

Here are runtimes comparing 2011.01.1 vs. 2012.01.1. The numbers are pretty close to same benchmarks using 2012.01.0 with a few small improvements.

   dbase   enzyme         variable mods    2011.01.1   2012.01.1     foldΔ
   ------------------------------------------------------------------------
                                            HH:MM:SS    HH:MM:SS
   yeast   full tryptic   16M                   2:27        0:10     14.7x
   yeast   full tryptic   16M, 80STY            6:05        0:55      6.6x

   yeast   semi tryptic   16M                   9:27        1:34      6.0x
   yeast   semi tryptic   16M, 80STY         6:26:47       18:35     20.8x

   yeast   no enzyme      16M                  24:07       15:15      1.6x

   human   full tryptic   16M                  26:26        1:07     23.7x (26 min to 1 min)
   human   full tryptic   16M, 80STY           57:37        7:40      7.5x

   human   semi tryptic   16M                1:49:14       17:23      6.3x
   human   semi tryptic   16M, 80STY        57:54:17     2:25:22     23.9x (2+ days to 2.5 hrs)

   human   no enzyme      16M                4:39:59     2:53:39      1.6x

Based on the number of PSMs at 1% FDR using E-value scores, the 2012.01.1 release does a little better than the 2012.01.0 release but still lags the 2011.01.1 performance for the human semi- and no-enzyme searches. Otherwise it is pretty equivalent or better for the other searches listed below. Chart below replicates the TPP analysis chart further down on this page but for just the E-value analysis showing update numbers. Numbers compare UW SEQUEST versions 2012.01.1 versus 2011.01.1.

    release 2012.01 rev 0 (2012.01.0), release date 2012/04/01
  • This is a major release with significant changes under the hood. The program has been rewritten to search all spectra in one pass through the sequence database, allowing reduction of previously redundant calculations, leading to the significant speed improvements as noted below.
  • Retired "decoy_mode" parameter option. Done to simplify decoy searches. Internal decoys will aways freeze N- or C-term residue and reverse everything else (ala what's been coined pseudo-reverse peptide as done by Sage-N's Sorcerer).
  • Retired "match_peak_count", "match_peak_allowed_error", and "match_peak_tolerance" parameters. These parameters controlled forcing candidate peptides to match the N most intense peaks in a spectrum and were rarely used in practice. Removed as these parameters were calculated and applied during the now defunct preliminary score step.
  • Retired "print_duplicate_references" parameter. Due to the way searches are now performed in batch mode, storing a large number of duplicate protein references is not feasible. A single matched protein entry is listed in the output along with a count of additional matched proteins. A secondary tool is required to find all match proteins for any particular peptide sequence which is something the TPP and MSDaPl already perform.
  • Retired "num_description_lines" parameter.
  • Retired "ion_cutoff_percentage" as this was calculated during the preliminary score stage.
  • Retired sqlite3 input format.
  • Changed behavior of "output_format" parameter to allow writing SQT output to a file. Still have option to write SQT output to standard out.
  • Retired 'export intensities' option in "show_fragment_ions" parameter.
  • Maximum peptide sequence length is now 64 (down from 144).
  • Modified "theoretical_fragment_ions" parameter to allow only default peak shape (1/2 height flanking peaks) or simply use mass peak alone. The M + 13C option has been retired.
  • Retired "xcorr_only" parameter. This version performs the cross correlation score as the primary score function. There is no option to do the classical Sp preliminary scoring against all peptides anymore. The Sp score is calculated for N results where N is set by the "num_results" parameter.
  • The "ion_series" parameter is changed. Previously the first 3 digits in the parameter were treated as boolean values controlling the use of neutral loss on a-ions, b-ions and y-ions. The structure of this parameter has not changed to maintain backwards compatibility. However, the first two digits are no longer used and only the third digit controls whether neutral loss ions are applied to b-ions and y-ions.
  • Adds a mass of 9000.0 to 'X' amino acids in the sequest.params file that is written out using the "-p" command line option. Effectively this skips analyzing peptides that contain this "any" amino acid as opposed to treating them as L/I in the past. Any peptide containing an 'X' is ignored by the default TPP analysis anyways. Simply remove this mass addition if you want to revert this behavior.
  • The E-value distributions are much different now compared to previous versions because there are so many more xcorr datapoints. So the E-values are now calculated based on the log transform of the cummulative distribution of xcorr scores. Non-linear regression is used based on Levenberg-Marquardt (LMA) least-squares minimization curve fitting of the log transform of a cummulative distribution of the xcorrs. As noted below, this needs to be revisited as I'm seeing upwards of 10% less IDs at a given FDR compared to previous version when using E-value scoring.

NOTE: you must use a new sequest.params file associated with this version due to retirement of parameters and the change in behavior for the "output_format" parameter. A new parameter file can be generated using the "-p" command line option.

On the UWPR system, this executable is named sequest.2012010 which is invoked using runSequestQ.2012010. The plain runSequestQ is always symlinked to the most current version of the tool.



The previous version loaded the database in memory and searched each spectrum individually in each thread. (Actually all spectra were pre-loaded into memory also in previous version.) This version loads all spectra into memory and sequences are loaded from the database on demand. So the memory requirements are different. You need enough memory to store all MS/MS spectra and N sequences in memory (where N=# threads). Memory requirements are also a function of the bin size defined by the "fragment_ion_tolerance" parameter. Searches may have to be split up to address memory limitations. Here are examples of the memory requirements:

   bin_size  RES    #spectra   RES = code + data (actual physical memory being consumed)
   ------------------------- 
   0.36      9.7g   72,773
   0.36      3.4g   23,759
   0.36      1.7g   11,811
   0.36      724m    4,515

   0.10      *      72,773    *  memory limited on 16GB machine;
   0.10      5.5g   23,759       searches must be divided up to run
   0.10      2.9g   11,811
   0.10      1.4g    4,515

   0.01      *      72,773
   0.01      *      23,759
   0.01      *      11,811
   0.01      9.6g    4,515



Here are examples of runtimes (4,515 spectra, 0.36 bin size, +- 3.0 Da tol, 8 threads, Intel Xeon E5345 2.33GHz cpu, top 50 peptides stored):

   dbase   enzyme         variable mods    2011.01.1   2012.01.0     foldΔ
   ------------------------------------------------------------------------
                                          HH:MM:SS    HH:MM:SS
   yeast   full tryptic   16M                 2:27        0:10       14.7x
   yeast   full tryptic   16M, 80STY          6:05        0:58        6.3x

   yeast   semi tryptic   16M                 9:27        1:32        6.2x
   yeast   semi tryptic   16M, 80STY       6:26:47       19:42       19.6x (6 hrs to 20 min)

   yeast   no enzyme      16M                24:07       14:43        1.6x

   human   full tryptic   16M                26:26        1:07       23.7x (26 min to 1 min)
   human   full tryptic   16M, 80STY         57:37        7:50        7.4x

   human   semi tryptic   16M              1:49:14       16:38        6.6x
   human   semi tryptic   16M, 80STY      57:54:17     2:32:15       22.8x (2+ days to 2.5 hrs)

   human   no enzyme      16M              4:39:59     2:51:19        1.6x



Since the underlying algorithm has changed fairly significantly (replacing Sp with xcorr as the primary score function which was optional previously but mandatory now), here's a comparison of results for this UWPR2012010 version versus the previous UWPR2011010 version on datasets analyzed in various ways. For the first few datasets, the files were processed through the Percolator algorithm using separate target/decoy database searches. The remaining datasets were processed through the TPP using concatenated target/decoy database searches.

Percolator: number of unique peptide IDs at 1% q-value FDR cutoff. All samples are C. elegans searched through the hermie pipeline in the MacCoss lab where the FT and Orbi runs were also processed through Bullseye. Percolator version 2.01.

TPP: reported ID counts at 1% error rate either by PeptideProphet or ProteinProphet (TPP 4.5.2, accurate mass option) and by FDR analysis based on SEQUEST E-values:



Here are some E-value distribution plots to show you how well the LMA algorithm (red line) fits the empirical xcorr data (blue dots). Looking at the analysis above, this new release does not perform as well as the previous release when comparing E-value based FDR analysis. Whether this is inherent in the change in scoring or due to lack of optimizations for this calculation, the answer is unknown and this will have to be revisited.

   


   



SEQUEST 2012.01.0 makes use of the following software/libraries:
  • Mike Hoopmann's mstoolkit for reading input data. mstoolkit is distributed under the BSD license.
  • lmfit by Joachim Wuttke - a C/C++ routine for Levenberg-Marquardt minimization with wrapper for least-squares curve fitting, based on work by B. S. Garbow, K. E. Hillstrom, J. J. MorĂ©, and S. Moshier. Version 3.2. Public domain software.
  • sha1 code by Paul E. Jones and distributed under a Freeware Public License.
  • twiddle code by Matthew Belmonte. Public domain software.


UW SEQUEST version 2012.01.0 parameters.
These parameter entries only apply to this UW academic version and should not be used with any other versions of SEQUEST (such as from Thermo, Scripps, or Sage-N).

# comment lines begin with a '#' in the first position

[SEQUEST]
database_name = /some/path/yeast.fasta
decoy_search = 0                       ; 0=no (default), 1=concatenated search, 2=separate search

num_threads = 0                        ; 0=poll CPU to set num threads; else specify num threads directly (max 32)

#
# masses
#
peptide_mass_tolerance = 3.00
peptide_mass_units = 0                 ; 0=amu, 1=mmu, 2=ppm
mass_type_parent = 1                   ; 0=average masses, 1=monoisotopic masses
mass_type_fragment = 1                 ; 0=average masses, 1=monoisotopic masses
precursor_tolerance_type = 0           ; 0=MH+ (default), 1=precursor m/z
isotope_error = 0                      ; 0=off, 1= on -1/0/1/2/3 (standard C13 error), 2= -8/-4/0/4/8 (for +4/+8 labeling)

#
# enzyme
#
enzyme_number = 1                      ; choose from list at end of this params file
num_enzyme_termini = 2                 ; valid values are 1 (semi-digested), 2 (fully digested, default), 8 N-term, 9 C-term
max_num_internal_cleavage_sites = 2    ; maximum value is 5; for enzyme search

#
# Up to 6 differential modifications are supported
#
diff_search_options = 15.9949 M 0.0 X 0.0 X 0.0 X 0.0 X 0.0 X
diff_search_type = 0 0 0 0 0 0         ; 0=variable mod, 1=binary mod
diff_search_count = 4 4 4 4 4 4        ; max num of modified AA per each variable mod in a peptide
max_num_differential_per_peptide = 10  ; max num of total variable mods in a peptide (not including terminal mods)

#
# fragment ions
#
# ion trap ms/ms:  0.36 tolerance, 0.11 offset (mono masses)
# high res ms/ms:  0.01 tolerance, 0.00 offset (mono masses)
# historical:      1.0005079 tolerance, 0.00 offset (mono masses)
#
# ion_series line:  0 0 nl A B C D V W X Y Z  (nl=use neutral loss); 1st two digits are unused
#
ion_series = 0 0 1 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
theoretical_fragment_ions = 0          ; 0=default peak shape, 1=M peak only
fragment_ion_tolerance = 0.36          ; binning to use on fragment ions
fragment_bin_startoffset = 0.11        ; offset position to start the binning

#
# output
#
output_format = 2                      ; 0=sqt stdout (default), 1=sqt file, 2=out files
print_expect_score = 1                 ; 0=no, 1=yes, replace Sp with expect
num_output_lines = 10                  ; num peptide results to show
show_fragment_ions = 0                 ; 0=no, 1=yes

#
# mzXML parameters
#
scan_range = 0 0                       ; start and scan scan range to search; 0 as 1st entry ignores parameter
precursor_charge = 0 0                 ; precursor charge range to analyze; does not override mzXML charge; 0 as 1st entry ignores parameter
ms_level = 2                           ; MS level to analyze, valid are levels 2 (default) or 3
activation_method = ALL                ; activation method; used if activation method set; allowed ALL, CID, ECD, ETD, PQD, HCD, IRMPD

#
# misc parameters
#
digest_mass_range = 600.0 5000.0       ; MH+ peptide mass range to analyze
num_results = 50                       ; num prelim score results to store for xcorr analysis
skip_researching = 1                   ; for '.out' file output only, 0=search everything again (default), 1=don't search if .out exists
max_fragment_charge = 0                ; 0=use default else set maximum fragment charge state to input value
max_precursor_charge = 6               ; 0=use default (5) else set maximum precursor charge state to analyze
nucleotide_reading_frame = 0           ; 0=proteinDB, 1-6, 7=forward three, 8=reverse three, 9=all six
clip_nterm_methionine = 0              ; 0=leave sequences as-is; 1=also consider sequence w/o N-term methionine

#
# spectral processing
#
minimum_peaks = 5                      ; minimum num. of peaks in spectrum to search (default 5)
minimum_intensity = 0                  ; minimum intensity value to read in
remove_precursor_peak = 0              ; 0=no, 1=yes, 2=all charge reduced precursor peaks (for ETD)
remove_precursor_tolerance = 1.5       ; +- Da tolerance for precursor removal

#
# additional modifications
#

variable_C_terminus = 0.0
variable_N_terminus = 0.0
variable_C_terminus_distance = -1      ; -1=all peptides, 0=protein terminus, 1-N = maximum offset from C-terminus
variable_N_terminus_distance = -1      ; -1=all peptides, 0=protein terminus, 1-N = maximum offset from N-terminus

add_Cterm_peptide = 0.0
add_Nterm_peptide = 0.0
add_Cterm_protein = 0.0
add_Nterm_protein = 0.0

add_G_Glycine = 0.0000                 ; added to G - avg.  57.0513, mono.  57.02146
add_A_Alanine = 0.0000                 ; added to A - avg.  71.0779, mono.  71.03711
add_S_Serine = 0.0000                  ; added to S - avg.  87.0773, mono.  87.02303
add_P_Proline = 0.0000                 ; added to P - avg.  97.1152, mono.  97.05276
add_V_Valine = 0.0000                  ; added to V - avg.  99.1311, mono.  99.06841
add_T_Threonine = 0.0000               ; added to T - avg. 101.1038, mono. 101.04768
add_C_Cysteine = 57.021464             ; added to C - avg. 103.1429, mono. 103.00918
add_L_Leucine = 0.0000                 ; added to L - avg. 113.1576, mono. 113.08406
add_I_Isoleucine = 0.0000              ; added to I - avg. 113.1576, mono. 113.08406
add_X_LorI = 9000.0                    ; added to X - avg. 113.1576, mono. 113.08406
add_N_Asparagine = 0.0000              ; added to N - avg. 114.1026, mono. 114.04293
add_B_avg_NandD = 0.0000               ; added to B - avg. 114.5950, mono. 114.53494
add_D_Aspartic_Acid = 0.0000           ; added to D - avg. 115.0874, mono. 115.02694
add_Q_Glutamine = 0.0000               ; added to Q - avg. 128.1292, mono. 128.05858
add_K_Lysine = 0.0000                  ; added to K - avg. 128.1723, mono. 128.09496
add_Z_avg_QandE = 0.0000               ; added to Z - avg. 128.6216, mono. 128.55059
add_E_Glutamic_Acid = 0.0000           ; added to E - avg. 129.1140, mono. 129.04259
add_M_Methionine = 0.0000              ; added to M - avg. 131.1961, mono. 131.04048
add_O_Ornithine = 0.0000               ; added to O - avg. 132.1610, mono  132.08988
add_H_Histidine = 0.0000               ; added to H - avg. 137.1393, mono. 137.05891
add_F_Phenyalanine = 0.0000            ; added to F - avg. 147.1739, mono. 147.06841
add_R_Arginine = 0.0000                ; added to R - avg. 156.1857, mono. 156.10111
add_Y_Tyrosine = 0.0000                ; added to Y - avg. 163.0633, mono. 163.06333
add_W_Tryptophan = 0.0000              ; added to W - avg. 186.0793, mono. 186.07931
add_U_user_amino_acid = 0.0000         ; added to U - avg.   0.0000, mono.   0.00000
add_J_user_amino_acid = 0.0000         ; added to J - avg.   0.0000, mono.   0.00000

#
# SEQUEST_ENZYME_INFO _must_ be at the end of this parameters file
#
[SEQUEST_ENZYME_INFO]
0.  No_Enzyme              0      -           -
1.  Trypsin                1      KR          P
2.  Chymotrypsin           1      FWY         P
3.  Clostripain            1      R           -
4.  Cyanogen_Bromide       1      M           -
5.  IodosoBenzoate         1      W           -
6.  Proline_Endopept       1      P           -
7.  Staph_Protease         1      E           -
8.  Trypsin_K              1      K           P
9.  Trypsin_R              1      R           P
10. AspN                   0      D           -
11. Cymotryp/Modified      1      FWYL        P
12. Elastase               1      ALIV        P
13. Elastase/Tryp/Chymo    1      ALIVKRWFY   P
14. Trypsin                1      KRD         -