Percolator

Percolator is an algorithm that uses semi-supervised machine learning to improve the discrimination between correct and incorrect spectrum identifications. The matches from searching a decoy database provide the negative examples for the classifier, and a subset of the high-scoring matches from the target database provide the positive examples. Percolator trains a machine learning algorithm called a support vector machine (SVM) to discriminate between the positive and negative matches by assigning weights to a number of features. Examples of features include Mascot score, precursor mass error, fragment mass error, number of variable modifications, etc. The vector of features with their optimal weights is then be used to re-rank matches from all queries, often leading to improved sensitivity.

Percolator was developed by Lukas Käll, Jesse D Canterbury, Jason Weston, William Stafford Noble, & Michael J MacCoss at the University of Washington, Department of Genome Sciences. The software is released under an Apache 2.0 licence and included with Mascot by permission.

We would also like to acknowledge the work of Markus Brosch and colleagues at the Sanger Centre, Hinxton, UK, who first applied Percolator to Mascot results and developed a wrapper application called Mascot Percolator.

There are a number of relevant publications:

Percolator returns p values, q values and Posterior Error Probabilities (PEPs) for each match. The q value can be thought of as the false discovery rate. If we accept all matches with q values of 0.01 or less, the false discovery rate will be 1%. The PEP is the probability that an individual match is a chance event.

The requirements for using Percolator to re-rank the matches from a Mascot search are:

  1. MS/MS search
  2. The search must include the results from an automatic decoy database search
  3. The search must contain at least 750 queries
  4. At least 100 database entries must be searched.
  5. The search must not be an error tolerant search.

If these requirements are met, the result report will include a checkbox Show Percolator scores. When this is checked and the report re-loaded, the original Mascot scores will be replaced as follows:

  • Score: -10log(PEP)
  • Expect value: PEP
  • Identity threshold score for p<0.05: 13

Percolator will usually give a worthwhile improvement in sensitivity. There are occasions when it can fail. For example, if there are very few good matches in the search results, it may not have enough positive examples to work with.

Features

The complete set of features that can be made available to Percolator is defined in code. You can choose a sub-set of these features using a setting in the Options section of the Mascot configuration file, mascot.dat. The default setting, as shipped, is:

PercolatorFeatures dM, mScore, MIT, MHT, peptideLength, z1, z2, z4, z7, isoSysDM, isoSysDMppm, isoSysDMz, 12C, mc0, mc1, mc2, varmods, varmodsCount, totInt, intMatchedTot, relIntMatchedTot, RMS, RMSppm, meanAbsFragDa, meanAbsFragPPM, rawScore

List of features available to Percolator

Feature name Description
retentionTime Retention time in seconds if available
dM Calculated minus observed peptide mass in Da
mScore Mascot score (always on)
lgDScore Mascot score minus Mascot score of next best non-isobaric peptide hit
mrCalc Calculated Mr
charge Charge
dMppm Calculated minus observed peptide mass in ppm
absDM Absolute value of calculated minus observed peptide mass in Da
absDMppm Absolute value of calculated minus observed peptide mass in ppm
isoDM Absolute value of calculated minus observed peptide mass, after eliminating possible isotope errors up to 2 Da, in Da
isoDMppm Absolute value of calculated minus observed peptide mass, after eliminating possible isotope errors up to 2 Da, in ppm
isoDmz Absolute value of calculated minus observed peptide m/z
mc Number of missed cleavages (always 0 if no enzyme)
varmods Number of modified sites divided by number of modifiable sites (set to 0 if number of modifiable sites is 0)
varcount Number of distinct varmods present
varmodsCount The number of variable mods used in the peptide. That is, if there are 10 Met and 5 of these are oxidised, this counts as 1. A peptide with Met-OX, phosphoS, deamidation, and acetylation, would count as 5.
modifiable Total number of modifiable sites
modified Total number of modified residues and terminii
totInt Log total ion intensity. The 20 most intense peaks in each 100 Da bin are used for all features, and totInt reports this value
intMatchedTot Log total matched ion intensity
relIntMatchedTot Total matched ion intensity divided by total ion intensity as a percentage (no logs involved)
fragDeltaMed Median value of all matched fragment errors in Da
fragDeltaIqr Interquartile range value of all matched fragment errors in Da
fragDeltaMedPPM Median value of all matched fragment errors in ppm
fragDeltaIqrPPM Interquartile range value of all matched fragment errors in ppm
fragDeltaPolyFit 2nd order polynomial fit to m/z vs delta. Result is RSquared multiplied by the number of points divided by 100
longest Longest sequence matched ions, reported separately for each ion series (backbone only), as with fracIonsMatched
fracIonsMatched Fraction of calculated ions matched, reported separately for each ion series, with NLs lumped together (e.g. fracIonsMatchedB1, fracIonsMatchedB1deriv, fracIonsMatchedB2, fracIonsMatchedB2deriv)
matchedIntensity Matched ion intensity, reported separately for each ion series, as with fracIonsMatched
qmatch The number of peptide matches for which an ms-ms match was attempted
MIT Mascot identity threshold
MHT Mascot homology threshold
peptideLength Peptide length
z1 1 if charge = 1
z2 1 if charge = 2 or 3
z4 1 if charge = 4, 5, or 6
z7 1 if charge = 7 or more
12C 1 if peptide mass is 12C value (no isotope error)
mc0 1 if missed cleavages = 0 or if no enzyme
mc1 1 if missed cleavages = 0 or 1
mc2 1 if missed cleavages = 2 or more
RMS RMS m/z error for matched fragments
RMSppm RMS ppm error for matched fragments
meanAbsFragDa Mean absolute m/z error for matched fragments
meanAbsFragPPM Mean absolute PPM error for matched fragments
rawscore Simple binomial score using matches to main series sequence ions and p = 2*ITOL*n/100 where n is the number of peaks selected in each 100 Da bin
peptide The peptide string that was matched interpolated with numbers to represent modifications, e.g. X.DAKAAM1AGRLM1IR.X
proteins A tab separated list of accessions of proteins that contain this peptide. Must be last feature in list

One feature is treated differently from the others: retention time. If retention time is included in the peak list, so that it is available in the Mascot result file, it can be used as a feature by comparing the experimental RT values with values calculated by Percolator. To enable this:

  • The peak list must supply retention time information using the MGF RTINSECONDS parameter. It is not sufficient to have the information embedded in the scan title string
  • In the Options section of mascot.dat, set PercolatorUseRT to 1 to turn this feature on by default. Please note that retention time calculation in Percolator is very time consuming and the sensitivity improvement is only marginal for most data sets. We advise against turning it on as a global default. Better to try it on specific examples by adding the argument percolate_rt=1 to the report URL.

Two options in mascot.dat control whether target matches other than rank 1 are Percolated:

  • PercolatorTargetRankScoreThreshold: Target matches below rank 1 are not Percolated if score less than this value (default 20)
  • PercolatorTargetRankRelativeThreshold: Target matches below rank 1 are not Percolated if score difference divided by rank 1 score is greater than this value (default 0.2)

Data flow

  1. At the completion of a qualifying search, nph-mascot.exe creates a Percolator input file (*.pip) in the result directory
  2. When a report for Percolated results is loaded, the Percolator executable is called by nph-cache_families.pl to create a pair of output files (*.target.pop, *.decoy.pop) in the result directory. If Percolation is on by default, which is not recommended, this will occur when the report is first loaded. Otherwise, it occurs when the Percolator checkbox is checked and the report reloaded using Format As.
  3. Finally, nph-cache_families.pl uses the *.pop files to create new cache files that allow a report to be displayed using Percolated scores in place of the original Mascot scores.