Percolator

Percolator is an algorithm that uses semi-supervised machine learning to improve the discrimination between correct and incorrect spectrum identifications. The matches from searching a decoy database provide the negative examples for the classifier, and a subset of the high-scoring matches from the target database provide the positive examples. Percolator trains a machine learning algorithm called a support vector machine (SVM) to discriminate between the positive and negative matches by assigning weights to a number of features. Examples of features include Mascot score, precursor mass error, fragment mass error, number of variable modifications, etc. The vector of features with their optimal weights is then be used to re-rank matches from all queries, often leading to improved sensitivity.

Percolator was developed by Lukas Käll, Jesse D Canterbury, Jason Weston, William Stafford Noble, & Michael J MacCoss at the University of Washington, Department of Genome Sciences. The software is released under an Apache 2.0 licence and included with Mascot by permission.

We would also like to acknowledge the work of Markus Brosch and colleagues at the Sanger Centre, Hinxton, UK, who first applied Percolator to Mascot results and developed a wrapper application called Mascot Percolator.

There are a number of relevant publications:

Percolator returns p values, q values and Posterior Error Probabilities (PEPs) for each match. The q value can be thought of as the false discovery rate. If we accept all matches with q values of 0.01 or less, the false discovery rate will be 1%. The PEP is the probability that an individual match is a chance event.

The requirements for using Percolator to re-rank the matches from a Mascot search are:

  1. MS/MS search
  2. The search must include the results from an automatic decoy database search
  3. The search must contain at least 100 queries
  4. At least 100 database entries must be searched.

If these requirements are met, the result report will include a checkbox Show Percolator scores. When this is checked and the report re-loaded, the original Mascot scores will be replaced as follows:

  • Score: -10log(PEP)
  • Expect value: PEP
  • Identity threshold score for p<0.05: 13

Features

The complete set of features that can be made available to Percolator is defined in code. You can choose a sub-set of these features using a setting in the Options section of the Mascot configuration file, mascot.dat. The default setting, as shipped, is:

PercolatorFeatures mScore,lgDScore, mrCalc, charge, dM, dMppm, absDM, absDMppm, isoDM, isoDMppm, mc, varmods, totInt, intMatchedTot,relIntMatchedTot

For complete details of Percolator configuration settings and a description of the data flow, refer to the Mascot Setup & Installation manual.

List of features available to Percolator

Feature name Description
retentionTime Retention time in seconds if available
dM Calculated minus observed peptide mass in Da
mScore Mascot score (always on)
lgDScore Mascot score minus Mascot score of next best non-isobaric peptide hit
mrCalc Calculated Mr
charge Charge
dMppm Calculated minus observed peptide mass in ppm
absDM Absolute value of calculated minus observed peptide mass in Da
absDMppm Absolute value of calculated minus observed peptide mass in ppm
isoDM Calculated minus observed peptide mass, after eliminating possible isotope errors up to 2 Da, in Da
isoDMppm Calculated minus observed peptide mass, after eliminating possible isotope errors up to 2 Da, in ppm
mc Number of missed cleavages (always 0 if no enzyme)
varmods Number of modified sites divided by number of modifiable sites
varmodsCount The number of variable mods used in the peptide. That is, if there are 10 Met and 5 of these are oxidised, this counts as 1. A peptide with Met-OX, phosphoS, deamidation, and acetylation, would count as 5.
modifiable Total number of modifiable sites
modified Total number of modified residues and terminii
totInt Log total ion intensity. The 20 most intense peaks in each 100 Da bin are used for all features, and totInt reports this value
intMatchedTot Log total matched ion intensity
relIntMatchedTot Total matched ion intensity divided by total ion intensity as a percentage (no logs involved)
fragDeltaMed Median value of all matched fragment errors in Da
fragDeltaIqr Interquartile range value of all matched fragment errors in Da
fragDeltaMedPPM Median value of all matched fragment errors in ppm
fragDeltaIqrPPM Interquartile range value of all matched fragment errors in ppm
fragDeltaPolyFit 2nd order polynomial fit to m/z vs delta. Result is RSquared multiplied by the number of points divided by 100
longest Longest sequence matched ions, reported separately for each ion series (backbone only), as with fracIonsMatched
fracIonsMatched Fraction of calculated ions matched, reported separately for each ion series, with NLs lumped together (e.g. fracIonsMatchedB1, fracIonsMatchedB1deriv, fracIonsMatchedB2, fracIonsMatchedB2deriv)
matchedIntensity Matched ion intensity, reported separately for each ion series, as with fracIonsMatched
qmatch The number of peptide matches for which an ms-ms match was attempted
peptide The peptide string that was matched
proteins A tab separated list of accessions of proteins that contain this peptide. Must be last feature in list

One feature is treated differently from the others: retention time. If retention time is included in the peak list, so that it is available in the Mascot result file, it can be used as a feature by comparing the experimental RT values with values calculated by Percolator. To enable this:

  • The peak list must supply retention time information using the MGF RTINSECONDS parameter. It is not sufficient to have the information embedded in the scan title string
  • retentionTime must be listed in the PercolatorFeatures line in mascot.dat
  • In the Options section of mascot.dat, set PercolatorUseRT to 1 to turn this feature on by default. Otherwise, add the argument percolate_rt=1 to the report URL

Application Notes

Percolator will usually give a worthwhile improvement in sensitivity. There are occasions when it can fail. For example, if there are very few good matches in the search results, it may not have enough positive examples to work with.

If there are multiple, high scoring matches to a single query, the current approach is to submit only the first rank match to Percolator. The other matches to the same query are then re-scored by pro-rating the new score for the rank 1 match. Thus, if there were matches to multiple peptides which differed only in (say) I and L, all of which had the same Mascot score, they would still have the same score after re-ranking with Percolator. Similarly, if the top 3 matches had Mascot scores of 60, 50, and 40, and Percolator re-scored the rank 1 match to 54, the rank 2 and 3 matches would be re-scored to 45 and 36. This avoids anomalies, but it is not ideal. If we accept that the weighted vector of features is doing a good job of re-ranking matches from different queries, it is only logical to re-rank the alternative matches to a single query. This would allow a rank 2 match to be promoted over the rank 1 match, which cannot happen using the current approach.