Data analysis tools

The two primary analysis performed at the UWPR are shotgun or bottom up proteomics and targeted quantitative proteomics.

For targeted proteomics, data is classically acquired on triple quadrupole instruments and typically analyzed using the Skyline software suite. There are extensive documentation, tutorials, and videos on the Skyline web site. The nice thing about Skyline is the develop is done here at the UW in the MacCoss lab in Genome Sciences (in the Foege building on South campus).

Shotgun analysis involves peptide identification via MS/MS database searching. For those that prefer to perform their own analysis, the UWPR hosts a Mascot server (release 2.3) that you can request an account on. Other popular commercial tools are Thermo's Protein Discoverer and Proteome Software's Scaffold. The free tool MaxQuant is a widely used for performing peptide ID and quantification.

To use the same tools as we do here at the UWPR, you can learn about the Trans-Proteomic Pipeline (TPP) originally developed in the Aebersold group at ISB. There's a support forum for your questions and they offer periodic week-long software courses if you want to learn how to use the software.

Data analysis primer

Shotgun data procesed by the UWPR for you will typically entail a Comet database search followed by Trans-Proteomics Pipeline (TPP) analysis using PeptideProphet and ProteinProphet. Aimed at someone new to UWPR proteomics analysis, this is a brief tutorial on how to start looking at your data.

Links to your processed data will appear at the bottom of your project page in the section "External Links to Data".

Click on the "View Data" link which will bring up a page that looks like this:

The very first thing I always look at is PeptideProphet's score distribution plots and how well the modeled positive and negative distributions fit the experimental data. To do this, click on the "pep.xml" file link for each analysis. This brings up the TPP's PepXML Viewer below. Think of this as a grid of of your raw search results where each row represents an MS/MS spectrum search result. You'll see some scores, a spectrum name with scan number and charge state encoded in it, link to the spectrum viewer, best (not necessarily correct) peptide match, protein name (just one protein name printed but peptide could match to many), and the peptide mass. You can add or remove other columns of information.

Then click on any of the probability score values in the leftmost column with the header "PROB". They will all bring up the same score distribution curves so it doesn't matter which one you click on. You should see a set of model charts like below. There's a lot of information here that is beyond this primer. Your best resource for questions on PeptideProphet and these score distributions is the TPP's support forum.

The charts on the far left are the key indicators and what I use to judge whether or not the calculated probability values are worthwhile.

What you see in the plots is a score histogram (black curves) of all the peptide identifications; one plot for each precursor charge state. The red curves are what PeptideProphet fits to the negative/null/wrong distribution and the green curves are what PeptideProphet fits to the positive/correct distribution. What you want to see is that there are two distinct distributions (bimodal) in the black curve and that the red and green curves fit those well.

Here's two examples of very good score distributions. Sensitivity/error curves are near ideal (you want them to hit the top right and bottom left corners). The black line in the 2nd and 3rd plots represents histogram of search results and you want to see a bimodal (two peaks) distribution representing the bad hits (modeled by the red curve) and good hits (modeled by the green curve). In these examples, it's clear that there are two peaks in the black score distribution and the positive distributions are big.

   

Here's two examples of other good, maybe more normal distributions. You can see the positive distributions are not nearly as large as in the plots above but they are clear positive distributions. In the plots on the right, the lines are jagged simply because the raw counts are so low. But even with these low counts, there's good separation between positive and negative distributions (and this good separation is encapsulated in the good sensitivity/error plots).

   

Here's an example of very poor score distributions. The sensitivity/error plots don't trend to the top right and bottom left corners. And there's simply no positive distribution. So if there are any good peptide IDs, their counts are very low.

   

When the score distributions are this poor as in the third example above, I tend to ignore the calculated probability values which means the protein probabilities aren't reliable either (don't bother opening the prot.xml link). For such data, I end up sorting the peptide list in ascending order by the "expect" column (E-value or expectation value) and looking at the best scoring identifications. In contrast to PeptideProphet probablities calculated by analyzing the entire run, the E-value is calculated on each individual spectrum search; smaller E-values are better. Think of it as related to a p-value but with the definition that it's the expected number of random identifications to score as well or better than the current peptide's score. By observation of Comet scores and the spectral annotations, E-value scores in the range of 10^-8 or smaller are usually very good while spectra for scores in the 10^-4 range and higher start to look more noisy and suspect. And there are always exceptions (like a good looking spectrum match with a poorer E-value).

To visualize annotated spectral using the Lorikeet viewer, click on the links in the "IONS" column of the PepXML Viewer. Here are 3 examples of good MS/MS spectra with good peptide matches.

These are examples of very poor MS/MS spectra.

These peptide-spectrum-matches got poor scores but might seem plausible if you squint your eyes hard enough.

The prot.xml link brings up the ProteinProphet viewer. This is a protein centric view of the data and should only be looked at after you validate that the peptide score distributions aren't horrible. Here's what the ProteinProphet viewer looks like:

Scrolling down the list, you start seeing protein "groups" which are usually isoforms grouped together (but sometimes they're unrelated proteins that simply share sequence homology):

Clicking on the far left column for a particular protein entry will bring up these peptide details:

Clicking on the group entry number will bring up these group details:


How to download the "Excel" files from the pep.xml and prot.xml viewers:

For the pep.xml link, after you choose "Export Spreadsheet", go back to the "Summary tab". In the header, you should now see an "exported spreadsheet to:" text with a hypertext link to the .xls file (which is really just a tab-delimited text file). See image below. You can just click on the hypertext link to download the .xls file and open in Excel.

It's a little more convoluted to get the exported file from the prot.xml view. Once you hit the "Export to XLS" button in the protein view, you end up with something like this:

So it says it wrote the exported file to disk (text in red above) but it doesn't give you a way to download. What you sadly have to do is 'browse' to your directory using the web browser and download the exported file by finding it the file name in the long list of files and doing a "save-as" via your browser. To do this look in your browser URL. For UWPR users, go to your data page, the one with all of the pep.xml and prot.xml links. The URL might look like:

https://proteomicsresource.washington.edu/net/pr/vol1/ProteomicsResource/search/test/data.html

Simply delete the "data.html" text after the very last "/" character. In the example above, delete the text "interact-test.prot.xml" and hit enter. Your URL should look something like below:

https://proteomicsresource.washington.edu/net/pr/vol1/ProteomicsResource/search/test/

Your browser should now display a list of files. Browse to the appropriate .xls file and save it locally by right-mouse-clicking on it and doing a save-as. You can grab the peptide export this way too.