Sequence database setup: NCBI nr
|
Overview
The nr database is compiled by the NCBI (National Center for Biotechnology Information) as a protein database for Blast searches. It contains non-identical sequences from GenBank CDS translations, PDB, Swiss-Prot, PIR, and PRF.
The strengths of nr are that it is comprehensive and frequently updated. The downside is that it is a huge database. As of September 2016, the 54 GB Fasta file contained 94 million entries. A 64-bit version of Mascot on a 64-bit PC is essential. In most cases, there are better choices of database, such as a subset of GenBank for the organism of interest or a Uniprot complete proteome.
Download
ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nr.gz for the current release.
NCBIprot manual configuration
The rest of this page is only relevant for Mascot 2.3 or if you choose to edit mascot.dat rather than use Database Manager. (It isn’t possible to configure taxonomy for nr in Mascot 2.2 and earlier.)
Taxonomy
The following taxonomy files are required:
ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz (the downloaded file must unpacked using tar as well as decompressed.)
prot.av2taxid.gz can be downloaded from http://s3.amazonaws.com/matrixsciencemisc
Note that the taxonomy files go into the taxonomy directory, not into the sequence database directory.
Add the following taxonomy definition to mascot.dat, changing the taxonomy block number so as to be consecutive with the existing blocks. Make a backup copy of mascot.dat, then use a text editor to make these changes. Note that the file must be saved as plain text and ensure the filename is not changed to mascot.dat.txt or something.
#
Taxonomy_17
Identifier NCBIprot (nr post gi numbers)
Enabled 1
FromRefFile 0
ErrorLevel 0
DescriptionLineSep 1
AccFromSpeciesLine "^>*\([^> ,]*\)"
SpeciesFiles ACC2TAXID:prot.av2taxid, NCBI:names.dmp
NodesFiles NCBI:nodes.dmp, NCBI:merged.dmp
DefaultRule ACC2TAXID, CHOP: "^>*\([^> ,]*\)"
end
#
Parse Rules
A typical Fasta title line is:
>WP_011638038.1 LysR family transcriptional regulator [Shewanella frigidimarina]
Suitable parse rules are:
Accession from Fasta title: ">\([^ ]*\)"
Description from Fasta title: ">[^ ]* \(.*\)"
If an entry in nr represents multiple source database entries, the Fasta title lines are concatenated together with CTRL+A as the delimiter.
Miscellaneous
It is essential for NCBIprot (or whatever name you use for the database) to be listed on the IgnoreDupeAccessions line in the Options section of mascot.dat.
Configuration example for Mascot 2.3
For this example, nr.gz was downloaded to a folder named C:\Inetpub\MASCOT\sequence\NCBIprot\current. The file was decompressed using gzip, and renamed to NCBIprot_20160901.fasta.
Taxonomy files were downloaded to the taxonomy directory, as described above.
There is no downloadable full text file for nr, but full text for individual entries can be retrieved across the web from the NCBI Entrez server. The syntax for the Path field is:
/entrez/eutils/efetch.fcgi?rettype=gp&retmode=text&db=protein&tool=mascot&email=support@matrixscience.com&id=#ACCESSION#
If you don’t require full text in a Mascot Protein View report, simply leave the Host, Port, and Path fields blank
and choose
— no full text report —
in the drop down list.
Always test a new definition before applying the changes to mascot.dat