Sequence database setup: NCBI nr

IMPORTANT – NCBI have dropped gi numbers
In late August 2016, NCBI removed gi numbers from the title lines of the nr Fasta file. This breaks the existing definition, which was called NCBInr, so we have created a new definition for accession.version identifiers, called NCBIprot.
If you are part way through a major project or have a workflow that absolutely requires the continued use of gi numbers as identifiers, you will need to freeze nr at or before the 21 August 2016 release. That is, you must disable any type of automatic updating. If you need to refer to the old configuration information, see the archived help page.
In all other cases, you should use the new NCBIprot configuration. In Mascot 2.4 and later, NCBIprot is a predefined database, meaning up-to-date configuration information is downloaded automatically by Mascot Database Manager. To enable nr with the new configuration, all you need to do is:
  1. Configuration Editor; Configuration Options; add NCBIprot to the list of databases next to IgnoreDupeAccessions and Apply.
  2. Configuration Editor; Database Manager; choose Enable predefined definition then select NCBIprot.
If you already had NCBInr enabled, either select it from the top level Databases page and choose Delete or, if you wish to keep it so that you can load Protein View reports for old search results, ensure that automatic updates are disabled.

Overview

The nr database is compiled by the NCBI (National Center for Biotechnology Information) as a protein database for Blast searches. It contains non-identical sequences from GenBank CDS translations, PDB, Swiss-Prot, PIR, and PRF.

The strengths of nr are that it is comprehensive and frequently updated. The downside is that it is a huge database. As of September 2016, the 54 GB Fasta file contained 94 million entries. A 64-bit version of Mascot on a 64-bit PC is essential. In most cases, there are better choices of database, such as a subset of GenBank for the organism of interest or a Uniprot complete proteome.

Download

ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nr.gz for the current release.

NCBIprot manual configuration

The rest of this page is only relevant for Mascot 2.3 or if you choose to edit mascot.dat rather than use Database Manager. (It isn’t possible to configure taxonomy for nr in Mascot 2.2 and earlier.)

Taxonomy

The following taxonomy files are required:

ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz (the downloaded file must unpacked using tar as well as decompressed.)
prot.av2taxid.gz can be downloaded from http://s3.amazonaws.com/matrixsciencemisc

Note that the taxonomy files go into the taxonomy directory, not into the sequence database directory.

Add the following taxonomy definition to mascot.dat, changing the taxonomy block number so as to be consecutive with the existing blocks. Make a backup copy of mascot.dat, then use a text editor to make these changes. Note that the file must be saved as plain text and ensure the filename is not changed to mascot.dat.txt or something.

#
Taxonomy_17
Identifier NCBIprot (nr post gi numbers)
Enabled 1
FromRefFile 0
ErrorLevel 0
DescriptionLineSep 1
AccFromSpeciesLine "^>*\([^> ,]*\)"
SpeciesFiles ACC2TAXID:prot.av2taxid, NCBI:names.dmp
NodesFiles NCBI:nodes.dmp, NCBI:merged.dmp
DefaultRule ACC2TAXID, CHOP: "^>*\([^> ,]*\)"
end
#

Parse Rules

A typical Fasta title line is:

>WP_011638038.1 LysR family transcriptional regulator [Shewanella frigidimarina]

Suitable parse rules are:

Accession from Fasta title: ">\([^ ]*\)"
Description from Fasta title: ">[^ ]* \(.*\)"

If an entry in nr represents multiple source database entries, the Fasta title lines are concatenated together with CTRL+A as the delimiter.

Miscellaneous

It is essential for NCBIprot (or whatever name you use for the database) to be listed on the IgnoreDupeAccessions line in the Options section of mascot.dat.

Configuration example for Mascot 2.3

For this example, nr.gz was downloaded to a folder named C:\Inetpub\MASCOT\sequence\NCBIprot\current. The file was decompressed using gzip, and renamed to NCBIprot_20160901.fasta.

Taxonomy files were downloaded to the taxonomy directory, as described above.

Mascot database maintenance utility

There is no downloadable full text file for nr, but full text for individual entries can be retrieved across the web from the NCBI Entrez server. The syntax for the Path field is:

/entrez/eutils/efetch.fcgi?rettype=gp&retmode=text&db=protein&tool=mascot&email=support@matrixscience.com&id=#ACCESSION#

If you don’t require full text in a Mascot Protein View report, simply leave the Host, Port, and Path fields blank and choose
— no full text report —
in the drop down list.

Always test a new definition before applying the changes to mascot.dat