Sequence database setup: NCBI nr with gi numbers (OBSOLETE)

IMPORTANT – NCBI have dropped gi numbers
In late August 2016, NCBI removed gi numbers from the title lines of the nr Fasta file. This breaks the existing NCBInr definition, described below, so we have created a new definition for accession.version identifiers, called NCBIprot.

Overview

The configuration for NCBI nr described on this page uses gi numbers as identifiers. You should only use it if you are part way through a major project or have a workflow that absolutely requires the continued use of gi numbers as identifiers. You will need to freeze nr at or before the 21 August 2016 release. That is, you must disable any type of automatic updating.

NCBInr manual configuration

This is historical reference material. If you do not have NCBInr configured, you should enable or configure NCBIprot.

Taxonomy

The following taxonomy files are required:

ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/gi_taxid_prot.dmp.gz
ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz

Note that the taxonomy files go into the taxonomy directory, not into the sequence database directory.

Parse Rules

A typical Fasta title line is:

>gi|21305377|gb|AAM45611.1|AF384285_1 (AF384285) envelope protein [Human immunodeficiency virus type 1]

The gi number is the most reliable identifier. Suitable parse rules are:

Accession from Fasta title: ">(gi|[0-9]*)"
Description from Fasta title: ">[^ ]* (.*)"

If an entry in nr represents multiple source database entries, the Fasta title lines are concatenated together with CTRL+A as the delimiter.

Miscellaneous

Ensure that NCBInr (or whatever name you use for the database) is listed on the IgnoreDupeAccessions line in the Options section of mascot.dat.

Configuration example for Mascot 2.3 and earlier

For this example, nr.gz was downloaded to a folder named C:\Inetpub\MASCOT\sequence\NCBInr\current. The file was decompressed using gzip, and renamed to NCBInr_20160821.fasta.

Taxonomy files were downloaded to the taxonomy directory, as described above.

Mascot database maintenance utility

There is no downloadable full text file for nr, but full text for individual entries can be retrieved across the web from the NCBI Entrez server. The syntax for the Path field is:

/entrez/eutils/efetch.fcgi?rettype=gp&retmode=text&db=protein&tool=mascot&email=support@matrixscience.com&id=#ACCESSION#

If you don’t require full text in a Mascot Protein View report, simply leave the Host, Port, and Path fields blank and choose
— no full text report —
in the drop down list.

Always test a new definition before applying the changes to mascot.dat