BLAST 2.0 RELEASE NOTES

(revised December 18, 1998)

Introduction

BLAST is a service of the National Center for Biotechnology Information (NCBI). A nucleotide or protein sequence sent to the BLAST server is compared against databases at the NCBI and a summary of matches is returned to the user.

The www BLAST server can be accessed through the home page of the NCBI at www.ncbi.nlm.nih.gov.

The BLAST 2.0 release has significant differences from the BLAST 1.4 release.

The BLAST 2.0 programs are described in a Nucleic Acids Research article. Please cite this reference if you publish the results of your BLAST query.

Blast Family of Programs

The BLAST family of programs allows all combinations of DNA or protein query sequences with searches against DNA or protein databases:

     blastp    compares an amino acid query  sequence  against  a
               protein sequence database.

     blastn    compares a nucleotide  query  sequence  against  a
               nucleotide sequence database.

     blastx    compares  the  six-frame  conceptual   translation
               products  of  a  nucleotide  query  sequence (both
               strands) against a protein sequence database.

     tblastn   compares  a  protein  query  sequence  against   a
               nucleotide     sequence    database    dynamically
               translated  in  all  six  reading   frames   (both
               strands).

     tblastx   compares the six-frame translations of  a  nucleo-
               tide query sequence against the six-frame transla-
               tions of a nucleotide sequence database.

The default matrix for all protein-protein comparisons is BLOSUM62.

Gaps in Blast

Version 2.0 of BLAST allows the introduction of gaps (deletions and insertions) into alignments. With a gapped alignment tool, homologous domains do not have to be broken into several segments. Also, the scoring of gapped results tends to be more biologically meaningful than ungapped results.

The programs, blastn and blastp, offer fully gapped alignments. blastx and tblastn have 'in-frame' gapped alignments and use sum statistics to link alignments from different frames. tblastx provides only ungapped alignments.

Blast Query Format

The sequence sent to the BLAST server should be in FASTA format, described in http://www.ncbi.nlm.nih.gov/BLAST/fasta.html.

A number of databases are also available. They are described in http://www.ncbi.nlm.nih.gov/BLAST/blast_databases.html.

Blast Report

The BLAST report consists of a number of sections. The descriptions below are for a blastp comparison, but the format for the other programs is analogous.

The BLAST report is not intended to be a parseable document. It is subject to change with little or no notice.

The BLAST report starts with some header information that lists the type of program (here blastp), the version (here 2.0.1), and a release date. Also listed are a reference to the BLAST program, the query definition line, and summary of the database used.

BLASTP 2.0.1 [Aug-20-1997]

Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer,
Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), "Gapped
BLAST and PSI-BLAST: a new generation of protein database search programs",
Nucleic Acids Res. 25:3389-3402.

Query= gi|129295|sp|P01013|OVAX_CHICK gene X protein - chicken (fragment)
         (232 letters)

Database: Non-redundant SwissProt sequences
           59,576 sequences; 21,219,450 total letters


One-line descriptions of the database matches found are presented next.  These 
include a database sequence identifier, the corresponding definition line, as 
well as the score (in bits) and the statistical significance ('E value') for this 
match (please see the section on statistics for an explanation of bits and 
significance).  Consider the output below, from a gapped blastp comparison of 
SwissProt accession P01013 against the SwissProt database.

                                                                    High    E
Sequences producing significant alignments:                        Score  Value

sp|P01013|OVAX_CHICK GENE X PROTEIN (OVALBUMIN-RELATED)               442  e-124
sp|P01014|OVAY_CHICK GENE Y PROTEIN (OVALBUMIN-RELATED)               353  9e-98
sp|P01012|OVAL_CHICK OVALBUMIN (PLAKALBUMIN) (ALLERGEN GAL D II)      278  5e-75
sp|P19104|OVAL_COTJA OVALBUMIN                                        268  5e-72
sp|P48595|BOMA_HUMAN BOMAPIN (PROTEASE INHIBITOR 10)                  199  2e-51
sp|P29508|SCC1_HUMAN SQUAMOUS CELL CARCINOMA ANTIGEN 1 (SCCA-1) ...   198  5e-51
sp|P80229|ILEU_PIG LEUKOCYTE ELASTASE INHIBITOR (LEI) (LEUCOCYTE...   197  1e-50
sp|P48594|SCC2_HUMAN SQUAMOUS CELL CARCINOMA ANTIGEN 2 (SCCA-2) ...   196  2e-50
sp|P50453|PTI9_HUMAN CYTOPLASMIC ANTIPROTEINASE 3 (CAP3) (PROTEA...   195  6e-50
sp|P05619|ILEU_HORSE LEUKOCYTE ELASTASE INHIBITOR (LEI)               193  2e-49

The first match, in this case, is the actual query sequence. The identifiers shown here are all from SwissProt, so they all have 'sp' in the first field, followed by the accession, and then a Locus name. The syntax of these identifiers is discussed in more detail in the appendices of ftp://ncbi.nlm.nih.gov/blast/db/README The definition lines are taken from the definition line in the database, with the ellipsis (e.g., P29508) indicating that the definition line was too long to for the space available.

Ungapped alignments and results from blastx and tblastn will have an additional column ('N'), displaying the number of different segment pairs used to produce the alignment, according to the Karlin-Altschul statistics.

Each alignment is preceded by the sequence identifier, the full definition line and the length of the database sequence. Next come the score (in bits as well as the raw score) as well as the statistical significance of the match, followed by the number of identities and positive matches according to the scoring system (e.g., BLOSUM62) and, if applicable, the number of gaps in the alignment. Finally the actual alignment is shown, with the query on top and the database match labeled as 'Sbjct'. Between the two sequences the residue is shown if it is conserved, a '+' is shown if there is a positive match. One or more dashes, '-', indicates insertions or deletions. The example below is the third sequence listed in the one-line descriptions above.

>sp|P01012|OVAL_CHICK OVALBUMIN (PLAKALBUMIN) (ALLERGEN GAL D II)
          Length = 386

Score =  278 bits (744), Expect = 5e-75
Identities = 149/231 (64%), Positives = 182/231 (78%), Gaps = 2/231 (0%)

Query 2   IKDLLVSSSTDLDTTLVLVNAIYFKGMWKTAFNAEDTREMPFHVTKQESKPVQMMCMNNS 61
          I+++L  SS D  T +VLVNAI FKG+W+ AF  EDT+ MPF VT+QESKPVQMM
Sbjct 158 IRNVLQPSSVDSQTAMVLVNAIVFKGLWEKAFKDEDTQAMPFRVTEQESKPVQMMYQIGL 217

Query 62  FNVATLPAEKMKILELPFASGDLSMLVLLPDEVSDLERIEKTINFEKLTEWTNPNTMEKR 121
          F VA++ +EKMKILELPFASG +SMLVLLPDEVS LE++E  INFEKLTEWT+ N ME+R
Sbjct 218 FRVASMASEKMKILELPFASGTMSMLVLLPDEVSGLEQLESIINFEKLTEWTSSNVMEER 277

Query 122 RVKVYLPQMKIEEKYNLTSVLMALGMTDLFIPSANLTGISSAESLKISQAVHGAFMELSE 181
          ++KVYLP+MK+EEKYNLTSVLMA+G+TD+F  SANL+GISSAESLKISQAVH A  E++E
Sbjct 278 KIKVYLPRMKMEEKYNLTSVLMAMGITDVFSSSANLSGISSAESLKISQAVHAAHAEINE 337

Query 182 DGIEMAGSTGVIEDIKHSPESEQFRADHPFLFLIKHNPTNTIVYFGRYWSP 232
           G E+ GS      +  +  SE+FRADHPFLF IKH  TN +++FGR  SP
Sbjct 338 AGREVVGSAEA--GVDAASVSEEFRADHPFLFCIKHIATNAVLFFGRCVSP 386

The last section lists specifics about the database searched as well as statistical and search parameters used:

  Database: Non-redundant SwissProt sequences
    Posted date:  Aug 14, 1997  9:52 AM
  Number of letters in database: 21,219,450
  Number of sequences in database:  59,576

Lambda     K      H
   0.317    0.132    0.377

Gapped
Lambda     K      H
   0.255   0.0350    0.190


Matrix: BLOSUM62
Gap Penalties: Existence: 10, Extension: 1
Number of Hits to DB: 8938654
Number of Sequences: 59576
Number of extensions: 335248
Number of successful extensions: 1188
Number of sequences better than 10: 116
Number of HSP's better than 10.0 without gapping: 106
Number of HSP's successfully gapped in prelim test: 10
Number of HSP's that attempted gapping in prelim test: 868
Number of HSP's gapped (non-prelim): 120
length of query: 232
length of database: 21219450
effective HSP length: 52
effective length of query: 180
effective length of database: 18121498
effective search space: -1033097656
T: 11
A: 40
X1: 16 ( 7.3 bits)
X2: 40 (14.7 bits)
X3: 67 (24.6 bits)
S1: 41 (21.7 bits)
S2: 64 (28.4 bits)

Blast Statistics and Scores

One may judge the results of a blast search by two numbers. One is the 'bit' score, which is defined as:

        S' (bits) =  [lambda * S (raw)  -  ln K] / ln 2

where lambda and K are Karlin-Altschul parameters. The expression of the score in terms of bits makes it independent of the scoring system used (i.e., which matrix). The Expect value estimates the statistical significance of the match, specifying the number of matches, with a given score, that are expected in a search of a database of this size absolutely by chance. An Expect value of two, with a given score, would indicate that two matches with this score, are expected purely by chance. The expect value changes with the size of the database (in a larger database more chance matches with a given score are expected) and is the most intuitive way to rank results or compare the results of one query run against two different databases.

Formatdb

Formatdb, should be used to format the FASTA databases for both protein and DNA databases for BLAST 2.0. This must be done before blastall or blastpgp can be run locally. The format of the databases has been changed substantially from the BLAST 1.4 release. A major improvement in this format over the old one is that ambiguity information for DNA sequences is now retrieved from the files produced by formatdb, rather than from the original FASTA file. The original FASTA file is no longer needed for the BLAST runs. Formatdb may be obtained with the other BLAST binaries from the executables directory (see above). The input for formatdb may be either ASN.1 or FASTA. Use of ASN.1 is advantageous for those sites that might also wish to format the ASN.1 in different ways, such as a GenBank report. Usage of formatdb may be obtained by executing formatdb and a dash:

formatdb   arguments:

  -t  Title for database file [String]  Optional
  -i  Input file for formatting (this parameter must be set) [File In]
  -l  Logfile name: [File Out]  Optional
    default = formatdb.log
  -p  Type of file
         T - protein   
         F - nucleotide [T/F]  Optional
    default = T
  -o  Parse options
         T - True: Parse SeqId and create indexes.
         F - False: Do not parse SeqId. Do not create indexes.
 [T/F]  Optional
    default = F
  -a  Input file is database in ASN.1 format (otherwise FASTA is expected)
         T - True, 
         F - False.
 [T/F]  Optional
    default = F
  -b  ASN.1 database in binary mode
         T - binary, 
         F - text mode.
 [T/F]  Optional
    default = F
  -e  Input is a Seq-entry [T/F]  Optional
    default = F

The "-p" option has two different meaning depending on whether input database is in FASTA or ASN.1 format. In case of FASTA, the "-p" specifies type of input database. In case of ASN.1, the option specifies the type of sequence to be indexed for BLAST.

If the "-o" option is TRUE (and the input database is in FASTA format), then the database identifiers in the FASTA definition line must follow the convention described in the appendices of ftp://ncbi.nlm.nih.gov/blast/db/README

It is always advantageous to use the '-o' option if the database identifiers are in the format specified above. If the database identifiers are in the parseable formatdb produces additional indices allowing retrieval from the databases by identifier. The databases on the NCBI FTP site contain parseable identifiers. It is sufficient if the first word on the FASTA defintion line is a unique identifier (e.g., ">3091 Alcoho de..."). It is necessary to use parseable identifiers for the following cases:

1.) If ASN.1 is to be produced from blastall or blastpgp, then "-o" must be TRUE.

2.) master-slave alignments are desired (i.e., the '-m' option with a non-zero value is used).

3.) The gi's are desired as part of the output (i.e., '-I' is used).

4.) fastacmd is used to fetch sequences from the database by accession or gi.

An input ASN.1 database may be represented in two formats - ascii text and binary. The "-b" option, if TRUE, specifies that input ASN.1 database is in binary format. The option is ignored in case of FASTA input database.

An input ASN.1 database (either text ascii or binary) may contains Bioseq-set or just one Bioseq. In the latter case the "-e" switch should be set to TRUE.

Blastall

Blastall may be used to perform all five flavors of blast comparison. One may obtain the blastall options by executing 'blastall -' (note the dash). A typical blastall to perform a blastn search (nucl. vs. nucl.) of a file called QUERY would be:

blastall -p blastn -d nr -i QUERY -o out.QUERY

The output is placed into the output file out.QUERY and the search is performed
against the 'nr' database.  If a protein vs. protein search is desired,
then 'blastn' should be replaced with 'blastp' etc.


Some of the most commonly used blastall options are:

blastall   arguments:

  -p  Program Name [String]

        Input should be one of "blastp", "blastn", "blastx", "tblastn", or "tblastx".

  -d  Database [String]
    default = nr

        Version 2.0.4 and higher will accept multiple database names (bracketed by quotations).
        An example would be 

                -d "nr est"

        which will search both the nr and est databases, presenting the results as if one
        'virtual' database consisting of all the entries from both were searched.   The
        statistics are based on the 'virtual' database.


  -i  Query File [File In]
    default = stdin

        The query should be in FASTA format.  If multiple FASTA entries are in the input
        file, all queries will be searched.

  -e  Expectation value (E) [Real]
    default = 10.0

  -o  BLAST report Output File [File Out]  Optional
    default = stdout

  -F  Filter query sequence (DUST with blastn, SEG with others) [T/F]
    default = T

        See the "Low-complexity Filters" section below for details.


Blastpgp

Blastpgp performs gapped blastp searches and can be used to perform iterative searches in psi-blast and phi-blast mode. See the PSI-Blast and PHI-BLAST sections for a description of this binary. The options may be obtained by executing 'blastpgp -'.

Fastacmd

Fastacmd retrives FASTA formatted sequences from a BLAST database, if it was formatted
using the '-o' option.  An example fastacmd call would be:

fastacmd -d nr -s p38398

The fastacmd options are:

fastacmd   arguments:

  -d  Database [String]
    default = nr
  -s  Search string: GIs, accessions and locuses may be used delimited
      by comma or space) [String]  Optional
  -i  Input file wilth GIs/accessions/locuses for batch retrieval [String]  Optional
  -a  Retrieve duplicated accessions [T/F]  Optional
    default = F
  -l  Line length for sequence [Integer]  Optional
    default = 80

Software requirements

Blast 2.0 uses threads to perform multi-processing searches. OS requirements on SGI's are IRIX 6 (with relevant threads patches, see below), any Solaris version, or a version of DEC UNIX. IRIX 5 may be used if multi-processing is not enabled.

SGI recommends the following threads patches on IRIX6 systems:

   For 6.2 systems, install SG0001404, SG0001645, SG0002000, SG0002420 and SG0002458 (in that order)
   For 6.3 systems, install SG0001645, SG0002420 and SG0002458 (in that order)
   For 6.4 systems, install SG0002194, SG0002420 and SG0002458 (in that order)

These patches can be obtained by calling SGI customer service or from the web: http://support.sgi.com/

System recommendations

BLAST uses memory-mapped files (on UNIX and NT systems), so it runs best if it can read the entire BLAST database into memory, then keep on using it there. Resources consumed reading a database into memory can easily outweight the cost of a BLAST search, so that the memory of a machine is normally more important than the CPU speed. This means that one should have sufficient memory for the largest BLAST database one will use, then run all the searches against this databases in serial, then run queries against another database in serial. This guarantees that the database will be read into memory only once. As of Aug. 1997 the EST FASTA file is about 500 Meg, which translates to about 170-200 Meg of BLAST database. At least another 100-200 Meg should be allowed for memory consumed by the actual BLAST program. All of the FASTA databases together are about 1.5 Gig, the BLAST databases produced from this will probably be about another Gig or so. 4 Gig of disk space, to make room for software and output, is probably a pretty good bet.

Setup

BLAST needs to know where the NCBI data directory and BLAST databases are. This is specified by the main configuration file for the NCBI toolkit (".ncbirc" on UNIX systems, ncbi.ini on Windows, analogous names on other platforms). If BLAST is the ONLY NCBI application that will be used, it is sufficient to have the following simple configuration file:

[NCBI]
   Data=/am/ncbiapdata/data

[BLAST]
BLASTDB=/usr/ncbi/db/disk.blast/blast2

BLAST looks for resource files in the "Data" directory (e.g., "/am/ncbiapdata/data/"). A directory different than "/am/ncbiapdata/data" can be used if this is desired. The resource files can be found in the data directory of the toolbox (i.e., ncbi/data). The .ncbirc should be either in the directory from which BLAST is called, the user's home directory, or in the directory set by the environment variable "NCBI". Alternatively, an environment variable may be set under UNIX. If BLAST is run from the same directory as the database files, the BLASTDB line is unnecessary.

Database and matrix directories

On UNIX systems environment variables can be setenv to specify the directory of the database (BLASTDB) and matrices (BLASTMAT).

Low-complexity Filters

BLAST 2.0 uses the dust low-complexity filter for blastn and seg for the other programs. Both 'dust' and 'seg' are integral parts of the NCBI toolkit and are accessed automatically.

Access to filtering options.

If one uses "-F T" then normal filtering by seg or dust (for blastn)
occurs (likewise "-F F" means no filtering whatsoever).  The seg options
can be changed by using:

-F "S 10 1.0 1.5"

which specifies a window of 10, locut of 1.0 and hicut of 1.5.  A coiled-coiled filter, 
based on the work of Lupas et al. (Science, vol 252, pp. 1162-4 (1991)) and written by
John Kuzio (Wilson et al., J Gen Virol, vol. 76, pp. 2923-32 (1995)), may be invoked
by specifying:

-F "C"

There are three parameters for this: window, cutoff (prob of a coil-coil), and
linker (distance between two coiled-coiled regions that should be linked
together).  These are now set to

window: 22
cutoff: 40.0
linker: 32

One may also change the coiled-coiled parameters in a manner analogous to
that of seg:

-F "C 28 40.0 32" will change the window to 28.

One may also run both seg and coiled-coiled together by using a ";":

-F "C;S"

BLAST databases

The FASTA files used by the NCBI to produce BLAST databases are available on the NCBI FTP site in ftp://ncbi.nlm.nih.gov/blast/db/. Please see the README for details.

References

Zhang, Zheng, Alejandro A. Schäffer, Webb Miller, Thomas L. Madden, David J. Lipman, Eugene V. Koonin, and Stephen F. Altschul (1998), "Protein sequence similarity searches using patterns as seeds", Nucleic Acids Res. 26:3986-3990. Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs", Nucleic Acids Res. 25:3389-3402. Karlin, Samuel and Stephen F. Altschul (1990). Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc. Natl. Acad. Sci. USA 87:2264-68. Karlin, Samuel and Stephen F. Altschul (1993). Applications and statistics for multiple high-scoring segments in molecu- lar sequences. Proc. Natl. Acad. Sci. USA 90:5873-7.

Release History

Notes for 2.0.7 release: Bug fixes: 1.) BLAST now multi-threads properly under LINUX. 2.) A problem with very redundant databases and psi-blast was fixed. 3.) A problem with the formatting of the number of identities and positives was fixed. This affected results on the minus strand only and did not affect the expect value or scores. 4.) A problem that caused tblastn to core-dump very occassionally was corrected. 5.) A problem with multiple patterns in PHI-BLAST was fixed. 6.) A limit on the number of HSP's that were saved (100) was removed. Notes for 2.0.6 release: Enhancements: 1.) PHI-BLAST is included in this release. Please see notes on PHI-BLAST for details. 2.) SEG has become an integral part of the NCBI toolkit and it is no longer necessary to install it separately. It is also now supported under non-UNIX platforms. 3.) Access to filtering options. If one uses "-F T" then normal filtering by seg or dust (for blastn) occurs (likewise "-F F" means no filtering whatsoever). The seg options can be changed by using: -F "S 10 1.0 1.5" which specifies a window of 10, locut of 1.0 and hicut of 1.5. One may also specify coiled-coiled filtering by specifying: -F "C" There are three parameters for this: window, cutoff (prob of a coil-coil), and linker (distance between two coiled-coiled regions that should be linked together). These are now set to window: 22 cutoff: 40.0 linker: 32 One may also change the coiled-coiled parameters in a manner analogous to that of seg: -F "C 28 40.0 32" will change the window to 28. One may also run both seg and coiled-coiled together by using a ";": -F "C;S" 4.) BLAST has been changed to reduce the number of redundant hits that a user may see. This is acheived by keeping track of the number of hits completely contained in a certain region and eliminating those lower scoring hits that are redundant with others. This behavior may be controlled with the -K and -L options: -K Number of best hits from a region to keep [Integer] default = 50 -L Length of region used to judge hits [Integer] default = 20 Setting -K to zero turns off this feature. This is the default only on blastall. Bug fixes: 1.) There was a problem with the procedure that called the external utility seg. The need to fix this was obviated by the integration of seg into the toolkit. This showed up under LINUX. 2.) There was a memory problem with formatdb that has been fixed. This showed up mostly under NT and LINUX. 3.) A problem with running in multi-processing mode under IRIX6.5 (as a non-root user) was fixed. Notes for 2.0.5 release: Enhancements: 1.) The BLAST version is printed by formatdb in it's log file. 2.) Multi-database searches no longer require that the -o option be used when preparing the databases (i.e., with formatdb). Bugs fixed: 1.) A serious bug with multi-database iterative searches was fixed (thanks to Steve Brenner for providing an example). 2.) 'lcl' is not formatted in the BLAST report when the sequence identifier is a local identifier or does not contain a bar ("|"). 3.) A large memory leak in formatdb was fixed. 4.) An unnecessary cast that caused formatdb to fail on Solaris 2.5 machines if the binary was made under 2.6 was fixed. 5.) Better error checking was added to protect against core-dumps. 6.) Some problems with the sum statistics treatment of the blastx and tblastn programs reported by D. Rozenbaum were fixed. The number of alignments involved in a sum group was misrepresented. Also the incorrect length for the database sequence was used, sometimes casuing a slight change in the value reported. 7.) A problem with blastpgp was fixed that reported incorrect values for matrices other than BLOSUM62 during iterative searches. Notes for 2.0.4 release: Enhancements: 1.) multiple database searches: Version 2.0.4 will accept multiple database names (bracketed by quotations). An example would be -d "nr est" which will search both the nr and est databases, presenting the results as if one 'virtual' database consisting of all the entries from both were searched. The statistics are based on the 'virtual' database. 2.) new options: -W Word size, default if zero [Integer] default = 0 -z Effective length of the database (use zero for the real size) [Integer] default = 0 3.) The number of identities, positives, and gaps are now printed out before the alignments for gapped blastx, tblastn, and tblastx. Additionally this feature is now also enabled for ungapped BLAST. 4.) Formatdb now accepts ASN.1, as well as FASTA, as input. Bugs fixed: 1.) In blastx, tblastn, and tblastx a codon was incorrectly formatted as a start codon in some cases. 2.) The last alignment of the last sequence being presented was incorrectly dropped in some cases. This change could affect the statistical significance of the last database sequence if the dropped alignment had a lower e-value than any other alignments from the same database sequence.