EMBL Outstation - The European Bioinformatics Institute EMBL Nucleotide Sequence Database Release Notes Release 73 December 2002 EMBL Outstation European Bioinformatics Institute Wellcome Trust Genome Campus Hinxton Cambridge CB10 1SD United Kingdom Telephone: +44-1223-494400 Telefax : +44-1223-494468 Electronic mail: datalib@ebi.ac.uk URL: http://www.ebi.ac.uk CONTENTS * 1 RELEASE 73 o 1.1 27.9 Billion Nucleotides o 1.2 Recently Completed Genomes o 1.3 New Cross-References to GOA (GO Annotation) o 1.4 EMBL Sequence Version Archive (SVA) o 1.5 Third-Party Annotation Dataset (TPA) o 1.6 CON(struct) Division o 1.7 Whole Genome Shotgun Sequences (WGS) o 1.8 New Feature Table Definition Document v5 o 1.8.1 New Qualifier: /segment o 1.8.2 New Qualifier: /mol_type o 1.8.3 New Qualifier: /locus_tag o 1.9 Database Files o 1.9.1 Naming Conventions o 1.9.2 EST Database Files o 1.9.3 GSS Database Files o 1.9.4 INV Database Files o 1.9.5 HUM Database Files o 1.9.6 HTG Database Files o 1.9.6.1 Base Quality Values o 1.9.7 PAT Database Files o 1.9.8 CON Database Files o 1.9.9 CRC Values for Distributed Files o 1.10 Cross-Reference Information o 1.11 Sequence Retrieval System (SRS) o 1.12 EMBL Database FAQ o 1.13 Disclaimer * 2 FORTHCOMING CHANGES o 2.1 Sequence Length Limit o 2.2 Molecule Type Information o 2.3 E-mail Submission Form Discontinued o 2.4 New reference line-type 'RG' (Reference Group) * 3 SEQUENCE SUBMISSION SYSTEMS o 3.1 Checking Sequence Data For Vector Contamination o 3.2 Webin - WWW Sequence Submission System o 3.2.1 Webin-Bulk Submissions o 3.2.2 Webin-TPA (Third Party Annotation) Submissions o 3.2.3 Webin-Align - WWW Alignment Submission System o 3.3 SEQUIN - Stand-alone Submission Program o 3.4 Further Submission Information o 3.4.1 Annotation Guides * 4 CITING THE EMBL NUCLEOTIDE SEQUENCE DATABASE * 5 EBI NETWORK SERVICES o 5.1 Electronic Mail Server o 5.2 Anonymous FTP Server o 5.3 World Wide Web (WWW) Server o 5.4 Sequence Similarity Search Servers * 6 DISTRIBUTION FILES o 6.1 Release 73 Files * APPENDIX A DATABASE GROWTH TABLE 1 RELEASE 73 The EMBL Nucleotide Sequence Database was frozen to make Release 73 on 29-NOV-2002. The release contains 20,857,746 sequence entries comprising 27,903,283,528 nucleotides. This represents an increase of about 20% over Release 72. A breakdown of Release 73 by division is shown below: Division Entries Nucleotides ----------------- ------------ --------------- Constructed 146 451,710,729 ESTs 14,278,034 7,122,871,284 Fungi 69,690 110,593,809 GSSs 4,181,290 2,343,305,324 HTC 41,184 52,445,825 HTG 63,671 10,664,843,325 Human 233,004 3,725,671,508 Invertebrates 108,405 590,984,276 Other Mammals 42,976 70,886,183 Mus musculus 63,784 792,364,491 Organelles 173,371 143,776,002 Patents 899,597 470,683,009 Bacteriophage 2,239 7,141,342 Plants 134,576 479,339,789 Prokaryotes 166,584 512,495,745 Rodents 23,789 37,005,931 STSs 156,147 66,054,417 Synthetic 7,059 13,434,531 Unclassified 1,500 2,487,640 Viruses 170,355 151,858,111 Other Vertebrates 40,345 93,330,257 ---------- -------------- Total 20,857,746 27,903,283,528 1.1 27.9 Billion Nucleotides When freezing data for this release, the number of nucleotides in the database had nearly reached the 28 billion mark. EMBL database statistics are available at URL: http://www3.ebi.ac.uk/Services/DBStats/ 1.2 Recently Completed Genomes Plasmodium falciparum The analysis of the genome sequence of Plasmodium falciparum strain 3D7 has been published in Nature 419, 498-511 (2002). The nuclear genome consists of 14 chromosomes, genome size is 23 megabases encoding about 5,300 genes. Plasmodium falciparum strain 3D7 chromosome 1 AL844501 Plasmodium falciparum strain 3D7 chromosome 2 AE001362 Plasmodium falciparum strain 3D7 chromosome 3 AL844502 Plasmodium falciparum strain 3D7 chromosome 4 AL844503 Plasmodium falciparum strain 3D7 chromosome 5 AL844504 Plasmodium falciparum strain 3D7 chromosome 6 AL844505 Plasmodium falciparum strain 3D7 chromosome 7 AL844506 Plasmodium falciparum strain 3D7 chromosome 8 AL844507 Plasmodium falciparum strain 3D7 chromosome 9 AL844508 Plasmodium falciparum strain 3D7 chromosome 10 AE014185 Plasmodium falciparum strain 3D7 chromosome 11 AE014186 Plasmodium falciparum strain 3D7 chromosome 12 AE014188 Plasmodium falciparum strain 3D7 chromosome 13 AL844509 Plasmodium falciparum strain 3D7 chromosome 14 AE014187 Anopheles gambiae WGS In a separate project, an international consortium of researchers has sequenced the genome of the Anopheles gambiae mosquito, which transmits the parasite to humans. The International Anopheles Genome project is a collaboration between Celera Genomics, Genoscope, University of Notre Dame, EBI/SangerInstitute, EMBL, Institut Pasteur, IMBB and TIGR. The genome sequence of the malaria mosquito Anopheles gambiae has been published in Science 298, 129 (2002). The Whole Genome Shotgun (WGS) assembly of the mosquito genome is available from the EBI at ftp://ftp.ebi.ac.uk/pub/databases/embl/wgs. Both genome sequences provide the foundation for future studies of these organisms, and is being exploited in the search for new drugs and vaccines to fight malaria. Other recently completed genomes include: Bifidobacterium longum NCC2705 AE014295 Brucella suis 1330 chromosome I AE014291 Brucella suis 1330 chromosome II AE014292 Corynebacterium efficiens YS-314 BA000035 Leptospira interrogans serovar lai str. 56601 chromosome I AE010300 Leptospira interrogans serovar lai str. 56601 chromosome II AE010301 Oceanobacillus iheyensis BA000028 Shewanella oneidensis MR-1 AE014299 Shigella flexneri 2a str. 301 AE005674 Streptococcus agalactiae 2603V/R AE009948 Streptococcus mutans UA159 AE014133 Wigglesworthia brevipalpis BA000021 Trypanosoma brucei DNA chromosome 1 AL929608 Shewanella oneidensis MR-1 megaplasmid AE014300 Direct access to hundreds of completed genome sequences is available via EBI's WWW Genomes server at URL http://www.ebi.ac.uk/genomes/ 1.3 New Cross-References to GOA (GO Annotation) GOA is a project at the European Bioinformatics Institute to provide assignments of gene products to the Gene Ontology (GO) resource. The goal of the Gene Ontology Consortium is to produce a dynamic controlled vocabulary that can be applied to all organisms. In the GOA project, this vocabulary will be applied to a non-redundant set of proteins described in the SWISS-PROT, TrEMBL and Ensembl databases that collectively provide complete proteomes for Homo sapiens and other organisms. EMBL Release 73 includes more than 730,000 new cross-references to GOA. Example: ID HSFOS standard; DNA; HUM; 6210 BP. XX AC K00650; M16287; ... DR EPD; EP11145; HS_FOS. DR GDB; 119917; FOS. DR GOA; P01100; P01100. DR SWISS-PROT; P01100; FOS_HUMAN. ... FT CDS join(889..1029,1783..2034,2466..2573,2688..3329) FT /codon_start=1 FT /db_xref="GOA:P01100" FT /db_xref="SWISS-PROT:P01100" ... Hyperlinks from e.g. /db_xref="GOA:P01100" use the QuickGO browser at www.ebi.ac.uk/ego/QuickGO?, e.g. http://www.ebi.ac.uk/ego/QuickGO?mode=search&querytype=protein&query=P01100 1.4 EMBL Sequence Version Archive (SVA) In order to provide access to previous versions of database records, the EMBL database has established a Sequence Version Archive (SVA). Data in the EMBL nucleotide sequence database change over time for a number of reasons, e.g due to updates/corrections or extensions based on new findings from more recent experiments. Each time data in an entry are modified, the entry is assigned a new entry version number. Entries can change their appearance even while the data included remain unchanged, due to general flat-file format changes or when the taxonomic classification of the source organisms changes. e.g. when an organism is assigned a new place in the hierarchy. Following these types of changes, the entry will retain its original sequence version number. Querying the EMBL Sequence Version Archive The EMBL Sequence Version Archive is available from the EBI web-servers at URL http://www.ebi.ac.uk/embl/sva/. Entry(ies) are viewable by either accession number, nucleotide sequence identifier and protein identifier. Query results options allow to a) show the complete history for an entry, i.e. all recorded flat files matching the query criterion in chronological order or b) show a snapshot of the entry at a particular date. Query results will be presented in a table, listed EMBL entries can be 'Viewed' on Screen' or 'Saved to File'. c) show differences between entry versions 1.5 Third-Party Annotation Dataset (TPA) Following a decision taken at the 2002 Collaborative Meeting, DDBJ/EMBL/GenBank have been creating a Third Party Annotation (TPA) dataset. The TPA data- collection is a complement to the existing DDBJ/EMBL/GenBank comprehensive database of primary nucleotide sequences, which typically result from direct sequencing of cDNAs, ESTs, genomic DNAs etc. Primary data are defined to be data for which the submitting group has done the sequencing and annotation, and as 'owner' of these data has privileges to submit updates/corrections etc. In contrast, non-primary sequences are defined as sequences which a) consist exclusively of DNA from one or several already existing entries 'owned' by other groups or b) consist of a mixture of new & already existing sequences. TPA categories and requirements Users can submit re-annotations/re-assemblies of sequences already present in DDBJ/EMBL/GenBank and owned by other groups to be included in the Third Party Annotation (TPA) data-collection. Categories of data submissions accepted for TPA include: a. re-annotation/analysis of sequence(s)from DDBJ/EMBL/GenBank b. mixed primary/non-primary TPA sequence including regions of new and existing sequence (e.g. filling the gaps with HTG or EST or newly sequenced data) c. TPA sequences based on trace archive data d. TPA sequences based on Whole Genome Shotgun (WGS) sequences Not accepted are consensus sequences from multiple organisms. For TPA submissions, the database requires information on the composition of the TPA sequence to show which spans in a TPA sequence originated from which contributing database entries. In order to assure that the sequence annotation is of high quality, we require that the study be published in a peer reviewed journal before we release the data to the public. EBI's submission system WEBIN has been customised to allow submissions of TPA sequences to the EMBL Nucleotide Sequence Database. WEBIN is available at URL http://www.ebi.ac.uk/embl/Submission/webin.html TPA sequences are exchanged amongst the DDBJ/EMBL/GenBank database collaboration. The TPA data-collection is available via the EBI FTP server at ftp://ftp.ebi.ac.uk/pub/databases/embl/tpa and also via EBI's Sequence Retrieval System (SRS) at http://srs.ebi.ac.uk. 1.6 CON(struct) Division Con(structed) or Con(tig) sequences in the CON division represent complete genomes and other long sequences constructed from segment entries. Nucleotide sequence records in EMBL (DDBJ and GenBank) currently have a size restriction of 350 kb. Sequences >350 kb are split into smaller segment entries before inclusion in the database. Segment entries are assigned individual accession numbers, they include sequence data and are distributed in the appropriate taxonomic divisions. In contrast, CON division entries do not contain sequence data per se, but rather the assembly information on all accession.versions and sequence locations relevant in building the contig genome. CON sequence entries follow the daily data exchange mechanism between DDBJ/EMBL/GenBank. CON entries are available at ftp://ftp.ebi.ac.uk/pub/databases/embl/release/ in file 'embl.con' and from the EBI Genome Web server at EBI's WWW Genomes server at URL http://www.ebi.ac.uk/genomes/. For more detailed information on specific aspects of CON(structed) sequences (e.g. examples, CO line-type, gap representation etc.), please see the previous EMBL release notes (Release 72, Sep 2002). 1.7 Whole Genome Shotgun Sequences (WGS) Methods using whole genome shotgun data are used to gain a large amount of genome coverage for an organism. WGS data for a growing number of organisms are being submitted to DDBJ/EMBL/GenBank. WGS sequences are exchanged amongst the DDBJ/EMBL/GenBank database collaboration. EMBL's WGS data-collection is made available via the EBI FTP server at ftp://ftp.ebi.ac.uk/pub/databases/embl/wgs and also via EBI's Sequence Retrieval System (SRS) at http://srs.ebi.ac.uk Detailed information on specific aspects of Whole Genome Shotgun Sequences, (e.g. accession-number format, examples, lists etc) are available from the previous EMBL release notes (Release 72, Sep 2002). 1.8 Feature Table Definition Document v5 The new version of the Feature Table Definition Document (FTv5) is implemented on 15-DEC-2002. The document is available from the EBI servers at: http://www.ebi.ac.uk/embl/Documentation/FT_definitions/feature_table.html ftp://ftp.ebi.ac.uk/pub/databases/embl/doc/ 1.8.1 New SOURCE feature qualifier: /segment /segment is a qualifier to the 'Source' feature key providing information on the name or number of a viral or phage segment Qualifier /segment= Definition name of viral or phage segment sequenced Value format "text" Example /segment="6" 1.8.2 New SOURCE feature qualifier: /mol_type /mol_type is a qualifier to the 'Sourcce feature key' recording the biological state (in vivo molecule type) of the sequence. Qualifier /mol_type= Definition in vivo molecule type Value format "text" Example /mol_type="genomic DNA" Comment text limited to "genomic DNA", "genomic RNA", "mRNA" (incl. EST), "tRNA", "rRNA", "snoRNA", "snRNA", "scRNA", "pre-mRNA", "other RNA" (incl. synthetic),"other DNA" (incl. synthetic), "unassigned DNA" (incl. unknown), "unassigned RNA" (incl. unknown); /mol_type will become mandatory qualifier to SOURCE feature key from 01-JUL-2003. 1.8.3 New qualifier: /locus_tag /locus_tag allows assignment of systematic tags for tracking purposes. Qualifier: /locus_tag Definition: feature tag assigned for tracking purposes Value Format: "text"(single token) Example: /locus_tag="RSc0382" /locus_tag="YPO0002" Comment: /locus_tag can be used with any feature where /gene is valid; 1.9 Database Files In order to keep the size of the data files within reasonable limits for handling purposes, additional division files will be added in subsequent releases as appropriate. 1.9.1 Naming Conventions When a division is split into several files, these are named so that they sort sequentially, e.g. est_hum01.dat, est_hum02.dat,......, est_hum22.dat, est_hum23.dat etc 1.9.2 EST Database Files ESTs (single pass cDNA reads) constitute a major source of sequence records - the vast majority originating from human and mouse. In addition to the EST division files in the EMBL database release, EBI's ESTLIB provides further information about the libraries from which EST sequences were derived. The according EST division entries in EMBL are cross-referenced to ESTLIB with a /db_xref qualifier on the source feature, e.g. /db_xref="ESTLIB:863" ESTLIB is available from ftp://ftp.ebi.ac.uk/pub/databases/embl/estlib/ EST files are split according to taxonomic subdivisions following the model of the taxonomic split of all other EMBL database divisions. est_fun01.dat - est_fun02.dat Fungi ESTs est_hum01.dat - est_hum49.dat Human ESTs est_inv01.dat - est_inv17.dat Invertebrate ESTs est_mam01.dat - est_mam04.dat Other Mammal ESTs est_mus01.dat - est_mus29.dat Mouse ESTs est_pln01.dat - est_pln29.dat Plant ESTs est_pro.dat Prokaryote ESTs est_rod01.dat - est_rod04.dat Other Rodent ESTs est_unc.dat Unclassified ESTs est_vrt01.dat - est_vrt12.dat Vertebrate ESTs 1.9.3 GSS Database Files Genome Survey Sequences (GSS) are of similar nature to EST data, except that sequences are genomic rather than cDNA (mRNA). The GSS division contains e.g. random `single pass read' genome survey sequences, single pass reads from cosmid/BAC/YAC ends, exon trapped genomic sequences and Alu PCR sequences. GSS division files are also split according to taxonomic subdivisions. Mouse GSSs are now included in files gss_mus01.dat - gss_mus10.dat. gss_fun.dat Fungi GSSs gss_hum01.dat - gss_hum09.dat Human GSSs gss_inv01.dat - gss_inv06.dat Invertebrate GSSs gss_mam01.dat - gss_mam02.dat Other Mammal GSSs gss_mus01.dat - gss_mus10.dat Mouse GSSs gss_phg.dat Phage GSSs gss_pln01.dat - gss_pln11.dat Plant GSSs gss_pro.dat Prokaryote GSSs gss_rod01.dat - gss_rod04.dat Rodent GSSs gss_vrl.dat Viral GSSs gss_vrt01.dat - gss_vrt03.dat Vertebrate GSSs 1.9.4 INV Database Files The INV division has been split into 3 files (inv01.dat - inv03.dat). 1.9.5 HUM Database Files The HUM division has been split into 24 files (hum01.dat-hum24.dat). 1.9.6 HTG Database Files 'Unfinished' DNA sequences generated by the high-throughput sequencing centers are represented in the HTG division - and are rapidly made available to the scientific community for homology searches. Entries in this division all contain keywords to indicate the status of the sequencing (e.g., HTGS_PHASE1). A single accession number is assigned to one clone, and as sequencing progresses and the entry passes from one phase to another, it will retain the same accession number. Once 'finished', HTG sequences are moved into the appropriate primary EMBL taxonomic division. HTG division files have also been split according to taxonomic subdivisions. Mouse HTGs are included in files htgo_mus.dat and htg_mus.dat. htgo_hum.dat Human HTGs phase0 htgo_mus.dat Mouse HTGs phase0 htgo_other.dat Other HTGs phase0 htg_hum01.dat - htg_hum05.dat Human HTGs htg_inv01.dat - htg_inv02.dat Invertebrate HTGs htg_mam.dat Other Mammal HTGs htg_mus01.dat - htg_mus04.dat Mouse HTGs htg_other.dat Other HTGs htg_pln.dat Plant HTGs htg_rod01.dat - htg_rod07.dat Rodent HTGs htg_vrt.dat Other Vertebrate HTGs HTGS_PHASE0 entries typically consist of one-to-few pass reads of a single clone, have not been assembled into contigs and are unoriented, unordered, unannotated and contain gaps of unknown length. Low-pass sequence sampling is useful for identifying clones that may be gene- rich. Phase0 sequences are used to check whether another center is already sequencing this clone. If not, it will be sequenced through phase 1 and phase 2. When records are updated, the accession numbers will be preserved. 1.9.6.1 Base Quality Values Quality scores from draft HTG data are available on the EBI FTP server. The Compressed (gzip) files contain base quality values for unfinished human sequences from Japanese, US and European sequencing centres. The fasta-type headers contain the EMBL sequence identifier of the corresponding database entries. Example: >AL157822.2 Phrap Quality (Length:158745, Min: 1, Max: 99) In order to keep the size of the files within reasonable limits for handling purposes, files which in uncompressed form are bigger than 1 Gb, are split into smaller files. Directory: ftp://ftp.ebi.ac.uk/pub/databases/embl/quality_scores Quality score files are updated on a daily basis. 1.9.7 PAT Database Files PAT files include sequence data incorporated from the European patent literature (EPO) and complemented by American and Japanese patent data integrated from NCBI(USA)and DDBJ(Japan). The Patent division has been split into 5 files (pat01.dat - pat05.dat) 1.9.8 CON Database File CON files include construct information for building contig sequences of chromosomes, genomes and other long DNA sequences. CON entries in file 'embl.con' do not contain sequence data per se. The CON division data is included in 1 file (embl.con). 1.9.9 CRC values for distributed files To help users verify the integrity of release data files, we supply files containing 32-bit checksum Cyclic Redundancy Check (CRC) values, plus byte counts, for both compressed and uncompressed release files. These CRC values are calculated based on the IEEE Std 1003.2-1992 (POSIX 1003.2) and X/Open CAE specifications. These values are generated by default by the 'cksum' command on Irix, RedHat Linux, SunOS, Solaris. On Tru64 unix, the environment variable CMD_ENV needs to be set to xpg4. File: crc_gz.txt for compressed data files File: crc.txt for uncompressed data files Example from crc.txt: 1553759899 195684609 est_fun.dat This output shows that the checksum of the file est_fun.dat is 1553759899 and the file contains 195684609 bytes. 1.10 Cross-Reference Information Links to external databases allow integration with specialised data collections, such as protein databases, species-specific databases, taxonomy databases etc. The WWW-based sequence retrieval system SRS enables users to easily navigate between cross-referenced database entries. Total number of links in EMBL Release 73 is 19,586,912. More than 1.89 million of these are also referring to individual features e.g. CDS (coding sequences) via the /db_xref feature qualifier in EMBL entries. Database Nr of links ---------- ----------- UNILIB 13410818 RZPD 3439996 TrEMBL 864467 GOA 734286 GrainGenes 441708 SWISS-PROT 216658 MaizeDB 210299 RemTrEMBL 87243 IMGT/LIGM 63782 MGD 38275 FLYBASE 25345 MENDEL 21033 SGD 10991 GDB 8430 TRANSFAC 6620 IMGT/HLA 3577 EPD 3384 ---------------------- Total 19586912 1.11 Sequence Retrieval System (SRS) EBI's SRS server is available at URL http://srs.ebi.ac.uk All external services are available from the 'Toolbox' button on EBI's Web pages. If you have any comments and/or suggestions please send these to: support@ebi.ac.uk 1.12 EMBL Database FAQ An EMBL Database FAQ is available from the EBI at URL http://www.ebi.ac.uk/embl/Documentation/FAQ/ This document includes information on: General questions about EMBL and other databases Submission procedure Updating database entries Webin-specific questions Navigation guide 1.13 Disclaimer No guarantee is given and no legal liability or responsibility is assumed for the completeness and accuracy of the database entries, in particular the conformity of sequence data in the database with the journal publication where the sequence is also disclosed. 2 FORTHCOMING CHANGES 2.1 Sequence Length Limit Currently database records are limited in length to 350kb. At the recent collaborative meeting DDBJ/EMBL/GenBank have discussed the issue of relaxing the maximum sequence length limit. The plan is to remove the size rescriction on database records in 2 years time. We will announce this to the community, especially developers, and we will review this proposal in 12 months time. 2.2 Molecule type information /mol_type qualifier: The new /mol_type qualifier described above in 1.7 will be implemented with the new FTv5-Document on 16-Dec-2002. For the start /mol_type will be an optional qualifier to the SOURCE feature key to allow some time for retrofitting of existing data. From 01-JUL-2003 /mol_type will be a mandatory qualifier to the SOURCE feature key and will consistently display the in vivo molecule type of the sequence. Molecule type in ID_lines: At the same time, molecule type information in the flat-file entry ID-line will display the corresponding value from the /mol_type value list. 2.3 E-mail submission form discontinued From 01-Jan-2003, we will discontinue support of the E-mail submission form and will not be accepting e-mail submissions. EBI's preferred submission medium is WEBIN (for details see 3.2). 2.4 New reference line-type 'RG' (Reference Group) A new reference line-type 'RG' will be introduced in June 2003 to list the consortium name associated with a given citation. Examples: RG The C. elegans Sequencing Consortium; RG The Brazilian Network for HIV Isolation and Characterization; 3 SEQUENCE SUBMISSION SYSTEMS 3.1 Checking Sequence Data For Vector Contamination We urge submitters to remove vector contamination from sequence data before submitting to the database. To assist submitters the EBI is providing a Vector Screening Service using the latest implementation of the BLAST algorithm and a special sequence databank known as EMVEC. EMVEC is an extraction of sequences from the SYNthetic division of EMBL containing more than 2000 sequences commonly used in cloning and sequencing experiments. EMVEC is by no means a complete vector databank but EBI believes it is representative of the kind of material used in modern sequencing and should be useful to submitters. The databank will be updated with each release of EMBL and made publicly available on the EBI's ftp server. The interactive WWW service can be found at: http://www.ebi.ac.uk/embl/Submission/webin.html http://www.ebi.ac.uk/blastall/vectors.html The results will list sequences producing significant alignments and Associated information like vector name, score, alignment etc 3.2 Webin - WWW Sequence Submission System Webin is the preferred WWW Sequence Submission System for submitting Nucleotide sequence data and associated biological information to the EMBL Nucleotide Sequence Database. To access Webin please use the following URL: http://www.ebi.ac.uk/embl/Submission/webin.html Database entries submitted to the EMBL Nucleotide Sequence Database at the EBI will be exchanged and shared among the International Collaboration of Nucleotide Sequence Databases (DDBJ/EMBL/GenBank). Webin guides the user through a sequence of WWW forms allowing the submission of sequence data and descriptive information in an interactive and easy way. All the information required to create a database entry will be collected during this process. EBI staff will process data submissions within 2 working days and send the database accession number(s) assigned to your data to your e-mail address. 3.2.1 Webin Bulk Submissions With the aim to make bulk sequence submission less time consuming for the submitters, a web-based submission system can be accessed from the Webin page. Authors planning to submit a large number of similar sequences(i.e.,>25) are presented with an option for "Bulk Webin Submission". When choosing the bulk path, submitters follow the usual Webin submission procedure to do a first single representative sequence, which will be processed by database staff to create templates for the other sequences, thus saving the author the time and effort required to complete numerous submission events individually. Please contact database staff if you require further information. e-mail: datasubs@ebi.ac.uk Tel: +44-1223-494499 Fax: +44-1223-494472 3.2.2 Webin-TPA submissions Users can submit re-annotations/re-assemblies of sequences already present in DDBJ/EMBL/GenBank and owned by other groups to be included in the Third Party Annotation (TPA) data-collection. 3.2.3 Webin-Align Sequence Alignment Submissions The EBI accepts submissions of sequence alignment data (from phylogenetic and population analysis etc.) via Webin-Align, the EBI's WWW-based submission tool. After approval by EMBL staff, a unique identifier (an accession number with the format ALIGN_000001) is assigned to the alignment. This identifier is then communicated to the submitter, and should be quoted in publications. Details on how to access Webin-Align, related help documentation and annotation example pages are available at: http://www.ebi.ac.uk/embl/Submission/align_top.html New users are advised to read the help documentation and FAQ prior to submitting their data. Alignment data is available in EMBL-Align and ClustalW file format from the EBI FTP server. New alignments in EMBL-Align format can be retrieved via SRS. http://srs.ebi.ac.uk/ - Sequence Retrieval System http://www3.ebi.ac.uk/Services/align/listali.html - HTML alignment list Submitters unable to access Webin-Align should contact the data submissions staff (email: datasubs@ebi.ac.uk) for advice. 3.3 SEQUIN - Stand-alone Submission Program Sequin is the multi-platform (Mac/PC/Unix) stand-alone software tool developed by the NCBI for submitting entries to the EMBL, GenBank, or DDBJ sequence databases. The Sequin program, along with detailed downloading and installation instructions plus general information are available from the EBI via WWW and anonymous FTP. http://www3.ebi.ac.uk/Services/Sequin/ ftp://ftp.ebi.ac.uk/pub/software/sequin/ 3.4 Further Submission Information 3.4.1 Annotation Guides To help and guide submitters in annotating their sequences, two online guides are available via hyperlinks from within Webin: EMBL Annotation Examples (http://www3.ebi.ac.uk/Services/Standards/web/) and EMBL Features and Qualifiers (http://www3.ebi.ac.uk/Services/WebFeat/). The annotation examples consist of a list of EMBL approved feature table annotations for common biological sequences. The EMBL Features and Qualifiers is a complete list of feature table key and qualifier definitions providing detailed descriptions and usage examples. For further information on submission of sequence data to the EMBL Nucleotide Sequence Database please access: http://www.ebi.ac.uk/embl/Submission/ or contact database staff at: EMBL Nucleotide Sequence Submissions e-mail: datasubs@ebi.ac.uk telephone: +44-1223-494499 telefax: +44-1223-494472 4 CITING THE EMBL NUCLEOTIDE SEQUENCE DATABASE We encourage authors to include a reference to the EMBL Database in publications related to their research. When citing data in the EMBL Database, we suggest to give the according primary accession number and the publication in which the sequence first appeared. For unpublished data, we suggest to contact the original submitters for recent publication information or revisions of the data. We suggest to also provide a reference for the EMBL Database itself. Our recent publication describing the EMBL database should be cited: Stoesser G., Baker W., van den Broek A., Camon E., Garcia-Pastor M., Kanz C., Kulikova T., Leinonen R., Lin Q., Lombard V., Lopez R., Redaschi N., Stoehr P., Tuli M.A, Tzouvara K. and Vaughan R. 'The EMBL Nucleotide Sequence Database' Nucleic Acids Res 30:21-26(2002) Example: The numbers in parentheses refer to the reference citation in the EMBL database entry, and to the EMBL citation above. "Sequence entry X56734 (1) has been retrieved from the EMBL Database (2) and showed significant sequence similarity to ..." (1) Oxtoby, E., et al., Plant Mol. Biol. 17:209-219(1991). (2) Stoesser G. et al., Nucleic Acids Res 30:21-26(2002). 5 EBI NETWORK SERVICES 5.1 Electronic Mail Server Users with access to electronic mail and internet can obtain copies of database entries, documentation or the data submission form, by sending commands to a file server running at EBI. New and updated EMBL nucleotide sequence entries are made available on the server on a daily basis. To use this facility, send file server commands to the address netserv@ebi.ac.uk. Each line of the mail message should consist of a single file server request. The most important file server request, to get started, is: HELP If the file server receives this command, it will return a helpfile to the sender, explaining in some detail how to use the facility. For example, to request a copy of the nucleotide sequence with accession number X55652, use the command: GET NUC:X55652 The file server offers various other services, (eg., access to nucleotide and protein sequence data, protein structure data, software), details of which are provided in the HELP file. 5.2 Anonymous FTP Server An alternative method of accessing the EBI archives is to use the file transfer protocol (ftp). Researchers with direct access to the Internet can use the FTP program on their local machine to connect to the host FTP.EBI.AC.UK and enter the username "anonymous" and their email address as password. The directory pub/help contains detailed information about the data available from the EBI anonymous FTP server which includes the complete EMBL Nucleotide Sequence Database releases as well as daily and weekly updates and a cumulative update file (gzip compressed format)in the following directories: EMBL quarterly release: pub/databases/embl/release EMBL updates: pub/databases/embl/new 5.3 World Wide Web (WWW) Server The EBI operates a WWW server at URL http://www.ebi.ac.uk/ providing information about the EBI and it's products and services. Data Retrieval: Nucleotide sequences can be retrieved by a simple query by accession number, or more complex queries can be constructed using the SRS databank browser at http://srs.ebi.ac.uk Data Submission: Nucleotide sequences can be submitted to the database using the interactive submission system Webin at http://www.ebi.ac.uk/embl/Submission/webin.html 5.4 Sequence Similarity Search Servers The EBI offers two network servers for sequence similarity searches via electronic mail or interactive WWW forms: FASTA based on W. Pearson's FASTA algorithm. Allows local similarity searches of protein and nucleotide sequence databases. Send "help" to fasta@ebi.ac.uk or use URL http://www.ebi.ac.uk/fasta3/ BLAST based on the NCBI and WU-Blast software Send "help" to blast@ebi.ac.uk or use URL http://www.ebi.ac.uk/blast2/ BLITZ allows very fast searches of protein sequence databases for local similarities. The software used by the blitz service is based on MPsrch and Scanps. Send "help" to blitz@ebi.ac.uk or use URLs http://www.ebi.ac.uk/MPsrch/ and http://www.ebi.ac.uk/scanps/ 6 DISTRIBUTION FILES 6.1 Release 73 Files The release contains the files shown below. File sizes are given as numbers of records. File Number File Name Description Number of Records 1 CRC.TXT Checksum CRC uncompressed files 278 2 CRC_GZ.TXT Checksum CRC compressed files 278 3 DELETEAC.TXT Deleted accession numbers 187325 4 EMBL.CON Constructed Sequences 13437 5 FTABLE.TXT Feature Table Documentation 474 6 RELNOTES.TXT Release Notes (this document) 1327 7 SUBINFO.TXT Data Submission Documentation 390 8 UPDATE.TXT Data Update Form 107 9 USRMAN.TXT User Manual 1596 10 ACNUMBER.NDX Accession Number Index 20894252 11 CITATION.NDX Citation Index 3015185 12 DIVISION.NDX Division Index 25 13 KEYWORD.NDX Keyword Index 7539949 14 SHORTDIR.NDX Short Directory Index 50666308 15 SPECIES.NDX Species Index 7143659 16 EST_FUN01.DAT EST Sequences 6375143 17 EST_FUN02.DAT EST Sequences 2420352 18 EST_HUM01.DAT EST Sequences 7333014 19 EST_HUM02.DAT EST Sequences 7505165 20 EST_HUM03.DAT EST Sequences 7204627 21 EST_HUM04.DAT EST Sequences 7094572 22 EST_HUM05.DAT EST Sequences 7217072 23 EST_HUM06.DAT EST Sequences 7233645 24 EST_HUM07.DAT EST Sequences 7272291 25 EST_HUM08.DAT EST Sequences 6955801 26 EST_HUM09.DAT EST Sequences 6786550 27 EST_HUM10.DAT EST Sequences 7319274 28 EST_HUM11.DAT EST Sequences 7206978 29 EST_HUM12.DAT EST Sequences 7015107 30 EST_HUM13.DAT EST Sequences 7050431 31 EST_HUM14.DAT EST Sequences 6608002 32 EST_HUM15.DAT EST Sequences 7179386 33 EST_HUM16.DAT EST Sequences 7309111 34 EST_HUM17.DAT EST Sequences 7297241 35 EST_HUM18.DAT EST Sequences 7343197 36 EST_HUM19.DAT EST Sequences 7509606 37 EST_HUM20.DAT EST Sequences 7391010 38 EST_HUM21.DAT EST Sequences 7460626 39 EST_HUM22.DAT EST Sequences 7109359 40 EST_HUM23.DAT EST Sequences 7201641 41 EST_HUM24.DAT EST Sequences 6963596 42 EST_HUM25.DAT EST Sequences 7409104 43 EST_HUM26.DAT EST Sequences 7649761 44 EST_HUM27.DAT EST Sequences 7301340 45 EST_HUM28.DAT EST Sequences 7095929 46 EST_HUM29.DAT EST Sequences 7293403 47 EST_HUM30.DAT EST Sequences 7388960 48 EST_HUM31.DAT EST Sequences 7622217 49 EST_HUM32.DAT EST Sequences 8189837 50 EST_HUM33.DAT EST Sequences 7659621 51 EST_HUM34.DAT EST Sequences 7969240 52 EST_HUM35.DAT EST Sequences 7314376 53 EST_HUM36.DAT EST Sequences 7411549 54 EST_HUM37.DAT EST Sequences 7641701 55 EST_HUM38.DAT EST Sequences 7575333 56 EST_HUM39.DAT EST Sequences 8111579 57 EST_HUM40.DAT EST Sequences 7369420 58 EST_HUM41.DAT EST Sequences 7179631 59 EST_HUM42.DAT EST Sequences 7391983 60 EST_HUM43.DAT EST Sequences 7369009 61 EST_HUM44.DAT EST Sequences 7459552 62 EST_HUM45.DAT EST Sequences 7541066 63 EST_HUM46.DAT EST Sequences 7532994 64 EST_HUM47.DAT EST Sequences 6755787 65 EST_HUM48.DAT EST Sequences 6795182 66 EST_HUM49.DAT EST Sequences 6158759 67 EST_INV01.DAT EST Sequences 6605844 68 EST_INV02.DAT EST Sequences 6176252 69 EST_INV03.DAT EST Sequences 5705260 70 EST_INV04.DAT EST Sequences 5878111 71 EST_INV05.DAT EST Sequences 7174229 72 EST_INV06.DAT EST Sequences 7037426 73 EST_INV07.DAT EST Sequences 7363144 74 EST_INV08.DAT EST Sequences 6276630 75 EST_INV09.DAT EST Sequences 5797544 76 EST_INV10.DAT EST Sequences 7242011 77 EST_INV11.DAT EST Sequences 6765000 78 EST_INV12.DAT EST Sequences 7006928 79 EST_INV13.DAT EST Sequences 5742327 80 EST_INV14.DAT EST Sequences 5466496 81 EST_INV15.DAT EST Sequences 5411826 82 EST_INV16.DAT EST Sequences 5633017 83 EST_INV17.DAT EST Sequences 3111886 84 EST_MAM01.DAT EST Sequences 6414515 85 EST_MAM02.DAT EST Sequences 6783574 86 EST_MAM03.DAT EST Sequences 6957822 87 EST_MAM04.DAT EST Sequences 5077330 88 EST_MUS01.DAT EST Sequences 7588064 89 EST_MUS02.DAT EST Sequences 7833536 90 EST_MUS03.DAT EST Sequences 7487534 91 EST_MUS04.DAT EST Sequences 7051477 92 EST_MUS05.DAT EST Sequences 7659033 93 EST_MUS06.DAT EST Sequences 10008794 94 EST_MUS07.DAT EST Sequences 9133953 95 EST_MUS08.DAT EST Sequences 8597297 96 EST_MUS09.DAT EST Sequences 9964469 97 EST_MUS10.DAT EST Sequences 9962907 98 EST_MUS11.DAT EST Sequences 9933001 99 EST_MUS12.DAT EST Sequences 9833960 100 EST_MUS13.DAT EST Sequences 9752033 101 EST_MUS14.DAT EST Sequences 10006172 102 EST_MUS15.DAT EST Sequences 10746131 103 EST_MUS16.DAT EST Sequences 10873733 104 EST_MUS17.DAT EST Sequences 8640494 105 EST_MUS18.DAT EST Sequences 7605320 106 EST_MUS19.DAT EST Sequences 7435947 107 EST_MUS20.DAT EST Sequences 7572154 108 EST_MUS21.DAT EST Sequences 7076778 109 EST_MUS22.DAT EST Sequences 7430196 110 EST_MUS23.DAT EST Sequences 8246377 111 EST_MUS24.DAT EST Sequences 8048316 112 EST_MUS25.DAT EST Sequences 7560721 113 EST_MUS26.DAT EST Sequences 7274202 114 EST_MUS27.DAT EST Sequences 7938601 115 EST_MUS28.DAT EST Sequences 7723407 116 EST_MUS29.DAT EST Sequences 4204936 117 EST_PLN01.DAT EST Sequences 6849381 118 EST_PLN02.DAT EST Sequences 6406930 119 EST_PLN03.DAT EST Sequences 5983878 120 EST_PLN04.DAT EST Sequences 6321347 121 EST_PLN05.DAT EST Sequences 6714535 122 EST_PLN06.DAT EST Sequences 7390691 123 EST_PLN07.DAT EST Sequences 7089983 124 EST_PLN08.DAT EST Sequences 7045235 125 EST_PLN09.DAT EST Sequences 7227826 126 EST_PLN10.DAT EST Sequences 7462989 127 EST_PLN11.DAT EST Sequences 7296558 128 EST_PLN12.DAT EST Sequences 7182412 129 EST_PLN13.DAT EST Sequences 7183016 130 EST_PLN14.DAT EST Sequences 6710038 131 EST_PLN15.DAT EST Sequences 7211895 132 EST_PLN16.DAT EST Sequences 7675607 133 EST_PLN17.DAT EST Sequences 5991886 134 EST_PLN18.DAT EST Sequences 6690866 135 EST_PLN19.DAT EST Sequences 7896631 136 EST_PLN20.DAT EST Sequences 7315504 137 EST_PLN21.DAT EST Sequences 7186888 138 EST_PLN22.DAT EST Sequences 7118885 139 EST_PLN23.DAT EST Sequences 7362420 140 EST_PLN24.DAT EST Sequences 7211537 141 EST_PLN25.DAT EST Sequences 6207235 142 EST_PLN26.DAT EST Sequences 6276389 143 EST_PLN27.DAT EST Sequences 5788069 144 EST_PLN28.DAT EST Sequences 6383395 145 EST_PLN29.DAT EST Sequences 1934164 146 EST_PRO.DAT EST Sequences 41264 147 EST_ROD01.DAT EST Sequences 7213611 148 EST_ROD02.DAT EST Sequences 7349462 149 EST_ROD03.DAT EST Sequences 7883067 150 EST_ROD04.DAT EST Sequences 6220402 151 EST_UNC.DAT EST Sequences 723 152 EST_VRT01.DAT EST Sequences 6982845 153 EST_VRT02.DAT EST Sequences 5992420 154 EST_VRT03.DAT EST Sequences 7323311 155 EST_VRT04.DAT EST Sequences 7179850 156 EST_VRT05.DAT EST Sequences 7314440 157 EST_VRT06.DAT EST Sequences 5818079 158 EST_VRT07.DAT EST Sequences 6752343 159 EST_VRT08.DAT EST Sequences 7436496 160 EST_VRT09.DAT EST Sequences 7379526 161 EST_VRT10.DAT EST Sequences 7522403 162 EST_VRT11.DAT EST Sequences 6654927 163 EST_VRT12.DAT EST Sequences 4459845 164 FUN.DAT Fungi Sequences 6010827 165 GSS_FUN.DAT Genome Survey Sequences 6610633 166 GSS_HUM01.DAT Genome Survey Sequences 6016393 167 GSS_HUM02.DAT Genome Survey Sequences 6000629 168 GSS_HUM03.DAT Genome Survey Sequences 6148434 169 GSS_HUM04.DAT Genome Survey Sequences 6434740 170 GSS_HUM05.DAT Genome Survey Sequences 6510575 171 GSS_HUM06.DAT Genome Survey Sequences 6481559 172 GSS_HUM07.DAT Genome Survey Sequences 6580378 173 GSS_HUM08.DAT Genome Survey Sequences 6408229 174 GSS_HUM09.DAT Genome Survey Sequences 4350709 175 GSS_INV01.DAT Genome Survey Sequences 6846270 176 GSS_INV02.DAT Genome Survey Sequences 6946885 177 GSS_INV03.DAT Genome Survey Sequences 7373386 178 GSS_INV04.DAT Genome Survey Sequences 6498946 179 GSS_INV05.DAT Genome Survey Sequences 6298586 180 GSS_INV06.DAT Genome Survey Sequences 1218861 181 GSS_MAM01.DAT Genome Survey Sequences 7074821 182 GSS_MAM02.DAT Genome Survey Sequences 4162326 183 GSS_MUS01.DAT Genome Survey Sequences 7158644 184 GSS_MUS02.DAT Genome Survey Sequences 7182266 185 GSS_MUS03.DAT Genome Survey Sequences 7774907 186 GSS_MUS04.DAT Genome Survey Sequences 8163492 187 GSS_MUS05.DAT Genome Survey Sequences 7951661 188 GSS_MUS06.DAT Genome Survey Sequences 7608860 189 GSS_MUS07.DAT Genome Survey Sequences 7852442 190 GSS_MUS08.DAT Genome Survey Sequences 7739116 191 GSS_MUS09.DAT Genome Survey Sequences 7487387 192 GSS_MUS10.DAT Genome Survey Sequences 3065646 193 GSS_PHG.DAT Genome Survey Sequences 7930 194 GSS_PLN01.DAT Genome Survey Sequences 7452483 195 GSS_PLN02.DAT Genome Survey Sequences 7063977 196 GSS_PLN03.DAT Genome Survey Sequences 6371278 197 GSS_PLN04.DAT Genome Survey Sequences 5950592 198 GSS_PLN05.DAT Genome Survey Sequences 6141289 199 GSS_PLN06.DAT Genome Survey Sequences 5915003 200 GSS_PLN07.DAT Genome Survey Sequences 6375699 201 GSS_PLN08.DAT Genome Survey Sequences 6338531 202 GSS_PLN09.DAT Genome Survey Sequences 6339093 203 GSS_PLN10.DAT Genome Survey Sequences 6459290 204 GSS_PLN11.DAT Genome Survey Sequences 37654 205 GSS_PRO.DAT Genome Survey Sequences 788718 206 GSS_ROD01.DAT Genome Survey Sequences 6918296 207 GSS_ROD02.DAT Genome Survey Sequences 7113805 208 GSS_ROD03.DAT Genome Survey Sequences 7137747 209 GSS_ROD04.DAT Genome Survey Sequences 517234 210 GSS_VRL.DAT Genome Survey Sequences 126246 211 GSS_VRT01.DAT Genome Survey Sequences 7438011 212 GSS_VRT02.DAT Genome Survey Sequences 7328881 213 GSS_VRT03.DAT Genome Survey Sequences 3847978 214 HTC.DAT High throughput cDNAs 4413137 215 HTG_HUM01.DAT High Throughput Genome Sequences 8690973 216 HTG_HUM02.DAT High Throughput Genome Sequences 8880540 217 HTG_HUM03.DAT High Throughput Genome Sequences 8693938 218 HTG_HUM04.DAT High Throughput Genome Sequences 8317314 219 HTG_HUM05.DAT High Throughput Genome Sequences 3361981 220 HTG_INV01.DAT High Throughput Genome Sequences 2007807 221 HTG_INV02.DAT High Throughput Genome Sequences 2239775 222 HTG_MAM.DAT High Throughput Genome Sequences 2999564 223 HTG_MUS01.DAT High Throughput Genome Sequences 9650351 224 HTG_MUS02.DAT High Throughput Genome Sequences 9979200 225 HTG_MUS03.DAT High Throughput Genome Sequences 10534863 226 HTG_MUS04.DAT High Throughput Genome Sequences 2551549 227 HTG_OTHER.DAT High Throughput Genome Sequences 103342 228 HTG_PLN.DAT High Throughput Genome Sequences 6323749 229 HTG_ROD01.DAT High Throughput Genome Sequences 13698142 230 HTG_ROD02.DAT High Throughput Genome Sequences 13569850 231 HTG_ROD03.DAT High Throughput Genome Sequences 13228869 232 HTG_ROD04.DAT High Throughput Genome Sequences 12573760 233 HTG_ROD05.DAT High Throughput Genome Sequences 12301822 234 HTG_ROD06.DAT High Throughput Genome Sequences 12357597 235 HTG_ROD07.DAT High Throughput Genome Sequences 9011240 236 HTG_VRT.DAT High Throughput Genome Sequences 2854326 237 HTGO_HUM.DAT High Throughput Genome Sequences phase 0 6633121 238 HTGO_MUS.DAT High Throughput Genome Sequences phase 0 6707942 239 HTGO_OTHER.DATHigh Throughput Genome Sequences phase 0 42757 240 HUM01.DAT Human Sequences 5029177 241 HUM02.DAT Human Sequences 28372672 242 HUM03.DAT Human Sequences 11106654 243 HUM04.DAT Human Sequences 1045837 244 HUM05.DAT Human Sequences 1348766 245 HUM06.DAT Human Sequences 1094217 246 HUM07.DAT Human Sequences 1029938 247 HUM08.DAT Human Sequences 1039696 248 HUM09.DAT Human Sequences 9859535 249 HUM10.DAT Human Sequences 7897860 250 HUM11.DAT Human Sequences 1033036 251 HUM12.DAT Human Sequences 3064424 252 HUM13.DAT Human Sequences 1762084 253 HUM14.DAT Human Sequences 2064733 254 HUM15.DAT Human Sequences 809376 255 HUM16.DAT Human Sequences 564115 256 HUM17.DAT Human Sequences 625646 257 HUM18.DAT Human Sequences 1485653 258 HUM19.DAT Human Sequences 1435561 259 HUM20.DAT Human Sequences 1633138 260 HUM21.DAT Human Sequences 892976 261 HUM22.DAT Human Sequences 943399 262 HUM23.DAT Human Sequences 797317 263 HUM24.DAT Human Sequences 222725 264 INV01.DAT Invertebrate Sequences 10663300 265 INV02.DAT Invertebrate Sequences 6379812 266 INV03.DAT Invertebrate Sequences 650577 267 MAM.DAT Other Mammal Sequences 3628316 268 MUS.DAT Mus musculus Sequences 17438873 269 ORG.DAT Organelle Sequences 12425338 270 PAT01.DAT Patent Sequences 7435654 271 PAT02.DAT Patent Sequences 8392712 272 PAT03.DAT Patent Sequences 9484539 273 PAT04.DAT Patent Sequences 10411529 274 PAT05.DAT Patent Sequences 4141311 275 PHG.DAT Bacteriophage Sequences 348251 276 PLN.DAT Plant Sequences 17404353 277 PRO01.DAT Prokaryote Sequences 9125660 278 PRO02.DAT Prokaryote Sequences 6540736 279 PRO03.DAT Prokaryote Sequences 6329704 280 PRO04.DAT Prokaryote Sequences 1425835 281 ROD.DAT Rodent Sequences 1981634 282 STS.DAT STS Sequences 10660715 283 SYN.DAT Synthetic Sequences 647744 284 UNC.DAT Unclassified Sequences 145719 285 VRL.DAT Viral Sequences 13030513 286 VRT.DAT Other Vertebrate Sequences 3901972 APPENDIX A DATABASE GROWTH TABLE The following table shows the growth of the EMBL Nucleotide Sequence Database at each release. Release Month Entries Nucleotides 1 06/1982 568 585433 2 04/1983 811 1114447 3 12/1983 1481 1654863 4 08/1984 1698 2147205 5 04/1985 2378 2874493 6 08/1985 4835 4567592 7 12/1985 5789 5622638 8 04/1986 6395 6353040 9 09/1986 7630 7813214 10 12/1986 8817 9766948 11 04/1987 11621 12189783 12 07/1987 12706 13638061 13 10/1987 14397 16023478 14 01/1988 15344 17272160 15 05/1988 17961 20318442 16 08/1988 19592 22625941 17 11/1988 20695 24211054 18 02/1989 22938 27249830 19 05/1989 24365 29066676 20 08/1989 26223 31240948 21 11/1989 28679 34748087 22 02/1990 31508 38165786 23 05/1990 34902 42923803 24 08/1990 37784 47354438 25 11/1990 41580 52900354 26 02/1991 43745 55859549 27 05/1991 46871 59915244 28 09/1991 54558 70448052 29 12/1991 57655 75400487 30 03/1992 63378 83574342 31 06/1992 72481 94390065 32 09/1992 79377 101292310 33 12/1992 89100 111413979 34 03/1993 99591 121420828 35 06/1993 108973 131880111 36 09/1993 127933 145401156 37 12/1993 146576 158171400 38 03/1994 167777 177550115 39 06/1994 182615 192195819 40 09/1994 209352 211017104 41 12/1994 230950 226259607 42 03/1995 303206 262559786 43 06/1995 420111 315840053 44 09/1995 506190 363273777 45 12/1995 622566 427620278 46 03/1996 701246 473691480 47 06/1996 827174 550739395 48 09/1996 928067 608931850 49 12/1996 1047263 696183789 50 03/1997 1187455 789755858 51 06/1997 1432941 931351601 52 10/1997 1787004 1181167498 53 12/1997 1917868 1281391651 54 03/1998 2125225 1427634373 55 06/1998 2330040 1607673907 56 09/1998 2689618 1904091473 57 12/1998 3046471 2164718256 58 03/1999 3272064 2355200790 59 06/1999 3952878 2924568545 60 09/1999 4719266 3543553093 61 12/1999 5303436 4508169737 62 03/2000 5865742 6120908677 63 06/2000 6760113 8255674441 64 09/2000 8344436 9650223037 65 12/2000 9549382 10710321435 66 03/2001 11169673 11916112872 67 06/2001 12044420 12821742622 68 09/2001 12964797 13727100206 69 12/2001 14366182 15383451165 70 03/2002 15851373 17807926047 71 06/2002 17226422 20020556107 72 09/2002 18324246 23090186146 73 12/2002 20857746 27903283528