Submitting Data to Public Databases

Cold Spring Harbor Laboratory Press:
Selected Chapter

Genome Analysis: A Laboratory Manual Series Volume 1, Analyzing DNA
CHAPTER 7, Computational Analysis of DNA and Protein Sequences

Submitting Data to Public Databases

Preparing a new submission
Updates and corrections
Special arrangements for large projects

NOTE: This is an old document. Links may no longer function correctly.
Some tables are available, but figures are not.

It is an important responsibility for researchers to make their sequence data available to the scientific community. This is accomplished by submitting sequence data to a public sequence database. Historically, most funding agencies have encouraged this and journals have supported this by requiring a database accession number as proof of submission as a condition of publication. (An accession number is a unique identifier for a particular sequence and allows for retrieval of the sequence and associated annotation). Reference should be made to accession numbers when reporting new sequences or when describing experiments and analyses based on existing sequences. This is an essential component of research reproducibility in the electronic age. Accession Numbers have been in the one_letter+5_digits format (e.g. U12345) since their inception, but just recently the databases have had to switch to a different format (to llow more numbers, because of the growth of the databases) where two letters are used with 6 digits (e.g. AA123456)

As described in the Sequence Analysis section, GenBank, EMBL, and DDBJ are the three partners in an International Collaboration of Nucleotide Sequence Databases and authors may submit their sequences to whichever of these databases is most convenient. The general E-mail addresses and WWW URL used to submit data to these databases are:

Database	E-mail submissions	WWW submissions
DDBJ	ddbjsub@ddbj.nig.ac.jp	`http://sakura.ddbj.nig.ac.jp/`
EMBL	datasubs@ebi.ac.uk	`http://www.ebi.ac.uk/submission/webin.html`
GenBank	gb-sub@ncbi.nlm.nih.gov	`http://www3.ncbi.nlm.nih.gov/BankIt/`

As submitted data is exchanged between all three partners (as well as other sites) on a daily basis, authors should submit their sequences to only one of these locations. For example, once a sequence is sent to GenBank and is made public, the submission is automatically shared with EMBL and DDBJ during daily updates. Therefore, a newly-submitted sequence will be available within a few days at the other two sites as well. Overall, this arrangement allows for the best management of the data with optimum accessibility at sites located throughout the world. Also, in keeping with this exchange, it is best to refer to such published Accession Numbers in your publication (as a footnote if possible) in the following fashion:

        "These sequence data have been submitted to the
         DDBJ/EMBL/GenBank databases under Accession 
         Number AC123456"

Protein sequence contained in these records are also distributed to a number of protein databases (Swiss-Prot, PIR, and GenPept) in a similar fashion. The protein databases use the nucleotide databases as their primary source of new amino acid sequences, so you need not submit to them seperatly. They (the protein databases) will get your sequences from the DNA sequence records.

New submissions

There are many different tools available for the submission of data to the data repositories described above. These three sites offer assistance in the process of data submissions and advice on specific situations. Here, we will present how NCBI interacts with persons submitting sequences, but keep in mind that all three sites have their own experts that can provide advice on submission, annotation, presentation and even data analysis issues.

At NCBI the popularity of the WWW-based tool BankIt has taken submitters of DNA sequence by storm. More than 80% of the submitted sequences (not including ESTs and entries from genome centers) are submitted using BankIt. BankIt is a simple, document-based tool that enables easy, step-by-step entry of sequence and associated biological information. This information is, in turn, converted into a format that allows for the rapid biological and computational checks performed on all sequence data submitted to GenBank.

A dimishing number of submissions are still coming from files prepared by a tool called Authorin. Available for both the Macintosh (albeit only the 16 bit variety) and the PC platforms (non-windows), Authorin was amongst the first generation of submission tools, providing a standard interface through which sequences, and information related to those sequences, could be submitted. The ease of using the World Wide Web, however, has steered users to BankIt.

NCBI, in colaboration with the other Nucleotide sequence databases, has recently released a beta version of Sequin, which is a specialized network or stand-alone (i.e. not requiring a network connetion) application for the submission of sequence data and associated annotation. It is intented as a replacement for Authorin for use in this process. Sequin differs from BankIt in having more sophisticated capabilities with specialized editors allowing more accurate annotations, as well as built-in validating capabilities; all objects which can be viewed in Sequin, either through a flatfile view or via the graphical interface, can be "clicked" on to bring up the appropriate editor. A simple submission can be generated from stored FASTA files in a matter of minutes. Sequin software is available for the Macintosh (all variety), PC (Windows 16/32 bit), and UNIX platforms.

As mentioned above, sequences submitted to GenBank go through a series of validation checks conducted by the professional database staff. These quality assurance procedures are designed to detect vector and other sequence contaminants (e.g., mitochondrial sequences, which should be absent from nuclear-encoded genes), mistranslations of coding sequences, and to ensure proper taxonomic classification. As a final check, all records submitted through the direct submission process at GenBank are checked by at least one Ph.D.-level molecular biologist. Accession numbers are issued to the submitters in less than 24 hours, and provisional GenBank records (e.g., Appendix I) are made available for approval prior to public release. Authors may request that their data remain confidential until publication.

The most important component of a submission is, of course, the new sequence itself. Those using GenBank data want to be assured that submitted data is free of sequencing errors, cloning artifacts, and the like; routine checks can be done in this regard. With the advent of "single pass" strategies for both cDNA and genomic data, however, a great deal of lower quality, but nonetheless very useful data, has been included in special divisions of GenBank. In these cases it is important for the submitter to provide an estimate of sequencing accuracy. The databases have been very insightful in creating divisions which would allow the sequestration of subsets of the sequences allowing for them to be manipulated, interpreted and used in specific ways which allowed for their maximum benefit. This has been the case for EST's, STS's and the recent GSS (Genome Survey Sequences) division which represents low pass genomic sequences comming to us from a variety of genome projects. Although the full advantage and benefit of the EST division is not chalanged today, it was not long ago where the usefulness of such data class was chalanged. This is no longuer the case.

What is the basic information one needs to know about a sequence before submitting it to GenBank? There are many features one can attach to a sequence, as detailed in the GenBank Feature Table Document (available on the WWW at http://www.ncbi.nlm.nih.gov/collab/FT/index.html or by anonymous ftp at ncbi.nlm.nih.gov (directory /genbank/docs). The list of features and their qualifiers can seem quite overwhelming, but the list is designed (and updated) to clearly and unambiguously cover all possible needs. The most convenient way to become familiar with the GenBank annotation system is to examine some existing GenBank records by looking them up using either Entrez or the QUERY E-mail server (Table 1). As an example, the GenBank record for the hereditary breast cancer gene, BRCA1, is shown in Appendix I). A larger and more complex example is the record for the breakpoint cluster region (BCR) gene on human chromosome 22. Lastly, the GenBank entries for positionally cloned genes are given in Table 7. Not all of these are sterling examples of well-done records, but do give a good sense for the variety of types of information that GenBank annotations can accommodate.

One of the most important characteristics of a DNA sequence is its coding potential (the CDS feature in GenBank jargon). All mRNA (cDNA) sequences and exons should have their coding regions precisely specified, and the conceptual translations should be supplied. An equally important and very useful piece of information, when known, is the gene name and product name for the feature of interest. These items are stored in the GenBank record and is the best way to ensure that the encoded protein be represented in protein sequence databases, without having to perform separate submissions. Proper CDS annotation also ensures that the records will be properly linked in Entrez.

Updates and corrections

In addition to new data, authors are strongly encouraged to submit updates and corrections to their GenBank records. A submission remains the author's publication, and although the databases do play an editorial role in maintaining a certain standard, only biologists themselves can help maintain the quality of the biological information in the databases. For instance, if a gene name is now known, or have a better understanding of the enzyme for which a sequence was previously submitted (e.g., you now have an E.C. number for it), it is highly desireable that the databases be informed of this new information so that records can be properly updated.

As the genomes of various organisms are completed, errors, ommisions, and missplacements will become apparent to some users, and notification to the databases will be critical in insuring the quality of the databases. It will become the responsibility of all to participate in the gigantic task of maintaining order in this sea of data. The databases themselves willl have to be very diligent in their task to keep up with the flow of information, but users will also need to direct the focus of the databases, as they are the users of this information, and unless the users don't voice their opinions, the databases have no way to speculate on what may be needed (although they will probably hazard a guess ;-).

So if any user of the databases who notices a problem, error, or omission in a given entry is encouraged to bring the discrepancy to the attention of the GenBank staff. For example, it is not uncommon for submitters to forget to notify GenBank that a confidential sequence has been published and can be made public; this is almost certainly the case when the entry for a published GenBank accession number cannot be retrieved. GenBank will release the entry when the complete journal citation, including the full title of the paper, is sent via E-mail to update@ncbi.nlm.nih.gov or when the first page of the article and the page containing the cited accession number is sent via FAX to NCBI at 301-480-9241. Updates can also be submitted through the World Wide Web on using BankIt's Update option.

The updated information need be sent to only one of the databases, and the information will be shared with the other collaborating databases as described earlier. The update E-mail addresses for each of the collaborating databases are:

Database	E-mail update	WWW update
DDBJ	ddbjupdt@ddbj.nig.ac.jp	Use E-mail for DDBJ updates
EMBL	update@ebi.ac.uk	`http://www.ebi.ac.uk/ebi_docs/update.html`
GenBank	update@ncbi.nlm.nih.gov	`http://www3.ncbi.nlm.nih.gov/BankIt/`

Special arrangements for large projects

Over the past several years, large projects whose goal is to sequence entire chromosomes or genomes have become more common, and "single pass" survey sequencing is generating tens of thousands of new sequences every month. The database submission needs for these projects are not conveniently met by the standard methods discussed so far. Large, well-organized laboratories or consortia that are carrying out this work usually have their own sophisticated informatics capabilities and laboratory information management systems. In this milieu, the professional database staffs become close collaborators in getting the information into the public repositories and making it available in a form that the entire scientific community can benefit from.

Several years ago, NCBI devised special streamlined submission procedures for rapidly accumulating EST and STS data and has also worked with major sequencing groups to provide convenient interfaces between local laboratory information management systems (such as ACeDB) and GenBank. EBI and DDBJ have also made special arrangements with "high-throughput" sequencing laboratories. If the data handling requirements of your group exceeds the capacity or capabilities of existing submission tools, the GenBank staff can discuss alternatives that will ensure accurate, efficient and timely submission, annotation and distribution of such sequence data.

With data throughput that may soon approach hundreds of megabases a year, a major challenge will be to provide up-to-date annotation for this sequence data. A new class of data will soon be present in the database, that from High Throughput Genome Sequencing Centres which will be placed in the HTG division. These data will be primarly of large records (more than 100 kb) and will be automatically annotated, automatic annotations for repeats, structural RNAs and similarity hits. These will initially be updated frequently (once or twice a month) and these will also be retrievable via the normal channels (eg Entrez) which will have all the typical neighbor information, but which will no be anchored to the various genetic maps present in the genomes division in Entrez.

The public databases will necessarily have an increasingly important role in this endeavor. Keeping homology information and links to relevant literature current, as in the Entrez system, will be a useful and essential as an approach to this task.

Previous topic: Multiple Alignment, Sequence Motifs and Structure Inference

Cold Spring Harbor Laboratory Press: Selected Chapter

Genome Analysis: A Laboratory Manual Series Volume 1, Analyzing DNA CHAPTER 7, Computational Analysis of DNA and Protein Sequences