Primary vs. Secondary Databases Primary databases are repositories of “raw” data. These are also referred to as archival databases. -This is one of the

Primary vs. Secondary DatabasesPrimary databases are repositories of “raw” data. These are also referred to as archival databases.

-This is one of the most important functions of a database: to reliably store and make accessible the data. Most protein sequences are predicted (i.e. annotated) from nucleotide sequence and therefore not curated.

Secondary databases are repositories of “curated” data.

Curated databases require human review of some kind, in addition to some experimental verification of the biological meaning of sequence data.

International Nucleotide Sequence Database Collection

• European Molecular Biology Laboratory EMBL(UK)

• GenBank (NCBI, USA)

• DNA Databank of Japan DDBJ

All three organizations share 100% of their data. See Figure 1.1 in your text.

One of the consequences of data sharing is that file formats must correspond

Flatfiles

• In biological databases, a flat file is a textfile, that usually contains one (sequence) record.

• Flat files are the indivisible unit of all sequence databases, but data in them can be display in a variety of formats.

• One of the most common formats for sequence records is called FASTA

A closer look at Flatfiles

Name identifier: a unique identifier for each sequence. This is also known as the primary accession number

Length of mRNA

In this case, the sequence was submitted as an mRNA sequence. The “N” means nucleotide and the “M” means mRNA. See Box 1.2

i.e. not a circular moleculeLike a plasmid

Taxonomic code, not very useful anymore

Date when last updated.

The first line is called the header

Flatfiles continuedThe second line is called the Definition Line, the goal of which is to summarize the essential biological information encoded by the entry.

1. Genus species Gene name

Note: Gene ontology can be confusing. In this case, the gene is named after a fruitfly mutant.

Basic description of structure and function

Type of molecule from which the sequence was derived. In reality, this would have been derived from a cDNA corresponding to a mRNA harvested from an embryonic cell

The most important entry.

Primary database to reference to the sequence. If using this sequence in a publication, this is cited to refer readers to the database entry you used or created

The version is very similar to the accession number, but if the sequence is updated either because it was wrong or incomplete, the number after the decimal indicates the version

GenBank specific “geneinfo” identifier

Source organism

Pretty self explanatory, except the difference between SOURCE and ORGANISM is that the latter is hyperlinked so one can go and investigate more…

All GenBank entries must be associated with a citation

In essence, this ensures that the means by which the sequences were acquired have been peer reviewed, if not the sequence itself. This is what lends scientific credibility to the quality of these databases.

This is an EMBL accession number, which means that it was not originally submitted through the GenBank portal

The only feature common to all three primary databases is the source feature

All sequences must come from somewhere, so the minimum data (organisms and type of molecule) is entered here, with a link to the Taxonomy Browser.

The list of acceptable database cross references (i.e. db_xref) to external links is strictly controlled. In this case it is still within the “Entrez” webspace, but others are possible.

All annotated nucleotide entries contain a “virtual” translation into amino acid sequence

In this case, the translation is derived directly from a mRNA sequence, so there is a good chance it is correct, but if the translation is due to computationally derived genomic sequence, it should validated against a curated database.

And then, finally, the sequence data

So, flatfiles are informative, but what if you want to work with the sequence?

The sequence data in the flatfile can be displayed or downloaded in a variety of different ways. A FASTA file is a very common format.

The simplest possible FASTA file

>sequenceAGTCCGATCGATCGTAGCTACGTACGTACGTAGCTAGCTACGTACGTACGATCGATGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCG

This FASTA sequence file has all of the necessary elements for a database entry, but it is not very informative.

For example, we don’t know what database it is from, what organism is has come from, what molecule it encodes, if any etc.

FASTA formatThe chevron symbol “>” is important because it denotes the beginning of a new sequence. This is particularly important if you are using a file that contains multiple sequences for a query search, for example.

>A sequenceCAGCTGACAGATCGTACGATCGATGCGCACGAAGCACTACTAGCTAGGT>Another sequenceCGCTAGCTCGCGATCGTATCAACGCGCGCGCGCGCGCATACTCACGCGC

Protein sequence databasesRead Chapter one in book from “Protein Sequence Databases” to end of chapter

With the exception of Protein Data Bank, which is a primary database composed of experimentally determined protein structure, all other protein databases are considered to be either mixed primary and secondary databases because they rely upon conceptual, or virtual translation of nucleotide data.

GenPept is a secondary database, searchable through the “Protein” portal in Entrez. Caveat: errors in nucleotide sequence can be propagated.

UniParc is a mixed primary and secondary database, and therefore attempts to be a comprehensive repository of amino acid sequences. Curated,

Protein Data Bank is a primary database of protein structure determinations, using either X-ray crystallography or Nuclear Magnetic Resonance Spectroscopy.

Entrez Webspace

http://www.ncbi.nlm.nih.gov/books/NBK21101/

This book will be your best friend. It is a comprehensive online documentation volume that attempts to fill the gap between a straightforward search in PubMed or BLAST and more advanced tasks.

This webspace uses the concept of neighboring, which describes logical (i.e. natural) relationships between entries in one database and those in another.

Documents

Primary vs. Secondary Databases Primary databases are repositories of “raw” data. These are also referred to as archival databases. -This is one of the