Upload
encode-dcc
View
51
Download
2
Embed Size (px)
Citation preview
Metadata-‐driven tools to access data from the ENCODE project
Esther T. Chan1, Cricket A. Sloan1, Eurie L. Hong1, Venkat S. Malladi1, Laurence D. Rowe1, J. Seth StraIan1, Jean M. Davidson1, Marcus Ho1, Nikhil R. Podduturi1, Benjamin C. Hitz1, Forrest Tanaka1, Brian J. Lee2, Katrina Learned2, MaI Simison1, W. James Kent2, J. Michael Cherry1
1) Department of GeneTcs, School of Medicine, Stanford University, Stanford, CA 94305; 2) Center for Biomolecular Science and Engineering, University of California, Santa Cruz, CA, 95064
The Encyclopedia of DNA Elements (ENCODE) project is an ongoing collaborative effort to create a comprehensive catalog of functional elements in the human and mouse genomes. Its current data corpus exceeds 4000 experiments across more than 400 cell lines and tissues using a wide array of experimental techniques to survey the chromatin structure, regulatory and transcriptional landscape in human and mouse genomes. All ENCODE experimental data, metadata and associated computational analyses are submitted to the ENCODE Data Coordination Center (DCC) for validation, tracking, storage, and distribution to community resources and the scientific community. As the volume of data increases, the identification and organization of data sets becomes challenging to present to users. Here, we describe the web interface, search tools and underlying database the DCC have built for simple and intuitive access to ENCODE data and metadata. Extensive and structured metadata describing experimental variables, such as the biological samples, specific reagents, and protocols necessary to replicate the assays and their subsequent analysis collected from ENCODE data producers drive a powerful faceted browsing interface, allowing users to filter and retrieve particular slices of the large data corpus. Elasticsearch-driven real-time indexing also allows users to perform full-text searches to directly access specific data of interest. Upcoming features planned include access to data standards, quality metrics for data files, data visualization tools, uniform processing and analysis pipelines as part of the revamped ENCODE public portal. Data and metadata from the ENCODE project can currently be accessed at https://www.encodeproject.org.
APPR
OAC
H
II. Detailed capture of metadata in JSON
• Type (e.g. Tssue, cell line) • Source • Product id • Lot id • Dates (e.g. growth, harvest, procurement) • Passage number • StarTng amount • Lab assigned IDs • Link to donor
BIOSAMPLE
• Species • Age • Sex • Life stage • Health status • Ethnicity • De-‐idenTfied ID
HUMAN DONOR
• Lysis method • SonicaTon method • ExtracTon method • Nucleic acid type • Nucleic acid size range • Strand specificity • Size selecTon method • Protocol document
LIBRARY
EXPERIMENT
REPLICATE [1..n]
LIBRARY [1..n]
FILES
FILE [0..n]
CONSTRUCT [0..n]
DONOR [1..n]
BIOSAMPLE (1..n)
ANTIBODY [0..1]
has
has
has
has
TREATMENT [0..n]
has
RNAi [0..1] has
I. Principles driving metadata defini<on III. Rela<onships between metadata objects reflect underlying experimental processes
has
• Provide transparency about the experimental process
• Communicate key variables of each experiment • Capture the data provenance of computaTonal analyses
• Use ontologies and controlled vocabularies to standardize terminology and promote interoperability with other data resources
ENCODE portal hIp://www.github.com/ENCODE-‐DCC
@ENCODE-‐DCC
encode-‐[email protected] hIps://www.encodeproject.org/help/gejng-‐started hIps://www.encodeproject.org
Help documenta<on Code repository
Browse data collections by assay type, biosamples, antibodies and annotations.
Data type Description ENCODE accession
format
Experiment An ENCODE produced experiment with 2 or more biological replicates.*
ENCSR###XXX
Dataset A collection of data files, e.g. associated with an ENCODE analysis or publication.
ENCSR###XXX
Biosample A distinct growth of a cell line, excised tissue, or whole organism used in an assay.
ENCBS###XXX
Antibody lot A distinct lot of an antibody, identified by its product ID, lot number and source used in an
assay
ENCAB###XXX
Donor A distinct donor or strain from which the biosample was obtained.
ENCDO###XXX
File A data file (raw or processed) with a unique md5sum
ENCFF###XXX
USING TH
E EN
CODE
DAT
A PO
RTAL
I. Navigate the home page for the ENCODE portal
Find data for over 4000 experiments across more than 1700 different biosamples and 300+ antibodies.
II. Browse data collec<ons, e.g. assay types.
III. Search by term
A faceted browsing interface allows users to filter data collections and search results by relevant experimental metadata properties.
Visualize signal and segment files via trackhubs in a genome browser.
IV. Search by accession
…
FEAT
URE
S IN DEV
ELOPM
ENT
Download raw and processed data files directly from the experiment page. Batch downloads of files coming soon.
https://www.encodeproject.org
See poster 1653S for more details on programmaTc data retrieval using a REST API.
Access the underlying ENCODE data referenced in a publication directly using the unique accessions.
Enter a search term for a biosample (e.g. skin) or an assay name (e.g. ChIP -seq), or a protein target of an antibody (e.g. CTCF).
II. Search by region of interest
I. ENCODE standard analysis pipelines
Find ENCODE datasets overlapping a region of interest by its genomic coordinates, or rs ID (SNP), or gene name etc.
Figure 1 from Boyle et al. Genome Res. 2012 Sep;22(9):1790-‐7
Primary Data (fastq)
Mapped Reads (bam)
QA Metrics
Signal detecTon
Find documentation on standards, software, formats and how-tos.
JSON is an open standard human and machine readable data exchange format expressed as attribute-value pairs.
The ENCODE Data Analysis Center is defining standard analysis pipelines for ChIP-seq, RNA-seq, DNase-seq and whole genome bisulfite-seq. These ENCODE pipelines are being implemented on the DNAnexus cloud platform and will be available for users to run on their own data.