Software Development by the Genomics Standards Consortium

Preview:

DESCRIPTION

Presentation held at the M3 SIG meeting at the ISMB in Stockholm 2009. Purpose to show the audience the software development activities of the Genomics standards Consortium. See also http://gensc.org

Citation preview

1

Bringing Standards to Life:

Software Development by theGenomics

Standards Consortium

Renzo Kottmann Microbial Genomics Group

Max Planck Institute for Marine Microbiology

M3 SIG Stockholm July 2009

2

Genomic Standards Consortium (GSC)

Goal

• Promote mechanisms that standardize the description of genomes

exchange and integrate genomic data

Open-membership, international working body

• Established in Sept 2005

• Participants include DDBJ, EMBL, GenBank, Sanger, JCVI, JGI, EBI and a range of US, UK and EU research institutions

• Organized a series of workshops

2http://gensc.org and http://gensc.org/gc_wiki/index.php/GSC_Membership

3

Minimum Information about a Genome Sequence(MIGS) Specification

MIGS extends what DDBJ/EMBL/GenBank request upon submission of a genome sequence

• Examples:

Description of geographic location of a sample and habitat

“Minimum Information about a Metagenomic Sequence” (MIMS)

– Temperature

– pH

Description of sequence generation– Sequencing method

– Assembly method

Field et al. Nat Biotechnol. 2008 3

4Field et al. Nat Biotechnol. 2008

MIGS Checklist 2.0

4

5

MIGS Checklist 2.0

Field et al. Nat Biotechnol. 2008

M = mandatory

5

6

Software Development for MIGS/MIMS

Mechanisms for achieving compliance are needed:

• Such mechanisms involve an appropriate reporting

structure for capturing and exchanging data,

software,

databases

and controlled vocabularies and/or ontologies for defining the terms used in the annotations.

Field et al. Nat Biotechnol. 2008

7

Software Development for MIGS/MIMS

Mechanisms for achieving compliance are needed:

• Such mechanisms involve an appropriate reporting

structure for capturing and exchanging data,

software,

databases

and controlled vocabularies and/or ontologies for defining the terms used in the annotations.

Supporting Projects:

• Habitat-Lite (Ontology specification)

Field et al. Nat Biotechnol. 2008

8

Software Development for MIGS/MIMS

Mechanisms for achieving compliance are needed:

• Such mechanisms involve an appropriate reporting

structure for capturing and exchanging data,

software,

databases

and controlled vocabularies and/or ontologies for defining the terms used in the annotations.

Supporting Projects:

• Habitat-Lite (Ontology specification)

• Genomic Rosetta Stone (Identifier Mapping)

Field et al. Nat Biotechnol. 2008

9

Software Development for MIGS/MIMS

Mechanisms for achieving compliance are needed:

• Such mechanisms involve an appropriate reporting

structure for capturing and exchanging data,

software,

databases

and controlled vocabularies and/or ontologies for defining the terms used in the annotations.

Supporting Projects:

• Habitat-Lite (Ontology specification)

• Genomic Rosetta Stone (Identifier Mapping)

• GCDML (MIGS/MIMS specification in XML)

Field et al. Nat Biotechnol. 2008

10

Software Development for MIGS/MIMS

Mechanisms for achieving compliance are needed:

• Such mechanisms involve an appropriate reporting

structure for capturing and exchanging data,

software,

databases

and controlled vocabularies and/or ontologies for defining the terms used in the annotations.

Supporting Projects:

• Habitat-Lite (Ontology specification)

• Genomic Rosetta Stone (Identifier Mapping)

• GCDML (MIGS/MIMS specification in XML)

• Genomes Catalogue (Database and Web Server)

Field et al. Nat Biotechnol. 2008

11

Habitat-Lite (= EnvO-Lite)

Easy-to-use (small) set of terms

• Captures high-level information about habitat

• Derived from the Environment Ontology (EnvO).

Meet the needs of multiple users

• Annotators, database providers, biologists, and bioinformaticians alike who need to search and employ such data in comparative analyses.

11

Aquatic Aquatic: Freshwater Acquatic: Marine Terrestrial Air Fossil Food Organism-Associated Extreme Habitat Other

Hirschman et al. OMICS. 2008

12

Habitat-Lite

1. Level 2. Level

Aquatic

Aquatic: Freshwater

Aquatic: Marine

Terrestrial

Air

Fossil

Food

Organism-Associated

Extreme Habitat

Other

soil

sediment

sludge

waste water

hot spring

hydrothermal vent

biofilm

microbial mat

12

< 20 terms

Hirschman et al. OMICS. 2008

13

Habitat-Lite applied

13http://www.megx.net/genomes

14

Genomic Rosetta Stone (GRS)

14

Create a unified mapping between different genomic

resources

Improve navigation across these resources

Enable the integration of this information in the near

future.

Van Brabant et al. OMICS. 2008

15

Genomic Rosetta Stone (GRS)

15Van Brabant et al. OMICS. 2008

16

Genomic Rosetta Stone (GRS)

Enable the integration of this information in the near

future

16Van Brabant et al. OMICS. 2008

17

Genomic Contextual DataMarkup Language (GCDML)

An Extensible Markup Language (XML)

Aim

• Implement MIGS/MIMS

• Provide even more descriptors

• Facilitate exchange and integration of genomic data

Kottmann et al. OMICS. 2008 17

18

GCDML Example (excerpt)

<gcdml:originalSample>

<gcdml:physicalMaterial>

<gcdml:samplingTime><gcdml:notGiven>unknown</gcdml:notGiven></gcdml:samplingTime>

<gcdml:samplePointLocation>

<gml:LocationKeyWord>Baltic Sea</gml:LocationKeyWord>

<gml:LocationString>Kiel Fjord, Baltic Sea, Germany</gml:LocationString>

<gcdml:pos2D>54.329 10.149</gcdml:pos2D>

<gcdml:determinationMethod>derived from literature</gcdml:determinationMethod>

</gcdml:samplePointLocation>

<gcdml:marineHabitat>

<gcdml:waterBody>

<gcdml:depth>

<gcdml:measure min="0.00" max="0.05“><gcdml:values uom="m">0.00 0.05</gcdml:values></gcdml:measure>

</gcdml:depth>

</gcdml:waterBody>

</gcdml:marineHabitat>

<gcdml:materialType>seawater</gcdml:materialType>

<gcdml:amount><gcdml:measure><gcdml:values uom="ml">100</gcdml:values></gcdml:measure></gcdml:amount>

</gcdml:physicalMaterial>

</gcdml:originalSample>Kottmann et al. OMICS. 2008 18

19

GCDML Example (excerpt)

<gcdml:originalSample>

<gcdml:physicalMaterial>

<gcdml:samplingTime><gcdml:notGiven>unknown</gcdml:notGiven></gcdml:samplingTime>

<gcdml:samplePointLocation>

<gml:LocationKeyWord>Baltic Sea</gml:LocationKeyWord>

<gml:LocationString>Kiel Fjord, Baltic Sea, Germany</gml:LocationString>

<gcdml:pos2D>54.329 10.149</gcdml:pos2D>

<gcdml:determinationMethod>derived from literature</gcdml:determinationMethod>

</gcdml:samplePointLocation>

<gcdml:marineHabitat>

<gcdml:waterBody>

<gcdml:depth>

<gcdml:measure min="0.00" max="0.05“><gcdml:values uom="m">0.00 0.05</gcdml:values></gcdml:measure>

</gcdml:depth>

</gcdml:waterBody>

</gcdml:marineHabitat>

<gcdml:materialType>seawater</gcdml:materialType>

<gcdml:amount><gcdml:measure><gcdml:values uom="ml">100</gcdml:values></gcdml:measure></gcdml:amount>

</gcdml:physicalMaterial>

</gcdml:originalSample>Kottmann et al. OMICS. 2008 19

20

GCDML Example (excerpt)

<gcdml:originalSample>

<gcdml:physicalMaterial>

<gcdml:samplingTime><gcdml:notGiven>unknown</gcdml:notGiven></gcdml:samplingTime>

<gcdml:samplePointLocation>

<gml:LocationKeyWord>Baltic Sea</gml:LocationKeyWord>

<gml:LocationString>Kiel Fjord, Baltic Sea, Germany</gml:LocationString>

<gcdml:pos2D>54.329 10.149</gcdml:pos2D>

<gcdml:determinationMethod>derived from literature</gcdml:determinationMethod>

</gcdml:samplePointLocation>

<gcdml:marineHabitat>

<gcdml:waterBody>

<gcdml:depth>

<gcdml:measure min="0.00" max="0.05“><gcdml:values uom="m">0.00 0.05</gcdml:values></gcdml:measure>

</gcdml:depth>

</gcdml:waterBody>

</gcdml:marineHabitat>

<gcdml:materialType>seawater</gcdml:materialType>

<gcdml:amount><gcdml:measure><gcdml:values uom="ml">100</gcdml:values></gcdml:measure></gcdml:amount>

</gcdml:physicalMaterial>

</gcdml:originalSample>Kottmann et al. OMICS. 2008 20

21

Genome Catalogue

Online system for capturing MIGS/MIMS compliant

reports

21Field et al. Nature 2008

22

Genome Catalogue

Requirements

• A Rich toolkit/user-friendly

• Designed to give credit to all contributors

• XML-based (GCDML) Able to maintain all versions of GCDML schemas

• Web services-based Supporting the automated exchange of content

• Serve as the international GCAT identifier authority

• Comprehensive Containing reports for all taxa and metagenomes

• Ontology-supportive

• Shared by the GSC

22

23

Current Status

We have specifications:

• MIGS/MIMS

• Habitat-Lite

• Genomic Rosetta Stone

Work on supporting software is ongoing:

• Genomes Catalogue is in prototype status

• Funding This is a long-term endeavour that can not be done on a

voluntary basis

23

24

Disscusion

Need of software for:

• Creation of MIGS/MIMS data

• Storage

• Analysis

Expand standardization efforts to

• Software specification/development

• Work on a standardized genomic data management architecture / cyberinfrastructure

Data intensive science is successful if it works

towards one community with one vision

• World Wide Genomics project

24

25

Acknowledgements

All Members of GSC incl. Dawn Field

Peter Sterk

Saul Kravitz

Tanya Gray

Megx.net team

Frank Oliver Glöckner

Ivaylo Kostadinov

Melissa Beth Duhaime

Pier Luigi Buttigieg

Wolfgang Hankeln

Pelin Yilmaz

26

END

Looking forward to the discussion

26

Join the GSC

http://gensc.org

Recommended