48
ENA – 1 st Dec 2014 – EBI, UK Evangelos Pafilis Institute of Marine Biology, Biotechnology and Aquaculture (IMBBC) Hellenic Centre for Marine Research (HCMR), Heraklio Crete, Greece [email protected], http://epafilis.info Text Mining and Environmental Metadata Suggestion

Text Mining and Environmental Metadata Suggestion

Embed Size (px)

Citation preview

Page 1: Text Mining and Environmental Metadata Suggestion

ENA – 1st Dec 2014 – EBI, UK

Evangelos Pafilis

Institute of Marine Biology, Biotechnology and Aquaculture (IMBBC)

Hellenic Centre for Marine Research (HCMR), Heraklio Crete, Greece

[email protected], http://epafilis.info

Text Mining and Environmental

Metadata Suggestion

Page 2: Text Mining and Environmental Metadata Suggestion

ENA – 1st Dec 2014 – EBI, UK

Species – Environments

Page 3: Text Mining and Environmental Metadata Suggestion

ENA – 1st Dec 2014 – EBI, UK

Comparative Αnalysis •  Location •  Environment •  Time Period

Image from http://theresilientearth.com/

Coral Reefs

?

Page 4: Text Mining and Environmental Metadata Suggestion

ENA – 1st Dec 2014 – EBI, UK

Not Trivial

Page 5: Text Mining and Environmental Metadata Suggestion

Slide by Dr. P. Yilmaz, http://www.arb-silva.de/projects/contextual-data/

Page 6: Text Mining and Environmental Metadata Suggestion

ENA – 1st Dec 2014 – EBI, UK

Metadata

Meta- = Μετά (“after”)

=> data “after” data

=> data describing data

Essential Context Information

Page 7: Text Mining and Environmental Metadata Suggestion

ENA – 1st Dec 2014 – EBI, UK

a clear definition, that can be interpreted

in many, sometimes conflicting, ways

Page 8: Text Mining and Environmental Metadata Suggestion

ENA – 1st Dec 2014 – EBI, UK

a clear definition, that can be interpreted

in many, sometimes conflicting, ways

Essential Context Information

Page 9: Text Mining and Environmental Metadata Suggestion

ENA – 1st Dec 2014 – EBI, UK

Community Standards

•  Standards (such as MiXS, MIMARKS)

see http://gensc.org/gc_wiki/index.php/GSC_Publications

for a comprehensive list of publications

•  capture genomic/metagenomic and other type of sequence contextual information

•  Including detailed guidelines on how to annotate a sample

(e.g. Yilmaz P et al. (2011) The ISME journal 5: 1565–1567)

http://gensc.org/

Page 10: Text Mining and Environmental Metadata Suggestion

P. Yilmaz et al., Nat Biotech 29, 415–420 (2011)

Page 11: Text Mining and Environmental Metadata Suggestion

source: http://wiki.gensc.org/index.php?title=MIMARKS

Page 12: Text Mining and Environmental Metadata Suggestion

ENA – 1st Dec 2014 – EBI, UK

http://www.tomorrowstarted.com/2013/01/how-a-key-works/.html

Page 13: Text Mining and Environmental Metadata Suggestion

ENA – 1st Dec 2014 – EBI, UK

•  Project descriptions

•  Scientific-content web pages

•  Full text scientific articles

•  Literature abstracts

•  In-house documents

Page 14: Text Mining and Environmental Metadata Suggestion

ENA – 1st Dec 2014 – EBI, UK

Microbes are key players in both healthy and degraded coral reefs. A combination of metagenomics, microscopy, culturing, and water chemistry were used to characterize microbial communities on four coral atolls in the Northern Line Islands, central Pacific.

Source: http://metagenomics.anl.gov/linkin.cgi?metagenome=4440039.3 (“Project Description”)

Page 15: Text Mining and Environmental Metadata Suggestion

ENA – 1st Dec 2014 – EBI, UK

Looking up terms:

Intensive, learning curve

Page 16: Text Mining and Environmental Metadata Suggestion

ENA – 1st Dec 2014 – EBI, UK

Literature Mining

Page 17: Text Mining and Environmental Metadata Suggestion

ENA – 1st Dec 2014 – EBI, UK

processing text

to extract facts of interest

Page 18: Text Mining and Environmental Metadata Suggestion

ENA – 1st Dec 2014 – EBI, UK

ENVIRONMENTS

Page 19: Text Mining and Environmental Metadata Suggestion

ENA – 1st Dec 2014 – EBI, UK

terrestrial, aquatic, marine, lagoon, coral reef, sediment, freshwater, soil

ENVIRONMENTS: ENVO term identification in text

Page 20: Text Mining and Environmental Metadata Suggestion

ENA – 1st Dec 2014 – EBI, UK

Microbes are key players in both healthy and degraded coral reefs. A combination of metagenomics, microscopy, culturing, and water chemistry were used to characterize microbial communities on four coral atolls in the Northern Line Islands, central Pacific.

Source: http://metagenomics.anl.gov/linkin.cgi?metagenome=4440039.3 (“Project Description”)

ENVIRONMENTS: ENVO term identification in text

Page 21: Text Mining and Environmental Metadata Suggestion

ENA – 1st Dec 2014 – EBI, UK

ENVIRONMENTS: ENVO term identification in text

ID: ENVO:00000150 Name: coral reef

Microbes are key players in both healthy and degraded coral reefs. A combination of metagenomics, microscopy, culturing, and water chemistry were used to characterize microbial communities on four coral atolls in the Northern Line Islands, central Pacific.

Source: http://metagenomics.anl.gov/linkin.cgi?metagenome=4440039.3 (“Project Description”)

Page 22: Text Mining and Environmental Metadata Suggestion

ENA – 1st Dec 2014 – EBI, UK

ENVIRONMENTS: ENVO term identification in text

ID: ENVO:00000150 Name: coral reef

Microbes are key players in both healthy and degraded coral reefs. A combination of metagenomics, microscopy, culturing, and water chemistry were used to characterize microbial communities on four coral atolls in the Northern Line Islands, central Pacific.

Source: http://metagenomics.anl.gov/linkin.cgi?metagenome=4440039.3 (“Project Description”)

Page 23: Text Mining and Environmental Metadata Suggestion

ENA – 1st Dec 2014 – EBI, UK

ENVIRONMENTS http://environments.hcmr.gr http://environments-eol.blogspot.gr/

●  Dictionary based ●  Open source ●  Environment Ontology ●  fast performance

●  4000 PubMed abstracts / second *

●  Based on SPECIES name recognition tagger (Pafilis et al, PLOS ONE)

●  E600 gold standard: ENVO-based corpus of EOL Species pages

●  Recognition Accuracy – Mention Level: - F1: 82.0% 87.1% of the TPs: exact id among predicted ones

●  Submitted preprint: http://biorxiv.org/content/early/2014/11/13/011403

Pafilis E et al. (2013) The SPECIES and ORGANISMS Resources for Fast and Accurate Identification of Taxonomic Names in Text. PLoS ONE 8(6): e65390, *: based a single-thread run on an Intel 2,27GHz, 24 GB RAM processing a set of 536,052 abstracts

Page 24: Text Mining and Environmental Metadata Suggestion

ENA – 1st Dec 2014 – EBI, UK

biome

environmental feature

environmental material

environmental condition

habitat … … … … …

Based on slides by Dr. Pier Luigi Buttigier, AWI, Bremenhaven, Germany

http://environmentontology.org ~1600 terms, June 2013

ENVO: source of environment descriptor names and synonyms

Page 25: Text Mining and Environmental Metadata Suggestion

ENA – 1st Dec 2014 – EBI, UK

ENVIRONMENTS – Improving Accuracy

●  Increasing matches in text ●  orthographic variation supported

e.g. freshwater, fresh water, and fresh-water ●  Case-insensitive matching ●  Synonym generation to reflect the way environment descriptive

terms are mentioned in text (both generic and ENVO specific)

●  Preventing overmatching (i.e. avoiding increased FP) ●  „stopword-list” (e.g. spring, well, range)

Action Example Add a variant in which non-informative words have been removed

epipelagic zone → epipelagic estuarine biome → estuarine

Plural form addition sediment → sediments Adjective form addition lagoon → lagoonal

Page 26: Text Mining and Environmental Metadata Suggestion

ENA – 1st Dec 2014 – EBI, UK

ENVO parts Not included: species tissues foods

Limitations – Known Issues

negation not supported conflicts with anatomy terms

(e.g. mouth, blowhole)

Scope

Page 27: Text Mining and Environmental Metadata Suggestion

ENA – 1st Dec 2014 – EBI, UK

ENVIRONMENTS – Sample Output

Update to EOLTAGS 346289845

eol_documents_ascii_nonHTML.txt 346289845 346289853 mud flats ENVO:00000192 eol_documents_ascii_nonHTML.txt 346289845 346289853 mud flats ENVO:00002297 eol_documents_ascii_nonHTML.txt 346289845 346289853 mud flats ENVO:00000043 eol_documents_ascii_nonHTML.txt 346289845 346289853 mud flats ENVO:00000000 eol_documents_ascii_nonHTML.txt 346289845 346289853 mud flats ENVO:00000012 eol_documents_ascii_nonHTML.txt 346289871 346289873 mud ENVO:01000001 eol_documents_ascii_nonHTML.txt 346289871 346289873 mud ENVO:00010483 eol_documents_ascii_nonHTML.txt 346289905 346289910 mounds ENVO:00000180 eol_documents_ascii_nonHTML.txt 346289905 346289910 mounds ENVO:00000191 eol_documents_ascii_nonHTML.txt 346289905 346289910 mounds ENVO:00002297 eol_documents_ascii_nonHTML.txt 346289905 346289910 mounds ENVO:00000176 eol_documents_ascii_nonHTML.txt 346289905 346289910 mounds ENVO:00000000 eol_documents_ascii_nonHTML.txt 346289905 346289910 mounds ENVO:00000477

File Name

Start coord

End coord

Match text ENVO ID

Tags corresponding to “Habitat” text data object: http://eol.org/data_objects/31415353 of EOL Taxon Phoenicopterus ruber (Greater Flamingo): http://eol.org/pages/913221

Page 28: Text Mining and Environmental Metadata Suggestion

ENA – 1st Dec 2014 – EBI, UK

ENVIRONMENTS – Sample Output

Update to EOLTAGS 346289845

eol_documents_ascii_nonHTML.txt 346289845 346289853 mud flats ENVO:00000192 eol_documents_ascii_nonHTML.txt 346289845 346289853 mud flats ENVO:00002297 eol_documents_ascii_nonHTML.txt 346289845 346289853 mud flats ENVO:00000043 eol_documents_ascii_nonHTML.txt 346289845 346289853 mud flats ENVO:00000000 eol_documents_ascii_nonHTML.txt 346289845 346289853 mud flats ENVO:00000012 eol_documents_ascii_nonHTML.txt 346289871 346289873 mud ENVO:01000001 eol_documents_ascii_nonHTML.txt 346289871 346289873 mud ENVO:00010483 eol_documents_ascii_nonHTML.txt 346289905 346289910 mounds ENVO:00000180 eol_documents_ascii_nonHTML.txt 346289905 346289910 mounds ENVO:00000191 eol_documents_ascii_nonHTML.txt 346289905 346289910 mounds ENVO:00002297 eol_documents_ascii_nonHTML.txt 346289905 346289910 mounds ENVO:00000176 eol_documents_ascii_nonHTML.txt 346289905 346289910 mounds ENVO:00000000 eol_documents_ascii_nonHTML.txt 346289905 346289910 mounds ENVO:00000477

File Name

Start coord

End coord

Match text ENVO ID

Tags corresponding to “Habitat” text data object: http://eol.org/data_objects/31415353 of EOL Taxon Phoenicopterus ruber (Greater Flamingo): http://eol.org/pages/913221

Traversing all IS_A, PART_OF

Relationships in ENVO

Page 29: Text Mining and Environmental Metadata Suggestion

ENA – 1st Dec 2014 – EBI, UK

Download

ENVIRONMENTS

•  Home Page: http://environments.hcmr.gr/ •  Tagger Software:

http://download.jensenlab.org/environments_tagger.tar.gz

Page 30: Text Mining and Environmental Metadata Suggestion

ENA – 1st Dec 2014 – EBI, UK

other forms of access

Page 31: Text Mining and Environmental Metadata Suggestion

ENA – 1st Dec 2014 – EBI, UK

http://eol.org/info/discover_what

Page 32: Text Mining and Environmental Metadata Suggestion

ENA – 1st Dec 2014 – EBI, UK ACTION ES1103

ENVIRONMENTS

ID: ENVO:00000150 Name: coral reef

Interactive Curation

http://www.ncbi.nlm.nih.gov/pubmed/18301735

Page 33: Text Mining and Environmental Metadata Suggestion

ENA – 1st Dec 2014 – EBI, UK ACTION ES1103

http://www.ncbi.nlm.nih.gov/pubmed/18301735

Interactive Curation

Page 34: Text Mining and Environmental Metadata Suggestion

ENA – 1st Dec 2014 – EBI, UK ACTION ES1103

http://www.ncbi.nlm.nih.gov/pubmed/18301735

Interactive Curation

Page 35: Text Mining and Environmental Metadata Suggestion

ENA – 1st Dec 2014 – EBI, UK ACTION ES1103

http://www.ncbi.nlm.nih.gov/pubmed/18301735

Interactive Curation

Page 36: Text Mining and Environmental Metadata Suggestion

ENA – 1st Dec 2014 – EBI, UK ACTION ES1103

http://www.ncbi.nlm.nih.gov/pubmed/18301735

Interactive Curation

Page 37: Text Mining and Environmental Metadata Suggestion

ENA – 1st Dec 2014 – EBI, UK ACTION ES1103

Not only ENVO terms

Page 38: Text Mining and Environmental Metadata Suggestion

ENA – 1st Dec 2014 – EBI, UK ACTION ES1103

http://www.ncbi.nlm.nih.gov/pubmed/18301735

Page 39: Text Mining and Environmental Metadata Suggestion

ENA – 1st Dec 2014 – EBI, UK ACTION ES1103

What else is being identified?

ready you to discover!

Page 40: Text Mining and Environmental Metadata Suggestion

ENA – 1st Dec 2014 – EBI, UK ACTION ES1103

Page 41: Text Mining and Environmental Metadata Suggestion

ENA – 1st Dec 2014 – EBI, UK

!  Importance of standardized metadata and annotations !  ENVO: Standardized hierarchically organized descriptions of

environment types !  Literature, project and other scientific content web pages may

describe the environment context of a metagenomics sample !  ENVIRONMENTS:

!  Dictionary-based environment descriptive term identification !  Ontological Community standards, e.g. ENVO: name source !  Command line application

!  Browser extensions, a user-friendly interface !  Highly Interactive !  Can be used while browsing the web !  Extract ENVO from a selected part of a web page !  Extended for:

!  Organism, diseases, and tissue mention identification

Summary

Page 42: Text Mining and Environmental Metadata Suggestion

ENA – 1st Dec 2014 – EBI, UK

Digging-out Information

http://hartpurylrc.files.wordpress.com Photo by Dr Chatzinikolaou E

Page 43: Text Mining and Environmental Metadata Suggestion

ENA – 1st Dec 2014 – EBI, UK

Critical Assessment of Information Extraction in Biology

BioCreative: Metagenomics Track

•  Preparing a Metagenomics Track as part of the BioCreative 2015 challenge •  Aim: improve the environmental-context annotation of sequences in major

metagenomics repositories.

•  Track coordinator: Dr. L. Hirschman, MITRE •  BioCreative (www.biocreative.org)

Page 44: Text Mining and Environmental Metadata Suggestion

ENA – 1st Dec 2014 – EBI, UK

ACTION ES1103

ENVIRONMENTS-EOL http://environments-eol.blogspot.com/ Encyclopedia of Life (EOL) http://www.eol.org •  process EOL taxon pages •  extract environmental context (ENVO terms) •  EOL Taxon Page: Quick Facts, Data tab •  integrated in Traitbank •  large scale biological questions Rubenstein Fellowship 2013 In collab: Jennifer Hammock, Patrick Leary, Katja Schulz, Cyndy Parr

SEQenv http://environments.hcmr.gr/seqenv.html •  annotate microbial sequences with ENVO terms •  sequence analysis, literature mining, visualization •  GenBank isolation source, PubMed Abstracts •  sample comparison, temporal/spatial pattern analysis •  extension: proteins, protein families, 3D visualization Reused: Analysis of America bird habitats, http://blog.eol.org/

(NoPlaceLikeHome, in collab: Rob Stevenson, Carl Nordman)

Hexanchus griseus EOL page, http://eol.org/pages/212027

Biodiversity – Genomics

Page 45: Text Mining and Environmental Metadata Suggestion

ENA – 1st Dec 2014 – EBI, UK

http://jensenlab.org/

Santos A et al. (under review), preprint: http://biorxiv.org/content/early/2014/11/10/010975

Frankild S et al. (under review), preprint: http://biorxiv.org/content/early/2014/08/25/008425

Pafilis E et al. (2013) The SPECIES and ORGANISMS Resources for Fast and Accurate Identification of Taxonomic Names in Text. PLoS ONE 8(6): e65390

Page 46: Text Mining and Environmental Metadata Suggestion

ENA – 1st Dec 2014 – EBI, UK

Acknowledgements

HCMR-IMBG: Christos Arvanitidis, Christina Pavloudi, Katerina Vasileiadou Lucia Fanini, Sarah Faulwetter, Anastasis Oulas NNF CPR: Lars Juhl Jensen, Sune Frankild U Mass: Rob Stevenson Uni Glasgow: Christopher Quince, Umer Ijaz EOL: Cynthia Parr, Jennifer Hammock, Patrick Leary, Katja Schulz MM-MPI: J. Schnetzer, AWI: Dr P. Buttigieg, HITS: Dr. S. Berger and more

Funding: EOL Rubenstein Fellowship, LifeWatch Greece, MARBIGEN, NNF-CPR, EOL-BHL NESCent Researh, Sprint 2014,”SEQenv” Hackathons (COST ES1103)

Thank You!

Amvrakikos Lagoons, May 2011

ACTION ES1103

Page 47: Text Mining and Environmental Metadata Suggestion

ENA – 1st Dec 2014 – EBI, UK

Acknowledgements

Thank You!

Amvrakikos Lagoons, May 2011

ACTION ES1103

id: ENVO:00000038 name: lagoon

HCMR-IMBG: Christos Arvanitidis, Christina Pavloudi, Katerina Vasileiadou Lucia Fanini, Sarah Faulwetter, Anastasis Oulas NNF CPR: Lars Juhl Jensen, Sune Frankild U Mass: Rob Stevenson Uni Glasgow: Christopher Quince, Umer Ijaz EOL: Cynthia Parr, Jennifer Hammock, Patrick Leary, Katja Schulz MM-MPI: J. Schnetzer, AWI: Dr P. Buttigieg, and more

Funding: EOL Rubenstein Fellowship, LifeWatch Greece, MARBIGEN, NNF-CPR, EOL-BHL NESCent Researh, Sprint 2014,”SEQenv” Hackathons (COST ES1103)

Page 48: Text Mining and Environmental Metadata Suggestion

ENA – 1st Dec 2014 – EBI, UK

•  Start Firefox •  Install the “megx-seqenv-bar.xpi”

•  Drug and Drop •  “Install Now” and “Restart”

•  Visit a couple of PubMed abstracts or article web

pages of your preference •  Annotate the complete abstract, •  Annotate selected sentences only

Tutorial