Upload
sabrina-heath
View
217
Download
0
Embed Size (px)
Citation preview
2
About me
3
Overview• Big Data
– Big Data is BIG– Issues in research
• Semantic Web– Standards: URIs, RDF, SPARQL, OWL– Linked data
• Ontologies– Definition and reasoning– OBO Foundry– Example of existing ontologies– Pharmacovigilance– Publishing ontologies on the Semantic Web
• IRIDA– The IRIDA platform– Adding standards to IRIDA
• Take home message
4
Overview• Big Data
– Big Data is BIG– Issues in research
• Semantic Web– Standards: URIs, RDF, SPARQL, OWL– Linked data
• Ontologies– Definition and reasoning– OBO Foundry– Example of existing ontologies– Pharmacovigilance– Publishing ontologies on the Semantic Web
• IRIDA– The IRIDA platform– Adding standards to IRIDA
• Take home message
5
6
Big data
Big data is data that is too large and complex to process for any conventional data tools.
7
2005
8
2013
9
What is a Zettabyte?
1,000,000,000,000 gigabytes1,000,000,000,000 terabytes1,000,000,000,000 petabytes1,000,000,000,000 exabytes1,000,000,000,000 zettabyte
10
How big is big?
• Facebook: 25 Terabytes of logged data per day, Google (2008): 20 Petabytes per day
• Over 90% of all the data in the world was created in the past 2 years [1]
• Today 3.2 zettabytes. 2020: 40 zettabytes.[2]
• Good news: jobs! [3]
1. http://www-01.ibm.com/software/data/bigdata/2. http://barnraisersllc.com/2012/12/38-big-facts-big-data-companies/3. http://www.webopedia.com/quick_ref/important-big-data-facts-for-it-professionals.html
11https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century
12
Issues with research data (1): data availability
http://www.nature.com/news/scientists-losing-data-at-a-rapid-rate-1.14416
13
Issues with research data (2):
data reproducibility
http://www.firstwordpharma.com/node/931605#axzz3IalL2lzU
14
Overview• Big Data
– Big Data is BIG– Issues in research
• Semantic Web– Standards: URIs, RDF, SPARQL, OWL– Linked data
• Ontologies– Definition and reasoning– OBO Foundry– Example of existing ontologies– Pharmacovigilance– Publishing ontologies on the Semantic Web
• IRIDA– The IRIDA platform– Adding standards to IRIDA
• Take home message
15
A solution: the Semantic Web
"The Semantic Web is an... extension of the current web in which... information is given well-defined meaning,... better enabling computers and people to work in cooperation.”
The Semantic WebTim Berners-Lee, James Hendler and Ora LassilaScientific American, May 2001
http://www.scientificamerican.com/article/the-semantic-web/
16
Adds to Web standards and practices (currently only for documents and services) encouraging• Unambiguous names for things, classes, and relationships• Well organized and documented in ontologies• With data expressed using uniform knowledge
representation languages (e.g. OWL)• To enable computationally assisted exploitation of
information• That can be easily integrated from different sources
The Semantic Web in a nutshell
17
Some Semantic Web successes
• In February 2011, the Watson system by IBM made international headlines for beating the best humans in the quiz show Jeopardy!
• A significant number of very prominent websites are powered by Semantic Web technologies, including the New York Times, Thomson Reuters, BBC, and Google's Freebase.
• The Speech Interpretation and Recognition Interface Siri launched by Apple in 2011 as an intelligent personal assistant for the new generation of IPhone smartphones heavily draws from work on ontologies, knowledge representation, and reasoning.
http://130.108.5.60/faculty/pascal/pub/crc-handbook-13.pdf
18
19
Uniform Resource Identifiers (URIs)
• Two different uses:– Unambiguous name for something– Location of a document
• Examples:– http://example.org/wiki/Main_Page – ftp://example.org/resource.txt– mailto:[email protected]
20
Resource Description Framework (RDF)
• Resources (= nodes)• Identified by Unique Resource Identifier (URI)
• Properties (= edges)• Identified by Unique Resource Identifier (URI)• Binary relations between 2 resources
http://elmonline.ca/sw/sparql/social.ttl
21
<http://www.linkedin.com/in/mcourtot> a foaf:Person ; foaf:name "Melanie Courtot" ; foaf:knows <http://elmonline.ca/luke> ; foaf:knows <http://www.linkedin.com/pub/mark-wilkinson/1/674/665> .
22
SPARQL
SELECT ?personWHERE { <http://www.linkedin.com/in/mcourtot> <http://xmlns.com/foaf/0.1/knows> ?person .}
---------------------------------------------------------------------------------------------| person |==========================================================| http://www.linkedin.com/pub/mark-wilkinson/1/674/665 || <http://elmonline.ca/luke> |----------------------------------------------------------------------------------------------
• An excellent tutorial by Luke McCarthy: http://elmonline.ca/sw/sparql/
A query language for RDF
23
The Web Ontology Language (OWL)
• Knowledge representation language• Based on Description Logics: fragments of
First-Order logics with decidable and defined computational properties
• Sound, complete, terminating reasoners available
24
Overview• Big Data
– Big Data is BIG– Issues in research
• Semantic Web– Standards: URIs, RDF, SPARQL, OWL– Linked data
• Ontologies– Definition and reasoning– OBO Foundry– Example of existing ontologies– Pharmacovigilance– Publishing ontologies on the Semantic Web
• IRIDA– The IRIDA platform– Adding standards to IRIDA
• Take home message
25
Linked open data cloud
26
Biological resources in LOD
27
Examples of issues in linking data incorrectly
• http://dbpedia.org/resource/WelshOWL:sameAs
<http://sw.cyc.com/2006/07/27/cyc/EthnicGroupOfWelsh><http://sw.cyc.com/2006/07/27/cyc/Welsh-TheWord><http://sw.cyc.com/2006/07/27/cyc/WelshLanguage><http://sw.cyc.com/2006/07/27/cyc/Welshing-Cheating>
28
Overview• Big Data
– Big Data is BIG– Issues in research
• Semantic Web– Standards: URIs, RDF, SPARQL, OWL– Linked data
• Ontologies– Definition and reasoning– OBO Foundry– Example of existing ontologies– Pharmacovigilance– Publishing ontologies on the Semantic Web
• IRIDA– The IRIDA platform– Adding standards to IRIDA
• Take home message
29
Ontologies
• Representation of important things in a specific domain– Describes types of entities (e.g. cells) and relations between
them (e.g. prokaryotic cells and eukaryotic cells are cells) and their instances (e.g. the specific cells in my sample)
• An active computational artifact– A mathematical model based on a subset of first order logic– Tools can automatically process ontologies
• A communication tool– Provides a dictionary for collaborators, a shared
understanding– Allows data sharing
30
Reasoning is critical
• Prokaryotic and Eukaryotic cell are declared disjoints
• Fungal cell is a Eukaryotic cell
• Spore is a Fungal cell and a Prokaryotic cell
Insatisfiability Solution: clarify spore
(sensu Mycetozoa) AND actinomycete-type spore
http://www.plosone.org/article/info:doi/10.1371/journal.pone.0022006
31
Logics
• Simple example based on http://arxiv.org/pdf/1201.4089v1.pdf
• Ontology file available from http://www.sfu.ca/~mcourtot/course/20141112BigDataSemWebOntologies/ontology.owl
• Manipulation done using Protégé: http://protege.stanford.edu
32
Family ontology
33
Logics of a grandfather
34
Reasoning
35
Inferred class hierarchy
36
Explanations
37
A wrong assertion
38
Unsatisfiability
39
Overview• Big Data
– Big Data is BIG– Issues in research
• Semantic Web– Standards: URIs, RDF, SPARQL, OWL– Linked data
• Ontologies– Definition and reasoning– OBO Foundry– Example of existing ontologies– Pharmacovigilance– Publishing ontologies on the Semantic Web
• IRIDA– The IRIDA platform– Adding standards to IRIDA
• Take home message
40
OBO Foundry
A subset of biological and biomedical ontologies whose developers have agreed in advance to accept a common set of principles reflecting best practice in ontology development designed to ensure
• tight connection to the biomedical basic sciences
• Compatibility
• interoperability, common relations
• formal robustness
• support for logic-based reasoning
42
RELATION TO TIME
GRANULARITY
CONTINUANT OCCURRENT
INDEPENDENT DEPENDENT
ORGAN ANDORGANISM
Organism(NCBI
Taxonomy?)
Anatomical Entity
(FMA, CARO)
OrganFunction
(FMP, CPRO) Phenotypic
Quality(PaTO)
Organism-Level Process
(GO)
CELL AND CELLULAR
COMPONENT
Cell(CL)
Cellular Compone
nt(FMA, GO)
Cellular Function
(GO)
Cellular Process
(GO)
MOLECULEMolecule
(ChEBI, SO,RnaO, PrO)
Molecular Function(GO)
Molecular Process
(GO)Slide credit: Barry Smith
43
Minimum Information to Reuse an External Ontology Term
• OBO and Sematic Web promote reuse of resources• Biological resources (e.g., FMA for anatomy),
taken together, are too big for current tool support.
• MIREOT used across the OBO library– OBI: 400 mireoted terms (140 GO, 55 ChEBI, 50 PATO)– PR (Protein Ontology): 23,000 mireoted terms
• http://ontofox.hegroup.org
Example of OBO ontologies
• OBI, Ontology for Biomedical investigations• VO, the vaccine ontology• AERO, the Adverse Event Reporting Ontology
45
Ontology for Biomedical Investigations (OBI)
• OBI is a multi-community project driven by the practical needs of its members with the goal to build a high quality, interoperable reference ontology
• OBI high level classes are in place - solidified over several years - that cover all aspects of biomedical investigations
• OBI is expanded to enable member applications and based on term requests
46
High level class hierarchy (partial)
Slide credit: OBI Consortium
47Slide credit: Alan Ruttenberg
48Slide credit: OBI Consortium
49
Representing vaccine data – the Vaccine Ontology (VO)
Picture credit: Yongqun He
50
Overview• Big Data
– Big Data is BIG– Issues in research
• Semantic Web– Standards: URIs, RDF, SPARQL, OWL– Linked data
• Ontologies– Definition and reasoning– OBO Foundry– Example of existing ontologies– Pharmacovigilance– Publishing ontologies on the Semantic Web
• IRIDA– The IRIDA platform– Adding standards to IRIDA
• Take home message
51
Representing pharmacovigilance data
• The Adverse Event Reporting Ontology (AERO)
• Encodes existing clinical guidelines (Brighton Collaboration)
52
Background and problem statement
• Surveillance of Adverse Events Following Immunization is important– Detection of issues with vaccine – Importance of vaccine-risk communication
• Analysis of AE reports is a subjective, time- and money costly process– Manual review of the textual reports
53
Workflow• Hypothesis: Use the AERO I developed to annotate
and classify a dataset• VAERS dataset
– Vaccine Adverse Event Reporting System– 6032 reports: ~5800 negative, ~230 positive– Post H1N1 immunization 2009/2010– Manually classified for anaphylaxis
• MedDRA (Medical Dictionary of Regulatory Activities) is used to represent clinical findings
54
Automated Diagnosis workflow
55
Results
At best cut-off point: Sensitivity 57%Specificity 97%
56
AE classification can be improved through the use of ontologies
• Manual analysis: 3 months for 12 medical officers• Ontology-based analysis: once data collected (2 months), almost instantaneous
(2h on laptop) => Could allow for earlier detection of safety issues and better understanding of adverse events
2h automatedvs.
3 months manual
http://dx.doi.org/10.1371/journal.pone.0092632
57
Overview• Big Data
– Big Data is BIG– Issues in research
• Semantic Web– Standards: URIs, RDF, SPARQL, OWL– Linked data
• Ontologies– Definition and reasoning– OBO Foundry– Example of existing ontologies– Pharmacovigilance– Publishing ontologies on the Semantic Web
• IRIDA– The IRIDA platform– Adding standards to IRIDA
• Take home message
58
IRI dereferencing
59
Ontobee: publishing biomedical resources on the Semantic Web
HTML for humans …
… RDF for machines
Ontobee: publishing biomedical resources on the Semantic Web
61
Overview• Big Data
– Big Data is BIG– Issues in research
• Semantic Web– Standards: URIs, RDF, SPARQL, OWL– Linked data
• Ontologies– Definition and reasoning– OBO Foundry– Example of existing ontologies– Pharmacovigilance– Publishing ontologies on the Semantic Web
• IRIDA– The IRIDA platform– Adding standards to IRIDA
• Take home message
62
The Integrated Rapid Infectious Disease Analysis (IRIDA) project
• Goal: automate infectious disease outbreak detection and investigation
• Issues: – Integrate WGS, clinical and lab info– Provide relevant tools and validate pipeline
• Methods:– Data standards for information exchange– Analysis pipeline (Galaxy based)– User interface– Additional tools:
• IslandViewer• GenGIS
63
64
Building the IRIDA data standards
• Interview with key personnel at BCCDC• Review of existing resources• Identify “holes”, i.e., missing bits• Collect existing data• Liaise with implementation team• Generate cohesive resource• Validate
65
Relevant data standards
• TypON, the typing ontology• OBI, the ontology for Biomedical Investigations• NGSOnto, Next Generation Sequencing Ontology• NIAIS-GS-BRC core metadata• MIxS ontology• TRANS, Pathogen Transmission ontology• ExO, Exposure Ontology• EPO, Epidemiology Ontology• IDO, Infectious Disease Ontology• Food: USDA, EFSA?
66
Relevant international efforts
• MIxS standard• Global Microbial Identifier• Global Alliance for Genomics and Health• NCBI BioSample• European Nucleotide Archive• …
67
Remaining challenges
• Trust, provenance– Ability to track origin of data to assess whether it
is trustworthy• Data sharing, reuse, policy
– Social and legal issues in getting access to data• Confidentiality
– Privacy concerns when linking data
68
Overview• Big Data
– Big Data is BIG– Issues in research
• Semantic Web– Standards: URIs, RDF, SPARQL, OWL– Linked data
• Ontologies– Definition and reasoning– OBO Foundry– Example of existing ontologies– Pharmacovigilance– Publishing ontologies on the Semantic Web
• IRIDA– The IRIDA platform– Adding standards to IRIDA
• Take home message
69
Take home message
Big data is a big challenge, but we can deal with it if done properly: that will be your responsibility
DO NOT build a black boxDO annotate and describe your dataDO make your data openly available
70
Acknowledgements
• Drs. Fiona Brinkman, Will Hsiao, Ryan Brinkman• The Brinkman^2 labs• Alan Ruttenberg, Barry Smith, Chris Mungall &
OBO• Colleagues at Public Health Agency Canada (Ms
Lafleche, Dr Law)• The IRIDA consortium and the IRIDA ontology
working group (Emma Griffiths and Damion Dooley)