View
3
Download
0
Category
Preview:
Citation preview
How data sharing leads toHow data sharing leads to knowledgeg
M. Scott Marshall, Ph.D.W3C HCLS IG h iW3C HCLS IG co‐chair
Leiden University Medical CenterUniversity of Amsterdam
http://staff.science.uva.nl/~marshall
Semantic Web Anno 1999Semantic Web Anno 1999
DIY (Do It Yourself)
Reasoning across a Federationg(Semantic Web 2012?)
SPARQLクエリを分解して分散コンピューSPARQLクエリを分解して分散コンピューティングを行なう
The Semantic Web is the New GlobalThe Semantic Web is the New Global Web of Knowledge
It is about standards for publishing, sharing and querying knowledge drawn from diverse sources
It makes possible the answeringsophisticated questions usingsophisticated questions using
background knowledge
Source: Michel Dumontier
Some of the forces at workSome of the forces at work• Pharmaceutical industry changing strategyy g g gy
– David Cox Strategy: Academic / Industry partnershippartnership
– Pistoia Alliance, Vocabulary Services Initiative
• Personalized Medicine and EHRs
• US NIH: NCBCs and NCBOUS NIH: NCBCs and NCBO
• NCI Semantic Infrastructure
• European Innovative Medicine Initiatives (IMI)
Biology in a nutshell: Bigger isn’t bettergy gg
• DNA Dogma– Transcription = DNA ‐> mRNA ‐> Protein
• Molecular pathways allow biologists to ‘connect’ one molecule to another.
• Huntington’s mutation mapped in 1993 yet there is still no understanding of the mechanism that causesstill no understanding of the mechanism that causes the neurodegeneration.
• Semantic models are necessary to create a ‘systems• Semantic models are necessary to create a systems view’ of biology.
Where is biomedical knowledge?Where is biomedical knowledge?
Can be extracted from:
• PeoplePeople
• LiteratureM t f th f• Diagrams
• Clinical reports
Most of these sources of biomedical knowledge are notmachine‐readableClinical reports
• Databases
notmachine readable
• Excel sheets
•• …
ExamplesExamples
1. “find all images with a big lesion in the right frontal lobe » ? g g g
2. Provide the neurosurgeon with landmarks Hmm, but what’s nearby?
Source: Christine Golbreich
Many tasks are still a challenge!With existing Web and Health IT:
• Find and integrate informationFind and integrate information
– “Although a plethora of resources (tools, databases materials) for neuroscientists is nowdatabases, materials) for neuroscientists is now available on the web, finding these resources among the billions of possible web pagesamong the billions of possible web pages continues to be a challenge.” [M. Martone, NCBO Seminar Series, 4 Nov 2009], ]
• Make multiple inferences based on background knowledgeknowledge
– to obtain more complete answers
t di k l d– to discover knowledgeSource: Christine Golbreich
ExamplesExamples– in a medical record system
“find all patients which radiology exhibits a fracture of femur”
d– in genomic data
“find all genes annotated with a molecular function or f it d d t d hi h i i t d ithany of its descendants and which is associated with any
form of a given disease (see genes associated to muscular dystrophy [Sahoo et al. 2007])muscular dystrophy [Sahoo et al. 2007])
– find, share, annotate images
Source: Christine Golbreich
Biological and medical ontologiesBiological and medical ontologies• Medical domain is *very* lucky☺
a large number of terminologies and reference ontologies, E.g., FMA, NCI, GO, SNOMED‐CT, etc.
• Web PortalsBioportal library contains ~200 ontologies in different languages: OBO– Bioportal library contains 200 ontologies in different languages: OBO, Protégé Frames, RDF, OWL http://bioportal.bioontology.org/
– Bioportal now provides SPARQL access to ontologies: http://sparql bioontology orghttp://sparql.bioontology.org
– Open Biomedical Ontologies (OBO) Foundry, http://obofoundry.org/
Source: Christine Golbreich
MotivationMotivation
Science is based on knowledge:Science is based on knowledge:
knowledge capture, knowledge sharing, i.e. i i f fi dicommunication of findings.
Semantic Web provides a basis for knowledge h i th h hi d bl dsharing through machine‐readable and reason‐able annotation of resources.
What is knowledge ?What is knowledge ?
“data”, “information”, “facts”, “knowledge”
Knowledge is a statement
that can be tested for truth.
(by a machine)
Otherwise, computing can’t add much
RDF : a web format for knowledge
RDF is a W3C language to expressRDF is a W3C language to express
statements.RDF Triple:
Subject Predicate Objectj j
Graph of Knowledge:Graph of Knowledge:
Node Edge Node
Round trip knowledge engineering (Vision)
In a linked open data environment:In a linked open data environment:– How will I find the resources?
H ill I i t t th i t li ti ?– How will I integrate them into an application?
– How will I use existing knowledge in my analysis?
– How will I publish the new knowledge resulting from my analysis?
– And link to the evidence (data) from the new knowledge?knowledge?
Programmable Web:A few thousand APIs
http://www.programmableweb.com/
SPARQL Interoperability:A single API
• a query language for RDF that matches graph patternsp
• SPARQL endpoint is comparable to a database connector for triplesconnector for triples
• SPARQL endpoint enables data layer interoperability
Background of the HCLS IGBackground of the HCLS IG
• Originally chartered in 2005– Chairs: Eric Neumann and Tonya Hongsermeier
• Re‐chartered in 2008• Re‐chartered in 2008– Chairs: Scott Marshall and Susie Stephens
– Team contact: Eric Prud’hommeaux
• Broad industry participation– Over 100 members
Mailing list of over 600– Mailing list of over 600
• Background Information– http://www.w3.org/blog/hclsp g g
– http://esw.w3.org/topic/HCLSIG
Mission of HCLS IGMission of HCLS IG
•The mission of HCLS is to develop, advocate for, and support the use of Semantic Web technologies for
Biological science– Biological science
– Translational medicine
– Health care
•These domains stand to gain tremendous benefit by adoption of Semantic Web technologies, as they depend on h i bili f i f i f d i dthe interoperability of information from many domains and processes for efficient decision support
How does the HCLS IG work?How does the HCLS IG work?
• Task forces
• Regular teleconferences using teleconferenceRegular teleconferences using teleconference bridge (Zakim), IRC, minutes
F 2F (F2F) i• Face2Face (F2F) twice a year
• Procedures for publishing W3C notes p g
Current Task ForcesCurrent Task Forces
Bi RDF f d i ( i ) k l d b• BioRDF – federating (neuroscience) knowledge bases– M. Scott Marshall (Leiden University Medical Center / University of Amsterdam)
• Clinical Observations Interoperability – patient recruitment in trials– Vipul Kashyap (Cigna Healthcare)
• Linking Open Drug Data – aggregation of Web‐based drug data – Susie Stephens (Johnson & Johnson)Susie Stephens (Johnson & Johnson)
• Translational Medicine Ontology – high level patient‐centric ontology– Michel Dumontier (Carleton University)
S i tifi Di b ildi iti th h t ki• Scientific Discourse – building communities through networking– Tim Clark (Harvard University)
• Terminology – Semantic Web representation of existing resources– John Madden (Duke University)
COI: Translating across domainsCOI: Translating across domains
EHR
Microarray AlzForum
MRI PubMed
COI: Use CaseCOI: Use Case
Pharmaceutical companies pay a lot to test drugsg
Pharmaceutical companies express protocol in CDISCCDISC
‐‐ precipitous gap –
H it l h i f ti i HL7/RIMHospitals exchange information in HL7/RIM
Hospitals have relational databases
Source: Eric Prud’hommeaux
Inclusion Criteria
T 2 di b t di t d i th• Type 2 diabetes on diet and exercise therapy or• monotherapy with metformin, insulin
t l h l id i hibit• secretagogue, or alpha-glucosidase inhibitors, or• a low-dose combination of these at 50%
i l d D i i t bl f 8 k i• maximal dose. Dosing is stable for 8 weeks prior• to randomization.• …• ?patient takes metformin .
Source: Holger Stenzhorn
Exclusion CriteriaExclusion Criteria
Use of warfarin (Coumadin), clopidogrel(Plavix) or other anticoagulants.( ) g…?patient doesNotTake anticoagulant .?patient doesNotTake anticoagulant .
Source: Holger Stenzhorn
Criteria in SPARQL?medication1 sdtm:subject ?patient ;
Criteria in SPARQL
spl:activeIngredient ?ingredient1 .?ingredient1 spl:classCode 6809 . #metformin
OPTIONAL {OPTIONAL {
?medication2 sdtm:subject ?patient ;spl:activeIngredient ?ingredient2 spl:activeIngredient ?ingredient2 .
?ingredient2 spl:classCode 11289 .#anticoagulant
} FILTER (!BOUND(?medication2))
Source: Holger Stenzhorn
TMO: Translating across domainsTMO: Translating across domains
EHR
Microarray AlzForum
MRI PubMed
Questions & ProblemsThe Drug Development Pipeline
“A virtual space odyssey”, Cath O'Driscoll (2004)
http://www.nature.com/horizon/chemicalspace/background/odyssey.html
• The road is long, and costly.
H d t i t d d l b tt d ?• How do we contain costs and develop better drugs?
Source: Elgar Pichler
Translational Medicine OntologyMission
• Focuses on the development of a high level patient‐centric ontology for the pharmaceutical industry. The ontology should enable data integration across discovery research, hypothesis management,integration across discovery research, hypothesis management, experimental studies, compounds, formulation, drug development, market size, competitive data, population data, etc. This would enable i i i d i i i ifiscientists to answer new questions, and to answer existing scientific
questions more quickly.
• This will help pharmaceutical companies to model patient‐centricThis will help pharmaceutical companies to model patient centric information, which is essential for the tailoring of drugs, and for early detection of compounds that may have sub‐optimal safety profiles. The ontology should link to existing publicly available domain ontologies.
TMO StructureTMO Structure
Source: Susie Stephens
Translational Medicine KBTranslational Medicine KB
Source: Susie Stephens
TMO QueryTMO Query
How many patients experienced side effects while taking Donepezil?
Source: Susie Stephens
Terminology: Translating across domains
EHR
Microarray AlzForum
MRI PubMed
TerminologyTerminology
• Representation of medical documents (OWL and SKOS))
• SKOS enables ‘fuzzier’ statements, with ‘broader’ and ‘narrower’broader and narrower
• Mammogram: Represent both radiology and pathology report to discover discrepancies
• Use Translational Medicine Ontology RadLex• Use Translational Medicine Ontology, RadLex, SNOMED in the RDF
BioRDF: Translating across domainsBioRDF: Translating across domains
EHR
Microarray AlzForum
MRI PubMed
BioRDF: Looking for Targets for Alzheimer’s
• Signal transduction pathways are considered to be rich in “druggable” targetstargets • CA1 Pyramidal Neurons are known to be particularly damaged in Alzheimer’s disease• Casting a wide net, can we find candidate genes known to be involvedcandidate genes known to be involved in signal transduction and active in Pyramidal Neurons?
Source: Alan Ruttenberg
BioRDF: Integrating Heterogeneous DataBioRDF: Integrating Heterogeneous Data
NeuronDB
PDSPki
R t NeuronDB
BAMS
Gene Ontology
BrainPharmAntibodies
Reactome
Allen Brain
Literature
Entrez Gene
M li
BrainPharmAntibodies
PubChem
MESHAtlas
Homologene
SWAN
Mammalian Phenotype
AlzGene
Source: Susie Stephens
“find me genes involved in signal transduction that arefind me genes involved in signal transduction that are related to pyramidal neurons”
BioRDF: Results: Genes ProcessesBioRDF: Results: Genes, Processes
•DRD1, 1812 adenylate cyclase activation•ADRB2, 154 adenylate cyclase activation•ADRB2, 154 arrestin mediated desensitization of G‐protein coupled receptor protein signaling pathway•DRD1IP, 50632 dopamine receptor signaling pathway•DRD1, 1812 dopamine receptor, adenylate cyclase activating pathway•DRD2, 1813 dopamine receptor, adenylate cyclase inhibiting pathway•GRM7 2917 G‐protein coupled receptor protein signaling pathwayGRM7, 2917 G protein coupled receptor protein signaling pathway•GNG3, 2785 G‐protein coupled receptor protein signaling pathway•GNG12, 55970 G‐protein coupled receptor protein signaling pathway•DRD2, 1813 G‐protein coupled receptor protein signaling pathway•ADRB2, 154 G‐protein coupled receptor protein signaling pathway•CALM3, 808 G‐protein coupled receptor protein signaling pathway•HTR2A, 3356 G‐protein coupled receptor protein signaling pathwayDRD1 1812 G i i li l d li l id d
Many of the genes are related to AD through gamma •DRD1, 1812 G‐protein signaling, coupled to cyclic nucleotide second messenger
•SSTR5, 6755 G‐protein signaling, coupled to cyclic nucleotide second messenger•MTNR1A, 4543 G‐protein signaling, coupled to cyclic nucleotide second messenger•CNR2, 1269 G‐protein signaling, coupled to cyclic nucleotide second messenger•HTR6, 3362 G‐protein signaling, coupled to cyclic nucleotide second messenger•GRIK2, 2898 glutamate signaling pathway•GRIN1, 2902 glutamate signaling pathway
g gsecretase (presenilin) activity
, g g g p y•GRIN2A, 2903 glutamate signaling pathway•GRIN2B, 2904 glutamate signaling pathway•ADAM10, 102 integrin‐mediated signaling pathway•GRM7, 2917 negative regulation of adenylate cyclase activity•LRP1, 4035 negative regulation of Wnt receptor signaling pathway•ADAM10, 102 Notch receptor processing•ASCL1 429 Notch signaling pathway•ASCL1, 429 Notch signaling pathway•HTR2A, 3356 serotonin receptor signaling pathway•ADRB2, 154 transmembrane receptor protein tyrosine kinase activation (dimerization)•PTPRG, 5793 ransmembrane receptor protein tyrosine kinase signaling pathway•EPHA4, 2043 transmembrane receptor protein tyrosine kinase signaling pathway•NRTN, 4902 transmembrane receptor protein tyrosine kinase signaling pathway•CTNND1, 1500 Wnt receptor signaling pathway
Source: Alan Ruttenberg
Automatic query decompositionAutomatic query decomposition
• Decompose query into components
• Locate sources of informationLocate sources of information
• Distribute queries across information sources
• Aggregate results
A SPARQL query for processes involved in pyramidal neurons
prefix go: <http://purl org/obo/owl/GO#>prefix go: <http://purl.org/obo/owl/GO#>prefix rdfs: <http://www.w3.org/2000/01/rdf‐schema#>prefix owl: <http://www.w3.org/2002/07/owl#>prefix mesh: <http://purl.org/commons/record/mesh/>prefix sc: <http://purl.org/science/owl/sciencecommons/>prefix ro: <http://www.obofoundry.org/ro/ro.owl#> MeSH: Pyramidal Neuronsselect ?genename ?processnamewhere{ graph <http://purl.org/commons/hcls/pubmesh>{ ?paper ?p mesh:D017966 .?article sc:identified_by_pmid ?paper.?gene sc:describes_gene_or_gene_product_mentioned_by ?article.}
Pubmed: Journal Articles}graph <http://purl.org/commons/hcls/goa>{ ?protein rdfs:subClassOf ?res.?res owl:onProperty ro:has_function.?res owl:someValuesFrom ?res2.?res2 owl:onProperty ro:realized_as.?res2 owl:someValuesFrom ?process.h htt // l / /h l /20070416/ l l ti Entrez Gene: Genesgraph <http://purl.org/commons/hcls/20070416/classrelations>
{{?process <http://purl.org/obo/owl/obo#part_of> go:GO_0007166} union{?process rdfs:subClassOf go:GO_0007166 }} ?protein rdfs:subClassOf ?parent.?parent owl:equivalentClass ?res3.?res3 owl:hasValue ?gene.
Entrez Gene: Genes
GO: Signal Transductiong}
graph <http://purl.org/commons/hcls/gene>{ ?gene rdfs:label ?genename }graph <http://purl.org/commons/hcls/20070416>{ ?process rdfs:label ?processname}
}
GO: Signal Transduction
Inference required
Recipe for a Semantic WebRecipe for a Semantic Web
• Follow Linked Open Data principles• Follow Linked Open Data principles
• Attempt to use Shared Names (same URI’s)
• Query rewriting to map from: – SPARQL > (query language)– SPARQL ‐> (query language)
– SPARQL (term1) ‐> SPARQL (term2)
• Add provenance information about graphsp g p
Interlinking in LODDInterlinking in LODD
http://esw.w3.org/HCLSIG/LODD/Interlinking
LOD cloud – a few challengesLOD cloud – a few challenges
• It’s a (nice) data warehouse in RDF– http://lod.openlinksw.com/p // p /
• Ad hoc identifiersP f h i i i d h d– Prefer authoritative, persistent, and shared
• Redundancyy
• Lacks provenance – Who created a given RDF rendering? Using what method? Version?rendering? Using what method? Version?
So what’s missing?So, what s missing?
• Data stewardship – serve (annotated) data back to communityy
• Plug‐n‐play SPARQL access to relational stores
B i f d f d i f SPARQL• Best practice for data federation of SPARQL endpoints
• Best practice for important data sources: microarray data variety of image datamicroarray data, variety of image data
• Shared Identifiers
ProvenanceProvenance
• Represent knowledge so that others can discover where a fact (or triple) came from and evaluate how to use it – link facts to data as evidence
• Named graphs – talk about what’s in them
• Example use: “Find the latest RDF version of DrugBank” from SPARQL endpoints accessible to me.ug a o S Q e dpo s access b e o e
Provenance for Federationor “What’s in a graph?”
• Minimum information about a graph (MIAG)
• Is it an ontology or RDF data?Is it an ontology or RDF data?
• Version, License, ..
• Who converted it to RDF?
• Rendering: “I’m looking for SNOMED in SKOSRendering: I m looking for SNOMED in SKOS (not OWL)”
• What is the most populated class?
• Graph statisticsGraph statistics
Translating across domainsTranslating across domains
EHR
Microarray AlzForum
MRI PubMed
Provenance types are perspectives on the data
Source: Helena Deus
A Bottom‐up Approachp pp
Provenance Workflow, Domain ontologiesCommunitymodels
Workflow, experimental design
Domain ontologies(DO, GO…)
Communitymodels
Which genes are markers gfor neurodegenerative
diseases?
Was gene ALG2
Provenance of Microarray experiment Was gene ALG2
differentially expressed in multiple experiments?
experiment
Wh f d
QuestionsWhat software was used to
analyse the data?
Raw Data
Results How can the experiment be replicated?
Raw Data
Source: Helena Deus
BioRDF: finding latest versionBioRDF: finding latest versionFILTER (?date_issued2 > ?date_issued)
}
FILTER (!BOUND(?srvc2))
}
#Get associated diseases from most recently updated Diseasomey pserver.
SERVICE ?srvc2 { # <‐ Federation here
?diseasomeGene rdfs:label ?geneLabel .
?disease diseasome:associatedGene ?diseasomeGene.
?disease rdfs:label ?diseaseName .
}
Myth bustingMyth busting
• Creating a Semantic Web application does not necessarily require converting and storing all y q g gdata in RDF
• “Leave the data where it lives”• Leave the data where it lives
• Create views of the data accessible via RDF export or (BETTER) a SPARQL endpoint
SWObjectsSWObjects
• Creator: Eric Prud’hommeaux
• SWObjects creates a SPARQL endpointSWObjects creates a SPARQL endpoint
• Query rewriting to SQL and SPARQL
• Uses SPARQL Constructs as a mapping language
• JNI interfaceJNI interface
• http://en.wikipedia.org/wiki/SWObjects
• http://precedings.nature.com/documents/5538/
Semantic Web is a frameworkSemantic Web is a framework
that provides common languages for defining vocabularies and querying across data that q y guses them.
However, common identifiers must be established if we are to link data without a million mapping files.million mapping files.
Shared IdentifiersShared Identifiers
• Must use common URI’s in order to link data
• Provenance related identifiers still needed:Provenance related identifiers still needed:– Identifiers for people (researchers)
Id ifi f di– Identifiers for diseases
– Identifiers for terms (Terminology servers)
– Identifiers for programs, processes, workflows
– Identifiers for chemical compoundsIdentifiers for chemical compounds
• Shared Names http://sharednames.org
Emerging Best Practices for RDFEmerging Best Practices for RDF
d d d d f h1. Convert unstructured source data into a structured data format, such as relational tables or XML
2. Identify persistent URLs (PURLS) for concepts in existing ontologies and y p ( ) p g gcreate a map from the structured data into an RDF view
3. Customize the map manually if necessary
i h i i h d4. Map concepts in the new RDF mapping to concepts in other RDF data sources relying as much as possible on ontologies
5. Publish the RDF data through a SPARQL endpoint or as Linked Data g p
6. Alternatively, if data is in a relational format, apply a Semantic Web toolkit such as SWObjects [Prud’hommeaux et al. 2010] that enables SPARQL queries over the relational schemaSPARQL queries over the relational schema
7. Create Semantic Web applications using the published data
Create RDF views that a stranger can useCreate RDF views that a stranger can use
• Use a mapping language to create an RDF view of the data when possible, rather than customized code.
Wh ibl b l i th t l il bl• When possible, use vocabularies that are openly available from an authoritative server. For the life sciences, we have OBO and NCBOOBO and NCBO.
• Use rdfs:label generously to provide information to user interfaces
Publish RDF so that it can be discoveredPublish RDF so that it can be discovered
• Publish open access data whenever possible, as well as any associated software.
• Publish a URL to the software and mappings that you used to create the RDF.
• Register your data at CKAN. If it’s biomedical, register it in BioCatalogue.o a a ogue
• Assign a graph URI to the RDF graph and add provenance and metadata about the graph URI toprovenance and metadata about the graph URI to the graph itself. This practice makes it possible for visitors and crawlers to find out what is in the graphvisitors and crawlers to find out what is in the graph.
AcknowledgementsAcknowledgements• W3C Health Care and Life Science Interest Group, http://www.w3.org/blog/hcls
• National Center for Biomedical Ontologiesg
• Concept Web Alliance
• Authors of all contributed slides• Authors of all contributed slides
The End
“Science is built up of facts, as a house is built of stones; but an accumulation of facts is no more a science than a heap of stones is a house ”stones is a house.
Henri Poincaré– Henri Poincaré,
Science and Hypothesis, 1905
Recommended