61
How data sharing leads to How data sharing leads to knowledge M. Scott Marshall, Ph.D. W3C HCLS IG hi W3C HCLS IG cochair Leiden University Medical Center University of Amsterdam http://staff.science.uva.nl/~marshall

How data sharing leads to - Keio Universitys-web.sfc.keio.ac.jp/conference2011/0102-marshall-update.pdf · Current Task Forces •Bi RDFBioRDF – fd ifederating (i)(neuroscience)

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: How data sharing leads to - Keio Universitys-web.sfc.keio.ac.jp/conference2011/0102-marshall-update.pdf · Current Task Forces •Bi RDFBioRDF – fd ifederating (i)(neuroscience)

How data sharing leads toHow data sharing leads to knowledgeg

M. Scott Marshall, Ph.D.W3C HCLS IG h iW3C HCLS IG co‐chair 

Leiden University Medical CenterUniversity of Amsterdam

http://staff.science.uva.nl/~marshall

Page 2: How data sharing leads to - Keio Universitys-web.sfc.keio.ac.jp/conference2011/0102-marshall-update.pdf · Current Task Forces •Bi RDFBioRDF – fd ifederating (i)(neuroscience)

Semantic Web Anno 1999Semantic Web Anno 1999

DIY (Do It Yourself)

Page 3: How data sharing leads to - Keio Universitys-web.sfc.keio.ac.jp/conference2011/0102-marshall-update.pdf · Current Task Forces •Bi RDFBioRDF – fd ifederating (i)(neuroscience)

Reasoning across a Federationg(Semantic Web 2012?)

SPARQLクエリを分解して分散コンピューSPARQLクエリを分解して分散コンピューティングを行なう

Page 4: How data sharing leads to - Keio Universitys-web.sfc.keio.ac.jp/conference2011/0102-marshall-update.pdf · Current Task Forces •Bi RDFBioRDF – fd ifederating (i)(neuroscience)

The Semantic Web is the New GlobalThe Semantic Web is the New Global Web of Knowledge

It is about standards for publishing, sharing and querying knowledge drawn from diverse sources

It makes possible the answeringsophisticated questions usingsophisticated questions using

background knowledge

Source: Michel Dumontier

Page 5: How data sharing leads to - Keio Universitys-web.sfc.keio.ac.jp/conference2011/0102-marshall-update.pdf · Current Task Forces •Bi RDFBioRDF – fd ifederating (i)(neuroscience)

Some of the forces at workSome of the forces at work• Pharmaceutical industry changing strategyy g g gy

– David Cox Strategy: Academic / Industry partnershippartnership

– Pistoia Alliance, Vocabulary Services Initiative

• Personalized Medicine and EHRs

• US NIH: NCBCs and NCBOUS NIH: NCBCs and NCBO

• NCI Semantic Infrastructure

• European Innovative Medicine Initiatives (IMI)

Page 6: How data sharing leads to - Keio Universitys-web.sfc.keio.ac.jp/conference2011/0102-marshall-update.pdf · Current Task Forces •Bi RDFBioRDF – fd ifederating (i)(neuroscience)

Biology in a nutshell: Bigger isn’t bettergy gg

• DNA Dogma– Transcription = DNA ‐> mRNA ‐> Protein

• Molecular pathways allow biologists to ‘connect’ one molecule to another.

• Huntington’s mutation mapped in 1993 yet there is still no understanding of the mechanism that causesstill no understanding of the mechanism that causes the neurodegeneration.

• Semantic models are necessary to create a ‘systems• Semantic models are necessary to create a  systems view’ of biology. 

Page 7: How data sharing leads to - Keio Universitys-web.sfc.keio.ac.jp/conference2011/0102-marshall-update.pdf · Current Task Forces •Bi RDFBioRDF – fd ifederating (i)(neuroscience)

Where is biomedical knowledge?Where is biomedical knowledge?

Can be extracted from:

• PeoplePeople

• LiteratureM t f th f• Diagrams

• Clinical reports

Most of these sources of biomedical knowledge are notmachine‐readableClinical reports 

• Databases

notmachine readable

• Excel sheets

•• …

Page 8: How data sharing leads to - Keio Universitys-web.sfc.keio.ac.jp/conference2011/0102-marshall-update.pdf · Current Task Forces •Bi RDFBioRDF – fd ifederating (i)(neuroscience)

ExamplesExamples

1. “find all images with a big lesion in the right frontal lobe » ? g g g

2. Provide the neurosurgeon with landmarks Hmm, but what’s nearby?

Source: Christine Golbreich

Page 9: How data sharing leads to - Keio Universitys-web.sfc.keio.ac.jp/conference2011/0102-marshall-update.pdf · Current Task Forces •Bi RDFBioRDF – fd ifederating (i)(neuroscience)

Many tasks are still a challenge!With existing Web and Health IT: 

• Find and integrate informationFind and integrate information 

– “Although a plethora of resources (tools, databases materials) for neuroscientists is nowdatabases, materials) for neuroscientists is now available on the web, finding these resources among the billions of possible web pagesamong the billions of possible web pages continues to be a challenge.” [M. Martone, NCBO Seminar Series, 4 Nov 2009], ]

• Make multiple inferences based on background knowledgeknowledge 

– to obtain more complete answers

t di k l d– to discover knowledgeSource: Christine Golbreich

Page 10: How data sharing leads to - Keio Universitys-web.sfc.keio.ac.jp/conference2011/0102-marshall-update.pdf · Current Task Forces •Bi RDFBioRDF – fd ifederating (i)(neuroscience)

ExamplesExamples– in a medical record system

“find all patients which radiology exhibits a fracture of femur”

d– in genomic data

“find all genes annotated with a molecular function or f it d d t d hi h i i t d ithany of its descendants and which is associated with any

form of a given disease (see genes associated to muscular dystrophy [Sahoo et al. 2007])muscular dystrophy [Sahoo et al. 2007])

– find, share, annotate images

Source: Christine Golbreich

Page 11: How data sharing leads to - Keio Universitys-web.sfc.keio.ac.jp/conference2011/0102-marshall-update.pdf · Current Task Forces •Bi RDFBioRDF – fd ifederating (i)(neuroscience)

Biological and medical ontologiesBiological and medical ontologies• Medical domain is *very* lucky☺

a large number of terminologies and reference ontologies, E.g., FMA, NCI, GO, SNOMED‐CT, etc.

• Web PortalsBioportal library contains ~200 ontologies in different languages: OBO– Bioportal library contains 200 ontologies in different languages: OBO, Protégé Frames, RDF, OWL http://bioportal.bioontology.org/

– Bioportal now provides SPARQL access to ontologies: http://sparql bioontology orghttp://sparql.bioontology.org 

– Open Biomedical Ontologies (OBO) Foundry, http://obofoundry.org/

Source: Christine Golbreich

Page 12: How data sharing leads to - Keio Universitys-web.sfc.keio.ac.jp/conference2011/0102-marshall-update.pdf · Current Task Forces •Bi RDFBioRDF – fd ifederating (i)(neuroscience)

MotivationMotivation

Science is based on knowledge:Science is based on knowledge: 

knowledge capture, knowledge sharing, i.e. i i f fi dicommunication of findings. 

Semantic Web provides a basis for knowledge h i th h hi d bl dsharing through machine‐readable and reason‐able annotation of resources.

Page 13: How data sharing leads to - Keio Universitys-web.sfc.keio.ac.jp/conference2011/0102-marshall-update.pdf · Current Task Forces •Bi RDFBioRDF – fd ifederating (i)(neuroscience)

What is knowledge ?What is knowledge ?

“data”, “information”, “facts”, “knowledge”

Knowledge is a statement 

that can be tested for truth.

(by a machine)

Otherwise, computing can’t add much

Page 14: How data sharing leads to - Keio Universitys-web.sfc.keio.ac.jp/conference2011/0102-marshall-update.pdf · Current Task Forces •Bi RDFBioRDF – fd ifederating (i)(neuroscience)

RDF : a web format for knowledge

RDF is a W3C language to expressRDF is a W3C language to express 

statements.RDF Triple:

Subject     Predicate     Objectj j

Graph of Knowledge:Graph of Knowledge:

Node           Edge           Node

Page 15: How data sharing leads to - Keio Universitys-web.sfc.keio.ac.jp/conference2011/0102-marshall-update.pdf · Current Task Forces •Bi RDFBioRDF – fd ifederating (i)(neuroscience)

Round trip knowledge engineering (Vision)

In a linked open data environment:In a linked open data environment:– How will I find the resources? 

H ill I i t t th i t li ti ?– How will I integrate them into an application?

– How will I use existing knowledge in my analysis?

– How will I publish the new knowledge resulting from my analysis?

– And link to the evidence (data) from the new knowledge?knowledge?

Page 16: How data sharing leads to - Keio Universitys-web.sfc.keio.ac.jp/conference2011/0102-marshall-update.pdf · Current Task Forces •Bi RDFBioRDF – fd ifederating (i)(neuroscience)

Programmable Web:A few thousand APIs

http://www.programmableweb.com/

Page 17: How data sharing leads to - Keio Universitys-web.sfc.keio.ac.jp/conference2011/0102-marshall-update.pdf · Current Task Forces •Bi RDFBioRDF – fd ifederating (i)(neuroscience)

SPARQL Interoperability:A single API

• a query language for RDF that matches graph patternsp

• SPARQL endpoint is comparable to a database connector for triplesconnector for triples

• SPARQL endpoint enables data layer interoperability

Page 18: How data sharing leads to - Keio Universitys-web.sfc.keio.ac.jp/conference2011/0102-marshall-update.pdf · Current Task Forces •Bi RDFBioRDF – fd ifederating (i)(neuroscience)

Background of the HCLS IGBackground of the HCLS IG

• Originally chartered in 2005– Chairs: Eric Neumann and Tonya Hongsermeier

• Re‐chartered in 2008• Re‐chartered in 2008– Chairs: Scott Marshall and Susie Stephens

– Team contact: Eric Prud’hommeaux

• Broad industry participation– Over 100 members 

Mailing list of over 600– Mailing list of over 600

• Background Information– http://www.w3.org/blog/hclsp g g

– http://esw.w3.org/topic/HCLSIG

Page 19: How data sharing leads to - Keio Universitys-web.sfc.keio.ac.jp/conference2011/0102-marshall-update.pdf · Current Task Forces •Bi RDFBioRDF – fd ifederating (i)(neuroscience)

Mission of HCLS IGMission of HCLS IG

•The mission of HCLS is to develop, advocate for, and support the use of Semantic Web technologies for

Biological science– Biological science

– Translational medicine

– Health care

•These domains stand to gain tremendous benefit by adoption of Semantic Web technologies, as they depend on h i bili f i f i f d i dthe interoperability of information from many domains and processes for efficient decision support

Page 20: How data sharing leads to - Keio Universitys-web.sfc.keio.ac.jp/conference2011/0102-marshall-update.pdf · Current Task Forces •Bi RDFBioRDF – fd ifederating (i)(neuroscience)

How does the HCLS IG work?How does the HCLS IG work?

• Task forces

• Regular teleconferences using teleconferenceRegular teleconferences using teleconference bridge (Zakim), IRC, minutes

F 2F (F2F) i• Face2Face (F2F) twice a year

• Procedures for publishing W3C notes p g

Page 21: How data sharing leads to - Keio Universitys-web.sfc.keio.ac.jp/conference2011/0102-marshall-update.pdf · Current Task Forces •Bi RDFBioRDF – fd ifederating (i)(neuroscience)

Current Task ForcesCurrent Task Forces

Bi RDF f d i ( i ) k l d b• BioRDF – federating (neuroscience) knowledge bases– M. Scott Marshall (Leiden University Medical Center / University of Amsterdam)

• Clinical Observations Interoperability – patient recruitment in trials– Vipul Kashyap (Cigna Healthcare) 

• Linking Open Drug Data – aggregation of Web‐based drug data – Susie Stephens (Johnson & Johnson)Susie Stephens (Johnson & Johnson)

• Translational Medicine Ontology – high level patient‐centric ontology– Michel Dumontier (Carleton University) 

S i tifi Di b ildi iti th h t ki• Scientific Discourse – building communities through networking– Tim Clark (Harvard University) 

• Terminology – Semantic Web representation of existing resources– John Madden (Duke University)

Page 22: How data sharing leads to - Keio Universitys-web.sfc.keio.ac.jp/conference2011/0102-marshall-update.pdf · Current Task Forces •Bi RDFBioRDF – fd ifederating (i)(neuroscience)

COI: Translating across domainsCOI: Translating across domains

EHR

Microarray AlzForum

MRI PubMed

Page 23: How data sharing leads to - Keio Universitys-web.sfc.keio.ac.jp/conference2011/0102-marshall-update.pdf · Current Task Forces •Bi RDFBioRDF – fd ifederating (i)(neuroscience)

COI: Use CaseCOI: Use Case

Pharmaceutical companies pay a lot to test drugsg

Pharmaceutical companies express protocol in CDISCCDISC

‐‐ precipitous gap –

H it l h i f ti i HL7/RIMHospitals exchange information in HL7/RIM

Hospitals have relational databases

Source: Eric Prud’hommeaux

Page 24: How data sharing leads to - Keio Universitys-web.sfc.keio.ac.jp/conference2011/0102-marshall-update.pdf · Current Task Forces •Bi RDFBioRDF – fd ifederating (i)(neuroscience)

Inclusion Criteria

T 2 di b t di t d i th• Type 2 diabetes on diet and exercise therapy or• monotherapy with metformin, insulin

t l h l id i hibit• secretagogue, or alpha-glucosidase inhibitors, or• a low-dose combination of these at 50%

i l d D i i t bl f 8 k i• maximal dose. Dosing is stable for 8 weeks prior• to randomization.• …• ?patient takes metformin .

Source: Holger Stenzhorn

Page 25: How data sharing leads to - Keio Universitys-web.sfc.keio.ac.jp/conference2011/0102-marshall-update.pdf · Current Task Forces •Bi RDFBioRDF – fd ifederating (i)(neuroscience)

Exclusion CriteriaExclusion Criteria

Use of warfarin (Coumadin), clopidogrel(Plavix) or other anticoagulants.( ) g…?patient doesNotTake anticoagulant .?patient doesNotTake anticoagulant .

Source: Holger Stenzhorn

Page 26: How data sharing leads to - Keio Universitys-web.sfc.keio.ac.jp/conference2011/0102-marshall-update.pdf · Current Task Forces •Bi RDFBioRDF – fd ifederating (i)(neuroscience)

Criteria in SPARQL?medication1 sdtm:subject ?patient ;

Criteria in SPARQL

spl:activeIngredient ?ingredient1 .?ingredient1 spl:classCode 6809 . #metformin

OPTIONAL {OPTIONAL {

?medication2 sdtm:subject ?patient ;spl:activeIngredient ?ingredient2 spl:activeIngredient ?ingredient2 .

?ingredient2 spl:classCode 11289 .#anticoagulant

} FILTER (!BOUND(?medication2))

Source: Holger Stenzhorn

Page 27: How data sharing leads to - Keio Universitys-web.sfc.keio.ac.jp/conference2011/0102-marshall-update.pdf · Current Task Forces •Bi RDFBioRDF – fd ifederating (i)(neuroscience)

TMO: Translating across domainsTMO: Translating across domains

EHR

Microarray AlzForum

MRI PubMed

Page 28: How data sharing leads to - Keio Universitys-web.sfc.keio.ac.jp/conference2011/0102-marshall-update.pdf · Current Task Forces •Bi RDFBioRDF – fd ifederating (i)(neuroscience)

Questions & ProblemsThe Drug Development Pipeline

“A virtual space odyssey”, Cath O'Driscoll  (2004) 

http://www.nature.com/horizon/chemicalspace/background/odyssey.html

• The road is long, and costly.

H d t i t d d l b tt d ?• How do we contain costs and develop better drugs?

Source: Elgar Pichler

Page 29: How data sharing leads to - Keio Universitys-web.sfc.keio.ac.jp/conference2011/0102-marshall-update.pdf · Current Task Forces •Bi RDFBioRDF – fd ifederating (i)(neuroscience)

Translational Medicine OntologyMission

• Focuses on the development of a high level patient‐centric ontology for the pharmaceutical industry. The ontology should enable data integration across discovery research, hypothesis management,integration across discovery research, hypothesis management, experimental studies, compounds, formulation, drug development, market size, competitive data, population data, etc. This would enable i i i d i i i ifiscientists to answer new questions, and to answer existing scientific 

questions more quickly. 

• This will help pharmaceutical companies to model patient‐centricThis will help pharmaceutical companies to model patient centric information, which is essential for the tailoring of drugs, and for early detection of compounds that may have sub‐optimal safety profiles. The ontology should link to existing publicly available domain ontologies.

Page 30: How data sharing leads to - Keio Universitys-web.sfc.keio.ac.jp/conference2011/0102-marshall-update.pdf · Current Task Forces •Bi RDFBioRDF – fd ifederating (i)(neuroscience)

TMO StructureTMO Structure

Source: Susie Stephens

Page 31: How data sharing leads to - Keio Universitys-web.sfc.keio.ac.jp/conference2011/0102-marshall-update.pdf · Current Task Forces •Bi RDFBioRDF – fd ifederating (i)(neuroscience)

Translational Medicine KBTranslational Medicine KB

Source: Susie Stephens

Page 32: How data sharing leads to - Keio Universitys-web.sfc.keio.ac.jp/conference2011/0102-marshall-update.pdf · Current Task Forces •Bi RDFBioRDF – fd ifederating (i)(neuroscience)

TMO QueryTMO Query

How many patients experienced side effects while taking Donepezil?

Source: Susie Stephens

Page 33: How data sharing leads to - Keio Universitys-web.sfc.keio.ac.jp/conference2011/0102-marshall-update.pdf · Current Task Forces •Bi RDFBioRDF – fd ifederating (i)(neuroscience)

Terminology: Translating across domains

EHR

Microarray AlzForum

MRI PubMed

Page 34: How data sharing leads to - Keio Universitys-web.sfc.keio.ac.jp/conference2011/0102-marshall-update.pdf · Current Task Forces •Bi RDFBioRDF – fd ifederating (i)(neuroscience)

TerminologyTerminology

• Representation of medical documents (OWL and SKOS))

• SKOS enables ‘fuzzier’ statements, with ‘broader’ and ‘narrower’broader  and  narrower

• Mammogram: Represent both radiology and pathology report to discover discrepancies

• Use Translational Medicine Ontology RadLex• Use Translational Medicine Ontology, RadLex, SNOMED in the RDF

Page 35: How data sharing leads to - Keio Universitys-web.sfc.keio.ac.jp/conference2011/0102-marshall-update.pdf · Current Task Forces •Bi RDFBioRDF – fd ifederating (i)(neuroscience)

BioRDF: Translating across domainsBioRDF: Translating across domains

EHR

Microarray AlzForum

MRI PubMed

Page 36: How data sharing leads to - Keio Universitys-web.sfc.keio.ac.jp/conference2011/0102-marshall-update.pdf · Current Task Forces •Bi RDFBioRDF – fd ifederating (i)(neuroscience)

BioRDF: Looking for Targets for Alzheimer’s

• Signal transduction pathways are considered to be rich in “druggable” targetstargets • CA1 Pyramidal Neurons are known to be particularly damaged in Alzheimer’s disease• Casting a wide net, can we find candidate genes known to be involvedcandidate genes known to be involved in signal transduction and active in Pyramidal Neurons?

Source: Alan Ruttenberg

Page 37: How data sharing leads to - Keio Universitys-web.sfc.keio.ac.jp/conference2011/0102-marshall-update.pdf · Current Task Forces •Bi RDFBioRDF – fd ifederating (i)(neuroscience)

BioRDF: Integrating Heterogeneous DataBioRDF: Integrating Heterogeneous Data 

NeuronDB

PDSPki

R t NeuronDB

BAMS

Gene Ontology

BrainPharmAntibodies

Reactome

Allen Brain

Literature

Entrez Gene

M li

BrainPharmAntibodies

PubChem

MESHAtlas

Homologene

SWAN

Mammalian Phenotype

AlzGene

Source: Susie Stephens

Page 38: How data sharing leads to - Keio Universitys-web.sfc.keio.ac.jp/conference2011/0102-marshall-update.pdf · Current Task Forces •Bi RDFBioRDF – fd ifederating (i)(neuroscience)

“find me genes involved in signal transduction that arefind me genes involved in signal transduction that are related to pyramidal neurons”

Page 39: How data sharing leads to - Keio Universitys-web.sfc.keio.ac.jp/conference2011/0102-marshall-update.pdf · Current Task Forces •Bi RDFBioRDF – fd ifederating (i)(neuroscience)

BioRDF: Results: Genes ProcessesBioRDF: Results: Genes, Processes

•DRD1, 1812 adenylate cyclase activation•ADRB2, 154 adenylate cyclase activation•ADRB2, 154 arrestin mediated desensitization of G‐protein coupled receptor protein signaling pathway•DRD1IP, 50632 dopamine receptor signaling pathway•DRD1, 1812 dopamine receptor, adenylate cyclase activating pathway•DRD2, 1813 dopamine receptor, adenylate cyclase inhibiting pathway•GRM7 2917 G‐protein coupled receptor protein signaling pathwayGRM7, 2917 G protein coupled receptor protein signaling pathway•GNG3, 2785 G‐protein coupled receptor protein signaling pathway•GNG12, 55970 G‐protein coupled receptor protein signaling pathway•DRD2, 1813 G‐protein coupled receptor protein signaling pathway•ADRB2, 154 G‐protein coupled receptor protein signaling pathway•CALM3, 808 G‐protein coupled receptor protein signaling pathway•HTR2A, 3356 G‐protein coupled receptor protein signaling pathwayDRD1 1812 G i i li l d li l id d

Many of the genes are related to AD through gamma •DRD1, 1812 G‐protein signaling, coupled to cyclic nucleotide second messenger

•SSTR5, 6755 G‐protein signaling, coupled to cyclic nucleotide second messenger•MTNR1A, 4543 G‐protein signaling, coupled to cyclic nucleotide second messenger•CNR2, 1269 G‐protein signaling, coupled to cyclic nucleotide second messenger•HTR6, 3362 G‐protein signaling, coupled to cyclic nucleotide second messenger•GRIK2, 2898 glutamate signaling pathway•GRIN1, 2902 glutamate signaling pathway

g gsecretase (presenilin) activity

, g g g p y•GRIN2A, 2903 glutamate signaling pathway•GRIN2B, 2904 glutamate signaling pathway•ADAM10, 102 integrin‐mediated signaling pathway•GRM7, 2917 negative regulation of adenylate cyclase activity•LRP1, 4035 negative regulation of Wnt receptor signaling pathway•ADAM10, 102 Notch receptor processing•ASCL1 429 Notch signaling pathway•ASCL1, 429 Notch signaling pathway•HTR2A, 3356 serotonin receptor signaling pathway•ADRB2, 154 transmembrane receptor protein tyrosine kinase activation (dimerization)•PTPRG, 5793 ransmembrane receptor protein tyrosine kinase signaling pathway•EPHA4, 2043 transmembrane receptor protein tyrosine kinase signaling pathway•NRTN, 4902 transmembrane receptor protein tyrosine kinase signaling pathway•CTNND1, 1500 Wnt receptor signaling pathway

Source: Alan Ruttenberg

Page 40: How data sharing leads to - Keio Universitys-web.sfc.keio.ac.jp/conference2011/0102-marshall-update.pdf · Current Task Forces •Bi RDFBioRDF – fd ifederating (i)(neuroscience)

Automatic query decompositionAutomatic query decomposition

• Decompose query into components

• Locate sources of informationLocate sources of information

• Distribute queries across information sources

• Aggregate results

Page 41: How data sharing leads to - Keio Universitys-web.sfc.keio.ac.jp/conference2011/0102-marshall-update.pdf · Current Task Forces •Bi RDFBioRDF – fd ifederating (i)(neuroscience)

A SPARQL query for processes involved in pyramidal neurons

prefix go: <http://purl org/obo/owl/GO#>prefix go: <http://purl.org/obo/owl/GO#>prefix rdfs: <http://www.w3.org/2000/01/rdf‐schema#>prefix owl: <http://www.w3.org/2002/07/owl#>prefix mesh: <http://purl.org/commons/record/mesh/>prefix sc: <http://purl.org/science/owl/sciencecommons/>prefix ro: <http://www.obofoundry.org/ro/ro.owl#> MeSH: Pyramidal Neuronsselect ?genename ?processnamewhere{  graph <http://purl.org/commons/hcls/pubmesh>{ ?paper ?p mesh:D017966 .?article sc:identified_by_pmid ?paper.?gene sc:describes_gene_or_gene_product_mentioned_by ?article.}

Pubmed: Journal Articles}graph <http://purl.org/commons/hcls/goa>{ ?protein rdfs:subClassOf ?res.?res owl:onProperty ro:has_function.?res owl:someValuesFrom ?res2.?res2 owl:onProperty ro:realized_as.?res2 owl:someValuesFrom ?process.h htt // l / /h l /20070416/ l l ti Entrez Gene: Genesgraph <http://purl.org/commons/hcls/20070416/classrelations>

{{?process <http://purl.org/obo/owl/obo#part_of> go:GO_0007166} union{?process rdfs:subClassOf go:GO_0007166 }} ?protein rdfs:subClassOf ?parent.?parent owl:equivalentClass ?res3.?res3 owl:hasValue ?gene.

Entrez Gene: Genes

GO: Signal Transductiong}

graph <http://purl.org/commons/hcls/gene>{ ?gene rdfs:label ?genename }graph <http://purl.org/commons/hcls/20070416>{ ?process rdfs:label ?processname}

}

GO: Signal Transduction

Inference required

Page 42: How data sharing leads to - Keio Universitys-web.sfc.keio.ac.jp/conference2011/0102-marshall-update.pdf · Current Task Forces •Bi RDFBioRDF – fd ifederating (i)(neuroscience)

Recipe for a Semantic WebRecipe for a Semantic Web

• Follow Linked Open Data principles• Follow Linked Open Data principles

• Attempt to use Shared Names (same URI’s)

• Query rewriting to map from:                 – SPARQL > (query language)– SPARQL  ‐>  (query language)

– SPARQL (term1) ‐> SPARQL (term2)

• Add provenance information about graphsp g p

Page 43: How data sharing leads to - Keio Universitys-web.sfc.keio.ac.jp/conference2011/0102-marshall-update.pdf · Current Task Forces •Bi RDFBioRDF – fd ifederating (i)(neuroscience)

Interlinking in LODDInterlinking in LODD

http://esw.w3.org/HCLSIG/LODD/Interlinking

Page 44: How data sharing leads to - Keio Universitys-web.sfc.keio.ac.jp/conference2011/0102-marshall-update.pdf · Current Task Forces •Bi RDFBioRDF – fd ifederating (i)(neuroscience)

LOD cloud – a few challengesLOD cloud – a few challenges

• It’s a (nice) data warehouse in RDF– http://lod.openlinksw.com/p // p /

• Ad hoc identifiersP f h i i i d h d– Prefer authoritative, persistent, and shared

• Redundancyy

• Lacks provenance – Who created a given RDF rendering? Using what method? Version?rendering? Using what method? Version?

Page 45: How data sharing leads to - Keio Universitys-web.sfc.keio.ac.jp/conference2011/0102-marshall-update.pdf · Current Task Forces •Bi RDFBioRDF – fd ifederating (i)(neuroscience)

So what’s missing?So, what s missing?

• Data stewardship – serve (annotated) data back to communityy

• Plug‐n‐play SPARQL access to relational stores 

B i f d f d i f SPARQL• Best practice for data federation of SPARQL endpoints

• Best practice for important data sources: microarray data variety of image datamicroarray data, variety of image data

• Shared Identifiers

Page 46: How data sharing leads to - Keio Universitys-web.sfc.keio.ac.jp/conference2011/0102-marshall-update.pdf · Current Task Forces •Bi RDFBioRDF – fd ifederating (i)(neuroscience)

ProvenanceProvenance

• Represent knowledge so that others can discover where a fact (or triple) came from and evaluate how to use it – link facts to data as evidence

• Named graphs – talk about what’s in them

• Example use: “Find the latest RDF version of DrugBank” from SPARQL endpoints accessible to me.ug a o S Q e dpo s access b e o e

Page 47: How data sharing leads to - Keio Universitys-web.sfc.keio.ac.jp/conference2011/0102-marshall-update.pdf · Current Task Forces •Bi RDFBioRDF – fd ifederating (i)(neuroscience)

Provenance for Federationor “What’s in a graph?”

• Minimum information about a graph (MIAG)

• Is it an ontology or RDF data?Is it an ontology or RDF data?

• Version, License, ..

• Who converted it to RDF?

• Rendering: “I’m looking for SNOMED in SKOSRendering:  I m looking for SNOMED in SKOS (not OWL)”

• What is the most populated class?

• Graph statisticsGraph statistics

Page 48: How data sharing leads to - Keio Universitys-web.sfc.keio.ac.jp/conference2011/0102-marshall-update.pdf · Current Task Forces •Bi RDFBioRDF – fd ifederating (i)(neuroscience)

Translating across domainsTranslating across domains

EHR

Microarray AlzForum

MRI PubMed

Page 49: How data sharing leads to - Keio Universitys-web.sfc.keio.ac.jp/conference2011/0102-marshall-update.pdf · Current Task Forces •Bi RDFBioRDF – fd ifederating (i)(neuroscience)

Provenance types are perspectives on the data 

Source: Helena Deus

Page 50: How data sharing leads to - Keio Universitys-web.sfc.keio.ac.jp/conference2011/0102-marshall-update.pdf · Current Task Forces •Bi RDFBioRDF – fd ifederating (i)(neuroscience)

A Bottom‐up Approachp pp

Provenance Workflow, Domain ontologiesCommunitymodels

Workflow, experimental design

Domain ontologies(DO, GO…)

Communitymodels

Which genes are markers gfor neurodegenerative 

diseases?

Was gene ALG2

Provenance of Microarray experiment Was gene ALG2 

differentially expressed in multiple experiments?

experiment

Wh f d

QuestionsWhat software was used to 

analyse the data?

Raw Data

Results How can the experiment be replicated?

Raw Data

Source: Helena Deus

Page 51: How data sharing leads to - Keio Universitys-web.sfc.keio.ac.jp/conference2011/0102-marshall-update.pdf · Current Task Forces •Bi RDFBioRDF – fd ifederating (i)(neuroscience)

BioRDF: finding latest versionBioRDF: finding latest versionFILTER (?date_issued2 > ?date_issued) 

FILTER (!BOUND(?srvc2)) 

#Get associated diseases from most recently updated Diseasomey pserver. 

SERVICE ?srvc2 {                            #  <‐ Federation here

?diseasomeGene rdfs:label ?geneLabel . 

?disease diseasome:associatedGene ?diseasomeGene. 

?disease rdfs:label ?diseaseName . 

Page 52: How data sharing leads to - Keio Universitys-web.sfc.keio.ac.jp/conference2011/0102-marshall-update.pdf · Current Task Forces •Bi RDFBioRDF – fd ifederating (i)(neuroscience)

Myth bustingMyth busting

• Creating a Semantic Web application does not necessarily require converting and storing all y q g gdata in RDF 

• “Leave the data where it lives”• Leave the data where it lives

• Create views of the data accessible via RDF export or (BETTER) a SPARQL endpoint

Page 53: How data sharing leads to - Keio Universitys-web.sfc.keio.ac.jp/conference2011/0102-marshall-update.pdf · Current Task Forces •Bi RDFBioRDF – fd ifederating (i)(neuroscience)

SWObjectsSWObjects

• Creator: Eric Prud’hommeaux

• SWObjects creates a SPARQL endpointSWObjects creates a SPARQL endpoint

• Query rewriting to SQL and SPARQL

• Uses SPARQL Constructs as a mapping language

• JNI interfaceJNI interface

• http://en.wikipedia.org/wiki/SWObjects

• http://precedings.nature.com/documents/5538/

Page 54: How data sharing leads to - Keio Universitys-web.sfc.keio.ac.jp/conference2011/0102-marshall-update.pdf · Current Task Forces •Bi RDFBioRDF – fd ifederating (i)(neuroscience)

Semantic Web is a frameworkSemantic Web is a framework

that  provides common languages for defining vocabularies and querying across data that q y guses them.

However,  common identifiers must be established if we are to link data without a million mapping files.million mapping files.

Page 55: How data sharing leads to - Keio Universitys-web.sfc.keio.ac.jp/conference2011/0102-marshall-update.pdf · Current Task Forces •Bi RDFBioRDF – fd ifederating (i)(neuroscience)

Shared IdentifiersShared Identifiers

• Must use common URI’s in order to link data

• Provenance related identifiers still needed:Provenance related identifiers still needed:– Identifiers for people (researchers)

Id ifi f di– Identifiers for diseases

– Identifiers for terms (Terminology servers)

– Identifiers for programs, processes, workflows

– Identifiers for chemical compoundsIdentifiers for chemical compounds

• Shared Names http://sharednames.org

Page 56: How data sharing leads to - Keio Universitys-web.sfc.keio.ac.jp/conference2011/0102-marshall-update.pdf · Current Task Forces •Bi RDFBioRDF – fd ifederating (i)(neuroscience)
Page 57: How data sharing leads to - Keio Universitys-web.sfc.keio.ac.jp/conference2011/0102-marshall-update.pdf · Current Task Forces •Bi RDFBioRDF – fd ifederating (i)(neuroscience)

Emerging Best Practices for RDFEmerging Best Practices for RDF

d d d d f h1. Convert unstructured source data into a structured data format, such as relational tables or XML 

2. Identify persistent URLs (PURLS) for concepts in existing ontologies and y p ( ) p g gcreate a map from the structured data into an RDF view 

3. Customize the map manually if necessary

i h i i h d4. Map concepts in the new RDF mapping to concepts in other RDF data sources relying as much as possible on ontologies 

5. Publish the RDF data through a SPARQL endpoint or as Linked Data g p

6. Alternatively, if data is in a relational format, apply a Semantic Web toolkit such as SWObjects [Prud’hommeaux et al. 2010] that enables SPARQL queries over the relational schemaSPARQL queries over the relational schema

7. Create Semantic Web applications using the published data 

Page 58: How data sharing leads to - Keio Universitys-web.sfc.keio.ac.jp/conference2011/0102-marshall-update.pdf · Current Task Forces •Bi RDFBioRDF – fd ifederating (i)(neuroscience)

Create RDF views that a stranger can useCreate RDF views that a stranger can use

• Use a mapping language to create an RDF view of the data when possible, rather than customized code.

Wh ibl b l i th t l il bl• When possible, use vocabularies that are openly available from an authoritative server. For the life sciences, we have OBO and NCBOOBO and NCBO.

• Use rdfs:label generously to provide information to user interfaces

Page 59: How data sharing leads to - Keio Universitys-web.sfc.keio.ac.jp/conference2011/0102-marshall-update.pdf · Current Task Forces •Bi RDFBioRDF – fd ifederating (i)(neuroscience)

Publish RDF so that it can be discoveredPublish RDF so that it can be discovered

• Publish open access data whenever possible, as well as any associated software.

• Publish a URL to the software and mappings that you used to create the RDF.

• Register your data at CKAN. If it’s biomedical, register it in BioCatalogue.o a a ogue

• Assign a graph URI to the RDF graph and add provenance and metadata about the graph URI toprovenance and metadata about the graph URI to the graph itself. This practice makes it possible for visitors and crawlers to find out what is in the graphvisitors and crawlers to find out what is in the graph.

Page 60: How data sharing leads to - Keio Universitys-web.sfc.keio.ac.jp/conference2011/0102-marshall-update.pdf · Current Task Forces •Bi RDFBioRDF – fd ifederating (i)(neuroscience)

AcknowledgementsAcknowledgements• W3C Health Care and Life Science Interest Group, http://www.w3.org/blog/hcls

• National Center for Biomedical Ontologiesg

• Concept Web Alliance

• Authors of all contributed slides• Authors of all contributed slides

Page 61: How data sharing leads to - Keio Universitys-web.sfc.keio.ac.jp/conference2011/0102-marshall-update.pdf · Current Task Forces •Bi RDFBioRDF – fd ifederating (i)(neuroscience)

The End

“Science is built up of facts, as a house is built of stones; but an accumulation of facts is no more a science than a heap of stones is a house ”stones is a house.

Henri Poincaré– Henri Poincaré, 

Science and Hypothesis, 1905