31
+ Question Answering on Interlinked Data Saeedeh Shekarpour, Axel-Cyrille Ngonga Ngomo, Soeren Auer AKSW Research Group, Leipzig University December 5 2013, IBM Research Center

Sina presentation in IBM

Embed Size (px)

Citation preview

Page 1: Sina presentation in IBM

+

Question Answering on Interlinked Data

Saeedeh Shekarpour, Axel-Cyrille Ngonga Ngomo, Soeren Auer AKSW Research Group, Leipzig University December 5 2013, IBM Research Center

Page 2: Sina presentation in IBM

+ Motivation Retrieving information from LOD

AKSW group - Question Answering on Interlinked Data (published in www2013)

2

Page 3: Sina presentation in IBM

+ Motivation

Text  queries  (either  keyword  or  natural  language  )  are:  

n  Simple  retrieval  approach  

n  Popular  

n  Implicit  and  ambiguous  seman=cs.  

SPARQL  queries  require:  

n  Knowledge  about  the  ontology  

n  Proficiency  in  formula=ng  formal  queries    

n  Explicit  and  unambigious  seman=cs.  

AKSW  group  -­‐  Ques=on  Answering  on  Interlinked  Data  (published  in  www2013)  

3

Page 4: Sina presentation in IBM

+ Comparison of Search Approaches

AKSW group - Question Answering on Interlinked Data (published in www2013)

Data-Semantic unaware

Data-Semantic aware

Keyword-based query

Natural language query

Question Answering

Systems

Information Retrieval

Our approach:

SINA

4

Page 5: Sina presentation in IBM

+ Example

n  Which televisions shows were created by Walt Disney?

AKSW group - Question Answering on Interlinked Data (published in www2013)

select * where !{ ?v0 a ! !dbo:TelevisionShow.! ?v0 dbo:creator dbr:Walt_Disney. }!

1

2 3

5

Page 6: Sina presentation in IBM

+ Aim and Challenges

Aim: Question answering over a set of interlinked data sources.

n  Query segmentation.

n  Resource disambiguation.

n  To construct a formal query (expressed in SPARQL)

AKSW group - Question Answering on Interlinked Data (published in www2013)

6

Page 7: Sina presentation in IBM

+ Further Challenges over Interlinked Data

1.  Information for answering a certain question can be spread among different datasets employing heterogeneous schemas.

2.  Constructing a federated formal query across different datasets requires exploiting links between the different datasets on both the schema and instance levels.

AKSW group - Question Answering on Interlinked Data (published in www2013)

7

Page 8: Sina presentation in IBM

+ SINA Architecture

AKSW group - Question Answering on Interlinked Data (published in www2013)

8

Page 9: Sina presentation in IBM

+ Test bed datasets

AKSW group - Question Answering on Interlinked Data (published in www2013)

*  One single dataset: DBpedia. *  Three interlinked datasets

from life-science: ü  Drugbank: is a

comprehensive knowledge base containing information about drugs, drug target (i.e. protein) information, interactions and enzymes.

ü  Diseasome: contains information about diseases and genes associated with these diseases.

ü  Sider: contains information about drugs and their side effects.

9

Page 10: Sina presentation in IBM

+ Main characteristics of federated queries

1.  Queries requiring fused information, e.g. side effects of drugs used for Tuberculosis.

2.  Queries targeting combined information, e.g. side effect an enzymes of drugs used for ASTHMA.

3.  Queries requiring keyword expansion, e.g. side effects of Valdecoxib.

AKSW group - Question Answering on Interlinked Data (published in www2013)

Diseasome

Drug

Asthma

?v0 side effect sameAs

a

?v2 ?v3

Disease

Drug Side Effect

a a

a

?v1 enzyme

Enzymes

a

Sider DrugBank

10

Page 11: Sina presentation in IBM

+ Challenge 1: Query Segmentation and Resource Disambiguation

l  Sample  ques5on:  What  is  the  side  effects  of  drugs  used  for  Tuberculosis?    

l   Transformed  to  4-­‐tuple  (side  #  effect  #  drug  #  Tuberculosis)  

l  Different  segmenta=ons  are  possible:    1.  (  side  effect  #  drug  #  Tuberculosis)  2.  (  side  effect  drug  #  Tuberculosis  )

Mapping  of  the  segments  to  the  resources  in  the  underlying  knowledge  bases.  

AKSW group - Question Answering on Interlinked Data (published in www2013)

Each valid segment

11

Page 12: Sina presentation in IBM

Segment validation

 ü   Original tuple: (side # effect # drug # Tuberculosis). ü  Using a naive approach for finding all valid segments.  

Valid Segments Samples of Candidate Resources

Side effect 1. sider:class:sideeffect !2. sider:property:side_effects!

drug 1. drugbank: drugs 2.class:offer!3.sider:drugs 4.diseases:possibledrug!

tuberculosis 1. diseases:1154 !2. side_effects: C0041296!

AKSW group - Question Answering on Interlinked Data (published in www2013)

12

Page 13: Sina presentation in IBM

+

Concurrent  Segmenta5on  and  Disambigua5on    

AKSW group - Question Answering on Interlinked Data (published in www2013)

13

Page 14: Sina presentation in IBM

Hidden Markov Model

•  A statistics model containing a set of states. •  Moving from one state to another state generates a sequence of observations. •  The probability of entering state only depends on the previous state. •  Output is the most likely states generating the sequence of the observation.

AKSW group - Question Answering on Interlinked Data (published in www2013)

14

Page 15: Sina presentation in IBM

State Space

•  A state represents a knowledge base resource. •  Contains all resources in the knowledge base. •  In practice, we prune the state space by excluding irrelevant states. •  Adding an unknown entity state comprising all resources, which are not

available (anymore) in the pruned state space.

•  Extension of State Space with reasoning: An extension of the state space by including resources inferred from lightweight owl:sameAs reasoning.

AKSW group - Question Answering on Interlinked Data (published in www2013)

15

Page 16: Sina presentation in IBM

Bootstrapping the Model Parameters Emission Probability

•  The set-similarity level measures the difference between the label and the segment in terms of the number of words using the Jaccard similarity.

•  The string-similarity level measures the string similarity of each word in the segment with the most similar word in the label using the Levenshtein distance.

AKSW group - Question Answering on Interlinked Data (published in www2013)

16

Page 17: Sina presentation in IBM

Bootstrapping the Model Parameters Transition Probability & Initial Probability

•  Computing the transition probability and initial probability based on Semantic relatedness of two resources.

•  Semantic relatedness is based on two values: distance and connectivity degree.

•  We transform these two values to hub and authority values using HITS algorithm.

•  Initial probability and Transition probability are defined as a uniform distribution over the hub and and authority values.

AKSW group - Question Answering on Interlinked Data (published in www2013)

17

Page 18: Sina presentation in IBM

Evaluation of Bootstrapping

•  The accuracy of different distribution functions, i.e., Normal, Zipfian and uniform distributions for transition probability.

•  We ran the distribution functions with two different inputs, i.e. distance and connectivity degree values as well as hub and authority values.

AKSW group - Question Answering on Interlinked Data (published in www2013)

18

Page 19: Sina presentation in IBM

+ Viterbi Algorithm

AKSW group - Question Answering on Interlinked Data (published in www2013)

Aim: The most likely path generating the sequence of input keywords.

19

Page 20: Sina presentation in IBM

+ Output of the HMM for the following query: Which televisions shows were created by Walt Disney?

AKSW group - Question Answering on Interlinked Data (published in www2013)

Probability Path of states 0.0023 dbo:TelevisionShow , dbo:creator , dbr: Walt_Disney!0.0014 dbo:TelevisionShow , dbo:creator , dbr: Category:Walt_Disney!5.89E-4 dbr:TelevisionShow , dbo:creator , dbr: Walt_Disney!3.53E-4 dbr:TelevisionShow , dbo:creator , dbr: Category:Walt_Disney!3.76E-5 dbp:television , dbp:show , dbo:creator , dbr: Category:Walt_Disney!

20

Page 21: Sina presentation in IBM

+

Query Construction    

AKSW group - Question Answering on Interlinked Data (published in www2013)

21

Page 22: Sina presentation in IBM

Query Construction Method

Input: set of resources Output: A query graph is a directed, connected multi-graph. Forward Chaining: 1.  CT: Comprehensive type. 2.  CD: Comprehensive domain. 3.  CR: Comprehensive range.

AKSW group - Question Answering on Interlinked Data (published in www2013)

R = {r1, r2,..., rn}

QG = (V,E)

22

Page 23: Sina presentation in IBM

Query Construction Method

Input: set of resources Output: A query graph is a directed, connected multi-graph. Generating the Incomplete Query Graph (IQG) Initializing vertices and primary edges. •  A vertex is added to IQG (1) If r is an instance, (2) If r is a class. •  Properties are added along with zero, one or two vertices.

AKSW group - Question Answering on Interlinked Data (published in www2013)

R = {r1, r2,..., rn}

QG = (V,E)

23

Page 24: Sina presentation in IBM

Query Construction Method

Example: What is the side effects of drugs used for Tuberculosis?

•  diseasome:1154 ! ! !(type instance) !!•  diseasome:possibleDrug ! !(type property)!•  sider:sideEffect ! !(type property) !!

AKSW group - Question Answering on Interlinked Data (published in www2013)

1154 ?v0 possibleDrug

Graph 1

?v1 ?v2

sideEffect

Graph 2

24

Page 25: Sina presentation in IBM

Query Construction Method

Connecting Sub-graphs of an IQG: 1.  Minimum spanning tree: a minimum set of edges (i.e., properties) to span a set of

disjoint graphs. 2.  Prim’s algorithm: incrementally includes edges to connect disjoint sub-graphs.

•  Direct properties: ?v0 ?p ?v1. •  Properties via owl:sameAs link. (1) ?v0 owl:sameAs ?x. ?x ?p ?v1. !(2) ?v0 ?p ?x. ?x owl:sameAs ?v1. !(3) ?v0 owl:sameAs ?x. ?x ?p ?y. ?y owl:sameAs ?v1. !

AKSW group - Question Answering on Interlinked Data (published in www2013)

1154 ?v0 possibleDrug

Template 1

?v1 ?v2 sideEffect

Template 2

1154 ?v0 possibleDrug

?v1 ?v2 sideEffect

25

Page 26: Sina presentation in IBM

Evaluation

Goal of experiment: How well: 1.  resource disambiguation 2.  query construction approaches perform. Measurement of the performance: 1.  For disambiguation using the Mean Reciprocal Rank (MRR). 2.  Query construction in terms of precision and recall.

Benchmark 1.  A natural- language query and the equivalent conjunctive SPARQL query. 2.  25 queries on the 3 interlinked datasets Drugbank, Sider and Diseasome. 3.  QALD1 and QALD3 benchmark for DBpedia.

AKSW group - Question Answering on Interlinked Data (published in www2013)

26

Page 27: Sina presentation in IBM

Evaluation using life-science datasets

AKSW group - Question Answering on Interlinked Data (published in www2013)

Without reasoning: precision = 0.91 recall = 0.88 With reasoning: precision = 0.95 recall = 0.90

27

Page 28: Sina presentation in IBM

+ Evaluation using DBpedia

n  QALD3 Benchmark: ü  contains 100 questions.

ü  32 original questions can be answered correctly.

n  QALD1 Benchmark: ü  contains 50 questions.

ü  7 complex questions.

ü  13 questions requiring information beyond DBpedia, i.e., from YAGO and FOAF.

ü  14 slightly were modified to remove expansion and cleaning problem.

ü  MRR of disambiguation = 96%

ü  Query construction accuracy = 83%

AKSW group - Question Answering on Interlinked Data (published in www2013)

28

Page 29: Sina presentation in IBM

AKSW group - Question Answering on Interlinked Data (published in www2013)

Runtime

Parallization over three components: 1.  Segment validation 2.  Resource retrieval 3.  Query construction

29

Page 30: Sina presentation in IBM

+ Related work

AKSW group - Question Answering on Interlinked Data (published in www2013)

30

Page 31: Sina presentation in IBM

AKSW group - Question Answering on Interlinked Data (published in www2013)

Thank you

Saeedeh Shekarpour [email protected] [email protected]

31