Sina presentation in IBM

+

Question Answering on Interlinked Data

Saeedeh Shekarpour, Axel-Cyrille Ngonga Ngomo, Soeren Auer AKSW Research Group, Leipzig University December 5 2013, IBM Research Center

+ Motivation Retrieving information from LOD

AKSW group - Question Answering on Interlinked Data (published in www2013)

2

+ Motivation

Text queries (either keyword or natural language ) are:

n  Simple retrieval approach

n  Popular

n  Implicit and ambiguous seman=cs.

SPARQL queries require:

n  Knowledge about the ontology

n  Proficiency in formula=ng formal queries

n  Explicit and unambigious seman=cs.

AKSW group -‐ Ques=on Answering on Interlinked Data (published in www2013)

3

+ Comparison of Search Approaches


Data-Semantic unaware

Data-Semantic aware

Keyword-based query

Natural language query

Question Answering

Systems

Information Retrieval

Our approach:

SINA

4

+ Example

n  Which televisions shows were created by Walt Disney?


select * where !{ ?v0 a ! !dbo:TelevisionShow.! ?v0 dbo:creator dbr:Walt_Disney. }!

1

2 3

5

+ Aim and Challenges

Aim: Question answering over a set of interlinked data sources.

n  Query segmentation.

n  Resource disambiguation.

n  To construct a formal query (expressed in SPARQL)


6

+ Further Challenges over Interlinked Data

1.  Information for answering a certain question can be spread among different datasets employing heterogeneous schemas.

2.  Constructing a federated formal query across different datasets requires exploiting links between the different datasets on both the schema and instance levels.


7

+ SINA Architecture


8

+ Test bed datasets


*  One single dataset: DBpedia. *  Three interlinked datasets

from life-science: ü  Drugbank: is a

comprehensive knowledge base containing information about drugs, drug target (i.e. protein) information, interactions and enzymes.

ü  Diseasome: contains information about diseases and genes associated with these diseases.

ü  Sider: contains information about drugs and their side effects.

9

+ Main characteristics of federated queries

1.  Queries requiring fused information, e.g. side effects of drugs used for Tuberculosis.

2.  Queries targeting combined information, e.g. side effect an enzymes of drugs used for ASTHMA.

3.  Queries requiring keyword expansion, e.g. side effects of Valdecoxib.


Diseasome

Drug

Asthma

?v0 side effect sameAs

a

?v2 ?v3

Disease

Drug Side Effect

a a

a

?v1 enzyme

Enzymes

a

Sider DrugBank

10

+ Challenge 1: Query Segmentation and Resource Disambiguation

l  Sample ques5on: What is the side effects of drugs used for Tuberculosis?

l  Transformed to 4-‐tuple (side # effect # drug # Tuberculosis)

l  Different segmenta=ons are possible: 1.  ( side effect # drug # Tuberculosis) 2.  ( side effect drug # Tuberculosis )

Mapping of the segments to the resources in the underlying knowledge bases.


Each valid segment

11

Segment validation

ü  Original tuple: (side # effect # drug # Tuberculosis). ü  Using a naive approach for finding all valid segments.

Valid Segments Samples of Candidate Resources

Side effect 1. sider:class:sideeffect !2. sider:property:side_effects!

drug 1. drugbank: drugs 2.class:offer!3.sider:drugs 4.diseases:possibledrug!

tuberculosis 1. diseases:1154 !2. side_effects: C0041296!


12

+

Concurrent Segmenta5on and Disambigua5on


13

Hidden Markov Model

•  A statistics model containing a set of states. •  Moving from one state to another state generates a sequence of observations. •  The probability of entering state only depends on the previous state. •  Output is the most likely states generating the sequence of the observation.


14

State Space

•  A state represents a knowledge base resource. •  Contains all resources in the knowledge base. •  In practice, we prune the state space by excluding irrelevant states. •  Adding an unknown entity state comprising all resources, which are not

available (anymore) in the pruned state space.

•  Extension of State Space with reasoning: An extension of the state space by including resources inferred from lightweight owl:sameAs reasoning.


15

Bootstrapping the Model Parameters Emission Probability

•  The set-similarity level measures the difference between the label and the segment in terms of the number of words using the Jaccard similarity.

•  The string-similarity level measures the string similarity of each word in the segment with the most similar word in the label using the Levenshtein distance.


16

Bootstrapping the Model Parameters Transition Probability & Initial Probability

•  Computing the transition probability and initial probability based on Semantic relatedness of two resources.

•  Semantic relatedness is based on two values: distance and connectivity degree.

•  We transform these two values to hub and authority values using HITS algorithm.

•  Initial probability and Transition probability are defined as a uniform distribution over the hub and and authority values.


17

Evaluation of Bootstrapping

•  The accuracy of different distribution functions, i.e., Normal, Zipfian and uniform distributions for transition probability.

•  We ran the distribution functions with two different inputs, i.e. distance and connectivity degree values as well as hub and authority values.


18

+ Viterbi Algorithm


Aim: The most likely path generating the sequence of input keywords.

19

+ Output of the HMM for the following query: Which televisions shows were created by Walt Disney?


Probability Path of states 0.0023 dbo:TelevisionShow , dbo:creator , dbr: Walt_Disney!0.0014 dbo:TelevisionShow , dbo:creator , dbr: Category:Walt_Disney!5.89E-4 dbr:TelevisionShow , dbo:creator , dbr: Walt_Disney!3.53E-4 dbr:TelevisionShow , dbo:creator , dbr: Category:Walt_Disney!3.76E-5 dbp:television , dbp:show , dbo:creator , dbr: Category:Walt_Disney!

20

+

Query Construction


21

Query Construction Method

Input: set of resources Output: A query graph is a directed, connected multi-graph. Forward Chaining: 1.  CT: Comprehensive type. 2.  CD: Comprehensive domain. 3.  CR: Comprehensive range.


R = {r1, r2,..., rn}

QG = (V,E)

22


Input: set of resources Output: A query graph is a directed, connected multi-graph. Generating the Incomplete Query Graph (IQG) Initializing vertices and primary edges. •  A vertex is added to IQG (1) If r is an instance, (2) If r is a class. •  Properties are added along with zero, one or two vertices.


R = {r1, r2,..., rn}

QG = (V,E)

23


Example: What is the side effects of drugs used for Tuberculosis?

•  diseasome:1154 ! ! !(type instance) !!•  diseasome:possibleDrug ! !(type property)!•  sider:sideEffect ! !(type property) !!


1154 ?v0 possibleDrug

Graph 1

?v1 ?v2

sideEffect

Graph 2

24


Connecting Sub-graphs of an IQG: 1.  Minimum spanning tree: a minimum set of edges (i.e., properties) to span a set of

disjoint graphs. 2.  Prim’s algorithm: incrementally includes edges to connect disjoint sub-graphs.

•  Direct properties: ?v0 ?p ?v1. •  Properties via owl:sameAs link. (1) ?v0 owl:sameAs ?x. ?x ?p ?v1. !(2) ?v0 ?p ?x. ?x owl:sameAs ?v1. !(3) ?v0 owl:sameAs ?x. ?x ?p ?y. ?y owl:sameAs ?v1. !



Template 1

?v1 ?v2 sideEffect

Template 2


?v1 ?v2 sideEffect

25

Evaluation

Goal of experiment: How well: 1.  resource disambiguation 2.  query construction approaches perform. Measurement of the performance: 1.  For disambiguation using the Mean Reciprocal Rank (MRR). 2.  Query construction in terms of precision and recall.

Benchmark 1.  A natural- language query and the equivalent conjunctive SPARQL query. 2.  25 queries on the 3 interlinked datasets Drugbank, Sider and Diseasome. 3.  QALD1 and QALD3 benchmark for DBpedia.


26

Evaluation using life-science datasets


Without reasoning: precision = 0.91 recall = 0.88 With reasoning: precision = 0.95 recall = 0.90

27

+ Evaluation using DBpedia

n  QALD3 Benchmark: ü  contains 100 questions.

ü  32 original questions can be answered correctly.

n  QALD1 Benchmark: ü  contains 50 questions.

ü  7 complex questions.

ü  13 questions requiring information beyond DBpedia, i.e., from YAGO and FOAF.

ü  14 slightly were modified to remove expansion and cleaning problem.

ü  MRR of disambiguation = 96%

ü  Query construction accuracy = 83%


28


Runtime

Parallization over three components: 1.  Segment validation 2.  Resource retrieval 3.  Query construction

29

+ Related work


30


Thank you

Saeedeh Shekarpour [email protected] [email protected]

31

Education

Sina presentation in IBM