Predicting SPARQL query execution time and suggesting SPARQL queries based on query history

Predic'ng SPARQL Query Execu'on Time and Sugges'ng SPARQL

Queries Based on Query History

Rakebul Hasan

Context

•  Assis'ng human users and soAware agents in:

– Querying Seman'c Web data •  Understanding query behavior: predic'ng query performance – Workload management, query scheduling, query op'miza'on

•  Construc'ng and refining queries: sugges'ng alterna'ves –  Consuming Seman'c Web data

•  Understanding reasoning of Seman'c Web soAware agents: explaining reasoning –  Transparency, trust, scrutability, decision effec'veness, decision efficiency, user sa'sfac'on

1

Outline

•  Predic'ng SPARQL query execu'on 'me

•  Sugges'ng similar SPARQL queries from query history

2

PREDICTING SPARQL QUERY EXECUTION TIME

3

•  Accurately predic'ng query performance enables effec've – workload management – query scheduling – query op'miza'on

4

Understanding performance of computer programs

Insight. [Knuth] Use scien'fic method to understand performance

5

Scien'fic method applied to analysis of algorithms

•  A framework for predic'ng performance and comparing algorithms.

•  Scien'fic method –  Observe some feature of the natural world. –  Hypothesize a model that is consistent with the observa'ons. –  Predict events using the hypothesis. –  Verify the predic'ons by making further observa'ons. –  Validate by repea'ng un'l the hypothesis and observa'ons

agree. •  Principles

–  Experiments must be reproducible. –  Hypotheses must be falsifiable.

•  Feature of the natural world. Computer itself.

Slide credit: Robert Sedgewick 6

Example: 3-‐Sum

•  3-‐SUM. Given N dis'nct integers, how many triples sum to exactly zero?

•  3-‐SUM brute-‐force algorithm. Check all the possible triples.

•  How much 'me does it take?


Data analysis

•  Standard plot. Plot running 'me T (N) vs. input size N.


Data analysis •  Log-‐log plot. Plot running 'me lg(T (N)) vs. input size lg N.

•  Regression. Fit straight line through data points: a N b. •  Hypothesis. The running 'me is about 1.006 × 10 –10 × N 2.999


Predic'on and valida'on

•  Hypothesis. The running 'me is about 1.006 × 10 –10 × N 2.999

•  Predic'ons. –  51.0 seconds for N = 8000. –  408.1 seconds for N = 16000.

•  Observa'ons.

Slide credit: Robert Sedgewick

Validates the hypothesis

10

Understanding performance of database queries

•  Ganapathi et al. predic'ng performance metrics of database queries prior to query execu'on using machine learning.

•  Gupta et al. use machine learning for predic'ng query execu'on 'me ranges.

Ganapathi et al.: Predic'ng mul'ple metrics for queries: Befer decisions enabled by machine learning. In Proc. of the 2009 IEEE ICDE

Gupta et al.: PQR: Predic'ng query execu'on 'mes for autonomous workload management. In Proc. of the 2008 ICAC 11

Predic'ng SPARQL query execu'on 'me

•  Key challenge. Feature engineering – Represen'ng SPARQL queries as feature vectors •  Each dimension of the vector is a feature

12

Configura'on

•  Apache Jena TDB – With DBpedia 3.8 dataset

•  Training, valida'on, and test queries: randomly selected from DBpedia SPARQL Benchmark (DBPSB) query dataset – 3600 training, 1200 valida'on, 1200 test

13

Jena ARQ query processing

•  A SPARQL query in ARQ goes through several stages of processing: – String to Query (parsing) – Transla'on from Query to a SPARQL algebra expression

– Op'miza'on of the algebra expression – Query plan determina'on and low-‐level op'miza'on

– Evalua'on of the query plan

14

SPARQL algebra features

•  SPARQL Algebra1

1 hfp://www.w3.org/TR/sparql11-‐query/#sparqlQuery 15

SPARQL algebra features

!"#$"%&$

'()*+&$,-.%/0+,.%"&12

3+4$*)"%

56' 56'

$("'3+,.7,4)/4805)7,90/"3$)8'+(#)%:#+(;+(<&)0=

$("'3+,.7,4)/48%/0+,.%/0+

$("'3+,.7,4)/48%"&1.%"&1

$("'3+,56',*)"%,3+4$*)"%,<,<,<,<,'()*+&$,!"#$"%&$,<,<,<,<,!+'$>,,?,,,,,@,,,A,,,,,,B,,,,,<,<,<,<,,,,B,,,,,,,B,,,,,<,<,<,<,,,C

DEFGHI,4)/48,9>$$'8JJ703%#<&)0J4)/4JA<BJ=,KFLFMN,OHKNHPMN,.%/0+,.%"&1,QRFEF,S,,,,.7,4)/4805)7,90/"3$)8'+(#)%:#+(;+(<&)0=,<,,,,.7,4)/48%/0+,.%/0+,,,,,TDNHTPUL,S,.7,4)/48%"&1,.%"&1,VV

16

Experiment 1

•  Model: Support Vector Machine regression •  Evalua'on measure: R2

•  Measures how well future samples are likely to be predicted by the model.

17

Experiment 1

•  Test dataset R2 = 0.004492

Log scale plomng of predicted vs actual execu'on 'mes for the test queries.

18

Experiment 1

Some of the long running queries share structurally similar basic graph paferns.

{ dbpedia :1549 _Mikko ?p ? uri . ? uri rdf : type ?x

}

Challenge. How do we represent basic graph paferns as vectors?

19

Basic Graph Pafern Features

•  Infinite number of possibili'es to write a basic graph pafern (BGP)

•  Only the set of literal values and the set of resources appearing in the RDF graph –  Exponen'al number of possibili'es –  A graph with n triples has 2n subsets of triples

•  Feature vector with exponen'al number of dimensions –  Not feasible

20

Basic Graph Pafern Features

•  Pafern graph = RDF graph constructed from all the BGPs in a query – Replace variables with a fixed symbol ‘?’

•  Cluster the training queries based on pafern graph similari'es

•  Create a vector with similarity scores between the pafern graph of the query and the queries in the cluster centers.

21

•  Graph Edit Distance – Minimum amount of distor'on needed to transform one graph to another

– Compute similarity by inversing distance

22

•  Graph Edit Distance – Usually computed using A* search •  Exponen'al running 'me

– Bipar'te matching based approximated graph edit distance with •  Previous research shows very accurate results with classifica'on problems

23

•  Clustering Training Queries – K-‐mediods clustering algorithm with approximated edit distance as distance func'on •  Selects data points as cluster centers •  Arbitrary distance func'on

24

Experiment 2

•  Model: Support Vector Machine regression


•  K = 10

25 Algebra features Algebra + BGP features

Mul'ple Regressions

•  We train different SMV regressions for different 'me ranges.

•  The variance in y-‐axis is less for each regression, easier to fit a curve.

26

•  Different 'me ranges – Clustering the execu'on 'me ranges • We use x-‐means clustering algorithm which automa'cally es'mates the number of clusters –  5 clusters found in the training dataset

– Each cluster contains queries with similar execu'on 'mes

27

•  Predic'ng execu'on 'me range – Predict the corresponding clusters for unseen queries.

– How •  Train a SMV classifier with the found clusters as labels

•  Classify unseen queries: accuracy of 96% for the test dataset

•  This means we can accurately predict 'me ranges

28

•  Predic'ng execu'on 'me – Different SMV regressions for different 'me ranges.

– Use the corresponding regression to the 'me range cluster for an unseen query

29

Experiment 3


30

Algebra + BGP features Mul'ple regressions

Predic'ng with nearest neighbors regression

•  The k-‐nearest neighbors algorithm (k-‐NN) is oAen successful in the cases where decision boundary is irregular.

•  We train a k-‐NN with – Euclidean distance as the distance func'on – Distance weigh'ng: weighted by the inverse of the distance

31

•  k-‐dimensional tree (k-‐d tree) data structure to search the nearest neighbors – a space-‐par''oning data structure for organizing points in a k-‐dimensional space

•  Complexity of a search: O(log N) opera'ons

32

Experiment 4

•  Test dataset R2 = 0.837 •  k=2 for k-‐NN (selected by cross valida'on)

k-‐NN Mul'ple regressions 33

•  Future work – Training data with broad coverage •  DBpedia SPARQL benchmark query templates

–  Berlin: 5 templates

–  DBPSB: 20 templates

– Fine tuning with more cross valida'on

34

SUGGESTING SPARQL QUERIES

35

Sugges'ng SPARQL queries based on query history

•  Use the same features •  Construct a k-‐d tree for nearest neighbor search

•  Top M neighbors for a query are the top M sugges'ons for that query

36

Example

SELECT DISTINCT ?uri

WHERE

{

dbpedia :1549 _Mikko ?p ? uri .

? uri rdf : type ?x

}

Sugges'on 1 SELECT DISTINCT ?uri

WHERE

{

dbpedia : Radu_Sabo ?p ? uri .

? uri rdf : type ?x }

Sugges'on 2

SELECT DISTINCT ?uri

WHERE

{ dbpedia : Hafar_Al -‐ Ba'n ?p ? uri .

? uri rdf : type ?x

}

Sugges'on 3

SELECT DISTINCT ?uri WHERE

{

dbpedia : Maurice_D ._G. _Scof ?p ? uri .

? uri rdf : type ?x

}

37

•  Future work – Query construc'on and refinement workflow •  How to use the query sugges'ons?

– Evalua'ng the sugges'ons •  User study

38

Thank you

39

Technology

Predicting SPARQL query execution time and suggesting SPARQL queries based on query history