Predic'ng SPARQL Query Execu'on Time and Sugges'ng SPARQL
Queries Based on Query History
Rakebul Hasan
Context
• Assis'ng human users and soAware agents in:
– Querying Seman'c Web data • Understanding query behavior: predic'ng query performance – Workload management, query scheduling, query op'miza'on
• Construc'ng and refining queries: sugges'ng alterna'ves – Consuming Seman'c Web data
• Understanding reasoning of Seman'c Web soAware agents: explaining reasoning – Transparency, trust, scrutability, decision effec'veness, decision efficiency, user sa'sfac'on
1
Outline
• Predic'ng SPARQL query execu'on 'me
• Sugges'ng similar SPARQL queries from query history
2
PREDICTING SPARQL QUERY EXECUTION TIME
3
• Accurately predic'ng query performance enables effec've – workload management – query scheduling – query op'miza'on
4
Understanding performance of computer programs
Insight. [Knuth] Use scien'fic method to understand performance
5
Scien'fic method applied to analysis of algorithms
• A framework for predic'ng performance and comparing algorithms.
• Scien'fic method – Observe some feature of the natural world. – Hypothesize a model that is consistent with the observa'ons. – Predict events using the hypothesis. – Verify the predic'ons by making further observa'ons. – Validate by repea'ng un'l the hypothesis and observa'ons
agree. • Principles
– Experiments must be reproducible. – Hypotheses must be falsifiable.
• Feature of the natural world. Computer itself.
Slide credit: Robert Sedgewick 6
Example: 3-‐Sum
• 3-‐SUM. Given N dis'nct integers, how many triples sum to exactly zero?
• 3-‐SUM brute-‐force algorithm. Check all the possible triples.
• How much 'me does it take?
Slide credit: Robert Sedgewick 7
Data analysis
• Standard plot. Plot running 'me T (N) vs. input size N.
Slide credit: Robert Sedgewick 8
Data analysis • Log-‐log plot. Plot running 'me lg(T (N)) vs. input size lg N.
• Regression. Fit straight line through data points: a N b. • Hypothesis. The running 'me is about 1.006 × 10 –10 × N 2.999
Slide credit: Robert Sedgewick 9
Predic'on and valida'on
• Hypothesis. The running 'me is about 1.006 × 10 –10 × N 2.999
• Predic'ons. – 51.0 seconds for N = 8000. – 408.1 seconds for N = 16000.
• Observa'ons.
Slide credit: Robert Sedgewick
Validates the hypothesis
10
Understanding performance of database queries
• Ganapathi et al. predic'ng performance metrics of database queries prior to query execu'on using machine learning.
• Gupta et al. use machine learning for predic'ng query execu'on 'me ranges.
Ganapathi et al.: Predic'ng mul'ple metrics for queries: Befer decisions enabled by machine learning. In Proc. of the 2009 IEEE ICDE
Gupta et al.: PQR: Predic'ng query execu'on 'mes for autonomous workload management. In Proc. of the 2008 ICAC 11
Predic'ng SPARQL query execu'on 'me
• Key challenge. Feature engineering – Represen'ng SPARQL queries as feature vectors • Each dimension of the vector is a feature
12
Configura'on
• Apache Jena TDB – With DBpedia 3.8 dataset
• Training, valida'on, and test queries: randomly selected from DBpedia SPARQL Benchmark (DBPSB) query dataset – 3600 training, 1200 valida'on, 1200 test
13
Jena ARQ query processing
• A SPARQL query in ARQ goes through several stages of processing: – String to Query (parsing) – Transla'on from Query to a SPARQL algebra expression
– Op'miza'on of the algebra expression – Query plan determina'on and low-‐level op'miza'on
– Evalua'on of the query plan
14
SPARQL algebra features
• SPARQL Algebra1
1 hfp://www.w3.org/TR/sparql11-‐query/#sparqlQuery 15
SPARQL algebra features
!"#$"%&$
'()*+&$,-.%/0+,.%"&12
3+4$*)"%
56' 56'
$("'3+,.7,4)/4805)7,90/"3$)8'+(#)%:#+(;+(<&)0=
$("'3+,.7,4)/48%/0+,.%/0+
$("'3+,.7,4)/48%"&1.%"&1
$("'3+,56',*)"%,3+4$*)"%,<,<,<,<,'()*+&$,!"#$"%&$,<,<,<,<,!+'$>,,?,,,,,@,,,A,,,,,,B,,,,,<,<,<,<,,,,B,,,,,,,B,,,,,<,<,<,<,,,C
DEFGHI,4)/48,9>$$'8JJ703%#<&)0J4)/4JA<BJ=,KFLFMN,OHKNHPMN,.%/0+,.%"&1,QRFEF,S,,,,.7,4)/4805)7,90/"3$)8'+(#)%:#+(;+(<&)0=,<,,,,.7,4)/48%/0+,.%/0+,,,,,TDNHTPUL,S,.7,4)/48%"&1,.%"&1,VV
16
Experiment 1
• Model: Support Vector Machine regression • Evalua'on measure: R2
• Measures how well future samples are likely to be predicted by the model.
17
Experiment 1
• Test dataset R2 = 0.004492
Log scale plomng of predicted vs actual execu'on 'mes for the test queries.
18
Experiment 1
Some of the long running queries share structurally similar basic graph paferns.
{ dbpedia :1549 _Mikko ?p ? uri . ? uri rdf : type ?x
}
Challenge. How do we represent basic graph paferns as vectors?
19
Basic Graph Pafern Features
• Infinite number of possibili'es to write a basic graph pafern (BGP)
• Only the set of literal values and the set of resources appearing in the RDF graph – Exponen'al number of possibili'es – A graph with n triples has 2n subsets of triples
• Feature vector with exponen'al number of dimensions – Not feasible
20
Basic Graph Pafern Features
• Pafern graph = RDF graph constructed from all the BGPs in a query – Replace variables with a fixed symbol ‘?’
• Cluster the training queries based on pafern graph similari'es
• Create a vector with similarity scores between the pafern graph of the query and the queries in the cluster centers.
21
• Graph Edit Distance – Minimum amount of distor'on needed to transform one graph to another
– Compute similarity by inversing distance
22
• Graph Edit Distance – Usually computed using A* search • Exponen'al running 'me
– Bipar'te matching based approximated graph edit distance with • Previous research shows very accurate results with classifica'on problems
23
• Clustering Training Queries – K-‐mediods clustering algorithm with approximated edit distance as distance func'on • Selects data points as cluster centers • Arbitrary distance func'on
24
Experiment 2
• Model: Support Vector Machine regression
• Test dataset R2 = 0.124204
• K = 10
25 Algebra features Algebra + BGP features
Mul'ple Regressions
• We train different SMV regressions for different 'me ranges.
• The variance in y-‐axis is less for each regression, easier to fit a curve.
26
• Different 'me ranges – Clustering the execu'on 'me ranges • We use x-‐means clustering algorithm which automa'cally es'mates the number of clusters – 5 clusters found in the training dataset
– Each cluster contains queries with similar execu'on 'mes
27
• Predic'ng execu'on 'me range – Predict the corresponding clusters for unseen queries.
– How • Train a SMV classifier with the found clusters as labels
• Classify unseen queries: accuracy of 96% for the test dataset
• This means we can accurately predict 'me ranges
28
• Predic'ng execu'on 'me – Different SMV regressions for different 'me ranges.
– Use the corresponding regression to the 'me range cluster for an unseen query
29
Experiment 3
• Test dataset R2 = 0.83862
30
Algebra + BGP features Mul'ple regressions
Predic'ng with nearest neighbors regression
• The k-‐nearest neighbors algorithm (k-‐NN) is oAen successful in the cases where decision boundary is irregular.
• We train a k-‐NN with – Euclidean distance as the distance func'on – Distance weigh'ng: weighted by the inverse of the distance
31
• k-‐dimensional tree (k-‐d tree) data structure to search the nearest neighbors – a space-‐par''oning data structure for organizing points in a k-‐dimensional space
• Complexity of a search: O(log N) opera'ons
32
Experiment 4
• Test dataset R2 = 0.837 • k=2 for k-‐NN (selected by cross valida'on)
k-‐NN Mul'ple regressions 33
• Future work – Training data with broad coverage • DBpedia SPARQL benchmark query templates
– Berlin: 5 templates
– DBPSB: 20 templates
– Fine tuning with more cross valida'on
34
SUGGESTING SPARQL QUERIES
35
Sugges'ng SPARQL queries based on query history
• Use the same features • Construct a k-‐d tree for nearest neighbor search
• Top M neighbors for a query are the top M sugges'ons for that query
36
Example
SELECT DISTINCT ?uri
WHERE
{
dbpedia :1549 _Mikko ?p ? uri .
? uri rdf : type ?x
}
Sugges'on 1 SELECT DISTINCT ?uri
WHERE
{
dbpedia : Radu_Sabo ?p ? uri .
? uri rdf : type ?x }
Sugges'on 2
SELECT DISTINCT ?uri
WHERE
{ dbpedia : Hafar_Al -‐ Ba'n ?p ? uri .
? uri rdf : type ?x
}
Sugges'on 3
SELECT DISTINCT ?uri WHERE
{
dbpedia : Maurice_D ._G. _Scof ?p ? uri .
? uri rdf : type ?x
}
37
• Future work – Query construc'on and refinement workflow • How to use the query sugges'ons?
– Evalua'ng the sugges'ons • User study
38
Thank you
39