Upload
rakebul-hasan
View
392
Download
1
Embed Size (px)
DESCRIPTION
Predicting SPARQL query execution time and suggesting SPARQL queries based on query history
Citation preview
Predic'ng SPARQL Query Execu'on Time and Sugges'ng SPARQL
Queries Based on Query History
Rakebul Hasan
Context
• Assis'ng human users and soAware agents in:
– Querying Seman'c Web data • Understanding query behavior: predic'ng query performance – Workload management, query scheduling, query op'miza'on
• Construc'ng and refining queries: sugges'ng alterna'ves – Consuming Seman'c Web data
• Understanding reasoning of Seman'c Web soAware agents: explaining reasoning – Transparency, trust, scrutability, decision effec'veness, decision efficiency, user sa'sfac'on
1
Outline
• Predic'ng SPARQL query execu'on 'me
• Sugges'ng similar SPARQL queries from query history
2
PREDICTING SPARQL QUERY EXECUTION TIME
3
• Accurately predic'ng query performance enables effec've – workload management – query scheduling – query op'miza'on
4
Understanding performance of computer programs
Insight. [Knuth] Use scien'fic method to understand performance
5
Scien'fic method applied to analysis of algorithms
• A framework for predic'ng performance and comparing algorithms.
• Scien'fic method – Observe some feature of the natural world. – Hypothesize a model that is consistent with the observa'ons. – Predict events using the hypothesis. – Verify the predic'ons by making further observa'ons. – Validate by repea'ng un'l the hypothesis and observa'ons
agree. • Principles
– Experiments must be reproducible. – Hypotheses must be falsifiable.
• Feature of the natural world. Computer itself.
Slide credit: Robert Sedgewick 6
Example: 3-‐Sum
• 3-‐SUM. Given N dis'nct integers, how many triples sum to exactly zero?
• 3-‐SUM brute-‐force algorithm. Check all the possible triples.
• How much 'me does it take?
Slide credit: Robert Sedgewick 7
Data analysis
• Standard plot. Plot running 'me T (N) vs. input size N.
Slide credit: Robert Sedgewick 8
Data analysis • Log-‐log plot. Plot running 'me lg(T (N)) vs. input size lg N.
• Regression. Fit straight line through data points: a N b. • Hypothesis. The running 'me is about 1.006 × 10 –10 × N 2.999
Slide credit: Robert Sedgewick 9
Predic'on and valida'on
• Hypothesis. The running 'me is about 1.006 × 10 –10 × N 2.999
• Predic'ons. – 51.0 seconds for N = 8000. – 408.1 seconds for N = 16000.
• Observa'ons.
Slide credit: Robert Sedgewick
Validates the hypothesis
10
Understanding performance of database queries
• Ganapathi et al. predic'ng performance metrics of database queries prior to query execu'on using machine learning.
• Gupta et al. use machine learning for predic'ng query execu'on 'me ranges.
Ganapathi et al.: Predic'ng mul'ple metrics for queries: Befer decisions enabled by machine learning. In Proc. of the 2009 IEEE ICDE
Gupta et al.: PQR: Predic'ng query execu'on 'mes for autonomous workload management. In Proc. of the 2008 ICAC 11
Predic'ng SPARQL query execu'on 'me
• Key challenge. Feature engineering – Represen'ng SPARQL queries as feature vectors • Each dimension of the vector is a feature
12
Configura'on
• Apache Jena TDB – With DBpedia 3.8 dataset
• Training, valida'on, and test queries: randomly selected from DBpedia SPARQL Benchmark (DBPSB) query dataset – 3600 training, 1200 valida'on, 1200 test
13
Jena ARQ query processing
• A SPARQL query in ARQ goes through several stages of processing: – String to Query (parsing) – Transla'on from Query to a SPARQL algebra expression
– Op'miza'on of the algebra expression – Query plan determina'on and low-‐level op'miza'on
– Evalua'on of the query plan
14
SPARQL algebra features
• SPARQL Algebra1
1 hfp://www.w3.org/TR/sparql11-‐query/#sparqlQuery 15
SPARQL algebra features
!"#$"%&$
'()*+&$,-.%/0+,.%"&12
3+4$*)"%
56' 56'
$("'3+,.7,4)/4805)7,90/"3$)8'+(#)%:#+(;+(<&)0=
$("'3+,.7,4)/48%/0+,.%/0+
$("'3+,.7,4)/48%"&1.%"&1
$("'3+,56',*)"%,3+4$*)"%,<,<,<,<,'()*+&$,!"#$"%&$,<,<,<,<,!+'$>,,?,,,,,@,,,A,,,,,,B,,,,,<,<,<,<,,,,B,,,,,,,B,,,,,<,<,<,<,,,C
DEFGHI,4)/48,9>$$'8JJ703%#<&)0J4)/4JA<BJ=,KFLFMN,OHKNHPMN,.%/0+,.%"&1,QRFEF,S,,,,.7,4)/4805)7,90/"3$)8'+(#)%:#+(;+(<&)0=,<,,,,.7,4)/48%/0+,.%/0+,,,,,TDNHTPUL,S,.7,4)/48%"&1,.%"&1,VV
16
Experiment 1
• Model: Support Vector Machine regression • Evalua'on measure: R2
• Measures how well future samples are likely to be predicted by the model.
17
Experiment 1
• Test dataset R2 = 0.004492
Log scale plomng of predicted vs actual execu'on 'mes for the test queries.
18
Experiment 1
Some of the long running queries share structurally similar basic graph paferns.
{ dbpedia :1549 _Mikko ?p ? uri . ? uri rdf : type ?x
}
Challenge. How do we represent basic graph paferns as vectors?
19
Basic Graph Pafern Features
• Infinite number of possibili'es to write a basic graph pafern (BGP)
• Only the set of literal values and the set of resources appearing in the RDF graph – Exponen'al number of possibili'es – A graph with n triples has 2n subsets of triples
• Feature vector with exponen'al number of dimensions – Not feasible
20
Basic Graph Pafern Features
• Pafern graph = RDF graph constructed from all the BGPs in a query – Replace variables with a fixed symbol ‘?’
• Cluster the training queries based on pafern graph similari'es
• Create a vector with similarity scores between the pafern graph of the query and the queries in the cluster centers.
21
• Graph Edit Distance – Minimum amount of distor'on needed to transform one graph to another
– Compute similarity by inversing distance
22
• Graph Edit Distance – Usually computed using A* search • Exponen'al running 'me
– Bipar'te matching based approximated graph edit distance with • Previous research shows very accurate results with classifica'on problems
23
• Clustering Training Queries – K-‐mediods clustering algorithm with approximated edit distance as distance func'on • Selects data points as cluster centers • Arbitrary distance func'on
24
Experiment 2
• Model: Support Vector Machine regression
• Test dataset R2 = 0.124204
• K = 10
25 Algebra features Algebra + BGP features
Mul'ple Regressions
• We train different SMV regressions for different 'me ranges.
• The variance in y-‐axis is less for each regression, easier to fit a curve.
26
• Different 'me ranges – Clustering the execu'on 'me ranges • We use x-‐means clustering algorithm which automa'cally es'mates the number of clusters – 5 clusters found in the training dataset
– Each cluster contains queries with similar execu'on 'mes
27
• Predic'ng execu'on 'me range – Predict the corresponding clusters for unseen queries.
– How • Train a SMV classifier with the found clusters as labels
• Classify unseen queries: accuracy of 96% for the test dataset
• This means we can accurately predict 'me ranges
28
• Predic'ng execu'on 'me – Different SMV regressions for different 'me ranges.
– Use the corresponding regression to the 'me range cluster for an unseen query
29
Experiment 3
• Test dataset R2 = 0.83862
30
Algebra + BGP features Mul'ple regressions
Predic'ng with nearest neighbors regression
• The k-‐nearest neighbors algorithm (k-‐NN) is oAen successful in the cases where decision boundary is irregular.
• We train a k-‐NN with – Euclidean distance as the distance func'on – Distance weigh'ng: weighted by the inverse of the distance
31
• k-‐dimensional tree (k-‐d tree) data structure to search the nearest neighbors – a space-‐par''oning data structure for organizing points in a k-‐dimensional space
• Complexity of a search: O(log N) opera'ons
32
Experiment 4
• Test dataset R2 = 0.837 • k=2 for k-‐NN (selected by cross valida'on)
k-‐NN Mul'ple regressions 33
• Future work – Training data with broad coverage • DBpedia SPARQL benchmark query templates
– Berlin: 5 templates
– DBPSB: 20 templates
– Fine tuning with more cross valida'on
34
SUGGESTING SPARQL QUERIES
35
Sugges'ng SPARQL queries based on query history
• Use the same features • Construct a k-‐d tree for nearest neighbor search
• Top M neighbors for a query are the top M sugges'ons for that query
36
Example
SELECT DISTINCT ?uri
WHERE
{
dbpedia :1549 _Mikko ?p ? uri .
? uri rdf : type ?x
}
Sugges'on 1 SELECT DISTINCT ?uri
WHERE
{
dbpedia : Radu_Sabo ?p ? uri .
? uri rdf : type ?x }
Sugges'on 2
SELECT DISTINCT ?uri
WHERE
{ dbpedia : Hafar_Al -‐ Ba'n ?p ? uri .
? uri rdf : type ?x
}
Sugges'on 3
SELECT DISTINCT ?uri WHERE
{
dbpedia : Maurice_D ._G. _Scof ?p ? uri .
? uri rdf : type ?x
}
37
• Future work – Query construc'on and refinement workflow • How to use the query sugges'ons?
– Evalua'ng the sugges'ons • User study
38
Thank you
39