Transcript
Page 1: Predicting SPARQL query execution time and suggesting SPARQL queries based on query history

Predic'ng  SPARQL  Query  Execu'on  Time  and  Sugges'ng  SPARQL  

Queries  Based  on  Query  History  

Rakebul  Hasan  

Page 2: Predicting SPARQL query execution time and suggesting SPARQL queries based on query history

Context  

•  Assis'ng  human  users  and  soAware  agents  in:  

– Querying  Seman'c  Web  data  •  Understanding  query  behavior:  predic'ng  query  performance  – Workload  management,  query  scheduling,  query  op'miza'on  

•  Construc'ng  and  refining  queries:  sugges'ng  alterna'ves  –  Consuming  Seman'c  Web  data  

•  Understanding  reasoning  of  Seman'c  Web  soAware  agents:  explaining  reasoning  –  Transparency,  trust,  scrutability,  decision  effec'veness,  decision  efficiency,  user  sa'sfac'on  

1  

Page 3: Predicting SPARQL query execution time and suggesting SPARQL queries based on query history

Outline  

•  Predic'ng  SPARQL  query  execu'on  'me  

•  Sugges'ng  similar  SPARQL  queries  from  query  history  

2  

Page 4: Predicting SPARQL query execution time and suggesting SPARQL queries based on query history

PREDICTING  SPARQL  QUERY  EXECUTION  TIME  

3  

Page 5: Predicting SPARQL query execution time and suggesting SPARQL queries based on query history

•  Accurately  predic'ng  query  performance  enables  effec've    – workload  management  – query  scheduling  – query  op'miza'on  

4  

Page 6: Predicting SPARQL query execution time and suggesting SPARQL queries based on query history

Understanding  performance  of  computer  programs  

Insight.  [Knuth]  Use  scien'fic  method  to  understand  performance  

5  

Page 7: Predicting SPARQL query execution time and suggesting SPARQL queries based on query history

Scien'fic  method  applied  to  analysis  of  algorithms  

•  A  framework  for  predic'ng  performance  and  comparing  algorithms.  

•  Scien'fic  method  –  Observe  some  feature  of  the  natural  world.  –  Hypothesize  a  model  that  is  consistent  with  the  observa'ons.  –  Predict  events  using  the  hypothesis.  –  Verify  the  predic'ons  by  making  further  observa'ons.  –  Validate  by  repea'ng  un'l  the  hypothesis  and  observa'ons  

agree.  •  Principles  

–  Experiments  must  be  reproducible.    –  Hypotheses  must  be  falsifiable.    

•  Feature  of  the  natural  world.  Computer  itself.  

Slide  credit:  Robert  Sedgewick   6  

Page 8: Predicting SPARQL query execution time and suggesting SPARQL queries based on query history

Example:  3-­‐Sum  

•  3-­‐SUM.  Given  N dis'nct  integers,  how  many  triples  sum  to  exactly  zero?  

•  3-­‐SUM  brute-­‐force  algorithm.  Check  all  the  possible  triples.  

•  How  much  'me  does  it  take?  

Slide  credit:  Robert  Sedgewick   7  

Page 9: Predicting SPARQL query execution time and suggesting SPARQL queries based on query history

Data  analysis  

•  Standard  plot.  Plot  running  'me  T (N)  vs.  input  size  N.  

Slide  credit:  Robert  Sedgewick   8  

Page 10: Predicting SPARQL query execution time and suggesting SPARQL queries based on query history

Data  analysis  •  Log-­‐log  plot.  Plot  running  'me  lg(T (N))  vs.  input  size lg N.  

•  Regression.  Fit  straight  line  through  data  points:  a N b.  •  Hypothesis.  The  running  'me  is  about  1.006 × 10 –10 × N 2.999

Slide  credit:  Robert  Sedgewick   9  

Page 11: Predicting SPARQL query execution time and suggesting SPARQL queries based on query history

Predic'on  and  valida'on  

•  Hypothesis.  The  running  'me  is  about  1.006 × 10 –10 × N 2.999

•  Predic'ons.  –  51.0  seconds  for  N =  8000.  –  408.1  seconds  for  N =  16000.  

•  Observa'ons.  

Slide  credit:  Robert  Sedgewick  

Validates  the  hypothesis  

10  

Page 12: Predicting SPARQL query execution time and suggesting SPARQL queries based on query history

Understanding  performance  of  database  queries  

•  Ganapathi  et  al.  predic'ng  performance  metrics  of  database  queries  prior  to  query  execu'on  using  machine  learning.  

•  Gupta  et  al.  use  machine  learning  for  predic'ng  query  execu'on  'me  ranges.  

Ganapathi  et  al.:  Predic'ng  mul'ple  metrics  for  queries:  Befer  decisions  enabled  by  machine  learning.  In  Proc.  of  the  2009  IEEE  ICDE  

Gupta  et  al.:  PQR:  Predic'ng  query  execu'on  'mes  for  autonomous  workload  management.  In  Proc.  of  the  2008  ICAC   11  

Page 13: Predicting SPARQL query execution time and suggesting SPARQL queries based on query history

Predic'ng  SPARQL  query  execu'on  'me  

•  Key  challenge.  Feature  engineering  – Represen'ng  SPARQL  queries  as  feature  vectors  •  Each  dimension  of  the  vector  is  a  feature  

12  

Page 14: Predicting SPARQL query execution time and suggesting SPARQL queries based on query history

Configura'on  

•  Apache  Jena  TDB  – With  DBpedia  3.8  dataset    

•  Training,  valida'on,  and  test  queries:  randomly  selected  from  DBpedia  SPARQL  Benchmark  (DBPSB)  query  dataset  – 3600  training,  1200  valida'on,  1200  test    

13  

Page 15: Predicting SPARQL query execution time and suggesting SPARQL queries based on query history

Jena  ARQ  query  processing  

•  A  SPARQL  query  in  ARQ  goes  through  several  stages  of  processing:  – String  to  Query  (parsing)  – Transla'on  from  Query  to  a  SPARQL  algebra  expression  

– Op'miza'on  of  the  algebra  expression  – Query  plan  determina'on  and  low-­‐level  op'miza'on  

– Evalua'on  of  the  query  plan  

14  

Page 16: Predicting SPARQL query execution time and suggesting SPARQL queries based on query history

SPARQL  algebra  features  

•  SPARQL  Algebra1  

1  hfp://www.w3.org/TR/sparql11-­‐query/#sparqlQuery   15  

Page 17: Predicting SPARQL query execution time and suggesting SPARQL queries based on query history

SPARQL  algebra  features  

!"#$"%&$

'()*+&$,-.%/0+,.%"&12

3+4$*)"%

56' 56'

$("'3+,.7,4)/4805)7,90/"3$)8'+(#)%:#+(;+(<&)0=

$("'3+,.7,4)/48%/0+,.%/0+

$("'3+,.7,4)/48%"&1.%"&1

$("'3+,56',*)"%,3+4$*)"%,<,<,<,<,'()*+&$,!"#$"%&$,<,<,<,<,!+'$>,,?,,,,,@,,,A,,,,,,B,,,,,<,<,<,<,,,,B,,,,,,,B,,,,,<,<,<,<,,,C

DEFGHI,4)/48,9>$$'8JJ703%#<&)0J4)/4JA<BJ=,KFLFMN,OHKNHPMN,.%/0+,.%"&1,QRFEF,S,,,,.7,4)/4805)7,90/"3$)8'+(#)%:#+(;+(<&)0=,<,,,,.7,4)/48%/0+,.%/0+,,,,,TDNHTPUL,S,.7,4)/48%"&1,.%"&1,VV

16  

Page 18: Predicting SPARQL query execution time and suggesting SPARQL queries based on query history

Experiment  1  

•  Model:  Support  Vector  Machine  regression  •  Evalua'on  measure:  R2

•  Measures  how  well  future  samples  are  likely  to  be  predicted  by  the  model.  

17  

Page 19: Predicting SPARQL query execution time and suggesting SPARQL queries based on query history

Experiment  1  

•  Test  dataset  R2  =  0.004492  

Log  scale  plomng  of  predicted  vs  actual  execu'on  'mes  for  the  test  queries.  

18  

Page 20: Predicting SPARQL query execution time and suggesting SPARQL queries based on query history

Experiment  1  

Some  of  the  long  running  queries  share  structurally  similar  basic  graph  paferns.  

{      dbpedia  :1549  _Mikko  ?p  ?  uri  .    ?  uri  rdf  :  type  ?x  

}  

Challenge.  How  do  we  represent  basic  graph  paferns  as  vectors?  

19  

Page 21: Predicting SPARQL query execution time and suggesting SPARQL queries based on query history

Basic  Graph  Pafern  Features  

•  Infinite  number  of  possibili'es  to  write  a  basic  graph  pafern  (BGP)  

•  Only  the  set  of  literal  values  and  the  set  of  resources  appearing  in  the  RDF  graph  –  Exponen'al  number  of  possibili'es  –  A  graph  with  n  triples  has  2n subsets  of  triples    

•  Feature  vector  with  exponen'al  number  of  dimensions  –  Not  feasible    

20  

Page 22: Predicting SPARQL query execution time and suggesting SPARQL queries based on query history

Basic  Graph  Pafern  Features  

•  Pafern  graph  =  RDF  graph  constructed  from  all  the  BGPs  in  a  query  – Replace  variables  with  a  fixed  symbol  ‘?’  

•  Cluster  the  training  queries  based  on  pafern  graph  similari'es  

•  Create  a  vector  with  similarity  scores  between  the  pafern  graph  of  the  query  and  the  queries  in  the  cluster  centers.  

21  

Page 23: Predicting SPARQL query execution time and suggesting SPARQL queries based on query history

•  Graph  Edit  Distance  – Minimum  amount  of  distor'on  needed  to  transform  one  graph  to  another  

– Compute  similarity  by  inversing  distance  

22  

Page 24: Predicting SPARQL query execution time and suggesting SPARQL queries based on query history

•  Graph  Edit  Distance  – Usually  computed  using  A*  search    •  Exponen'al  running  'me  

– Bipar'te  matching  based  approximated  graph  edit  distance  with    •  Previous  research  shows  very  accurate  results  with  classifica'on  problems  

23  

Page 25: Predicting SPARQL query execution time and suggesting SPARQL queries based on query history

•  Clustering  Training  Queries  – K-­‐mediods  clustering  algorithm  with  approximated  edit  distance  as  distance  func'on  •  Selects  data  points  as  cluster  centers  •  Arbitrary  distance  func'on  

24  

Page 26: Predicting SPARQL query execution time and suggesting SPARQL queries based on query history

Experiment  2  

•  Model:  Support  Vector  Machine  regression  

•  Test  dataset  R2  =  0.124204  

•  K  =  10  

25  Algebra  features   Algebra  +  BGP  features  

Page 27: Predicting SPARQL query execution time and suggesting SPARQL queries based on query history

Mul'ple  Regressions  

•  We  train  different  SMV  regressions  for  different  'me  ranges.  

•  The  variance  in  y-­‐axis  is  less  for  each  regression,  easier  to  fit  a  curve.  

26  

Page 28: Predicting SPARQL query execution time and suggesting SPARQL queries based on query history

•  Different  'me  ranges  – Clustering  the  execu'on  'me  ranges  • We  use  x-­‐means  clustering  algorithm  which  automa'cally  es'mates  the  number  of  clusters  –  5  clusters  found  in  the  training  dataset  

– Each  cluster  contains  queries  with  similar  execu'on  'mes  

27  

Page 29: Predicting SPARQL query execution time and suggesting SPARQL queries based on query history

•  Predic'ng  execu'on  'me  range  – Predict  the  corresponding  clusters  for  unseen  queries.  

– How  •  Train  a  SMV  classifier  with  the  found  clusters  as  labels  

•  Classify  unseen  queries:  accuracy  of  96%  for  the  test  dataset    

•  This  means  we  can  accurately  predict  'me  ranges  

28  

Page 30: Predicting SPARQL query execution time and suggesting SPARQL queries based on query history

•  Predic'ng  execu'on  'me  – Different  SMV  regressions  for  different  'me  ranges.  

– Use  the  corresponding  regression  to  the  'me  range  cluster  for  an  unseen  query  

29  

Page 31: Predicting SPARQL query execution time and suggesting SPARQL queries based on query history

Experiment  3  

•  Test  dataset  R2  =  0.83862  

30  

Algebra  +  BGP  features   Mul'ple  regressions  

Page 32: Predicting SPARQL query execution time and suggesting SPARQL queries based on query history

Predic'ng  with  nearest  neighbors  regression  

•  The  k-­‐nearest  neighbors  algorithm  (k-­‐NN)  is  oAen  successful  in  the  cases  where  decision  boundary  is  irregular.  

•  We  train  a  k-­‐NN  with    – Euclidean  distance  as  the  distance  func'on  – Distance  weigh'ng:  weighted  by  the  inverse  of  the  distance  

31  

Page 33: Predicting SPARQL query execution time and suggesting SPARQL queries based on query history

•  k-­‐dimensional  tree  (k-­‐d  tree)  data  structure  to  search  the  nearest  neighbors    – a  space-­‐par''oning  data  structure  for  organizing  points  in  a  k-­‐dimensional  space  

•  Complexity  of  a  search:  O(log N)  opera'ons  

32  

Page 34: Predicting SPARQL query execution time and suggesting SPARQL queries based on query history

Experiment  4  

•  Test  dataset  R2  =  0.837  •  k=2  for  k-­‐NN  (selected  by  cross  valida'on)  

k-­‐NN  Mul'ple  regressions   33  

Page 35: Predicting SPARQL query execution time and suggesting SPARQL queries based on query history

•  Future  work  – Training  data  with  broad  coverage  •  DBpedia  SPARQL  benchmark  query  templates    

–  Berlin:  5  templates  

–  DBPSB:  20  templates  

– Fine  tuning  with  more  cross  valida'on  

34  

Page 36: Predicting SPARQL query execution time and suggesting SPARQL queries based on query history

SUGGESTING  SPARQL  QUERIES  

35  

Page 37: Predicting SPARQL query execution time and suggesting SPARQL queries based on query history

Sugges'ng  SPARQL  queries  based  on  query  history  

•  Use  the  same  features    •  Construct  a  k-­‐d  tree  for  nearest  neighbor  search  

•  Top  M neighbors  for  a  query  are  the  top  M  sugges'ons  for  that  query  

36  

Page 38: Predicting SPARQL query execution time and suggesting SPARQL queries based on query history

Example  

SELECT  DISTINCT  ?uri  

WHERE  

{    

 dbpedia  :1549  _Mikko  ?p  ?  uri  .  

 ?  uri  rdf  :  type  ?x  

}  

Sugges'on  1  SELECT  DISTINCT  ?uri  

WHERE  

{    

 dbpedia  :  Radu_Sabo  ?p  ?  uri  .  

 ?  uri  rdf  :  type  ?x  }  

Sugges'on  2  

SELECT  DISTINCT  ?uri  

WHERE  

{      dbpedia  :  Hafar_Al  -­‐  Ba'n  ?p  ?  uri  .  

 ?  uri  rdf  :  type  ?x  

}  

Sugges'on  3  

SELECT  DISTINCT  ?uri  WHERE  

{    

 dbpedia  :  Maurice_D  ._G.  _Scof  ?p  ?  uri  .  

 ?  uri  rdf  :  type  ?x  

}  

37  

Page 39: Predicting SPARQL query execution time and suggesting SPARQL queries based on query history

•  Future  work  – Query  construc'on  and  refinement  workflow  •  How  to  use  the  query  sugges'ons?  

– Evalua'ng  the  sugges'ons  •  User  study  

38  

Page 40: Predicting SPARQL query execution time and suggesting SPARQL queries based on query history

Thank  you  

39  


Recommended