40
Predic’ng SPARQL Query Execu’on Time and Sugges’ng SPARQL Queries Based on Query History Rakebul Hasan

Predicting SPARQL query execution time and suggesting SPARQL queries based on query history

Embed Size (px)

DESCRIPTION

Predicting SPARQL query execution time and suggesting SPARQL queries based on query history

Citation preview

Page 1: Predicting SPARQL query execution time and suggesting SPARQL queries based on query history

Predic'ng  SPARQL  Query  Execu'on  Time  and  Sugges'ng  SPARQL  

Queries  Based  on  Query  History  

Rakebul  Hasan  

Page 2: Predicting SPARQL query execution time and suggesting SPARQL queries based on query history

Context  

•  Assis'ng  human  users  and  soAware  agents  in:  

– Querying  Seman'c  Web  data  •  Understanding  query  behavior:  predic'ng  query  performance  – Workload  management,  query  scheduling,  query  op'miza'on  

•  Construc'ng  and  refining  queries:  sugges'ng  alterna'ves  –  Consuming  Seman'c  Web  data  

•  Understanding  reasoning  of  Seman'c  Web  soAware  agents:  explaining  reasoning  –  Transparency,  trust,  scrutability,  decision  effec'veness,  decision  efficiency,  user  sa'sfac'on  

1  

Page 3: Predicting SPARQL query execution time and suggesting SPARQL queries based on query history

Outline  

•  Predic'ng  SPARQL  query  execu'on  'me  

•  Sugges'ng  similar  SPARQL  queries  from  query  history  

2  

Page 4: Predicting SPARQL query execution time and suggesting SPARQL queries based on query history

PREDICTING  SPARQL  QUERY  EXECUTION  TIME  

3  

Page 5: Predicting SPARQL query execution time and suggesting SPARQL queries based on query history

•  Accurately  predic'ng  query  performance  enables  effec've    – workload  management  – query  scheduling  – query  op'miza'on  

4  

Page 6: Predicting SPARQL query execution time and suggesting SPARQL queries based on query history

Understanding  performance  of  computer  programs  

Insight.  [Knuth]  Use  scien'fic  method  to  understand  performance  

5  

Page 7: Predicting SPARQL query execution time and suggesting SPARQL queries based on query history

Scien'fic  method  applied  to  analysis  of  algorithms  

•  A  framework  for  predic'ng  performance  and  comparing  algorithms.  

•  Scien'fic  method  –  Observe  some  feature  of  the  natural  world.  –  Hypothesize  a  model  that  is  consistent  with  the  observa'ons.  –  Predict  events  using  the  hypothesis.  –  Verify  the  predic'ons  by  making  further  observa'ons.  –  Validate  by  repea'ng  un'l  the  hypothesis  and  observa'ons  

agree.  •  Principles  

–  Experiments  must  be  reproducible.    –  Hypotheses  must  be  falsifiable.    

•  Feature  of  the  natural  world.  Computer  itself.  

Slide  credit:  Robert  Sedgewick   6  

Page 8: Predicting SPARQL query execution time and suggesting SPARQL queries based on query history

Example:  3-­‐Sum  

•  3-­‐SUM.  Given  N dis'nct  integers,  how  many  triples  sum  to  exactly  zero?  

•  3-­‐SUM  brute-­‐force  algorithm.  Check  all  the  possible  triples.  

•  How  much  'me  does  it  take?  

Slide  credit:  Robert  Sedgewick   7  

Page 9: Predicting SPARQL query execution time and suggesting SPARQL queries based on query history

Data  analysis  

•  Standard  plot.  Plot  running  'me  T (N)  vs.  input  size  N.  

Slide  credit:  Robert  Sedgewick   8  

Page 10: Predicting SPARQL query execution time and suggesting SPARQL queries based on query history

Data  analysis  •  Log-­‐log  plot.  Plot  running  'me  lg(T (N))  vs.  input  size lg N.  

•  Regression.  Fit  straight  line  through  data  points:  a N b.  •  Hypothesis.  The  running  'me  is  about  1.006 × 10 –10 × N 2.999

Slide  credit:  Robert  Sedgewick   9  

Page 11: Predicting SPARQL query execution time and suggesting SPARQL queries based on query history

Predic'on  and  valida'on  

•  Hypothesis.  The  running  'me  is  about  1.006 × 10 –10 × N 2.999

•  Predic'ons.  –  51.0  seconds  for  N =  8000.  –  408.1  seconds  for  N =  16000.  

•  Observa'ons.  

Slide  credit:  Robert  Sedgewick  

Validates  the  hypothesis  

10  

Page 12: Predicting SPARQL query execution time and suggesting SPARQL queries based on query history

Understanding  performance  of  database  queries  

•  Ganapathi  et  al.  predic'ng  performance  metrics  of  database  queries  prior  to  query  execu'on  using  machine  learning.  

•  Gupta  et  al.  use  machine  learning  for  predic'ng  query  execu'on  'me  ranges.  

Ganapathi  et  al.:  Predic'ng  mul'ple  metrics  for  queries:  Befer  decisions  enabled  by  machine  learning.  In  Proc.  of  the  2009  IEEE  ICDE  

Gupta  et  al.:  PQR:  Predic'ng  query  execu'on  'mes  for  autonomous  workload  management.  In  Proc.  of  the  2008  ICAC   11  

Page 13: Predicting SPARQL query execution time and suggesting SPARQL queries based on query history

Predic'ng  SPARQL  query  execu'on  'me  

•  Key  challenge.  Feature  engineering  – Represen'ng  SPARQL  queries  as  feature  vectors  •  Each  dimension  of  the  vector  is  a  feature  

12  

Page 14: Predicting SPARQL query execution time and suggesting SPARQL queries based on query history

Configura'on  

•  Apache  Jena  TDB  – With  DBpedia  3.8  dataset    

•  Training,  valida'on,  and  test  queries:  randomly  selected  from  DBpedia  SPARQL  Benchmark  (DBPSB)  query  dataset  – 3600  training,  1200  valida'on,  1200  test    

13  

Page 15: Predicting SPARQL query execution time and suggesting SPARQL queries based on query history

Jena  ARQ  query  processing  

•  A  SPARQL  query  in  ARQ  goes  through  several  stages  of  processing:  – String  to  Query  (parsing)  – Transla'on  from  Query  to  a  SPARQL  algebra  expression  

– Op'miza'on  of  the  algebra  expression  – Query  plan  determina'on  and  low-­‐level  op'miza'on  

– Evalua'on  of  the  query  plan  

14  

Page 16: Predicting SPARQL query execution time and suggesting SPARQL queries based on query history

SPARQL  algebra  features  

•  SPARQL  Algebra1  

1  hfp://www.w3.org/TR/sparql11-­‐query/#sparqlQuery   15  

Page 17: Predicting SPARQL query execution time and suggesting SPARQL queries based on query history

SPARQL  algebra  features  

!"#$"%&$

'()*+&$,-.%/0+,.%"&12

3+4$*)"%

56' 56'

$("'3+,.7,4)/4805)7,90/"3$)8'+(#)%:#+(;+(<&)0=

$("'3+,.7,4)/48%/0+,.%/0+

$("'3+,.7,4)/48%"&1.%"&1

$("'3+,56',*)"%,3+4$*)"%,<,<,<,<,'()*+&$,!"#$"%&$,<,<,<,<,!+'$>,,?,,,,,@,,,A,,,,,,B,,,,,<,<,<,<,,,,B,,,,,,,B,,,,,<,<,<,<,,,C

DEFGHI,4)/48,9>$$'8JJ703%#<&)0J4)/4JA<BJ=,KFLFMN,OHKNHPMN,.%/0+,.%"&1,QRFEF,S,,,,.7,4)/4805)7,90/"3$)8'+(#)%:#+(;+(<&)0=,<,,,,.7,4)/48%/0+,.%/0+,,,,,TDNHTPUL,S,.7,4)/48%"&1,.%"&1,VV

16  

Page 18: Predicting SPARQL query execution time and suggesting SPARQL queries based on query history

Experiment  1  

•  Model:  Support  Vector  Machine  regression  •  Evalua'on  measure:  R2

•  Measures  how  well  future  samples  are  likely  to  be  predicted  by  the  model.  

17  

Page 19: Predicting SPARQL query execution time and suggesting SPARQL queries based on query history

Experiment  1  

•  Test  dataset  R2  =  0.004492  

Log  scale  plomng  of  predicted  vs  actual  execu'on  'mes  for  the  test  queries.  

18  

Page 20: Predicting SPARQL query execution time and suggesting SPARQL queries based on query history

Experiment  1  

Some  of  the  long  running  queries  share  structurally  similar  basic  graph  paferns.  

{      dbpedia  :1549  _Mikko  ?p  ?  uri  .    ?  uri  rdf  :  type  ?x  

}  

Challenge.  How  do  we  represent  basic  graph  paferns  as  vectors?  

19  

Page 21: Predicting SPARQL query execution time and suggesting SPARQL queries based on query history

Basic  Graph  Pafern  Features  

•  Infinite  number  of  possibili'es  to  write  a  basic  graph  pafern  (BGP)  

•  Only  the  set  of  literal  values  and  the  set  of  resources  appearing  in  the  RDF  graph  –  Exponen'al  number  of  possibili'es  –  A  graph  with  n  triples  has  2n subsets  of  triples    

•  Feature  vector  with  exponen'al  number  of  dimensions  –  Not  feasible    

20  

Page 22: Predicting SPARQL query execution time and suggesting SPARQL queries based on query history

Basic  Graph  Pafern  Features  

•  Pafern  graph  =  RDF  graph  constructed  from  all  the  BGPs  in  a  query  – Replace  variables  with  a  fixed  symbol  ‘?’  

•  Cluster  the  training  queries  based  on  pafern  graph  similari'es  

•  Create  a  vector  with  similarity  scores  between  the  pafern  graph  of  the  query  and  the  queries  in  the  cluster  centers.  

21  

Page 23: Predicting SPARQL query execution time and suggesting SPARQL queries based on query history

•  Graph  Edit  Distance  – Minimum  amount  of  distor'on  needed  to  transform  one  graph  to  another  

– Compute  similarity  by  inversing  distance  

22  

Page 24: Predicting SPARQL query execution time and suggesting SPARQL queries based on query history

•  Graph  Edit  Distance  – Usually  computed  using  A*  search    •  Exponen'al  running  'me  

– Bipar'te  matching  based  approximated  graph  edit  distance  with    •  Previous  research  shows  very  accurate  results  with  classifica'on  problems  

23  

Page 25: Predicting SPARQL query execution time and suggesting SPARQL queries based on query history

•  Clustering  Training  Queries  – K-­‐mediods  clustering  algorithm  with  approximated  edit  distance  as  distance  func'on  •  Selects  data  points  as  cluster  centers  •  Arbitrary  distance  func'on  

24  

Page 26: Predicting SPARQL query execution time and suggesting SPARQL queries based on query history

Experiment  2  

•  Model:  Support  Vector  Machine  regression  

•  Test  dataset  R2  =  0.124204  

•  K  =  10  

25  Algebra  features   Algebra  +  BGP  features  

Page 27: Predicting SPARQL query execution time and suggesting SPARQL queries based on query history

Mul'ple  Regressions  

•  We  train  different  SMV  regressions  for  different  'me  ranges.  

•  The  variance  in  y-­‐axis  is  less  for  each  regression,  easier  to  fit  a  curve.  

26  

Page 28: Predicting SPARQL query execution time and suggesting SPARQL queries based on query history

•  Different  'me  ranges  – Clustering  the  execu'on  'me  ranges  • We  use  x-­‐means  clustering  algorithm  which  automa'cally  es'mates  the  number  of  clusters  –  5  clusters  found  in  the  training  dataset  

– Each  cluster  contains  queries  with  similar  execu'on  'mes  

27  

Page 29: Predicting SPARQL query execution time and suggesting SPARQL queries based on query history

•  Predic'ng  execu'on  'me  range  – Predict  the  corresponding  clusters  for  unseen  queries.  

– How  •  Train  a  SMV  classifier  with  the  found  clusters  as  labels  

•  Classify  unseen  queries:  accuracy  of  96%  for  the  test  dataset    

•  This  means  we  can  accurately  predict  'me  ranges  

28  

Page 30: Predicting SPARQL query execution time and suggesting SPARQL queries based on query history

•  Predic'ng  execu'on  'me  – Different  SMV  regressions  for  different  'me  ranges.  

– Use  the  corresponding  regression  to  the  'me  range  cluster  for  an  unseen  query  

29  

Page 31: Predicting SPARQL query execution time and suggesting SPARQL queries based on query history

Experiment  3  

•  Test  dataset  R2  =  0.83862  

30  

Algebra  +  BGP  features   Mul'ple  regressions  

Page 32: Predicting SPARQL query execution time and suggesting SPARQL queries based on query history

Predic'ng  with  nearest  neighbors  regression  

•  The  k-­‐nearest  neighbors  algorithm  (k-­‐NN)  is  oAen  successful  in  the  cases  where  decision  boundary  is  irregular.  

•  We  train  a  k-­‐NN  with    – Euclidean  distance  as  the  distance  func'on  – Distance  weigh'ng:  weighted  by  the  inverse  of  the  distance  

31  

Page 33: Predicting SPARQL query execution time and suggesting SPARQL queries based on query history

•  k-­‐dimensional  tree  (k-­‐d  tree)  data  structure  to  search  the  nearest  neighbors    – a  space-­‐par''oning  data  structure  for  organizing  points  in  a  k-­‐dimensional  space  

•  Complexity  of  a  search:  O(log N)  opera'ons  

32  

Page 34: Predicting SPARQL query execution time and suggesting SPARQL queries based on query history

Experiment  4  

•  Test  dataset  R2  =  0.837  •  k=2  for  k-­‐NN  (selected  by  cross  valida'on)  

k-­‐NN  Mul'ple  regressions   33  

Page 35: Predicting SPARQL query execution time and suggesting SPARQL queries based on query history

•  Future  work  – Training  data  with  broad  coverage  •  DBpedia  SPARQL  benchmark  query  templates    

–  Berlin:  5  templates  

–  DBPSB:  20  templates  

– Fine  tuning  with  more  cross  valida'on  

34  

Page 36: Predicting SPARQL query execution time and suggesting SPARQL queries based on query history

SUGGESTING  SPARQL  QUERIES  

35  

Page 37: Predicting SPARQL query execution time and suggesting SPARQL queries based on query history

Sugges'ng  SPARQL  queries  based  on  query  history  

•  Use  the  same  features    •  Construct  a  k-­‐d  tree  for  nearest  neighbor  search  

•  Top  M neighbors  for  a  query  are  the  top  M  sugges'ons  for  that  query  

36  

Page 38: Predicting SPARQL query execution time and suggesting SPARQL queries based on query history

Example  

SELECT  DISTINCT  ?uri  

WHERE  

{    

 dbpedia  :1549  _Mikko  ?p  ?  uri  .  

 ?  uri  rdf  :  type  ?x  

}  

Sugges'on  1  SELECT  DISTINCT  ?uri  

WHERE  

{    

 dbpedia  :  Radu_Sabo  ?p  ?  uri  .  

 ?  uri  rdf  :  type  ?x  }  

Sugges'on  2  

SELECT  DISTINCT  ?uri  

WHERE  

{      dbpedia  :  Hafar_Al  -­‐  Ba'n  ?p  ?  uri  .  

 ?  uri  rdf  :  type  ?x  

}  

Sugges'on  3  

SELECT  DISTINCT  ?uri  WHERE  

{    

 dbpedia  :  Maurice_D  ._G.  _Scof  ?p  ?  uri  .  

 ?  uri  rdf  :  type  ?x  

}  

37  

Page 39: Predicting SPARQL query execution time and suggesting SPARQL queries based on query history

•  Future  work  – Query  construc'on  and  refinement  workflow  •  How  to  use  the  query  sugges'ons?  

– Evalua'ng  the  sugges'ons  •  User  study  

38  

Page 40: Predicting SPARQL query execution time and suggesting SPARQL queries based on query history

Thank  you  

39