Download pdf - Effective Hive Queries

Transcript
Page 1: Effective Hive Queries

Effec%ve  Hive  Queries                                Secrets  From  the  Pros  

We  will  be  star,ng  at  11:03  PDT  Use  the  Chat  Pane  in  GoToWebinar  to  Ask  Ques%ons!  

Assess  your  level  and  learn  new  stuff  This  webinar  is  intended  for  intermediate  audiences  (familiar  with  Apache  Hive  and  Hadoop,  but  not  experts)  ?  

Page 2: Effective Hive Queries

AGENDA  This  Webinar  provides  %ps  on  improving  the  performance  and  beJer  u%lizing  resources  using  the  following  best  prac%ces:  

•  Data  Layout  (Par%%ons  and  Buckets)  •  Data  Sampling  (Bucket  and  Block  sampling)  •  Data  Processing  (Bucket  Map  Join  and  Parallel  execu%on)  

Page 3: Effective Hive Queries

Dataset  Used  

#  of  records:    276M  records  Columns:  I%nerary  ID  Year  &Quarter  of  Travel  Trip  Origin  City  &  State  Trip  Des%na%on  City  &  State  Distance  between  Origin  &  Des%na%on  

Airline  Bookings  All  Includes  stops  at  intermediate  ci%es  

#  of  records:    116M  records  Columns:  I%nerary  ID  Year  &Quarter  of  Travel  Trip  Origin  City  &  State  Trip  Des%na%on  City  &  State  Distance  between  Origin  &  Des%na%on  

Airline  Bookings  Origin  Only  Only  first  leg  of  travel  

#  of  records:    50  Columns:  State  code  &  Name  Popula%on  

Census  Human  popula%on  by  US  State  

Page 4: Effective Hive Queries

#1  -­‐  Data  Par%%oning    •  Problem  PaJern  

–  Query  a  subset  of  data  in  a  table  –  Subset  iden%fied  by  “Column_Name  =  X”  filter  

•  Solu%on  paJern  –  Layout  data  in  sub-­‐directories  with  each  directory  associated  with  a  value  of  the  par%%on  column  

–  The  filter  on  par%%on  column  just  picks  a  single  sub  directory  •  Approach  

–  Use  PARTITION  BY  clause  •  Benefit  

–  Par%%on  pruning  –  2.7x  faster  on  a  query  on  Airline  Bookings  Dataset  (29  seconds)  

Page 5: Effective Hive Queries

#1  -­‐  Data  Par%%oning  

Airline  Bookings  All  Table  

Origin  State  (Par%%on  Column  /  Sub-­‐directory)   CA   WY  AL  

File1001.dat  

File1002.dat  

File100n.dat  

File3001.dat  

File3002.dat  

File300n.dat  

Filex001.dat  

Filex002.dat  

Filex00n.dat  

Files  inside  the  par%%on  

SELECT  origin_city,  origin_state  FROM  Airline_Bookings_All  WHERE  origin_state  =  ‘CA’  

CREATE  TABLE  Airline_Bookings_All  ….  PARTITIONED  BY  (origin_state  STRING)  

Page 6: Effective Hive Queries

#2  -­‐  Data  Bucke%ng  

•  Problem  PaJern  –  Join  data  in  two  large  tables  efficiently  –  Sample  data  inside  a  table  efficiently  

•  Solu%on  paJern  – More  efficient  processing  by  storing  data  in  hash  buckets  

•  Approach  •  Use  bucke%ng  using  CLUSTERED  BY  ..  INTO  n  BUCKETS  

•  Benefit  –  Bucket  Map  Join  –  Bucket  Sampling  

Page 7: Effective Hive Queries

#2  –  Data  Bucke%ng  CREATE  TABLE  Airline_Bookings_All  …  CLUSTERED  BY  (i%nid)  INTO  64  BUCKETS  

set  hive.enforce.bucke%ng  =  true;  INSERT  OVERWRITE  TABLE  Airline_Bookings_All  SELECT  …  FROM  ..  

Ailrine_Bookings_All  

File00.dat  

File63.dat  

File01.dat  Each  File  contains  all  

the  rows  that  correspond  to  the  same  hash  of  i%nid  

column  

Page 8: Effective Hive Queries

#2  -­‐  Data  Bucke%ng  

a  

File1001.dat  

File1002.dat  

File100n.dat  

Filex001.dat  

Filex002.dat  

Filex00m.dat  

Files  containing  table  data  bucketed  on  a  

column  

b  

set  hive.op%mize.bucketmapjoin  =  true;    SELECT  /*+  MAPJOIN(a,  b)  */  a.*,  b.*  FROM  Airline_Bookings_All  a  JOIN  Airline_Bookings_Origin_Only  b  ON  a.i%nid  =  b.i%nid  

Note:    1.  Both  the  tables  are  bucketed  on  i%nid  column  2.  The  numbers  of  buckets  in  the  two  tables  are  a  strict  mul%ple  of  each  other  

Page 9: Effective Hive Queries

#3  -­‐  Bucket  Sampling  

•  Problem  PaJern  – Work  on  joinable  samples  of  data  from  different  tables  

•  Solu%on  paJern  –  Use  Bucket  Sampling  

•  Approach  •  TABLESAMPLE  (BUCKET  x  OUT  OF  Y  ON  column)  

•  Benefit  –  Useful  while  working  with  sample  data  and  joins  

Page 10: Effective Hive Queries

#3  -­‐  Bucket  Sampling  

Filex002.dat  

Filex030.dat  

Filex064.dat  

Files  containing  bookings  data  bucketed  on  i%nid  

a  

SELECT  a.*,  b.*  FROM  Airline_Bookings_All  TABLESAMPLE(bucket  30  out  of  64  on  i%nid)  a                      ,  Airline_Bookings_Origin_Only  TABLESAMPLE(bucket  30  out  of  64  on  i%nid)  b  WHERE  a.i%nid  =  b.i%nid  

Filex001.dat  

Filex063.dat  

Filey002.dat  

Filey030.dat  

Filey064.dat  

b  

Filey001.dat  

Filey063.dat  

Page 11: Effective Hive Queries

#4  –  Block  Sampling  •  Problem  PaJern  

–  View  a  sample  of  a  data  with  in  a  table  –  Sample  size  expressed  as  number  of  rows,  %age  of  data,  or  number  of  MBs  

•  Solu%on  paJern  –  Use  Block  sampling  

•  Approach  –  Use  TABLESAMPLE  (n%,  nM,  or  n  ROWS)  

•  Benefit  –  Geyng  a  random  sample  from  the  table  –  More  op%ons  to  specify  how  many  samples  to  generate  

Page 12: Effective Hive Queries

#5  –  Parallel  Execu%on  

SELECT a.year, a.quarter, a.origin, a.originstate, count(*) ct FROM (

SELECT itinid, year, quarter, origin, originstate FROM air_travel_bookings_8

)a JOIN (

SELECT itinid, origin, originstate FROM air_travel_origins_8

)B ON ( A.itinid = b.itinid and a.origin = b.origin and a.originstate = b.originstate) GROUP BY

a.year, a.quarter, a.origin, a.originstate;

Stage  1  

Stage  2  

Stage  3  

Stage  1  

Stage  2  

Stage  3  

Stage  1   Stage  2  

Stage  3  

set  hive.exec.parallel  =  false;  

set  hive.exec.parallel  =  true;  

Page 13: Effective Hive Queries

Summary  

•  Iterate  quickly  on  Query  Design  – Use  Bucket  and  Block  Sampling  

•  Run  queries  faster  – Par%%oning  to  invoke  Par%%on  Pruning  – Bucke%ng  to  invoke  Bucket  Map  Joins  – Execute  complex  queries  in  parallel  

Page 14: Effective Hive Queries

THANK  YOU  

Managed  Cluster   Built-­‐In  Connectors   Friendly  User-­‐Interface   Dedicated  Support  

•  100%  Managed  Hadoop  Cluster  in  the  Cloud  •  Auto-­‐Scaling  Cluster.  Full  Life-­‐cycle  Management  •  +12  Connectors  to  Applica%ons  and  Data  Sources  •  14-­‐Day  Free  Trial  (free  account  available)  •  24/7  Customer  Support  

What’s  Included?  

è  www.qubole.com/try  ç