Download pdf - Effective Hive Queries

Effec%ve Hive Queries Secrets From the Pros

We will be star,ng at 11:03 PDT Use the Chat Pane in GoToWebinar to Ask Ques%ons!

Assess your level and learn new stuff This webinar is intended for intermediate audiences (familiar with Apache Hive and Hadoop, but not experts) ?

AGENDA This Webinar provides %ps on improving the performance and beJer u%lizing resources using the following best prac%ces:

•  Data Layout (Par%%ons and Buckets) •  Data Sampling (Bucket and Block sampling) •  Data Processing (Bucket Map Join and Parallel execu%on)

Dataset Used

# of records: 276M records Columns: I%nerary ID Year &Quarter of Travel Trip Origin City & State Trip Des%na%on City & State Distance between Origin & Des%na%on

Airline Bookings All Includes stops at intermediate ci%es

# of records: 116M records Columns: I%nerary ID Year &Quarter of Travel Trip Origin City & State Trip Des%na%on City & State Distance between Origin & Des%na%on

Airline Bookings Origin Only Only first leg of travel

# of records: 50 Columns: State code & Name Popula%on

Census Human popula%on by US State

#1 -‐ Data Par%%oning •  Problem PaJern

–  Query a subset of data in a table –  Subset iden%fied by “Column_Name = X” filter

•  Solu%on paJern –  Layout data in sub-‐directories with each directory associated with a value of the par%%on column

–  The filter on par%%on column just picks a single sub directory •  Approach

–  Use PARTITION BY clause •  Benefit

–  Par%%on pruning –  2.7x faster on a query on Airline Bookings Dataset (29 seconds)

#1 -‐ Data Par%%oning

Airline Bookings All Table

Origin State (Par%%on Column / Sub-‐directory) CA WY AL

File1001.dat

File1002.dat

File100n.dat

File3001.dat

File3002.dat

File300n.dat

Filex001.dat

Filex002.dat

Filex00n.dat

Files inside the par%%on

SELECT origin_city, origin_state FROM Airline_Bookings_All WHERE origin_state = ‘CA’

CREATE TABLE Airline_Bookings_All …. PARTITIONED BY (origin_state STRING)

#2 -‐ Data Bucke%ng

•  Problem PaJern –  Join data in two large tables efficiently –  Sample data inside a table efficiently

•  Solu%on paJern – More efficient processing by storing data in hash buckets

•  Approach •  Use bucke%ng using CLUSTERED BY .. INTO n BUCKETS

•  Benefit –  Bucket Map Join –  Bucket Sampling

#2 – Data Bucke%ng CREATE TABLE Airline_Bookings_All … CLUSTERED BY (i%nid) INTO 64 BUCKETS

set hive.enforce.bucke%ng = true; INSERT OVERWRITE TABLE Airline_Bookings_All SELECT … FROM ..

Ailrine_Bookings_All

File00.dat

File63.dat

File01.dat Each File contains all

the rows that correspond to the same hash of i%nid

column

#2 -‐ Data Bucke%ng

a

File1001.dat

File1002.dat

File100n.dat

Filex001.dat

Filex002.dat

Filex00m.dat

Files containing table data bucketed on a

column

b

set hive.op%mize.bucketmapjoin = true; SELECT /*+ MAPJOIN(a, b) */ a.*, b.* FROM Airline_Bookings_All a JOIN Airline_Bookings_Origin_Only b ON a.i%nid = b.i%nid

Note: 1.  Both the tables are bucketed on i%nid column 2.  The numbers of buckets in the two tables are a strict mul%ple of each other

#3 -‐ Bucket Sampling

•  Problem PaJern – Work on joinable samples of data from different tables

•  Solu%on paJern –  Use Bucket Sampling

•  Approach •  TABLESAMPLE (BUCKET x OUT OF Y ON column)

•  Benefit –  Useful while working with sample data and joins

#3 -‐ Bucket Sampling

Filex002.dat

Filex030.dat

Filex064.dat

Files containing bookings data bucketed on i%nid

a

SELECT a.*, b.* FROM Airline_Bookings_All TABLESAMPLE(bucket 30 out of 64 on i%nid) a , Airline_Bookings_Origin_Only TABLESAMPLE(bucket 30 out of 64 on i%nid) b WHERE a.i%nid = b.i%nid

Filex001.dat

Filex063.dat

Filey002.dat

Filey030.dat

Filey064.dat

b

Filey001.dat

Filey063.dat

#4 – Block Sampling •  Problem PaJern

–  View a sample of a data with in a table –  Sample size expressed as number of rows, %age of data, or number of MBs

•  Solu%on paJern –  Use Block sampling

•  Approach –  Use TABLESAMPLE (n%, nM, or n ROWS)

•  Benefit –  Geyng a random sample from the table –  More op%ons to specify how many samples to generate

#5 – Parallel Execu%on

SELECT a.year, a.quarter, a.origin, a.originstate, count(*) ct FROM (

SELECT itinid, year, quarter, origin, originstate FROM air_travel_bookings_8

)a JOIN (

SELECT itinid, origin, originstate FROM air_travel_origins_8

)B ON ( A.itinid = b.itinid and a.origin = b.origin and a.originstate = b.originstate) GROUP BY

a.year, a.quarter, a.origin, a.originstate;

Stage 1

Stage 2

Stage 3

Stage 1

Stage 2

Stage 3

Stage 1 Stage 2

Stage 3

set hive.exec.parallel = false;

set hive.exec.parallel = true;

Summary

•  Iterate quickly on Query Design – Use Bucket and Block Sampling

•  Run queries faster – Par%%oning to invoke Par%%on Pruning – Bucke%ng to invoke Bucket Map Joins – Execute complex queries in parallel

THANK YOU

Managed Cluster Built-‐In Connectors Friendly User-‐Interface Dedicated Support

•  100% Managed Hadoop Cluster in the Cloud •  Auto-‐Scaling Cluster. Full Life-‐cycle Management •  +12 Connectors to Applica%ons and Data Sources •  14-‐Day Free Trial (free account available) •  24/7 Customer Support

What’s Included?

è www.qubole.com/try ç