26
Programming Hive Reading #3 @just_do_neet

Programming Hive Reading #3

Embed Size (px)

Citation preview

Page 1: Programming Hive Reading #3

Programming Hive Reading #3

@just_do_neet

Page 2: Programming Hive Reading #3
Page 3: Programming Hive Reading #3

Programming Hive Reading #3

Chapter 10. Tuning

•Using Explain / Explain Extended

•Optimized Join

•Local Mode

•Parallel Execution

•Strict Mode

•Tuning the Number of Map/Reduce

•JVM etc...

3

Page 4: Programming Hive Reading #3

Programming Hive Reading #3

Using EXPLAIN

4

•EXPLAIN使わなくて済むのは小学生ま(ry

•出力内容

•Abstract Syntax Tree(AST)

•Dependencies

•Stage Plans

Page 6: Programming Hive Reading #3

Programming Hive Reading #3

Using EXPLAIN

6

•AST(抽象構文木)

•TOK_FROM:入力元(TOK_TABREF=table)

•TOK_INSERT:出力先

•TOK_SELECT:selectの条件

Page 7: Programming Hive Reading #3

Programming Hive Reading #3

Using EXPLAIN

7

•Dependencies

•MapReduce Job / Sampling Stage / Merge Stage / Limit Stage / etc..

Page 8: Programming Hive Reading #3

Programming Hive Reading #3

Using EXPLAIN

8

•Stage Plans

Page 9: Programming Hive Reading #3

Programming Hive Reading #3

Using EXPLAIN

9

•Stage Plans:Operators

•“EXPLAIN EXTENDED”にするとより詳細な情報が出力される。(tmpファイルの出力先等)

http://hive.apache.org/docs/r0.7.1/api/org/apache/hadoop/hive/ql/exec/Operator.html

Page 10: Programming Hive Reading #3

Programming Hive Reading #3

Optimized Join

10

•tableのデータ件数によって式を調整。 ex. stocks > dividends の場合

•最右辺に出現するテーブル:streamed(at reduce)それ以外:buffered

Page 11: Programming Hive Reading #3

Programming Hive Reading #3

Optimized Join

11

•stream tableはhint句 ”STREAMTABLE(tbl_name)”で明示的に指定できる。

Page 12: Programming Hive Reading #3

Programming Hive Reading #3

Optimized Join

12

•検証 a : 1,000,000,000 recordsb : 100,000,000 records

$ SELECT a.hoge, b.fuga FROM a JOIN b on (a.id = b.id)121.384 s

$ SELECT a.hoge, b.fuga FROM b JOIN a on (b.id = a.id)122.339 s

$ SELECT /*+ streamtable(a) */ a.hoge, b.fuga FROM b JOIN a on (b.id = a.id)120.298 s

Page 13: Programming Hive Reading #3

Programming Hive Reading #3

Map Side Join

13

•再掲

Page 14: Programming Hive Reading #3

Programming Hive Reading #3

Map Side Join

14

•再掲

Page 15: Programming Hive Reading #3

Programming Hive Reading #3

Local Mode

15

•データサイズが小さい場合はLocal Modeの方がoverheadが減らせて速いケースがある。$ set mapred.job.tracker = local;$ set mapred.tmp.dir =/tmp/masashi/sada;$ SELECT * FROM hoge FROM id = ‘fuga’..........Job running in-process (local Hadoop)..........

Page 16: Programming Hive Reading #3

Programming Hive Reading #3

Local Mode

16

•データサイズが小さい場合はLocal Modeの方がoverheadが減らせて速いケースがある。

•ex. 約30,000レコードのtable normal mode : 27s local mode : 10s

•ex. 約100,000,000レコードのtablenormal mode : 40slocal mode : 532s

Page 17: Programming Hive Reading #3

Programming Hive Reading #3

Local Mode

17

•自動的にLocal Mode処理をさせるには“hive.exec.mode.local.auto=true”

•Local Mode動作する条件は以下• The total input size of the job is lower than:

“hive.exec.mode.local.auto.inputbytes.max” (128MB by default)• The total number of map-tasks is less than:

“hive.exec.mode.local.auto.tasks.max” (4 by default)• The total number of reduce tasks required is 1 or 0.

Page 18: Programming Hive Reading #3

Programming Hive Reading #3

Strict Mode

18

•Tuning?

•有効にすると構文チェックが厳格になる。”hive.mapred.mode=strict”

Page 19: Programming Hive Reading #3

Programming Hive Reading #3

Tuning M/R Number

19

•hive.exec.reducers.bytes.per.reducer = <number>

•hive.exec.reducers.max = <number>

•mapred.reduce.tasks = <number>

Page 20: Programming Hive Reading #3

Programming Hive Reading #3

JVM Reuse

20

•1つのJVM上で動作するMap/Reduce Task数を設定可能。(at “mapred-site.xml”)

•-1の場合は無制限。

Page 21: Programming Hive Reading #3

Programming Hive Reading #3

Dynamic Partition Tuning

21

•Dynamic Partitionの使用制約を設定可能。

Page 22: Programming Hive Reading #3

Programming Hive Reading #3

Single MR Multi Group By

22

•参考:https://issues.apache.org/jira/browse/HIVE-2056

•上記の場合”hive.multigroupby.singlemr=true”のほうが速いらしい。

From table T insert overwrite table test1 select col1, count(distinct colx) group by col1 insert overwrite table test2 select col1, col2, count(distinct colx) group by col1, col2;

Page 23: Programming Hive Reading #3

Programming Hive Reading #3

•Tuning?

•以下の情報はHiveQLを用いて取得可能、ならびに条件指定可能

•INPUT__FILE__NAME

•BLOCK__OFFSET__INSIDE__FILE

•ROW__OFFSET__INSIDE__BLOCK(“hive.exec.rowoffset=true”)

Virtual Columns

23

Page 24: Programming Hive Reading #3

Programming Hive Reading #3

•Example

Virtual Columns

24

https://cwiki.apache.org/Hive/languagemanual-virtualcolumns.html

Page 25: Programming Hive Reading #3

Programming Hive Reading #3

Conclusion

25

•実際のパフォーマンスチューニングには、上述の内容よりもデータ構造の改善の方が効果が大きいと思います。

•Chapter 11. ならびに Chapter 15. 担当の方に超期待しています!!!

Page 26: Programming Hive Reading #3

ご清聴ありがとうございました