Hadoop ecosystem - hadoop 生態系

Headfirst 之Hadoop的生態系

陳威宇

今天的象排餐，供應的部位…

2figure Source : http://aryannava.com/2014/02/19/apache-hadoop-ecosystem/hadoopecosystem/

等等，為何我們需要這些東東

• 接下來會遇到六個Case，一起想一想要怎麼解決

3

Hadoop EcoSystem

手工打造程式來做

用flume

問題

• 場景:

– 有上百個服務，運作在許多不同的機器中，每個服務都產生超多的 log ，且需要被分析，我知道最後可以放到hadoop中，可是….

• 問題:

– 我要如何送這些源源不絕的資料到hadoop?

• 解法:

4figure Source : http://image.slidesharecdn.com/flume-120314204418-phpapp01/95/apache-flume-4-728.jpg?cb=1338404245

Apache Flume: Log 收集器

• 即時日誌收集系統

• 將分佈在不同節點、機器上的日誌收集到hdfs 中

• 不用寫程式: 僅定義config檔即可

5

Source• netcat• exec• syslog• spooldir• seq• http• avro

Sink• logger• hdfs• file_roll• hbase• solr• avro

channel• memory• jdbc• File

figure Source : https://flume.apache.org/FlumeUserGuide.html

用 shell 硬把程式兜出來，放棄用 hadoop 了使用 PIG 發憤圖強，廢寢忘食的研究…

問題 :

• 場景:

– 老闆要我統計組織內所有員工的平均工時。於是我取得了全台灣的打卡紀錄檔(打卡鐘的log檔)，還跟人事部門拿到了員工 id 對應表。這些資料量又多且大，我想到要餵進去 Hadoop 的HDFS, .. 然後

• 問題:

– 為了寫MapReduce，開始學 Java, 物件導向, hadoopAPI, … @@

• 解法:

6

有Pig後Map-Reduce簡單了！?

• Apache Pig用來處理大規模資料的高級查詢語言

• 適合操作大型半結構化數據集

• 比使用Java，C++等語言編寫大規模資料處理程式的難度要小16倍，實現同樣的效果的代碼量也小20倍。

• Pig元件– Pig Shell (Grunt)

– Pig Language (Latin)

– Libraries (Piggy Bank)

– UDF:使用者定義功能

7figure Source : http://www.slideshare.net/ydn/hadoop-yahoo-internet-scale-data-processing

豬也會的程式設計

8

功能指令

讀取 LOAD

儲存 STORE

資料處理

REGEX_EXTRACT, FILTER, FOREACH, GROUP, JOIN, UNION, SPLIT, …

彙總運算

AVG, COUNT, MAX, MIN, SIZE, …

數學運算

ABS, RANDOM, ROUND, …

字串處理

INDEXOF, SUBSTRING, REGEX EXTRACT, …

Debug DUMP, DESCRIBE, EXPLAIN, ILLUSTRATE

HDFS cat, ls, cp, mkdir, …

$ pig –x grunt> A = LOAD ‘file1’ AS (x, y, z);grunt> B = FILTER A by y > 10000;grunt> STORE B INTO ‘output’;

整型前的mapreduce code

9

nm dp Id Id dt hr

劉北 A1 A1 7/7 13

李中 B1 A1 7/8 12

王中 B2 A1 7/9 4

Java Code

Map-Reduce

A1 劉北 7/8 13

A1 劉北 7/9 12

A1 劉北 Jul 12.5

用pig 整形後

10

北 A1 劉 12.5

LOAD

LOAD

FILTER

JOIN

GROUP

FOREACH

STORE

(nm, dp, id)

(nm, dp, id)(id, dt, hr)

(nm, dp, id, id, dt, hr)

(group, {(nm, dp, id, id, dt, hr)})

(group, …., AVG(hr))

(dp,group, nm, hr)

Logical PlanPig LatinA = LOAD 'file1.txt' using PigStorage(',') AS (nm, dp, id) ;B = LOAD ‘file2.txt' using PigStorage(',') AS (id, dt, hr) ;C = FILTER B by hr > 8;D = JOIN C BY id, A BY id;E = GROUP D BY A::id;F = FOREACH E GENERATE $1.dp,group,$1.nm, AVG($1.hr);STORE F INTO '/tmp/pig_output/';

nm dp Id Id dt hr

劉北 A1 A1 7/7 13

李中 B1 A1 7/8 12

王中 B2 A1 7/9 4

Tips : 關鍵字大小寫有差；先用小量資料於pig –x local 模式驗證；每行先配合dump or illustrate看是否正確

問題 :

• 場景:

– 組織內有統一格式的出勤紀錄資料表，分散在全台各縣市的各個部門的資料庫中。老闆要我蒐集全台的資料統計所有員工的平均工時。DB內的table 都轉成csv 檔，並且餵進去 Hadoop 的HDFS了, ..

• 問題:

– 雖然我知道PIG可以降低MapReduce的門檻，但我還是習慣 SQL 語法來實作，如果有一台超大又免費的DB就好了…

• 解法:

11

編列經費買台高效伺服器再裝個大容量的 sql server 使用 Hive

Hadoop 也有 RDB 可以用 : Hive

• Hive = Hadoop的RDB– 將結構化的資料檔案映射為資料庫表

– 提供SQL查詢功能( 轉譯SQL語法成MapReduce程式)

• 適合：– 有SQL 基礎的使用者且基本 SQL 能運算的事

• 特色：– 可擴展、可自訂函數、容錯

• 限制：– 執行時間較久

– 資料結構固定

– 無法修改

12

Hive 架構提供了..

• 介面

– CLI

– WebUI

– API • JDBC and ODBC

• Thrift Server (hiveserver)

– 使遠端Client可用 API 執行 HiveQL

• Metastore

– DB, table, partition…

13figure Source : http://blog.cloudera.com/blog/2013/07/how-hiveserver2-brings-security-and-concurrency-to-apache-hive

現在換蜂也會的程式設計

14

$ hivehive> create table A(x int, y int, z int)hive> load data local inpath ‘file1 ’ into table A;hive> select * from A where y>10000

hive> insert table B select * from A where y>10000

figure Source : http://hortonworks.com/blog/stinger-phase-2-the-journey-to-100x-faster-hive/

用 Hive 整形後

15

北 A1 劉 12.5

HiveQL> create table A (nm String, dp String, id String)> create table B (id String, dt Date, hr int)> create table final (dp String, id String , nm String, avg float)> load data inpath ‘file1’ into table A;> load data inpath ‘file2’ into table B;> Insert table final select a.id, collect_set(a.dp), collect_set(a.nm), avg(b.hr) from a,b where b.hr > 8 and b.id = a.id group by a.id;

nm dp Id id dt hr

劉北 A1 A1 7/7 13

李中 B1 A1 7/8 12

王中 B2 A1 7/9 4

Tips : create table & load data 建議用 tool 匯入資料較不會錯

Hive和SQL 比較

Hive RDMS

查詢語法 HQL SQL

儲存體 HDFSRaw Device or Local FS

運算方法 MapReduce Excutor

延遲非常高低

處理數據規模大小

修改資料 NO YES

索引Index, Bigmapindex…

複雜健全的索引機制

16Source : http://sishuok.com/forum/blogPost/list/6220.html

Pig vs Hive

17

Hive Pig

SQL-LIKE 語法 PigLatin

Yes/明確型 Schemas/Types

Yes /隱含型

Yes Partitions No

Thrift Server No

Yes Web Interface

No

Yes(limited) JDBC/ODBC No

No Hdsf 操作 Yes

Hive更適合於數據倉庫的任務，用於靜態的結構及需要經常分析的工作

Pig賦予開發人員在Big Data中，具備更多的靈活性，並允許開發簡潔腳本

Source : http://f.dataguru.cn/thread-33553-1-1.html

豬與蜜蜂兼得 : HCatalog

• 提供:

– Mapreduce, pig, hive 的讀寫"metastore”介面

– Command line 介面

18figure Source : http://wiki.gurubee.net/pages/viewpage.action?pageId=26739793

問題 :

• 場景:

– 承前，長官反映一個月做一次統計太久，頻率要改成一天一次以即時反應

• 問題:

– 每天都要將這麼多個資料表，各自轉成csv 再匯入hdfs ，然後 load 到 hive 接著運算…，天都黑了

• 解法:

19

組織內有統一格式的出勤紀錄資料表，分散在全台各縣市的各個部門的資料庫中。老闆要我蒐集全台的資料統計所有員工的平均工時。DB內的table 都轉成csv 檔，並且餵進去 Hadoop 的HDFS了,

找工讀生 ……….. 使用 sqoop ………

Sqoop : RDB 與 Hadoop 的橋樑

• Apache Sqoop = SQL to Hadoop

• 從..拿資料

– RDBMS

– Data warehources

– NoSQL

• 寫資料到..

– Hive

– Hbase

• 與 oozie 整合

– 可排程

20figure Source : http://bigdataanalyticsnews.com/data-transfer-mysql-cassandra-using-sqoop/

Sqoop 使用方法

21figure Source : http://hive.3du.me/slide.html

用 Hive + Sqoop 的微創整形手術

22

北 A1 劉 12.5

HiveQL> create …………> load data inpath ‘file1’ into table A;> load data inpath ‘file2’ into table B;> Insert table final select a.id, collect_set(a.dp), collect_set(a.nm), avg(b.hr) from a,b where b.hr> 8 and b.id = a.id group by a.id;

nm dp Id id dt hr

劉北 A1 A1 7/7 13

李中 B1 A1 7/8 12

李中 B2 A1 7/9 4

HiveQL> create …………

> Insert table final select a.id, collect_set(a.dp), collect_set(a.nm), avg(b.hr) from a,b where b.hr > 8 and b.id = a.id group by a.id;

問題 :

• 場景:

– 自從知道 hive 的好用之後，所有以前 RDB 存不下、不能存的東西，我通通都建成 hive 的DB, table 來存放，搭配 sqoop 資料是還滿順的，不過…

• 問題:

– 即使沒有要做複雜運算，只是要取出某一行資料，總是要等hive 處理很久很久

• 解法:

23

邊唱韋禮安的歌邊慢慢等使用 Impala 使用 HBase

關於impala 的兩三事

• 目的：解決批次化處理的時間延宕和存取資料速度不方便

• Near-realtime 的 SQL 查詢工具

• 速度約比hive 快 6~ 60 倍

24figure Source : http://blog.cloudera.com/blog/2013/05/cloudera-impala-1-0-its-here-its-real-its-already-the-standard-for-sql-on-hadoop/

NOSQL 資料庫 Hbase

• Hbase是參考谷歌BigTable建模的NoSQL

• 特性：

– 類似表格的資料結構 (Multi-Dimensional Map)

– 分散式

– 高可用性、高效能

– 很容易擴充容量及效能

• Why HBase：

– Random read/write hadoop 內的資料

25

Hbase “不是不是不是” 關聯式資料

• HBase並不是關聯式資料庫系統(RDBMS)– 表格(Table)只有一個主要索引 (primary index) 即 row key.

– 不提供 SQL 語法 (如 join )

• 提供Java函式庫, 與 REST與Thrift等介面

• 利用 getRow(), Scan() 存取資料– getRow()可以取得一筆row range的資料，同時也可以指定

版本(timestamp)

– Scan()可以取得整個表格的資料或是一組row range (設定start key, end key)

• insert, update, delete 都是在塞資料– Hbase 中的 insert 功能即 put ()

– 在同一cell 內重複put() => update;

– Delete() = 在該 cell 上貼上刪除的標籤

• Row Key design 是 hbase 設計重點中的重點26

HBase 資料長相

• “Rowkey”, “column family”, “column qualifier”, “timestamp”, “cell”

27figure Source : http://www.slideshare.net/hanborq/h-base-introduction

問題 :

• 問題:

– 我的東西需要很多的統計分析方法、machine learning, data mining 等，用 hive, pig 都不適用…

• 解法:

– Machine learning => Machout

– 統計分析 => Rhadoop

28

Mahout = 象夫

• Mahout = 可伸縮的機器學習演算法

• 用MapReduce實現了部分data mining算法

• 演算法分類如 : (各自提供多種經典演算法的實作 )

– 推薦引擎（Mahout中專指協同過濾式的推薦）

– 降維（Dimension Reduction

– 向量相似度（Vector Similarity）

– 分類演算法

– 群集演算法

– 模式探勘（Pattern Mining）

29

Regression

Recommenders

ClusteringClassificationFreq. PatternMining

Vector Similarity

Non-MRAlgorithms

Examples

Dimension Reduction

Evolution

figure Source : http://www.slideshare.net/chaoyu0513/hit20130928-apache-mahout

處理大資料的R使用者有福了 : R hadoop

• R 是在統計領域上，鼎鼎大名的語言

• 主要用於統計分析、繪圖、資料探勘、矩陣計算

• R綜合典藏網 CRAN

– 像Perl 依樣的自由函式庫

• Revolution Rhadoop

– rmr2, rhdfs, rhbase …

30figure Source : http://www.r-project.org/

問題 :

• 場景:

– 自從我學了 hadoop 的十八般武藝之後，已經設計了很多用不同 ecosystem 做的 application 了，不過老闆要我把 src txt-> { flume => MR => hive 或 pig => sqoop } -> dst DB，整段串起來在每天凌晨執行，活要見人result 死要見屍 error message…

• 做法:

31

用shell script 將整段兜起來 ……….. 使用 oozie ………

Hadoop 工作流程管理員 : oozie

• 把多個 job 組合到起來，從而完成更大型的任務

• 包含– 控制流程 ( start, end, kill, fork, join )

– 動作 ( mapreduce/java/pig/hive )

• 不用寫 code ，用 xml 定義流程

32figure Source : http://www.slideshare.net/martyhall/hadoop-tutorial-oozie

http://oozie_server:11000/

33

回顧

• ETL

– Apache Flume

– Apache Sqoop

• DB

– Apache Hbase

– Apache Hive

– Apache Impala

• Calculate

– Apache Pig

– Apache Mahout

– R Hadoop

• WorkFlow

– Apache OOZIE

34

Advice

• 在巨量資料領域中Hadoop是目前最多人使用的框架，在這之上，你可以更聰明的使用它

• 資料不夠大時，難以發揮Hadoop大資料分析的效益

• 大數據人才:懂資工、統計還不夠，還要會說故事

– 一個能擔當資料科學的完整團隊，最好包括四種角色：懂資訊科學的程式設計師、懂統計學的資料分析師、懂圖像呈現，善於包裝傳達的圖像設計師與擁有產業知識的專案推動者。(2014 年 4 月號《遠見雜誌》第 334 期)

• 小心別掉進陷阱裡，大數據專案失敗的八個理由

– (Yahoo)

35

http://www.gvm.com.tw/Boardcontent_25148.html

http://www.inside.com.tw/2015/06/10/big-botched-data

backup

Pig example result

37

A = LOAD '/user/waue/pig_input/file1.txt' using PigStorage(',') AS (nm, dp, id) ;B = LOAD '/user/waue/pig_input/file2.txt' using PigStorage(',') AS (id, dt, hr) ;C = FILTER B by hr > 8;D = JOIN C BY id, A BY id;E = GROUP D BY A::id;F = FOREACH E GENERATE $1.dp,group,$1.nm, AVG($1.hr);STORE F INTO '/tmp/pig_output/';

Hive example result

38

INSERT OVERWRITE TABLE finalselect a.id, collect_set(a.dp), collect_set(a.nm), avg(b.hr) from a,b where b.hr > 8 and b.id = a.id group by a.id;

Software

Hadoop ecosystem - hadoop 生態系