View
3.343
Download
4
Category
Preview:
DESCRIPTION
「Ruby会議でSQLの話をするのは間違っているだろうか」 @大江戸Ruby会議04, 2014-04-19
Citation preview
Ruby会議でSQLの話をするのは間違っているだろうか
Minero Aoki
今日のお話について
Theme of this session
「技術的に濃い 話題がいいです」
Akira Matsuda said ldquoI expect you deep technical talkrdquo
「濃い話」is 何
Whatrsquos deep talk
Rubyの実装の話とか もう別に濃くない
Ruby implementation is not deep already so I speak about another theme
25分でわかる ビッグデータ分析 ~MapReduce追悼~
Big Data Analytics in 25 minutes
トータル100TBくらいのデータを分析するとしよう
Suppose you must analyze 100TB text data
1CPUとか もうマヂムリhellip
コンピュータ
プログラム
データ
Single CPU cannot handle 100TB
そうだ分散処理しようノード0 ノード1 ノード2 ノード3
プログラム プログラム プログラム プログラム
データ データ データ データ
You need more computers (distributed processing)
でも分散処理って めんどいhellipマヂムリhellip
Distributed processing is too difficulthellip
そこで並列RDBですよ
Parallel RDB may help you
Parallel RDBNode 0 Node 1 Node 2 Node 3
Front End
Front End
Front End
Front End
Back End
Back End
Back End
Back End
並列RDBの特長
1 ノードを増やせば線形に速くなる
2 標準SQLが使える
3 クライアントからは1台に見える
has linear scalability
You can use SQL
Looks as the one computer
Parallel RDB is great becausehellip
並列RDB超スゴイ age age マック
Parallel RDB is great
いろいろな商用並列RDBDatabase Vendor Since
Teradata Teradata 1983
Teradata Aster Teradata 2005
PureData System for Analytics IBM 2000
Exadata Oracle 2008
Greenplum Pivotal 2003
SQL Server PDW Microsoft 2010くらい
Redshift Amazon 2012
Various parallel RDBs
Here Comes a New Challenger
since 2005
Hadoop Architecture
HDFS Distributed File System
MapReduce Compute Framework
(Hive SQL interface)
Hadoopの特徴
1 データはテーブルではなくファイル
2 処理にはMapReduceを使う(使っていた)
3 Hiveを乗せるとSQLっぽい言語でも書ける
Hadoop data is plain file
Processed by MapReduce
Hive allows you to write SQL-like query
猫も194780子もMapReduce
ldquoBig Datardquo meant MapReduce few years ago
MapReducek1 v1
k2 v2
k3 v3
k4 v4
k5 v5
k6 v6
Map
k1 v1
k1 v2
k1 v3
k2 v4
k3 v5
k3 v6
k1 v1
k2 v2
k3 v3
Reduce
Map関数とReduce関数を 書いたらよしなに
分散してくれるフレームワーク
You just write MapampReduce functions Hadoop serves the rest
Q1 SQLとMapReduce どっちがいいの
Which is good SQL and MapReduce
ビジネス的な答え
SQL
なぜSQLか
1 SQL書けてもJava書けない人は多い
2 既存のSQLを使ったアプリが動かない
3 MapReduce関数を書くのは高コスト
Many people can write SQL but not Java
Many applications rely on SQL
Writing MR function needs more time
コスト差ってどれくらいよpackage orgmyorg import javaioIOException import javautil import orgapachehadoopfsPath import orgapachehadoopconf import orgapachehadoopio import orgapachehadoopmapred import orgapachehadooputil public class WordCount public static class Map extends MapReduceBase implements MapperltLongWritable Text Text IntWritablegt private final static IntWritable one = new IntWritable(1) private Text word = new Text() public void map(LongWritable key Text value OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException String line = valuetoString() StringTokenizer tokenizer = new StringTokenizer(line) while (tokenizerhasMoreTokens()) wordset(tokenizernextToken()) outputcollect(word one) public static class Reduce extends MapReduceBase implements ReducerltText IntWritable Text IntWritablegt public void reduce(Text key IteratorltIntWritablegt values OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException int sum = 0 while (valueshasNext()) sum += valuesnext()get() outputcollect(key new IntWritable(sum)) public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass) confsetJobName(wordcount) confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass) confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass) confsetReducerClass(Reduceclass) confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass) FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1])) JobClientrunJob(conf)
select count() from ( select regexp_split_to_table(str lsquos+) from text_table ) t
MapReduceによるWordCount SQLによるWordCount
実際SQLが勝った
Now SQL beats MapReduce
Q2 Hadoopと並列RDBは
どっちがいいんですか
Which is good Hadoop or parallel RDB
速度は並列RDB データ構造はHadoop
Parallel RDB is faster Hadoop is more flexible
現在ありがちな構成
HDFS
MapReduce
Hive
今後の構成
HDFS
impala backend
impala frontend
Hadoopは並列RDBに 似てきている
DB filesystem
backend
parser planner
HDFS
impala be
impala fe
Hadoop resembles to parallel RDB now
Hybrid DB comes in near future
Q3 MapReduceは
お亡くなりですか
MapReduce is dead
まだだっhelliphellip まだ終わらんよ
No
MapReduceは 並列処理にJavaやCを
はさみこめる
MapReduce has better extendability
例Asterの SQL-MapReduce
SQLからMapReduce呼べるselect count(distinct user_id) from npath( on clicks partition by user_id order by timestamp mode(overlapping) pattern(lsquoHSPrsquo) symbols( page_type = lsquohomersquo AS H page_type = lsquosearchrsquo AS S page_type = lsquoproductrsquo AS P) result(first(user_id of H) as user_id) )
最近Hiveにもnpath入りました (MatchPath)
You can combine MapReduce with SQL
Easy amp Handy SQL +
Extendable MapReduce
最後にポエム
よいものはよい
Great product is anywhere
だが知識は偏在している
but knowledge is maldistributed
OSS World Enterprise World
Hadoop
Ruby
Python Excel
並列RDB
Windows
markdown VB
Git
Java
Cross the border
end
今日のお話について
Theme of this session
「技術的に濃い 話題がいいです」
Akira Matsuda said ldquoI expect you deep technical talkrdquo
「濃い話」is 何
Whatrsquos deep talk
Rubyの実装の話とか もう別に濃くない
Ruby implementation is not deep already so I speak about another theme
25分でわかる ビッグデータ分析 ~MapReduce追悼~
Big Data Analytics in 25 minutes
トータル100TBくらいのデータを分析するとしよう
Suppose you must analyze 100TB text data
1CPUとか もうマヂムリhellip
コンピュータ
プログラム
データ
Single CPU cannot handle 100TB
そうだ分散処理しようノード0 ノード1 ノード2 ノード3
プログラム プログラム プログラム プログラム
データ データ データ データ
You need more computers (distributed processing)
でも分散処理って めんどいhellipマヂムリhellip
Distributed processing is too difficulthellip
そこで並列RDBですよ
Parallel RDB may help you
Parallel RDBNode 0 Node 1 Node 2 Node 3
Front End
Front End
Front End
Front End
Back End
Back End
Back End
Back End
並列RDBの特長
1 ノードを増やせば線形に速くなる
2 標準SQLが使える
3 クライアントからは1台に見える
has linear scalability
You can use SQL
Looks as the one computer
Parallel RDB is great becausehellip
並列RDB超スゴイ age age マック
Parallel RDB is great
いろいろな商用並列RDBDatabase Vendor Since
Teradata Teradata 1983
Teradata Aster Teradata 2005
PureData System for Analytics IBM 2000
Exadata Oracle 2008
Greenplum Pivotal 2003
SQL Server PDW Microsoft 2010くらい
Redshift Amazon 2012
Various parallel RDBs
Here Comes a New Challenger
since 2005
Hadoop Architecture
HDFS Distributed File System
MapReduce Compute Framework
(Hive SQL interface)
Hadoopの特徴
1 データはテーブルではなくファイル
2 処理にはMapReduceを使う(使っていた)
3 Hiveを乗せるとSQLっぽい言語でも書ける
Hadoop data is plain file
Processed by MapReduce
Hive allows you to write SQL-like query
猫も194780子もMapReduce
ldquoBig Datardquo meant MapReduce few years ago
MapReducek1 v1
k2 v2
k3 v3
k4 v4
k5 v5
k6 v6
Map
k1 v1
k1 v2
k1 v3
k2 v4
k3 v5
k3 v6
k1 v1
k2 v2
k3 v3
Reduce
Map関数とReduce関数を 書いたらよしなに
分散してくれるフレームワーク
You just write MapampReduce functions Hadoop serves the rest
Q1 SQLとMapReduce どっちがいいの
Which is good SQL and MapReduce
ビジネス的な答え
SQL
なぜSQLか
1 SQL書けてもJava書けない人は多い
2 既存のSQLを使ったアプリが動かない
3 MapReduce関数を書くのは高コスト
Many people can write SQL but not Java
Many applications rely on SQL
Writing MR function needs more time
コスト差ってどれくらいよpackage orgmyorg import javaioIOException import javautil import orgapachehadoopfsPath import orgapachehadoopconf import orgapachehadoopio import orgapachehadoopmapred import orgapachehadooputil public class WordCount public static class Map extends MapReduceBase implements MapperltLongWritable Text Text IntWritablegt private final static IntWritable one = new IntWritable(1) private Text word = new Text() public void map(LongWritable key Text value OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException String line = valuetoString() StringTokenizer tokenizer = new StringTokenizer(line) while (tokenizerhasMoreTokens()) wordset(tokenizernextToken()) outputcollect(word one) public static class Reduce extends MapReduceBase implements ReducerltText IntWritable Text IntWritablegt public void reduce(Text key IteratorltIntWritablegt values OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException int sum = 0 while (valueshasNext()) sum += valuesnext()get() outputcollect(key new IntWritable(sum)) public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass) confsetJobName(wordcount) confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass) confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass) confsetReducerClass(Reduceclass) confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass) FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1])) JobClientrunJob(conf)
select count() from ( select regexp_split_to_table(str lsquos+) from text_table ) t
MapReduceによるWordCount SQLによるWordCount
実際SQLが勝った
Now SQL beats MapReduce
Q2 Hadoopと並列RDBは
どっちがいいんですか
Which is good Hadoop or parallel RDB
速度は並列RDB データ構造はHadoop
Parallel RDB is faster Hadoop is more flexible
現在ありがちな構成
HDFS
MapReduce
Hive
今後の構成
HDFS
impala backend
impala frontend
Hadoopは並列RDBに 似てきている
DB filesystem
backend
parser planner
HDFS
impala be
impala fe
Hadoop resembles to parallel RDB now
Hybrid DB comes in near future
Q3 MapReduceは
お亡くなりですか
MapReduce is dead
まだだっhelliphellip まだ終わらんよ
No
MapReduceは 並列処理にJavaやCを
はさみこめる
MapReduce has better extendability
例Asterの SQL-MapReduce
SQLからMapReduce呼べるselect count(distinct user_id) from npath( on clicks partition by user_id order by timestamp mode(overlapping) pattern(lsquoHSPrsquo) symbols( page_type = lsquohomersquo AS H page_type = lsquosearchrsquo AS S page_type = lsquoproductrsquo AS P) result(first(user_id of H) as user_id) )
最近Hiveにもnpath入りました (MatchPath)
You can combine MapReduce with SQL
Easy amp Handy SQL +
Extendable MapReduce
最後にポエム
よいものはよい
Great product is anywhere
だが知識は偏在している
but knowledge is maldistributed
OSS World Enterprise World
Hadoop
Ruby
Python Excel
並列RDB
Windows
markdown VB
Git
Java
Cross the border
end
「技術的に濃い 話題がいいです」
Akira Matsuda said ldquoI expect you deep technical talkrdquo
「濃い話」is 何
Whatrsquos deep talk
Rubyの実装の話とか もう別に濃くない
Ruby implementation is not deep already so I speak about another theme
25分でわかる ビッグデータ分析 ~MapReduce追悼~
Big Data Analytics in 25 minutes
トータル100TBくらいのデータを分析するとしよう
Suppose you must analyze 100TB text data
1CPUとか もうマヂムリhellip
コンピュータ
プログラム
データ
Single CPU cannot handle 100TB
そうだ分散処理しようノード0 ノード1 ノード2 ノード3
プログラム プログラム プログラム プログラム
データ データ データ データ
You need more computers (distributed processing)
でも分散処理って めんどいhellipマヂムリhellip
Distributed processing is too difficulthellip
そこで並列RDBですよ
Parallel RDB may help you
Parallel RDBNode 0 Node 1 Node 2 Node 3
Front End
Front End
Front End
Front End
Back End
Back End
Back End
Back End
並列RDBの特長
1 ノードを増やせば線形に速くなる
2 標準SQLが使える
3 クライアントからは1台に見える
has linear scalability
You can use SQL
Looks as the one computer
Parallel RDB is great becausehellip
並列RDB超スゴイ age age マック
Parallel RDB is great
いろいろな商用並列RDBDatabase Vendor Since
Teradata Teradata 1983
Teradata Aster Teradata 2005
PureData System for Analytics IBM 2000
Exadata Oracle 2008
Greenplum Pivotal 2003
SQL Server PDW Microsoft 2010くらい
Redshift Amazon 2012
Various parallel RDBs
Here Comes a New Challenger
since 2005
Hadoop Architecture
HDFS Distributed File System
MapReduce Compute Framework
(Hive SQL interface)
Hadoopの特徴
1 データはテーブルではなくファイル
2 処理にはMapReduceを使う(使っていた)
3 Hiveを乗せるとSQLっぽい言語でも書ける
Hadoop data is plain file
Processed by MapReduce
Hive allows you to write SQL-like query
猫も194780子もMapReduce
ldquoBig Datardquo meant MapReduce few years ago
MapReducek1 v1
k2 v2
k3 v3
k4 v4
k5 v5
k6 v6
Map
k1 v1
k1 v2
k1 v3
k2 v4
k3 v5
k3 v6
k1 v1
k2 v2
k3 v3
Reduce
Map関数とReduce関数を 書いたらよしなに
分散してくれるフレームワーク
You just write MapampReduce functions Hadoop serves the rest
Q1 SQLとMapReduce どっちがいいの
Which is good SQL and MapReduce
ビジネス的な答え
SQL
なぜSQLか
1 SQL書けてもJava書けない人は多い
2 既存のSQLを使ったアプリが動かない
3 MapReduce関数を書くのは高コスト
Many people can write SQL but not Java
Many applications rely on SQL
Writing MR function needs more time
コスト差ってどれくらいよpackage orgmyorg import javaioIOException import javautil import orgapachehadoopfsPath import orgapachehadoopconf import orgapachehadoopio import orgapachehadoopmapred import orgapachehadooputil public class WordCount public static class Map extends MapReduceBase implements MapperltLongWritable Text Text IntWritablegt private final static IntWritable one = new IntWritable(1) private Text word = new Text() public void map(LongWritable key Text value OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException String line = valuetoString() StringTokenizer tokenizer = new StringTokenizer(line) while (tokenizerhasMoreTokens()) wordset(tokenizernextToken()) outputcollect(word one) public static class Reduce extends MapReduceBase implements ReducerltText IntWritable Text IntWritablegt public void reduce(Text key IteratorltIntWritablegt values OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException int sum = 0 while (valueshasNext()) sum += valuesnext()get() outputcollect(key new IntWritable(sum)) public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass) confsetJobName(wordcount) confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass) confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass) confsetReducerClass(Reduceclass) confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass) FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1])) JobClientrunJob(conf)
select count() from ( select regexp_split_to_table(str lsquos+) from text_table ) t
MapReduceによるWordCount SQLによるWordCount
実際SQLが勝った
Now SQL beats MapReduce
Q2 Hadoopと並列RDBは
どっちがいいんですか
Which is good Hadoop or parallel RDB
速度は並列RDB データ構造はHadoop
Parallel RDB is faster Hadoop is more flexible
現在ありがちな構成
HDFS
MapReduce
Hive
今後の構成
HDFS
impala backend
impala frontend
Hadoopは並列RDBに 似てきている
DB filesystem
backend
parser planner
HDFS
impala be
impala fe
Hadoop resembles to parallel RDB now
Hybrid DB comes in near future
Q3 MapReduceは
お亡くなりですか
MapReduce is dead
まだだっhelliphellip まだ終わらんよ
No
MapReduceは 並列処理にJavaやCを
はさみこめる
MapReduce has better extendability
例Asterの SQL-MapReduce
SQLからMapReduce呼べるselect count(distinct user_id) from npath( on clicks partition by user_id order by timestamp mode(overlapping) pattern(lsquoHSPrsquo) symbols( page_type = lsquohomersquo AS H page_type = lsquosearchrsquo AS S page_type = lsquoproductrsquo AS P) result(first(user_id of H) as user_id) )
最近Hiveにもnpath入りました (MatchPath)
You can combine MapReduce with SQL
Easy amp Handy SQL +
Extendable MapReduce
最後にポエム
よいものはよい
Great product is anywhere
だが知識は偏在している
but knowledge is maldistributed
OSS World Enterprise World
Hadoop
Ruby
Python Excel
並列RDB
Windows
markdown VB
Git
Java
Cross the border
end
「濃い話」is 何
Whatrsquos deep talk
Rubyの実装の話とか もう別に濃くない
Ruby implementation is not deep already so I speak about another theme
25分でわかる ビッグデータ分析 ~MapReduce追悼~
Big Data Analytics in 25 minutes
トータル100TBくらいのデータを分析するとしよう
Suppose you must analyze 100TB text data
1CPUとか もうマヂムリhellip
コンピュータ
プログラム
データ
Single CPU cannot handle 100TB
そうだ分散処理しようノード0 ノード1 ノード2 ノード3
プログラム プログラム プログラム プログラム
データ データ データ データ
You need more computers (distributed processing)
でも分散処理って めんどいhellipマヂムリhellip
Distributed processing is too difficulthellip
そこで並列RDBですよ
Parallel RDB may help you
Parallel RDBNode 0 Node 1 Node 2 Node 3
Front End
Front End
Front End
Front End
Back End
Back End
Back End
Back End
並列RDBの特長
1 ノードを増やせば線形に速くなる
2 標準SQLが使える
3 クライアントからは1台に見える
has linear scalability
You can use SQL
Looks as the one computer
Parallel RDB is great becausehellip
並列RDB超スゴイ age age マック
Parallel RDB is great
いろいろな商用並列RDBDatabase Vendor Since
Teradata Teradata 1983
Teradata Aster Teradata 2005
PureData System for Analytics IBM 2000
Exadata Oracle 2008
Greenplum Pivotal 2003
SQL Server PDW Microsoft 2010くらい
Redshift Amazon 2012
Various parallel RDBs
Here Comes a New Challenger
since 2005
Hadoop Architecture
HDFS Distributed File System
MapReduce Compute Framework
(Hive SQL interface)
Hadoopの特徴
1 データはテーブルではなくファイル
2 処理にはMapReduceを使う(使っていた)
3 Hiveを乗せるとSQLっぽい言語でも書ける
Hadoop data is plain file
Processed by MapReduce
Hive allows you to write SQL-like query
猫も194780子もMapReduce
ldquoBig Datardquo meant MapReduce few years ago
MapReducek1 v1
k2 v2
k3 v3
k4 v4
k5 v5
k6 v6
Map
k1 v1
k1 v2
k1 v3
k2 v4
k3 v5
k3 v6
k1 v1
k2 v2
k3 v3
Reduce
Map関数とReduce関数を 書いたらよしなに
分散してくれるフレームワーク
You just write MapampReduce functions Hadoop serves the rest
Q1 SQLとMapReduce どっちがいいの
Which is good SQL and MapReduce
ビジネス的な答え
SQL
なぜSQLか
1 SQL書けてもJava書けない人は多い
2 既存のSQLを使ったアプリが動かない
3 MapReduce関数を書くのは高コスト
Many people can write SQL but not Java
Many applications rely on SQL
Writing MR function needs more time
コスト差ってどれくらいよpackage orgmyorg import javaioIOException import javautil import orgapachehadoopfsPath import orgapachehadoopconf import orgapachehadoopio import orgapachehadoopmapred import orgapachehadooputil public class WordCount public static class Map extends MapReduceBase implements MapperltLongWritable Text Text IntWritablegt private final static IntWritable one = new IntWritable(1) private Text word = new Text() public void map(LongWritable key Text value OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException String line = valuetoString() StringTokenizer tokenizer = new StringTokenizer(line) while (tokenizerhasMoreTokens()) wordset(tokenizernextToken()) outputcollect(word one) public static class Reduce extends MapReduceBase implements ReducerltText IntWritable Text IntWritablegt public void reduce(Text key IteratorltIntWritablegt values OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException int sum = 0 while (valueshasNext()) sum += valuesnext()get() outputcollect(key new IntWritable(sum)) public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass) confsetJobName(wordcount) confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass) confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass) confsetReducerClass(Reduceclass) confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass) FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1])) JobClientrunJob(conf)
select count() from ( select regexp_split_to_table(str lsquos+) from text_table ) t
MapReduceによるWordCount SQLによるWordCount
実際SQLが勝った
Now SQL beats MapReduce
Q2 Hadoopと並列RDBは
どっちがいいんですか
Which is good Hadoop or parallel RDB
速度は並列RDB データ構造はHadoop
Parallel RDB is faster Hadoop is more flexible
現在ありがちな構成
HDFS
MapReduce
Hive
今後の構成
HDFS
impala backend
impala frontend
Hadoopは並列RDBに 似てきている
DB filesystem
backend
parser planner
HDFS
impala be
impala fe
Hadoop resembles to parallel RDB now
Hybrid DB comes in near future
Q3 MapReduceは
お亡くなりですか
MapReduce is dead
まだだっhelliphellip まだ終わらんよ
No
MapReduceは 並列処理にJavaやCを
はさみこめる
MapReduce has better extendability
例Asterの SQL-MapReduce
SQLからMapReduce呼べるselect count(distinct user_id) from npath( on clicks partition by user_id order by timestamp mode(overlapping) pattern(lsquoHSPrsquo) symbols( page_type = lsquohomersquo AS H page_type = lsquosearchrsquo AS S page_type = lsquoproductrsquo AS P) result(first(user_id of H) as user_id) )
最近Hiveにもnpath入りました (MatchPath)
You can combine MapReduce with SQL
Easy amp Handy SQL +
Extendable MapReduce
最後にポエム
よいものはよい
Great product is anywhere
だが知識は偏在している
but knowledge is maldistributed
OSS World Enterprise World
Hadoop
Ruby
Python Excel
並列RDB
Windows
markdown VB
Git
Java
Cross the border
end
Rubyの実装の話とか もう別に濃くない
Ruby implementation is not deep already so I speak about another theme
25分でわかる ビッグデータ分析 ~MapReduce追悼~
Big Data Analytics in 25 minutes
トータル100TBくらいのデータを分析するとしよう
Suppose you must analyze 100TB text data
1CPUとか もうマヂムリhellip
コンピュータ
プログラム
データ
Single CPU cannot handle 100TB
そうだ分散処理しようノード0 ノード1 ノード2 ノード3
プログラム プログラム プログラム プログラム
データ データ データ データ
You need more computers (distributed processing)
でも分散処理って めんどいhellipマヂムリhellip
Distributed processing is too difficulthellip
そこで並列RDBですよ
Parallel RDB may help you
Parallel RDBNode 0 Node 1 Node 2 Node 3
Front End
Front End
Front End
Front End
Back End
Back End
Back End
Back End
並列RDBの特長
1 ノードを増やせば線形に速くなる
2 標準SQLが使える
3 クライアントからは1台に見える
has linear scalability
You can use SQL
Looks as the one computer
Parallel RDB is great becausehellip
並列RDB超スゴイ age age マック
Parallel RDB is great
いろいろな商用並列RDBDatabase Vendor Since
Teradata Teradata 1983
Teradata Aster Teradata 2005
PureData System for Analytics IBM 2000
Exadata Oracle 2008
Greenplum Pivotal 2003
SQL Server PDW Microsoft 2010くらい
Redshift Amazon 2012
Various parallel RDBs
Here Comes a New Challenger
since 2005
Hadoop Architecture
HDFS Distributed File System
MapReduce Compute Framework
(Hive SQL interface)
Hadoopの特徴
1 データはテーブルではなくファイル
2 処理にはMapReduceを使う(使っていた)
3 Hiveを乗せるとSQLっぽい言語でも書ける
Hadoop data is plain file
Processed by MapReduce
Hive allows you to write SQL-like query
猫も194780子もMapReduce
ldquoBig Datardquo meant MapReduce few years ago
MapReducek1 v1
k2 v2
k3 v3
k4 v4
k5 v5
k6 v6
Map
k1 v1
k1 v2
k1 v3
k2 v4
k3 v5
k3 v6
k1 v1
k2 v2
k3 v3
Reduce
Map関数とReduce関数を 書いたらよしなに
分散してくれるフレームワーク
You just write MapampReduce functions Hadoop serves the rest
Q1 SQLとMapReduce どっちがいいの
Which is good SQL and MapReduce
ビジネス的な答え
SQL
なぜSQLか
1 SQL書けてもJava書けない人は多い
2 既存のSQLを使ったアプリが動かない
3 MapReduce関数を書くのは高コスト
Many people can write SQL but not Java
Many applications rely on SQL
Writing MR function needs more time
コスト差ってどれくらいよpackage orgmyorg import javaioIOException import javautil import orgapachehadoopfsPath import orgapachehadoopconf import orgapachehadoopio import orgapachehadoopmapred import orgapachehadooputil public class WordCount public static class Map extends MapReduceBase implements MapperltLongWritable Text Text IntWritablegt private final static IntWritable one = new IntWritable(1) private Text word = new Text() public void map(LongWritable key Text value OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException String line = valuetoString() StringTokenizer tokenizer = new StringTokenizer(line) while (tokenizerhasMoreTokens()) wordset(tokenizernextToken()) outputcollect(word one) public static class Reduce extends MapReduceBase implements ReducerltText IntWritable Text IntWritablegt public void reduce(Text key IteratorltIntWritablegt values OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException int sum = 0 while (valueshasNext()) sum += valuesnext()get() outputcollect(key new IntWritable(sum)) public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass) confsetJobName(wordcount) confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass) confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass) confsetReducerClass(Reduceclass) confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass) FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1])) JobClientrunJob(conf)
select count() from ( select regexp_split_to_table(str lsquos+) from text_table ) t
MapReduceによるWordCount SQLによるWordCount
実際SQLが勝った
Now SQL beats MapReduce
Q2 Hadoopと並列RDBは
どっちがいいんですか
Which is good Hadoop or parallel RDB
速度は並列RDB データ構造はHadoop
Parallel RDB is faster Hadoop is more flexible
現在ありがちな構成
HDFS
MapReduce
Hive
今後の構成
HDFS
impala backend
impala frontend
Hadoopは並列RDBに 似てきている
DB filesystem
backend
parser planner
HDFS
impala be
impala fe
Hadoop resembles to parallel RDB now
Hybrid DB comes in near future
Q3 MapReduceは
お亡くなりですか
MapReduce is dead
まだだっhelliphellip まだ終わらんよ
No
MapReduceは 並列処理にJavaやCを
はさみこめる
MapReduce has better extendability
例Asterの SQL-MapReduce
SQLからMapReduce呼べるselect count(distinct user_id) from npath( on clicks partition by user_id order by timestamp mode(overlapping) pattern(lsquoHSPrsquo) symbols( page_type = lsquohomersquo AS H page_type = lsquosearchrsquo AS S page_type = lsquoproductrsquo AS P) result(first(user_id of H) as user_id) )
最近Hiveにもnpath入りました (MatchPath)
You can combine MapReduce with SQL
Easy amp Handy SQL +
Extendable MapReduce
最後にポエム
よいものはよい
Great product is anywhere
だが知識は偏在している
but knowledge is maldistributed
OSS World Enterprise World
Hadoop
Ruby
Python Excel
並列RDB
Windows
markdown VB
Git
Java
Cross the border
end
25分でわかる ビッグデータ分析 ~MapReduce追悼~
Big Data Analytics in 25 minutes
トータル100TBくらいのデータを分析するとしよう
Suppose you must analyze 100TB text data
1CPUとか もうマヂムリhellip
コンピュータ
プログラム
データ
Single CPU cannot handle 100TB
そうだ分散処理しようノード0 ノード1 ノード2 ノード3
プログラム プログラム プログラム プログラム
データ データ データ データ
You need more computers (distributed processing)
でも分散処理って めんどいhellipマヂムリhellip
Distributed processing is too difficulthellip
そこで並列RDBですよ
Parallel RDB may help you
Parallel RDBNode 0 Node 1 Node 2 Node 3
Front End
Front End
Front End
Front End
Back End
Back End
Back End
Back End
並列RDBの特長
1 ノードを増やせば線形に速くなる
2 標準SQLが使える
3 クライアントからは1台に見える
has linear scalability
You can use SQL
Looks as the one computer
Parallel RDB is great becausehellip
並列RDB超スゴイ age age マック
Parallel RDB is great
いろいろな商用並列RDBDatabase Vendor Since
Teradata Teradata 1983
Teradata Aster Teradata 2005
PureData System for Analytics IBM 2000
Exadata Oracle 2008
Greenplum Pivotal 2003
SQL Server PDW Microsoft 2010くらい
Redshift Amazon 2012
Various parallel RDBs
Here Comes a New Challenger
since 2005
Hadoop Architecture
HDFS Distributed File System
MapReduce Compute Framework
(Hive SQL interface)
Hadoopの特徴
1 データはテーブルではなくファイル
2 処理にはMapReduceを使う(使っていた)
3 Hiveを乗せるとSQLっぽい言語でも書ける
Hadoop data is plain file
Processed by MapReduce
Hive allows you to write SQL-like query
猫も194780子もMapReduce
ldquoBig Datardquo meant MapReduce few years ago
MapReducek1 v1
k2 v2
k3 v3
k4 v4
k5 v5
k6 v6
Map
k1 v1
k1 v2
k1 v3
k2 v4
k3 v5
k3 v6
k1 v1
k2 v2
k3 v3
Reduce
Map関数とReduce関数を 書いたらよしなに
分散してくれるフレームワーク
You just write MapampReduce functions Hadoop serves the rest
Q1 SQLとMapReduce どっちがいいの
Which is good SQL and MapReduce
ビジネス的な答え
SQL
なぜSQLか
1 SQL書けてもJava書けない人は多い
2 既存のSQLを使ったアプリが動かない
3 MapReduce関数を書くのは高コスト
Many people can write SQL but not Java
Many applications rely on SQL
Writing MR function needs more time
コスト差ってどれくらいよpackage orgmyorg import javaioIOException import javautil import orgapachehadoopfsPath import orgapachehadoopconf import orgapachehadoopio import orgapachehadoopmapred import orgapachehadooputil public class WordCount public static class Map extends MapReduceBase implements MapperltLongWritable Text Text IntWritablegt private final static IntWritable one = new IntWritable(1) private Text word = new Text() public void map(LongWritable key Text value OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException String line = valuetoString() StringTokenizer tokenizer = new StringTokenizer(line) while (tokenizerhasMoreTokens()) wordset(tokenizernextToken()) outputcollect(word one) public static class Reduce extends MapReduceBase implements ReducerltText IntWritable Text IntWritablegt public void reduce(Text key IteratorltIntWritablegt values OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException int sum = 0 while (valueshasNext()) sum += valuesnext()get() outputcollect(key new IntWritable(sum)) public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass) confsetJobName(wordcount) confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass) confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass) confsetReducerClass(Reduceclass) confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass) FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1])) JobClientrunJob(conf)
select count() from ( select regexp_split_to_table(str lsquos+) from text_table ) t
MapReduceによるWordCount SQLによるWordCount
実際SQLが勝った
Now SQL beats MapReduce
Q2 Hadoopと並列RDBは
どっちがいいんですか
Which is good Hadoop or parallel RDB
速度は並列RDB データ構造はHadoop
Parallel RDB is faster Hadoop is more flexible
現在ありがちな構成
HDFS
MapReduce
Hive
今後の構成
HDFS
impala backend
impala frontend
Hadoopは並列RDBに 似てきている
DB filesystem
backend
parser planner
HDFS
impala be
impala fe
Hadoop resembles to parallel RDB now
Hybrid DB comes in near future
Q3 MapReduceは
お亡くなりですか
MapReduce is dead
まだだっhelliphellip まだ終わらんよ
No
MapReduceは 並列処理にJavaやCを
はさみこめる
MapReduce has better extendability
例Asterの SQL-MapReduce
SQLからMapReduce呼べるselect count(distinct user_id) from npath( on clicks partition by user_id order by timestamp mode(overlapping) pattern(lsquoHSPrsquo) symbols( page_type = lsquohomersquo AS H page_type = lsquosearchrsquo AS S page_type = lsquoproductrsquo AS P) result(first(user_id of H) as user_id) )
最近Hiveにもnpath入りました (MatchPath)
You can combine MapReduce with SQL
Easy amp Handy SQL +
Extendable MapReduce
最後にポエム
よいものはよい
Great product is anywhere
だが知識は偏在している
but knowledge is maldistributed
OSS World Enterprise World
Hadoop
Ruby
Python Excel
並列RDB
Windows
markdown VB
Git
Java
Cross the border
end
トータル100TBくらいのデータを分析するとしよう
Suppose you must analyze 100TB text data
1CPUとか もうマヂムリhellip
コンピュータ
プログラム
データ
Single CPU cannot handle 100TB
そうだ分散処理しようノード0 ノード1 ノード2 ノード3
プログラム プログラム プログラム プログラム
データ データ データ データ
You need more computers (distributed processing)
でも分散処理って めんどいhellipマヂムリhellip
Distributed processing is too difficulthellip
そこで並列RDBですよ
Parallel RDB may help you
Parallel RDBNode 0 Node 1 Node 2 Node 3
Front End
Front End
Front End
Front End
Back End
Back End
Back End
Back End
並列RDBの特長
1 ノードを増やせば線形に速くなる
2 標準SQLが使える
3 クライアントからは1台に見える
has linear scalability
You can use SQL
Looks as the one computer
Parallel RDB is great becausehellip
並列RDB超スゴイ age age マック
Parallel RDB is great
いろいろな商用並列RDBDatabase Vendor Since
Teradata Teradata 1983
Teradata Aster Teradata 2005
PureData System for Analytics IBM 2000
Exadata Oracle 2008
Greenplum Pivotal 2003
SQL Server PDW Microsoft 2010くらい
Redshift Amazon 2012
Various parallel RDBs
Here Comes a New Challenger
since 2005
Hadoop Architecture
HDFS Distributed File System
MapReduce Compute Framework
(Hive SQL interface)
Hadoopの特徴
1 データはテーブルではなくファイル
2 処理にはMapReduceを使う(使っていた)
3 Hiveを乗せるとSQLっぽい言語でも書ける
Hadoop data is plain file
Processed by MapReduce
Hive allows you to write SQL-like query
猫も194780子もMapReduce
ldquoBig Datardquo meant MapReduce few years ago
MapReducek1 v1
k2 v2
k3 v3
k4 v4
k5 v5
k6 v6
Map
k1 v1
k1 v2
k1 v3
k2 v4
k3 v5
k3 v6
k1 v1
k2 v2
k3 v3
Reduce
Map関数とReduce関数を 書いたらよしなに
分散してくれるフレームワーク
You just write MapampReduce functions Hadoop serves the rest
Q1 SQLとMapReduce どっちがいいの
Which is good SQL and MapReduce
ビジネス的な答え
SQL
なぜSQLか
1 SQL書けてもJava書けない人は多い
2 既存のSQLを使ったアプリが動かない
3 MapReduce関数を書くのは高コスト
Many people can write SQL but not Java
Many applications rely on SQL
Writing MR function needs more time
コスト差ってどれくらいよpackage orgmyorg import javaioIOException import javautil import orgapachehadoopfsPath import orgapachehadoopconf import orgapachehadoopio import orgapachehadoopmapred import orgapachehadooputil public class WordCount public static class Map extends MapReduceBase implements MapperltLongWritable Text Text IntWritablegt private final static IntWritable one = new IntWritable(1) private Text word = new Text() public void map(LongWritable key Text value OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException String line = valuetoString() StringTokenizer tokenizer = new StringTokenizer(line) while (tokenizerhasMoreTokens()) wordset(tokenizernextToken()) outputcollect(word one) public static class Reduce extends MapReduceBase implements ReducerltText IntWritable Text IntWritablegt public void reduce(Text key IteratorltIntWritablegt values OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException int sum = 0 while (valueshasNext()) sum += valuesnext()get() outputcollect(key new IntWritable(sum)) public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass) confsetJobName(wordcount) confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass) confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass) confsetReducerClass(Reduceclass) confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass) FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1])) JobClientrunJob(conf)
select count() from ( select regexp_split_to_table(str lsquos+) from text_table ) t
MapReduceによるWordCount SQLによるWordCount
実際SQLが勝った
Now SQL beats MapReduce
Q2 Hadoopと並列RDBは
どっちがいいんですか
Which is good Hadoop or parallel RDB
速度は並列RDB データ構造はHadoop
Parallel RDB is faster Hadoop is more flexible
現在ありがちな構成
HDFS
MapReduce
Hive
今後の構成
HDFS
impala backend
impala frontend
Hadoopは並列RDBに 似てきている
DB filesystem
backend
parser planner
HDFS
impala be
impala fe
Hadoop resembles to parallel RDB now
Hybrid DB comes in near future
Q3 MapReduceは
お亡くなりですか
MapReduce is dead
まだだっhelliphellip まだ終わらんよ
No
MapReduceは 並列処理にJavaやCを
はさみこめる
MapReduce has better extendability
例Asterの SQL-MapReduce
SQLからMapReduce呼べるselect count(distinct user_id) from npath( on clicks partition by user_id order by timestamp mode(overlapping) pattern(lsquoHSPrsquo) symbols( page_type = lsquohomersquo AS H page_type = lsquosearchrsquo AS S page_type = lsquoproductrsquo AS P) result(first(user_id of H) as user_id) )
最近Hiveにもnpath入りました (MatchPath)
You can combine MapReduce with SQL
Easy amp Handy SQL +
Extendable MapReduce
最後にポエム
よいものはよい
Great product is anywhere
だが知識は偏在している
but knowledge is maldistributed
OSS World Enterprise World
Hadoop
Ruby
Python Excel
並列RDB
Windows
markdown VB
Git
Java
Cross the border
end
1CPUとか もうマヂムリhellip
コンピュータ
プログラム
データ
Single CPU cannot handle 100TB
そうだ分散処理しようノード0 ノード1 ノード2 ノード3
プログラム プログラム プログラム プログラム
データ データ データ データ
You need more computers (distributed processing)
でも分散処理って めんどいhellipマヂムリhellip
Distributed processing is too difficulthellip
そこで並列RDBですよ
Parallel RDB may help you
Parallel RDBNode 0 Node 1 Node 2 Node 3
Front End
Front End
Front End
Front End
Back End
Back End
Back End
Back End
並列RDBの特長
1 ノードを増やせば線形に速くなる
2 標準SQLが使える
3 クライアントからは1台に見える
has linear scalability
You can use SQL
Looks as the one computer
Parallel RDB is great becausehellip
並列RDB超スゴイ age age マック
Parallel RDB is great
いろいろな商用並列RDBDatabase Vendor Since
Teradata Teradata 1983
Teradata Aster Teradata 2005
PureData System for Analytics IBM 2000
Exadata Oracle 2008
Greenplum Pivotal 2003
SQL Server PDW Microsoft 2010くらい
Redshift Amazon 2012
Various parallel RDBs
Here Comes a New Challenger
since 2005
Hadoop Architecture
HDFS Distributed File System
MapReduce Compute Framework
(Hive SQL interface)
Hadoopの特徴
1 データはテーブルではなくファイル
2 処理にはMapReduceを使う(使っていた)
3 Hiveを乗せるとSQLっぽい言語でも書ける
Hadoop data is plain file
Processed by MapReduce
Hive allows you to write SQL-like query
猫も194780子もMapReduce
ldquoBig Datardquo meant MapReduce few years ago
MapReducek1 v1
k2 v2
k3 v3
k4 v4
k5 v5
k6 v6
Map
k1 v1
k1 v2
k1 v3
k2 v4
k3 v5
k3 v6
k1 v1
k2 v2
k3 v3
Reduce
Map関数とReduce関数を 書いたらよしなに
分散してくれるフレームワーク
You just write MapampReduce functions Hadoop serves the rest
Q1 SQLとMapReduce どっちがいいの
Which is good SQL and MapReduce
ビジネス的な答え
SQL
なぜSQLか
1 SQL書けてもJava書けない人は多い
2 既存のSQLを使ったアプリが動かない
3 MapReduce関数を書くのは高コスト
Many people can write SQL but not Java
Many applications rely on SQL
Writing MR function needs more time
コスト差ってどれくらいよpackage orgmyorg import javaioIOException import javautil import orgapachehadoopfsPath import orgapachehadoopconf import orgapachehadoopio import orgapachehadoopmapred import orgapachehadooputil public class WordCount public static class Map extends MapReduceBase implements MapperltLongWritable Text Text IntWritablegt private final static IntWritable one = new IntWritable(1) private Text word = new Text() public void map(LongWritable key Text value OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException String line = valuetoString() StringTokenizer tokenizer = new StringTokenizer(line) while (tokenizerhasMoreTokens()) wordset(tokenizernextToken()) outputcollect(word one) public static class Reduce extends MapReduceBase implements ReducerltText IntWritable Text IntWritablegt public void reduce(Text key IteratorltIntWritablegt values OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException int sum = 0 while (valueshasNext()) sum += valuesnext()get() outputcollect(key new IntWritable(sum)) public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass) confsetJobName(wordcount) confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass) confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass) confsetReducerClass(Reduceclass) confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass) FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1])) JobClientrunJob(conf)
select count() from ( select regexp_split_to_table(str lsquos+) from text_table ) t
MapReduceによるWordCount SQLによるWordCount
実際SQLが勝った
Now SQL beats MapReduce
Q2 Hadoopと並列RDBは
どっちがいいんですか
Which is good Hadoop or parallel RDB
速度は並列RDB データ構造はHadoop
Parallel RDB is faster Hadoop is more flexible
現在ありがちな構成
HDFS
MapReduce
Hive
今後の構成
HDFS
impala backend
impala frontend
Hadoopは並列RDBに 似てきている
DB filesystem
backend
parser planner
HDFS
impala be
impala fe
Hadoop resembles to parallel RDB now
Hybrid DB comes in near future
Q3 MapReduceは
お亡くなりですか
MapReduce is dead
まだだっhelliphellip まだ終わらんよ
No
MapReduceは 並列処理にJavaやCを
はさみこめる
MapReduce has better extendability
例Asterの SQL-MapReduce
SQLからMapReduce呼べるselect count(distinct user_id) from npath( on clicks partition by user_id order by timestamp mode(overlapping) pattern(lsquoHSPrsquo) symbols( page_type = lsquohomersquo AS H page_type = lsquosearchrsquo AS S page_type = lsquoproductrsquo AS P) result(first(user_id of H) as user_id) )
最近Hiveにもnpath入りました (MatchPath)
You can combine MapReduce with SQL
Easy amp Handy SQL +
Extendable MapReduce
最後にポエム
よいものはよい
Great product is anywhere
だが知識は偏在している
but knowledge is maldistributed
OSS World Enterprise World
Hadoop
Ruby
Python Excel
並列RDB
Windows
markdown VB
Git
Java
Cross the border
end
そうだ分散処理しようノード0 ノード1 ノード2 ノード3
プログラム プログラム プログラム プログラム
データ データ データ データ
You need more computers (distributed processing)
でも分散処理って めんどいhellipマヂムリhellip
Distributed processing is too difficulthellip
そこで並列RDBですよ
Parallel RDB may help you
Parallel RDBNode 0 Node 1 Node 2 Node 3
Front End
Front End
Front End
Front End
Back End
Back End
Back End
Back End
並列RDBの特長
1 ノードを増やせば線形に速くなる
2 標準SQLが使える
3 クライアントからは1台に見える
has linear scalability
You can use SQL
Looks as the one computer
Parallel RDB is great becausehellip
並列RDB超スゴイ age age マック
Parallel RDB is great
いろいろな商用並列RDBDatabase Vendor Since
Teradata Teradata 1983
Teradata Aster Teradata 2005
PureData System for Analytics IBM 2000
Exadata Oracle 2008
Greenplum Pivotal 2003
SQL Server PDW Microsoft 2010くらい
Redshift Amazon 2012
Various parallel RDBs
Here Comes a New Challenger
since 2005
Hadoop Architecture
HDFS Distributed File System
MapReduce Compute Framework
(Hive SQL interface)
Hadoopの特徴
1 データはテーブルではなくファイル
2 処理にはMapReduceを使う(使っていた)
3 Hiveを乗せるとSQLっぽい言語でも書ける
Hadoop data is plain file
Processed by MapReduce
Hive allows you to write SQL-like query
猫も194780子もMapReduce
ldquoBig Datardquo meant MapReduce few years ago
MapReducek1 v1
k2 v2
k3 v3
k4 v4
k5 v5
k6 v6
Map
k1 v1
k1 v2
k1 v3
k2 v4
k3 v5
k3 v6
k1 v1
k2 v2
k3 v3
Reduce
Map関数とReduce関数を 書いたらよしなに
分散してくれるフレームワーク
You just write MapampReduce functions Hadoop serves the rest
Q1 SQLとMapReduce どっちがいいの
Which is good SQL and MapReduce
ビジネス的な答え
SQL
なぜSQLか
1 SQL書けてもJava書けない人は多い
2 既存のSQLを使ったアプリが動かない
3 MapReduce関数を書くのは高コスト
Many people can write SQL but not Java
Many applications rely on SQL
Writing MR function needs more time
コスト差ってどれくらいよpackage orgmyorg import javaioIOException import javautil import orgapachehadoopfsPath import orgapachehadoopconf import orgapachehadoopio import orgapachehadoopmapred import orgapachehadooputil public class WordCount public static class Map extends MapReduceBase implements MapperltLongWritable Text Text IntWritablegt private final static IntWritable one = new IntWritable(1) private Text word = new Text() public void map(LongWritable key Text value OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException String line = valuetoString() StringTokenizer tokenizer = new StringTokenizer(line) while (tokenizerhasMoreTokens()) wordset(tokenizernextToken()) outputcollect(word one) public static class Reduce extends MapReduceBase implements ReducerltText IntWritable Text IntWritablegt public void reduce(Text key IteratorltIntWritablegt values OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException int sum = 0 while (valueshasNext()) sum += valuesnext()get() outputcollect(key new IntWritable(sum)) public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass) confsetJobName(wordcount) confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass) confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass) confsetReducerClass(Reduceclass) confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass) FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1])) JobClientrunJob(conf)
select count() from ( select regexp_split_to_table(str lsquos+) from text_table ) t
MapReduceによるWordCount SQLによるWordCount
実際SQLが勝った
Now SQL beats MapReduce
Q2 Hadoopと並列RDBは
どっちがいいんですか
Which is good Hadoop or parallel RDB
速度は並列RDB データ構造はHadoop
Parallel RDB is faster Hadoop is more flexible
現在ありがちな構成
HDFS
MapReduce
Hive
今後の構成
HDFS
impala backend
impala frontend
Hadoopは並列RDBに 似てきている
DB filesystem
backend
parser planner
HDFS
impala be
impala fe
Hadoop resembles to parallel RDB now
Hybrid DB comes in near future
Q3 MapReduceは
お亡くなりですか
MapReduce is dead
まだだっhelliphellip まだ終わらんよ
No
MapReduceは 並列処理にJavaやCを
はさみこめる
MapReduce has better extendability
例Asterの SQL-MapReduce
SQLからMapReduce呼べるselect count(distinct user_id) from npath( on clicks partition by user_id order by timestamp mode(overlapping) pattern(lsquoHSPrsquo) symbols( page_type = lsquohomersquo AS H page_type = lsquosearchrsquo AS S page_type = lsquoproductrsquo AS P) result(first(user_id of H) as user_id) )
最近Hiveにもnpath入りました (MatchPath)
You can combine MapReduce with SQL
Easy amp Handy SQL +
Extendable MapReduce
最後にポエム
よいものはよい
Great product is anywhere
だが知識は偏在している
but knowledge is maldistributed
OSS World Enterprise World
Hadoop
Ruby
Python Excel
並列RDB
Windows
markdown VB
Git
Java
Cross the border
end
でも分散処理って めんどいhellipマヂムリhellip
Distributed processing is too difficulthellip
そこで並列RDBですよ
Parallel RDB may help you
Parallel RDBNode 0 Node 1 Node 2 Node 3
Front End
Front End
Front End
Front End
Back End
Back End
Back End
Back End
並列RDBの特長
1 ノードを増やせば線形に速くなる
2 標準SQLが使える
3 クライアントからは1台に見える
has linear scalability
You can use SQL
Looks as the one computer
Parallel RDB is great becausehellip
並列RDB超スゴイ age age マック
Parallel RDB is great
いろいろな商用並列RDBDatabase Vendor Since
Teradata Teradata 1983
Teradata Aster Teradata 2005
PureData System for Analytics IBM 2000
Exadata Oracle 2008
Greenplum Pivotal 2003
SQL Server PDW Microsoft 2010くらい
Redshift Amazon 2012
Various parallel RDBs
Here Comes a New Challenger
since 2005
Hadoop Architecture
HDFS Distributed File System
MapReduce Compute Framework
(Hive SQL interface)
Hadoopの特徴
1 データはテーブルではなくファイル
2 処理にはMapReduceを使う(使っていた)
3 Hiveを乗せるとSQLっぽい言語でも書ける
Hadoop data is plain file
Processed by MapReduce
Hive allows you to write SQL-like query
猫も194780子もMapReduce
ldquoBig Datardquo meant MapReduce few years ago
MapReducek1 v1
k2 v2
k3 v3
k4 v4
k5 v5
k6 v6
Map
k1 v1
k1 v2
k1 v3
k2 v4
k3 v5
k3 v6
k1 v1
k2 v2
k3 v3
Reduce
Map関数とReduce関数を 書いたらよしなに
分散してくれるフレームワーク
You just write MapampReduce functions Hadoop serves the rest
Q1 SQLとMapReduce どっちがいいの
Which is good SQL and MapReduce
ビジネス的な答え
SQL
なぜSQLか
1 SQL書けてもJava書けない人は多い
2 既存のSQLを使ったアプリが動かない
3 MapReduce関数を書くのは高コスト
Many people can write SQL but not Java
Many applications rely on SQL
Writing MR function needs more time
コスト差ってどれくらいよpackage orgmyorg import javaioIOException import javautil import orgapachehadoopfsPath import orgapachehadoopconf import orgapachehadoopio import orgapachehadoopmapred import orgapachehadooputil public class WordCount public static class Map extends MapReduceBase implements MapperltLongWritable Text Text IntWritablegt private final static IntWritable one = new IntWritable(1) private Text word = new Text() public void map(LongWritable key Text value OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException String line = valuetoString() StringTokenizer tokenizer = new StringTokenizer(line) while (tokenizerhasMoreTokens()) wordset(tokenizernextToken()) outputcollect(word one) public static class Reduce extends MapReduceBase implements ReducerltText IntWritable Text IntWritablegt public void reduce(Text key IteratorltIntWritablegt values OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException int sum = 0 while (valueshasNext()) sum += valuesnext()get() outputcollect(key new IntWritable(sum)) public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass) confsetJobName(wordcount) confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass) confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass) confsetReducerClass(Reduceclass) confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass) FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1])) JobClientrunJob(conf)
select count() from ( select regexp_split_to_table(str lsquos+) from text_table ) t
MapReduceによるWordCount SQLによるWordCount
実際SQLが勝った
Now SQL beats MapReduce
Q2 Hadoopと並列RDBは
どっちがいいんですか
Which is good Hadoop or parallel RDB
速度は並列RDB データ構造はHadoop
Parallel RDB is faster Hadoop is more flexible
現在ありがちな構成
HDFS
MapReduce
Hive
今後の構成
HDFS
impala backend
impala frontend
Hadoopは並列RDBに 似てきている
DB filesystem
backend
parser planner
HDFS
impala be
impala fe
Hadoop resembles to parallel RDB now
Hybrid DB comes in near future
Q3 MapReduceは
お亡くなりですか
MapReduce is dead
まだだっhelliphellip まだ終わらんよ
No
MapReduceは 並列処理にJavaやCを
はさみこめる
MapReduce has better extendability
例Asterの SQL-MapReduce
SQLからMapReduce呼べるselect count(distinct user_id) from npath( on clicks partition by user_id order by timestamp mode(overlapping) pattern(lsquoHSPrsquo) symbols( page_type = lsquohomersquo AS H page_type = lsquosearchrsquo AS S page_type = lsquoproductrsquo AS P) result(first(user_id of H) as user_id) )
最近Hiveにもnpath入りました (MatchPath)
You can combine MapReduce with SQL
Easy amp Handy SQL +
Extendable MapReduce
最後にポエム
よいものはよい
Great product is anywhere
だが知識は偏在している
but knowledge is maldistributed
OSS World Enterprise World
Hadoop
Ruby
Python Excel
並列RDB
Windows
markdown VB
Git
Java
Cross the border
end
そこで並列RDBですよ
Parallel RDB may help you
Parallel RDBNode 0 Node 1 Node 2 Node 3
Front End
Front End
Front End
Front End
Back End
Back End
Back End
Back End
並列RDBの特長
1 ノードを増やせば線形に速くなる
2 標準SQLが使える
3 クライアントからは1台に見える
has linear scalability
You can use SQL
Looks as the one computer
Parallel RDB is great becausehellip
並列RDB超スゴイ age age マック
Parallel RDB is great
いろいろな商用並列RDBDatabase Vendor Since
Teradata Teradata 1983
Teradata Aster Teradata 2005
PureData System for Analytics IBM 2000
Exadata Oracle 2008
Greenplum Pivotal 2003
SQL Server PDW Microsoft 2010くらい
Redshift Amazon 2012
Various parallel RDBs
Here Comes a New Challenger
since 2005
Hadoop Architecture
HDFS Distributed File System
MapReduce Compute Framework
(Hive SQL interface)
Hadoopの特徴
1 データはテーブルではなくファイル
2 処理にはMapReduceを使う(使っていた)
3 Hiveを乗せるとSQLっぽい言語でも書ける
Hadoop data is plain file
Processed by MapReduce
Hive allows you to write SQL-like query
猫も194780子もMapReduce
ldquoBig Datardquo meant MapReduce few years ago
MapReducek1 v1
k2 v2
k3 v3
k4 v4
k5 v5
k6 v6
Map
k1 v1
k1 v2
k1 v3
k2 v4
k3 v5
k3 v6
k1 v1
k2 v2
k3 v3
Reduce
Map関数とReduce関数を 書いたらよしなに
分散してくれるフレームワーク
You just write MapampReduce functions Hadoop serves the rest
Q1 SQLとMapReduce どっちがいいの
Which is good SQL and MapReduce
ビジネス的な答え
SQL
なぜSQLか
1 SQL書けてもJava書けない人は多い
2 既存のSQLを使ったアプリが動かない
3 MapReduce関数を書くのは高コスト
Many people can write SQL but not Java
Many applications rely on SQL
Writing MR function needs more time
コスト差ってどれくらいよpackage orgmyorg import javaioIOException import javautil import orgapachehadoopfsPath import orgapachehadoopconf import orgapachehadoopio import orgapachehadoopmapred import orgapachehadooputil public class WordCount public static class Map extends MapReduceBase implements MapperltLongWritable Text Text IntWritablegt private final static IntWritable one = new IntWritable(1) private Text word = new Text() public void map(LongWritable key Text value OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException String line = valuetoString() StringTokenizer tokenizer = new StringTokenizer(line) while (tokenizerhasMoreTokens()) wordset(tokenizernextToken()) outputcollect(word one) public static class Reduce extends MapReduceBase implements ReducerltText IntWritable Text IntWritablegt public void reduce(Text key IteratorltIntWritablegt values OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException int sum = 0 while (valueshasNext()) sum += valuesnext()get() outputcollect(key new IntWritable(sum)) public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass) confsetJobName(wordcount) confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass) confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass) confsetReducerClass(Reduceclass) confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass) FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1])) JobClientrunJob(conf)
select count() from ( select regexp_split_to_table(str lsquos+) from text_table ) t
MapReduceによるWordCount SQLによるWordCount
実際SQLが勝った
Now SQL beats MapReduce
Q2 Hadoopと並列RDBは
どっちがいいんですか
Which is good Hadoop or parallel RDB
速度は並列RDB データ構造はHadoop
Parallel RDB is faster Hadoop is more flexible
現在ありがちな構成
HDFS
MapReduce
Hive
今後の構成
HDFS
impala backend
impala frontend
Hadoopは並列RDBに 似てきている
DB filesystem
backend
parser planner
HDFS
impala be
impala fe
Hadoop resembles to parallel RDB now
Hybrid DB comes in near future
Q3 MapReduceは
お亡くなりですか
MapReduce is dead
まだだっhelliphellip まだ終わらんよ
No
MapReduceは 並列処理にJavaやCを
はさみこめる
MapReduce has better extendability
例Asterの SQL-MapReduce
SQLからMapReduce呼べるselect count(distinct user_id) from npath( on clicks partition by user_id order by timestamp mode(overlapping) pattern(lsquoHSPrsquo) symbols( page_type = lsquohomersquo AS H page_type = lsquosearchrsquo AS S page_type = lsquoproductrsquo AS P) result(first(user_id of H) as user_id) )
最近Hiveにもnpath入りました (MatchPath)
You can combine MapReduce with SQL
Easy amp Handy SQL +
Extendable MapReduce
最後にポエム
よいものはよい
Great product is anywhere
だが知識は偏在している
but knowledge is maldistributed
OSS World Enterprise World
Hadoop
Ruby
Python Excel
並列RDB
Windows
markdown VB
Git
Java
Cross the border
end
Parallel RDBNode 0 Node 1 Node 2 Node 3
Front End
Front End
Front End
Front End
Back End
Back End
Back End
Back End
並列RDBの特長
1 ノードを増やせば線形に速くなる
2 標準SQLが使える
3 クライアントからは1台に見える
has linear scalability
You can use SQL
Looks as the one computer
Parallel RDB is great becausehellip
並列RDB超スゴイ age age マック
Parallel RDB is great
いろいろな商用並列RDBDatabase Vendor Since
Teradata Teradata 1983
Teradata Aster Teradata 2005
PureData System for Analytics IBM 2000
Exadata Oracle 2008
Greenplum Pivotal 2003
SQL Server PDW Microsoft 2010くらい
Redshift Amazon 2012
Various parallel RDBs
Here Comes a New Challenger
since 2005
Hadoop Architecture
HDFS Distributed File System
MapReduce Compute Framework
(Hive SQL interface)
Hadoopの特徴
1 データはテーブルではなくファイル
2 処理にはMapReduceを使う(使っていた)
3 Hiveを乗せるとSQLっぽい言語でも書ける
Hadoop data is plain file
Processed by MapReduce
Hive allows you to write SQL-like query
猫も194780子もMapReduce
ldquoBig Datardquo meant MapReduce few years ago
MapReducek1 v1
k2 v2
k3 v3
k4 v4
k5 v5
k6 v6
Map
k1 v1
k1 v2
k1 v3
k2 v4
k3 v5
k3 v6
k1 v1
k2 v2
k3 v3
Reduce
Map関数とReduce関数を 書いたらよしなに
分散してくれるフレームワーク
You just write MapampReduce functions Hadoop serves the rest
Q1 SQLとMapReduce どっちがいいの
Which is good SQL and MapReduce
ビジネス的な答え
SQL
なぜSQLか
1 SQL書けてもJava書けない人は多い
2 既存のSQLを使ったアプリが動かない
3 MapReduce関数を書くのは高コスト
Many people can write SQL but not Java
Many applications rely on SQL
Writing MR function needs more time
コスト差ってどれくらいよpackage orgmyorg import javaioIOException import javautil import orgapachehadoopfsPath import orgapachehadoopconf import orgapachehadoopio import orgapachehadoopmapred import orgapachehadooputil public class WordCount public static class Map extends MapReduceBase implements MapperltLongWritable Text Text IntWritablegt private final static IntWritable one = new IntWritable(1) private Text word = new Text() public void map(LongWritable key Text value OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException String line = valuetoString() StringTokenizer tokenizer = new StringTokenizer(line) while (tokenizerhasMoreTokens()) wordset(tokenizernextToken()) outputcollect(word one) public static class Reduce extends MapReduceBase implements ReducerltText IntWritable Text IntWritablegt public void reduce(Text key IteratorltIntWritablegt values OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException int sum = 0 while (valueshasNext()) sum += valuesnext()get() outputcollect(key new IntWritable(sum)) public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass) confsetJobName(wordcount) confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass) confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass) confsetReducerClass(Reduceclass) confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass) FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1])) JobClientrunJob(conf)
select count() from ( select regexp_split_to_table(str lsquos+) from text_table ) t
MapReduceによるWordCount SQLによるWordCount
実際SQLが勝った
Now SQL beats MapReduce
Q2 Hadoopと並列RDBは
どっちがいいんですか
Which is good Hadoop or parallel RDB
速度は並列RDB データ構造はHadoop
Parallel RDB is faster Hadoop is more flexible
現在ありがちな構成
HDFS
MapReduce
Hive
今後の構成
HDFS
impala backend
impala frontend
Hadoopは並列RDBに 似てきている
DB filesystem
backend
parser planner
HDFS
impala be
impala fe
Hadoop resembles to parallel RDB now
Hybrid DB comes in near future
Q3 MapReduceは
お亡くなりですか
MapReduce is dead
まだだっhelliphellip まだ終わらんよ
No
MapReduceは 並列処理にJavaやCを
はさみこめる
MapReduce has better extendability
例Asterの SQL-MapReduce
SQLからMapReduce呼べるselect count(distinct user_id) from npath( on clicks partition by user_id order by timestamp mode(overlapping) pattern(lsquoHSPrsquo) symbols( page_type = lsquohomersquo AS H page_type = lsquosearchrsquo AS S page_type = lsquoproductrsquo AS P) result(first(user_id of H) as user_id) )
最近Hiveにもnpath入りました (MatchPath)
You can combine MapReduce with SQL
Easy amp Handy SQL +
Extendable MapReduce
最後にポエム
よいものはよい
Great product is anywhere
だが知識は偏在している
but knowledge is maldistributed
OSS World Enterprise World
Hadoop
Ruby
Python Excel
並列RDB
Windows
markdown VB
Git
Java
Cross the border
end
並列RDBの特長
1 ノードを増やせば線形に速くなる
2 標準SQLが使える
3 クライアントからは1台に見える
has linear scalability
You can use SQL
Looks as the one computer
Parallel RDB is great becausehellip
並列RDB超スゴイ age age マック
Parallel RDB is great
いろいろな商用並列RDBDatabase Vendor Since
Teradata Teradata 1983
Teradata Aster Teradata 2005
PureData System for Analytics IBM 2000
Exadata Oracle 2008
Greenplum Pivotal 2003
SQL Server PDW Microsoft 2010くらい
Redshift Amazon 2012
Various parallel RDBs
Here Comes a New Challenger
since 2005
Hadoop Architecture
HDFS Distributed File System
MapReduce Compute Framework
(Hive SQL interface)
Hadoopの特徴
1 データはテーブルではなくファイル
2 処理にはMapReduceを使う(使っていた)
3 Hiveを乗せるとSQLっぽい言語でも書ける
Hadoop data is plain file
Processed by MapReduce
Hive allows you to write SQL-like query
猫も194780子もMapReduce
ldquoBig Datardquo meant MapReduce few years ago
MapReducek1 v1
k2 v2
k3 v3
k4 v4
k5 v5
k6 v6
Map
k1 v1
k1 v2
k1 v3
k2 v4
k3 v5
k3 v6
k1 v1
k2 v2
k3 v3
Reduce
Map関数とReduce関数を 書いたらよしなに
分散してくれるフレームワーク
You just write MapampReduce functions Hadoop serves the rest
Q1 SQLとMapReduce どっちがいいの
Which is good SQL and MapReduce
ビジネス的な答え
SQL
なぜSQLか
1 SQL書けてもJava書けない人は多い
2 既存のSQLを使ったアプリが動かない
3 MapReduce関数を書くのは高コスト
Many people can write SQL but not Java
Many applications rely on SQL
Writing MR function needs more time
コスト差ってどれくらいよpackage orgmyorg import javaioIOException import javautil import orgapachehadoopfsPath import orgapachehadoopconf import orgapachehadoopio import orgapachehadoopmapred import orgapachehadooputil public class WordCount public static class Map extends MapReduceBase implements MapperltLongWritable Text Text IntWritablegt private final static IntWritable one = new IntWritable(1) private Text word = new Text() public void map(LongWritable key Text value OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException String line = valuetoString() StringTokenizer tokenizer = new StringTokenizer(line) while (tokenizerhasMoreTokens()) wordset(tokenizernextToken()) outputcollect(word one) public static class Reduce extends MapReduceBase implements ReducerltText IntWritable Text IntWritablegt public void reduce(Text key IteratorltIntWritablegt values OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException int sum = 0 while (valueshasNext()) sum += valuesnext()get() outputcollect(key new IntWritable(sum)) public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass) confsetJobName(wordcount) confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass) confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass) confsetReducerClass(Reduceclass) confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass) FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1])) JobClientrunJob(conf)
select count() from ( select regexp_split_to_table(str lsquos+) from text_table ) t
MapReduceによるWordCount SQLによるWordCount
実際SQLが勝った
Now SQL beats MapReduce
Q2 Hadoopと並列RDBは
どっちがいいんですか
Which is good Hadoop or parallel RDB
速度は並列RDB データ構造はHadoop
Parallel RDB is faster Hadoop is more flexible
現在ありがちな構成
HDFS
MapReduce
Hive
今後の構成
HDFS
impala backend
impala frontend
Hadoopは並列RDBに 似てきている
DB filesystem
backend
parser planner
HDFS
impala be
impala fe
Hadoop resembles to parallel RDB now
Hybrid DB comes in near future
Q3 MapReduceは
お亡くなりですか
MapReduce is dead
まだだっhelliphellip まだ終わらんよ
No
MapReduceは 並列処理にJavaやCを
はさみこめる
MapReduce has better extendability
例Asterの SQL-MapReduce
SQLからMapReduce呼べるselect count(distinct user_id) from npath( on clicks partition by user_id order by timestamp mode(overlapping) pattern(lsquoHSPrsquo) symbols( page_type = lsquohomersquo AS H page_type = lsquosearchrsquo AS S page_type = lsquoproductrsquo AS P) result(first(user_id of H) as user_id) )
最近Hiveにもnpath入りました (MatchPath)
You can combine MapReduce with SQL
Easy amp Handy SQL +
Extendable MapReduce
最後にポエム
よいものはよい
Great product is anywhere
だが知識は偏在している
but knowledge is maldistributed
OSS World Enterprise World
Hadoop
Ruby
Python Excel
並列RDB
Windows
markdown VB
Git
Java
Cross the border
end
並列RDB超スゴイ age age マック
Parallel RDB is great
いろいろな商用並列RDBDatabase Vendor Since
Teradata Teradata 1983
Teradata Aster Teradata 2005
PureData System for Analytics IBM 2000
Exadata Oracle 2008
Greenplum Pivotal 2003
SQL Server PDW Microsoft 2010くらい
Redshift Amazon 2012
Various parallel RDBs
Here Comes a New Challenger
since 2005
Hadoop Architecture
HDFS Distributed File System
MapReduce Compute Framework
(Hive SQL interface)
Hadoopの特徴
1 データはテーブルではなくファイル
2 処理にはMapReduceを使う(使っていた)
3 Hiveを乗せるとSQLっぽい言語でも書ける
Hadoop data is plain file
Processed by MapReduce
Hive allows you to write SQL-like query
猫も194780子もMapReduce
ldquoBig Datardquo meant MapReduce few years ago
MapReducek1 v1
k2 v2
k3 v3
k4 v4
k5 v5
k6 v6
Map
k1 v1
k1 v2
k1 v3
k2 v4
k3 v5
k3 v6
k1 v1
k2 v2
k3 v3
Reduce
Map関数とReduce関数を 書いたらよしなに
分散してくれるフレームワーク
You just write MapampReduce functions Hadoop serves the rest
Q1 SQLとMapReduce どっちがいいの
Which is good SQL and MapReduce
ビジネス的な答え
SQL
なぜSQLか
1 SQL書けてもJava書けない人は多い
2 既存のSQLを使ったアプリが動かない
3 MapReduce関数を書くのは高コスト
Many people can write SQL but not Java
Many applications rely on SQL
Writing MR function needs more time
コスト差ってどれくらいよpackage orgmyorg import javaioIOException import javautil import orgapachehadoopfsPath import orgapachehadoopconf import orgapachehadoopio import orgapachehadoopmapred import orgapachehadooputil public class WordCount public static class Map extends MapReduceBase implements MapperltLongWritable Text Text IntWritablegt private final static IntWritable one = new IntWritable(1) private Text word = new Text() public void map(LongWritable key Text value OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException String line = valuetoString() StringTokenizer tokenizer = new StringTokenizer(line) while (tokenizerhasMoreTokens()) wordset(tokenizernextToken()) outputcollect(word one) public static class Reduce extends MapReduceBase implements ReducerltText IntWritable Text IntWritablegt public void reduce(Text key IteratorltIntWritablegt values OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException int sum = 0 while (valueshasNext()) sum += valuesnext()get() outputcollect(key new IntWritable(sum)) public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass) confsetJobName(wordcount) confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass) confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass) confsetReducerClass(Reduceclass) confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass) FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1])) JobClientrunJob(conf)
select count() from ( select regexp_split_to_table(str lsquos+) from text_table ) t
MapReduceによるWordCount SQLによるWordCount
実際SQLが勝った
Now SQL beats MapReduce
Q2 Hadoopと並列RDBは
どっちがいいんですか
Which is good Hadoop or parallel RDB
速度は並列RDB データ構造はHadoop
Parallel RDB is faster Hadoop is more flexible
現在ありがちな構成
HDFS
MapReduce
Hive
今後の構成
HDFS
impala backend
impala frontend
Hadoopは並列RDBに 似てきている
DB filesystem
backend
parser planner
HDFS
impala be
impala fe
Hadoop resembles to parallel RDB now
Hybrid DB comes in near future
Q3 MapReduceは
お亡くなりですか
MapReduce is dead
まだだっhelliphellip まだ終わらんよ
No
MapReduceは 並列処理にJavaやCを
はさみこめる
MapReduce has better extendability
例Asterの SQL-MapReduce
SQLからMapReduce呼べるselect count(distinct user_id) from npath( on clicks partition by user_id order by timestamp mode(overlapping) pattern(lsquoHSPrsquo) symbols( page_type = lsquohomersquo AS H page_type = lsquosearchrsquo AS S page_type = lsquoproductrsquo AS P) result(first(user_id of H) as user_id) )
最近Hiveにもnpath入りました (MatchPath)
You can combine MapReduce with SQL
Easy amp Handy SQL +
Extendable MapReduce
最後にポエム
よいものはよい
Great product is anywhere
だが知識は偏在している
but knowledge is maldistributed
OSS World Enterprise World
Hadoop
Ruby
Python Excel
並列RDB
Windows
markdown VB
Git
Java
Cross the border
end
いろいろな商用並列RDBDatabase Vendor Since
Teradata Teradata 1983
Teradata Aster Teradata 2005
PureData System for Analytics IBM 2000
Exadata Oracle 2008
Greenplum Pivotal 2003
SQL Server PDW Microsoft 2010くらい
Redshift Amazon 2012
Various parallel RDBs
Here Comes a New Challenger
since 2005
Hadoop Architecture
HDFS Distributed File System
MapReduce Compute Framework
(Hive SQL interface)
Hadoopの特徴
1 データはテーブルではなくファイル
2 処理にはMapReduceを使う(使っていた)
3 Hiveを乗せるとSQLっぽい言語でも書ける
Hadoop data is plain file
Processed by MapReduce
Hive allows you to write SQL-like query
猫も194780子もMapReduce
ldquoBig Datardquo meant MapReduce few years ago
MapReducek1 v1
k2 v2
k3 v3
k4 v4
k5 v5
k6 v6
Map
k1 v1
k1 v2
k1 v3
k2 v4
k3 v5
k3 v6
k1 v1
k2 v2
k3 v3
Reduce
Map関数とReduce関数を 書いたらよしなに
分散してくれるフレームワーク
You just write MapampReduce functions Hadoop serves the rest
Q1 SQLとMapReduce どっちがいいの
Which is good SQL and MapReduce
ビジネス的な答え
SQL
なぜSQLか
1 SQL書けてもJava書けない人は多い
2 既存のSQLを使ったアプリが動かない
3 MapReduce関数を書くのは高コスト
Many people can write SQL but not Java
Many applications rely on SQL
Writing MR function needs more time
コスト差ってどれくらいよpackage orgmyorg import javaioIOException import javautil import orgapachehadoopfsPath import orgapachehadoopconf import orgapachehadoopio import orgapachehadoopmapred import orgapachehadooputil public class WordCount public static class Map extends MapReduceBase implements MapperltLongWritable Text Text IntWritablegt private final static IntWritable one = new IntWritable(1) private Text word = new Text() public void map(LongWritable key Text value OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException String line = valuetoString() StringTokenizer tokenizer = new StringTokenizer(line) while (tokenizerhasMoreTokens()) wordset(tokenizernextToken()) outputcollect(word one) public static class Reduce extends MapReduceBase implements ReducerltText IntWritable Text IntWritablegt public void reduce(Text key IteratorltIntWritablegt values OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException int sum = 0 while (valueshasNext()) sum += valuesnext()get() outputcollect(key new IntWritable(sum)) public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass) confsetJobName(wordcount) confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass) confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass) confsetReducerClass(Reduceclass) confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass) FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1])) JobClientrunJob(conf)
select count() from ( select regexp_split_to_table(str lsquos+) from text_table ) t
MapReduceによるWordCount SQLによるWordCount
実際SQLが勝った
Now SQL beats MapReduce
Q2 Hadoopと並列RDBは
どっちがいいんですか
Which is good Hadoop or parallel RDB
速度は並列RDB データ構造はHadoop
Parallel RDB is faster Hadoop is more flexible
現在ありがちな構成
HDFS
MapReduce
Hive
今後の構成
HDFS
impala backend
impala frontend
Hadoopは並列RDBに 似てきている
DB filesystem
backend
parser planner
HDFS
impala be
impala fe
Hadoop resembles to parallel RDB now
Hybrid DB comes in near future
Q3 MapReduceは
お亡くなりですか
MapReduce is dead
まだだっhelliphellip まだ終わらんよ
No
MapReduceは 並列処理にJavaやCを
はさみこめる
MapReduce has better extendability
例Asterの SQL-MapReduce
SQLからMapReduce呼べるselect count(distinct user_id) from npath( on clicks partition by user_id order by timestamp mode(overlapping) pattern(lsquoHSPrsquo) symbols( page_type = lsquohomersquo AS H page_type = lsquosearchrsquo AS S page_type = lsquoproductrsquo AS P) result(first(user_id of H) as user_id) )
最近Hiveにもnpath入りました (MatchPath)
You can combine MapReduce with SQL
Easy amp Handy SQL +
Extendable MapReduce
最後にポエム
よいものはよい
Great product is anywhere
だが知識は偏在している
but knowledge is maldistributed
OSS World Enterprise World
Hadoop
Ruby
Python Excel
並列RDB
Windows
markdown VB
Git
Java
Cross the border
end
Here Comes a New Challenger
since 2005
Hadoop Architecture
HDFS Distributed File System
MapReduce Compute Framework
(Hive SQL interface)
Hadoopの特徴
1 データはテーブルではなくファイル
2 処理にはMapReduceを使う(使っていた)
3 Hiveを乗せるとSQLっぽい言語でも書ける
Hadoop data is plain file
Processed by MapReduce
Hive allows you to write SQL-like query
猫も194780子もMapReduce
ldquoBig Datardquo meant MapReduce few years ago
MapReducek1 v1
k2 v2
k3 v3
k4 v4
k5 v5
k6 v6
Map
k1 v1
k1 v2
k1 v3
k2 v4
k3 v5
k3 v6
k1 v1
k2 v2
k3 v3
Reduce
Map関数とReduce関数を 書いたらよしなに
分散してくれるフレームワーク
You just write MapampReduce functions Hadoop serves the rest
Q1 SQLとMapReduce どっちがいいの
Which is good SQL and MapReduce
ビジネス的な答え
SQL
なぜSQLか
1 SQL書けてもJava書けない人は多い
2 既存のSQLを使ったアプリが動かない
3 MapReduce関数を書くのは高コスト
Many people can write SQL but not Java
Many applications rely on SQL
Writing MR function needs more time
コスト差ってどれくらいよpackage orgmyorg import javaioIOException import javautil import orgapachehadoopfsPath import orgapachehadoopconf import orgapachehadoopio import orgapachehadoopmapred import orgapachehadooputil public class WordCount public static class Map extends MapReduceBase implements MapperltLongWritable Text Text IntWritablegt private final static IntWritable one = new IntWritable(1) private Text word = new Text() public void map(LongWritable key Text value OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException String line = valuetoString() StringTokenizer tokenizer = new StringTokenizer(line) while (tokenizerhasMoreTokens()) wordset(tokenizernextToken()) outputcollect(word one) public static class Reduce extends MapReduceBase implements ReducerltText IntWritable Text IntWritablegt public void reduce(Text key IteratorltIntWritablegt values OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException int sum = 0 while (valueshasNext()) sum += valuesnext()get() outputcollect(key new IntWritable(sum)) public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass) confsetJobName(wordcount) confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass) confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass) confsetReducerClass(Reduceclass) confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass) FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1])) JobClientrunJob(conf)
select count() from ( select regexp_split_to_table(str lsquos+) from text_table ) t
MapReduceによるWordCount SQLによるWordCount
実際SQLが勝った
Now SQL beats MapReduce
Q2 Hadoopと並列RDBは
どっちがいいんですか
Which is good Hadoop or parallel RDB
速度は並列RDB データ構造はHadoop
Parallel RDB is faster Hadoop is more flexible
現在ありがちな構成
HDFS
MapReduce
Hive
今後の構成
HDFS
impala backend
impala frontend
Hadoopは並列RDBに 似てきている
DB filesystem
backend
parser planner
HDFS
impala be
impala fe
Hadoop resembles to parallel RDB now
Hybrid DB comes in near future
Q3 MapReduceは
お亡くなりですか
MapReduce is dead
まだだっhelliphellip まだ終わらんよ
No
MapReduceは 並列処理にJavaやCを
はさみこめる
MapReduce has better extendability
例Asterの SQL-MapReduce
SQLからMapReduce呼べるselect count(distinct user_id) from npath( on clicks partition by user_id order by timestamp mode(overlapping) pattern(lsquoHSPrsquo) symbols( page_type = lsquohomersquo AS H page_type = lsquosearchrsquo AS S page_type = lsquoproductrsquo AS P) result(first(user_id of H) as user_id) )
最近Hiveにもnpath入りました (MatchPath)
You can combine MapReduce with SQL
Easy amp Handy SQL +
Extendable MapReduce
最後にポエム
よいものはよい
Great product is anywhere
だが知識は偏在している
but knowledge is maldistributed
OSS World Enterprise World
Hadoop
Ruby
Python Excel
並列RDB
Windows
markdown VB
Git
Java
Cross the border
end
since 2005
Hadoop Architecture
HDFS Distributed File System
MapReduce Compute Framework
(Hive SQL interface)
Hadoopの特徴
1 データはテーブルではなくファイル
2 処理にはMapReduceを使う(使っていた)
3 Hiveを乗せるとSQLっぽい言語でも書ける
Hadoop data is plain file
Processed by MapReduce
Hive allows you to write SQL-like query
猫も194780子もMapReduce
ldquoBig Datardquo meant MapReduce few years ago
MapReducek1 v1
k2 v2
k3 v3
k4 v4
k5 v5
k6 v6
Map
k1 v1
k1 v2
k1 v3
k2 v4
k3 v5
k3 v6
k1 v1
k2 v2
k3 v3
Reduce
Map関数とReduce関数を 書いたらよしなに
分散してくれるフレームワーク
You just write MapampReduce functions Hadoop serves the rest
Q1 SQLとMapReduce どっちがいいの
Which is good SQL and MapReduce
ビジネス的な答え
SQL
なぜSQLか
1 SQL書けてもJava書けない人は多い
2 既存のSQLを使ったアプリが動かない
3 MapReduce関数を書くのは高コスト
Many people can write SQL but not Java
Many applications rely on SQL
Writing MR function needs more time
コスト差ってどれくらいよpackage orgmyorg import javaioIOException import javautil import orgapachehadoopfsPath import orgapachehadoopconf import orgapachehadoopio import orgapachehadoopmapred import orgapachehadooputil public class WordCount public static class Map extends MapReduceBase implements MapperltLongWritable Text Text IntWritablegt private final static IntWritable one = new IntWritable(1) private Text word = new Text() public void map(LongWritable key Text value OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException String line = valuetoString() StringTokenizer tokenizer = new StringTokenizer(line) while (tokenizerhasMoreTokens()) wordset(tokenizernextToken()) outputcollect(word one) public static class Reduce extends MapReduceBase implements ReducerltText IntWritable Text IntWritablegt public void reduce(Text key IteratorltIntWritablegt values OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException int sum = 0 while (valueshasNext()) sum += valuesnext()get() outputcollect(key new IntWritable(sum)) public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass) confsetJobName(wordcount) confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass) confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass) confsetReducerClass(Reduceclass) confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass) FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1])) JobClientrunJob(conf)
select count() from ( select regexp_split_to_table(str lsquos+) from text_table ) t
MapReduceによるWordCount SQLによるWordCount
実際SQLが勝った
Now SQL beats MapReduce
Q2 Hadoopと並列RDBは
どっちがいいんですか
Which is good Hadoop or parallel RDB
速度は並列RDB データ構造はHadoop
Parallel RDB is faster Hadoop is more flexible
現在ありがちな構成
HDFS
MapReduce
Hive
今後の構成
HDFS
impala backend
impala frontend
Hadoopは並列RDBに 似てきている
DB filesystem
backend
parser planner
HDFS
impala be
impala fe
Hadoop resembles to parallel RDB now
Hybrid DB comes in near future
Q3 MapReduceは
お亡くなりですか
MapReduce is dead
まだだっhelliphellip まだ終わらんよ
No
MapReduceは 並列処理にJavaやCを
はさみこめる
MapReduce has better extendability
例Asterの SQL-MapReduce
SQLからMapReduce呼べるselect count(distinct user_id) from npath( on clicks partition by user_id order by timestamp mode(overlapping) pattern(lsquoHSPrsquo) symbols( page_type = lsquohomersquo AS H page_type = lsquosearchrsquo AS S page_type = lsquoproductrsquo AS P) result(first(user_id of H) as user_id) )
最近Hiveにもnpath入りました (MatchPath)
You can combine MapReduce with SQL
Easy amp Handy SQL +
Extendable MapReduce
最後にポエム
よいものはよい
Great product is anywhere
だが知識は偏在している
but knowledge is maldistributed
OSS World Enterprise World
Hadoop
Ruby
Python Excel
並列RDB
Windows
markdown VB
Git
Java
Cross the border
end
Hadoop Architecture
HDFS Distributed File System
MapReduce Compute Framework
(Hive SQL interface)
Hadoopの特徴
1 データはテーブルではなくファイル
2 処理にはMapReduceを使う(使っていた)
3 Hiveを乗せるとSQLっぽい言語でも書ける
Hadoop data is plain file
Processed by MapReduce
Hive allows you to write SQL-like query
猫も194780子もMapReduce
ldquoBig Datardquo meant MapReduce few years ago
MapReducek1 v1
k2 v2
k3 v3
k4 v4
k5 v5
k6 v6
Map
k1 v1
k1 v2
k1 v3
k2 v4
k3 v5
k3 v6
k1 v1
k2 v2
k3 v3
Reduce
Map関数とReduce関数を 書いたらよしなに
分散してくれるフレームワーク
You just write MapampReduce functions Hadoop serves the rest
Q1 SQLとMapReduce どっちがいいの
Which is good SQL and MapReduce
ビジネス的な答え
SQL
なぜSQLか
1 SQL書けてもJava書けない人は多い
2 既存のSQLを使ったアプリが動かない
3 MapReduce関数を書くのは高コスト
Many people can write SQL but not Java
Many applications rely on SQL
Writing MR function needs more time
コスト差ってどれくらいよpackage orgmyorg import javaioIOException import javautil import orgapachehadoopfsPath import orgapachehadoopconf import orgapachehadoopio import orgapachehadoopmapred import orgapachehadooputil public class WordCount public static class Map extends MapReduceBase implements MapperltLongWritable Text Text IntWritablegt private final static IntWritable one = new IntWritable(1) private Text word = new Text() public void map(LongWritable key Text value OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException String line = valuetoString() StringTokenizer tokenizer = new StringTokenizer(line) while (tokenizerhasMoreTokens()) wordset(tokenizernextToken()) outputcollect(word one) public static class Reduce extends MapReduceBase implements ReducerltText IntWritable Text IntWritablegt public void reduce(Text key IteratorltIntWritablegt values OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException int sum = 0 while (valueshasNext()) sum += valuesnext()get() outputcollect(key new IntWritable(sum)) public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass) confsetJobName(wordcount) confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass) confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass) confsetReducerClass(Reduceclass) confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass) FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1])) JobClientrunJob(conf)
select count() from ( select regexp_split_to_table(str lsquos+) from text_table ) t
MapReduceによるWordCount SQLによるWordCount
実際SQLが勝った
Now SQL beats MapReduce
Q2 Hadoopと並列RDBは
どっちがいいんですか
Which is good Hadoop or parallel RDB
速度は並列RDB データ構造はHadoop
Parallel RDB is faster Hadoop is more flexible
現在ありがちな構成
HDFS
MapReduce
Hive
今後の構成
HDFS
impala backend
impala frontend
Hadoopは並列RDBに 似てきている
DB filesystem
backend
parser planner
HDFS
impala be
impala fe
Hadoop resembles to parallel RDB now
Hybrid DB comes in near future
Q3 MapReduceは
お亡くなりですか
MapReduce is dead
まだだっhelliphellip まだ終わらんよ
No
MapReduceは 並列処理にJavaやCを
はさみこめる
MapReduce has better extendability
例Asterの SQL-MapReduce
SQLからMapReduce呼べるselect count(distinct user_id) from npath( on clicks partition by user_id order by timestamp mode(overlapping) pattern(lsquoHSPrsquo) symbols( page_type = lsquohomersquo AS H page_type = lsquosearchrsquo AS S page_type = lsquoproductrsquo AS P) result(first(user_id of H) as user_id) )
最近Hiveにもnpath入りました (MatchPath)
You can combine MapReduce with SQL
Easy amp Handy SQL +
Extendable MapReduce
最後にポエム
よいものはよい
Great product is anywhere
だが知識は偏在している
but knowledge is maldistributed
OSS World Enterprise World
Hadoop
Ruby
Python Excel
並列RDB
Windows
markdown VB
Git
Java
Cross the border
end
Hadoopの特徴
1 データはテーブルではなくファイル
2 処理にはMapReduceを使う(使っていた)
3 Hiveを乗せるとSQLっぽい言語でも書ける
Hadoop data is plain file
Processed by MapReduce
Hive allows you to write SQL-like query
猫も194780子もMapReduce
ldquoBig Datardquo meant MapReduce few years ago
MapReducek1 v1
k2 v2
k3 v3
k4 v4
k5 v5
k6 v6
Map
k1 v1
k1 v2
k1 v3
k2 v4
k3 v5
k3 v6
k1 v1
k2 v2
k3 v3
Reduce
Map関数とReduce関数を 書いたらよしなに
分散してくれるフレームワーク
You just write MapampReduce functions Hadoop serves the rest
Q1 SQLとMapReduce どっちがいいの
Which is good SQL and MapReduce
ビジネス的な答え
SQL
なぜSQLか
1 SQL書けてもJava書けない人は多い
2 既存のSQLを使ったアプリが動かない
3 MapReduce関数を書くのは高コスト
Many people can write SQL but not Java
Many applications rely on SQL
Writing MR function needs more time
コスト差ってどれくらいよpackage orgmyorg import javaioIOException import javautil import orgapachehadoopfsPath import orgapachehadoopconf import orgapachehadoopio import orgapachehadoopmapred import orgapachehadooputil public class WordCount public static class Map extends MapReduceBase implements MapperltLongWritable Text Text IntWritablegt private final static IntWritable one = new IntWritable(1) private Text word = new Text() public void map(LongWritable key Text value OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException String line = valuetoString() StringTokenizer tokenizer = new StringTokenizer(line) while (tokenizerhasMoreTokens()) wordset(tokenizernextToken()) outputcollect(word one) public static class Reduce extends MapReduceBase implements ReducerltText IntWritable Text IntWritablegt public void reduce(Text key IteratorltIntWritablegt values OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException int sum = 0 while (valueshasNext()) sum += valuesnext()get() outputcollect(key new IntWritable(sum)) public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass) confsetJobName(wordcount) confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass) confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass) confsetReducerClass(Reduceclass) confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass) FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1])) JobClientrunJob(conf)
select count() from ( select regexp_split_to_table(str lsquos+) from text_table ) t
MapReduceによるWordCount SQLによるWordCount
実際SQLが勝った
Now SQL beats MapReduce
Q2 Hadoopと並列RDBは
どっちがいいんですか
Which is good Hadoop or parallel RDB
速度は並列RDB データ構造はHadoop
Parallel RDB is faster Hadoop is more flexible
現在ありがちな構成
HDFS
MapReduce
Hive
今後の構成
HDFS
impala backend
impala frontend
Hadoopは並列RDBに 似てきている
DB filesystem
backend
parser planner
HDFS
impala be
impala fe
Hadoop resembles to parallel RDB now
Hybrid DB comes in near future
Q3 MapReduceは
お亡くなりですか
MapReduce is dead
まだだっhelliphellip まだ終わらんよ
No
MapReduceは 並列処理にJavaやCを
はさみこめる
MapReduce has better extendability
例Asterの SQL-MapReduce
SQLからMapReduce呼べるselect count(distinct user_id) from npath( on clicks partition by user_id order by timestamp mode(overlapping) pattern(lsquoHSPrsquo) symbols( page_type = lsquohomersquo AS H page_type = lsquosearchrsquo AS S page_type = lsquoproductrsquo AS P) result(first(user_id of H) as user_id) )
最近Hiveにもnpath入りました (MatchPath)
You can combine MapReduce with SQL
Easy amp Handy SQL +
Extendable MapReduce
最後にポエム
よいものはよい
Great product is anywhere
だが知識は偏在している
but knowledge is maldistributed
OSS World Enterprise World
Hadoop
Ruby
Python Excel
並列RDB
Windows
markdown VB
Git
Java
Cross the border
end
猫も194780子もMapReduce
ldquoBig Datardquo meant MapReduce few years ago
MapReducek1 v1
k2 v2
k3 v3
k4 v4
k5 v5
k6 v6
Map
k1 v1
k1 v2
k1 v3
k2 v4
k3 v5
k3 v6
k1 v1
k2 v2
k3 v3
Reduce
Map関数とReduce関数を 書いたらよしなに
分散してくれるフレームワーク
You just write MapampReduce functions Hadoop serves the rest
Q1 SQLとMapReduce どっちがいいの
Which is good SQL and MapReduce
ビジネス的な答え
SQL
なぜSQLか
1 SQL書けてもJava書けない人は多い
2 既存のSQLを使ったアプリが動かない
3 MapReduce関数を書くのは高コスト
Many people can write SQL but not Java
Many applications rely on SQL
Writing MR function needs more time
コスト差ってどれくらいよpackage orgmyorg import javaioIOException import javautil import orgapachehadoopfsPath import orgapachehadoopconf import orgapachehadoopio import orgapachehadoopmapred import orgapachehadooputil public class WordCount public static class Map extends MapReduceBase implements MapperltLongWritable Text Text IntWritablegt private final static IntWritable one = new IntWritable(1) private Text word = new Text() public void map(LongWritable key Text value OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException String line = valuetoString() StringTokenizer tokenizer = new StringTokenizer(line) while (tokenizerhasMoreTokens()) wordset(tokenizernextToken()) outputcollect(word one) public static class Reduce extends MapReduceBase implements ReducerltText IntWritable Text IntWritablegt public void reduce(Text key IteratorltIntWritablegt values OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException int sum = 0 while (valueshasNext()) sum += valuesnext()get() outputcollect(key new IntWritable(sum)) public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass) confsetJobName(wordcount) confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass) confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass) confsetReducerClass(Reduceclass) confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass) FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1])) JobClientrunJob(conf)
select count() from ( select regexp_split_to_table(str lsquos+) from text_table ) t
MapReduceによるWordCount SQLによるWordCount
実際SQLが勝った
Now SQL beats MapReduce
Q2 Hadoopと並列RDBは
どっちがいいんですか
Which is good Hadoop or parallel RDB
速度は並列RDB データ構造はHadoop
Parallel RDB is faster Hadoop is more flexible
現在ありがちな構成
HDFS
MapReduce
Hive
今後の構成
HDFS
impala backend
impala frontend
Hadoopは並列RDBに 似てきている
DB filesystem
backend
parser planner
HDFS
impala be
impala fe
Hadoop resembles to parallel RDB now
Hybrid DB comes in near future
Q3 MapReduceは
お亡くなりですか
MapReduce is dead
まだだっhelliphellip まだ終わらんよ
No
MapReduceは 並列処理にJavaやCを
はさみこめる
MapReduce has better extendability
例Asterの SQL-MapReduce
SQLからMapReduce呼べるselect count(distinct user_id) from npath( on clicks partition by user_id order by timestamp mode(overlapping) pattern(lsquoHSPrsquo) symbols( page_type = lsquohomersquo AS H page_type = lsquosearchrsquo AS S page_type = lsquoproductrsquo AS P) result(first(user_id of H) as user_id) )
最近Hiveにもnpath入りました (MatchPath)
You can combine MapReduce with SQL
Easy amp Handy SQL +
Extendable MapReduce
最後にポエム
よいものはよい
Great product is anywhere
だが知識は偏在している
but knowledge is maldistributed
OSS World Enterprise World
Hadoop
Ruby
Python Excel
並列RDB
Windows
markdown VB
Git
Java
Cross the border
end
MapReducek1 v1
k2 v2
k3 v3
k4 v4
k5 v5
k6 v6
Map
k1 v1
k1 v2
k1 v3
k2 v4
k3 v5
k3 v6
k1 v1
k2 v2
k3 v3
Reduce
Map関数とReduce関数を 書いたらよしなに
分散してくれるフレームワーク
You just write MapampReduce functions Hadoop serves the rest
Q1 SQLとMapReduce どっちがいいの
Which is good SQL and MapReduce
ビジネス的な答え
SQL
なぜSQLか
1 SQL書けてもJava書けない人は多い
2 既存のSQLを使ったアプリが動かない
3 MapReduce関数を書くのは高コスト
Many people can write SQL but not Java
Many applications rely on SQL
Writing MR function needs more time
コスト差ってどれくらいよpackage orgmyorg import javaioIOException import javautil import orgapachehadoopfsPath import orgapachehadoopconf import orgapachehadoopio import orgapachehadoopmapred import orgapachehadooputil public class WordCount public static class Map extends MapReduceBase implements MapperltLongWritable Text Text IntWritablegt private final static IntWritable one = new IntWritable(1) private Text word = new Text() public void map(LongWritable key Text value OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException String line = valuetoString() StringTokenizer tokenizer = new StringTokenizer(line) while (tokenizerhasMoreTokens()) wordset(tokenizernextToken()) outputcollect(word one) public static class Reduce extends MapReduceBase implements ReducerltText IntWritable Text IntWritablegt public void reduce(Text key IteratorltIntWritablegt values OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException int sum = 0 while (valueshasNext()) sum += valuesnext()get() outputcollect(key new IntWritable(sum)) public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass) confsetJobName(wordcount) confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass) confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass) confsetReducerClass(Reduceclass) confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass) FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1])) JobClientrunJob(conf)
select count() from ( select regexp_split_to_table(str lsquos+) from text_table ) t
MapReduceによるWordCount SQLによるWordCount
実際SQLが勝った
Now SQL beats MapReduce
Q2 Hadoopと並列RDBは
どっちがいいんですか
Which is good Hadoop or parallel RDB
速度は並列RDB データ構造はHadoop
Parallel RDB is faster Hadoop is more flexible
現在ありがちな構成
HDFS
MapReduce
Hive
今後の構成
HDFS
impala backend
impala frontend
Hadoopは並列RDBに 似てきている
DB filesystem
backend
parser planner
HDFS
impala be
impala fe
Hadoop resembles to parallel RDB now
Hybrid DB comes in near future
Q3 MapReduceは
お亡くなりですか
MapReduce is dead
まだだっhelliphellip まだ終わらんよ
No
MapReduceは 並列処理にJavaやCを
はさみこめる
MapReduce has better extendability
例Asterの SQL-MapReduce
SQLからMapReduce呼べるselect count(distinct user_id) from npath( on clicks partition by user_id order by timestamp mode(overlapping) pattern(lsquoHSPrsquo) symbols( page_type = lsquohomersquo AS H page_type = lsquosearchrsquo AS S page_type = lsquoproductrsquo AS P) result(first(user_id of H) as user_id) )
最近Hiveにもnpath入りました (MatchPath)
You can combine MapReduce with SQL
Easy amp Handy SQL +
Extendable MapReduce
最後にポエム
よいものはよい
Great product is anywhere
だが知識は偏在している
but knowledge is maldistributed
OSS World Enterprise World
Hadoop
Ruby
Python Excel
並列RDB
Windows
markdown VB
Git
Java
Cross the border
end
Map関数とReduce関数を 書いたらよしなに
分散してくれるフレームワーク
You just write MapampReduce functions Hadoop serves the rest
Q1 SQLとMapReduce どっちがいいの
Which is good SQL and MapReduce
ビジネス的な答え
SQL
なぜSQLか
1 SQL書けてもJava書けない人は多い
2 既存のSQLを使ったアプリが動かない
3 MapReduce関数を書くのは高コスト
Many people can write SQL but not Java
Many applications rely on SQL
Writing MR function needs more time
コスト差ってどれくらいよpackage orgmyorg import javaioIOException import javautil import orgapachehadoopfsPath import orgapachehadoopconf import orgapachehadoopio import orgapachehadoopmapred import orgapachehadooputil public class WordCount public static class Map extends MapReduceBase implements MapperltLongWritable Text Text IntWritablegt private final static IntWritable one = new IntWritable(1) private Text word = new Text() public void map(LongWritable key Text value OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException String line = valuetoString() StringTokenizer tokenizer = new StringTokenizer(line) while (tokenizerhasMoreTokens()) wordset(tokenizernextToken()) outputcollect(word one) public static class Reduce extends MapReduceBase implements ReducerltText IntWritable Text IntWritablegt public void reduce(Text key IteratorltIntWritablegt values OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException int sum = 0 while (valueshasNext()) sum += valuesnext()get() outputcollect(key new IntWritable(sum)) public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass) confsetJobName(wordcount) confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass) confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass) confsetReducerClass(Reduceclass) confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass) FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1])) JobClientrunJob(conf)
select count() from ( select regexp_split_to_table(str lsquos+) from text_table ) t
MapReduceによるWordCount SQLによるWordCount
実際SQLが勝った
Now SQL beats MapReduce
Q2 Hadoopと並列RDBは
どっちがいいんですか
Which is good Hadoop or parallel RDB
速度は並列RDB データ構造はHadoop
Parallel RDB is faster Hadoop is more flexible
現在ありがちな構成
HDFS
MapReduce
Hive
今後の構成
HDFS
impala backend
impala frontend
Hadoopは並列RDBに 似てきている
DB filesystem
backend
parser planner
HDFS
impala be
impala fe
Hadoop resembles to parallel RDB now
Hybrid DB comes in near future
Q3 MapReduceは
お亡くなりですか
MapReduce is dead
まだだっhelliphellip まだ終わらんよ
No
MapReduceは 並列処理にJavaやCを
はさみこめる
MapReduce has better extendability
例Asterの SQL-MapReduce
SQLからMapReduce呼べるselect count(distinct user_id) from npath( on clicks partition by user_id order by timestamp mode(overlapping) pattern(lsquoHSPrsquo) symbols( page_type = lsquohomersquo AS H page_type = lsquosearchrsquo AS S page_type = lsquoproductrsquo AS P) result(first(user_id of H) as user_id) )
最近Hiveにもnpath入りました (MatchPath)
You can combine MapReduce with SQL
Easy amp Handy SQL +
Extendable MapReduce
最後にポエム
よいものはよい
Great product is anywhere
だが知識は偏在している
but knowledge is maldistributed
OSS World Enterprise World
Hadoop
Ruby
Python Excel
並列RDB
Windows
markdown VB
Git
Java
Cross the border
end
Q1 SQLとMapReduce どっちがいいの
Which is good SQL and MapReduce
ビジネス的な答え
SQL
なぜSQLか
1 SQL書けてもJava書けない人は多い
2 既存のSQLを使ったアプリが動かない
3 MapReduce関数を書くのは高コスト
Many people can write SQL but not Java
Many applications rely on SQL
Writing MR function needs more time
コスト差ってどれくらいよpackage orgmyorg import javaioIOException import javautil import orgapachehadoopfsPath import orgapachehadoopconf import orgapachehadoopio import orgapachehadoopmapred import orgapachehadooputil public class WordCount public static class Map extends MapReduceBase implements MapperltLongWritable Text Text IntWritablegt private final static IntWritable one = new IntWritable(1) private Text word = new Text() public void map(LongWritable key Text value OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException String line = valuetoString() StringTokenizer tokenizer = new StringTokenizer(line) while (tokenizerhasMoreTokens()) wordset(tokenizernextToken()) outputcollect(word one) public static class Reduce extends MapReduceBase implements ReducerltText IntWritable Text IntWritablegt public void reduce(Text key IteratorltIntWritablegt values OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException int sum = 0 while (valueshasNext()) sum += valuesnext()get() outputcollect(key new IntWritable(sum)) public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass) confsetJobName(wordcount) confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass) confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass) confsetReducerClass(Reduceclass) confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass) FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1])) JobClientrunJob(conf)
select count() from ( select regexp_split_to_table(str lsquos+) from text_table ) t
MapReduceによるWordCount SQLによるWordCount
実際SQLが勝った
Now SQL beats MapReduce
Q2 Hadoopと並列RDBは
どっちがいいんですか
Which is good Hadoop or parallel RDB
速度は並列RDB データ構造はHadoop
Parallel RDB is faster Hadoop is more flexible
現在ありがちな構成
HDFS
MapReduce
Hive
今後の構成
HDFS
impala backend
impala frontend
Hadoopは並列RDBに 似てきている
DB filesystem
backend
parser planner
HDFS
impala be
impala fe
Hadoop resembles to parallel RDB now
Hybrid DB comes in near future
Q3 MapReduceは
お亡くなりですか
MapReduce is dead
まだだっhelliphellip まだ終わらんよ
No
MapReduceは 並列処理にJavaやCを
はさみこめる
MapReduce has better extendability
例Asterの SQL-MapReduce
SQLからMapReduce呼べるselect count(distinct user_id) from npath( on clicks partition by user_id order by timestamp mode(overlapping) pattern(lsquoHSPrsquo) symbols( page_type = lsquohomersquo AS H page_type = lsquosearchrsquo AS S page_type = lsquoproductrsquo AS P) result(first(user_id of H) as user_id) )
最近Hiveにもnpath入りました (MatchPath)
You can combine MapReduce with SQL
Easy amp Handy SQL +
Extendable MapReduce
最後にポエム
よいものはよい
Great product is anywhere
だが知識は偏在している
but knowledge is maldistributed
OSS World Enterprise World
Hadoop
Ruby
Python Excel
並列RDB
Windows
markdown VB
Git
Java
Cross the border
end
ビジネス的な答え
SQL
なぜSQLか
1 SQL書けてもJava書けない人は多い
2 既存のSQLを使ったアプリが動かない
3 MapReduce関数を書くのは高コスト
Many people can write SQL but not Java
Many applications rely on SQL
Writing MR function needs more time
コスト差ってどれくらいよpackage orgmyorg import javaioIOException import javautil import orgapachehadoopfsPath import orgapachehadoopconf import orgapachehadoopio import orgapachehadoopmapred import orgapachehadooputil public class WordCount public static class Map extends MapReduceBase implements MapperltLongWritable Text Text IntWritablegt private final static IntWritable one = new IntWritable(1) private Text word = new Text() public void map(LongWritable key Text value OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException String line = valuetoString() StringTokenizer tokenizer = new StringTokenizer(line) while (tokenizerhasMoreTokens()) wordset(tokenizernextToken()) outputcollect(word one) public static class Reduce extends MapReduceBase implements ReducerltText IntWritable Text IntWritablegt public void reduce(Text key IteratorltIntWritablegt values OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException int sum = 0 while (valueshasNext()) sum += valuesnext()get() outputcollect(key new IntWritable(sum)) public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass) confsetJobName(wordcount) confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass) confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass) confsetReducerClass(Reduceclass) confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass) FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1])) JobClientrunJob(conf)
select count() from ( select regexp_split_to_table(str lsquos+) from text_table ) t
MapReduceによるWordCount SQLによるWordCount
実際SQLが勝った
Now SQL beats MapReduce
Q2 Hadoopと並列RDBは
どっちがいいんですか
Which is good Hadoop or parallel RDB
速度は並列RDB データ構造はHadoop
Parallel RDB is faster Hadoop is more flexible
現在ありがちな構成
HDFS
MapReduce
Hive
今後の構成
HDFS
impala backend
impala frontend
Hadoopは並列RDBに 似てきている
DB filesystem
backend
parser planner
HDFS
impala be
impala fe
Hadoop resembles to parallel RDB now
Hybrid DB comes in near future
Q3 MapReduceは
お亡くなりですか
MapReduce is dead
まだだっhelliphellip まだ終わらんよ
No
MapReduceは 並列処理にJavaやCを
はさみこめる
MapReduce has better extendability
例Asterの SQL-MapReduce
SQLからMapReduce呼べるselect count(distinct user_id) from npath( on clicks partition by user_id order by timestamp mode(overlapping) pattern(lsquoHSPrsquo) symbols( page_type = lsquohomersquo AS H page_type = lsquosearchrsquo AS S page_type = lsquoproductrsquo AS P) result(first(user_id of H) as user_id) )
最近Hiveにもnpath入りました (MatchPath)
You can combine MapReduce with SQL
Easy amp Handy SQL +
Extendable MapReduce
最後にポエム
よいものはよい
Great product is anywhere
だが知識は偏在している
but knowledge is maldistributed
OSS World Enterprise World
Hadoop
Ruby
Python Excel
並列RDB
Windows
markdown VB
Git
Java
Cross the border
end
なぜSQLか
1 SQL書けてもJava書けない人は多い
2 既存のSQLを使ったアプリが動かない
3 MapReduce関数を書くのは高コスト
Many people can write SQL but not Java
Many applications rely on SQL
Writing MR function needs more time
コスト差ってどれくらいよpackage orgmyorg import javaioIOException import javautil import orgapachehadoopfsPath import orgapachehadoopconf import orgapachehadoopio import orgapachehadoopmapred import orgapachehadooputil public class WordCount public static class Map extends MapReduceBase implements MapperltLongWritable Text Text IntWritablegt private final static IntWritable one = new IntWritable(1) private Text word = new Text() public void map(LongWritable key Text value OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException String line = valuetoString() StringTokenizer tokenizer = new StringTokenizer(line) while (tokenizerhasMoreTokens()) wordset(tokenizernextToken()) outputcollect(word one) public static class Reduce extends MapReduceBase implements ReducerltText IntWritable Text IntWritablegt public void reduce(Text key IteratorltIntWritablegt values OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException int sum = 0 while (valueshasNext()) sum += valuesnext()get() outputcollect(key new IntWritable(sum)) public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass) confsetJobName(wordcount) confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass) confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass) confsetReducerClass(Reduceclass) confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass) FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1])) JobClientrunJob(conf)
select count() from ( select regexp_split_to_table(str lsquos+) from text_table ) t
MapReduceによるWordCount SQLによるWordCount
実際SQLが勝った
Now SQL beats MapReduce
Q2 Hadoopと並列RDBは
どっちがいいんですか
Which is good Hadoop or parallel RDB
速度は並列RDB データ構造はHadoop
Parallel RDB is faster Hadoop is more flexible
現在ありがちな構成
HDFS
MapReduce
Hive
今後の構成
HDFS
impala backend
impala frontend
Hadoopは並列RDBに 似てきている
DB filesystem
backend
parser planner
HDFS
impala be
impala fe
Hadoop resembles to parallel RDB now
Hybrid DB comes in near future
Q3 MapReduceは
お亡くなりですか
MapReduce is dead
まだだっhelliphellip まだ終わらんよ
No
MapReduceは 並列処理にJavaやCを
はさみこめる
MapReduce has better extendability
例Asterの SQL-MapReduce
SQLからMapReduce呼べるselect count(distinct user_id) from npath( on clicks partition by user_id order by timestamp mode(overlapping) pattern(lsquoHSPrsquo) symbols( page_type = lsquohomersquo AS H page_type = lsquosearchrsquo AS S page_type = lsquoproductrsquo AS P) result(first(user_id of H) as user_id) )
最近Hiveにもnpath入りました (MatchPath)
You can combine MapReduce with SQL
Easy amp Handy SQL +
Extendable MapReduce
最後にポエム
よいものはよい
Great product is anywhere
だが知識は偏在している
but knowledge is maldistributed
OSS World Enterprise World
Hadoop
Ruby
Python Excel
並列RDB
Windows
markdown VB
Git
Java
Cross the border
end
コスト差ってどれくらいよpackage orgmyorg import javaioIOException import javautil import orgapachehadoopfsPath import orgapachehadoopconf import orgapachehadoopio import orgapachehadoopmapred import orgapachehadooputil public class WordCount public static class Map extends MapReduceBase implements MapperltLongWritable Text Text IntWritablegt private final static IntWritable one = new IntWritable(1) private Text word = new Text() public void map(LongWritable key Text value OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException String line = valuetoString() StringTokenizer tokenizer = new StringTokenizer(line) while (tokenizerhasMoreTokens()) wordset(tokenizernextToken()) outputcollect(word one) public static class Reduce extends MapReduceBase implements ReducerltText IntWritable Text IntWritablegt public void reduce(Text key IteratorltIntWritablegt values OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException int sum = 0 while (valueshasNext()) sum += valuesnext()get() outputcollect(key new IntWritable(sum)) public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass) confsetJobName(wordcount) confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass) confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass) confsetReducerClass(Reduceclass) confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass) FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1])) JobClientrunJob(conf)
select count() from ( select regexp_split_to_table(str lsquos+) from text_table ) t
MapReduceによるWordCount SQLによるWordCount
実際SQLが勝った
Now SQL beats MapReduce
Q2 Hadoopと並列RDBは
どっちがいいんですか
Which is good Hadoop or parallel RDB
速度は並列RDB データ構造はHadoop
Parallel RDB is faster Hadoop is more flexible
現在ありがちな構成
HDFS
MapReduce
Hive
今後の構成
HDFS
impala backend
impala frontend
Hadoopは並列RDBに 似てきている
DB filesystem
backend
parser planner
HDFS
impala be
impala fe
Hadoop resembles to parallel RDB now
Hybrid DB comes in near future
Q3 MapReduceは
お亡くなりですか
MapReduce is dead
まだだっhelliphellip まだ終わらんよ
No
MapReduceは 並列処理にJavaやCを
はさみこめる
MapReduce has better extendability
例Asterの SQL-MapReduce
SQLからMapReduce呼べるselect count(distinct user_id) from npath( on clicks partition by user_id order by timestamp mode(overlapping) pattern(lsquoHSPrsquo) symbols( page_type = lsquohomersquo AS H page_type = lsquosearchrsquo AS S page_type = lsquoproductrsquo AS P) result(first(user_id of H) as user_id) )
最近Hiveにもnpath入りました (MatchPath)
You can combine MapReduce with SQL
Easy amp Handy SQL +
Extendable MapReduce
最後にポエム
よいものはよい
Great product is anywhere
だが知識は偏在している
but knowledge is maldistributed
OSS World Enterprise World
Hadoop
Ruby
Python Excel
並列RDB
Windows
markdown VB
Git
Java
Cross the border
end
実際SQLが勝った
Now SQL beats MapReduce
Q2 Hadoopと並列RDBは
どっちがいいんですか
Which is good Hadoop or parallel RDB
速度は並列RDB データ構造はHadoop
Parallel RDB is faster Hadoop is more flexible
現在ありがちな構成
HDFS
MapReduce
Hive
今後の構成
HDFS
impala backend
impala frontend
Hadoopは並列RDBに 似てきている
DB filesystem
backend
parser planner
HDFS
impala be
impala fe
Hadoop resembles to parallel RDB now
Hybrid DB comes in near future
Q3 MapReduceは
お亡くなりですか
MapReduce is dead
まだだっhelliphellip まだ終わらんよ
No
MapReduceは 並列処理にJavaやCを
はさみこめる
MapReduce has better extendability
例Asterの SQL-MapReduce
SQLからMapReduce呼べるselect count(distinct user_id) from npath( on clicks partition by user_id order by timestamp mode(overlapping) pattern(lsquoHSPrsquo) symbols( page_type = lsquohomersquo AS H page_type = lsquosearchrsquo AS S page_type = lsquoproductrsquo AS P) result(first(user_id of H) as user_id) )
最近Hiveにもnpath入りました (MatchPath)
You can combine MapReduce with SQL
Easy amp Handy SQL +
Extendable MapReduce
最後にポエム
よいものはよい
Great product is anywhere
だが知識は偏在している
but knowledge is maldistributed
OSS World Enterprise World
Hadoop
Ruby
Python Excel
並列RDB
Windows
markdown VB
Git
Java
Cross the border
end
Q2 Hadoopと並列RDBは
どっちがいいんですか
Which is good Hadoop or parallel RDB
速度は並列RDB データ構造はHadoop
Parallel RDB is faster Hadoop is more flexible
現在ありがちな構成
HDFS
MapReduce
Hive
今後の構成
HDFS
impala backend
impala frontend
Hadoopは並列RDBに 似てきている
DB filesystem
backend
parser planner
HDFS
impala be
impala fe
Hadoop resembles to parallel RDB now
Hybrid DB comes in near future
Q3 MapReduceは
お亡くなりですか
MapReduce is dead
まだだっhelliphellip まだ終わらんよ
No
MapReduceは 並列処理にJavaやCを
はさみこめる
MapReduce has better extendability
例Asterの SQL-MapReduce
SQLからMapReduce呼べるselect count(distinct user_id) from npath( on clicks partition by user_id order by timestamp mode(overlapping) pattern(lsquoHSPrsquo) symbols( page_type = lsquohomersquo AS H page_type = lsquosearchrsquo AS S page_type = lsquoproductrsquo AS P) result(first(user_id of H) as user_id) )
最近Hiveにもnpath入りました (MatchPath)
You can combine MapReduce with SQL
Easy amp Handy SQL +
Extendable MapReduce
最後にポエム
よいものはよい
Great product is anywhere
だが知識は偏在している
but knowledge is maldistributed
OSS World Enterprise World
Hadoop
Ruby
Python Excel
並列RDB
Windows
markdown VB
Git
Java
Cross the border
end
速度は並列RDB データ構造はHadoop
Parallel RDB is faster Hadoop is more flexible
現在ありがちな構成
HDFS
MapReduce
Hive
今後の構成
HDFS
impala backend
impala frontend
Hadoopは並列RDBに 似てきている
DB filesystem
backend
parser planner
HDFS
impala be
impala fe
Hadoop resembles to parallel RDB now
Hybrid DB comes in near future
Q3 MapReduceは
お亡くなりですか
MapReduce is dead
まだだっhelliphellip まだ終わらんよ
No
MapReduceは 並列処理にJavaやCを
はさみこめる
MapReduce has better extendability
例Asterの SQL-MapReduce
SQLからMapReduce呼べるselect count(distinct user_id) from npath( on clicks partition by user_id order by timestamp mode(overlapping) pattern(lsquoHSPrsquo) symbols( page_type = lsquohomersquo AS H page_type = lsquosearchrsquo AS S page_type = lsquoproductrsquo AS P) result(first(user_id of H) as user_id) )
最近Hiveにもnpath入りました (MatchPath)
You can combine MapReduce with SQL
Easy amp Handy SQL +
Extendable MapReduce
最後にポエム
よいものはよい
Great product is anywhere
だが知識は偏在している
but knowledge is maldistributed
OSS World Enterprise World
Hadoop
Ruby
Python Excel
並列RDB
Windows
markdown VB
Git
Java
Cross the border
end
現在ありがちな構成
HDFS
MapReduce
Hive
今後の構成
HDFS
impala backend
impala frontend
Hadoopは並列RDBに 似てきている
DB filesystem
backend
parser planner
HDFS
impala be
impala fe
Hadoop resembles to parallel RDB now
Hybrid DB comes in near future
Q3 MapReduceは
お亡くなりですか
MapReduce is dead
まだだっhelliphellip まだ終わらんよ
No
MapReduceは 並列処理にJavaやCを
はさみこめる
MapReduce has better extendability
例Asterの SQL-MapReduce
SQLからMapReduce呼べるselect count(distinct user_id) from npath( on clicks partition by user_id order by timestamp mode(overlapping) pattern(lsquoHSPrsquo) symbols( page_type = lsquohomersquo AS H page_type = lsquosearchrsquo AS S page_type = lsquoproductrsquo AS P) result(first(user_id of H) as user_id) )
最近Hiveにもnpath入りました (MatchPath)
You can combine MapReduce with SQL
Easy amp Handy SQL +
Extendable MapReduce
最後にポエム
よいものはよい
Great product is anywhere
だが知識は偏在している
but knowledge is maldistributed
OSS World Enterprise World
Hadoop
Ruby
Python Excel
並列RDB
Windows
markdown VB
Git
Java
Cross the border
end
今後の構成
HDFS
impala backend
impala frontend
Hadoopは並列RDBに 似てきている
DB filesystem
backend
parser planner
HDFS
impala be
impala fe
Hadoop resembles to parallel RDB now
Hybrid DB comes in near future
Q3 MapReduceは
お亡くなりですか
MapReduce is dead
まだだっhelliphellip まだ終わらんよ
No
MapReduceは 並列処理にJavaやCを
はさみこめる
MapReduce has better extendability
例Asterの SQL-MapReduce
SQLからMapReduce呼べるselect count(distinct user_id) from npath( on clicks partition by user_id order by timestamp mode(overlapping) pattern(lsquoHSPrsquo) symbols( page_type = lsquohomersquo AS H page_type = lsquosearchrsquo AS S page_type = lsquoproductrsquo AS P) result(first(user_id of H) as user_id) )
最近Hiveにもnpath入りました (MatchPath)
You can combine MapReduce with SQL
Easy amp Handy SQL +
Extendable MapReduce
最後にポエム
よいものはよい
Great product is anywhere
だが知識は偏在している
but knowledge is maldistributed
OSS World Enterprise World
Hadoop
Ruby
Python Excel
並列RDB
Windows
markdown VB
Git
Java
Cross the border
end
Hadoopは並列RDBに 似てきている
DB filesystem
backend
parser planner
HDFS
impala be
impala fe
Hadoop resembles to parallel RDB now
Hybrid DB comes in near future
Q3 MapReduceは
お亡くなりですか
MapReduce is dead
まだだっhelliphellip まだ終わらんよ
No
MapReduceは 並列処理にJavaやCを
はさみこめる
MapReduce has better extendability
例Asterの SQL-MapReduce
SQLからMapReduce呼べるselect count(distinct user_id) from npath( on clicks partition by user_id order by timestamp mode(overlapping) pattern(lsquoHSPrsquo) symbols( page_type = lsquohomersquo AS H page_type = lsquosearchrsquo AS S page_type = lsquoproductrsquo AS P) result(first(user_id of H) as user_id) )
最近Hiveにもnpath入りました (MatchPath)
You can combine MapReduce with SQL
Easy amp Handy SQL +
Extendable MapReduce
最後にポエム
よいものはよい
Great product is anywhere
だが知識は偏在している
but knowledge is maldistributed
OSS World Enterprise World
Hadoop
Ruby
Python Excel
並列RDB
Windows
markdown VB
Git
Java
Cross the border
end
Hybrid DB comes in near future
Q3 MapReduceは
お亡くなりですか
MapReduce is dead
まだだっhelliphellip まだ終わらんよ
No
MapReduceは 並列処理にJavaやCを
はさみこめる
MapReduce has better extendability
例Asterの SQL-MapReduce
SQLからMapReduce呼べるselect count(distinct user_id) from npath( on clicks partition by user_id order by timestamp mode(overlapping) pattern(lsquoHSPrsquo) symbols( page_type = lsquohomersquo AS H page_type = lsquosearchrsquo AS S page_type = lsquoproductrsquo AS P) result(first(user_id of H) as user_id) )
最近Hiveにもnpath入りました (MatchPath)
You can combine MapReduce with SQL
Easy amp Handy SQL +
Extendable MapReduce
最後にポエム
よいものはよい
Great product is anywhere
だが知識は偏在している
but knowledge is maldistributed
OSS World Enterprise World
Hadoop
Ruby
Python Excel
並列RDB
Windows
markdown VB
Git
Java
Cross the border
end
Q3 MapReduceは
お亡くなりですか
MapReduce is dead
まだだっhelliphellip まだ終わらんよ
No
MapReduceは 並列処理にJavaやCを
はさみこめる
MapReduce has better extendability
例Asterの SQL-MapReduce
SQLからMapReduce呼べるselect count(distinct user_id) from npath( on clicks partition by user_id order by timestamp mode(overlapping) pattern(lsquoHSPrsquo) symbols( page_type = lsquohomersquo AS H page_type = lsquosearchrsquo AS S page_type = lsquoproductrsquo AS P) result(first(user_id of H) as user_id) )
最近Hiveにもnpath入りました (MatchPath)
You can combine MapReduce with SQL
Easy amp Handy SQL +
Extendable MapReduce
最後にポエム
よいものはよい
Great product is anywhere
だが知識は偏在している
but knowledge is maldistributed
OSS World Enterprise World
Hadoop
Ruby
Python Excel
並列RDB
Windows
markdown VB
Git
Java
Cross the border
end
まだだっhelliphellip まだ終わらんよ
No
MapReduceは 並列処理にJavaやCを
はさみこめる
MapReduce has better extendability
例Asterの SQL-MapReduce
SQLからMapReduce呼べるselect count(distinct user_id) from npath( on clicks partition by user_id order by timestamp mode(overlapping) pattern(lsquoHSPrsquo) symbols( page_type = lsquohomersquo AS H page_type = lsquosearchrsquo AS S page_type = lsquoproductrsquo AS P) result(first(user_id of H) as user_id) )
最近Hiveにもnpath入りました (MatchPath)
You can combine MapReduce with SQL
Easy amp Handy SQL +
Extendable MapReduce
最後にポエム
よいものはよい
Great product is anywhere
だが知識は偏在している
but knowledge is maldistributed
OSS World Enterprise World
Hadoop
Ruby
Python Excel
並列RDB
Windows
markdown VB
Git
Java
Cross the border
end
MapReduceは 並列処理にJavaやCを
はさみこめる
MapReduce has better extendability
例Asterの SQL-MapReduce
SQLからMapReduce呼べるselect count(distinct user_id) from npath( on clicks partition by user_id order by timestamp mode(overlapping) pattern(lsquoHSPrsquo) symbols( page_type = lsquohomersquo AS H page_type = lsquosearchrsquo AS S page_type = lsquoproductrsquo AS P) result(first(user_id of H) as user_id) )
最近Hiveにもnpath入りました (MatchPath)
You can combine MapReduce with SQL
Easy amp Handy SQL +
Extendable MapReduce
最後にポエム
よいものはよい
Great product is anywhere
だが知識は偏在している
but knowledge is maldistributed
OSS World Enterprise World
Hadoop
Ruby
Python Excel
並列RDB
Windows
markdown VB
Git
Java
Cross the border
end
例Asterの SQL-MapReduce
SQLからMapReduce呼べるselect count(distinct user_id) from npath( on clicks partition by user_id order by timestamp mode(overlapping) pattern(lsquoHSPrsquo) symbols( page_type = lsquohomersquo AS H page_type = lsquosearchrsquo AS S page_type = lsquoproductrsquo AS P) result(first(user_id of H) as user_id) )
最近Hiveにもnpath入りました (MatchPath)
You can combine MapReduce with SQL
Easy amp Handy SQL +
Extendable MapReduce
最後にポエム
よいものはよい
Great product is anywhere
だが知識は偏在している
but knowledge is maldistributed
OSS World Enterprise World
Hadoop
Ruby
Python Excel
並列RDB
Windows
markdown VB
Git
Java
Cross the border
end
SQLからMapReduce呼べるselect count(distinct user_id) from npath( on clicks partition by user_id order by timestamp mode(overlapping) pattern(lsquoHSPrsquo) symbols( page_type = lsquohomersquo AS H page_type = lsquosearchrsquo AS S page_type = lsquoproductrsquo AS P) result(first(user_id of H) as user_id) )
最近Hiveにもnpath入りました (MatchPath)
You can combine MapReduce with SQL
Easy amp Handy SQL +
Extendable MapReduce
最後にポエム
よいものはよい
Great product is anywhere
だが知識は偏在している
but knowledge is maldistributed
OSS World Enterprise World
Hadoop
Ruby
Python Excel
並列RDB
Windows
markdown VB
Git
Java
Cross the border
end
Easy amp Handy SQL +
Extendable MapReduce
最後にポエム
よいものはよい
Great product is anywhere
だが知識は偏在している
but knowledge is maldistributed
OSS World Enterprise World
Hadoop
Ruby
Python Excel
並列RDB
Windows
markdown VB
Git
Java
Cross the border
end
最後にポエム
よいものはよい
Great product is anywhere
だが知識は偏在している
but knowledge is maldistributed
OSS World Enterprise World
Hadoop
Ruby
Python Excel
並列RDB
Windows
markdown VB
Git
Java
Cross the border
end
よいものはよい
Great product is anywhere
だが知識は偏在している
but knowledge is maldistributed
OSS World Enterprise World
Hadoop
Ruby
Python Excel
並列RDB
Windows
markdown VB
Git
Java
Cross the border
end
だが知識は偏在している
but knowledge is maldistributed
OSS World Enterprise World
Hadoop
Ruby
Python Excel
並列RDB
Windows
markdown VB
Git
Java
Cross the border
end
OSS World Enterprise World
Hadoop
Ruby
Python Excel
並列RDB
Windows
markdown VB
Git
Java
Cross the border
end
Cross the border
end
end
Recommended