45
Ruby会議でSQLの話をする のは間違っているだろうか Minero Aoki

Oedo Ruby Conference 04: Ruby会議でSQLの話をするのは間違っているだろうか

Embed Size (px)

DESCRIPTION

「Ruby会議でSQLの話をするのは間違っているだろうか」 @大江戸Ruby会議04, 2014-04-19

Citation preview

Page 1: Oedo Ruby Conference 04: Ruby会議でSQLの話をするのは間違っているだろうか

Ruby会議でSQLの話をするのは間違っているだろうか

Minero Aoki

今日のお話について

Theme of this session

「技術的に濃い 話題がいいです」

Akira Matsuda said ldquoI expect you deep technical talkrdquo

「濃い話」is 何

Whatrsquos deep talk

Rubyの実装の話とか もう別に濃くない

Ruby implementation is not deep already so I speak about another theme

25分でわかる ビッグデータ分析 ~MapReduce追悼~

Big Data Analytics in 25 minutes

トータル100TBくらいのデータを分析するとしよう

Suppose you must analyze 100TB text data

1CPUとか もうマヂムリhellip

コンピュータ

プログラム

データ

Single CPU cannot handle 100TB

そうだ分散処理しようノード0 ノード1 ノード2 ノード3

プログラム プログラム プログラム プログラム

データ データ データ データ

You need more computers (distributed processing)

でも分散処理って めんどいhellipマヂムリhellip

Distributed processing is too difficulthellip

そこで並列RDBですよ

Parallel RDB may help you

Parallel RDBNode 0 Node 1 Node 2 Node 3

Front End

Front End

Front End

Front End

Back End

Back End

Back End

Back End

並列RDBの特長

1 ノードを増やせば線形に速くなる

2 標準SQLが使える

3 クライアントからは1台に見える

has linear scalability

You can use SQL

Looks as the one computer

Parallel RDB is great becausehellip

並列RDB超スゴイ age age マック

Parallel RDB is great

いろいろな商用並列RDBDatabase Vendor Since

Teradata Teradata 1983

Teradata Aster Teradata 2005

PureData System for Analytics IBM 2000

Exadata Oracle 2008

Greenplum Pivotal 2003

SQL Server PDW Microsoft 2010くらい

Redshift Amazon 2012

Various parallel RDBs

Here Comes a New Challenger

since 2005

Hadoop Architecture

HDFS Distributed File System

MapReduce Compute Framework

(Hive SQL interface)

Hadoopの特徴

1 データはテーブルではなくファイル

2 処理にはMapReduceを使う(使っていた)

3 Hiveを乗せるとSQLっぽい言語でも書ける

Hadoop data is plain file

Processed by MapReduce

Hive allows you to write SQL-like query

猫も194780子もMapReduce

ldquoBig Datardquo meant MapReduce few years ago

MapReducek1 v1

k2 v2

k3 v3

k4 v4

k5 v5

k6 v6

Map

k1 v1

k1 v2

k1 v3

k2 v4

k3 v5

k3 v6

k1 v1

k2 v2

k3 v3

Reduce

Map関数とReduce関数を 書いたらよしなに

分散してくれるフレームワーク

You just write MapampReduce functions Hadoop serves the rest

Q1 SQLとMapReduce どっちがいいの

Which is good SQL and MapReduce

ビジネス的な答え

SQL

なぜSQLか

1 SQL書けてもJava書けない人は多い

2 既存のSQLを使ったアプリが動かない

3 MapReduce関数を書くのは高コスト

Many people can write SQL but not Java

Many applications rely on SQL

Writing MR function needs more time

コスト差ってどれくらいよpackage orgmyorg import javaioIOException import javautil import orgapachehadoopfsPath import orgapachehadoopconf import orgapachehadoopio import orgapachehadoopmapred import orgapachehadooputil public class WordCount public static class Map extends MapReduceBase implements MapperltLongWritable Text Text IntWritablegt private final static IntWritable one = new IntWritable(1) private Text word = new Text() public void map(LongWritable key Text value OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException String line = valuetoString() StringTokenizer tokenizer = new StringTokenizer(line) while (tokenizerhasMoreTokens()) wordset(tokenizernextToken()) outputcollect(word one) public static class Reduce extends MapReduceBase implements ReducerltText IntWritable Text IntWritablegt public void reduce(Text key IteratorltIntWritablegt values OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException int sum = 0 while (valueshasNext()) sum += valuesnext()get() outputcollect(key new IntWritable(sum)) public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass) confsetJobName(wordcount) confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass) confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass) confsetReducerClass(Reduceclass) confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass) FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1])) JobClientrunJob(conf)

select count() from ( select regexp_split_to_table(str lsquos+) from text_table ) t

MapReduceによるWordCount SQLによるWordCount

実際SQLが勝った

Now SQL beats MapReduce

Q2 Hadoopと並列RDBは

どっちがいいんですか

Which is good Hadoop or parallel RDB

速度は並列RDB データ構造はHadoop

Parallel RDB is faster Hadoop is more flexible

現在ありがちな構成

HDFS

MapReduce

Hive

今後の構成

HDFS

impala backend

impala frontend

Hadoopは並列RDBに 似てきている

DB filesystem

backend

parser planner

HDFS

impala be

impala fe

Hadoop resembles to parallel RDB now

Hybrid DB comes in near future

Q3 MapReduceは

お亡くなりですか

MapReduce is dead

まだだっhelliphellip まだ終わらんよ

No

MapReduceは 並列処理にJavaやCを

はさみこめる

MapReduce has better extendability

例Asterの SQL-MapReduce

SQLからMapReduce呼べるselect count(distinct user_id) from npath( on clicks partition by user_id order by timestamp mode(overlapping) pattern(lsquoHSPrsquo) symbols( page_type = lsquohomersquo AS H page_type = lsquosearchrsquo AS S page_type = lsquoproductrsquo AS P) result(first(user_id of H) as user_id) )

最近Hiveにもnpath入りました (MatchPath)

You can combine MapReduce with SQL

Easy amp Handy SQL +

Extendable MapReduce

最後にポエム

よいものはよい

Great product is anywhere

だが知識は偏在している

but knowledge is maldistributed

OSS World Enterprise World

Hadoop

Ruby

Python Excel

並列RDB

Windows

markdown VB

Git

Java

Cross the border

end

Page 2: Oedo Ruby Conference 04: Ruby会議でSQLの話をするのは間違っているだろうか

今日のお話について

Theme of this session

「技術的に濃い 話題がいいです」

Akira Matsuda said ldquoI expect you deep technical talkrdquo

「濃い話」is 何

Whatrsquos deep talk

Rubyの実装の話とか もう別に濃くない

Ruby implementation is not deep already so I speak about another theme

25分でわかる ビッグデータ分析 ~MapReduce追悼~

Big Data Analytics in 25 minutes

トータル100TBくらいのデータを分析するとしよう

Suppose you must analyze 100TB text data

1CPUとか もうマヂムリhellip

コンピュータ

プログラム

データ

Single CPU cannot handle 100TB

そうだ分散処理しようノード0 ノード1 ノード2 ノード3

プログラム プログラム プログラム プログラム

データ データ データ データ

You need more computers (distributed processing)

でも分散処理って めんどいhellipマヂムリhellip

Distributed processing is too difficulthellip

そこで並列RDBですよ

Parallel RDB may help you

Parallel RDBNode 0 Node 1 Node 2 Node 3

Front End

Front End

Front End

Front End

Back End

Back End

Back End

Back End

並列RDBの特長

1 ノードを増やせば線形に速くなる

2 標準SQLが使える

3 クライアントからは1台に見える

has linear scalability

You can use SQL

Looks as the one computer

Parallel RDB is great becausehellip

並列RDB超スゴイ age age マック

Parallel RDB is great

いろいろな商用並列RDBDatabase Vendor Since

Teradata Teradata 1983

Teradata Aster Teradata 2005

PureData System for Analytics IBM 2000

Exadata Oracle 2008

Greenplum Pivotal 2003

SQL Server PDW Microsoft 2010くらい

Redshift Amazon 2012

Various parallel RDBs

Here Comes a New Challenger

since 2005

Hadoop Architecture

HDFS Distributed File System

MapReduce Compute Framework

(Hive SQL interface)

Hadoopの特徴

1 データはテーブルではなくファイル

2 処理にはMapReduceを使う(使っていた)

3 Hiveを乗せるとSQLっぽい言語でも書ける

Hadoop data is plain file

Processed by MapReduce

Hive allows you to write SQL-like query

猫も194780子もMapReduce

ldquoBig Datardquo meant MapReduce few years ago

MapReducek1 v1

k2 v2

k3 v3

k4 v4

k5 v5

k6 v6

Map

k1 v1

k1 v2

k1 v3

k2 v4

k3 v5

k3 v6

k1 v1

k2 v2

k3 v3

Reduce

Map関数とReduce関数を 書いたらよしなに

分散してくれるフレームワーク

You just write MapampReduce functions Hadoop serves the rest

Q1 SQLとMapReduce どっちがいいの

Which is good SQL and MapReduce

ビジネス的な答え

SQL

なぜSQLか

1 SQL書けてもJava書けない人は多い

2 既存のSQLを使ったアプリが動かない

3 MapReduce関数を書くのは高コスト

Many people can write SQL but not Java

Many applications rely on SQL

Writing MR function needs more time

コスト差ってどれくらいよpackage orgmyorg import javaioIOException import javautil import orgapachehadoopfsPath import orgapachehadoopconf import orgapachehadoopio import orgapachehadoopmapred import orgapachehadooputil public class WordCount public static class Map extends MapReduceBase implements MapperltLongWritable Text Text IntWritablegt private final static IntWritable one = new IntWritable(1) private Text word = new Text() public void map(LongWritable key Text value OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException String line = valuetoString() StringTokenizer tokenizer = new StringTokenizer(line) while (tokenizerhasMoreTokens()) wordset(tokenizernextToken()) outputcollect(word one) public static class Reduce extends MapReduceBase implements ReducerltText IntWritable Text IntWritablegt public void reduce(Text key IteratorltIntWritablegt values OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException int sum = 0 while (valueshasNext()) sum += valuesnext()get() outputcollect(key new IntWritable(sum)) public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass) confsetJobName(wordcount) confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass) confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass) confsetReducerClass(Reduceclass) confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass) FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1])) JobClientrunJob(conf)

select count() from ( select regexp_split_to_table(str lsquos+) from text_table ) t

MapReduceによるWordCount SQLによるWordCount

実際SQLが勝った

Now SQL beats MapReduce

Q2 Hadoopと並列RDBは

どっちがいいんですか

Which is good Hadoop or parallel RDB

速度は並列RDB データ構造はHadoop

Parallel RDB is faster Hadoop is more flexible

現在ありがちな構成

HDFS

MapReduce

Hive

今後の構成

HDFS

impala backend

impala frontend

Hadoopは並列RDBに 似てきている

DB filesystem

backend

parser planner

HDFS

impala be

impala fe

Hadoop resembles to parallel RDB now

Hybrid DB comes in near future

Q3 MapReduceは

お亡くなりですか

MapReduce is dead

まだだっhelliphellip まだ終わらんよ

No

MapReduceは 並列処理にJavaやCを

はさみこめる

MapReduce has better extendability

例Asterの SQL-MapReduce

SQLからMapReduce呼べるselect count(distinct user_id) from npath( on clicks partition by user_id order by timestamp mode(overlapping) pattern(lsquoHSPrsquo) symbols( page_type = lsquohomersquo AS H page_type = lsquosearchrsquo AS S page_type = lsquoproductrsquo AS P) result(first(user_id of H) as user_id) )

最近Hiveにもnpath入りました (MatchPath)

You can combine MapReduce with SQL

Easy amp Handy SQL +

Extendable MapReduce

最後にポエム

よいものはよい

Great product is anywhere

だが知識は偏在している

but knowledge is maldistributed

OSS World Enterprise World

Hadoop

Ruby

Python Excel

並列RDB

Windows

markdown VB

Git

Java

Cross the border

end

Page 3: Oedo Ruby Conference 04: Ruby会議でSQLの話をするのは間違っているだろうか

「技術的に濃い 話題がいいです」

Akira Matsuda said ldquoI expect you deep technical talkrdquo

「濃い話」is 何

Whatrsquos deep talk

Rubyの実装の話とか もう別に濃くない

Ruby implementation is not deep already so I speak about another theme

25分でわかる ビッグデータ分析 ~MapReduce追悼~

Big Data Analytics in 25 minutes

トータル100TBくらいのデータを分析するとしよう

Suppose you must analyze 100TB text data

1CPUとか もうマヂムリhellip

コンピュータ

プログラム

データ

Single CPU cannot handle 100TB

そうだ分散処理しようノード0 ノード1 ノード2 ノード3

プログラム プログラム プログラム プログラム

データ データ データ データ

You need more computers (distributed processing)

でも分散処理って めんどいhellipマヂムリhellip

Distributed processing is too difficulthellip

そこで並列RDBですよ

Parallel RDB may help you

Parallel RDBNode 0 Node 1 Node 2 Node 3

Front End

Front End

Front End

Front End

Back End

Back End

Back End

Back End

並列RDBの特長

1 ノードを増やせば線形に速くなる

2 標準SQLが使える

3 クライアントからは1台に見える

has linear scalability

You can use SQL

Looks as the one computer

Parallel RDB is great becausehellip

並列RDB超スゴイ age age マック

Parallel RDB is great

いろいろな商用並列RDBDatabase Vendor Since

Teradata Teradata 1983

Teradata Aster Teradata 2005

PureData System for Analytics IBM 2000

Exadata Oracle 2008

Greenplum Pivotal 2003

SQL Server PDW Microsoft 2010くらい

Redshift Amazon 2012

Various parallel RDBs

Here Comes a New Challenger

since 2005

Hadoop Architecture

HDFS Distributed File System

MapReduce Compute Framework

(Hive SQL interface)

Hadoopの特徴

1 データはテーブルではなくファイル

2 処理にはMapReduceを使う(使っていた)

3 Hiveを乗せるとSQLっぽい言語でも書ける

Hadoop data is plain file

Processed by MapReduce

Hive allows you to write SQL-like query

猫も194780子もMapReduce

ldquoBig Datardquo meant MapReduce few years ago

MapReducek1 v1

k2 v2

k3 v3

k4 v4

k5 v5

k6 v6

Map

k1 v1

k1 v2

k1 v3

k2 v4

k3 v5

k3 v6

k1 v1

k2 v2

k3 v3

Reduce

Map関数とReduce関数を 書いたらよしなに

分散してくれるフレームワーク

You just write MapampReduce functions Hadoop serves the rest

Q1 SQLとMapReduce どっちがいいの

Which is good SQL and MapReduce

ビジネス的な答え

SQL

なぜSQLか

1 SQL書けてもJava書けない人は多い

2 既存のSQLを使ったアプリが動かない

3 MapReduce関数を書くのは高コスト

Many people can write SQL but not Java

Many applications rely on SQL

Writing MR function needs more time

コスト差ってどれくらいよpackage orgmyorg import javaioIOException import javautil import orgapachehadoopfsPath import orgapachehadoopconf import orgapachehadoopio import orgapachehadoopmapred import orgapachehadooputil public class WordCount public static class Map extends MapReduceBase implements MapperltLongWritable Text Text IntWritablegt private final static IntWritable one = new IntWritable(1) private Text word = new Text() public void map(LongWritable key Text value OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException String line = valuetoString() StringTokenizer tokenizer = new StringTokenizer(line) while (tokenizerhasMoreTokens()) wordset(tokenizernextToken()) outputcollect(word one) public static class Reduce extends MapReduceBase implements ReducerltText IntWritable Text IntWritablegt public void reduce(Text key IteratorltIntWritablegt values OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException int sum = 0 while (valueshasNext()) sum += valuesnext()get() outputcollect(key new IntWritable(sum)) public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass) confsetJobName(wordcount) confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass) confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass) confsetReducerClass(Reduceclass) confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass) FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1])) JobClientrunJob(conf)

select count() from ( select regexp_split_to_table(str lsquos+) from text_table ) t

MapReduceによるWordCount SQLによるWordCount

実際SQLが勝った

Now SQL beats MapReduce

Q2 Hadoopと並列RDBは

どっちがいいんですか

Which is good Hadoop or parallel RDB

速度は並列RDB データ構造はHadoop

Parallel RDB is faster Hadoop is more flexible

現在ありがちな構成

HDFS

MapReduce

Hive

今後の構成

HDFS

impala backend

impala frontend

Hadoopは並列RDBに 似てきている

DB filesystem

backend

parser planner

HDFS

impala be

impala fe

Hadoop resembles to parallel RDB now

Hybrid DB comes in near future

Q3 MapReduceは

お亡くなりですか

MapReduce is dead

まだだっhelliphellip まだ終わらんよ

No

MapReduceは 並列処理にJavaやCを

はさみこめる

MapReduce has better extendability

例Asterの SQL-MapReduce

SQLからMapReduce呼べるselect count(distinct user_id) from npath( on clicks partition by user_id order by timestamp mode(overlapping) pattern(lsquoHSPrsquo) symbols( page_type = lsquohomersquo AS H page_type = lsquosearchrsquo AS S page_type = lsquoproductrsquo AS P) result(first(user_id of H) as user_id) )

最近Hiveにもnpath入りました (MatchPath)

You can combine MapReduce with SQL

Easy amp Handy SQL +

Extendable MapReduce

最後にポエム

よいものはよい

Great product is anywhere

だが知識は偏在している

but knowledge is maldistributed

OSS World Enterprise World

Hadoop

Ruby

Python Excel

並列RDB

Windows

markdown VB

Git

Java

Cross the border

end

Page 4: Oedo Ruby Conference 04: Ruby会議でSQLの話をするのは間違っているだろうか

「濃い話」is 何

Whatrsquos deep talk

Rubyの実装の話とか もう別に濃くない

Ruby implementation is not deep already so I speak about another theme

25分でわかる ビッグデータ分析 ~MapReduce追悼~

Big Data Analytics in 25 minutes

トータル100TBくらいのデータを分析するとしよう

Suppose you must analyze 100TB text data

1CPUとか もうマヂムリhellip

コンピュータ

プログラム

データ

Single CPU cannot handle 100TB

そうだ分散処理しようノード0 ノード1 ノード2 ノード3

プログラム プログラム プログラム プログラム

データ データ データ データ

You need more computers (distributed processing)

でも分散処理って めんどいhellipマヂムリhellip

Distributed processing is too difficulthellip

そこで並列RDBですよ

Parallel RDB may help you

Parallel RDBNode 0 Node 1 Node 2 Node 3

Front End

Front End

Front End

Front End

Back End

Back End

Back End

Back End

並列RDBの特長

1 ノードを増やせば線形に速くなる

2 標準SQLが使える

3 クライアントからは1台に見える

has linear scalability

You can use SQL

Looks as the one computer

Parallel RDB is great becausehellip

並列RDB超スゴイ age age マック

Parallel RDB is great

いろいろな商用並列RDBDatabase Vendor Since

Teradata Teradata 1983

Teradata Aster Teradata 2005

PureData System for Analytics IBM 2000

Exadata Oracle 2008

Greenplum Pivotal 2003

SQL Server PDW Microsoft 2010くらい

Redshift Amazon 2012

Various parallel RDBs

Here Comes a New Challenger

since 2005

Hadoop Architecture

HDFS Distributed File System

MapReduce Compute Framework

(Hive SQL interface)

Hadoopの特徴

1 データはテーブルではなくファイル

2 処理にはMapReduceを使う(使っていた)

3 Hiveを乗せるとSQLっぽい言語でも書ける

Hadoop data is plain file

Processed by MapReduce

Hive allows you to write SQL-like query

猫も194780子もMapReduce

ldquoBig Datardquo meant MapReduce few years ago

MapReducek1 v1

k2 v2

k3 v3

k4 v4

k5 v5

k6 v6

Map

k1 v1

k1 v2

k1 v3

k2 v4

k3 v5

k3 v6

k1 v1

k2 v2

k3 v3

Reduce

Map関数とReduce関数を 書いたらよしなに

分散してくれるフレームワーク

You just write MapampReduce functions Hadoop serves the rest

Q1 SQLとMapReduce どっちがいいの

Which is good SQL and MapReduce

ビジネス的な答え

SQL

なぜSQLか

1 SQL書けてもJava書けない人は多い

2 既存のSQLを使ったアプリが動かない

3 MapReduce関数を書くのは高コスト

Many people can write SQL but not Java

Many applications rely on SQL

Writing MR function needs more time

コスト差ってどれくらいよpackage orgmyorg import javaioIOException import javautil import orgapachehadoopfsPath import orgapachehadoopconf import orgapachehadoopio import orgapachehadoopmapred import orgapachehadooputil public class WordCount public static class Map extends MapReduceBase implements MapperltLongWritable Text Text IntWritablegt private final static IntWritable one = new IntWritable(1) private Text word = new Text() public void map(LongWritable key Text value OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException String line = valuetoString() StringTokenizer tokenizer = new StringTokenizer(line) while (tokenizerhasMoreTokens()) wordset(tokenizernextToken()) outputcollect(word one) public static class Reduce extends MapReduceBase implements ReducerltText IntWritable Text IntWritablegt public void reduce(Text key IteratorltIntWritablegt values OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException int sum = 0 while (valueshasNext()) sum += valuesnext()get() outputcollect(key new IntWritable(sum)) public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass) confsetJobName(wordcount) confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass) confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass) confsetReducerClass(Reduceclass) confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass) FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1])) JobClientrunJob(conf)

select count() from ( select regexp_split_to_table(str lsquos+) from text_table ) t

MapReduceによるWordCount SQLによるWordCount

実際SQLが勝った

Now SQL beats MapReduce

Q2 Hadoopと並列RDBは

どっちがいいんですか

Which is good Hadoop or parallel RDB

速度は並列RDB データ構造はHadoop

Parallel RDB is faster Hadoop is more flexible

現在ありがちな構成

HDFS

MapReduce

Hive

今後の構成

HDFS

impala backend

impala frontend

Hadoopは並列RDBに 似てきている

DB filesystem

backend

parser planner

HDFS

impala be

impala fe

Hadoop resembles to parallel RDB now

Hybrid DB comes in near future

Q3 MapReduceは

お亡くなりですか

MapReduce is dead

まだだっhelliphellip まだ終わらんよ

No

MapReduceは 並列処理にJavaやCを

はさみこめる

MapReduce has better extendability

例Asterの SQL-MapReduce

SQLからMapReduce呼べるselect count(distinct user_id) from npath( on clicks partition by user_id order by timestamp mode(overlapping) pattern(lsquoHSPrsquo) symbols( page_type = lsquohomersquo AS H page_type = lsquosearchrsquo AS S page_type = lsquoproductrsquo AS P) result(first(user_id of H) as user_id) )

最近Hiveにもnpath入りました (MatchPath)

You can combine MapReduce with SQL

Easy amp Handy SQL +

Extendable MapReduce

最後にポエム

よいものはよい

Great product is anywhere

だが知識は偏在している

but knowledge is maldistributed

OSS World Enterprise World

Hadoop

Ruby

Python Excel

並列RDB

Windows

markdown VB

Git

Java

Cross the border

end

Page 5: Oedo Ruby Conference 04: Ruby会議でSQLの話をするのは間違っているだろうか

Rubyの実装の話とか もう別に濃くない

Ruby implementation is not deep already so I speak about another theme

25分でわかる ビッグデータ分析 ~MapReduce追悼~

Big Data Analytics in 25 minutes

トータル100TBくらいのデータを分析するとしよう

Suppose you must analyze 100TB text data

1CPUとか もうマヂムリhellip

コンピュータ

プログラム

データ

Single CPU cannot handle 100TB

そうだ分散処理しようノード0 ノード1 ノード2 ノード3

プログラム プログラム プログラム プログラム

データ データ データ データ

You need more computers (distributed processing)

でも分散処理って めんどいhellipマヂムリhellip

Distributed processing is too difficulthellip

そこで並列RDBですよ

Parallel RDB may help you

Parallel RDBNode 0 Node 1 Node 2 Node 3

Front End

Front End

Front End

Front End

Back End

Back End

Back End

Back End

並列RDBの特長

1 ノードを増やせば線形に速くなる

2 標準SQLが使える

3 クライアントからは1台に見える

has linear scalability

You can use SQL

Looks as the one computer

Parallel RDB is great becausehellip

並列RDB超スゴイ age age マック

Parallel RDB is great

いろいろな商用並列RDBDatabase Vendor Since

Teradata Teradata 1983

Teradata Aster Teradata 2005

PureData System for Analytics IBM 2000

Exadata Oracle 2008

Greenplum Pivotal 2003

SQL Server PDW Microsoft 2010くらい

Redshift Amazon 2012

Various parallel RDBs

Here Comes a New Challenger

since 2005

Hadoop Architecture

HDFS Distributed File System

MapReduce Compute Framework

(Hive SQL interface)

Hadoopの特徴

1 データはテーブルではなくファイル

2 処理にはMapReduceを使う(使っていた)

3 Hiveを乗せるとSQLっぽい言語でも書ける

Hadoop data is plain file

Processed by MapReduce

Hive allows you to write SQL-like query

猫も194780子もMapReduce

ldquoBig Datardquo meant MapReduce few years ago

MapReducek1 v1

k2 v2

k3 v3

k4 v4

k5 v5

k6 v6

Map

k1 v1

k1 v2

k1 v3

k2 v4

k3 v5

k3 v6

k1 v1

k2 v2

k3 v3

Reduce

Map関数とReduce関数を 書いたらよしなに

分散してくれるフレームワーク

You just write MapampReduce functions Hadoop serves the rest

Q1 SQLとMapReduce どっちがいいの

Which is good SQL and MapReduce

ビジネス的な答え

SQL

なぜSQLか

1 SQL書けてもJava書けない人は多い

2 既存のSQLを使ったアプリが動かない

3 MapReduce関数を書くのは高コスト

Many people can write SQL but not Java

Many applications rely on SQL

Writing MR function needs more time

コスト差ってどれくらいよpackage orgmyorg import javaioIOException import javautil import orgapachehadoopfsPath import orgapachehadoopconf import orgapachehadoopio import orgapachehadoopmapred import orgapachehadooputil public class WordCount public static class Map extends MapReduceBase implements MapperltLongWritable Text Text IntWritablegt private final static IntWritable one = new IntWritable(1) private Text word = new Text() public void map(LongWritable key Text value OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException String line = valuetoString() StringTokenizer tokenizer = new StringTokenizer(line) while (tokenizerhasMoreTokens()) wordset(tokenizernextToken()) outputcollect(word one) public static class Reduce extends MapReduceBase implements ReducerltText IntWritable Text IntWritablegt public void reduce(Text key IteratorltIntWritablegt values OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException int sum = 0 while (valueshasNext()) sum += valuesnext()get() outputcollect(key new IntWritable(sum)) public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass) confsetJobName(wordcount) confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass) confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass) confsetReducerClass(Reduceclass) confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass) FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1])) JobClientrunJob(conf)

select count() from ( select regexp_split_to_table(str lsquos+) from text_table ) t

MapReduceによるWordCount SQLによるWordCount

実際SQLが勝った

Now SQL beats MapReduce

Q2 Hadoopと並列RDBは

どっちがいいんですか

Which is good Hadoop or parallel RDB

速度は並列RDB データ構造はHadoop

Parallel RDB is faster Hadoop is more flexible

現在ありがちな構成

HDFS

MapReduce

Hive

今後の構成

HDFS

impala backend

impala frontend

Hadoopは並列RDBに 似てきている

DB filesystem

backend

parser planner

HDFS

impala be

impala fe

Hadoop resembles to parallel RDB now

Hybrid DB comes in near future

Q3 MapReduceは

お亡くなりですか

MapReduce is dead

まだだっhelliphellip まだ終わらんよ

No

MapReduceは 並列処理にJavaやCを

はさみこめる

MapReduce has better extendability

例Asterの SQL-MapReduce

SQLからMapReduce呼べるselect count(distinct user_id) from npath( on clicks partition by user_id order by timestamp mode(overlapping) pattern(lsquoHSPrsquo) symbols( page_type = lsquohomersquo AS H page_type = lsquosearchrsquo AS S page_type = lsquoproductrsquo AS P) result(first(user_id of H) as user_id) )

最近Hiveにもnpath入りました (MatchPath)

You can combine MapReduce with SQL

Easy amp Handy SQL +

Extendable MapReduce

最後にポエム

よいものはよい

Great product is anywhere

だが知識は偏在している

but knowledge is maldistributed

OSS World Enterprise World

Hadoop

Ruby

Python Excel

並列RDB

Windows

markdown VB

Git

Java

Cross the border

end

Page 6: Oedo Ruby Conference 04: Ruby会議でSQLの話をするのは間違っているだろうか

25分でわかる ビッグデータ分析 ~MapReduce追悼~

Big Data Analytics in 25 minutes

トータル100TBくらいのデータを分析するとしよう

Suppose you must analyze 100TB text data

1CPUとか もうマヂムリhellip

コンピュータ

プログラム

データ

Single CPU cannot handle 100TB

そうだ分散処理しようノード0 ノード1 ノード2 ノード3

プログラム プログラム プログラム プログラム

データ データ データ データ

You need more computers (distributed processing)

でも分散処理って めんどいhellipマヂムリhellip

Distributed processing is too difficulthellip

そこで並列RDBですよ

Parallel RDB may help you

Parallel RDBNode 0 Node 1 Node 2 Node 3

Front End

Front End

Front End

Front End

Back End

Back End

Back End

Back End

並列RDBの特長

1 ノードを増やせば線形に速くなる

2 標準SQLが使える

3 クライアントからは1台に見える

has linear scalability

You can use SQL

Looks as the one computer

Parallel RDB is great becausehellip

並列RDB超スゴイ age age マック

Parallel RDB is great

いろいろな商用並列RDBDatabase Vendor Since

Teradata Teradata 1983

Teradata Aster Teradata 2005

PureData System for Analytics IBM 2000

Exadata Oracle 2008

Greenplum Pivotal 2003

SQL Server PDW Microsoft 2010くらい

Redshift Amazon 2012

Various parallel RDBs

Here Comes a New Challenger

since 2005

Hadoop Architecture

HDFS Distributed File System

MapReduce Compute Framework

(Hive SQL interface)

Hadoopの特徴

1 データはテーブルではなくファイル

2 処理にはMapReduceを使う(使っていた)

3 Hiveを乗せるとSQLっぽい言語でも書ける

Hadoop data is plain file

Processed by MapReduce

Hive allows you to write SQL-like query

猫も194780子もMapReduce

ldquoBig Datardquo meant MapReduce few years ago

MapReducek1 v1

k2 v2

k3 v3

k4 v4

k5 v5

k6 v6

Map

k1 v1

k1 v2

k1 v3

k2 v4

k3 v5

k3 v6

k1 v1

k2 v2

k3 v3

Reduce

Map関数とReduce関数を 書いたらよしなに

分散してくれるフレームワーク

You just write MapampReduce functions Hadoop serves the rest

Q1 SQLとMapReduce どっちがいいの

Which is good SQL and MapReduce

ビジネス的な答え

SQL

なぜSQLか

1 SQL書けてもJava書けない人は多い

2 既存のSQLを使ったアプリが動かない

3 MapReduce関数を書くのは高コスト

Many people can write SQL but not Java

Many applications rely on SQL

Writing MR function needs more time

コスト差ってどれくらいよpackage orgmyorg import javaioIOException import javautil import orgapachehadoopfsPath import orgapachehadoopconf import orgapachehadoopio import orgapachehadoopmapred import orgapachehadooputil public class WordCount public static class Map extends MapReduceBase implements MapperltLongWritable Text Text IntWritablegt private final static IntWritable one = new IntWritable(1) private Text word = new Text() public void map(LongWritable key Text value OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException String line = valuetoString() StringTokenizer tokenizer = new StringTokenizer(line) while (tokenizerhasMoreTokens()) wordset(tokenizernextToken()) outputcollect(word one) public static class Reduce extends MapReduceBase implements ReducerltText IntWritable Text IntWritablegt public void reduce(Text key IteratorltIntWritablegt values OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException int sum = 0 while (valueshasNext()) sum += valuesnext()get() outputcollect(key new IntWritable(sum)) public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass) confsetJobName(wordcount) confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass) confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass) confsetReducerClass(Reduceclass) confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass) FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1])) JobClientrunJob(conf)

select count() from ( select regexp_split_to_table(str lsquos+) from text_table ) t

MapReduceによるWordCount SQLによるWordCount

実際SQLが勝った

Now SQL beats MapReduce

Q2 Hadoopと並列RDBは

どっちがいいんですか

Which is good Hadoop or parallel RDB

速度は並列RDB データ構造はHadoop

Parallel RDB is faster Hadoop is more flexible

現在ありがちな構成

HDFS

MapReduce

Hive

今後の構成

HDFS

impala backend

impala frontend

Hadoopは並列RDBに 似てきている

DB filesystem

backend

parser planner

HDFS

impala be

impala fe

Hadoop resembles to parallel RDB now

Hybrid DB comes in near future

Q3 MapReduceは

お亡くなりですか

MapReduce is dead

まだだっhelliphellip まだ終わらんよ

No

MapReduceは 並列処理にJavaやCを

はさみこめる

MapReduce has better extendability

例Asterの SQL-MapReduce

SQLからMapReduce呼べるselect count(distinct user_id) from npath( on clicks partition by user_id order by timestamp mode(overlapping) pattern(lsquoHSPrsquo) symbols( page_type = lsquohomersquo AS H page_type = lsquosearchrsquo AS S page_type = lsquoproductrsquo AS P) result(first(user_id of H) as user_id) )

最近Hiveにもnpath入りました (MatchPath)

You can combine MapReduce with SQL

Easy amp Handy SQL +

Extendable MapReduce

最後にポエム

よいものはよい

Great product is anywhere

だが知識は偏在している

but knowledge is maldistributed

OSS World Enterprise World

Hadoop

Ruby

Python Excel

並列RDB

Windows

markdown VB

Git

Java

Cross the border

end

Page 7: Oedo Ruby Conference 04: Ruby会議でSQLの話をするのは間違っているだろうか

トータル100TBくらいのデータを分析するとしよう

Suppose you must analyze 100TB text data

1CPUとか もうマヂムリhellip

コンピュータ

プログラム

データ

Single CPU cannot handle 100TB

そうだ分散処理しようノード0 ノード1 ノード2 ノード3

プログラム プログラム プログラム プログラム

データ データ データ データ

You need more computers (distributed processing)

でも分散処理って めんどいhellipマヂムリhellip

Distributed processing is too difficulthellip

そこで並列RDBですよ

Parallel RDB may help you

Parallel RDBNode 0 Node 1 Node 2 Node 3

Front End

Front End

Front End

Front End

Back End

Back End

Back End

Back End

並列RDBの特長

1 ノードを増やせば線形に速くなる

2 標準SQLが使える

3 クライアントからは1台に見える

has linear scalability

You can use SQL

Looks as the one computer

Parallel RDB is great becausehellip

並列RDB超スゴイ age age マック

Parallel RDB is great

いろいろな商用並列RDBDatabase Vendor Since

Teradata Teradata 1983

Teradata Aster Teradata 2005

PureData System for Analytics IBM 2000

Exadata Oracle 2008

Greenplum Pivotal 2003

SQL Server PDW Microsoft 2010くらい

Redshift Amazon 2012

Various parallel RDBs

Here Comes a New Challenger

since 2005

Hadoop Architecture

HDFS Distributed File System

MapReduce Compute Framework

(Hive SQL interface)

Hadoopの特徴

1 データはテーブルではなくファイル

2 処理にはMapReduceを使う(使っていた)

3 Hiveを乗せるとSQLっぽい言語でも書ける

Hadoop data is plain file

Processed by MapReduce

Hive allows you to write SQL-like query

猫も194780子もMapReduce

ldquoBig Datardquo meant MapReduce few years ago

MapReducek1 v1

k2 v2

k3 v3

k4 v4

k5 v5

k6 v6

Map

k1 v1

k1 v2

k1 v3

k2 v4

k3 v5

k3 v6

k1 v1

k2 v2

k3 v3

Reduce

Map関数とReduce関数を 書いたらよしなに

分散してくれるフレームワーク

You just write MapampReduce functions Hadoop serves the rest

Q1 SQLとMapReduce どっちがいいの

Which is good SQL and MapReduce

ビジネス的な答え

SQL

なぜSQLか

1 SQL書けてもJava書けない人は多い

2 既存のSQLを使ったアプリが動かない

3 MapReduce関数を書くのは高コスト

Many people can write SQL but not Java

Many applications rely on SQL

Writing MR function needs more time

コスト差ってどれくらいよpackage orgmyorg import javaioIOException import javautil import orgapachehadoopfsPath import orgapachehadoopconf import orgapachehadoopio import orgapachehadoopmapred import orgapachehadooputil public class WordCount public static class Map extends MapReduceBase implements MapperltLongWritable Text Text IntWritablegt private final static IntWritable one = new IntWritable(1) private Text word = new Text() public void map(LongWritable key Text value OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException String line = valuetoString() StringTokenizer tokenizer = new StringTokenizer(line) while (tokenizerhasMoreTokens()) wordset(tokenizernextToken()) outputcollect(word one) public static class Reduce extends MapReduceBase implements ReducerltText IntWritable Text IntWritablegt public void reduce(Text key IteratorltIntWritablegt values OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException int sum = 0 while (valueshasNext()) sum += valuesnext()get() outputcollect(key new IntWritable(sum)) public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass) confsetJobName(wordcount) confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass) confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass) confsetReducerClass(Reduceclass) confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass) FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1])) JobClientrunJob(conf)

select count() from ( select regexp_split_to_table(str lsquos+) from text_table ) t

MapReduceによるWordCount SQLによるWordCount

実際SQLが勝った

Now SQL beats MapReduce

Q2 Hadoopと並列RDBは

どっちがいいんですか

Which is good Hadoop or parallel RDB

速度は並列RDB データ構造はHadoop

Parallel RDB is faster Hadoop is more flexible

現在ありがちな構成

HDFS

MapReduce

Hive

今後の構成

HDFS

impala backend

impala frontend

Hadoopは並列RDBに 似てきている

DB filesystem

backend

parser planner

HDFS

impala be

impala fe

Hadoop resembles to parallel RDB now

Hybrid DB comes in near future

Q3 MapReduceは

お亡くなりですか

MapReduce is dead

まだだっhelliphellip まだ終わらんよ

No

MapReduceは 並列処理にJavaやCを

はさみこめる

MapReduce has better extendability

例Asterの SQL-MapReduce

SQLからMapReduce呼べるselect count(distinct user_id) from npath( on clicks partition by user_id order by timestamp mode(overlapping) pattern(lsquoHSPrsquo) symbols( page_type = lsquohomersquo AS H page_type = lsquosearchrsquo AS S page_type = lsquoproductrsquo AS P) result(first(user_id of H) as user_id) )

最近Hiveにもnpath入りました (MatchPath)

You can combine MapReduce with SQL

Easy amp Handy SQL +

Extendable MapReduce

最後にポエム

よいものはよい

Great product is anywhere

だが知識は偏在している

but knowledge is maldistributed

OSS World Enterprise World

Hadoop

Ruby

Python Excel

並列RDB

Windows

markdown VB

Git

Java

Cross the border

end

Page 8: Oedo Ruby Conference 04: Ruby会議でSQLの話をするのは間違っているだろうか

1CPUとか もうマヂムリhellip

コンピュータ

プログラム

データ

Single CPU cannot handle 100TB

そうだ分散処理しようノード0 ノード1 ノード2 ノード3

プログラム プログラム プログラム プログラム

データ データ データ データ

You need more computers (distributed processing)

でも分散処理って めんどいhellipマヂムリhellip

Distributed processing is too difficulthellip

そこで並列RDBですよ

Parallel RDB may help you

Parallel RDBNode 0 Node 1 Node 2 Node 3

Front End

Front End

Front End

Front End

Back End

Back End

Back End

Back End

並列RDBの特長

1 ノードを増やせば線形に速くなる

2 標準SQLが使える

3 クライアントからは1台に見える

has linear scalability

You can use SQL

Looks as the one computer

Parallel RDB is great becausehellip

並列RDB超スゴイ age age マック

Parallel RDB is great

いろいろな商用並列RDBDatabase Vendor Since

Teradata Teradata 1983

Teradata Aster Teradata 2005

PureData System for Analytics IBM 2000

Exadata Oracle 2008

Greenplum Pivotal 2003

SQL Server PDW Microsoft 2010くらい

Redshift Amazon 2012

Various parallel RDBs

Here Comes a New Challenger

since 2005

Hadoop Architecture

HDFS Distributed File System

MapReduce Compute Framework

(Hive SQL interface)

Hadoopの特徴

1 データはテーブルではなくファイル

2 処理にはMapReduceを使う(使っていた)

3 Hiveを乗せるとSQLっぽい言語でも書ける

Hadoop data is plain file

Processed by MapReduce

Hive allows you to write SQL-like query

猫も194780子もMapReduce

ldquoBig Datardquo meant MapReduce few years ago

MapReducek1 v1

k2 v2

k3 v3

k4 v4

k5 v5

k6 v6

Map

k1 v1

k1 v2

k1 v3

k2 v4

k3 v5

k3 v6

k1 v1

k2 v2

k3 v3

Reduce

Map関数とReduce関数を 書いたらよしなに

分散してくれるフレームワーク

You just write MapampReduce functions Hadoop serves the rest

Q1 SQLとMapReduce どっちがいいの

Which is good SQL and MapReduce

ビジネス的な答え

SQL

なぜSQLか

1 SQL書けてもJava書けない人は多い

2 既存のSQLを使ったアプリが動かない

3 MapReduce関数を書くのは高コスト

Many people can write SQL but not Java

Many applications rely on SQL

Writing MR function needs more time

コスト差ってどれくらいよpackage orgmyorg import javaioIOException import javautil import orgapachehadoopfsPath import orgapachehadoopconf import orgapachehadoopio import orgapachehadoopmapred import orgapachehadooputil public class WordCount public static class Map extends MapReduceBase implements MapperltLongWritable Text Text IntWritablegt private final static IntWritable one = new IntWritable(1) private Text word = new Text() public void map(LongWritable key Text value OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException String line = valuetoString() StringTokenizer tokenizer = new StringTokenizer(line) while (tokenizerhasMoreTokens()) wordset(tokenizernextToken()) outputcollect(word one) public static class Reduce extends MapReduceBase implements ReducerltText IntWritable Text IntWritablegt public void reduce(Text key IteratorltIntWritablegt values OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException int sum = 0 while (valueshasNext()) sum += valuesnext()get() outputcollect(key new IntWritable(sum)) public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass) confsetJobName(wordcount) confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass) confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass) confsetReducerClass(Reduceclass) confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass) FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1])) JobClientrunJob(conf)

select count() from ( select regexp_split_to_table(str lsquos+) from text_table ) t

MapReduceによるWordCount SQLによるWordCount

実際SQLが勝った

Now SQL beats MapReduce

Q2 Hadoopと並列RDBは

どっちがいいんですか

Which is good Hadoop or parallel RDB

速度は並列RDB データ構造はHadoop

Parallel RDB is faster Hadoop is more flexible

現在ありがちな構成

HDFS

MapReduce

Hive

今後の構成

HDFS

impala backend

impala frontend

Hadoopは並列RDBに 似てきている

DB filesystem

backend

parser planner

HDFS

impala be

impala fe

Hadoop resembles to parallel RDB now

Hybrid DB comes in near future

Q3 MapReduceは

お亡くなりですか

MapReduce is dead

まだだっhelliphellip まだ終わらんよ

No

MapReduceは 並列処理にJavaやCを

はさみこめる

MapReduce has better extendability

例Asterの SQL-MapReduce

SQLからMapReduce呼べるselect count(distinct user_id) from npath( on clicks partition by user_id order by timestamp mode(overlapping) pattern(lsquoHSPrsquo) symbols( page_type = lsquohomersquo AS H page_type = lsquosearchrsquo AS S page_type = lsquoproductrsquo AS P) result(first(user_id of H) as user_id) )

最近Hiveにもnpath入りました (MatchPath)

You can combine MapReduce with SQL

Easy amp Handy SQL +

Extendable MapReduce

最後にポエム

よいものはよい

Great product is anywhere

だが知識は偏在している

but knowledge is maldistributed

OSS World Enterprise World

Hadoop

Ruby

Python Excel

並列RDB

Windows

markdown VB

Git

Java

Cross the border

end

Page 9: Oedo Ruby Conference 04: Ruby会議でSQLの話をするのは間違っているだろうか

そうだ分散処理しようノード0 ノード1 ノード2 ノード3

プログラム プログラム プログラム プログラム

データ データ データ データ

You need more computers (distributed processing)

でも分散処理って めんどいhellipマヂムリhellip

Distributed processing is too difficulthellip

そこで並列RDBですよ

Parallel RDB may help you

Parallel RDBNode 0 Node 1 Node 2 Node 3

Front End

Front End

Front End

Front End

Back End

Back End

Back End

Back End

並列RDBの特長

1 ノードを増やせば線形に速くなる

2 標準SQLが使える

3 クライアントからは1台に見える

has linear scalability

You can use SQL

Looks as the one computer

Parallel RDB is great becausehellip

並列RDB超スゴイ age age マック

Parallel RDB is great

いろいろな商用並列RDBDatabase Vendor Since

Teradata Teradata 1983

Teradata Aster Teradata 2005

PureData System for Analytics IBM 2000

Exadata Oracle 2008

Greenplum Pivotal 2003

SQL Server PDW Microsoft 2010くらい

Redshift Amazon 2012

Various parallel RDBs

Here Comes a New Challenger

since 2005

Hadoop Architecture

HDFS Distributed File System

MapReduce Compute Framework

(Hive SQL interface)

Hadoopの特徴

1 データはテーブルではなくファイル

2 処理にはMapReduceを使う(使っていた)

3 Hiveを乗せるとSQLっぽい言語でも書ける

Hadoop data is plain file

Processed by MapReduce

Hive allows you to write SQL-like query

猫も194780子もMapReduce

ldquoBig Datardquo meant MapReduce few years ago

MapReducek1 v1

k2 v2

k3 v3

k4 v4

k5 v5

k6 v6

Map

k1 v1

k1 v2

k1 v3

k2 v4

k3 v5

k3 v6

k1 v1

k2 v2

k3 v3

Reduce

Map関数とReduce関数を 書いたらよしなに

分散してくれるフレームワーク

You just write MapampReduce functions Hadoop serves the rest

Q1 SQLとMapReduce どっちがいいの

Which is good SQL and MapReduce

ビジネス的な答え

SQL

なぜSQLか

1 SQL書けてもJava書けない人は多い

2 既存のSQLを使ったアプリが動かない

3 MapReduce関数を書くのは高コスト

Many people can write SQL but not Java

Many applications rely on SQL

Writing MR function needs more time

コスト差ってどれくらいよpackage orgmyorg import javaioIOException import javautil import orgapachehadoopfsPath import orgapachehadoopconf import orgapachehadoopio import orgapachehadoopmapred import orgapachehadooputil public class WordCount public static class Map extends MapReduceBase implements MapperltLongWritable Text Text IntWritablegt private final static IntWritable one = new IntWritable(1) private Text word = new Text() public void map(LongWritable key Text value OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException String line = valuetoString() StringTokenizer tokenizer = new StringTokenizer(line) while (tokenizerhasMoreTokens()) wordset(tokenizernextToken()) outputcollect(word one) public static class Reduce extends MapReduceBase implements ReducerltText IntWritable Text IntWritablegt public void reduce(Text key IteratorltIntWritablegt values OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException int sum = 0 while (valueshasNext()) sum += valuesnext()get() outputcollect(key new IntWritable(sum)) public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass) confsetJobName(wordcount) confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass) confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass) confsetReducerClass(Reduceclass) confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass) FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1])) JobClientrunJob(conf)

select count() from ( select regexp_split_to_table(str lsquos+) from text_table ) t

MapReduceによるWordCount SQLによるWordCount

実際SQLが勝った

Now SQL beats MapReduce

Q2 Hadoopと並列RDBは

どっちがいいんですか

Which is good Hadoop or parallel RDB

速度は並列RDB データ構造はHadoop

Parallel RDB is faster Hadoop is more flexible

現在ありがちな構成

HDFS

MapReduce

Hive

今後の構成

HDFS

impala backend

impala frontend

Hadoopは並列RDBに 似てきている

DB filesystem

backend

parser planner

HDFS

impala be

impala fe

Hadoop resembles to parallel RDB now

Hybrid DB comes in near future

Q3 MapReduceは

お亡くなりですか

MapReduce is dead

まだだっhelliphellip まだ終わらんよ

No

MapReduceは 並列処理にJavaやCを

はさみこめる

MapReduce has better extendability

例Asterの SQL-MapReduce

SQLからMapReduce呼べるselect count(distinct user_id) from npath( on clicks partition by user_id order by timestamp mode(overlapping) pattern(lsquoHSPrsquo) symbols( page_type = lsquohomersquo AS H page_type = lsquosearchrsquo AS S page_type = lsquoproductrsquo AS P) result(first(user_id of H) as user_id) )

最近Hiveにもnpath入りました (MatchPath)

You can combine MapReduce with SQL

Easy amp Handy SQL +

Extendable MapReduce

最後にポエム

よいものはよい

Great product is anywhere

だが知識は偏在している

but knowledge is maldistributed

OSS World Enterprise World

Hadoop

Ruby

Python Excel

並列RDB

Windows

markdown VB

Git

Java

Cross the border

end

Page 10: Oedo Ruby Conference 04: Ruby会議でSQLの話をするのは間違っているだろうか

でも分散処理って めんどいhellipマヂムリhellip

Distributed processing is too difficulthellip

そこで並列RDBですよ

Parallel RDB may help you

Parallel RDBNode 0 Node 1 Node 2 Node 3

Front End

Front End

Front End

Front End

Back End

Back End

Back End

Back End

並列RDBの特長

1 ノードを増やせば線形に速くなる

2 標準SQLが使える

3 クライアントからは1台に見える

has linear scalability

You can use SQL

Looks as the one computer

Parallel RDB is great becausehellip

並列RDB超スゴイ age age マック

Parallel RDB is great

いろいろな商用並列RDBDatabase Vendor Since

Teradata Teradata 1983

Teradata Aster Teradata 2005

PureData System for Analytics IBM 2000

Exadata Oracle 2008

Greenplum Pivotal 2003

SQL Server PDW Microsoft 2010くらい

Redshift Amazon 2012

Various parallel RDBs

Here Comes a New Challenger

since 2005

Hadoop Architecture

HDFS Distributed File System

MapReduce Compute Framework

(Hive SQL interface)

Hadoopの特徴

1 データはテーブルではなくファイル

2 処理にはMapReduceを使う(使っていた)

3 Hiveを乗せるとSQLっぽい言語でも書ける

Hadoop data is plain file

Processed by MapReduce

Hive allows you to write SQL-like query

猫も194780子もMapReduce

ldquoBig Datardquo meant MapReduce few years ago

MapReducek1 v1

k2 v2

k3 v3

k4 v4

k5 v5

k6 v6

Map

k1 v1

k1 v2

k1 v3

k2 v4

k3 v5

k3 v6

k1 v1

k2 v2

k3 v3

Reduce

Map関数とReduce関数を 書いたらよしなに

分散してくれるフレームワーク

You just write MapampReduce functions Hadoop serves the rest

Q1 SQLとMapReduce どっちがいいの

Which is good SQL and MapReduce

ビジネス的な答え

SQL

なぜSQLか

1 SQL書けてもJava書けない人は多い

2 既存のSQLを使ったアプリが動かない

3 MapReduce関数を書くのは高コスト

Many people can write SQL but not Java

Many applications rely on SQL

Writing MR function needs more time

コスト差ってどれくらいよpackage orgmyorg import javaioIOException import javautil import orgapachehadoopfsPath import orgapachehadoopconf import orgapachehadoopio import orgapachehadoopmapred import orgapachehadooputil public class WordCount public static class Map extends MapReduceBase implements MapperltLongWritable Text Text IntWritablegt private final static IntWritable one = new IntWritable(1) private Text word = new Text() public void map(LongWritable key Text value OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException String line = valuetoString() StringTokenizer tokenizer = new StringTokenizer(line) while (tokenizerhasMoreTokens()) wordset(tokenizernextToken()) outputcollect(word one) public static class Reduce extends MapReduceBase implements ReducerltText IntWritable Text IntWritablegt public void reduce(Text key IteratorltIntWritablegt values OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException int sum = 0 while (valueshasNext()) sum += valuesnext()get() outputcollect(key new IntWritable(sum)) public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass) confsetJobName(wordcount) confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass) confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass) confsetReducerClass(Reduceclass) confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass) FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1])) JobClientrunJob(conf)

select count() from ( select regexp_split_to_table(str lsquos+) from text_table ) t

MapReduceによるWordCount SQLによるWordCount

実際SQLが勝った

Now SQL beats MapReduce

Q2 Hadoopと並列RDBは

どっちがいいんですか

Which is good Hadoop or parallel RDB

速度は並列RDB データ構造はHadoop

Parallel RDB is faster Hadoop is more flexible

現在ありがちな構成

HDFS

MapReduce

Hive

今後の構成

HDFS

impala backend

impala frontend

Hadoopは並列RDBに 似てきている

DB filesystem

backend

parser planner

HDFS

impala be

impala fe

Hadoop resembles to parallel RDB now

Hybrid DB comes in near future

Q3 MapReduceは

お亡くなりですか

MapReduce is dead

まだだっhelliphellip まだ終わらんよ

No

MapReduceは 並列処理にJavaやCを

はさみこめる

MapReduce has better extendability

例Asterの SQL-MapReduce

SQLからMapReduce呼べるselect count(distinct user_id) from npath( on clicks partition by user_id order by timestamp mode(overlapping) pattern(lsquoHSPrsquo) symbols( page_type = lsquohomersquo AS H page_type = lsquosearchrsquo AS S page_type = lsquoproductrsquo AS P) result(first(user_id of H) as user_id) )

最近Hiveにもnpath入りました (MatchPath)

You can combine MapReduce with SQL

Easy amp Handy SQL +

Extendable MapReduce

最後にポエム

よいものはよい

Great product is anywhere

だが知識は偏在している

but knowledge is maldistributed

OSS World Enterprise World

Hadoop

Ruby

Python Excel

並列RDB

Windows

markdown VB

Git

Java

Cross the border

end

Page 11: Oedo Ruby Conference 04: Ruby会議でSQLの話をするのは間違っているだろうか

そこで並列RDBですよ

Parallel RDB may help you

Parallel RDBNode 0 Node 1 Node 2 Node 3

Front End

Front End

Front End

Front End

Back End

Back End

Back End

Back End

並列RDBの特長

1 ノードを増やせば線形に速くなる

2 標準SQLが使える

3 クライアントからは1台に見える

has linear scalability

You can use SQL

Looks as the one computer

Parallel RDB is great becausehellip

並列RDB超スゴイ age age マック

Parallel RDB is great

いろいろな商用並列RDBDatabase Vendor Since

Teradata Teradata 1983

Teradata Aster Teradata 2005

PureData System for Analytics IBM 2000

Exadata Oracle 2008

Greenplum Pivotal 2003

SQL Server PDW Microsoft 2010くらい

Redshift Amazon 2012

Various parallel RDBs

Here Comes a New Challenger

since 2005

Hadoop Architecture

HDFS Distributed File System

MapReduce Compute Framework

(Hive SQL interface)

Hadoopの特徴

1 データはテーブルではなくファイル

2 処理にはMapReduceを使う(使っていた)

3 Hiveを乗せるとSQLっぽい言語でも書ける

Hadoop data is plain file

Processed by MapReduce

Hive allows you to write SQL-like query

猫も194780子もMapReduce

ldquoBig Datardquo meant MapReduce few years ago

MapReducek1 v1

k2 v2

k3 v3

k4 v4

k5 v5

k6 v6

Map

k1 v1

k1 v2

k1 v3

k2 v4

k3 v5

k3 v6

k1 v1

k2 v2

k3 v3

Reduce

Map関数とReduce関数を 書いたらよしなに

分散してくれるフレームワーク

You just write MapampReduce functions Hadoop serves the rest

Q1 SQLとMapReduce どっちがいいの

Which is good SQL and MapReduce

ビジネス的な答え

SQL

なぜSQLか

1 SQL書けてもJava書けない人は多い

2 既存のSQLを使ったアプリが動かない

3 MapReduce関数を書くのは高コスト

Many people can write SQL but not Java

Many applications rely on SQL

Writing MR function needs more time

コスト差ってどれくらいよpackage orgmyorg import javaioIOException import javautil import orgapachehadoopfsPath import orgapachehadoopconf import orgapachehadoopio import orgapachehadoopmapred import orgapachehadooputil public class WordCount public static class Map extends MapReduceBase implements MapperltLongWritable Text Text IntWritablegt private final static IntWritable one = new IntWritable(1) private Text word = new Text() public void map(LongWritable key Text value OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException String line = valuetoString() StringTokenizer tokenizer = new StringTokenizer(line) while (tokenizerhasMoreTokens()) wordset(tokenizernextToken()) outputcollect(word one) public static class Reduce extends MapReduceBase implements ReducerltText IntWritable Text IntWritablegt public void reduce(Text key IteratorltIntWritablegt values OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException int sum = 0 while (valueshasNext()) sum += valuesnext()get() outputcollect(key new IntWritable(sum)) public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass) confsetJobName(wordcount) confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass) confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass) confsetReducerClass(Reduceclass) confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass) FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1])) JobClientrunJob(conf)

select count() from ( select regexp_split_to_table(str lsquos+) from text_table ) t

MapReduceによるWordCount SQLによるWordCount

実際SQLが勝った

Now SQL beats MapReduce

Q2 Hadoopと並列RDBは

どっちがいいんですか

Which is good Hadoop or parallel RDB

速度は並列RDB データ構造はHadoop

Parallel RDB is faster Hadoop is more flexible

現在ありがちな構成

HDFS

MapReduce

Hive

今後の構成

HDFS

impala backend

impala frontend

Hadoopは並列RDBに 似てきている

DB filesystem

backend

parser planner

HDFS

impala be

impala fe

Hadoop resembles to parallel RDB now

Hybrid DB comes in near future

Q3 MapReduceは

お亡くなりですか

MapReduce is dead

まだだっhelliphellip まだ終わらんよ

No

MapReduceは 並列処理にJavaやCを

はさみこめる

MapReduce has better extendability

例Asterの SQL-MapReduce

SQLからMapReduce呼べるselect count(distinct user_id) from npath( on clicks partition by user_id order by timestamp mode(overlapping) pattern(lsquoHSPrsquo) symbols( page_type = lsquohomersquo AS H page_type = lsquosearchrsquo AS S page_type = lsquoproductrsquo AS P) result(first(user_id of H) as user_id) )

最近Hiveにもnpath入りました (MatchPath)

You can combine MapReduce with SQL

Easy amp Handy SQL +

Extendable MapReduce

最後にポエム

よいものはよい

Great product is anywhere

だが知識は偏在している

but knowledge is maldistributed

OSS World Enterprise World

Hadoop

Ruby

Python Excel

並列RDB

Windows

markdown VB

Git

Java

Cross the border

end

Page 12: Oedo Ruby Conference 04: Ruby会議でSQLの話をするのは間違っているだろうか

Parallel RDBNode 0 Node 1 Node 2 Node 3

Front End

Front End

Front End

Front End

Back End

Back End

Back End

Back End

並列RDBの特長

1 ノードを増やせば線形に速くなる

2 標準SQLが使える

3 クライアントからは1台に見える

has linear scalability

You can use SQL

Looks as the one computer

Parallel RDB is great becausehellip

並列RDB超スゴイ age age マック

Parallel RDB is great

いろいろな商用並列RDBDatabase Vendor Since

Teradata Teradata 1983

Teradata Aster Teradata 2005

PureData System for Analytics IBM 2000

Exadata Oracle 2008

Greenplum Pivotal 2003

SQL Server PDW Microsoft 2010くらい

Redshift Amazon 2012

Various parallel RDBs

Here Comes a New Challenger

since 2005

Hadoop Architecture

HDFS Distributed File System

MapReduce Compute Framework

(Hive SQL interface)

Hadoopの特徴

1 データはテーブルではなくファイル

2 処理にはMapReduceを使う(使っていた)

3 Hiveを乗せるとSQLっぽい言語でも書ける

Hadoop data is plain file

Processed by MapReduce

Hive allows you to write SQL-like query

猫も194780子もMapReduce

ldquoBig Datardquo meant MapReduce few years ago

MapReducek1 v1

k2 v2

k3 v3

k4 v4

k5 v5

k6 v6

Map

k1 v1

k1 v2

k1 v3

k2 v4

k3 v5

k3 v6

k1 v1

k2 v2

k3 v3

Reduce

Map関数とReduce関数を 書いたらよしなに

分散してくれるフレームワーク

You just write MapampReduce functions Hadoop serves the rest

Q1 SQLとMapReduce どっちがいいの

Which is good SQL and MapReduce

ビジネス的な答え

SQL

なぜSQLか

1 SQL書けてもJava書けない人は多い

2 既存のSQLを使ったアプリが動かない

3 MapReduce関数を書くのは高コスト

Many people can write SQL but not Java

Many applications rely on SQL

Writing MR function needs more time

コスト差ってどれくらいよpackage orgmyorg import javaioIOException import javautil import orgapachehadoopfsPath import orgapachehadoopconf import orgapachehadoopio import orgapachehadoopmapred import orgapachehadooputil public class WordCount public static class Map extends MapReduceBase implements MapperltLongWritable Text Text IntWritablegt private final static IntWritable one = new IntWritable(1) private Text word = new Text() public void map(LongWritable key Text value OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException String line = valuetoString() StringTokenizer tokenizer = new StringTokenizer(line) while (tokenizerhasMoreTokens()) wordset(tokenizernextToken()) outputcollect(word one) public static class Reduce extends MapReduceBase implements ReducerltText IntWritable Text IntWritablegt public void reduce(Text key IteratorltIntWritablegt values OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException int sum = 0 while (valueshasNext()) sum += valuesnext()get() outputcollect(key new IntWritable(sum)) public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass) confsetJobName(wordcount) confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass) confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass) confsetReducerClass(Reduceclass) confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass) FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1])) JobClientrunJob(conf)

select count() from ( select regexp_split_to_table(str lsquos+) from text_table ) t

MapReduceによるWordCount SQLによるWordCount

実際SQLが勝った

Now SQL beats MapReduce

Q2 Hadoopと並列RDBは

どっちがいいんですか

Which is good Hadoop or parallel RDB

速度は並列RDB データ構造はHadoop

Parallel RDB is faster Hadoop is more flexible

現在ありがちな構成

HDFS

MapReduce

Hive

今後の構成

HDFS

impala backend

impala frontend

Hadoopは並列RDBに 似てきている

DB filesystem

backend

parser planner

HDFS

impala be

impala fe

Hadoop resembles to parallel RDB now

Hybrid DB comes in near future

Q3 MapReduceは

お亡くなりですか

MapReduce is dead

まだだっhelliphellip まだ終わらんよ

No

MapReduceは 並列処理にJavaやCを

はさみこめる

MapReduce has better extendability

例Asterの SQL-MapReduce

SQLからMapReduce呼べるselect count(distinct user_id) from npath( on clicks partition by user_id order by timestamp mode(overlapping) pattern(lsquoHSPrsquo) symbols( page_type = lsquohomersquo AS H page_type = lsquosearchrsquo AS S page_type = lsquoproductrsquo AS P) result(first(user_id of H) as user_id) )

最近Hiveにもnpath入りました (MatchPath)

You can combine MapReduce with SQL

Easy amp Handy SQL +

Extendable MapReduce

最後にポエム

よいものはよい

Great product is anywhere

だが知識は偏在している

but knowledge is maldistributed

OSS World Enterprise World

Hadoop

Ruby

Python Excel

並列RDB

Windows

markdown VB

Git

Java

Cross the border

end

Page 13: Oedo Ruby Conference 04: Ruby会議でSQLの話をするのは間違っているだろうか

並列RDBの特長

1 ノードを増やせば線形に速くなる

2 標準SQLが使える

3 クライアントからは1台に見える

has linear scalability

You can use SQL

Looks as the one computer

Parallel RDB is great becausehellip

並列RDB超スゴイ age age マック

Parallel RDB is great

いろいろな商用並列RDBDatabase Vendor Since

Teradata Teradata 1983

Teradata Aster Teradata 2005

PureData System for Analytics IBM 2000

Exadata Oracle 2008

Greenplum Pivotal 2003

SQL Server PDW Microsoft 2010くらい

Redshift Amazon 2012

Various parallel RDBs

Here Comes a New Challenger

since 2005

Hadoop Architecture

HDFS Distributed File System

MapReduce Compute Framework

(Hive SQL interface)

Hadoopの特徴

1 データはテーブルではなくファイル

2 処理にはMapReduceを使う(使っていた)

3 Hiveを乗せるとSQLっぽい言語でも書ける

Hadoop data is plain file

Processed by MapReduce

Hive allows you to write SQL-like query

猫も194780子もMapReduce

ldquoBig Datardquo meant MapReduce few years ago

MapReducek1 v1

k2 v2

k3 v3

k4 v4

k5 v5

k6 v6

Map

k1 v1

k1 v2

k1 v3

k2 v4

k3 v5

k3 v6

k1 v1

k2 v2

k3 v3

Reduce

Map関数とReduce関数を 書いたらよしなに

分散してくれるフレームワーク

You just write MapampReduce functions Hadoop serves the rest

Q1 SQLとMapReduce どっちがいいの

Which is good SQL and MapReduce

ビジネス的な答え

SQL

なぜSQLか

1 SQL書けてもJava書けない人は多い

2 既存のSQLを使ったアプリが動かない

3 MapReduce関数を書くのは高コスト

Many people can write SQL but not Java

Many applications rely on SQL

Writing MR function needs more time

コスト差ってどれくらいよpackage orgmyorg import javaioIOException import javautil import orgapachehadoopfsPath import orgapachehadoopconf import orgapachehadoopio import orgapachehadoopmapred import orgapachehadooputil public class WordCount public static class Map extends MapReduceBase implements MapperltLongWritable Text Text IntWritablegt private final static IntWritable one = new IntWritable(1) private Text word = new Text() public void map(LongWritable key Text value OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException String line = valuetoString() StringTokenizer tokenizer = new StringTokenizer(line) while (tokenizerhasMoreTokens()) wordset(tokenizernextToken()) outputcollect(word one) public static class Reduce extends MapReduceBase implements ReducerltText IntWritable Text IntWritablegt public void reduce(Text key IteratorltIntWritablegt values OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException int sum = 0 while (valueshasNext()) sum += valuesnext()get() outputcollect(key new IntWritable(sum)) public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass) confsetJobName(wordcount) confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass) confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass) confsetReducerClass(Reduceclass) confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass) FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1])) JobClientrunJob(conf)

select count() from ( select regexp_split_to_table(str lsquos+) from text_table ) t

MapReduceによるWordCount SQLによるWordCount

実際SQLが勝った

Now SQL beats MapReduce

Q2 Hadoopと並列RDBは

どっちがいいんですか

Which is good Hadoop or parallel RDB

速度は並列RDB データ構造はHadoop

Parallel RDB is faster Hadoop is more flexible

現在ありがちな構成

HDFS

MapReduce

Hive

今後の構成

HDFS

impala backend

impala frontend

Hadoopは並列RDBに 似てきている

DB filesystem

backend

parser planner

HDFS

impala be

impala fe

Hadoop resembles to parallel RDB now

Hybrid DB comes in near future

Q3 MapReduceは

お亡くなりですか

MapReduce is dead

まだだっhelliphellip まだ終わらんよ

No

MapReduceは 並列処理にJavaやCを

はさみこめる

MapReduce has better extendability

例Asterの SQL-MapReduce

SQLからMapReduce呼べるselect count(distinct user_id) from npath( on clicks partition by user_id order by timestamp mode(overlapping) pattern(lsquoHSPrsquo) symbols( page_type = lsquohomersquo AS H page_type = lsquosearchrsquo AS S page_type = lsquoproductrsquo AS P) result(first(user_id of H) as user_id) )

最近Hiveにもnpath入りました (MatchPath)

You can combine MapReduce with SQL

Easy amp Handy SQL +

Extendable MapReduce

最後にポエム

よいものはよい

Great product is anywhere

だが知識は偏在している

but knowledge is maldistributed

OSS World Enterprise World

Hadoop

Ruby

Python Excel

並列RDB

Windows

markdown VB

Git

Java

Cross the border

end

Page 14: Oedo Ruby Conference 04: Ruby会議でSQLの話をするのは間違っているだろうか

並列RDB超スゴイ age age マック

Parallel RDB is great

いろいろな商用並列RDBDatabase Vendor Since

Teradata Teradata 1983

Teradata Aster Teradata 2005

PureData System for Analytics IBM 2000

Exadata Oracle 2008

Greenplum Pivotal 2003

SQL Server PDW Microsoft 2010くらい

Redshift Amazon 2012

Various parallel RDBs

Here Comes a New Challenger

since 2005

Hadoop Architecture

HDFS Distributed File System

MapReduce Compute Framework

(Hive SQL interface)

Hadoopの特徴

1 データはテーブルではなくファイル

2 処理にはMapReduceを使う(使っていた)

3 Hiveを乗せるとSQLっぽい言語でも書ける

Hadoop data is plain file

Processed by MapReduce

Hive allows you to write SQL-like query

猫も194780子もMapReduce

ldquoBig Datardquo meant MapReduce few years ago

MapReducek1 v1

k2 v2

k3 v3

k4 v4

k5 v5

k6 v6

Map

k1 v1

k1 v2

k1 v3

k2 v4

k3 v5

k3 v6

k1 v1

k2 v2

k3 v3

Reduce

Map関数とReduce関数を 書いたらよしなに

分散してくれるフレームワーク

You just write MapampReduce functions Hadoop serves the rest

Q1 SQLとMapReduce どっちがいいの

Which is good SQL and MapReduce

ビジネス的な答え

SQL

なぜSQLか

1 SQL書けてもJava書けない人は多い

2 既存のSQLを使ったアプリが動かない

3 MapReduce関数を書くのは高コスト

Many people can write SQL but not Java

Many applications rely on SQL

Writing MR function needs more time

コスト差ってどれくらいよpackage orgmyorg import javaioIOException import javautil import orgapachehadoopfsPath import orgapachehadoopconf import orgapachehadoopio import orgapachehadoopmapred import orgapachehadooputil public class WordCount public static class Map extends MapReduceBase implements MapperltLongWritable Text Text IntWritablegt private final static IntWritable one = new IntWritable(1) private Text word = new Text() public void map(LongWritable key Text value OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException String line = valuetoString() StringTokenizer tokenizer = new StringTokenizer(line) while (tokenizerhasMoreTokens()) wordset(tokenizernextToken()) outputcollect(word one) public static class Reduce extends MapReduceBase implements ReducerltText IntWritable Text IntWritablegt public void reduce(Text key IteratorltIntWritablegt values OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException int sum = 0 while (valueshasNext()) sum += valuesnext()get() outputcollect(key new IntWritable(sum)) public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass) confsetJobName(wordcount) confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass) confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass) confsetReducerClass(Reduceclass) confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass) FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1])) JobClientrunJob(conf)

select count() from ( select regexp_split_to_table(str lsquos+) from text_table ) t

MapReduceによるWordCount SQLによるWordCount

実際SQLが勝った

Now SQL beats MapReduce

Q2 Hadoopと並列RDBは

どっちがいいんですか

Which is good Hadoop or parallel RDB

速度は並列RDB データ構造はHadoop

Parallel RDB is faster Hadoop is more flexible

現在ありがちな構成

HDFS

MapReduce

Hive

今後の構成

HDFS

impala backend

impala frontend

Hadoopは並列RDBに 似てきている

DB filesystem

backend

parser planner

HDFS

impala be

impala fe

Hadoop resembles to parallel RDB now

Hybrid DB comes in near future

Q3 MapReduceは

お亡くなりですか

MapReduce is dead

まだだっhelliphellip まだ終わらんよ

No

MapReduceは 並列処理にJavaやCを

はさみこめる

MapReduce has better extendability

例Asterの SQL-MapReduce

SQLからMapReduce呼べるselect count(distinct user_id) from npath( on clicks partition by user_id order by timestamp mode(overlapping) pattern(lsquoHSPrsquo) symbols( page_type = lsquohomersquo AS H page_type = lsquosearchrsquo AS S page_type = lsquoproductrsquo AS P) result(first(user_id of H) as user_id) )

最近Hiveにもnpath入りました (MatchPath)

You can combine MapReduce with SQL

Easy amp Handy SQL +

Extendable MapReduce

最後にポエム

よいものはよい

Great product is anywhere

だが知識は偏在している

but knowledge is maldistributed

OSS World Enterprise World

Hadoop

Ruby

Python Excel

並列RDB

Windows

markdown VB

Git

Java

Cross the border

end

Page 15: Oedo Ruby Conference 04: Ruby会議でSQLの話をするのは間違っているだろうか

いろいろな商用並列RDBDatabase Vendor Since

Teradata Teradata 1983

Teradata Aster Teradata 2005

PureData System for Analytics IBM 2000

Exadata Oracle 2008

Greenplum Pivotal 2003

SQL Server PDW Microsoft 2010くらい

Redshift Amazon 2012

Various parallel RDBs

Here Comes a New Challenger

since 2005

Hadoop Architecture

HDFS Distributed File System

MapReduce Compute Framework

(Hive SQL interface)

Hadoopの特徴

1 データはテーブルではなくファイル

2 処理にはMapReduceを使う(使っていた)

3 Hiveを乗せるとSQLっぽい言語でも書ける

Hadoop data is plain file

Processed by MapReduce

Hive allows you to write SQL-like query

猫も194780子もMapReduce

ldquoBig Datardquo meant MapReduce few years ago

MapReducek1 v1

k2 v2

k3 v3

k4 v4

k5 v5

k6 v6

Map

k1 v1

k1 v2

k1 v3

k2 v4

k3 v5

k3 v6

k1 v1

k2 v2

k3 v3

Reduce

Map関数とReduce関数を 書いたらよしなに

分散してくれるフレームワーク

You just write MapampReduce functions Hadoop serves the rest

Q1 SQLとMapReduce どっちがいいの

Which is good SQL and MapReduce

ビジネス的な答え

SQL

なぜSQLか

1 SQL書けてもJava書けない人は多い

2 既存のSQLを使ったアプリが動かない

3 MapReduce関数を書くのは高コスト

Many people can write SQL but not Java

Many applications rely on SQL

Writing MR function needs more time

コスト差ってどれくらいよpackage orgmyorg import javaioIOException import javautil import orgapachehadoopfsPath import orgapachehadoopconf import orgapachehadoopio import orgapachehadoopmapred import orgapachehadooputil public class WordCount public static class Map extends MapReduceBase implements MapperltLongWritable Text Text IntWritablegt private final static IntWritable one = new IntWritable(1) private Text word = new Text() public void map(LongWritable key Text value OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException String line = valuetoString() StringTokenizer tokenizer = new StringTokenizer(line) while (tokenizerhasMoreTokens()) wordset(tokenizernextToken()) outputcollect(word one) public static class Reduce extends MapReduceBase implements ReducerltText IntWritable Text IntWritablegt public void reduce(Text key IteratorltIntWritablegt values OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException int sum = 0 while (valueshasNext()) sum += valuesnext()get() outputcollect(key new IntWritable(sum)) public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass) confsetJobName(wordcount) confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass) confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass) confsetReducerClass(Reduceclass) confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass) FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1])) JobClientrunJob(conf)

select count() from ( select regexp_split_to_table(str lsquos+) from text_table ) t

MapReduceによるWordCount SQLによるWordCount

実際SQLが勝った

Now SQL beats MapReduce

Q2 Hadoopと並列RDBは

どっちがいいんですか

Which is good Hadoop or parallel RDB

速度は並列RDB データ構造はHadoop

Parallel RDB is faster Hadoop is more flexible

現在ありがちな構成

HDFS

MapReduce

Hive

今後の構成

HDFS

impala backend

impala frontend

Hadoopは並列RDBに 似てきている

DB filesystem

backend

parser planner

HDFS

impala be

impala fe

Hadoop resembles to parallel RDB now

Hybrid DB comes in near future

Q3 MapReduceは

お亡くなりですか

MapReduce is dead

まだだっhelliphellip まだ終わらんよ

No

MapReduceは 並列処理にJavaやCを

はさみこめる

MapReduce has better extendability

例Asterの SQL-MapReduce

SQLからMapReduce呼べるselect count(distinct user_id) from npath( on clicks partition by user_id order by timestamp mode(overlapping) pattern(lsquoHSPrsquo) symbols( page_type = lsquohomersquo AS H page_type = lsquosearchrsquo AS S page_type = lsquoproductrsquo AS P) result(first(user_id of H) as user_id) )

最近Hiveにもnpath入りました (MatchPath)

You can combine MapReduce with SQL

Easy amp Handy SQL +

Extendable MapReduce

最後にポエム

よいものはよい

Great product is anywhere

だが知識は偏在している

but knowledge is maldistributed

OSS World Enterprise World

Hadoop

Ruby

Python Excel

並列RDB

Windows

markdown VB

Git

Java

Cross the border

end

Page 16: Oedo Ruby Conference 04: Ruby会議でSQLの話をするのは間違っているだろうか

Here Comes a New Challenger

since 2005

Hadoop Architecture

HDFS Distributed File System

MapReduce Compute Framework

(Hive SQL interface)

Hadoopの特徴

1 データはテーブルではなくファイル

2 処理にはMapReduceを使う(使っていた)

3 Hiveを乗せるとSQLっぽい言語でも書ける

Hadoop data is plain file

Processed by MapReduce

Hive allows you to write SQL-like query

猫も194780子もMapReduce

ldquoBig Datardquo meant MapReduce few years ago

MapReducek1 v1

k2 v2

k3 v3

k4 v4

k5 v5

k6 v6

Map

k1 v1

k1 v2

k1 v3

k2 v4

k3 v5

k3 v6

k1 v1

k2 v2

k3 v3

Reduce

Map関数とReduce関数を 書いたらよしなに

分散してくれるフレームワーク

You just write MapampReduce functions Hadoop serves the rest

Q1 SQLとMapReduce どっちがいいの

Which is good SQL and MapReduce

ビジネス的な答え

SQL

なぜSQLか

1 SQL書けてもJava書けない人は多い

2 既存のSQLを使ったアプリが動かない

3 MapReduce関数を書くのは高コスト

Many people can write SQL but not Java

Many applications rely on SQL

Writing MR function needs more time

コスト差ってどれくらいよpackage orgmyorg import javaioIOException import javautil import orgapachehadoopfsPath import orgapachehadoopconf import orgapachehadoopio import orgapachehadoopmapred import orgapachehadooputil public class WordCount public static class Map extends MapReduceBase implements MapperltLongWritable Text Text IntWritablegt private final static IntWritable one = new IntWritable(1) private Text word = new Text() public void map(LongWritable key Text value OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException String line = valuetoString() StringTokenizer tokenizer = new StringTokenizer(line) while (tokenizerhasMoreTokens()) wordset(tokenizernextToken()) outputcollect(word one) public static class Reduce extends MapReduceBase implements ReducerltText IntWritable Text IntWritablegt public void reduce(Text key IteratorltIntWritablegt values OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException int sum = 0 while (valueshasNext()) sum += valuesnext()get() outputcollect(key new IntWritable(sum)) public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass) confsetJobName(wordcount) confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass) confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass) confsetReducerClass(Reduceclass) confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass) FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1])) JobClientrunJob(conf)

select count() from ( select regexp_split_to_table(str lsquos+) from text_table ) t

MapReduceによるWordCount SQLによるWordCount

実際SQLが勝った

Now SQL beats MapReduce

Q2 Hadoopと並列RDBは

どっちがいいんですか

Which is good Hadoop or parallel RDB

速度は並列RDB データ構造はHadoop

Parallel RDB is faster Hadoop is more flexible

現在ありがちな構成

HDFS

MapReduce

Hive

今後の構成

HDFS

impala backend

impala frontend

Hadoopは並列RDBに 似てきている

DB filesystem

backend

parser planner

HDFS

impala be

impala fe

Hadoop resembles to parallel RDB now

Hybrid DB comes in near future

Q3 MapReduceは

お亡くなりですか

MapReduce is dead

まだだっhelliphellip まだ終わらんよ

No

MapReduceは 並列処理にJavaやCを

はさみこめる

MapReduce has better extendability

例Asterの SQL-MapReduce

SQLからMapReduce呼べるselect count(distinct user_id) from npath( on clicks partition by user_id order by timestamp mode(overlapping) pattern(lsquoHSPrsquo) symbols( page_type = lsquohomersquo AS H page_type = lsquosearchrsquo AS S page_type = lsquoproductrsquo AS P) result(first(user_id of H) as user_id) )

最近Hiveにもnpath入りました (MatchPath)

You can combine MapReduce with SQL

Easy amp Handy SQL +

Extendable MapReduce

最後にポエム

よいものはよい

Great product is anywhere

だが知識は偏在している

but knowledge is maldistributed

OSS World Enterprise World

Hadoop

Ruby

Python Excel

並列RDB

Windows

markdown VB

Git

Java

Cross the border

end

Page 17: Oedo Ruby Conference 04: Ruby会議でSQLの話をするのは間違っているだろうか

since 2005

Hadoop Architecture

HDFS Distributed File System

MapReduce Compute Framework

(Hive SQL interface)

Hadoopの特徴

1 データはテーブルではなくファイル

2 処理にはMapReduceを使う(使っていた)

3 Hiveを乗せるとSQLっぽい言語でも書ける

Hadoop data is plain file

Processed by MapReduce

Hive allows you to write SQL-like query

猫も194780子もMapReduce

ldquoBig Datardquo meant MapReduce few years ago

MapReducek1 v1

k2 v2

k3 v3

k4 v4

k5 v5

k6 v6

Map

k1 v1

k1 v2

k1 v3

k2 v4

k3 v5

k3 v6

k1 v1

k2 v2

k3 v3

Reduce

Map関数とReduce関数を 書いたらよしなに

分散してくれるフレームワーク

You just write MapampReduce functions Hadoop serves the rest

Q1 SQLとMapReduce どっちがいいの

Which is good SQL and MapReduce

ビジネス的な答え

SQL

なぜSQLか

1 SQL書けてもJava書けない人は多い

2 既存のSQLを使ったアプリが動かない

3 MapReduce関数を書くのは高コスト

Many people can write SQL but not Java

Many applications rely on SQL

Writing MR function needs more time

コスト差ってどれくらいよpackage orgmyorg import javaioIOException import javautil import orgapachehadoopfsPath import orgapachehadoopconf import orgapachehadoopio import orgapachehadoopmapred import orgapachehadooputil public class WordCount public static class Map extends MapReduceBase implements MapperltLongWritable Text Text IntWritablegt private final static IntWritable one = new IntWritable(1) private Text word = new Text() public void map(LongWritable key Text value OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException String line = valuetoString() StringTokenizer tokenizer = new StringTokenizer(line) while (tokenizerhasMoreTokens()) wordset(tokenizernextToken()) outputcollect(word one) public static class Reduce extends MapReduceBase implements ReducerltText IntWritable Text IntWritablegt public void reduce(Text key IteratorltIntWritablegt values OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException int sum = 0 while (valueshasNext()) sum += valuesnext()get() outputcollect(key new IntWritable(sum)) public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass) confsetJobName(wordcount) confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass) confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass) confsetReducerClass(Reduceclass) confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass) FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1])) JobClientrunJob(conf)

select count() from ( select regexp_split_to_table(str lsquos+) from text_table ) t

MapReduceによるWordCount SQLによるWordCount

実際SQLが勝った

Now SQL beats MapReduce

Q2 Hadoopと並列RDBは

どっちがいいんですか

Which is good Hadoop or parallel RDB

速度は並列RDB データ構造はHadoop

Parallel RDB is faster Hadoop is more flexible

現在ありがちな構成

HDFS

MapReduce

Hive

今後の構成

HDFS

impala backend

impala frontend

Hadoopは並列RDBに 似てきている

DB filesystem

backend

parser planner

HDFS

impala be

impala fe

Hadoop resembles to parallel RDB now

Hybrid DB comes in near future

Q3 MapReduceは

お亡くなりですか

MapReduce is dead

まだだっhelliphellip まだ終わらんよ

No

MapReduceは 並列処理にJavaやCを

はさみこめる

MapReduce has better extendability

例Asterの SQL-MapReduce

SQLからMapReduce呼べるselect count(distinct user_id) from npath( on clicks partition by user_id order by timestamp mode(overlapping) pattern(lsquoHSPrsquo) symbols( page_type = lsquohomersquo AS H page_type = lsquosearchrsquo AS S page_type = lsquoproductrsquo AS P) result(first(user_id of H) as user_id) )

最近Hiveにもnpath入りました (MatchPath)

You can combine MapReduce with SQL

Easy amp Handy SQL +

Extendable MapReduce

最後にポエム

よいものはよい

Great product is anywhere

だが知識は偏在している

but knowledge is maldistributed

OSS World Enterprise World

Hadoop

Ruby

Python Excel

並列RDB

Windows

markdown VB

Git

Java

Cross the border

end

Page 18: Oedo Ruby Conference 04: Ruby会議でSQLの話をするのは間違っているだろうか

Hadoop Architecture

HDFS Distributed File System

MapReduce Compute Framework

(Hive SQL interface)

Hadoopの特徴

1 データはテーブルではなくファイル

2 処理にはMapReduceを使う(使っていた)

3 Hiveを乗せるとSQLっぽい言語でも書ける

Hadoop data is plain file

Processed by MapReduce

Hive allows you to write SQL-like query

猫も194780子もMapReduce

ldquoBig Datardquo meant MapReduce few years ago

MapReducek1 v1

k2 v2

k3 v3

k4 v4

k5 v5

k6 v6

Map

k1 v1

k1 v2

k1 v3

k2 v4

k3 v5

k3 v6

k1 v1

k2 v2

k3 v3

Reduce

Map関数とReduce関数を 書いたらよしなに

分散してくれるフレームワーク

You just write MapampReduce functions Hadoop serves the rest

Q1 SQLとMapReduce どっちがいいの

Which is good SQL and MapReduce

ビジネス的な答え

SQL

なぜSQLか

1 SQL書けてもJava書けない人は多い

2 既存のSQLを使ったアプリが動かない

3 MapReduce関数を書くのは高コスト

Many people can write SQL but not Java

Many applications rely on SQL

Writing MR function needs more time

コスト差ってどれくらいよpackage orgmyorg import javaioIOException import javautil import orgapachehadoopfsPath import orgapachehadoopconf import orgapachehadoopio import orgapachehadoopmapred import orgapachehadooputil public class WordCount public static class Map extends MapReduceBase implements MapperltLongWritable Text Text IntWritablegt private final static IntWritable one = new IntWritable(1) private Text word = new Text() public void map(LongWritable key Text value OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException String line = valuetoString() StringTokenizer tokenizer = new StringTokenizer(line) while (tokenizerhasMoreTokens()) wordset(tokenizernextToken()) outputcollect(word one) public static class Reduce extends MapReduceBase implements ReducerltText IntWritable Text IntWritablegt public void reduce(Text key IteratorltIntWritablegt values OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException int sum = 0 while (valueshasNext()) sum += valuesnext()get() outputcollect(key new IntWritable(sum)) public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass) confsetJobName(wordcount) confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass) confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass) confsetReducerClass(Reduceclass) confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass) FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1])) JobClientrunJob(conf)

select count() from ( select regexp_split_to_table(str lsquos+) from text_table ) t

MapReduceによるWordCount SQLによるWordCount

実際SQLが勝った

Now SQL beats MapReduce

Q2 Hadoopと並列RDBは

どっちがいいんですか

Which is good Hadoop or parallel RDB

速度は並列RDB データ構造はHadoop

Parallel RDB is faster Hadoop is more flexible

現在ありがちな構成

HDFS

MapReduce

Hive

今後の構成

HDFS

impala backend

impala frontend

Hadoopは並列RDBに 似てきている

DB filesystem

backend

parser planner

HDFS

impala be

impala fe

Hadoop resembles to parallel RDB now

Hybrid DB comes in near future

Q3 MapReduceは

お亡くなりですか

MapReduce is dead

まだだっhelliphellip まだ終わらんよ

No

MapReduceは 並列処理にJavaやCを

はさみこめる

MapReduce has better extendability

例Asterの SQL-MapReduce

SQLからMapReduce呼べるselect count(distinct user_id) from npath( on clicks partition by user_id order by timestamp mode(overlapping) pattern(lsquoHSPrsquo) symbols( page_type = lsquohomersquo AS H page_type = lsquosearchrsquo AS S page_type = lsquoproductrsquo AS P) result(first(user_id of H) as user_id) )

最近Hiveにもnpath入りました (MatchPath)

You can combine MapReduce with SQL

Easy amp Handy SQL +

Extendable MapReduce

最後にポエム

よいものはよい

Great product is anywhere

だが知識は偏在している

but knowledge is maldistributed

OSS World Enterprise World

Hadoop

Ruby

Python Excel

並列RDB

Windows

markdown VB

Git

Java

Cross the border

end

Page 19: Oedo Ruby Conference 04: Ruby会議でSQLの話をするのは間違っているだろうか

Hadoopの特徴

1 データはテーブルではなくファイル

2 処理にはMapReduceを使う(使っていた)

3 Hiveを乗せるとSQLっぽい言語でも書ける

Hadoop data is plain file

Processed by MapReduce

Hive allows you to write SQL-like query

猫も194780子もMapReduce

ldquoBig Datardquo meant MapReduce few years ago

MapReducek1 v1

k2 v2

k3 v3

k4 v4

k5 v5

k6 v6

Map

k1 v1

k1 v2

k1 v3

k2 v4

k3 v5

k3 v6

k1 v1

k2 v2

k3 v3

Reduce

Map関数とReduce関数を 書いたらよしなに

分散してくれるフレームワーク

You just write MapampReduce functions Hadoop serves the rest

Q1 SQLとMapReduce どっちがいいの

Which is good SQL and MapReduce

ビジネス的な答え

SQL

なぜSQLか

1 SQL書けてもJava書けない人は多い

2 既存のSQLを使ったアプリが動かない

3 MapReduce関数を書くのは高コスト

Many people can write SQL but not Java

Many applications rely on SQL

Writing MR function needs more time

コスト差ってどれくらいよpackage orgmyorg import javaioIOException import javautil import orgapachehadoopfsPath import orgapachehadoopconf import orgapachehadoopio import orgapachehadoopmapred import orgapachehadooputil public class WordCount public static class Map extends MapReduceBase implements MapperltLongWritable Text Text IntWritablegt private final static IntWritable one = new IntWritable(1) private Text word = new Text() public void map(LongWritable key Text value OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException String line = valuetoString() StringTokenizer tokenizer = new StringTokenizer(line) while (tokenizerhasMoreTokens()) wordset(tokenizernextToken()) outputcollect(word one) public static class Reduce extends MapReduceBase implements ReducerltText IntWritable Text IntWritablegt public void reduce(Text key IteratorltIntWritablegt values OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException int sum = 0 while (valueshasNext()) sum += valuesnext()get() outputcollect(key new IntWritable(sum)) public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass) confsetJobName(wordcount) confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass) confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass) confsetReducerClass(Reduceclass) confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass) FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1])) JobClientrunJob(conf)

select count() from ( select regexp_split_to_table(str lsquos+) from text_table ) t

MapReduceによるWordCount SQLによるWordCount

実際SQLが勝った

Now SQL beats MapReduce

Q2 Hadoopと並列RDBは

どっちがいいんですか

Which is good Hadoop or parallel RDB

速度は並列RDB データ構造はHadoop

Parallel RDB is faster Hadoop is more flexible

現在ありがちな構成

HDFS

MapReduce

Hive

今後の構成

HDFS

impala backend

impala frontend

Hadoopは並列RDBに 似てきている

DB filesystem

backend

parser planner

HDFS

impala be

impala fe

Hadoop resembles to parallel RDB now

Hybrid DB comes in near future

Q3 MapReduceは

お亡くなりですか

MapReduce is dead

まだだっhelliphellip まだ終わらんよ

No

MapReduceは 並列処理にJavaやCを

はさみこめる

MapReduce has better extendability

例Asterの SQL-MapReduce

SQLからMapReduce呼べるselect count(distinct user_id) from npath( on clicks partition by user_id order by timestamp mode(overlapping) pattern(lsquoHSPrsquo) symbols( page_type = lsquohomersquo AS H page_type = lsquosearchrsquo AS S page_type = lsquoproductrsquo AS P) result(first(user_id of H) as user_id) )

最近Hiveにもnpath入りました (MatchPath)

You can combine MapReduce with SQL

Easy amp Handy SQL +

Extendable MapReduce

最後にポエム

よいものはよい

Great product is anywhere

だが知識は偏在している

but knowledge is maldistributed

OSS World Enterprise World

Hadoop

Ruby

Python Excel

並列RDB

Windows

markdown VB

Git

Java

Cross the border

end

Page 20: Oedo Ruby Conference 04: Ruby会議でSQLの話をするのは間違っているだろうか

猫も194780子もMapReduce

ldquoBig Datardquo meant MapReduce few years ago

MapReducek1 v1

k2 v2

k3 v3

k4 v4

k5 v5

k6 v6

Map

k1 v1

k1 v2

k1 v3

k2 v4

k3 v5

k3 v6

k1 v1

k2 v2

k3 v3

Reduce

Map関数とReduce関数を 書いたらよしなに

分散してくれるフレームワーク

You just write MapampReduce functions Hadoop serves the rest

Q1 SQLとMapReduce どっちがいいの

Which is good SQL and MapReduce

ビジネス的な答え

SQL

なぜSQLか

1 SQL書けてもJava書けない人は多い

2 既存のSQLを使ったアプリが動かない

3 MapReduce関数を書くのは高コスト

Many people can write SQL but not Java

Many applications rely on SQL

Writing MR function needs more time

コスト差ってどれくらいよpackage orgmyorg import javaioIOException import javautil import orgapachehadoopfsPath import orgapachehadoopconf import orgapachehadoopio import orgapachehadoopmapred import orgapachehadooputil public class WordCount public static class Map extends MapReduceBase implements MapperltLongWritable Text Text IntWritablegt private final static IntWritable one = new IntWritable(1) private Text word = new Text() public void map(LongWritable key Text value OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException String line = valuetoString() StringTokenizer tokenizer = new StringTokenizer(line) while (tokenizerhasMoreTokens()) wordset(tokenizernextToken()) outputcollect(word one) public static class Reduce extends MapReduceBase implements ReducerltText IntWritable Text IntWritablegt public void reduce(Text key IteratorltIntWritablegt values OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException int sum = 0 while (valueshasNext()) sum += valuesnext()get() outputcollect(key new IntWritable(sum)) public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass) confsetJobName(wordcount) confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass) confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass) confsetReducerClass(Reduceclass) confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass) FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1])) JobClientrunJob(conf)

select count() from ( select regexp_split_to_table(str lsquos+) from text_table ) t

MapReduceによるWordCount SQLによるWordCount

実際SQLが勝った

Now SQL beats MapReduce

Q2 Hadoopと並列RDBは

どっちがいいんですか

Which is good Hadoop or parallel RDB

速度は並列RDB データ構造はHadoop

Parallel RDB is faster Hadoop is more flexible

現在ありがちな構成

HDFS

MapReduce

Hive

今後の構成

HDFS

impala backend

impala frontend

Hadoopは並列RDBに 似てきている

DB filesystem

backend

parser planner

HDFS

impala be

impala fe

Hadoop resembles to parallel RDB now

Hybrid DB comes in near future

Q3 MapReduceは

お亡くなりですか

MapReduce is dead

まだだっhelliphellip まだ終わらんよ

No

MapReduceは 並列処理にJavaやCを

はさみこめる

MapReduce has better extendability

例Asterの SQL-MapReduce

SQLからMapReduce呼べるselect count(distinct user_id) from npath( on clicks partition by user_id order by timestamp mode(overlapping) pattern(lsquoHSPrsquo) symbols( page_type = lsquohomersquo AS H page_type = lsquosearchrsquo AS S page_type = lsquoproductrsquo AS P) result(first(user_id of H) as user_id) )

最近Hiveにもnpath入りました (MatchPath)

You can combine MapReduce with SQL

Easy amp Handy SQL +

Extendable MapReduce

最後にポエム

よいものはよい

Great product is anywhere

だが知識は偏在している

but knowledge is maldistributed

OSS World Enterprise World

Hadoop

Ruby

Python Excel

並列RDB

Windows

markdown VB

Git

Java

Cross the border

end

Page 21: Oedo Ruby Conference 04: Ruby会議でSQLの話をするのは間違っているだろうか

MapReducek1 v1

k2 v2

k3 v3

k4 v4

k5 v5

k6 v6

Map

k1 v1

k1 v2

k1 v3

k2 v4

k3 v5

k3 v6

k1 v1

k2 v2

k3 v3

Reduce

Map関数とReduce関数を 書いたらよしなに

分散してくれるフレームワーク

You just write MapampReduce functions Hadoop serves the rest

Q1 SQLとMapReduce どっちがいいの

Which is good SQL and MapReduce

ビジネス的な答え

SQL

なぜSQLか

1 SQL書けてもJava書けない人は多い

2 既存のSQLを使ったアプリが動かない

3 MapReduce関数を書くのは高コスト

Many people can write SQL but not Java

Many applications rely on SQL

Writing MR function needs more time

コスト差ってどれくらいよpackage orgmyorg import javaioIOException import javautil import orgapachehadoopfsPath import orgapachehadoopconf import orgapachehadoopio import orgapachehadoopmapred import orgapachehadooputil public class WordCount public static class Map extends MapReduceBase implements MapperltLongWritable Text Text IntWritablegt private final static IntWritable one = new IntWritable(1) private Text word = new Text() public void map(LongWritable key Text value OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException String line = valuetoString() StringTokenizer tokenizer = new StringTokenizer(line) while (tokenizerhasMoreTokens()) wordset(tokenizernextToken()) outputcollect(word one) public static class Reduce extends MapReduceBase implements ReducerltText IntWritable Text IntWritablegt public void reduce(Text key IteratorltIntWritablegt values OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException int sum = 0 while (valueshasNext()) sum += valuesnext()get() outputcollect(key new IntWritable(sum)) public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass) confsetJobName(wordcount) confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass) confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass) confsetReducerClass(Reduceclass) confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass) FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1])) JobClientrunJob(conf)

select count() from ( select regexp_split_to_table(str lsquos+) from text_table ) t

MapReduceによるWordCount SQLによるWordCount

実際SQLが勝った

Now SQL beats MapReduce

Q2 Hadoopと並列RDBは

どっちがいいんですか

Which is good Hadoop or parallel RDB

速度は並列RDB データ構造はHadoop

Parallel RDB is faster Hadoop is more flexible

現在ありがちな構成

HDFS

MapReduce

Hive

今後の構成

HDFS

impala backend

impala frontend

Hadoopは並列RDBに 似てきている

DB filesystem

backend

parser planner

HDFS

impala be

impala fe

Hadoop resembles to parallel RDB now

Hybrid DB comes in near future

Q3 MapReduceは

お亡くなりですか

MapReduce is dead

まだだっhelliphellip まだ終わらんよ

No

MapReduceは 並列処理にJavaやCを

はさみこめる

MapReduce has better extendability

例Asterの SQL-MapReduce

SQLからMapReduce呼べるselect count(distinct user_id) from npath( on clicks partition by user_id order by timestamp mode(overlapping) pattern(lsquoHSPrsquo) symbols( page_type = lsquohomersquo AS H page_type = lsquosearchrsquo AS S page_type = lsquoproductrsquo AS P) result(first(user_id of H) as user_id) )

最近Hiveにもnpath入りました (MatchPath)

You can combine MapReduce with SQL

Easy amp Handy SQL +

Extendable MapReduce

最後にポエム

よいものはよい

Great product is anywhere

だが知識は偏在している

but knowledge is maldistributed

OSS World Enterprise World

Hadoop

Ruby

Python Excel

並列RDB

Windows

markdown VB

Git

Java

Cross the border

end

Page 22: Oedo Ruby Conference 04: Ruby会議でSQLの話をするのは間違っているだろうか

Map関数とReduce関数を 書いたらよしなに

分散してくれるフレームワーク

You just write MapampReduce functions Hadoop serves the rest

Q1 SQLとMapReduce どっちがいいの

Which is good SQL and MapReduce

ビジネス的な答え

SQL

なぜSQLか

1 SQL書けてもJava書けない人は多い

2 既存のSQLを使ったアプリが動かない

3 MapReduce関数を書くのは高コスト

Many people can write SQL but not Java

Many applications rely on SQL

Writing MR function needs more time

コスト差ってどれくらいよpackage orgmyorg import javaioIOException import javautil import orgapachehadoopfsPath import orgapachehadoopconf import orgapachehadoopio import orgapachehadoopmapred import orgapachehadooputil public class WordCount public static class Map extends MapReduceBase implements MapperltLongWritable Text Text IntWritablegt private final static IntWritable one = new IntWritable(1) private Text word = new Text() public void map(LongWritable key Text value OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException String line = valuetoString() StringTokenizer tokenizer = new StringTokenizer(line) while (tokenizerhasMoreTokens()) wordset(tokenizernextToken()) outputcollect(word one) public static class Reduce extends MapReduceBase implements ReducerltText IntWritable Text IntWritablegt public void reduce(Text key IteratorltIntWritablegt values OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException int sum = 0 while (valueshasNext()) sum += valuesnext()get() outputcollect(key new IntWritable(sum)) public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass) confsetJobName(wordcount) confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass) confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass) confsetReducerClass(Reduceclass) confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass) FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1])) JobClientrunJob(conf)

select count() from ( select regexp_split_to_table(str lsquos+) from text_table ) t

MapReduceによるWordCount SQLによるWordCount

実際SQLが勝った

Now SQL beats MapReduce

Q2 Hadoopと並列RDBは

どっちがいいんですか

Which is good Hadoop or parallel RDB

速度は並列RDB データ構造はHadoop

Parallel RDB is faster Hadoop is more flexible

現在ありがちな構成

HDFS

MapReduce

Hive

今後の構成

HDFS

impala backend

impala frontend

Hadoopは並列RDBに 似てきている

DB filesystem

backend

parser planner

HDFS

impala be

impala fe

Hadoop resembles to parallel RDB now

Hybrid DB comes in near future

Q3 MapReduceは

お亡くなりですか

MapReduce is dead

まだだっhelliphellip まだ終わらんよ

No

MapReduceは 並列処理にJavaやCを

はさみこめる

MapReduce has better extendability

例Asterの SQL-MapReduce

SQLからMapReduce呼べるselect count(distinct user_id) from npath( on clicks partition by user_id order by timestamp mode(overlapping) pattern(lsquoHSPrsquo) symbols( page_type = lsquohomersquo AS H page_type = lsquosearchrsquo AS S page_type = lsquoproductrsquo AS P) result(first(user_id of H) as user_id) )

最近Hiveにもnpath入りました (MatchPath)

You can combine MapReduce with SQL

Easy amp Handy SQL +

Extendable MapReduce

最後にポエム

よいものはよい

Great product is anywhere

だが知識は偏在している

but knowledge is maldistributed

OSS World Enterprise World

Hadoop

Ruby

Python Excel

並列RDB

Windows

markdown VB

Git

Java

Cross the border

end

Page 23: Oedo Ruby Conference 04: Ruby会議でSQLの話をするのは間違っているだろうか

Q1 SQLとMapReduce どっちがいいの

Which is good SQL and MapReduce

ビジネス的な答え

SQL

なぜSQLか

1 SQL書けてもJava書けない人は多い

2 既存のSQLを使ったアプリが動かない

3 MapReduce関数を書くのは高コスト

Many people can write SQL but not Java

Many applications rely on SQL

Writing MR function needs more time

コスト差ってどれくらいよpackage orgmyorg import javaioIOException import javautil import orgapachehadoopfsPath import orgapachehadoopconf import orgapachehadoopio import orgapachehadoopmapred import orgapachehadooputil public class WordCount public static class Map extends MapReduceBase implements MapperltLongWritable Text Text IntWritablegt private final static IntWritable one = new IntWritable(1) private Text word = new Text() public void map(LongWritable key Text value OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException String line = valuetoString() StringTokenizer tokenizer = new StringTokenizer(line) while (tokenizerhasMoreTokens()) wordset(tokenizernextToken()) outputcollect(word one) public static class Reduce extends MapReduceBase implements ReducerltText IntWritable Text IntWritablegt public void reduce(Text key IteratorltIntWritablegt values OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException int sum = 0 while (valueshasNext()) sum += valuesnext()get() outputcollect(key new IntWritable(sum)) public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass) confsetJobName(wordcount) confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass) confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass) confsetReducerClass(Reduceclass) confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass) FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1])) JobClientrunJob(conf)

select count() from ( select regexp_split_to_table(str lsquos+) from text_table ) t

MapReduceによるWordCount SQLによるWordCount

実際SQLが勝った

Now SQL beats MapReduce

Q2 Hadoopと並列RDBは

どっちがいいんですか

Which is good Hadoop or parallel RDB

速度は並列RDB データ構造はHadoop

Parallel RDB is faster Hadoop is more flexible

現在ありがちな構成

HDFS

MapReduce

Hive

今後の構成

HDFS

impala backend

impala frontend

Hadoopは並列RDBに 似てきている

DB filesystem

backend

parser planner

HDFS

impala be

impala fe

Hadoop resembles to parallel RDB now

Hybrid DB comes in near future

Q3 MapReduceは

お亡くなりですか

MapReduce is dead

まだだっhelliphellip まだ終わらんよ

No

MapReduceは 並列処理にJavaやCを

はさみこめる

MapReduce has better extendability

例Asterの SQL-MapReduce

SQLからMapReduce呼べるselect count(distinct user_id) from npath( on clicks partition by user_id order by timestamp mode(overlapping) pattern(lsquoHSPrsquo) symbols( page_type = lsquohomersquo AS H page_type = lsquosearchrsquo AS S page_type = lsquoproductrsquo AS P) result(first(user_id of H) as user_id) )

最近Hiveにもnpath入りました (MatchPath)

You can combine MapReduce with SQL

Easy amp Handy SQL +

Extendable MapReduce

最後にポエム

よいものはよい

Great product is anywhere

だが知識は偏在している

but knowledge is maldistributed

OSS World Enterprise World

Hadoop

Ruby

Python Excel

並列RDB

Windows

markdown VB

Git

Java

Cross the border

end

Page 24: Oedo Ruby Conference 04: Ruby会議でSQLの話をするのは間違っているだろうか

ビジネス的な答え

SQL

なぜSQLか

1 SQL書けてもJava書けない人は多い

2 既存のSQLを使ったアプリが動かない

3 MapReduce関数を書くのは高コスト

Many people can write SQL but not Java

Many applications rely on SQL

Writing MR function needs more time

コスト差ってどれくらいよpackage orgmyorg import javaioIOException import javautil import orgapachehadoopfsPath import orgapachehadoopconf import orgapachehadoopio import orgapachehadoopmapred import orgapachehadooputil public class WordCount public static class Map extends MapReduceBase implements MapperltLongWritable Text Text IntWritablegt private final static IntWritable one = new IntWritable(1) private Text word = new Text() public void map(LongWritable key Text value OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException String line = valuetoString() StringTokenizer tokenizer = new StringTokenizer(line) while (tokenizerhasMoreTokens()) wordset(tokenizernextToken()) outputcollect(word one) public static class Reduce extends MapReduceBase implements ReducerltText IntWritable Text IntWritablegt public void reduce(Text key IteratorltIntWritablegt values OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException int sum = 0 while (valueshasNext()) sum += valuesnext()get() outputcollect(key new IntWritable(sum)) public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass) confsetJobName(wordcount) confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass) confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass) confsetReducerClass(Reduceclass) confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass) FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1])) JobClientrunJob(conf)

select count() from ( select regexp_split_to_table(str lsquos+) from text_table ) t

MapReduceによるWordCount SQLによるWordCount

実際SQLが勝った

Now SQL beats MapReduce

Q2 Hadoopと並列RDBは

どっちがいいんですか

Which is good Hadoop or parallel RDB

速度は並列RDB データ構造はHadoop

Parallel RDB is faster Hadoop is more flexible

現在ありがちな構成

HDFS

MapReduce

Hive

今後の構成

HDFS

impala backend

impala frontend

Hadoopは並列RDBに 似てきている

DB filesystem

backend

parser planner

HDFS

impala be

impala fe

Hadoop resembles to parallel RDB now

Hybrid DB comes in near future

Q3 MapReduceは

お亡くなりですか

MapReduce is dead

まだだっhelliphellip まだ終わらんよ

No

MapReduceは 並列処理にJavaやCを

はさみこめる

MapReduce has better extendability

例Asterの SQL-MapReduce

SQLからMapReduce呼べるselect count(distinct user_id) from npath( on clicks partition by user_id order by timestamp mode(overlapping) pattern(lsquoHSPrsquo) symbols( page_type = lsquohomersquo AS H page_type = lsquosearchrsquo AS S page_type = lsquoproductrsquo AS P) result(first(user_id of H) as user_id) )

最近Hiveにもnpath入りました (MatchPath)

You can combine MapReduce with SQL

Easy amp Handy SQL +

Extendable MapReduce

最後にポエム

よいものはよい

Great product is anywhere

だが知識は偏在している

but knowledge is maldistributed

OSS World Enterprise World

Hadoop

Ruby

Python Excel

並列RDB

Windows

markdown VB

Git

Java

Cross the border

end

Page 25: Oedo Ruby Conference 04: Ruby会議でSQLの話をするのは間違っているだろうか

なぜSQLか

1 SQL書けてもJava書けない人は多い

2 既存のSQLを使ったアプリが動かない

3 MapReduce関数を書くのは高コスト

Many people can write SQL but not Java

Many applications rely on SQL

Writing MR function needs more time

コスト差ってどれくらいよpackage orgmyorg import javaioIOException import javautil import orgapachehadoopfsPath import orgapachehadoopconf import orgapachehadoopio import orgapachehadoopmapred import orgapachehadooputil public class WordCount public static class Map extends MapReduceBase implements MapperltLongWritable Text Text IntWritablegt private final static IntWritable one = new IntWritable(1) private Text word = new Text() public void map(LongWritable key Text value OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException String line = valuetoString() StringTokenizer tokenizer = new StringTokenizer(line) while (tokenizerhasMoreTokens()) wordset(tokenizernextToken()) outputcollect(word one) public static class Reduce extends MapReduceBase implements ReducerltText IntWritable Text IntWritablegt public void reduce(Text key IteratorltIntWritablegt values OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException int sum = 0 while (valueshasNext()) sum += valuesnext()get() outputcollect(key new IntWritable(sum)) public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass) confsetJobName(wordcount) confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass) confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass) confsetReducerClass(Reduceclass) confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass) FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1])) JobClientrunJob(conf)

select count() from ( select regexp_split_to_table(str lsquos+) from text_table ) t

MapReduceによるWordCount SQLによるWordCount

実際SQLが勝った

Now SQL beats MapReduce

Q2 Hadoopと並列RDBは

どっちがいいんですか

Which is good Hadoop or parallel RDB

速度は並列RDB データ構造はHadoop

Parallel RDB is faster Hadoop is more flexible

現在ありがちな構成

HDFS

MapReduce

Hive

今後の構成

HDFS

impala backend

impala frontend

Hadoopは並列RDBに 似てきている

DB filesystem

backend

parser planner

HDFS

impala be

impala fe

Hadoop resembles to parallel RDB now

Hybrid DB comes in near future

Q3 MapReduceは

お亡くなりですか

MapReduce is dead

まだだっhelliphellip まだ終わらんよ

No

MapReduceは 並列処理にJavaやCを

はさみこめる

MapReduce has better extendability

例Asterの SQL-MapReduce

SQLからMapReduce呼べるselect count(distinct user_id) from npath( on clicks partition by user_id order by timestamp mode(overlapping) pattern(lsquoHSPrsquo) symbols( page_type = lsquohomersquo AS H page_type = lsquosearchrsquo AS S page_type = lsquoproductrsquo AS P) result(first(user_id of H) as user_id) )

最近Hiveにもnpath入りました (MatchPath)

You can combine MapReduce with SQL

Easy amp Handy SQL +

Extendable MapReduce

最後にポエム

よいものはよい

Great product is anywhere

だが知識は偏在している

but knowledge is maldistributed

OSS World Enterprise World

Hadoop

Ruby

Python Excel

並列RDB

Windows

markdown VB

Git

Java

Cross the border

end

Page 26: Oedo Ruby Conference 04: Ruby会議でSQLの話をするのは間違っているだろうか

コスト差ってどれくらいよpackage orgmyorg import javaioIOException import javautil import orgapachehadoopfsPath import orgapachehadoopconf import orgapachehadoopio import orgapachehadoopmapred import orgapachehadooputil public class WordCount public static class Map extends MapReduceBase implements MapperltLongWritable Text Text IntWritablegt private final static IntWritable one = new IntWritable(1) private Text word = new Text() public void map(LongWritable key Text value OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException String line = valuetoString() StringTokenizer tokenizer = new StringTokenizer(line) while (tokenizerhasMoreTokens()) wordset(tokenizernextToken()) outputcollect(word one) public static class Reduce extends MapReduceBase implements ReducerltText IntWritable Text IntWritablegt public void reduce(Text key IteratorltIntWritablegt values OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException int sum = 0 while (valueshasNext()) sum += valuesnext()get() outputcollect(key new IntWritable(sum)) public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass) confsetJobName(wordcount) confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass) confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass) confsetReducerClass(Reduceclass) confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass) FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1])) JobClientrunJob(conf)

select count() from ( select regexp_split_to_table(str lsquos+) from text_table ) t

MapReduceによるWordCount SQLによるWordCount

実際SQLが勝った

Now SQL beats MapReduce

Q2 Hadoopと並列RDBは

どっちがいいんですか

Which is good Hadoop or parallel RDB

速度は並列RDB データ構造はHadoop

Parallel RDB is faster Hadoop is more flexible

現在ありがちな構成

HDFS

MapReduce

Hive

今後の構成

HDFS

impala backend

impala frontend

Hadoopは並列RDBに 似てきている

DB filesystem

backend

parser planner

HDFS

impala be

impala fe

Hadoop resembles to parallel RDB now

Hybrid DB comes in near future

Q3 MapReduceは

お亡くなりですか

MapReduce is dead

まだだっhelliphellip まだ終わらんよ

No

MapReduceは 並列処理にJavaやCを

はさみこめる

MapReduce has better extendability

例Asterの SQL-MapReduce

SQLからMapReduce呼べるselect count(distinct user_id) from npath( on clicks partition by user_id order by timestamp mode(overlapping) pattern(lsquoHSPrsquo) symbols( page_type = lsquohomersquo AS H page_type = lsquosearchrsquo AS S page_type = lsquoproductrsquo AS P) result(first(user_id of H) as user_id) )

最近Hiveにもnpath入りました (MatchPath)

You can combine MapReduce with SQL

Easy amp Handy SQL +

Extendable MapReduce

最後にポエム

よいものはよい

Great product is anywhere

だが知識は偏在している

but knowledge is maldistributed

OSS World Enterprise World

Hadoop

Ruby

Python Excel

並列RDB

Windows

markdown VB

Git

Java

Cross the border

end

Page 27: Oedo Ruby Conference 04: Ruby会議でSQLの話をするのは間違っているだろうか

実際SQLが勝った

Now SQL beats MapReduce

Q2 Hadoopと並列RDBは

どっちがいいんですか

Which is good Hadoop or parallel RDB

速度は並列RDB データ構造はHadoop

Parallel RDB is faster Hadoop is more flexible

現在ありがちな構成

HDFS

MapReduce

Hive

今後の構成

HDFS

impala backend

impala frontend

Hadoopは並列RDBに 似てきている

DB filesystem

backend

parser planner

HDFS

impala be

impala fe

Hadoop resembles to parallel RDB now

Hybrid DB comes in near future

Q3 MapReduceは

お亡くなりですか

MapReduce is dead

まだだっhelliphellip まだ終わらんよ

No

MapReduceは 並列処理にJavaやCを

はさみこめる

MapReduce has better extendability

例Asterの SQL-MapReduce

SQLからMapReduce呼べるselect count(distinct user_id) from npath( on clicks partition by user_id order by timestamp mode(overlapping) pattern(lsquoHSPrsquo) symbols( page_type = lsquohomersquo AS H page_type = lsquosearchrsquo AS S page_type = lsquoproductrsquo AS P) result(first(user_id of H) as user_id) )

最近Hiveにもnpath入りました (MatchPath)

You can combine MapReduce with SQL

Easy amp Handy SQL +

Extendable MapReduce

最後にポエム

よいものはよい

Great product is anywhere

だが知識は偏在している

but knowledge is maldistributed

OSS World Enterprise World

Hadoop

Ruby

Python Excel

並列RDB

Windows

markdown VB

Git

Java

Cross the border

end

Page 28: Oedo Ruby Conference 04: Ruby会議でSQLの話をするのは間違っているだろうか

Q2 Hadoopと並列RDBは

どっちがいいんですか

Which is good Hadoop or parallel RDB

速度は並列RDB データ構造はHadoop

Parallel RDB is faster Hadoop is more flexible

現在ありがちな構成

HDFS

MapReduce

Hive

今後の構成

HDFS

impala backend

impala frontend

Hadoopは並列RDBに 似てきている

DB filesystem

backend

parser planner

HDFS

impala be

impala fe

Hadoop resembles to parallel RDB now

Hybrid DB comes in near future

Q3 MapReduceは

お亡くなりですか

MapReduce is dead

まだだっhelliphellip まだ終わらんよ

No

MapReduceは 並列処理にJavaやCを

はさみこめる

MapReduce has better extendability

例Asterの SQL-MapReduce

SQLからMapReduce呼べるselect count(distinct user_id) from npath( on clicks partition by user_id order by timestamp mode(overlapping) pattern(lsquoHSPrsquo) symbols( page_type = lsquohomersquo AS H page_type = lsquosearchrsquo AS S page_type = lsquoproductrsquo AS P) result(first(user_id of H) as user_id) )

最近Hiveにもnpath入りました (MatchPath)

You can combine MapReduce with SQL

Easy amp Handy SQL +

Extendable MapReduce

最後にポエム

よいものはよい

Great product is anywhere

だが知識は偏在している

but knowledge is maldistributed

OSS World Enterprise World

Hadoop

Ruby

Python Excel

並列RDB

Windows

markdown VB

Git

Java

Cross the border

end

Page 29: Oedo Ruby Conference 04: Ruby会議でSQLの話をするのは間違っているだろうか

速度は並列RDB データ構造はHadoop

Parallel RDB is faster Hadoop is more flexible

現在ありがちな構成

HDFS

MapReduce

Hive

今後の構成

HDFS

impala backend

impala frontend

Hadoopは並列RDBに 似てきている

DB filesystem

backend

parser planner

HDFS

impala be

impala fe

Hadoop resembles to parallel RDB now

Hybrid DB comes in near future

Q3 MapReduceは

お亡くなりですか

MapReduce is dead

まだだっhelliphellip まだ終わらんよ

No

MapReduceは 並列処理にJavaやCを

はさみこめる

MapReduce has better extendability

例Asterの SQL-MapReduce

SQLからMapReduce呼べるselect count(distinct user_id) from npath( on clicks partition by user_id order by timestamp mode(overlapping) pattern(lsquoHSPrsquo) symbols( page_type = lsquohomersquo AS H page_type = lsquosearchrsquo AS S page_type = lsquoproductrsquo AS P) result(first(user_id of H) as user_id) )

最近Hiveにもnpath入りました (MatchPath)

You can combine MapReduce with SQL

Easy amp Handy SQL +

Extendable MapReduce

最後にポエム

よいものはよい

Great product is anywhere

だが知識は偏在している

but knowledge is maldistributed

OSS World Enterprise World

Hadoop

Ruby

Python Excel

並列RDB

Windows

markdown VB

Git

Java

Cross the border

end

Page 30: Oedo Ruby Conference 04: Ruby会議でSQLの話をするのは間違っているだろうか

現在ありがちな構成

HDFS

MapReduce

Hive

今後の構成

HDFS

impala backend

impala frontend

Hadoopは並列RDBに 似てきている

DB filesystem

backend

parser planner

HDFS

impala be

impala fe

Hadoop resembles to parallel RDB now

Hybrid DB comes in near future

Q3 MapReduceは

お亡くなりですか

MapReduce is dead

まだだっhelliphellip まだ終わらんよ

No

MapReduceは 並列処理にJavaやCを

はさみこめる

MapReduce has better extendability

例Asterの SQL-MapReduce

SQLからMapReduce呼べるselect count(distinct user_id) from npath( on clicks partition by user_id order by timestamp mode(overlapping) pattern(lsquoHSPrsquo) symbols( page_type = lsquohomersquo AS H page_type = lsquosearchrsquo AS S page_type = lsquoproductrsquo AS P) result(first(user_id of H) as user_id) )

最近Hiveにもnpath入りました (MatchPath)

You can combine MapReduce with SQL

Easy amp Handy SQL +

Extendable MapReduce

最後にポエム

よいものはよい

Great product is anywhere

だが知識は偏在している

but knowledge is maldistributed

OSS World Enterprise World

Hadoop

Ruby

Python Excel

並列RDB

Windows

markdown VB

Git

Java

Cross the border

end

Page 31: Oedo Ruby Conference 04: Ruby会議でSQLの話をするのは間違っているだろうか

今後の構成

HDFS

impala backend

impala frontend

Hadoopは並列RDBに 似てきている

DB filesystem

backend

parser planner

HDFS

impala be

impala fe

Hadoop resembles to parallel RDB now

Hybrid DB comes in near future

Q3 MapReduceは

お亡くなりですか

MapReduce is dead

まだだっhelliphellip まだ終わらんよ

No

MapReduceは 並列処理にJavaやCを

はさみこめる

MapReduce has better extendability

例Asterの SQL-MapReduce

SQLからMapReduce呼べるselect count(distinct user_id) from npath( on clicks partition by user_id order by timestamp mode(overlapping) pattern(lsquoHSPrsquo) symbols( page_type = lsquohomersquo AS H page_type = lsquosearchrsquo AS S page_type = lsquoproductrsquo AS P) result(first(user_id of H) as user_id) )

最近Hiveにもnpath入りました (MatchPath)

You can combine MapReduce with SQL

Easy amp Handy SQL +

Extendable MapReduce

最後にポエム

よいものはよい

Great product is anywhere

だが知識は偏在している

but knowledge is maldistributed

OSS World Enterprise World

Hadoop

Ruby

Python Excel

並列RDB

Windows

markdown VB

Git

Java

Cross the border

end

Page 32: Oedo Ruby Conference 04: Ruby会議でSQLの話をするのは間違っているだろうか

Hadoopは並列RDBに 似てきている

DB filesystem

backend

parser planner

HDFS

impala be

impala fe

Hadoop resembles to parallel RDB now

Hybrid DB comes in near future

Q3 MapReduceは

お亡くなりですか

MapReduce is dead

まだだっhelliphellip まだ終わらんよ

No

MapReduceは 並列処理にJavaやCを

はさみこめる

MapReduce has better extendability

例Asterの SQL-MapReduce

SQLからMapReduce呼べるselect count(distinct user_id) from npath( on clicks partition by user_id order by timestamp mode(overlapping) pattern(lsquoHSPrsquo) symbols( page_type = lsquohomersquo AS H page_type = lsquosearchrsquo AS S page_type = lsquoproductrsquo AS P) result(first(user_id of H) as user_id) )

最近Hiveにもnpath入りました (MatchPath)

You can combine MapReduce with SQL

Easy amp Handy SQL +

Extendable MapReduce

最後にポエム

よいものはよい

Great product is anywhere

だが知識は偏在している

but knowledge is maldistributed

OSS World Enterprise World

Hadoop

Ruby

Python Excel

並列RDB

Windows

markdown VB

Git

Java

Cross the border

end

Page 33: Oedo Ruby Conference 04: Ruby会議でSQLの話をするのは間違っているだろうか

Hybrid DB comes in near future

Q3 MapReduceは

お亡くなりですか

MapReduce is dead

まだだっhelliphellip まだ終わらんよ

No

MapReduceは 並列処理にJavaやCを

はさみこめる

MapReduce has better extendability

例Asterの SQL-MapReduce

SQLからMapReduce呼べるselect count(distinct user_id) from npath( on clicks partition by user_id order by timestamp mode(overlapping) pattern(lsquoHSPrsquo) symbols( page_type = lsquohomersquo AS H page_type = lsquosearchrsquo AS S page_type = lsquoproductrsquo AS P) result(first(user_id of H) as user_id) )

最近Hiveにもnpath入りました (MatchPath)

You can combine MapReduce with SQL

Easy amp Handy SQL +

Extendable MapReduce

最後にポエム

よいものはよい

Great product is anywhere

だが知識は偏在している

but knowledge is maldistributed

OSS World Enterprise World

Hadoop

Ruby

Python Excel

並列RDB

Windows

markdown VB

Git

Java

Cross the border

end

Page 34: Oedo Ruby Conference 04: Ruby会議でSQLの話をするのは間違っているだろうか

Q3 MapReduceは

お亡くなりですか

MapReduce is dead

まだだっhelliphellip まだ終わらんよ

No

MapReduceは 並列処理にJavaやCを

はさみこめる

MapReduce has better extendability

例Asterの SQL-MapReduce

SQLからMapReduce呼べるselect count(distinct user_id) from npath( on clicks partition by user_id order by timestamp mode(overlapping) pattern(lsquoHSPrsquo) symbols( page_type = lsquohomersquo AS H page_type = lsquosearchrsquo AS S page_type = lsquoproductrsquo AS P) result(first(user_id of H) as user_id) )

最近Hiveにもnpath入りました (MatchPath)

You can combine MapReduce with SQL

Easy amp Handy SQL +

Extendable MapReduce

最後にポエム

よいものはよい

Great product is anywhere

だが知識は偏在している

but knowledge is maldistributed

OSS World Enterprise World

Hadoop

Ruby

Python Excel

並列RDB

Windows

markdown VB

Git

Java

Cross the border

end

Page 35: Oedo Ruby Conference 04: Ruby会議でSQLの話をするのは間違っているだろうか

まだだっhelliphellip まだ終わらんよ

No

MapReduceは 並列処理にJavaやCを

はさみこめる

MapReduce has better extendability

例Asterの SQL-MapReduce

SQLからMapReduce呼べるselect count(distinct user_id) from npath( on clicks partition by user_id order by timestamp mode(overlapping) pattern(lsquoHSPrsquo) symbols( page_type = lsquohomersquo AS H page_type = lsquosearchrsquo AS S page_type = lsquoproductrsquo AS P) result(first(user_id of H) as user_id) )

最近Hiveにもnpath入りました (MatchPath)

You can combine MapReduce with SQL

Easy amp Handy SQL +

Extendable MapReduce

最後にポエム

よいものはよい

Great product is anywhere

だが知識は偏在している

but knowledge is maldistributed

OSS World Enterprise World

Hadoop

Ruby

Python Excel

並列RDB

Windows

markdown VB

Git

Java

Cross the border

end

Page 36: Oedo Ruby Conference 04: Ruby会議でSQLの話をするのは間違っているだろうか

MapReduceは 並列処理にJavaやCを

はさみこめる

MapReduce has better extendability

例Asterの SQL-MapReduce

SQLからMapReduce呼べるselect count(distinct user_id) from npath( on clicks partition by user_id order by timestamp mode(overlapping) pattern(lsquoHSPrsquo) symbols( page_type = lsquohomersquo AS H page_type = lsquosearchrsquo AS S page_type = lsquoproductrsquo AS P) result(first(user_id of H) as user_id) )

最近Hiveにもnpath入りました (MatchPath)

You can combine MapReduce with SQL

Easy amp Handy SQL +

Extendable MapReduce

最後にポエム

よいものはよい

Great product is anywhere

だが知識は偏在している

but knowledge is maldistributed

OSS World Enterprise World

Hadoop

Ruby

Python Excel

並列RDB

Windows

markdown VB

Git

Java

Cross the border

end

Page 37: Oedo Ruby Conference 04: Ruby会議でSQLの話をするのは間違っているだろうか

例Asterの SQL-MapReduce

SQLからMapReduce呼べるselect count(distinct user_id) from npath( on clicks partition by user_id order by timestamp mode(overlapping) pattern(lsquoHSPrsquo) symbols( page_type = lsquohomersquo AS H page_type = lsquosearchrsquo AS S page_type = lsquoproductrsquo AS P) result(first(user_id of H) as user_id) )

最近Hiveにもnpath入りました (MatchPath)

You can combine MapReduce with SQL

Easy amp Handy SQL +

Extendable MapReduce

最後にポエム

よいものはよい

Great product is anywhere

だが知識は偏在している

but knowledge is maldistributed

OSS World Enterprise World

Hadoop

Ruby

Python Excel

並列RDB

Windows

markdown VB

Git

Java

Cross the border

end

Page 38: Oedo Ruby Conference 04: Ruby会議でSQLの話をするのは間違っているだろうか

SQLからMapReduce呼べるselect count(distinct user_id) from npath( on clicks partition by user_id order by timestamp mode(overlapping) pattern(lsquoHSPrsquo) symbols( page_type = lsquohomersquo AS H page_type = lsquosearchrsquo AS S page_type = lsquoproductrsquo AS P) result(first(user_id of H) as user_id) )

最近Hiveにもnpath入りました (MatchPath)

You can combine MapReduce with SQL

Easy amp Handy SQL +

Extendable MapReduce

最後にポエム

よいものはよい

Great product is anywhere

だが知識は偏在している

but knowledge is maldistributed

OSS World Enterprise World

Hadoop

Ruby

Python Excel

並列RDB

Windows

markdown VB

Git

Java

Cross the border

end

Page 39: Oedo Ruby Conference 04: Ruby会議でSQLの話をするのは間違っているだろうか

Easy amp Handy SQL +

Extendable MapReduce

最後にポエム

よいものはよい

Great product is anywhere

だが知識は偏在している

but knowledge is maldistributed

OSS World Enterprise World

Hadoop

Ruby

Python Excel

並列RDB

Windows

markdown VB

Git

Java

Cross the border

end

Page 40: Oedo Ruby Conference 04: Ruby会議でSQLの話をするのは間違っているだろうか

最後にポエム

よいものはよい

Great product is anywhere

だが知識は偏在している

but knowledge is maldistributed

OSS World Enterprise World

Hadoop

Ruby

Python Excel

並列RDB

Windows

markdown VB

Git

Java

Cross the border

end

Page 41: Oedo Ruby Conference 04: Ruby会議でSQLの話をするのは間違っているだろうか

よいものはよい

Great product is anywhere

だが知識は偏在している

but knowledge is maldistributed

OSS World Enterprise World

Hadoop

Ruby

Python Excel

並列RDB

Windows

markdown VB

Git

Java

Cross the border

end

Page 42: Oedo Ruby Conference 04: Ruby会議でSQLの話をするのは間違っているだろうか

だが知識は偏在している

but knowledge is maldistributed

OSS World Enterprise World

Hadoop

Ruby

Python Excel

並列RDB

Windows

markdown VB

Git

Java

Cross the border

end

Page 43: Oedo Ruby Conference 04: Ruby会議でSQLの話をするのは間違っているだろうか

OSS World Enterprise World

Hadoop

Ruby

Python Excel

並列RDB

Windows

markdown VB

Git

Java

Cross the border

end

Page 44: Oedo Ruby Conference 04: Ruby会議でSQLの話をするのは間違っているだろうか

Cross the border

end

Page 45: Oedo Ruby Conference 04: Ruby会議でSQLの話をするのは間違っているだろうか

end