Tokyo r sqldf

簑田　高志簑田　高志

RRをを SQLSQLで操るで操るRDBRDBをつかってる人でもをつかってる人でも RRがとっつきやすくなる。がとっつきやすくなる。

03/05/11 2

1．自己紹介２． Sqldfパッケージの紹介３．ちょっと応用…

目次

03/05/11 3

自己紹介● 名前：簑田　高志（みのだ　たかし）● Twitter ： aad34210● ブログ： http://pracmper.blogspot.com/● 出身地：熊本県● 出身学部：法学部● 仕事：一般ユーザー向けのWebサービスの

企画・運営、アナリスト● 趣味：テニス

●（踊れません・歌えません）

http://pracmper.blogspot.com/

03/05/11 4

質問

まずは…ちょっとみなさんに質問。

03/05/11 5

まずは…ちょっとみなさんに質問。・ Rを使ったことがある人

質問

03/05/11 6

まずは…ちょっとみなさんに質問。・ Rを使ったことがある人　・ RDBを SQLを使って操作したことがある人

質問

03/05/11 7

まずは…ちょっとみなさんに質問。・ Rを使ったことがある人　・ RDBを SQLを使って操作したこ　　とがある人　・ Rで集計作業がめんどくさい！　　 (・ д・ ) ﾁｯ　　って思ったことことがある人

質問

03/05/11 8

sqldfパッケージ

Rで集計作業するときには…

#PriceをCutごとで合計　price_sum <- aggregate(diamonds[,c(7)] , list(cut = diamonds$cut) , sum)＃それ以外の変数をCutごとで平均　other_mean <- aggregate(diamonds[,c(5:10)] , list(cut = diamonds$cut) , mean)＃一緒に見たいので２つオブジェクトをMerge　merge(price_sum , other_mean , by = c("cut"))

(・ д・ ) ﾁｯ＞計算したい関数ごとでコード書かなきゃいけない…　　　　　　　　　　　　　 DataFrameを簡単に集計したい。

head(diamonds) carat cut color clarity depth table price x y z1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.432 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.313 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.314 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.635 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.756 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48

03/05/11 9

●Rを使ってて●集計がめんどくさい (・ д・ ) ﾁｯ●SQLを使ったことがある方。

そんなあなたに今日は sqldf パッケージを紹介します。

Sqldfパッケージ

03/05/11 10


sqldfパッケージって？sqldf is an R package for runing SQL statements on R data frames, optimized for convenience. The user simply specifies an SQL statement in R using data frame names in place of table names and a database with appropriate table layouts/schema is automatically created, the data frames are automatically loaded into the database, the specified SQL statement is performed, the result is read back into R and the database is deleted all automatically behind the scenes making the database's existence transparent to the user who only specifies the SQL statement. Surprisingly this can at times be even faster than the corresponding pure R calculation (although the purpose of the project is convenience and not speed).

出典：http://code.google.com/p/sqldf/

(超意訳）・sqldfパッケージはRのデータフレームに対してSQLを走らせることができるパッケージ。・裏側にDBを持っていて、レイアウト・スキーマ・データを自動的にロードして、SQLを走らせる環境を作ります。Rに結果がもどってきたら、勝手にDB側で保持している情報は削除しますよ！

03/05/11 11


基本的な使い方sqldf(”【SQL ”文】）

#sqldfパッケージのインストールinstall.packages("sqldf")library(sqldf)

基本的な使い方sqldf(”【SQL ”文】）

#シンプルな使い方#irisの件数をカウント＞sqldf("SELECT COUNT(*) as iris_count FROM iris") iris_count1 150

#irisをSeciesでカウント>sqldf("SELECT Species , COUNT(*) as iris_count FROM iris GROUP BY Species") Species iris_count1 setosa 502 versicolor 503 virginica 50

03/05/11 12

sqldfパッケージ#もうちょっと動かしてみる。#列名に「.（ドット）」が入っていると動かないため、列名を変更。＞iris2 <- iris＞colnames(iris2) <- c("Sepal_Length" , "Sepal_Width" , "Petal_Length" ,"Petal_Width" , "Species")＞head(iris2)

#Specisごとのデータのカウントと平均をとってみる。＞sqldf("＞SELECT＞ Species ,＞ COUNT(Species) as Species_num,＞ AVG(Sepal_Length) as average_Lentgh,＞ AVG(Sepal_Width) as average_width＞FROM＞ iris2＞GROUP BY＞ Species＞") Species Species_num average_Lentgh average_width1 setosa 50 5.006 3.4282 versicolor 50 5.936 2.7703 virginica 50 6.588 2.974

03/05/11 13

sqldfパッケージ#ちょっと高度な使い方。#Petal.Lengthの平均よりも高いデータを抽出してくる。#副問い合わせ（WHERE区のネスト）

>sqldf("SELECT * FROM iris2 WHERE Petal_Length >= (select avg(Petal_Length) from iris2)")

Sepal_Length Sepal_Width Petal_Length Petal_Width Species1 7.0 3.2 4.7 1.4 versicolor2 6.4 3.2 4.5 1.5 versicolor3 6.9 3.1 4.9 1.5 versicolor4 5.5 2.3 4.0 1.3 versicolor5 6.5 2.8 4.6 1.5 versicolor6 5.7 2.8 4.5 1.3 versicolor・・・・・avg(Petal_Length) =3.758 以上のデータが抽出されている。

03/05/11 14

sqldfパッケージ#ちょっと高度な使い方。#変数をつかってデータを抽出する。#Speciesに変数を持たせて、それを変化させる。var <- "setosa"sql_head <- "SELECT * FROM iris2 WHERE Species = "sql_str <- paste(sql_head , "'", var ,"'" , collapse = "" , sep = "")sqldf(sql_str)

・・・・・

Sepal_Length Sepal_Width Petal_Length Petal_Width Species1 5.1 3.5 1.4 0.2 setosa2 4.9 3.0 1.4 0.2 setosa3 4.7 3.2 1.3 0.2 setosa4 4.6 3.1 1.5 0.2 setosa5 5.0 3.6 1.4 0.2 setosa

#上の式をFunction化Sepal_search <- function(var){sql_head <- "SELECT * FROM iris2 WHERE Species = "sql_str <- paste(sql_head , "'", var ,"'" , collapse = "" , sep = "")print(sqldf(sql_str))}#実行！Sepal_search(var = "versicolor")

03/05/11 15

sqldfパッケージ#ちょっとしらべてみた。##ちょっとしらべてみた。#Rの集計コードと、sqldfってどっちが早い？

# R CodeR_Code <- function(){

price_sum <- aggregate(diamonds[,c(7)] , list(cut = diamonds$cut) , sum)other_mean <- aggregate(diamonds[,c(5,7:10)] , list(cut = diamonds$cut) , mean)merge(price_sum , other_mean , by = c("cut"))

}system.time(R_Code())

# sqldfsql_df_code <- function(){sqldf("

SELECTcut, SUM(price), avg(depth), avg(price), avg(x), avg(y), avg(z)

FROM diamondsGROUP BY cut")

}

#実行プログラムsystem.time(R_Code())system.time(sql_df_code())

03/05/11 16

sqldfパッケージ#結果#Rのaggregate ,mergeを使った結果system.time(R_Code())　 user system elapsed 0.468 0.041 0.541

#sqldfを使った結果system.time(sql_df_code()) user system elapsed 0.841 0.036 0.895

・ Rのコードと、 sqldfのコードとでは、 Rのコードのほうが早かった！・コードの書きやすさとスピードのトレードオフですかね…

03/05/11 17

まとめ1.Rのデータフレームの処理が難しと感じている方はsqldfパッケージ　を使って、SQLで集計してみるといいかも。

2.使い方は簡単sqldf(“[SQL]”)で実行。結果もRやオブジェクトに代入することができる。

3.変数の利用や、Function化も可能。paste関数を利用して、SQL　文を組み立てて、sqldfにいれてあげればよい。

4.パフォーマンスはR …のコードよりもちょっと遅いかも。　スピードとコードの生産性との兼ね合いを検討したほうがよい。

03/05/11 18

m(__)m

ご清聴ありがとうございました！

Documents

Tokyo r sqldf