Upload
felixcss
View
4.081
Download
0
Embed Size (px)
Citation preview
SparkR + ZeppelinSeattle Spark Meetup
Sept 9, 2015Felix Cheung
Agenda• R & SparkR• SparkR DataFrame• SparkR in Zeppelin•What’s next
R• A programming language for statistical computing and
graphics• S – 1975• S4 - advanced object-oriented features
• R – 1993• S + lexical scoping
• Interpreted•Matrix arithmetic• Comprehensive R Archive Network (CRAN) – 7000+ packages
Fast!
Scalable
Flexible
Statistical!
Interactive
Packages
SparkR• R language APIs for Spark and Spark SQL• Exposes Spark functionality in an R-friendly DataFrame API• Runs as its own REPL sparkR• or as a standard R package imported in tools like Rstudio library(SparkR)sc <- sparkR.init()sqlContext <- sparkRSQL.init(sc)
5
History• Shivaram Venkataraman & Zongheng Yang,
amplab – UC Berkeley• RDD APIs in a standalone package (Jan/2014)• Spark SQL and SchemaRDD -> DataFrame• Spark 1.4 – first Spark release with SparkR APIs• Spark 1.5 (today!)
6
Architecture
7Native S4 classes
& methods RBackend
socket
• A set of native S4 classes and methods that live inside a standard R package• A backend that passes data structures and method calls to
Spark Scala/JVM• A collection of “helper” methods written in Scala
Advantages• R-like syntax extending DataFrame API• JVM processing with full access to Spark’s DAG capabilities
and Catalyst engine,e.g. execution plan optimization, constant-folding, predicate pushdown, and code generation
8
https://databricks.com/blog/2015/06/09/announcing-sparkr-r-on-spark.html
SparkR DataFrame• Spark packages • Data Source API• Optimizations
SparkR in Zeppelin
Architecture
RR adaptor
Demo
DIY• https://
github.com/felixcheung/vagrant-projects/tree/master/SparkR-Zeppelin• Vagrant + VirtualBox• Install prerequisites: JDK, R, R packages• Automatically download Spark 1.5.0 release
• Need to build Zeppelin from https://github.com/felixcheung/incubator-zeppelin/tree/r• Notebook from https://
github.com/felixcheung/spark-notebook-examples/blob/master/Zeppelin_notebook/2AZ9584GE/note.json
(extracted from the demo)Native R
(extracted from the demo)
Native R and dplyr...
Similarly SparkR DataFrame…
(extracted from the demo)
SparkR DataFrame…
What’s new• Zeppelin - run with provided Spark (SPARK_HOME)• Spark 1.5.0 release• SparkR new APIs
SparkR in Spark 1.5.0Get this today:• R formula •Machine learning like GLMmodel <- glm(Sepal_Length ~ Sepal_Width + Species, data = df, family = "gaussian")
•More R-likedf[df$age %in% c(19, 30), 1:2]transform(df, newCol = df$col1 / 5, newCol2 = df$col1 * 2)
Zeppelin• Stay tuned! More to come with R/SparkR• Lots of updates in the upcoming 0.5.x/0.6.0 release
Question?https://github.com/felixcheung
linkedin: http://linkd.in/1OeZDb7 blog: http://bit.ly/1E2z6OI
subset# Columns can be selected using `[[` and `[`df[[2]] == df[["age"]]df[,2] == df[,"age"]df[,c("name", "age")]# Or to filter rowsdf[df$age > 20,]# DataFrame can be subset on both rows and Columnsdf[df$name == "Smith", c(1,2)]df[df$age %in% c(19, 30), 1:2]subset(df, df$age %in% c(19, 30), 1:2)subset(df, df$age %in% c(19), select = c(1,2))
Transform/mutatenewDF <- mutate(df, newCol = df$col1 * 5, newCol2 = df$col1 * 2)
newDF2 <- transform(df, newCol = df$col1 / 5, newCol2 = df$col1 * 2)