Upload
mapr-technologies-japan
View
2.213
Download
2
Embed Size (px)
Citation preview
2015 MapR Technologies 1
2015 MapR Technologies
Spark Streaming MapR Technologies2015 12 9
2015 MapR Technologies 2
Apache Spark Streaming ? Apache Spark Streaming
(@nagix)
2015 MapR Technologies 3
Spark Streaming ?
:
Web
put put
put put
Time stamped data
data
Data for real-time monitoring
2015 MapR Technologies 4
?
2015 MapR Technologies 4
What is time series data? Stuff with timestamps
Sensor data log files Phones..
Credit Card Transactions Web user behaviour
Social media Log files
Geodata
Sensors 2015 MapR Technologies 4
What is time series data? Stuff with timestamps
Sensor data log files Phones..
Credit Card Transactions Web user behaviour
Social media Log files
Geodata
Sensors 2015 MapR Technologies 4
What is time series data? Stuff with timestamps
Sensor data log files Phones..
Credit Card Transactions Web user behaviour
Social media Log files
Geodata
Sensors
2015 MapR Technologies 4
What is time series data? Stuff with timestamps
Sensor data log files Phones..
Credit Card Transactions Web user behaviour
Social media Log files
Geodata
Sensors 2015 MapR Technologies 4
What is time series data? Stuff with timestamps
Sensor data log files Phones..
Credit Card Transactions Web user behaviour
Social media Log files
Geodata
Sensors 2015 MapR Technologies 4
What is time series data? Stuff with timestamps
Sensor data log files Phones..
Credit Card Transactions Web user behaviour
Social media Log files
Geodata
Sensors
Web
2015 MapR Technologies 5
Apache Spark Streaming ?
? ?
2015 MapR Technologies 5
Why Spark Streaming ?
What If? You want to analyze data as it arrives?
For Example Time Series Data: Sensors, Clicks, Logs, Stats
:
2015 MapR Technologies 6
2015 MapR Technologies 6
Batch Processing
It's 6:01 and 72 degrees It's 6:02 and 75 degrees It's 6:03 and 77 degrees It's 6:04 and 85 degrees It's 6:05 and 90 degrees It's 6:06 and 85 degrees It's 6:07 and 77 degrees It's 6:08 and 75 degrees
It was hot at 6:05 yesterday!
Batch processing may be too late for some events
2015 MapR Technologies 6
Batch Processing
It's 6:01 and 72 degrees It's 6:02 and 75 degrees It's 6:03 and 77 degrees It's 6:04 and 85 degrees It's 6:05 and 90 degrees It's 6:06 and 85 degrees It's 6:07 and 77 degrees It's 6:08 and 75 degrees
It was hot at 6:05 yesterday!
Batch processing may be too late for some events
2015 MapR Technologies 6
Batch Processing
It's 6:01 and 72 degrees It's 6:02 and 75 degrees It's 6:03 and 77 degrees It's 6:04 and 85 degrees It's 6:05 and 90 degrees It's 6:06 and 85 degrees It's 6:07 and 77 degrees It's 6:08 and 75 degrees
It was hot at 6:05 yesterday!
Batch processing may be too late for some events
It's 6:01 and 72 degrees It's 6:02 and 75 degrees It's 6:03 and 77 degrees It's 6:04 and 85 degrees It's 6:05 and 90 degrees It's 6:06 and 85 degrees It's 6:07 and 77 degrees It's 6:08 and 75 degrees
6:05 !
2015 MapR Technologies 7
2015 MapR Technologies 6
Batch Processing
It's 6:01 and 72 degrees It's 6:02 and 75 degrees It's 6:03 and 77 degrees It's 6:04 and 85 degrees It's 6:05 and 90 degrees It's 6:06 and 85 degrees It's 6:07 and 77 degrees It's 6:08 and 75 degrees
It was hot at 6:05 yesterday!
Batch processing may be too late for some events
2015 MapR Technologies 7
Event Processing
It's 6:05 and 90 degrees
Someone should open a window!
Streaming
Its becoming important to process events as they arrive
It's 6:05 and 90 degrees
!
2015 MapR Technologies 8
Spark Streaming
Spark API
2015 MapR Technologies 9
2015 MapR Technologies 9
Stream Processing Architecture
Streaming
Sources/Apps
MapR-FS
Data Ingest
Topics
MapR-DB
Data Storage
MapR-FS
Apps$
Stream Processing
HDFS
HDFS
HBase
2015 MapR Technologies 10
: : HDFS : TCP
Twitter, Kafka, Flume, ZeroMQ, Akka Actor
Transformation
2015 MapR Technologies 11
Spark Streaming
X (Batch) DStream = RDD
Spark Streaming
DStream RDD Batch
Batch
time 0 1
time 1 2
RDD @ time 2
time 2 3
RDD @ time 3 RDD @ time 1
2015 MapR Technologies 12
Resilient Distributed Datasets (RDD)
Spark RDD Read Only
2015 MapR Technologies 13
Resilient Distributed Datasets (RDD)
Spark RDD Read Only
2015 MapR Technologies 14
RDD
RDD
textFile = sc.textFile(SomeFile.txt) !
2015 MapR Technologies 15
RDD
RDDRDDRDDRDD
Transformations
linesWithErrorRDD = linesRDD.filter(lambda line: ERROR in line) !
linesRDD = sc.textFile(LogFile.txt) !
2015 MapR Technologies 16
RDD
RDDRDDRDDRDD
Transformations
Action Value
linesWithErrorRDD.count()!6 !!linesWithErrorRDD.first() !# Error line!
textFile = sc.textFile(SomeFile.txt) !
linesWithErrorRDD = linesRDD.filter(lambda line: ERROR in line) !
2015 MapR Technologies 17
Dstream
transform
Transform map
reduceByValue count
DStream RDD
DStream RDD
transform transform
Transformation RDD
time 0 1
time 1 2
RDD @ time 2
time 2 3
RDD @ time 3 RDD @ time 1
RDD @ time 1 RDD @ time 2 RDD @ time 3
2015 MapR Technologies 18
Transformation: DStream
RDD : map, filter, union, reduce, join, ... : UpdateStateByKey(function),
countByValueAndWindow, ...
2015 MapR Technologies 19
Spark Streaming
Batch
Spark
Batch
Spark Streaming
DStream RDD Batch
time 0 1
time 1 2
RDD @ time 2
time 2 3
RDD @ time 3 RDD @ time 1
2015 MapR Technologies 20
Transformation :
saveAsHadoopFiles HDFS saveAsHadoopDataset HBase saveAsTextFiles foreach RDD Batch
2015 MapR Technologies 21
2015 MapR Technologies 22
:
read
Spark
Spark
Streaming
2015 MapR Technologies 23
CSV Sensor
case class Sensor(resid: String, date: String, time: String, hz: Double, disp: Double, flo: Double, sedPPM: Double, psi: Double, chlPPM: Double) def parseSensor(str: String): Sensor = { val p = str.split(",") Sensor(p(0), p(1), p(2), p(3).toDouble, p(4).toDouble, p(5).toDouble, p(6).toDouble, p(7).toDouble, p(8).toDouble) }
2015 MapR Technologies 24
data
alerts stats
data
alerts
stats
hz psi psi hz_avg psi_min
COHUTTA_3/10/14_1:01 10.37 84 0
COHUTTA_3/10/14 10 0
2015 MapR Technologies 25
Spark Streaming
Spark Streaming : 1. Spark StreamingContext 2. DStream
1. Transformation
DStream 2.
3.
streamingContext.start() 4.
streamingContext.awaitTermination()
2015 MapR Technologies 26
DStream
val ssc = new StreamingContext(sparkConf, Seconds(2))val linesDStream = ssc.textFileStream("/mapr/stream")
batch
time 0-1
linesDStream
batch time 1-2
batch time 1-2
DStream: RDD
RDD
2015 MapR Technologies 27
DStream
val linesDStream = ssc.textFileStream("directory path")val sensorDStream = linesDStream.map(parseSensor)
map Batch
RDD
batch time 0-1
linesDStream RDD
sensorDstream RDD
batch time 1-2
map map
batch time 1-2
2015 MapR Technologies 28
DStream
// RDD sensorDStream.foreachRDD { rdd => // val alertRDD = sensorRDD.filter(sensor => sensor.psi < 5.0) . . .}
2015 MapR Technologies 29
DataFrame SQL
// RDD sensorDStream.foreachRDD { rdd => . . . alertRdd.toDF().registerTempTable("alert") // val alertViewDF = sqlContext.sql( "select s.resid, s.psi, p.pumpType from alert s join pump p on s.resid = p.resid join maint m on p.resid=m.resid") . . .}
2015 MapR Technologies 30
HBase
// RDD sensorDStream.foreachRDD { rdd => . . . // put HBase rdd.map(Sensor.convertToPutAlert) .saveAsHadoopDataset(jobConfig)}
2015 MapR Technologies 31
HBase
rdd.map(Sensor.convertToPut).saveAsHadoopDataset(jobConfig)
map
Put HBase
batch time 0-1
linesRDD DStream
sensorRDD DStream
batch time 1-2
map map
batch time 1-2
HBase
save save save
:
2015 MapR Technologies 32
sensorDStream.foreachRDD { rdd => . . .
}// ssc.start() // ssc.awaitTermination()
2015 MapR Technologies 33
HBase
Read
Write
HBase Spark
:
2015 MapR Technologies 34
HBase
2015 MapR Technologies 32
HBase
HBase Read and Write
val hBaseRDD = sc.newAPIHadoopRDD( conf,classOf[TableInputFormat], classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable], classOf[org.apache.hadoop.hbase.client.Result])
keyStatsRDD.map { case (k, v) => convertToPut(k, v) }.saveAsHadoopDataset(jobConfig)
newAPIHadoopRDD
Row key Result
saveAsHadoopDataset
Key Put
HBase
Scan Result
val hBaseRDD = sc.newAPIHadoopRDD( conf,classOf[TableInputFormat], classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable], classOf[org.apache.hadoop.hbase.client.Result]) keyStatsRDD.map { case (k, v) => convertToPut(k, v) }.saveAsHadoopDataset(jobConfig)
2015 MapR Technologies 35
HBase
// HBase (rowkey, Result) RDD val hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat], classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable], classOf[org.apache.hadoop.hbase.client.Result])// val resultRDD = hBaseRDD.map(tuple => tuple._2)// (RowKey, ColumnValue) RDD val keyValueRDD = resultRDD.map( result => (Bytes.toString(result.getRow()).split(" ")(0), Bytes.toDouble(result.value)))// rowkey group by, val keyStatsRDD = keyValueRDD.groupByKey().mapValues(list => StatCounter(list))
2015 MapR Technologies 36
HBase
// HBase data val jobConfig: JobConf = new JobConf(conf, this.getClass) jobConfig.setOutputFormat(classOf[TableOutputFormat]) jobConfig.set(TableOutputFormat.OUTPUT_TABLE, tableName) // put hbase stats keyStatsRDD.map { case (k, v) => convertToPut(k, v) }.saveAsHadoopDataset(jobConfig)
2015 MapR Technologies 37
https://www.mapr.com/blog/spark-streaming-hbase
2015 MapR Technologies 38
2015 MapR Technologies 39
MapR Converged Data Platform
2015 MapR Technologies
NEW
MapR Streams Kafka API
2015 MapR Technologies 40
Q & A @mapr_japan maprjapan
MapR
maprtech
mapr-technologies