Spark Streaming + Amazon Kinesis
Preview:
Citation preview
- 1. Spark Streaming + Amazon Kinesis @imai_factory
- 2. Spark Streaming Spark KafkaKinesis RDDDStream FRP
- 3. Conclusion KinesisConsumer Spark Streaming SQL Kinesis
- 4. RDD @t1 RDD @t2 RDD @t3 DStream Time RDD @t4 RDD @t5
DStream
- 5. Programming with DStream val conf = SparkConf()! val ssc =
StreamingContext(conf, Seconds(1))! ! val lines =
lines.ssc.socketTextStream(localhost,9999)! val words =
lines.flatMap(_.split( ))! ! val pairs = words.map(word =>
(word, 1))! val count = pairs.reduceByKey(_ + _)! count.print()! !
ssc.satrt()! ssc.awaitTermination()!
- 6. Programming with DStream val conf = SparkConf()! val ssc =
StreamingContext(conf, Seconds(1))! ! val lines =
lines.ssc.socketTextStream(localhost,9999)! val words =
lines.flatMap(_.split( ))! ! val pairs = words.map(word =>
(word, 1))! val count = pairs.reduceByKey(_ + _)! count.print()! !
ssc.satrt()! ssc.awaitTermination()!
- 7. Programming with DStream val conf = SparkConf()! val ssc =
StreamingContext(conf, Seconds(1))! ! val lines =
lines.ssc.socketTextStream(localhost,9999)! val words =
lines.flatMap(_.split( ))! ! val pairs = words.map(word =>
(word, 1))! val count = pairs.reduceByKey(_ + _)! count.print()! !
ssc.satrt()! ssc.awaitTermination()!
- 8. Programming with DStream val conf = SparkConf()! val ssc =
StreamingContext(conf, Seconds(1))! ! val lines =
lines.ssc.socketTextStream(localhost,9999)! val words =
lines.flatMap(_.split( ))! ! val pairs = words.map(word =>
(word, 1))! val count = pairs.reduceByKey(_ + _)! count.print()! !
ssc.satrt()! ssc.awaitTermination()!
- 9. DStream Flume Kafka Kinesis Twitter File Socket Data
sources
- 10. Amazon Kinesis / Kafka
- 11. Amazon Kinesis Amazon Kinesis Datastream
Store,Shue&Sort Consumer apps Consumer apps Consumer apps
Process
- 12. Spark Streaming +Amazon Kinesis Amazon Kinesis Datastream
Store,Shue&Sort Process
- 13. Spark Streaming +Amazon Kinesis KinesisSpark Kinesis
+SparkSQL KinesisConsumer
- 14. Building Amazon Kinesis Consumer app Amazon Kinesis
Datastream Store,Shue&Sort API, SDK KCL AWS Lambda Process
SparkKinesisStormkinesis-spout KCL StormSpark
- 15. Amazon Kinesis Datastream Store,Shue&Sort Process Run
SparkSQL on Kinesis Stream SQL
- 16. Run SparkSQL on Kinesis Stream import
org.apache.spark.streaming.kinesis.KinesisUtils! ! val
kinesisStreams = (0 until numStreams).map { i =>!
KinesisUtils.createStream(! ssc, streamName, endpointUrl,
kinesisCheckpointInterval,! InitialPositionInStream.LATEST,
StorageLevel.MEMORY_ONLY! )! }! val unionStreams =
ssc.union(kinesisStreams)! val words =
unionStreams.flatMap(...)!
- 17. import org.apache.spark.streaming.kinesis.KinesisUtils! !
val kinesisStreams = (0 until numStreams).map { i =>!
KinesisUtils.createStream(! ssc, streamName, endpointUrl,
kinesisCheckpointInterval,! InitialPositionInStream.LATEST,
StorageLevel.MEMORY_ONLY! )! }! ! val unionStreams =
ssc.union(kinesisStreams)! ! val words = unionStreams.flatMap(...)!
Run SparkSQL on Kinesis Stream Dstream DstreamUNION
DstreamTransformation
- 18. words.foreachRDD(foreachFunc = (rdd: RDD[String], time:
Time) => {! ! val sqlContext =
SQLContextSingleton.getInstance(rdd.sparkContext)! !
sqlContext.read.json(rdd).registerTempTable("words")! ! val
wordCountsDataFrame =! sqlContext.sql(select level, count(*) as
total ! from words! group by level)! ! println(s"========= $time
=========")! wordCountsDataFrame.show()! ! })! DStream Run SparkSQL
on Kinesis Stream JSON
- 19. Conclusion KinesisConsumer Spark Streaming
- 20. PluggableInputDStream KinesisReceiver KinesisClientLibrary
Worker thread KinesisUtils.createStream(! ssc, streamName,
endpointUrl, kinesisCheckpointInterval,!
InitialPositionInStream.LATEST, StorageLevel.MEMORY_ONLY! )!
DynamoDB Table Kinesis Stream Under the hood GetRecords
Checkpoint
- 21. One more thing: Amazon EMR now supports Apache Spark! EMR
Spark 2015/06/23 Spark1.3.1
- 22. One more thing: Amazon EMR now supports Apache Spark!
Amazon Kinesis Amazon EMR +