35
Apache Spark on EMR Yuyang Lan SmartNews Inc.

AWS meetup「Apache Spark on EMR」

Embed Size (px)

Citation preview

Page 1: AWS meetup「Apache Spark on EMR」

Apache Spark on EMR

Yuyang Lan

SmartNews Inc.

Page 2: AWS meetup「Apache Spark on EMR」

MOKUJI

• Intro

• Recent Spark

• How we use Spark in Smartnews

• Best Practices

Page 3: AWS meetup「Apache Spark on EMR」

Who am I

• @y2_lan

• Engineer at SmartNews Inc. (AD team)

• Hacker, Data Engineer, Beer Lover

Page 4: AWS meetup「Apache Spark on EMR」

何か要望・問題あったら @kaiseh :)

Page 5: AWS meetup「Apache Spark on EMR」

About Apache Spark

maybe

just skip?

Page 6: AWS meetup「Apache Spark on EMR」

About Apache Spark Quick catch up

RDD

action

transformations

Page 7: AWS meetup「Apache Spark on EMR」

Recent Spark at a glance

• Databricks Cloud goes public

• Spark 1.4.x

• Project Tungsten

• AWS adds support for Apache Spark on EMR

• …

Page 8: AWS meetup「Apache Spark on EMR」

Spark 1.4.x• SparkR

• DataFrame API

• ML Pipeline

• Streaming UI

• …

Page 9: AWS meetup「Apache Spark on EMR」

Spark at SmartNews

• AD CTR Prediction ( Logistic Regression )

Page 10: AWS meetup「Apache Spark on EMR」
Page 11: AWS meetup「Apache Spark on EMR」

Spark at SmartNews

• Scoring articles by Kinesis + Spark Streaming

Page 12: AWS meetup「Apache Spark on EMR」

Spark at SmartNews

• Ad-Hoc Analysis, Faster (& Hive-compatible) SQL

Page 13: AWS meetup「Apache Spark on EMR」

Spark at SmartNews

• Realtime Stats by Kinesis + Spark Streaming

Page 14: AWS meetup「Apache Spark on EMR」

Spark at SmartNews• ML experiments

• AD targeting

• User Clustering

• Recommendation

• …

Page 15: AWS meetup「Apache Spark on EMR」

Best Practices #1

• Should use the default Spark with EMR ?

• Yes Sure

• EMR 4.0 is great ! (Released today ?!)

• Hadoop 2.6 + Hive 1.0 + Spark 1.4.1

Page 16: AWS meetup「Apache Spark on EMR」

Best Practices #1

• Should use the default Spark with EMR ?

• But only if you need a custom-build Spark

• Cutting Edge Version

• Native netlib-java ( mvn -Pnetlib-lgpl )

• Custom dependency version

• …

Page 17: AWS meetup「Apache Spark on EMR」

Best Practices #1

• Should use the default Spark with EMR ?

• But only if you need a custom-build Spark

• --bootstrap-actions bootstrap.json

Page 18: AWS meetup「Apache Spark on EMR」

Best Practices #1

• Should use the default Spark with EMR ?

• But only if you need a custom-build Spark

• Remember to start SparkHistoryServer

Page 19: AWS meetup「Apache Spark on EMR」

Best Practices #2

• Run Spark on Yarn

• Use yarn-cluster mode to distribute Drivers

• specify jars and files to distribute necessary resources

Page 20: AWS meetup「Apache Spark on EMR」

Best Practices #3

• Tuning Memory

• CPU shortage only slow down your program, but short in memory make it crash

• you can even set --executor-cores bigger than your CPU num

• Cache-able heap != JVM’s Xmx

• (normally about 50%)

Page 21: AWS meetup「Apache Spark on EMR」

Best Practices #3

• Tuning Memory

• CPU shortage only slow down your program, but short in memory make it crash

• Cache-able heap != JVM’s Xmx

Image from: http://0x0fff.com/spark-architecture/

Page 22: AWS meetup「Apache Spark on EMR」

Best Practices #3

• Tuning Memory

• CPU shortage only slow down your program, but short in memory make it crash

• Cache-able heap != JVM’s Xmx

• spark.yarn.executor.memoryOverhead

• spark.executor.memory

• spark.storage.memoryFraction

• …

• Split your executors if HEAP_SIZE > 64GB (GC)

• -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps

Page 23: AWS meetup「Apache Spark on EMR」

Best Practices #4

• If your ML job is really CPU-bound

• Try using OpenBLAS + netlib.NativeSystemBLAS

Page 24: AWS meetup「Apache Spark on EMR」

Best Practices #4

• Try using OpenBLAS + netlib.NativeSystemBLAS

4~5 times FAST

Page 25: AWS meetup「Apache Spark on EMR」

Best Practices #5

• Minimize data shuffle

• Prefer reduceByKey over groupByKey+map

• RDD.repartition(NUM_OF_CORES) before cache

• Try to do filter early

Page 26: AWS meetup「Apache Spark on EMR」

Best Practices #5

• Minimize data shuffle

Page 27: AWS meetup「Apache Spark on EMR」

Best Practices #6

• Prefer DataFrame APIs over low level RDD APIs

• Better DAG Optimization

• Same interface & same performance

Page 28: AWS meetup「Apache Spark on EMR」

Best Practices #7

• Use Kryo serialization if possible

--conf spark.serializer=org.apache.spark.serializer.KryoSerializer

Page 29: AWS meetup「Apache Spark on EMR」

Best Practices #8

• Pick up a notebook tool (iPython or Zeppelin or ?

• For memo, sharing, visualisation

• Convenient for non-engineer users

Page 30: AWS meetup「Apache Spark on EMR」

Best Practices #9

• Multiple small & task-driven EMR clusters

Page 31: AWS meetup「Apache Spark on EMR」

Best Practices #10

• use Dynamic scaling with Spark Streaming

• spark.dynamicAllocation.enabled = true

• spark.shuffle.service.enabled = true

• be careful if you use cached data

Page 32: AWS meetup「Apache Spark on EMR」

Best Practices #11

• Use Spot Instance

• Be more aggressive in bid price : p

• BID_PRICE != MONEY_TO_PAY

• Check Spot Instance Pricing History

• Find the instance type with relative stable price

• often Previous Generation Instance ?

• Prepare failure, don’t use them in critical missions

Page 33: AWS meetup「Apache Spark on EMR」

Further Reading

• To use Spark Streaming in Production

• http://www.slideshare.net/SparkSummit/recipes-for-running-spark-streaming-apploications-in-production-tathagata-daspptx

Page 34: AWS meetup「Apache Spark on EMR」

Further Reading

• If you’re interested in new ML pipelines

• http://www.slideshare.net/SparkSummit/building-debugging-and-tuning-spark-machine-leaning-pipelinesjoseph-bradley

Page 35: AWS meetup「Apache Spark on EMR」

Thanks!

We’re hiring!

http://about.smartnews.com/ja/careers/

iOSエンジニア / Androidエンジニア / Webアプリケーションエンジニア

/ プロダクティビティエンジニア / 機械学習 / 自然言語処理エンジニア

/ グロースハックエンジニア / サーバサイドエンジニア

/ 広告エンジニア…