32
Juantomás García - Open Sistemas Kappa Architecture 2.0 DataScience Lab, Odessa

DataScience Lab 2017_Kappa Architecture: How to implement a real-time streaming data analytics engine Juantomás García

Embed Size (px)

Citation preview

Page 1: DataScience Lab 2017_Kappa Architecture: How to implement a real-time streaming data analytics engine Juantomás García

Juantomás García - Open Sistemas

Kappa Architecture 2.0 DataScience Lab, Odessa

Page 2: DataScience Lab 2017_Kappa Architecture: How to implement a real-time streaming data analytics engine Juantomás García

Доброго ранку Одеса!!

(Dobroho ranku Odesa)

first

Page 3: DataScience Lab 2017_Kappa Architecture: How to implement a real-time streaming data analytics engine Juantomás García

Juantomás García

• Data Solutions Manager @ OpenSistemas

• GDE (Google Developer Expert) for cloud

Others

• Co-Author of the first Spanish free software book “La Pastilla Roja”

• President of Hispalinux (Spanish Linux User Group)

• Organizer of the Machine Learning Spain and GDG Cloud Madrid.

Who I am

Page 4: DataScience Lab 2017_Kappa Architecture: How to implement a real-time streaming data analytics engine Juantomás García

What’s Kappa Architecture?

July 2, 2014 Jay Kreps coined the term Kappa Architecture in an article for O’reilly Radar

“Maybe we could call this the Kappa Achitecture, though it may be too simple of an idea to merit a Greek letter”

Page 5: DataScience Lab 2017_Kappa Architecture: How to implement a real-time streaming data analytics engine Juantomás García

Jay has been involved in lots of projects:

✓ Author of the essay: The Log: What every software engineer should know about real-time data's unifying abstraction (12/16/2013)

✓ Author of the book I love Logs

Who is Jay Kreps?

Page 6: DataScience Lab 2017_Kappa Architecture: How to implement a real-time streaming data analytics engine Juantomás García

•Involved with projects as: ✓ Apache Kafka ✓ Apache Samza ✓ Voldemort ✓ Azkaban ✓ Ex-Linkedin ✓ Now co-founder and CEO of Confluent

Who is Jay Kreps?

Page 7: DataScience Lab 2017_Kappa Architecture: How to implement a real-time streaming data analytics engine Juantomás García

Usual Data Flow

Page 8: DataScience Lab 2017_Kappa Architecture: How to implement a real-time streaming data analytics engine Juantomás García

Usual Data Flow

Page 9: DataScience Lab 2017_Kappa Architecture: How to implement a real-time streaming data analytics engine Juantomás García

Usual Data Flow

Page 10: DataScience Lab 2017_Kappa Architecture: How to implement a real-time streaming data analytics engine Juantomás García

Kappa Architecture Way

Page 11: DataScience Lab 2017_Kappa Architecture: How to implement a real-time streaming data analytics engine Juantomás García

Tools we use

Page 12: DataScience Lab 2017_Kappa Architecture: How to implement a real-time streaming data analytics engine Juantomás García

Tools we use

Page 13: DataScience Lab 2017_Kappa Architecture: How to implement a real-time streaming data analytics engine Juantomás García

Tools we use

Page 14: DataScience Lab 2017_Kappa Architecture: How to implement a real-time streaming data analytics engine Juantomás García

✓ If you have an schema spark SQL, is perfect.

✓ Spark streaming works very fine with spark and almost each streaming sources.

✓ Structured queries will be a huge advance.

✓ We love Scala, the spirit of Spark.

Some Favorite Spark Features

Page 15: DataScience Lab 2017_Kappa Architecture: How to implement a real-time streaming data analytics engine Juantomás García

We love code like this:

Some Favorite Spark Features

Page 16: DataScience Lab 2017_Kappa Architecture: How to implement a real-time streaming data analytics engine Juantomás García

• One of our clients wanted to monitor all the car's information via OBD II

• OBD II is a car interface with the car electronics.

• Our client developed an app for reading all the car information throw ODB II with bluetooth

A Real Use Case

Page 17: DataScience Lab 2017_Kappa Architecture: How to implement a real-time streaming data analytics engine Juantomás García

A Real Use Case

Page 18: DataScience Lab 2017_Kappa Architecture: How to implement a real-time streaming data analytics engine Juantomás García

• We needed to scale the rest interfaces. There were too many requests.

• MySQL don’t scale

• Client wanted to do realtime expensive queries.

First Problems

Page 19: DataScience Lab 2017_Kappa Architecture: How to implement a real-time streaming data analytics engine Juantomás García

Some metrics

Page 20: DataScience Lab 2017_Kappa Architecture: How to implement a real-time streaming data analytics engine Juantomás García

Architecture v 2.0

Page 21: DataScience Lab 2017_Kappa Architecture: How to implement a real-time streaming data analytics engine Juantomás García

Architecture v 3.0

Page 22: DataScience Lab 2017_Kappa Architecture: How to implement a real-time streaming data analytics engine Juantomás García

We can have queries like:

“What are the drivers that are not client of the X gas brand, has a few gas and are near of gas station of the brand X and if true, send a notification with a discount coupon and a link with the route."

Now we’re more flexible!!

Page 23: DataScience Lab 2017_Kappa Architecture: How to implement a real-time streaming data analytics engine Juantomás García

• Kappa architecture is not a silver bullet but helps with a lot of solutions.

• Kafka + spark streaming are our favorite tools

• There are a lots of improvements:

Takeaways

✓ OLAP like Apache Druid

✓ Graph databases like neo4j

✓ Kafka streams and compacts logs

✓ Apache Beams

✓ Scio Scala bindings

Page 24: DataScience Lab 2017_Kappa Architecture: How to implement a real-time streaming data analytics engine Juantomás García

Takeaways: Apache Beam

Page 25: DataScience Lab 2017_Kappa Architecture: How to implement a real-time streaming data analytics engine Juantomás García

Takeaways: Scio Scala Binding

Page 26: DataScience Lab 2017_Kappa Architecture: How to implement a real-time streaming data analytics engine Juantomás García

Think Big

Page 27: DataScience Lab 2017_Kappa Architecture: How to implement a real-time streaming data analytics engine Juantomás García

Think Big

• Forget Legacy Architectures • Forget Old Tools • Use Light Technologies / Serverless • Use pieces of Lego • Mix different technologies from diverse sources

Page 28: DataScience Lab 2017_Kappa Architecture: How to implement a real-time streaming data analytics engine Juantomás García

Spark Use Cases

Not to do list• Avoid install & config a server even a VM.

• Avoid installs tools instead use containers and/or cloud services.

• In general: think if there is a simpler way to do it and needs less effort

Page 29: DataScience Lab 2017_Kappa Architecture: How to implement a real-time streaming data analytics engine Juantomás García

Spark Use Cases

Architecture & Tools• To use Cloud Services is not a brainer decision.

• Git + Containers + Kubernetes • Use the best language* for each module.

• Use Notebooks: Jupyter, Zeppelin, DSX

(*) Even java might be an option - unprovable

Page 30: DataScience Lab 2017_Kappa Architecture: How to implement a real-time streaming data analytics engine Juantomás García

Google Cloud Version

Page 31: DataScience Lab 2017_Kappa Architecture: How to implement a real-time streaming data analytics engine Juantomás García

Kappa Architecture

Questions?

• email: [email protected] • twitter: @juantomas

This talk have a free questions lifetime warranty: If you have any questions or concerns

about this talk, feel free to contact me anytime.

Selfie Time: If you like the talk just smile while I take the selfie ;-)

Page 32: DataScience Lab 2017_Kappa Architecture: How to implement a real-time streaming data analytics engine Juantomás García

Kappa Architecture

велике спасибі