32
Lambda Architecture Platform Using SQL Sep 13 2014 HadoopCon 2014 Taiwan TAGOMORI Satoshi (@tagomoris)

Lambda Architecture Using SQL

Embed Size (px)

DESCRIPTION

Keynote of HadoopCon 2014 Taiwan: * Data analytics platform architecture & designs * Lambda architecture overview * Using SQL as DSL for stream processing * Lambda architecture using SQL

Citation preview

Lambda Architecture PlatformUsing SQL

Sep 13 2014HadoopCon 2014 Taiwan

TAGOMORI Satoshi (@tagomoris)

Taipei

About Me & LINEData analytics workloads Batch processing Stream processingLambda architectureLambda architecture using SQL

Topics

Norikra: Stream processing with SQL13:30-14:20 4F

@tagomoris

Satoshi Tagomori (田籠 聡)LINE Corporation Analytics Platform Team

Tokyo

LINE Offices

Tokyo HQSpain

Thailand

Taipei

USAKorea

LINE is born! JUNE 23, 2011

Data Analytics Workload

Part 01

ReportsMonthly/Daily reportsHourly (or shorter) news

Real-time metricsAutomatically updated reports/graphsAlerts for abuse of services, overload, ...

Various Data Analytics Workload

HadoopMapReduce (or Spark, Tez) & DSLs (Hive, Pig, ...)For reports

MPP EnginesCloudera Impala, Apache Drill, Facebook Presto, ...For interactive analysisFor reports of shorter window

Batch Processing

Apache StormIncubator project“Distributed and fault-tolerant realtime computation”

Norikraby tagomorisNon-distributed “Stream processing with SQL”

Stream Processing

Less latencyRealtime metricsShort-term prompt reports

Less computing power10Mbps for batch processing: 100GB/day10Mbps for stream processing: 1 Server

No query schedule managementOnce query registered, it runs forever

Why Stream Processing?

Queries must be written before dataThere should be another way to query past data

Queries cannot be run twiceAll results will be lost when any error occursAll data have gone when bugs found

Disorders of events break resultsRecorded time based queries? Or arrival time based queries?

Disadvantage of Stream Processing

Lambda ArchitecturePart 02

“The Lambda-architecture aims to satisfy the needs for a robust system that is fault-tolerant, both against hardware failures and human mistakes, being able to serve a wide range of workloads and use cases, and in which low-latency reads and updates are required. The resulting system should be linearly scalable, and it should scale out rather than up.”

Lambda Architecture

http://lambda-architecture.net/

Lambda Architecture: Overview

new data

batch layer

master dataset

serving layer

view

speed layer

real-time view

query

Twitter Summingbird

Lambda architecture libraryBatch mode: Scalding on Hadoop MapReduceRealtime mode: Storm

Word counting by Summingbird (scala):def wordCount[P <: Platform[P]] (source: Producer[P, String], store: P#Store[String, Long]) = source.flatMap { sentence => toWords(sentence).map(_ -> 1L) }.sumByKey(store)

https://github.com/twitter/summingbirdhttps://blog.twitter.com/2013/streaming-mapreduce-with-summingbird

What Lambda Architecture Provides

Replayable queriesRedo queries anytime if results of speed layer are broken

Accurate results on demandPrompt reports in speed layer with arrival timeFixed reports in batch layer with recorded time

... And many more benefits of stream processing

Why All of Us Don’t Use It?

Storm doesn’t fit well with many usesStorm requires computer resources too big to deploySummingbird requires many steps to deploy

Many directors/analysts don’t write Scala/JavaSummingbird DSL is not enough easy for non-professional people

Lambda Architecture Using SQL

Part 03

Existing Hadoop Platform

new data

HDFS hivequery

Fluentd

prestoquery

Norikra

Schema-less stream processing with SQL“Norikra is a open source server software provides "Stream Processing" with SQL, written in JRuby, runs on JVM, licensed under GPLv2.”

SELECT path, COUNT(1, status=200) AS success_count, COUNT(1, status=500) AS server_error_count, COUNT(*) AS countFROM AccessLog.win:time_batch(10 min, 0L)WHERE service='myservice' AND path LIKE '/api/%'GROUP BY path

http://norikra.github.io/

Added-on Lambda Architecture Platform

new data

HDFS hivequery

prestoquery

norikraquery

Lambda architecture platform with almost same queries

SELECT path, COUNT(IF(status=200,1,NULL)) AS success_count, COUNT(IF(status=500,1,NULL)) AS server_error_count, COUNT(*) AS countFROM AccessLogWHERE service='myservice' AND path LIKE '/api/%' AND timestamp >= ‘2014-09-13 10:40:00’ AND timestamp < ‘2014-09-13 10:50:00’GROUP BY path

SELECT path, COUNT(1, status=200) AS success_count, COUNT(1, status=500) AS server_error_count, COUNT(*) AS countFROM AccessLog.win:time_batch(10 min, 0L)WHERE service='myservice' AND path LIKE '/api/%'GROUP BY path

“Pseudo Lambda” Architecture Using SQL

“Pseudo Lambda” Architecture Using SQL

SQL dialects are easy to learn!Standard SQL, Hive, Presto, Impala, Drill, ...+ Norikra

For non-professional people too!

SQL queries are very easy to write twice!

Use Cases in LINE

Prompt reports for Ads serviceShort-term prompt reports by NorikraDaily fixed reports by Hive

Summary of application server error logAggregate error log for alerting by NorikraCheck details with Hive, Presto (or grep!)

See you later for details!

TMTOWTDI“There’s more than one way to do it.”

- Perl programming language

SHAREWhat I want & What I’m doing!

- tagomoris

Q & A