Upload
byeongsu-kang
View
5.250
Download
1
Embed Size (px)
Citation preview
Streaming Platform Apache Kafka
Data Platform 본부
Data Infrastructure팀
강병수
Intro
Data Infrastructure팀의 Kafka
Kafka Summit 2017 NY
적용기
Intro
상품의구매전환율을실시간통계로뽑고싶어요
광고비용을실시간으로집계하고싶어요
현재접속해있는유저상태에따라마케팅을하고싶어요
현재상품의품절여부를알고싶어요
사업적니즈들을기술적으로어떻게서포팅해야할지고민이늘어난다.
외래문물을수용하러...
Systems Track Streams Track Pipelines Track
Data Processing at LinkedIn with Apache Kafka Portable Streaming Pipelines with Apache Beam Capture the Streams of Database Changes
Kafka in the Enterprise: What if it Fails? The Best Thing Since Partitioned Bread :
Rethinking Stream Processing with Apache
Kafka’s new Streams API
Billions of Messages a Day - Yelp’s Real-time Data
Pipeline
Apache Kafka Core Internals : A Deep Dive Microservices with Kafka : An Introduction to Kafka
Streams with a Real-Life Example
California Schemin! How the Schema Registry has
Ancestry Basking in Data
Simplifying Omni-Channel Retail at Scale Hanging Out with Your Past Self in VR: Time-
Shifted Avatar Replication Using Kafka Streams
Achieving Predictability And Compliance With The
Data Distribution Hub at Bank of New York Mellon
How to Lock Down Apache Kafka and Keep Your
Streams Safe
Building Advanced Streaming Applications using
the Latest from Apache Flink and Kafka
Every Message Counts: Kafka as Foundation for
Highly Reliable Logging at Airbnb
Running Hundreds of Kafka Clusters with 5 People The Data Dichotomy: Rethinking Data & Services
with Streams
The Source of Truth: Why the New York Times
Stores Every Piece of Content Ever Published in
Kafka
Hardening Kafka for New Use Cases with Venice Easy, Scalable, Fault-tolerant Stream Processing
with Kafka and Spark’s Structured Streaming
Single Message Transformations Are Not the
Transformations You’re Looking For
Introducing Exactly Once Semantics in Apache
Kafka
Scalable Real-time Complex Event Processing at
Uber
Cloud Native Data Streaming Microservices with
Spring Cloud and Kafka
Systems Track Streams Track Pipelines Track
Data Processing at LinkedIn with Apache Kafka Portable Streaming Pipelines with Apache Beam Capture the Streams of Database Changes
Kafka in the Enterprise: What if it Fails? The Best Thing Since Partitioned Bread :
Rethinking Stream Processing with Apache
Kafka’s new Streams API
Billions of Messages a Day - Yelp’s Real-time Data
Pipeline
Apache Kafka Core Internals : A Deep Dive Microservices with Kafka : An Introduction to Kafka
Streams with a Real-Life Example
California Schemin! How the Schema Registry has
Ancestry Basking in Data
Simplifying Omni-Channel Retail at Scale Hanging Out with Your Past Self in VR: Time-
Shifted Avatar Replication Using Kafka Streams
Achieving Predictability And Compliance With The
Data Distribution Hub at Bank of New York Mellon
How to Lock Down Apache Kafka and Keep Your
Streams Safe
Building Advanced Streaming Applications using
the Latest from Apache Flink and Kafka
Every Message Counts: Kafka as Foundation for
Highly Reliable Logging at Airbnb
Running Hundreds of Kafka Clusters with 5 People The Data Dichotomy: Rethinking Data & Services
with Streams
The Source of Truth: Why the New York Times
Stores Every Piece of Content Ever Published in
Kafka
Hardening Kafka for New Use Cases with Venice Easy, Scalable, Fault-tolerant Stream Processing
with Kafka and Spark’s Structured Streaming
Single Message Transformations Are Not the
Transformations You’re Looking For
Introducing Exactly Once Semantics in Apache
Kafka
Scalable Real-time Complex Event Processing at
Uber
Cloud Native Data Streaming Microservices with
Spring Cloud and Kafka
서비스 Database를실시간입수하는방법들
Kafka를이용한 Microservice
Kafka Streams 소개
그외에도.. 안정적으로 System을운영하는사례들이나,
하둡데이터와실시간이벤트를함께활용하는방법들
자세한내용과팀에어떻게적용하고있는지는뒤에서얘기하고,
먼저, Kafka를모르시는분들에위해서간단하게설명을하려고합니다.
Apache Kafka
publish / subscribe
distributed system
stable store
low latency
high throughput
real-time processing
Apache Kafka
Producer
BrokerProducer
Producer
Consumer
Consumer
Consumer
Apache Kafka
BrokerProducertopic : syrup
data : 캣츠
Apache Kafka
BrokerProducer
Disk
topic : syrup
data : 캣츠
Apache Kafka
BrokerProducer Consumer
Disk
topic : syrup
data : 캣츠topic : syrup
data : 캣츠
Apache Kafka
Producer
BrokerProducer Consumer
topic : 11st
data : 전자담배
topic : syrup
data : 캣츠topic : syrup
data : 캣츠
Disk
Apache Kafka
Producer
BrokerProducer
Consumer
Consumer
Disk
topic : syrup
data : 캣츠topic : syrup
data : 캣츠
topic : 11st
data : 전자담배topic : 11st
data : 전자담배
Apache Kafka
Producer
BrokerProducer
Consumer
Consumer
Consumer
topic : 11st
data : 전자담배
topic : 11st
data : 전자담배
Disk
topic : syrup
data : 캣츠topic : syrup
data : 캣츠
topic : 11st
data : 전자담배
Apache Kafka
Producer
BrokerProducer
Consumer
Consumer
ConsumerProducer
topic : OCB
data : 500p
Disk
topic : syrup
data : 캣츠topic : syrup
data : 캣츠
topic : 11st
data : 전자담배topic : 11st
data : 전자담배
topic : 11st
data : 전자담배
Apache Kafka
변경이불가능한이벤트로그를스트림으로처리한다.
Producer 와 Consumer는의존도가없다.
Disk를사용하기때문에안정적이고, 잘못됐을때돌아올수있다.
millisecond 단위의 low latency를보장한다.
Disk를사용하는데어떻게빠르지?운영체제는 Disk I/O 효율을위하여 Page Cache라는것을사용한다.
Apache Kafka
메모리사용상태를보면 190GB중 165GB를 Page Cache로사용
Apache Kafka
CPU IO wait 역시거의발생하지않음
ProcessingKafka Streams를이용하여 ETL이가능하다.
StoreDisk를이용하는안정적인저장소다.
Pub / Sub자유롭게데이터를넣고땡길수있다.
Data Infrastructure팀의 Kafka
App log
Web log
Server log
DB data
Data Play를하기위해서...
App log
Web log
Server log
DB data
원시시대
FTP
FTP로전송중에실패하면?하둡에전송된일부를지우고재전송한다.
제시간에 File이생성되지않으면?cron이동작하지않을테니손으로다시돌린다..
App이나 Web같은실시간로그는어떻게?무엇인가대책이필요하다.
App log
Web log
Server log
DB data
2011년에 Kafka를도입
전송중실패부분부터재전송가능
파일로뽑고읽어서보낼것없이이벤트가발생하면즉시전송
App, Web Client 로그도이벤트건 by 건적재가능
현재는대부분의로그를 Kafka를통해입수한다.
38000건 / Sec 33억건 / day
IN : 92 MB / Sec 8TB / day
OUT : 310 MB / Sec 25TB / day
현재 Main Cluster는평균적으로
Data Infrastructure팀의 Kafka
Recopick
3ea
Global
11st
5ea
11st
3ea
Main
10ea
Dev
5ea
Mirror
5ea
커머스
AWS
보라매
커머스
일산
MirrorMaker
MirrorMaker
MirrorMaker
Data Infrastructure팀의 Kafka
Recopick
3ea
Global
11st
5ea
11st
3ea
Main
10ea
Dev
5ea
Mirror
5ea
커머스
AWS
보라매
커머스
일산
서비스로그
MirrorMaker
MirrorMaker
MirrorMaker
실시간추천
Hadoop
Monitoring 검색인덱싱
서비스로그
서비스로그
서비스로그
CEP
클러스터들을혼자구성및운영하려니고충이많이생김
IDC 및클러스터별로입수상태및현황을파악해야 했고,
문제가생겼을때어디쪽의문제인지추적이되어야했고,
장애시복구포인트나복구는잘되는지를알아야한다.
또한언제어떤 Topic이생겼는지등의이력관리가되어야하고,
문제제기가들어왔을때증명도해야한다...
Data Infrastructure팀의 Kafka
운영상편의를위하여.... IDC 통합모니터링개발
Data Infrastructure팀의 Kafka
Kafka 상태모니터
Data Infrastructure팀의 Kafka
Zookeeper 데이터수집
Data Infrastructure팀의 Kafka
통합 Transfer
운영이편해졌다. 하지만...
서비스 DB를실시간으로어떻게입수하지?
real-time streaming 환경은되어있는데실제서비스는몇없다….
MicroService를 Kafka와함께쓰는게추세라던데?
실시간으로통계를뽑으려면 Hadoop으로는한계가있지..
고민해결의꿈을안고 Kafka Summit 2017 NY
Kafka Streams
Kafka Streams
Kafka project의 streaming library
2016년에 release
기존의다른 Streaming Framework에비하여큰이점들이있다.
그러나, 나온지가얼마안되서증명이안됐다.
Kafka Streams
눈부신성공사례로어느정도증명이...
Line corps. 내부파이프라인에 Kafka Streams적용하기https://engineering.linecorp.com/en/blog/detail/80
Peak time : 100만메시지 / sec
Kafka Consumer
BrokerProducertopic : 11st
data : 전자담배data : 화분
Kafka Consumer
Broker
Consumerconsumergroup : group-1
topic : 11st
data : 전자담배
Consumer
consumergroup : group-1
topic : 11st
data : 화분
Producertopic : 11st
data : 전자담배data : 화분
Kafka Consumer
Broker
Consumerconsumergroup : group-1
topic : 11st
data : 전자담배
Consumer
consumergroup : group-1
topic : 11st
data : 화분
Producertopic : 11st
data : 전자담배data : 화분
Scalability
Load balancing
Kafka Consumer
BrokerProducertopic : 11st
data : 자전거data : 전동킥보드
Kafka Consumer
Broker
Consumerconsumergroup : group-1
topic : 11st
Consumer
consumergroup : group-1
topic : 11st
data : 자전거전동킥보드
Producertopic : 11st
data : 자전거data : 전동킥보드
Kafka Consumer
Broker
Consumerconsumergroup : group-1
topic : 11st
Consumer
consumergroup : group-1
topic : 11st
data : 자전거전동킥보드
Producertopic : 11st
data : 자전거data : 전동킥보드
High Availability
Kafka Streams
Consumer, Producer
State
1m
2m
3m
4m
Join
Time Window
Kafka Streams
비슷한역할을수행하던것들
Kafka Streams
Kafka Streams의장점은?
Kafka Streams
Kafka Streams의장점은?
라이브러리이다.
Yarn과같은 resource manager가필요없다.
오직데몬을돌릴수있는서버만있으면된다.
소스가가볍다.
H/A, Scalability, Load Balancing 을구현할필요가없다. Kafka에서다해준다.
생산성이좋아진다.
디버깅이쉽다. 비즈니스로직에만집중해서개발을하면된다.
Kafka Streams로쉽고좋은 Streaming Application을만들수있다.
Microservice
하나의레파지토리에 100만라인의코드였다.
모노리딕앱 : 옐프메인2011
2013
2017
Kafka를백본으로, 마이크로서비스로옮겼다.
70개이상의 production 서비스였다.
2013년보다 R&D에 100만달러를아끼게됐다.
코드의복잡성이줄어들고, downtime이줄어들었다.
Monolithic service Micro service
서비스끼리통신이발생한다.
서비스는필연적으로복잡해진다.
서비스하나의장애가다른서비스에도영향을준다
시스템의복잡도를줄이고 Coupling을줄일수없을까?
Kafka를 Backbone으로활용하면해결할수있다.
Asynchronous하게 데이터를던지고받는다
A B
DC
E
B - getItem(1) A - 수영복C - 양말, 물안경, 수영복….
B - top10
Asynchronous하게 데이터를던지고받는다
A B
DC
E
A - 수영복
C - 양말, 물안경, 수영복,
…...
하나의서비스에장애가나도다른서비스는모른다
또한복구가됐을때, 처리못한 request를모두불러올수있다.
Message Queue는서비스의복잡도를확연하게줄여준다.
Publish / Subscribe 구조는 Coupling을줄여준다.
Kafka를이용한 Microservice를어떻게만들수있을까?
출처 : 넷플릭스 마이크로서비스 가이드https://www.youtube.com/watch?feature=youtu.be&v=OczG5FQIcXw&app=desktop
Microservice의단위는?
Consumer, Producer
State
1m
2m
3m
4m
Join
Time Window
Kafka Streams를이용해서만들수있다.
Kafka에서 Data 입출력이쉽고,
Scalability,
Load Balancing,
H/A 구성
Spring Cloud의 Kafka binder를사용해서쉽게만들수있다.
Kafka를사용하면 Microservice의복잡한문제들을꽤많이해결해준다.
Change Data Capture
11번가상품 DB
11번가주문 DB
Syrup
유저 DB
서비스 DB
11번가상품 DB
11번가주문 DB
Syrup
유저 DB
서비스 DB
Data Application
Data Analysis
서비스 DB는 Kafka를통해적재를못하고있다.
PK Data
A 2
B 2
결과
실시간트랜잭션Update A 1 -> 2
Delete C
Insert B 1
Update B 2
DB
Event Log들과다르게 DB는최종변경내용이저장된다.
/data/test
A 1
A 2
B 1
B 2
Hadoop
실시간트랜잭션Update A 1 -> 2
Delete C
Insert B 1
Update B 2
DB
Hadoop은상태변경을처리하기에적합하지않다.
RDB Hadoop
매일또는매시간Full dump batch
그래서우리는
RDB Hadoop
Data Analysis
Data Application
매일또는매시간Full dump batch
RDB Hadoop
최신상태가필요해..
Full dump에만 6시간이걸려…
File dump가없으면연락후재입수
매일또는매시간Full dump batch
실시간으로데이터베이스변경분을반영할수없을까?
Database의실시간트랜잭션로그는어떻게뽑지?
뽑을수있다면데이터를어떻게순서보장을하면서전송을할까?
Hadoop이아닌어떤 Repository를이용해야할까?
실시간상태를 Hadoop에서도볼수있어야하는데..
넘어야할산들
우리에게도움을줄회사들..
Database의실시간트랜잭션로그는어떻게뽑지?
넘어야할산들
Redo Log, Undo Log parser - Log Miner, OGG, XStreams etc...
Bin log parser - Open-Replicator Link : https://github.com/whitesock/open-replicator
뽑을수있다면데이터를어떻게순서보장을하면서전송을할까?
넘어야할산들
실시간로그를입수하면서
유실없이
메시지간순서보장을할수있다.
Hadoop이아닌어떤 Repository를이용해야할까?
넘어야할산들
실시간상태를 Hadoop에서도볼수있어야하는데..
넘어야할산들
Snapshot to HDFS
또는 Hive와연동
RDB Hadoop
Change
Data
Capture
RDB Hadoop
Change
Data
Capture
Data Application
Data Analysis
Data 분석시, 또는 Data Application에서
실시간으로서비스 DB의상태를사용할수있다.
이제다시 DI Cluster로돌아가보면
상품의구매전환율을실시간통계로뽑고싶어요
광고비용을실시간으로집계하고싶어요
현재접속해있는유저상태에따라마케팅을하고싶어요
현재상품의품절여부를알고싶어요
현재상태로돌아가보면
Sentinel
Rake
Server
Kafka
Hadoop Hive
Log
Agent
RakeAPI
Mirror
Maker
Collector
Other
Streaming
Apps
Router
위내용들을적용하여...
Sentinel
Rake
Server
Kafka
Hadoop
Druid
Hbase
Redis
Hive
Log
Agent
RakeAPI
Mirror
Maker
RDB
Collector
Other
Streaming
Apps
실시간 OLAP
CDC
OGG
유저상태
사업적니즈를함께풀어가기위한노력들
유저가여러번클릭한상품에대해서즉시구매까지전환되도록할수없을까?
실시간통계를보면서수익이많이나는광고를내보내면좋을것같은데..
현재접속해있는유저를대상으로마케팅을하면어떨까?
사업적니즈를함께풀어가기위한노력들
사업적요건들을 Data Infrastructure Cluster안에서해결해나갈수있도록
고민과노력중!
함께데이터의활용가치를높여나가자