Upload
yifeng-jiang
View
7.631
Download
3
Embed Size (px)
Citation preview
Kinesis vs. Kafka Kafka Deep Dive
Yifeng Jiang Solutions Engineer, Hortonworks
Hortonworks Inc. 2011 2015. All Rights Reserved
(Yifeng Jiang) Solutions Engineer, Hortonworks HBase book author Twitter: @uprush
About Hortonworks
Customer Momentum 556 customers (as of August 5, 2015) 119 customers added in Q2 2015 Publicly traded on NASDAQ: HDP
Hortonworks Data Platform Completely open multi-tenant platform
for any app and any data Consistent enterprise services for security,
operations, and governance
Partner for Customer Success Leader in open-source community, focused on
innovation to meet enterprise needs Unrivaled Hadoop support subscriptions
Founded in 2011
Original 24 architects, developers, operators of Hadoop from Yahoo!
740+ E M P L O Y E E S
1350+ E C O S Y S T E M
P A R T N E R S
Hortonworks Data Plateform (HDP)
Deploy on premises and cloud
Page 5 Hortonworks Inc. 2011 2015. All Rights Reserved
Kinesis vs. Kafka
Page 6 Hortonworks Inc. 2011 2015. All Rights Reserved
Amazon Kinesis -- Introduction
Amazon Kinesis is a fully managed, cloud-based service for real-time data processing over large, distributed data streams.
Page 7 Hortonworks Inc. 2011 2015. All Rights Reserved
Apache Kafka -- Introduction
Messaging systems Real-timeScalable to handle large data volumeLow LatencyFault tolerant
Originated at LinkedInAimed at solving data movement across systemsScala and JavaOpen Source (Apache 2.0)Adapted at many companies
Page 8 Hortonworks Inc. 2011 2015. All Rights Reserved
Kinesis vs. Kafka Future
Similar Futures Messaging system for large scale
real-time data processing
High performance, highly scalable, low latency
Fault tolerant
Difference Full managed cloud service vs. OSS Data durability and performance
trade off
Interface AWS service integration vs. OSS or
single platform (e.g., HDP) integration
Page 9 Hortonworks Inc. 2011 2015. All Rights Reserved
Kinesis vs. Kafka Data Durability
Kinesis Synchronously replicates data
across three facilities
High durability for free
Kafka Replication across servers in the
same DC/AZ. Configurable min # in-sync replica and ACKs.
Asynchronously mirror data across clusters across datacenters / AZs
Performance trade off
Page 10 Hortonworks Inc. 2011 2015. All Rights Reserved
Kinesis vs. Kafka Interface
Kinesis REST only Client library wraps REST API
Kafka Low level API REST API available (wrapping low
level API).
Impact throughput and latency
Page 11 Hortonworks Inc. 2011 2015. All Rights Reserved
Kinesis vs. Kafka Processing
Kafka Custom consumers
Event monitoring and alerting use case
Strom Fraud detection, Simple aggregation
Spark Streaming / Storm Trident Micro-batch, near real-time
Camus Batch hadoop ingestion
Kinesis KCL applications on EC2
Storm
Spark streaming
EMR for batch ingestion, e.g., write to S3
Page 12 Hortonworks Inc. 2011 2015. All Rights Reserved
Kinesis vs. Kafka Deployment & Operation
Kafka HDP: almost one-click deploy with Ambari
Basic monitoring with Ambari
Expand and rebalance: partition assignment and consumer rebalance
Zookeeper can also be managed by Ambari
Kinesis Fully managed, one-click deploy
CloudWatch monitoring
Expand and rebalance: resharding a stream
Easy operation
Page 13 Hortonworks Inc. 2011 2015. All Rights Reserved
Kafka Deep Dive
Page 14 Hortonworks Inc. 2011 2015. All Rights Reserved
Kafka Concepts * ZK is used by Broker, Consumer
Broker-0
P0.R0 (L)
P1.R0
Broker-1
P0.R1
P2.R1 (L)
Broker-2
P1.R2 (L)
P2.R2
Topic with 3 partition and Replica factor 2
Producer
Consumer
Page 15 Hortonworks Inc. 2011 2015. All Rights Reserved
Kafka -- Concepts Topics
Partitions Offset Ordered
Replication Prevents data loss Never read or written to Does not increase throughput Tolerates Replica-1 failures
$[ambari-qa@c6401 bin]$ kafka-topics.sh --zookeeper c6401:2181 --describe --topic page_visits Topic:page_visits PartitionCount:4 ReplicationFactor:2 Configs:
Topic: page_visits Partition: 0 Leader: 1 Replicas: 0,1 Isr: 1,0 Topic: page_visits Partition: 1 Leader: 0 Replicas: 1,0 Isr: 0,1 Topic: page_visits Partition: 2 Leader: 1 Replicas: 0,1 Isr: 1,0 Topic: page_visits Partition: 3 Leader: 0 Replicas: 1,0 Isr: 0,1
Page 16 Hortonworks Inc. 2011 2015. All Rights Reserved
Kafka Broker
Store messages (logs) on local disk Messages are appended to log file Log Retention time and size based
Controller Cluster management
Runs on each broker machine
One leader, others follower
Leader Partition Broker that is the leader for certain partitions
Use ZK for coordination
Page 17 Hortonworks Inc. 2011 2015. All Rights Reserved
Kafka Producer
New Producer API in 0.8.2 Kafka-client.jar New Java API Default Asynchronous mode
Create a new message and publish to a Topic and Partition Takes topic, value and optional key and partition id
Page 18 Hortonworks Inc. 2011 2015. All Rights Reserved
Kafka Producer API (0.8.2) Cont.
Original messages are partitioned and then split into batches
Each split batch is sent to leader broker (and then replicated to ISR)
Each send is acknowledged by either leader broker and/or all ISR
p3 p2 p1 p2 p1 m5 m4 m3 m2 m1
Broker-0
P0.R0 (L)
P1.R0
Broker-1
P0.R1
P2.R1 (L)
Broker-2
P1.R2 (L)
P2.R2
Topic with 3 partition and Replica factor 2
App Producer Lib
partitioner Split batch
Page 19 Hortonworks Inc. 2011 2015. All Rights Reserved
Kafka Consumer Read data from Kafka brokers JVM APIs supported out of box by project Consumers pull data from brokers Consumer apps have to keep track of the topic-partition offset read
Consumer API Simple API Greater control over consumption of topic/partitions Consumer apps will be complex as they need to handle things like offset handling.
High-level Uses Simple API internally Consumer apps will be simple to implement as offset tracking is out of box But not flexible in terms of what partitions to read.
Page 20 Hortonworks Inc. 2011 2015. All Rights Reserved
Kafka Consumer Cont. Consumer Groups
Allow multiple hosts to form a group to access a topic
Consume hosts join a group by using same group.id
Guarantees a message is read by only one consumer in a group
Partitions are assigned to consumers in a group A consumer node may get one or more partitions But one partition is assigned to only one consumer host Order of the message is guaranteed with in a partition
Max parallelism determined by topic partitions More consumers than partitions some consumers will be idle
P0
Broker-0
P3
Broker-1
P1 P2
C1 C2
Consumer Group - 1
C3 C4
Consumer Group - 2
C5 C6
Page 21 Hortonworks Inc. 2011 2015. All Rights Reserved
Kafka Why Kafka is fast Fast Writes
Writes are appends to file systemPartitions improve performance and throughputUses OS buffer cacheLots of memory on the machine helps
Fast ReadsMemory mapped filesFile descriptor to socket descriptor efficient transferLinux sendfile(), JVM transferTo() implementation
Why Performance?Disk flushes are delayedDurability is guaranteed via replicationWhen consumers are reading the latest data, it reads from page cache
Page 22 Hortonworks Inc. 2011 2015. All Rights Reserved
Kafka Cluster Mirroring Mirror Maker Mirror data across clusters even in different DCs / AZs Stand alone tool uses Consumer and Producer API Reads from one or more source cluster and writes to a target cluster Whitelist/blacklist topic
Page 23 Hortonworks Inc. 2011 2015. All Rights Reserved
Kafka REST Interface
REST Interface Wraps Producer and Consumer API
Performance Overhead Two hops
Extra REST server to maintain
Parse JSON payload
Page 24 Hortonworks Inc. 2011 2015. All Rights Reserved
Kinesis vs. Kafka -- Terms Amazon Kinesis Apache Kafka Streams Topics Data Records Messages Producers Producers Kinesis Producer Library Producer API Consumers Consumers Kinesis Applications Consumer Applications Kinesis Client Library Consumer High level API N/A Consumer Simple API Shards Partitions N/A (built in MD5 hash on partition keys)
Custom partitioner
Sequence Numbers Offset Application Name Consumer Group ID
Page 25 Hortonworks Inc. 2011 2015. All Rights Reserved Page 25 Hortonworks Inc. 2011 2015. All Rights Reserved Tweet: #hadooproadshow
More About Apache Kafka: http://hortonworks.com/hadoop/kafka/