25
Kinesis vs. Kafka – Kafka Deep Dive Yifeng Jiang Solutions Engineer, Hortonworks © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Kinesis vs-kafka-and-kafka-deep-dive

Embed Size (px)

Citation preview

  • Kinesis vs. Kafka Kafka Deep Dive

    Yifeng Jiang Solutions Engineer, Hortonworks

    Hortonworks Inc. 2011 2015. All Rights Reserved

  • (Yifeng Jiang) Solutions Engineer, Hortonworks HBase book author Twitter: @uprush

  • About Hortonworks

    Customer Momentum 556 customers (as of August 5, 2015) 119 customers added in Q2 2015 Publicly traded on NASDAQ: HDP

    Hortonworks Data Platform Completely open multi-tenant platform

    for any app and any data Consistent enterprise services for security,

    operations, and governance

    Partner for Customer Success Leader in open-source community, focused on

    innovation to meet enterprise needs Unrivaled Hadoop support subscriptions

    Founded in 2011

    Original 24 architects, developers, operators of Hadoop from Yahoo!

    740+ E M P L O Y E E S

    1350+ E C O S Y S T E M

    P A R T N E R S

  • Hortonworks Data Plateform (HDP)

    Deploy on premises and cloud

  • Page 5 Hortonworks Inc. 2011 2015. All Rights Reserved

    Kinesis vs. Kafka

  • Page 6 Hortonworks Inc. 2011 2015. All Rights Reserved

    Amazon Kinesis -- Introduction

    Amazon Kinesis is a fully managed, cloud-based service for real-time data processing over large, distributed data streams.

  • Page 7 Hortonworks Inc. 2011 2015. All Rights Reserved

    Apache Kafka -- Introduction

    Messaging systems Real-timeScalable to handle large data volumeLow LatencyFault tolerant

    Originated at LinkedInAimed at solving data movement across systemsScala and JavaOpen Source (Apache 2.0)Adapted at many companies

  • Page 8 Hortonworks Inc. 2011 2015. All Rights Reserved

    Kinesis vs. Kafka Future

    Similar Futures Messaging system for large scale

    real-time data processing

    High performance, highly scalable, low latency

    Fault tolerant

    Difference Full managed cloud service vs. OSS Data durability and performance

    trade off

    Interface AWS service integration vs. OSS or

    single platform (e.g., HDP) integration

  • Page 9 Hortonworks Inc. 2011 2015. All Rights Reserved

    Kinesis vs. Kafka Data Durability

    Kinesis Synchronously replicates data

    across three facilities

    High durability for free

    Kafka Replication across servers in the

    same DC/AZ. Configurable min # in-sync replica and ACKs.

    Asynchronously mirror data across clusters across datacenters / AZs

    Performance trade off

  • Page 10 Hortonworks Inc. 2011 2015. All Rights Reserved

    Kinesis vs. Kafka Interface

    Kinesis REST only Client library wraps REST API

    Kafka Low level API REST API available (wrapping low

    level API).

    Impact throughput and latency

  • Page 11 Hortonworks Inc. 2011 2015. All Rights Reserved

    Kinesis vs. Kafka Processing

    Kafka Custom consumers

    Event monitoring and alerting use case

    Strom Fraud detection, Simple aggregation

    Spark Streaming / Storm Trident Micro-batch, near real-time

    Camus Batch hadoop ingestion

    Kinesis KCL applications on EC2

    Storm

    Spark streaming

    EMR for batch ingestion, e.g., write to S3

  • Page 12 Hortonworks Inc. 2011 2015. All Rights Reserved

    Kinesis vs. Kafka Deployment & Operation

    Kafka HDP: almost one-click deploy with Ambari

    Basic monitoring with Ambari

    Expand and rebalance: partition assignment and consumer rebalance

    Zookeeper can also be managed by Ambari

    Kinesis Fully managed, one-click deploy

    CloudWatch monitoring

    Expand and rebalance: resharding a stream

    Easy operation

  • Page 13 Hortonworks Inc. 2011 2015. All Rights Reserved

    Kafka Deep Dive

  • Page 14 Hortonworks Inc. 2011 2015. All Rights Reserved

    Kafka Concepts * ZK is used by Broker, Consumer

    Broker-0

    P0.R0 (L)

    P1.R0

    Broker-1

    P0.R1

    P2.R1 (L)

    Broker-2

    P1.R2 (L)

    P2.R2

    Topic with 3 partition and Replica factor 2

    Producer

    Consumer

  • Page 15 Hortonworks Inc. 2011 2015. All Rights Reserved

    Kafka -- Concepts Topics

    Partitions Offset Ordered

    Replication Prevents data loss Never read or written to Does not increase throughput Tolerates Replica-1 failures

    $[ambari-qa@c6401 bin]$ kafka-topics.sh --zookeeper c6401:2181 --describe --topic page_visits Topic:page_visits PartitionCount:4 ReplicationFactor:2 Configs:

    Topic: page_visits Partition: 0 Leader: 1 Replicas: 0,1 Isr: 1,0 Topic: page_visits Partition: 1 Leader: 0 Replicas: 1,0 Isr: 0,1 Topic: page_visits Partition: 2 Leader: 1 Replicas: 0,1 Isr: 1,0 Topic: page_visits Partition: 3 Leader: 0 Replicas: 1,0 Isr: 0,1

  • Page 16 Hortonworks Inc. 2011 2015. All Rights Reserved

    Kafka Broker

    Store messages (logs) on local disk Messages are appended to log file Log Retention time and size based

    Controller Cluster management

    Runs on each broker machine

    One leader, others follower

    Leader Partition Broker that is the leader for certain partitions

    Use ZK for coordination

  • Page 17 Hortonworks Inc. 2011 2015. All Rights Reserved

    Kafka Producer

    New Producer API in 0.8.2 Kafka-client.jar New Java API Default Asynchronous mode

    Create a new message and publish to a Topic and Partition Takes topic, value and optional key and partition id

  • Page 18 Hortonworks Inc. 2011 2015. All Rights Reserved

    Kafka Producer API (0.8.2) Cont.

    Original messages are partitioned and then split into batches

    Each split batch is sent to leader broker (and then replicated to ISR)

    Each send is acknowledged by either leader broker and/or all ISR

    p3 p2 p1 p2 p1 m5 m4 m3 m2 m1

    Broker-0

    P0.R0 (L)

    P1.R0

    Broker-1

    P0.R1

    P2.R1 (L)

    Broker-2

    P1.R2 (L)

    P2.R2

    Topic with 3 partition and Replica factor 2

    App Producer Lib

    partitioner Split batch

  • Page 19 Hortonworks Inc. 2011 2015. All Rights Reserved

    Kafka Consumer Read data from Kafka brokers JVM APIs supported out of box by project Consumers pull data from brokers Consumer apps have to keep track of the topic-partition offset read

    Consumer API Simple API Greater control over consumption of topic/partitions Consumer apps will be complex as they need to handle things like offset handling.

    High-level Uses Simple API internally Consumer apps will be simple to implement as offset tracking is out of box But not flexible in terms of what partitions to read.

  • Page 20 Hortonworks Inc. 2011 2015. All Rights Reserved

    Kafka Consumer Cont. Consumer Groups

    Allow multiple hosts to form a group to access a topic

    Consume hosts join a group by using same group.id

    Guarantees a message is read by only one consumer in a group

    Partitions are assigned to consumers in a group A consumer node may get one or more partitions But one partition is assigned to only one consumer host Order of the message is guaranteed with in a partition

    Max parallelism determined by topic partitions More consumers than partitions some consumers will be idle

    P0

    Broker-0

    P3

    Broker-1

    P1 P2

    C1 C2

    Consumer Group - 1

    C3 C4

    Consumer Group - 2

    C5 C6

  • Page 21 Hortonworks Inc. 2011 2015. All Rights Reserved

    Kafka Why Kafka is fast Fast Writes

    Writes are appends to file systemPartitions improve performance and throughputUses OS buffer cacheLots of memory on the machine helps

    Fast ReadsMemory mapped filesFile descriptor to socket descriptor efficient transferLinux sendfile(), JVM transferTo() implementation

    Why Performance?Disk flushes are delayedDurability is guaranteed via replicationWhen consumers are reading the latest data, it reads from page cache

  • Page 22 Hortonworks Inc. 2011 2015. All Rights Reserved

    Kafka Cluster Mirroring Mirror Maker Mirror data across clusters even in different DCs / AZs Stand alone tool uses Consumer and Producer API Reads from one or more source cluster and writes to a target cluster Whitelist/blacklist topic

  • Page 23 Hortonworks Inc. 2011 2015. All Rights Reserved

    Kafka REST Interface

    REST Interface Wraps Producer and Consumer API

    Performance Overhead Two hops

    Extra REST server to maintain

    Parse JSON payload

  • Page 24 Hortonworks Inc. 2011 2015. All Rights Reserved

    Kinesis vs. Kafka -- Terms Amazon Kinesis Apache Kafka Streams Topics Data Records Messages Producers Producers Kinesis Producer Library Producer API Consumers Consumers Kinesis Applications Consumer Applications Kinesis Client Library Consumer High level API N/A Consumer Simple API Shards Partitions N/A (built in MD5 hash on partition keys)

    Custom partitioner

    Sequence Numbers Offset Application Name Consumer Group ID

  • Page 25 Hortonworks Inc. 2011 2015. All Rights Reserved Page 25 Hortonworks Inc. 2011 2015. All Rights Reserved Tweet: #hadooproadshow

    More About Apache Kafka: http://hortonworks.com/hadoop/kafka/