Something about Kafka - Why Kafka is so fast

Something about KafkaFrank Yao

@超⼤大杯摩卡星冰乐2013-06-29

13年7月5⽇日星期五

Agenda

• WHAT is Kafka?

• HOW we use it in Vipshop?

• WHY Kafka is so ‘fast’?

WHAT is Kafka?

• “Kafka is a messaging system that was originally developed at LinkedIn to serve as the foundation for LinkedIn's activity stream and operational data processing pipeline.”

User cases

• Operational monitoring: real-time, heads-up monitoring

• Reporting and Batch processing: load data into a data warehouse or Hadoop system

Performance(Below test is from Kafka website)

• Parameters:

• message size = 200 bytes

• batch size = 200 messages

• fetch size = 1MB

• flush interval = 600 messages

Batch Size

Consumer Throughput

Data Size?

Producers thread?

Topic Number?

Tradition Queue

• ActiveMQ, RabbitMQ...

My Test

• Use Flume:

• In/Out ~= 30w message per second

Kafka in Vipshop

Data ‘in’ Kafka

• Operational monitoring• Nginx access log• PHP error log, slow log

• Reporting and Batch processing:• Nginx access log• PHP error log, slow log• App log

• b2c• Recommend• Pay• Passport

How many Data?

• Peak Time(10:00~10:30):

• IN : 15k-20k msg per second

• OUT : 30k-40k msg per second

Apps depends on Kakfa

Kibana(Elasticsearch)

real-time pv uv

Load use Kafka

Replace RabbitMQRabbitMQ Kafka

Servers

Language

Deployment

Client

Management

RabbitMQ Kafka

>10 <2.5

Erlang Scala

Difficult Easy

A lot Not Many

Web-console JMX

WHY Kafka ‘fast’

Basics

• producers

• consumers

• consumer groups

• brokers

Kafka Arch

Kafka Deployment

Major Design Elements

• Persistent messages

• Throughput >>> features

• Consumers hold states

• ALL is distributed

Detail Agenda• Maximizing Performance

• Filesystem vs. Memory• BTree?• Zero-copy• End-to-end Batch Compression

• Consumer state• Message delivery semantics• Consumer state• Push vs. Pull

• Message• Message format• Disk structure

• Zookeeper• Directory Structure

Maximize Performance

Filesystem vs. MemoryMaximize Performance

Who is fast?

Memory

Filesystem

hardware linear writes random writes

6*7200rpm SATA RAID-5 300MB/sec 50k/sec

ACM Pieces

Let’s see something REAL

Server Stats

page cache

• use free memory for disk caching to make random write fast

Drawbacks

• All disk reads and writes will go through this unified cache. This feature cannot easily be turned off without using direct I/O, so even if a process maintains an in-process cache of the data, this data will likely be duplicated in OS pagecache, effectively storing everything twice.

If JVM...

If we use memory(JVM)

• The memory overhead of objects is very high, often doubling the size of the data stored (or worse).

• Java garbage collection becomes increasingly sketchy and expensive as the in-heap data increases.

cache size

• at least double the available cache by having automatic access to all free memory, and likely double again by storing a compact byte structure rather than individual objects. Doing so will result in a cache of up to 28-30GB on a 32GB machine.

comparison

in-disk in-memory

Initialization

no GC stop the world

stay warm even if restarted

rebuilt slow(10min for 10GB) and cold cache

handle by OS handle by programs

Conclusion

• using the filesystem and relying on pagecache is superior to maintaining an in-memory cache or other structure

Go Extreme!

• Write to filesystem DIRECTLY!

• (In effect this just means that it is transferred into the kernel's pagecache where the OS can flush it later.)

Furthermore

• You can configure: every N messages or every M seconds. It is to put a bound on the amount of data "at risk" in the event of a hard crash.

• Varnish use pagecache-centric design as well.

BTreeMaximize Performance

Background

• Messaging system meta is often a BTree.

• BTree operations are O(logN).

BTree• O(logN) ~= constant time

BTree is slow on Disk!

BTree for Disk

• Disk seeks come at 10 ms a pop

• each disk can do only one seek at a time

• parallelism is limited

• the observed performance of tree structures is often super-linear

• Page or row locking to avoid lock the tree

Two Facts

• no advantage of driver density because of the heavy reliance on disk seek

• need small (< 100GB) high RPM SAS drives to maintain a sane ratio of data to seek capacity

Use Log file Structure!

Feature

• One queue is one log file

• Operations is O(1)

• Reads do not block writes or each other

• Decouple with data size

• Retain messages after consumption

zero-copyMaximize Performance

1. The operating system reads data from the disk into pagecache in kernel space

2. The application reads the data from kernel space into a user-space buffer

3. The application writes the data back into kernel space into a socket buffer

4. The operating system copies the data from the socket buffer to the NIC buffer where it is sent over the network

zerocopy

• data is copied into pagecache exactly once and reused on each consumption instead of being stored in memory and copied out to kernel space every time it is read

zerocopy performance

End-to-end Batch CompressionMaximizing Performance

Consider that

2*compression+3*de-compression

M=num(P)N=num(C)M*compression+N*de-compression

Key point

• End-to-end: compress by producers and de-compress by consumers

• Batch: compression aims to compress a ‘message set’

• Kafka supports GZIP and Snappy protocols

Consumer State

• No ACK

• Consumers maintain the message state

Features

• Message is in a partition

• Stored and given out in the order they arrive

• ‘ watermark’ - ‘offset’ in Kafka

track state

• write msg state in zookeeper

• in one transaction with writing data

• side benefit: ‘rewind’ msg

Screenshot

push vs. pullConsumer State

push system

• if a consumer is <defunct>?

Kafka use pull model

MessageFormat & Data structure

Msg Format• N byte message:

• If magic byte is 0

1. 1 byte "magic" identifier to allow format changes

2. 4 byte CRC32 of the payload

3. N - 5 byte payload

• If magic byte is 1

1. 1 byte "magic" identifier to allow format changes

2. 1 byte "attributes" identifier to allow annotations on the message independent of the version (e.g. compression enabled, type of codec used)

3. 4 byte CRC32 of the payload

4. N - 6 byte payload

Log format on-disk

• On-disk format of a message• message length : 4 bytes (value: 1+4+n) • ‘magic’ value : 1 byte• crc : 4 bytes• payload : n bytes

• partition id and node id to uniquely identify a message

Kafka Log Implementation

Screenshot

WritesMessage

Writes• Append-write

• When rotate:

• M : M messages in a log file

• S : S seconds after last flush

• Durability guarantee: losing at most M messages or S seconds of data in the event of a system crash

ReadsMessage

Buffer Reads

• auto double buffer size

• you can specify the max buffer size

Offset Search

• Search steps:

1. locating the log segment file in which the data is stored

2. calculating the file-specific offset from the global offset value

3. reading from that file offset

• Simple binary in memory

Features

• Reset the offset

• OutOfRangeException(problem we met)

DeletesMessage

Deletes• Policy: N days ago or N GB

• Deleting while reading?

• a copy-on-write style segment list implementation that provides consistent views to allow a binary search to proceed on an immutable static snapshot view of the log segments

Zookeeper

Directory StructureZookeeper

Broker Node

• /brokers/ids/[0...N] --> host:port (ephemeral node)

Broker Topic

• /brokers/topics/[topic]/[0...N] --> nPartions (ephemeral node)

Consumer Id

• /consumers/[group_id]/ids/[consumer_id] --> {"topic1": #streams, ..., "topicN": #streams} (ephemeral node)

Consumer Offset Tracking

• /consumers/[group_id]/offsets/[topic]/[broker_id-partition_id] --> offset_counter_value ((persistent node)

Partition Owner

• /consumers/[group_id]/owners/[topic]/[broker_id-partition_id] --> consumer_node_id (ephemeral node)

Why Kafka fast?• Maximizing Performance

• Filesystem vs. Memory• BTree?• Zero-copy• End-to-end Batch Compression

• Consumer state• Message delivery semantics• Consumer state• Push vs. Pull

• Message• Message format• Disk structure

• Zookeeper• Directory Structure

Thank You!

Something about Kafka - Why Kafka is so fast

Technology

Something about search

Something big: Biggy

Die Kafka-Sammlung The Kafka-Collection · Die Kafka-Sammlung The Kafka-Collection SCHLEGEL präsentiert: ÖSTERREICH-KLASSIK aus dem Besitz des ehemaligen Direktors des Prager Nationalmuseums,

Something About Sylvia Plath

something visible but unconscious

Kinesis vs-kafka-and-kafka-deep-dive

Peer Assessment of Public Health Emergency … Assessment of Public Health Emergency Response ... B. Requestor Job Action Sheet ... Identification of why something went wrong and the

Managing Tail Risk using Exchange Traded · PDF file · 2012-06-15Traded Products June 2012 ... is something “no one” buys in stable markets: Why should we? ... Dynamic long volatility

SOMETHING VOL 1

Something about Foursquare

CCHHAARRNNIIAA - Charnia · Wood, not Stable Pit. Have a great summer, and if you see anything interesting on your travels, why don’t you consider submitting something about it

Blanchot Maurice - De Kafka a Kafka

Cause & Effect RC 2.6 ELA Standards 4 th Grade. What is Cause & Effect? Cause is why something happens. Effect is what happens as a result

Like Nancy Kerrigan sez: Why me? Why now? Why inverse … · Why? Like Nancy Kerrigan sez: "Why me? Why now? Why inverse trig?" It’s all motivated by trig substitution: like Wednesday

Something Beautiful - Jamie McGuire

Note something

Andrzej Sapkowski - Something Ends Something Begins Sapkowski

tell you something about

Crowd something

KAFKA, Franz - Os Melhores Contos de Kafka