Apache Kafka 下一代分布式消息系统

2015.09

Apache Kafka

1 2Kafka

3Kafka

1

Top 10 Uses For A Message Queue

Decoupling 解耦 Redundancy 冗余 Scalability 扩展性

Elasticity & Spikability 灵活＆消除峰值 Resilliency 可恢复性

Understanding Data Flow 理解数据流

Asynchronous Communication 异步通信

Delivery Guarantees 送达保证 Ordering Guarantees 顺序保证

Buffering 缓冲

消息语义

持久性安全淘汰交付路由批量

消息过滤排队标准已收通知

消息队列系统/ /

TCP / UDP / HTTP / SOAP …

消息接收器消息发送器

流入路由器

流出消息转换器

流出路由器

消息持久

消息状态

元数据

故障恢复

淘汰策略

容器组件

拦截器流入消息转换器　拦截器服务组件调用　

开源消息中间件

RabbitMQ

ZeroMQ

用Erlang编写的重量级消息队列，支持很多的协议：AMQP，XMPP，SMTP，STOMP，更适合于企业级的开发。对路由，负载均衡或者数据持久化都有很好的支持

Redis基于KV的NoSQL数据库，支持MQ功能。实验表明（执行100万次）：入队时，当数据大小少于10K，Redis性能高于RabbitMQ，当数据大小超过10K，Redis慢得无法忍受；出队时，无论数据大小，Redis表现出色，而RabbitMQ性能远低于Redis

号称最快的消息队列，尤其针对大吞吐量的需求场景。实现RabbitMQ不擅长的高级/复杂的队列，但开发应用需组合多种技术框架。具有一个独特的非中间件的模式，只需引用ZMQ程序库即可。但是ZMQ仅提供非持久性的队列

ActiveMQApache的一个子项目。类似于ZeroMQ，以代理人和点对点的技术实现队列。同时类似于RabbitMQ，少量代码就可以高效地实现高级应用场景

IO

硬盘 VS. 内存随机读随机写

硬盘

VS. 顺序读顺序写

机械磁盘慢现代操作系统优化

1. 使用 read-ahead 和 write-behind 技术，预读取成块数据，将微小琐碎的逻辑写入组织成一次较大的物理写入

2. 常用空闲内存用作磁盘缓存

3. 线性的访问磁盘，很多时候比随机的内存访问快得多

JVM 2个事实1. Java对象占用空间非常大，差不多要存储的数据的两倍甚至更高

2. 随着堆中数据量的增加，GC（垃圾回收）变的越来越困难

JVM 1个假设在64G内存的机器上，不得不使用到50G～56G的内存空间

当系统重启的时候，又必须要将数据刷到内存中（每分钟1GB 内存），即使冷刷新（在使用数据的时候发现没有再刷到内存）也会导致最初的时候性能非常慢

图解零拷贝

Application Buffer

Read buffer

Socket buffer

NIC buffer

Application Context

Kernel Context

Figure1.Traditional data copying approach

DMA copyCPU copy

Figure2.Traditional Context Switches

context switching

before read

before write

Syscall read

Syscall write

U

U

U

K

K

U User context Kernel context

next cycle

K

图解零拷贝

Read buffer

Socket buffer

NIC buffer

Figure3.Data copy with transferTo()

Appliacation Context

Kernel Context

TransferTo()

before transfer to()

Syscall read and send

U

K

context switching

Next cycleU

Figure4.Context switching with transferTo()

图解零拷贝Figure5.Data copies when transferTo(and) gather operations are used

Application Context

Kernel Context

TransferTo()

Read buffer

Descriptor

NIC buffer

设计常量复杂度的磁盘操作

B树的复杂度是O(logN)，通常被认为就是常量复杂度

1. 线性的访问减小磁盘寻道

2. 压缩数据以减少IO压力

3. 使用零拷贝（zero copy）技术

但对于磁盘操作来说并非如此：磁盘进行一次搜索需要10ms，每个磁盘在同一时间只能进行一次搜索，并发处理困难对树结构的性能的观察结果表明：其性能往往随着数据的增长而线性下降，数据增长一倍，速度就会降低一倍

Kafka2

一个高性能分布式（Distribution），可分区（Partitioned），可备份（Replicated），基于Zookeeper协调的发布/订阅消息队列系统

以时间复杂度为O(1)的方式提供消息持久化能力

快速持久化

在一台普通的服务器上可以达到10W/s级的消息处理

高吞吐率

Broker/Producer/Consumer 都支持分布式和负载均衡

分布式负载均衡

支持在线平滑水平扩展

水平扩展

Kafka名词解释

3. Replication & Replication Leader & Replication Follower

2. Topic & Partition & Segment & Offset

1. Broker & Controller & Producer & Consumer & Consumer Group

4. Assigned Replications & Preferred Replication

5. Message & Message Set

Producer1 Producer2

broker1 broker2 broker3 broker4

Consumer1 Consumer2 Consumer3

zookeeper

Topic1 Topic2

Consumer Group

网络拓扑

Partition0 0 1 2 3 4 5 6 7 8 9 10 11 12

Partition1 0 1 2 3 4 5 6 7 8 9

Partition2 0 1 2 3 4 5 6 7 8 9 10 11 12

Anatomy of a Topic

Old New

Write(Append)

Topic消息流

ProducerAConsumerA

ConsumerB

ConsumerC

Topic

ProducerB

Partition10 1

Partition20 1 2

Partition30 1

Segment File ImplementationActive Segment List

34477849968 - 35551592051 topic/34477849968.index35551592052 - 36625333894 topic/35551592052.index36625333895 - 37699075830 topic/36625333895.index37699075831 - 38772817944 topic/37699075831.index

… …… …… …

79575006771 - 80648748860 topic/79575006771.index80648748861 - 81722490796 topic/80648748861.index81722490797 - 82796232651 topic/81722490797.index82796232652 - 83869974631 topic/82796232652.index

topic/34477849968.kafkaMessage 34477849968Message 34477850175

……

Message 35551591806Message 35551592051

Con

sist

ent V

iew

s

topic/82796232652.kafkaMessage 82796232652Message 82796232859

……

Message 83869974383Message 83869974631

Deletes

Appends

Reads … …34477849968.index

344778499683447785063034477852008

…35551590905 35551592051

34477849968.kafkaMessage 34477849968Message 34477850175Message 34477850630Message 34477851038Message 34477851603Message 34477852008

…Message 35551590905Message 35551591309Message 35551591806Message 35551592051

broker1 broker2 broker3 broker4

Broker & Topic Partition

topic1-part1

topic2-part1

topic2-part2

topic1-part2

topic1-part1

topic2-part1

topic2-part2

topic1-part2

topic1-part1

topic2-part1

topic2-part2

topic1-part2

Message Deliverpayloadcrcmagicsize payloadcrcmagicsize … payloadcrcmagicsize

Message

Message Set

Producer

Consumer

offset

Consumer Buffer

ByteMessageSet

FileMessageSet

Zookeeper共享

锁服

务

配置管理

统一

命名

服务

队列管理集群管理

Zookeeper保证数据的强一致性，任何

时候集群中每个节点的数据都相同。一

个用户创建一个节点作为锁：检测该节

点是否存在，若存在，代表已被上锁，

否则可创建一个节点，代表拥有该锁

将配置信息保存在Zookeeper的某个目录节点中，

一旦配置信息发生变化，每台应用机器就会收到

Zookeeper的通知，然后从Zookeeper获取新的

配置信息应用到系统中

分布式应用通常要有一套完整的命名规则，

而用树形结构是一个理想的选择，树形结

构是一个有层次的目录结构，既对友好

又不会重复便于识别和记住

Zookeeper能够维护当前的集群中机器的

服务状态，能够帮你选出一个“总管”来

管理集群，这就是Zookeeper的另一个功

能Leader Election

同步队列创建目录/synchronizing，每个成员都加入/synchronizing/member_i并监控/synchronizing/start，判断i值等于成员个数，若小于则等待/synchronizing/start出现，若相等则创建/synchronizing/start

FIFO队列在/queue-fifo下创建SEQUENTIAL子目录/x(i)，保证所有成员入队时都有编号，出队时通过getChildren()方法返回当前队列中所有元素，然后消费其中最小的一个

In-Sync Semantics1. 每一个Broker节点必须维护和Zookeeper的连接Session，Zookeeper通过心跳机制检查每个结点的连接

2. Follower Broker节点必须及时同步Leader Broker节点，不能落后Leader Broker节点太多

Replication & Commit-Log1. 当且仅当Message被所有的Replication写入到Log中，才算”Committed”

2. 只有Committed的Message，才会被Consumer读取

Persistence & Efficiency1. 每一个Follower都只从Leader Pull数据

2. 每一个Follower收到数据后，立即向Leader发送ACK，而非等到数据写入Log后

Consumer & Partition1. 同一Consumer Group中Consumer竞争Partition，即队列语义

2. 不同Consumer Group中Consumer共享Partition，即主题语义

Message Delivery Semantics

2. At least once - 消息绝不会丢，但可能会重复传输

3. Exactly once - 每一条消息肯定会被传输一次且仅传输一次，理想状态

1. At most once - 消息可能会丢，但绝不会重复传输

Leader Election算法1. Leader Election

2. ISR Approach VS. Majority Vote

注： ISR，In-Sync Replicas简称

3. 某一个Partition所有Replication不工作

Broker/Controller思考1. 选举Broker Leader最简单最直观的方案

2. 该选举Broker Leader的方案引入了哪些问题

Partition思考1. Partition的数据结构，逻辑上/物理上的存储结构

2. Broker&Topic&Partition关系

3. Partition＆Consumer&Consumer Group关系

4. Partition Rebalance触发的条件及引发的问题

Replication Design –High Level1. How to assign replicas of a partition to broker servers evenly?

2. For a given partition, how to propagate every message to all replicas?

Producer思考1. Load balancing

2. Asynchronous send

Consumer思考1. Push VS. Pull

2. Offset的归属问题及其可靠性的存储方案

3. 如何帮助减轻Broker设计的复杂度

Kafka3

Request Channel

Log Manager

Broker InternalsNetwork Layer

Acceptor Thread

Processor Thread1Processor Thread2

Processor Thread3Processor ThreadN

Clie

nt C

onne

ctio

ns

API Layer

API Thread1API Thread2

API Thread3API ThreadN

Request Purgatory

Log Subsystem

File

Sys

tem

Partition Log1Partition Log2

Partition Log3Partition LogN

Replication Subsystem

Replication Manager

Replication Controller

Replication Thread1Replication Thread2

Replication Thread3Replication ThreadN

Replication Info

Kafka Brokers ZookeeperZookeeper

Kafka Log Replication Leadership Changes

1. 利用mmap和log index机制，

broker定时或定量持久化消息

2. 网络架构优化

Bro

ker

Failo

ver

Controller Broker State Affected BrokersBroker Pathwatch fired

read

available brokers

determine set_p

read

ISR for partition in set_p

determine new leader andISR for partition in set_p

write new ISR and Leader

RPC: ISR/Leader assignment

创建

/删除

Top

icController Broker Path Affected BrokersTopic Path

watch fired

read

created/deleted topics

determine set_p

read

available broker list

set AR as ISR, one replication in AR as Leader

write new ISR and Leader

RPC: ISR/Leader assignmentLeaderAndISRRequest

LeaderAndISRResponse

Broker State

LeaderAndIsrRequest响应

highwatermark

controller Epoch old?

LeaderAndLsrRequest

leader_epoch old?

partitionStateInfo 包含当前broker id

Send Response

N

N

N

StaleControllerEpochCode Response

Y

StaleLeaderEpochCode Response

partitionState

log.warn

Y

Y

当前broker是否为该partition leader

makeFollowers makeLeaders

IdleFetcher

N

Y

Replication工具

3. Preferred Replica Leader Election Tool

2. Replica Verification Tool

1. Topic Tool

4. Kafka Reassign Partitions Tool

5. State Change Log Merge Tool

To Be Continued…

Thanks

Engineering

Apache Kafka 下一代分布式消息系统