Geecon.cz 2015 debski krzysztof

Preview:

Citation preview

Krzysztof Dębski, 22nd October 2015

How to handle large amount of events

Geecon.cz 2015

Krzysztof Dębski, 22nd October 2015

Who am I

15 years as an IT professional

6 years as an Architect in Allegro Group

1 year as a Product Owner http://hermes.allegro.tech

@DebskiChris

Krzysztof Dębski, 22nd October 2015

Allegro Group

500+ people in IT

50+ independent teams

16 years on market

2 years after technical revolution

Krzysztof Dębski, 22nd October 2015

Agenda

Choose a tool

Distribute load

Be reliable

Focus on throughput or latency

Improve security

Think ahead

Krzysztof Dębski, 22nd October 2015

What is important when handling events

Choose a tool

Distribute load

Be reliable

Focus on throughput or latency

Improve security

Think ahead

Krzysztof Dębski, 22nd October 2015

Choose a tool

Krzysztof Dębski, 22nd October 2015

Kafka

Service

Producer

Service

Consumer

KafkaBroker

Zookeeper

Krzysztof Dębski, 22nd October 2015

Kafka topics

Producer_1…Producer_n

Remove old events

Publish eventTopic

Krzysztof Dębski, 22nd October 2015

Kafka partitions Producer_1…Producer_n

Publish eventPartition 0

Partition 1

Partition 2

Krzysztof Dębski, 22nd October 2015

1

Consumer group 2Consumer group 1

Kafka partitions

Consumer 1

Broker 1

P0

Broker 2

P1

Broker 3

P2

Consumer 2 Consumer 3 Consumer 4 Consumer 5

Krzysztof Dębski, 22nd October 2015

1

Consumer group 2Consumer group 1

Kafka partitions

Consumer 1

Broker 1

P0

Broker 2

P1

Broker 3

P2

Consumer 2 Consumer 3 Consumer 4 Consumer 5 Consumer 6

Krzysztof Dębski, 22nd October 2015

1

Consumer group 2Consumer group 1

Kafka partitions

Consumer 1

Broker 1

P0

Broker 2

P1

Broker 3

P2

Consumer 2 Consumer 3 Consumer 4 Consumer 5 Consumer 6

Broker 4

P3

Krzysztof Dębski, 22nd October 2015

Kafka replicas

Service

Producer

Service

Consumer

Broker

Zookeeper

Broker

Broker

P1 P0

P2 P1

P0 P2

Krzysztof Dębski, 22nd October 2015

Kafka replicas

Service

Producer

Service

Consumer

Broker

Zookeeper

Broker

Broker

P1 P0

P2 P1

P0 P2

Krzysztof Dębski, 22nd October 2015

HermesHermes Frontend

Hermes Frontend

Hermes Frontend

Hermes Consumer

Hermes ConsumerREST

REST, JMS

Krzysztof Dębski, 22nd October 2015

Distribute load

Krzysztof Dębski, 22nd October 2015

Kafka partitions Producer

Publish eventPartition 0

Partition 1

Partition 2

Default Partitioning - Round Robin

Krzysztof Dębski, 22nd October 2015

Kafka partitions Producer

Publish eventPartition 0

Partition 1

Partition 2

Default Partitioning - Round Robinin practice

Binds to single partition for 10 mins

Krzysztof Dębski, 22nd October 2015

Kafka partitions Producer

Key_0Partition 0

Partition 1

Partition 2

Default Partitioning - Key based

Key_1

Key_2

Krzysztof Dębski, 22nd October 2015

Rebalancing leadersBroker 1

P1 P0

Broker 2

P2 P1

Broker 3

P0 P2

Topic: test Partition count: 3 Replication factor: 1 Configs: retention.ms=86400000

Topic: test Partition: 0 Leader: 3 Replicas: 3, 1 ISR: 3, 1Topic: test Partition: 1 Leader: 1 Replicas: 1, 2 ISR: 1, 2Topic: test Partition: 2 Leader: 2 Replicas: 2, 3 ISR: 2, 3

Krzysztof Dębski, 22nd October 2015

Rebalancing leadersBroker 1

P1 P0

Broker 2

P2 P1

Broker 3

P0 P2

Topic: test Partition count: 3 Replication factor: 1 Configs: retention.ms=86400000

Topic: test Partition: 0 Leader: 3 Replicas: 3, 1 ISR: 3, 1Topic: test Partition: 1 Leader: 1 Replicas: 1, 2 ISR: 1, 2Topic: test Partition: 2 Leader: 2 Replicas: 2, 3 ISR: 2, 3

Brokers that should have partition copies

Krzysztof Dębski, 22nd October 2015

Rebalancing leadersBroker 1

P1 P0

Broker 2

P2 P1

Broker 3

P0 P2

Topic: test Partition count: 3 Replication factor: 1 Configs: retention.ms=86400000

Topic: test Partition: 0 Leader: 3 Replicas: 3, 1 ISR: 3, 1Topic: test Partition: 1 Leader: 1 Replicas: 1, 2 ISR: 1, 2Topic: test Partition: 2 Leader: 2 Replicas: 2, 3 ISR: 2, 3

In Sync Replicas

Krzysztof Dębski, 22nd October 2015

Rebalancing leadersBroker 1

P1 P0

Broker 2

P2 P1

Broker 3

P0 P2

Topic: test Partition count: 3 Replication factor: 1 Configs: retention.ms=86400000

Topic: test Partition: 0 Leader: 3 Replicas: 3, 1 ISR: 3, 1Topic: test Partition: 1 Leader: 1 Replicas: 1, 2 ISR: 1, 2Topic: test Partition: 2 Leader: 2 Replicas: 2, 3 ISR: 2, 3

Leader broker ID

Krzysztof Dębski, 22nd October 2015

Rebalancing leadersBroker 1

P1 P0

Broker 2

P2 P1

Broker 3

P0 P2

Topic: test Partition count: 3 Replication factor: 1 Configs: retention.ms=86400000

Topic: test Partition: 0 Leader: 3 Replicas: 3, 1 ISR: 3, 1Topic: test Partition: 1 Leader: 1 Replicas: 1, 2 ISR: 1, 2Topic: test Partition: 2 Leader: 2 Replicas: 2, 3 ISR: 2, 3

Krzysztof Dębski, 22nd October 2015

Rebalancing leadersBroker 1

P1 P0

Broker 2

P2 P1

Broker 3

P0 P2

Topic: test Partition count: 3 Replication factor: 1 Configs: retention.ms=86400000

Topic: test Partition: 0 Leader: 1 Replicas: 3, 1 ISR: 1Topic: test Partition: 1 Leader: 1 Replicas: 1, 2 ISR: 1, 2Topic: test Partition: 2 Leader: 2 Replicas: 2, 3 ISR: 2

Krzysztof Dębski, 22nd October 2015

Rebalancing leadersBroker 1

P1 P0

Broker 2

P2 P1

Broker 3

P0 P2

Topic: test Partition count: 3 Replication factor: 1 Configs: retention.ms=86400000

Topic: test Partition: 0 Leader: 1 Replicas: 3, 1 ISR: 1, 3Topic: test Partition: 1 Leader: 1 Replicas: 1, 2 ISR: 1, 2Topic: test Partition: 2 Leader: 2 Replicas: 2, 3 ISR: 2, 3

Krzysztof Dębski, 22nd October 2015

Be reliable

Krzysztof Dębski, 22nd October 2015

ACK levels

0 - don’t wait for the response

1 - only the leader has to acknowledge

-1 - all replicas must be in sync

Spee

d

Safe

ty

Krzysztof Dębski, 22nd October 2015

Lost events

ERROR [Replica Manager on Broker 2]: Error when processing fetch request for partition [test,1] offset 10000 from consumer with correlation id 0. Possible cause:

Request for offset 10000 but we only have log segments in the range 8000 to 9000. (kafka.server.ReplicaManager)

Krzysztof Dębski, 22nd October 2015

Lost events

Broker 1 Broker 2

Producer

ACK = 1

Replication factor = 1

offset commited = 10000 offset commited = 9000

Zookeeper

Krzysztof Dębski, 22nd October 2015

Lost events

Broker 1 Broker 2

Producer

ACK = 1

Replication factor = 1

replica.lag.max.messages = 2000

offset commited = 10000 offset commited = 9000

Zookeeper

Krzysztof Dębski, 22nd October 2015

Lost events

Broker 1 Broker 2

offset commited = 10000 offset commited = 9000

Zookeeper

Producer

Krzysztof Dębski, 22nd October 2015

Lost events

Broker 1 Broker 2

offset commited = 10000 offset commited = 9000

Zookeeperoffset commited = 9000

Producer

Krzysztof Dębski, 22nd October 2015

Event identification

Hermes Frontend

KafkaBroker

POST{“event”: ”test”}

{ "id": "58d7ff07-dd0e-4103-9b1f-55706f3049e6", "timestamp”: 1430443071995, “data”: {“event”: ”test”}}

HTTP 201 CreatedMessage-id: 58d7ff07-dd0e-4103-9b1f-55706f3049e6

Krzysztof Dębski, 22nd October 2015

Lost events

Hermes FrontendProducer

Hermes Consumer

Consumer

KafkaBroker

Zookeeper

Tracker

Publicationdata Delive

ry

attempts

Krzysztof Dębski, 22nd October 2015

Normal operation

Hermes FrontendProducer

Hermes Consumer

Consumer

KafkaBroker

Zookeeper

POST

HTTP201Created

Krzysztof Dębski, 22nd October 2015

Abnormal situation

Hermes FrontendProducer

Hermes Consumer

Consumer

KafkaBroker

Zookeeper

POST

HTTP202Accepted

Krzysztof Dębski, 22nd October 2015

Focus on throughput or latency

Krzysztof Dębski, 22nd October 2015

Throughput

Does whole world stop when you stop?

Krzysztof Dębski, 22nd October 2015

Throughput

Co-ordinated omission

Krzysztof Dębski, 22nd October 2015

75%

99%

99,9%

resp

onse

tim

e

Slow responses

Krzysztof Dębski, 22nd October 2015

Slow response vs. message sizem

essa

ge c

size

75%

99%

99,9%

Krzysztof Dębski, 22nd October 2015

resp

onse

tim

e

75%

99%

99,9%

Slow response and fixed message size

Krzysztof Dębski, 22nd October 2015

Kafka

kernel 3.2.x

Krzysztof Dębski, 22nd October 2015

Kafka

kernel 3.2.x

Krzysztof Dębski, 22nd October 2015

Kafka

kernel 3.2.x kernel >= 3.8.x

Krzysztof Dębski, 22nd October 2015

Optimize message sizem

essa

ge s

ize

99,9%all topics

99,9%biggest topic

Krzysztof Dębski, 22nd October 2015

Optimize message size

JSON human readable

big memory and network footprint

poor support for Hadoop

Krzysztof Dębski, 22nd October 2015

Optimize message size

JSON

Snappy

ERROR Error when sending message to topic t3 with key: 4 bytes, value: 100

bytes with error: The server experienced an unexpected error when

processing the request (org.apache.kafka.clients.producer.internals.

ErrorLoggingCallback)

java: target/snappy-1.1.1/snappy.cc:423: char* snappy::internal::

CompressFragment(const char*, size_t, char*, snappy::uint16*, int): Assertion

`0 == memcmp(base, candidate, matched)' failed.

errors on publishing large amount of messages

Krzysztof Dębski, 22nd October 2015

Optimize message size

JSON

Snappy

Lz4

failed on distributed data

com

pres

sion

ratio

single

topic

multiple

topics

Krzysztof Dębski, 22nd October 2015

Optimize message size

JSON

Snappy

Lz4

Avro

small network footprint

Hadoop friendly

easy schema verification

Krzysztof Dębski, 22nd October 2015

Kafka Offset Monitor

Krzysztof Dębski, 22nd October 2015

Improve security

Krzysztof Dębski, 22nd October 2015

Kafka-1682

Kafka <= 0.8.2

No security !

Kafka > 0.8.2

unix-like users, permissions, ACL

Krzysztof Dębski, 22nd October 2015

Manage your topics

cz.geecon.demo.basicGroup Topic

Krzysztof Dębski, 22nd October 2015

Improved securityAuthentication and authorization interfaces provided

By Default:

You can create any topic in your group

You can publish everywhere (in progress)

Group owner defines subscriptions

Krzysztof Dębski, 22nd October 2015

Think ahead

Krzysztof Dębski, 22nd October 2015

Consumer backoff

You can’t have exactly one delivery

http://bravenewgeek.com/you-cannot-have-exactly-once-delivery/

Krzysztof Dębski, 22nd October 2015

Improved offset managementHermesProducer

Hermes consumer

Publish event

Commited

Local unsent events

Read event

Krzysztof Dębski, 22nd October 2015

Improved offset management

Hermes consumer

Local unsent events

New event

Serviceinstance

Krzysztof Dębski, 22nd October 2015

Improved offset management

Hermes consumer

Local unsent events

New event

Serviceinstance

HTTP 503 Unavailable

Krzysztof Dębski, 22nd October 2015

Improved offset management

Local unsent events

New event

Serviceinstance

HTTP 503 Unavailable

Check TTL & Add to queue

Hermes consumer

Krzysztof Dębski, 22nd October 2015

Consumer backoff

100% adapt 1/s 1/min

Krzysztof Dębski, 22nd October 2015

Turn back the time

PUT /groups/{group}/topics/{topic}/subscriptions/{subscription}/retransmission -8h

Krzysztof Dębski, 22nd October 2015

Blog: http://allegro.tech

Twitter: @allegrotechblog

Twitter: @debskichris