65
Krzysztof Dębski, 22nd October 2015 How to handle large amount of events Geecon.cz 2015

Geecon.cz 2015 debski krzysztof

Embed Size (px)

Citation preview

Page 1: Geecon.cz 2015 debski krzysztof

Krzysztof Dębski, 22nd October 2015

How to handle large amount of events

Geecon.cz 2015

Page 2: Geecon.cz 2015 debski krzysztof

Krzysztof Dębski, 22nd October 2015

Who am I

15 years as an IT professional

6 years as an Architect in Allegro Group

1 year as a Product Owner http://hermes.allegro.tech

@DebskiChris

Page 3: Geecon.cz 2015 debski krzysztof

Krzysztof Dębski, 22nd October 2015

Allegro Group

500+ people in IT

50+ independent teams

16 years on market

2 years after technical revolution

Page 4: Geecon.cz 2015 debski krzysztof

Krzysztof Dębski, 22nd October 2015

Agenda

Choose a tool

Distribute load

Be reliable

Focus on throughput or latency

Improve security

Think ahead

Page 5: Geecon.cz 2015 debski krzysztof

Krzysztof Dębski, 22nd October 2015

What is important when handling events

Choose a tool

Distribute load

Be reliable

Focus on throughput or latency

Improve security

Think ahead

Page 6: Geecon.cz 2015 debski krzysztof

Krzysztof Dębski, 22nd October 2015

Choose a tool

Page 7: Geecon.cz 2015 debski krzysztof

Krzysztof Dębski, 22nd October 2015

Kafka

Service

Producer

Service

Consumer

KafkaBroker

Zookeeper

Page 8: Geecon.cz 2015 debski krzysztof

Krzysztof Dębski, 22nd October 2015

Kafka topics

Producer_1…Producer_n

Remove old events

Publish eventTopic

Page 9: Geecon.cz 2015 debski krzysztof

Krzysztof Dębski, 22nd October 2015

Kafka partitions Producer_1…Producer_n

Publish eventPartition 0

Partition 1

Partition 2

Page 10: Geecon.cz 2015 debski krzysztof

Krzysztof Dębski, 22nd October 2015

1

Consumer group 2Consumer group 1

Kafka partitions

Consumer 1

Broker 1

P0

Broker 2

P1

Broker 3

P2

Consumer 2 Consumer 3 Consumer 4 Consumer 5

Page 11: Geecon.cz 2015 debski krzysztof

Krzysztof Dębski, 22nd October 2015

1

Consumer group 2Consumer group 1

Kafka partitions

Consumer 1

Broker 1

P0

Broker 2

P1

Broker 3

P2

Consumer 2 Consumer 3 Consumer 4 Consumer 5 Consumer 6

Page 12: Geecon.cz 2015 debski krzysztof

Krzysztof Dębski, 22nd October 2015

1

Consumer group 2Consumer group 1

Kafka partitions

Consumer 1

Broker 1

P0

Broker 2

P1

Broker 3

P2

Consumer 2 Consumer 3 Consumer 4 Consumer 5 Consumer 6

Broker 4

P3

Page 13: Geecon.cz 2015 debski krzysztof

Krzysztof Dębski, 22nd October 2015

Kafka replicas

Service

Producer

Service

Consumer

Broker

Zookeeper

Broker

Broker

P1 P0

P2 P1

P0 P2

Page 14: Geecon.cz 2015 debski krzysztof

Krzysztof Dębski, 22nd October 2015

Kafka replicas

Service

Producer

Service

Consumer

Broker

Zookeeper

Broker

Broker

P1 P0

P2 P1

P0 P2

Page 15: Geecon.cz 2015 debski krzysztof

Krzysztof Dębski, 22nd October 2015

HermesHermes Frontend

Hermes Frontend

Hermes Frontend

Hermes Consumer

Hermes ConsumerREST

REST, JMS

Page 16: Geecon.cz 2015 debski krzysztof

Krzysztof Dębski, 22nd October 2015

Distribute load

Page 17: Geecon.cz 2015 debski krzysztof

Krzysztof Dębski, 22nd October 2015

Kafka partitions Producer

Publish eventPartition 0

Partition 1

Partition 2

Default Partitioning - Round Robin

Page 18: Geecon.cz 2015 debski krzysztof

Krzysztof Dębski, 22nd October 2015

Kafka partitions Producer

Publish eventPartition 0

Partition 1

Partition 2

Default Partitioning - Round Robinin practice

Binds to single partition for 10 mins

Page 19: Geecon.cz 2015 debski krzysztof

Krzysztof Dębski, 22nd October 2015

Kafka partitions Producer

Key_0Partition 0

Partition 1

Partition 2

Default Partitioning - Key based

Key_1

Key_2

Page 20: Geecon.cz 2015 debski krzysztof

Krzysztof Dębski, 22nd October 2015

Rebalancing leadersBroker 1

P1 P0

Broker 2

P2 P1

Broker 3

P0 P2

Topic: test Partition count: 3 Replication factor: 1 Configs: retention.ms=86400000

Topic: test Partition: 0 Leader: 3 Replicas: 3, 1 ISR: 3, 1Topic: test Partition: 1 Leader: 1 Replicas: 1, 2 ISR: 1, 2Topic: test Partition: 2 Leader: 2 Replicas: 2, 3 ISR: 2, 3

Page 21: Geecon.cz 2015 debski krzysztof

Krzysztof Dębski, 22nd October 2015

Rebalancing leadersBroker 1

P1 P0

Broker 2

P2 P1

Broker 3

P0 P2

Topic: test Partition count: 3 Replication factor: 1 Configs: retention.ms=86400000

Topic: test Partition: 0 Leader: 3 Replicas: 3, 1 ISR: 3, 1Topic: test Partition: 1 Leader: 1 Replicas: 1, 2 ISR: 1, 2Topic: test Partition: 2 Leader: 2 Replicas: 2, 3 ISR: 2, 3

Brokers that should have partition copies

Page 22: Geecon.cz 2015 debski krzysztof

Krzysztof Dębski, 22nd October 2015

Rebalancing leadersBroker 1

P1 P0

Broker 2

P2 P1

Broker 3

P0 P2

Topic: test Partition count: 3 Replication factor: 1 Configs: retention.ms=86400000

Topic: test Partition: 0 Leader: 3 Replicas: 3, 1 ISR: 3, 1Topic: test Partition: 1 Leader: 1 Replicas: 1, 2 ISR: 1, 2Topic: test Partition: 2 Leader: 2 Replicas: 2, 3 ISR: 2, 3

In Sync Replicas

Page 23: Geecon.cz 2015 debski krzysztof

Krzysztof Dębski, 22nd October 2015

Rebalancing leadersBroker 1

P1 P0

Broker 2

P2 P1

Broker 3

P0 P2

Topic: test Partition count: 3 Replication factor: 1 Configs: retention.ms=86400000

Topic: test Partition: 0 Leader: 3 Replicas: 3, 1 ISR: 3, 1Topic: test Partition: 1 Leader: 1 Replicas: 1, 2 ISR: 1, 2Topic: test Partition: 2 Leader: 2 Replicas: 2, 3 ISR: 2, 3

Leader broker ID

Page 24: Geecon.cz 2015 debski krzysztof

Krzysztof Dębski, 22nd October 2015

Rebalancing leadersBroker 1

P1 P0

Broker 2

P2 P1

Broker 3

P0 P2

Topic: test Partition count: 3 Replication factor: 1 Configs: retention.ms=86400000

Topic: test Partition: 0 Leader: 3 Replicas: 3, 1 ISR: 3, 1Topic: test Partition: 1 Leader: 1 Replicas: 1, 2 ISR: 1, 2Topic: test Partition: 2 Leader: 2 Replicas: 2, 3 ISR: 2, 3

Page 25: Geecon.cz 2015 debski krzysztof

Krzysztof Dębski, 22nd October 2015

Rebalancing leadersBroker 1

P1 P0

Broker 2

P2 P1

Broker 3

P0 P2

Topic: test Partition count: 3 Replication factor: 1 Configs: retention.ms=86400000

Topic: test Partition: 0 Leader: 1 Replicas: 3, 1 ISR: 1Topic: test Partition: 1 Leader: 1 Replicas: 1, 2 ISR: 1, 2Topic: test Partition: 2 Leader: 2 Replicas: 2, 3 ISR: 2

Page 26: Geecon.cz 2015 debski krzysztof

Krzysztof Dębski, 22nd October 2015

Rebalancing leadersBroker 1

P1 P0

Broker 2

P2 P1

Broker 3

P0 P2

Topic: test Partition count: 3 Replication factor: 1 Configs: retention.ms=86400000

Topic: test Partition: 0 Leader: 1 Replicas: 3, 1 ISR: 1, 3Topic: test Partition: 1 Leader: 1 Replicas: 1, 2 ISR: 1, 2Topic: test Partition: 2 Leader: 2 Replicas: 2, 3 ISR: 2, 3

Page 27: Geecon.cz 2015 debski krzysztof

Krzysztof Dębski, 22nd October 2015

Be reliable

Page 28: Geecon.cz 2015 debski krzysztof

Krzysztof Dębski, 22nd October 2015

ACK levels

0 - don’t wait for the response

1 - only the leader has to acknowledge

-1 - all replicas must be in sync

Spee

d

Safe

ty

Page 29: Geecon.cz 2015 debski krzysztof

Krzysztof Dębski, 22nd October 2015

Lost events

ERROR [Replica Manager on Broker 2]: Error when processing fetch request for partition [test,1] offset 10000 from consumer with correlation id 0. Possible cause:

Request for offset 10000 but we only have log segments in the range 8000 to 9000. (kafka.server.ReplicaManager)

Page 30: Geecon.cz 2015 debski krzysztof

Krzysztof Dębski, 22nd October 2015

Lost events

Broker 1 Broker 2

Producer

ACK = 1

Replication factor = 1

offset commited = 10000 offset commited = 9000

Zookeeper

Page 31: Geecon.cz 2015 debski krzysztof

Krzysztof Dębski, 22nd October 2015

Lost events

Broker 1 Broker 2

Producer

ACK = 1

Replication factor = 1

replica.lag.max.messages = 2000

offset commited = 10000 offset commited = 9000

Zookeeper

Page 32: Geecon.cz 2015 debski krzysztof

Krzysztof Dębski, 22nd October 2015

Lost events

Broker 1 Broker 2

offset commited = 10000 offset commited = 9000

Zookeeper

Producer

Page 33: Geecon.cz 2015 debski krzysztof

Krzysztof Dębski, 22nd October 2015

Lost events

Broker 1 Broker 2

offset commited = 10000 offset commited = 9000

Zookeeperoffset commited = 9000

Producer

Page 34: Geecon.cz 2015 debski krzysztof

Krzysztof Dębski, 22nd October 2015

Event identification

Hermes Frontend

KafkaBroker

POST{“event”: ”test”}

{ "id": "58d7ff07-dd0e-4103-9b1f-55706f3049e6", "timestamp”: 1430443071995, “data”: {“event”: ”test”}}

HTTP 201 CreatedMessage-id: 58d7ff07-dd0e-4103-9b1f-55706f3049e6

Page 35: Geecon.cz 2015 debski krzysztof

Krzysztof Dębski, 22nd October 2015

Lost events

Hermes FrontendProducer

Hermes Consumer

Consumer

KafkaBroker

Zookeeper

Tracker

Publicationdata Delive

ry

attempts

Page 36: Geecon.cz 2015 debski krzysztof

Krzysztof Dębski, 22nd October 2015

Normal operation

Hermes FrontendProducer

Hermes Consumer

Consumer

KafkaBroker

Zookeeper

POST

HTTP201Created

Page 37: Geecon.cz 2015 debski krzysztof

Krzysztof Dębski, 22nd October 2015

Abnormal situation

Hermes FrontendProducer

Hermes Consumer

Consumer

KafkaBroker

Zookeeper

POST

HTTP202Accepted

Page 38: Geecon.cz 2015 debski krzysztof

Krzysztof Dębski, 22nd October 2015

Focus on throughput or latency

Page 39: Geecon.cz 2015 debski krzysztof

Krzysztof Dębski, 22nd October 2015

Throughput

Does whole world stop when you stop?

Page 40: Geecon.cz 2015 debski krzysztof

Krzysztof Dębski, 22nd October 2015

Throughput

Co-ordinated omission

Page 41: Geecon.cz 2015 debski krzysztof

Krzysztof Dębski, 22nd October 2015

75%

99%

99,9%

resp

onse

tim

e

Slow responses

Page 42: Geecon.cz 2015 debski krzysztof

Krzysztof Dębski, 22nd October 2015

Slow response vs. message sizem

essa

ge c

size

75%

99%

99,9%

Page 43: Geecon.cz 2015 debski krzysztof

Krzysztof Dębski, 22nd October 2015

resp

onse

tim

e

75%

99%

99,9%

Slow response and fixed message size

Page 44: Geecon.cz 2015 debski krzysztof

Krzysztof Dębski, 22nd October 2015

Kafka

kernel 3.2.x

Page 45: Geecon.cz 2015 debski krzysztof

Krzysztof Dębski, 22nd October 2015

Kafka

kernel 3.2.x

Page 46: Geecon.cz 2015 debski krzysztof

Krzysztof Dębski, 22nd October 2015

Kafka

kernel 3.2.x kernel >= 3.8.x

Page 47: Geecon.cz 2015 debski krzysztof

Krzysztof Dębski, 22nd October 2015

Optimize message sizem

essa

ge s

ize

99,9%all topics

99,9%biggest topic

Page 48: Geecon.cz 2015 debski krzysztof

Krzysztof Dębski, 22nd October 2015

Optimize message size

JSON human readable

big memory and network footprint

poor support for Hadoop

Page 49: Geecon.cz 2015 debski krzysztof

Krzysztof Dębski, 22nd October 2015

Optimize message size

JSON

Snappy

ERROR Error when sending message to topic t3 with key: 4 bytes, value: 100

bytes with error: The server experienced an unexpected error when

processing the request (org.apache.kafka.clients.producer.internals.

ErrorLoggingCallback)

java: target/snappy-1.1.1/snappy.cc:423: char* snappy::internal::

CompressFragment(const char*, size_t, char*, snappy::uint16*, int): Assertion

`0 == memcmp(base, candidate, matched)' failed.

errors on publishing large amount of messages

Page 50: Geecon.cz 2015 debski krzysztof

Krzysztof Dębski, 22nd October 2015

Optimize message size

JSON

Snappy

Lz4

failed on distributed data

com

pres

sion

ratio

single

topic

multiple

topics

Page 51: Geecon.cz 2015 debski krzysztof

Krzysztof Dębski, 22nd October 2015

Optimize message size

JSON

Snappy

Lz4

Avro

small network footprint

Hadoop friendly

easy schema verification

Page 52: Geecon.cz 2015 debski krzysztof

Krzysztof Dębski, 22nd October 2015

Kafka Offset Monitor

Page 53: Geecon.cz 2015 debski krzysztof

Krzysztof Dębski, 22nd October 2015

Improve security

Page 54: Geecon.cz 2015 debski krzysztof

Krzysztof Dębski, 22nd October 2015

Kafka-1682

Kafka <= 0.8.2

No security !

Kafka > 0.8.2

unix-like users, permissions, ACL

Page 55: Geecon.cz 2015 debski krzysztof

Krzysztof Dębski, 22nd October 2015

Manage your topics

cz.geecon.demo.basicGroup Topic

Page 56: Geecon.cz 2015 debski krzysztof

Krzysztof Dębski, 22nd October 2015

Improved securityAuthentication and authorization interfaces provided

By Default:

You can create any topic in your group

You can publish everywhere (in progress)

Group owner defines subscriptions

Page 57: Geecon.cz 2015 debski krzysztof

Krzysztof Dębski, 22nd October 2015

Think ahead

Page 58: Geecon.cz 2015 debski krzysztof

Krzysztof Dębski, 22nd October 2015

Consumer backoff

You can’t have exactly one delivery

http://bravenewgeek.com/you-cannot-have-exactly-once-delivery/

Page 59: Geecon.cz 2015 debski krzysztof

Krzysztof Dębski, 22nd October 2015

Improved offset managementHermesProducer

Hermes consumer

Publish event

Commited

Local unsent events

Read event

Page 60: Geecon.cz 2015 debski krzysztof

Krzysztof Dębski, 22nd October 2015

Improved offset management

Hermes consumer

Local unsent events

New event

Serviceinstance

Page 61: Geecon.cz 2015 debski krzysztof

Krzysztof Dębski, 22nd October 2015

Improved offset management

Hermes consumer

Local unsent events

New event

Serviceinstance

HTTP 503 Unavailable

Page 62: Geecon.cz 2015 debski krzysztof

Krzysztof Dębski, 22nd October 2015

Improved offset management

Local unsent events

New event

Serviceinstance

HTTP 503 Unavailable

Check TTL & Add to queue

Hermes consumer

Page 63: Geecon.cz 2015 debski krzysztof

Krzysztof Dębski, 22nd October 2015

Consumer backoff

100% adapt 1/s 1/min

Page 64: Geecon.cz 2015 debski krzysztof

Krzysztof Dębski, 22nd October 2015

Turn back the time

PUT /groups/{group}/topics/{topic}/subscriptions/{subscription}/retransmission -8h

Page 65: Geecon.cz 2015 debski krzysztof

Krzysztof Dębski, 22nd October 2015

Blog: http://allegro.tech

Twitter: @allegrotechblog

Twitter: @debskichris