Путь мониторинга 2.0 всё стало другим / Всеволод Поляков (Grammarly)

МОНИТОРИНГ. ОПЯТЬ.Всеволод Поляков

Platform Engineer . Grammarly

ctrlok.com

http://ctrlok.com

Что такое метрики?

Успешность

Количество

Время

Взаимодействие

Внутренние процессы

Системные метрики

Зачем нужны метрики?

Алерты

Аналитика

Graphite

Default graphite architecture

what?

what?• RRD-like (gram.ly/gfsx)

http://gram.ly/gfsx


• so.it.is.my.metric → /so/it/is/my/metric.wsp

http://gram.ly/gfsx



• Fixed retention (by name\pattern)

http://gram.ly/gfsx



• Fixed retention (by name\pattern)

• Fixed size (actually no)

http://gram.ly/gfsx

Retention and size

Retention and size• 1s:1d → 1 036 828 bytes


• 10s:10d → 1 036 828 bytes


• 10s:10d → 1 036 828 bytes

whisper calc

https://gist.github.com/ndemengel/7ae93d260a4649e4e99b


• 10s:10d → 1 036 828 bytes

• 1s:365d → 378 432 028 bytes (1 TB ~ 3 000)

whisper calc



• 10s:10d → 1 036 828 bytes

• 1s:365d → 378 432 028 bytes (1 TB ~ 3 000)

• 10s:365d → 37 843 228 bytes (1 TB ~ 30 000)

whisper calc


Retention and size

Retention and size• 10s:30d,1m:120d,10m:365d → 4 564 864 bytes


• 240 864 metrics in 1 TB



• aggregation: average, sum, min, max, and last.



• aggregation: average, sum, min, max, and last.

• can be assign per metric

How• terraform (https://www.terraform.io/)

• docker (https://www.docker.com/)

• ansible (https://www.ansible.com/)

• rocker (https://github.com/grammarly/rocker)

• rocker-compose (https://github.com/grammarly/rocker-compose)

https://www.terraform.io/

https://www.docker.com/

https://www.ansible.com/

https://github.com/grammarly/rocker

https://github.com/grammarly/rocker-compose



carbon-cache.py

link

https://github.com/graphite-project/carbon

carbon-cache.py

• single-core

link


carbon-cache.py

• single-core

• many options in config file

link


carbon-cache.py

• single-core

• many options in config file

• default

link


architecturecarbon-cache.py

Start load testing

Start load testing• m4.xlarge instance (4 CPU, 16 GB ram, 256 GB disk EBS gp2)


• retentions = 1s:1d



• MAX_CACHE_SIZE, MAX_UPDATES_PER_SECOND, MAX_CREATES_PER_MINUTE = inf




• defaults




• defaults

• almost 1.5h to get limit :(

carbon-cache.py cache size → 75k m\s

updates

upd time

results

• 75 000 m\s max

• 60 000 m\s flagman speed

• I\O :(

Try to tune!

• WHISPER_SPARSE_CREATE = true (don’t allocate space on creation) non-linear I\O load.

• CACHE_WRITE_STRATEGY = sorted (default)

cache size 1k → 195k m\s

results

• 120 000 m\s flagman speed • cache flush problem :(

Try to tune!

• CACHE_WRITE_STRATEGY = max will give a strong flush preference to frequently updated metrics and will also reduce random file-io.

from 1k to 150k

results

• 90 000 m\s flagman speed • cache flush problem :(

Try to tune!

• CACHE_WRITE_STRATEGY = naive just flush. Better with random I\O.

from 45k to 135k

results

• 120 000 m\s flagman speed • still CPU

sorted

max

naive

• Maybe it’s I\O EBS limitation? → 512 GB disk.


• No.


• No.

go-carbon

link

https://github.com/lomik/go-carbon

go-carbon

• multi-core single daemon

link


go-carbon


• written in golang

link


go-carbon


• written in golang

• not many options to tune :(

link


Start load testing






• max-size = 0



• max-size = 0

• max-updates-per-second = 0



• max-size = 0

• max-updates-per-second = 0

• almost 1h to get limit :(

1k → 130k m\s ~3k/min

1k → 130k m\s ~3k/min

1k → 130k m\s ~3k/min

results

results• 120 000 m\s flagman speed

results• 120 000 m\s flagman speed• but it’s without sparse.

results• 120 000 m\s flagman speed• but it’s without sparse. • try to implement

try to tune! remaining := whisper.Size() - whisper.MetadataSize() whisper.file.Seek(int64(remaining-1), 0) whisper.file.Write([]byte{0}) chunkSize := 16384 zeros := make([]byte, chunkSize) for remaining > chunkSize { // if _, err = whisper.file.Write(zeros); err != nil { // return nil, err // } remaining -= chunkSize } if _, err = whisper.file.Write(zeros[:remaining]); err != nil { return nil, err }

Уже есть в go-carbon

180 000 m\s !

try to tune!

• max update operation = 1500

results

• TLDR 210 000 - 240 000 m\s flagman speed

• 31 000 000 cache size!

try to tune!

• max update operation = 0

• input-buffer = 400 000

results


• 10-20kk cache size!

try to tune!

• vm.dirty_background_ratio=40

• vm.dirty_ratio=60

300 000 req\s

results


• 180k+ m\s ±without cache

Re:Lays



arch forward

arch named\regexp

arch hash

arch hash replicafactor: 2

carbon-relay.py

• twisted based

• native

Start load testing

Start load testing• c4.xlarge instance (4 CPU, 7.5 GB ram)


• ~1 Gb lan


• ~1 Gb lan

• default parameters


• ~1 Gb lan


• hashing


• ~1 Gb lan


• hashing

• 10 connections

WTF!

carbon-relay-ng

link

https://github.com/graphite-ng/carbon-relay-ng

carbon-relay-ng• golang-based

link



• web-panel

link



• web-panel

• live-updates

link



• web-panel

• live-updates

• aggregators

link



• web-panel

• live-updates

• aggregators

• spooling

link


<150 000 req\s

carbon-c-relay

• написан на C

• advanced cluster management

from 100 000 to 1 600 000 req\s

1 400 000 flagman speed. Or not?



Итак…go-carbon + carbon-c-relay = ♡

Контейнеры

Всё перепутано

Различия• Окружение

• Роль

• Трек (Модификатор)

• IP

• Датацентр

• Что-угодно

Теги

TSDB с тегами• influxDB

• openTSDB (hbase)

• cyanite (cassandra)

• newTS (cassandra)

• Prometheus

(cluster) influx, 130k metric\s

openTSDB single instance + hbase cluster = upto 150k metric\s

Compaction

Graphite

Найти уникальное

Работает с Grafana

Zipper

• https://github.com/grobian/carbonserver

• https://github.com/dgryski/carbonzipper

• https://github.com/dgryski/carbonapi

https://github.com/grobian/carbonserver

https://github.com/dgryski/carbonzipper

https://github.com/dgryski/carbonapi

ALSO

• https://github.com/jssjr/carbonate

• https://github.com/jjneely/buckytools

• https://github.com/dgryski/carbonmem

• https://github.com/grobian/carbonwriter

https://github.com/jssjr/carbonate

https://github.com/jjneely/buckytools

https://github.com/dgryski/carbonmem

https://github.com/grobian/carbonwriter

Планы

• Патч statsd → ES

• Патч carbonserver → carbonlink

feel free to ask• Vsevolod Polyakov

• [email protected]

• skype: ctrlok1987

• github.com/ctrlok

• twitter.com/ctrlok

• slack: HangOps

• Gitter: dev_ua/devops

• skype: DevOps from Ukraine

• slack.ukrops.club

http://github.com/ctrlok

http://twitter.com/ctrlok

http://signup.hangops.com/

https://gitter.im/dev-ua/devops

https://join.skype.com/kG4nFsvU5iyG

feel free to ask• Vsevolod Polyakov

• [email protected]

• skype: ctrlok1987

• github.com/ctrlok

• twitter.com/ctrlok

• slack: HangOps

• Gitter: dev_ua/devops

• skype: DevOps from Ukraine

• slack.ukrops.club

Мы хайрим!

http://github.com/ctrlok

http://twitter.com/ctrlok

http://signup.hangops.com/

https://gitter.im/dev-ua/devops

https://join.skype.com/kG4nFsvU5iyG

Engineering

Путь мониторинга 2.0 всё стало другим / Всеволод Поляков (Grammarly)