Upload
ontico
View
672
Download
3
Embed Size (px)
Citation preview
МОНИТОРИНГ. ОПЯТЬ.Всеволод Поляков
Что такое метрики?
Успешность
Количество
Время
Взаимодействие
Внутренние процессы
Системные метрики
Зачем нужны метрики?
Алерты
Аналитика
Graphite
Default graphite architecture
what?
what?• RRD-like (gram.ly/gfsx)
what?• RRD-like (gram.ly/gfsx)
• so.it.is.my.metric → /so/it/is/my/metric.wsp
• Fixed retention (by name\pattern)
what?• RRD-like (gram.ly/gfsx)
• so.it.is.my.metric → /so/it/is/my/metric.wsp
• Fixed retention (by name\pattern)
• Fixed size (actually no)
Retention and size
Retention and size• 1s:1d → 1 036 828 bytes
Retention and size• 1s:1d → 1 036 828 bytes
• 10s:10d → 1 036 828 bytes
Retention and size• 1s:1d → 1 036 828 bytes
• 10s:10d → 1 036 828 bytes
whisper calc
Retention and size• 1s:1d → 1 036 828 bytes
• 10s:10d → 1 036 828 bytes
• 1s:365d → 378 432 028 bytes (1 TB ~ 3 000)
whisper calc
Retention and size• 1s:1d → 1 036 828 bytes
• 10s:10d → 1 036 828 bytes
• 1s:365d → 378 432 028 bytes (1 TB ~ 3 000)
• 10s:365d → 37 843 228 bytes (1 TB ~ 30 000)
whisper calc
Retention and size
Retention and size• 10s:30d,1m:120d,10m:365d → 4 564 864 bytes
Retention and size• 10s:30d,1m:120d,10m:365d → 4 564 864 bytes
• 240 864 metrics in 1 TB
Retention and size• 10s:30d,1m:120d,10m:365d → 4 564 864 bytes
• 240 864 metrics in 1 TB
• aggregation: average, sum, min, max, and last.
Retention and size• 10s:30d,1m:120d,10m:365d → 4 564 864 bytes
• 240 864 metrics in 1 TB
• aggregation: average, sum, min, max, and last.
• can be assign per metric
How• terraform (https://www.terraform.io/)
• docker (https://www.docker.com/)
• ansible (https://www.ansible.com/)
• rocker (https://github.com/grammarly/rocker)
• rocker-compose (https://github.com/grammarly/rocker-compose)
Default graphite architecture
Default graphite architecture
carbon-cache.py
• single-core
• many options in config file
link
carbon-cache.py
• single-core
• many options in config file
• default
link
architecturecarbon-cache.py
Start load testing
Start load testing• m4.xlarge instance (4 CPU, 16 GB ram, 256 GB disk EBS gp2)
Start load testing• m4.xlarge instance (4 CPU, 16 GB ram, 256 GB disk EBS gp2)
• retentions = 1s:1d
Start load testing• m4.xlarge instance (4 CPU, 16 GB ram, 256 GB disk EBS gp2)
• retentions = 1s:1d
• MAX_CACHE_SIZE, MAX_UPDATES_PER_SECOND, MAX_CREATES_PER_MINUTE = inf
Start load testing• m4.xlarge instance (4 CPU, 16 GB ram, 256 GB disk EBS gp2)
• retentions = 1s:1d
• MAX_CACHE_SIZE, MAX_UPDATES_PER_SECOND, MAX_CREATES_PER_MINUTE = inf
• defaults
Start load testing• m4.xlarge instance (4 CPU, 16 GB ram, 256 GB disk EBS gp2)
• retentions = 1s:1d
• MAX_CACHE_SIZE, MAX_UPDATES_PER_SECOND, MAX_CREATES_PER_MINUTE = inf
• defaults
• almost 1.5h to get limit :(
carbon-cache.py cache size → 75k m\s
results
• 75 000 m\s max
• 60 000 m\s flagman speed
• I\O :(
Try to tune!
• WHISPER_SPARSE_CREATE = true (don’t allocate space on creation) non-linear I\O load.
• CACHE_WRITE_STRATEGY = sorted (default)
cache size 1k → 195k m\s
results
• 120 000 m\s flagman speed • cache flush problem :(
Try to tune!
• CACHE_WRITE_STRATEGY = max will give a strong flush preference to frequently updated metrics and will also reduce random file-io.
from 1k to 150k
results
• 90 000 m\s flagman speed • cache flush problem :(
Try to tune!
• CACHE_WRITE_STRATEGY = naive just flush. Better with random I\O.
from 45k to 135k
results
• 120 000 m\s flagman speed • still CPU
sorted
max
naive
• Maybe it’s I\O EBS limitation? → 512 GB disk.
• Maybe it’s I\O EBS limitation? → 512 GB disk.
• No.
• Maybe it’s I\O EBS limitation? → 512 GB disk.
• No.
go-carbon
• multi-core single daemon
• written in golang
• not many options to tune :(
link
Start load testing
Start load testing• m4.xlarge instance (4 CPU, 16 GB ram, 256 GB disk EBS gp2)
Start load testing• m4.xlarge instance (4 CPU, 16 GB ram, 256 GB disk EBS gp2)
• retentions = 1s:1d
Start load testing• m4.xlarge instance (4 CPU, 16 GB ram, 256 GB disk EBS gp2)
• retentions = 1s:1d
• max-size = 0
Start load testing• m4.xlarge instance (4 CPU, 16 GB ram, 256 GB disk EBS gp2)
• retentions = 1s:1d
• max-size = 0
• max-updates-per-second = 0
Start load testing• m4.xlarge instance (4 CPU, 16 GB ram, 256 GB disk EBS gp2)
• retentions = 1s:1d
• max-size = 0
• max-updates-per-second = 0
• almost 1h to get limit :(
1k → 130k m\s ~3k/min
1k → 130k m\s ~3k/min
1k → 130k m\s ~3k/min
results
results• 120 000 m\s flagman speed
results• 120 000 m\s flagman speed• but it’s without sparse.
results• 120 000 m\s flagman speed• but it’s without sparse. • try to implement
try to tune! remaining := whisper.Size() - whisper.MetadataSize() whisper.file.Seek(int64(remaining-1), 0) whisper.file.Write([]byte{0}) chunkSize := 16384 zeros := make([]byte, chunkSize) for remaining > chunkSize { // if _, err = whisper.file.Write(zeros); err != nil { // return nil, err // } remaining -= chunkSize } if _, err = whisper.file.Write(zeros[:remaining]); err != nil { return nil, err }
Уже есть в go-carbon
180 000 m\s !
try to tune!
• max update operation = 1500
results
• TLDR 210 000 - 240 000 m\s flagman speed
• 31 000 000 cache size!
try to tune!
• max update operation = 0
• input-buffer = 400 000
results
• 270 000 m\s flagman speed
• 10-20kk cache size!
try to tune!
• vm.dirty_background_ratio=40
• vm.dirty_ratio=60
300 000 req\s
results
• 300 000 m\s flagman speed
• 180k+ m\s ±without cache
Re:Lays
Default graphite architecture
Default graphite architecture
arch forward
arch named\regexp
arch hash
arch hash replicafactor: 2
carbon-relay.py
• twisted based
• native
Start load testing
Start load testing• c4.xlarge instance (4 CPU, 7.5 GB ram)
Start load testing• c4.xlarge instance (4 CPU, 7.5 GB ram)
• ~1 Gb lan
Start load testing• c4.xlarge instance (4 CPU, 7.5 GB ram)
• ~1 Gb lan
• default parameters
Start load testing• c4.xlarge instance (4 CPU, 7.5 GB ram)
• ~1 Gb lan
• default parameters
• hashing
Start load testing• c4.xlarge instance (4 CPU, 7.5 GB ram)
• ~1 Gb lan
• default parameters
• hashing
• 10 connections
WTF!
carbon-relay-ng• golang-based
• web-panel
• live-updates
link
carbon-relay-ng• golang-based
• web-panel
• live-updates
• aggregators
link
carbon-relay-ng• golang-based
• web-panel
• live-updates
• aggregators
• spooling
link
<150 000 req\s
carbon-c-relay
• написан на C
• advanced cluster management
from 100 000 to 1 600 000 req\s
1 400 000 flagman speed. Or not?
1 400 000 flagman speed. Or not?
1 400 000 flagman speed. Or not?
Итак…go-carbon + carbon-c-relay = ♡
Контейнеры
Всё перепутано
Различия• Окружение
• Роль
• Трек (Модификатор)
• IP
• Датацентр
• Что-угодно
Теги
TSDB с тегами• influxDB
• openTSDB (hbase)
• cyanite (cassandra)
• newTS (cassandra)
• Prometheus
(cluster) influx, 130k metric\s
openTSDB single instance + hbase cluster = upto 150k metric\s
Compaction
Graphite
Найти уникальное
Работает с Grafana
Zipper
• https://github.com/grobian/carbonserver
• https://github.com/dgryski/carbonzipper
• https://github.com/dgryski/carbonapi
ALSO
• https://github.com/jssjr/carbonate
• https://github.com/jjneely/buckytools
• https://github.com/dgryski/carbonmem
• https://github.com/grobian/carbonwriter
Планы
• Патч statsd → ES
• Патч carbonserver → carbonlink
feel free to ask• Vsevolod Polyakov
• skype: ctrlok1987
• github.com/ctrlok
• twitter.com/ctrlok
• slack: HangOps
• Gitter: dev_ua/devops
• skype: DevOps from Ukraine
• slack.ukrops.club
feel free to ask• Vsevolod Polyakov
• skype: ctrlok1987
• github.com/ctrlok
• twitter.com/ctrlok
• slack: HangOps
• Gitter: dev_ua/devops
• skype: DevOps from Ukraine
• slack.ukrops.club
Мы хайрим!