23
The ELK Stack @ Inbot Jilles van Gurp - Inbot Inc.

Elk stack @inbot

Embed Size (px)

Citation preview

Page 1: Elk stack @inbot

The ELK Stack @ InbotJilles van Gurp - Inbot Inc.

Page 2: Elk stack @inbot

Who is Jilles?

www.jillesvangurp.com, and @jillesvangurp on everything I've signed up for

Java (J)Ruby Python Javascript/node.js

Servers reluctant Devops guy Software Architecture

Universities of Utrecht (NL), Blekinge (SE), and Groningen (NL)

GX (NL),Nokia Research (FI), Nokia/Here (DE),Localstream (DE),

Inbot (DE).

Page 3: Elk stack @inbot

Inbot app - available for Android & IOS

Page 4: Elk stack @inbot

ELK Stack?

Elasticsearch

Logstash

Kibana

Page 5: Elk stack @inbot

Recent trends Clustered/scalable time series DBs

Other people than sysadmins looking at graphs

Databases do some funky stuff these days: aggregations, search

Serverless, Docker, Amazon Lambda, Microservices etc. - where do the logs go?

More moving parts = more logs than ever

Page 6: Elk stack @inbot

LoggingKind of a boring topic ...

Stuff runs on servers, cloud, whatever

Produces errors, warnings, debug, telemetry, analytics, kpis, ux events, ...

Where does all this go and how do you make sense of it?

WHAT IS HAPPENING??!?!

Page 7: Elk stack @inbot

Old school: Cat, grep, awk, cut, ….

Good luck with that on 200GB of unstructured logs from a gazillion microservices on 40 virtual machines, docker images, etc.

That doesn't really work anymore ...

If you are doing this: you are doing it wrong!

Page 8: Elk stack @inbot

Hadoop ecosystem?Works great for structured data, if you know what you are looking for.

Requires a lot of infrastructure and hassle.

Not really real-time, tedious to explore data

Some hipster with a Ph.D. will fix it or ...

I’m not a data scientist, are you?

Page 9: Elk stack @inbot

Monitoring/graphing ecosystemMostly geared at measuring stuff like cpu load, IO, memory, etc.

Intended for system administrators

What about the higher level stuff?

You probably should do monitoring but it’s not really what we need either ...

Page 10: Elk stack @inbot

So, ELK ….

Page 11: Elk stack @inbot

LoggingMost languages/servers ship with awful logging defaults, you can fix this

Log enough but not too much or too little.

Log at the right log level ⇒ Turn off DEBUG log. Use ERROR sparingly.

Log metadata so you can pick your logs apart ⇒ Metadata == json fields.

Log opportunistically, it's cheap

Page 12: Elk stack @inbot

Too much logging

Your Elasticsearch cluster dies/you pay a fortune to keep data around that you don’t need.

Not enough logging

Something happened, you don’t know what because there’s nothing in the logs; you can't find back relevant events because metadata is missing.

You are going to waste what you saved in cost on finding out WTF is going on, probably more.

Page 13: Elk stack @inbot

Log entries in ELK{

"message": "[3017772.750979] device-mapper: thin: 252:0: unable to service pool target messages in READ_ONLY or FAIL mode",

"@timestamp": "2016-08-16T09:50:01.000Z",

"type": "syslog",

"host": "10.1.6.7",

"priority": 3,

"timestamp": "Aug 16 09:50:01",

"logsource": "ip-10-1-6-7",

"program": "kernel",

"severity": 3,

"facility": 0,

"facility_label": "kernel",

"severity_label": "Error"

}

Page 14: Elk stack @inbot

Plumbing your logsSimple problem: given some logs, convert it into json and shove it into Elasticsearch.

Lots of components to help you do that: Logstash, Docker Gelf driver, Beats, etc.

If you can, log json natively: e.g. Logback logstash driver, http://jsonlines.org/

Page 15: Elk stack @inbot

Ca. 40 Amazon EC2 instances, most of which have docker containers

VPC with several subnets and dmz.

Testing, production, and dev environments + dev infrastructure.

AWS comes with monitoring & alerts for basic stuff.

Everything logs to http://logs-internal.inbot.io/

Elasticsearch 2.2.0, logstash 2.2.1, kibana 4.4.1

1 week data retention, 14M events/day

Inbot technical setup

Page 16: Elk stack @inbot

Demo time

Page 17: Elk stack @inbot

Things to watch out forAvoid split brains and other nasty ES failure modes -> RTFM & configure ...

Data retention policies are not optional

Use curator https://github.com/elastic/curator

Customise your mappings, changing them sucks on a live logstash cluster. Dynamic mappings on fields that sometimes look like a number will break shit.

Running out of CPU credits in Amazon can kill your ES cluster

ES Rolling restarts take time when you have 6 months of logs

Page 18: Elk stack @inbot

Mapped Diagnostic Context (MDC)Common in java logging fws - log4j, slf4j, logback, etc.

Great for adding context to your logs

E.g. user_id, request url, host name, environment, headers, user agent, etc.

Makes it easy to slice and dice your logs

{MDC.put("user_id","123");LOG.info("some message"); MDC.remove("user_id");

}

Page 19: Elk stack @inbot

MDC for node.js: our log4js forkhttps://github.com/joona/log4js-node

Allows for MDC style attributes

Sorry: works for us but not in shape for pull request; maybe later.

But: this was an easy hack.

Page 20: Elk stack @inbot

MdcContexthttps://github.com/Inbot/inbot-utils/blob/master/src/main/java/io/inbot/utils/MdcContext.java

try(MdcContext ctx=MdcContext.create()){ctx.put("user_id","123");LOG.info("some message");

}

Page 21: Elk stack @inbot

Application Metricshttp://metrics.dropwizard.io/

Add counters, timers, gauges, etc. to your business logic.

metrics.register("httpclient_leased", new Gauge<Integer>() { @Override public Integer getValue() { return connectionManager.getTotalStats().getLeased(); }});

Reporter uses MDC to log once per minute: giant json blob but it works.

Page 22: Elk stack @inbot

Docker Gelf driverConfigure your docker hosts to log the output of any docker containers using the log driver.

command, container id, etc. become fields in log entry

nice as a fallback when you don't control the logging

/usr/bin/docker daemon --log-driver=gelf --log-opt gelf-address=udp://logs-internal.inbot.io:12201

Page 23: Elk stack @inbot

Thanks @jillesvangurp, @inbotapp