Upload
william-yeh
View
1.031
Download
0
Embed Size (px)
Citation preview
William YehDevOps Summit 2016 (2016-07-06)
Monitoring
Monitoring: a Process Perspective
CRT (Current Reality Tree)
#6
#10
( )
best practice
#3 #5
( ) ( )( )
AND
AND
AND
AND
#1#2
#4
#7
#8
#9
#11
Murphy exists
AND
AND
×
AND
DevOps
AND
AND
AND
AND
AND
AND
AND
AND
Op
AND
http://www.slideshare.net/williamyeh/devops-63711710
#6
#10
( )
best practice
#3 #5
( ) ( )( )
AND
AND
AND
AND
#1#2
#4
#7
#8
#9
#11
Murphy exists
AND
AND
×
AND
DevOps
AND
AND
AND
AND
AND
AND
AND
AND
Op
AND
#6
#10
( )
best practice
#3 #5
( ) ( )( )
AND
AND
AND
AND
#1#2
#4
#7
#8
#9
#11
Murphy exists
AND
AND
×
AND
DevOps
AND
AND
AND
AND
AND
AND
AND
AND
Op
AND
#7
#9 #1
#2
…
#11
…
#8 #4
Risk management
• Threats• avoid• transfer• mitigate
7
• Opportunities• exploit• enhance• share
👍👎
http://www.slideshare.net/williamyeh/whoscall-realtime-monitoring
William YehDevOps Summit 2016 (2016-07-06)
Monitoring
Monitoring: a Process Perspective
Process Monitoring
Monitoring
Process
?
?
?
?
Monitoring
Monitoring
?
?
Process
Monitoring
#5
Part 2
Efrat Goldratt-Ashlag
Efrat Goldratt-Ashlag
What to changeTo What to changeHow to cause the change
CRT (Current Reality Tree)
DevOps
DevOps
leverage
✘
TOC
CCPM
FRT (Future Reality Tree)
DevOps
leverage
✘
TOC
CCPM
FRT (Future Reality Tree)
TOC
CCPM
Stephen R. Covey
What get measured, get done.
Peter Drucker
Policy
What get measured, get done.
Policy
Policy
PolicyBuy-in
Policy
What to changeTo What to changeHow to cause the change
Adrian Cockcroft
CloudFront ELB API servers MongoDB
Cloud Manager
CloudWatch
log in S3
StatsD
BigQuery
CloudFront ELB API servers MongoDB
Cloud Manager
CloudWatch
log in S3
StatsD
BigQuery
CloudFront ELB API servers MongoDB
Cloud Manager
CloudWatch
log in S3
BigQuery
CloudFront ELB API servers MongoDB
Cloud Manager
CloudWatch
log in S3
BigQuery
http://school.soft-arch.net/blog/125009/change-viewpoint-on-lord-of-rings
Lean Change Canvas
Lean Change Canvas
Commitment Wins/Benefits
Urgency
Target State
Success Criteria
Vision
Communication
Action
Change Recipients
FYI: http://kojenchieh.pixnet.net/blog/post/442550432-firstthing_of_agile_promotionFYI: http://leankit.com/blog/2015/02/lean-change-method/
Monitoring Q1 (brainstorming) 2016-Jan-06Iteration #1
TO DO LIST details
Augmented
Lean Change Canvas
Urgency
Target State
Success Criteria
Vision
Communication
Action
Monitoring Q1 (brainstorming) 2016-Jan-06Iteration #1
What to changeTo What to changeHow to cause the change
Lean Change Canvas
Urgency
Target State
Success Criteria
Vision
Communication
Action
Monitoring Q1 (brainstorming) 2016-Jan-06Iteration #1
Flow
Tech
Monitoring
Buy-inFlow
Buy-inPolicy
Flow
TOC Lean Thinking
CCPM
TOC
Lean Thinking
Value Value stream FlowPull Perfection
http://school.soft-arch.net/blog/115652/devops-a-lean-perspective
“The Three Ways”
Create fast flow of work from Dev into IT Ops. Shorten and amplify feedback loops. Create a culture that simultaneously fosters 2 things: 1. continual experimentation, learning from
failure. 2. repetition and practice is the prerequisite
to mastery.
Create fast flow of work from Dev into IT Ops.
Shorten and amplify feedback loops.
CCPM
Critical ChainProject Management
Flow
TOC Lean Thinking
CCPM
VPC
CloudFront ELB API servers DB
Simplified version
CloudFront ELB API servers DB
ELB API servers DB
Microservices
Simplified version
Flow
Flow
Flow
Flow
Overview
Incomingrequests
APIservers
DB servers
DB serversAPI
servers
Incomingrequests
Overview
Flow
Lean Change Canvas
Urgency
Target State
Success Criteria
Vision
Communication
Action
Monitoring Q1 (brainstorming) 2016-Jan-06Iteration #1
Flow
TOC
Flow TOC
FlowBuy-in
Policy
TechFlow
Lean Change Canvas
Urgency
Target State
Success Criteria
Vision
Communication
Action
Monitoring Q1 (brainstorming) 2016-Jan-06Iteration #1
Tech
Personal Preferences
• Golang
• Microservices
• Composability
• OSS ecosystem
of server technologies
Personal Preferences
• Golang
• Microservices
• Composability
• OSS ecosystem
Runtime dependency
william Ansible
Personal Preferences
• Golang
• Microservices
• Composability
• OSS ecosystem
Scalability
Overhead
Personal Preferences
• Golang
• Microservices
• Composability
• OSS ecosystem
Node/system metrics exporterAWS CloudWatch exporterBlackbox exporterCollectd exporterConsul exporterGraphite exporterHAProxy exporterInfluxDB exporterJMX exporterMemcached exporterMesos task exporterMySQL server exporterSNMP exporterStatsD exporter
cAdvisorDoormanEtcdKubernetes-MesosKubernetesRobustIRCSkyDNSWeave Flux
Aerospike exporterApache exporterBIG-IP exporterBIND exporterCeph exporterCouchDB exporterDjango exporterGoogle's mtail log data extractorHeka dashboard exporterHeka exporterIoT Edison exporterJenkins exporterknxd exporterMeteor JS web framework exporterMinecraft exporter moduleMirth Connect exporterMongoDB exporterMunin exporterNew Relic exporterNginx metric libraryNSQ exporterOpenWeatherMap exporterPassenger exporterPgBouncer exporterPostgreSQL exporterPowerDNS exporterRabbitMQ exporterRabbitMQ Management Plugin exporterRancher exporterRedis exporterRethinkDB exporterrTorrent exporterscollector exporterSMTP/Maildir MDA blackbox proberSpeedtest.net exporterSQL query result set metrics exporterUbiquiti UniFi exporterVarnish exporterZookeeper exporter
CloudFront ELB API servers MongoDB
Cloud Manager
CloudWatch
log in S3
StatsD
BigQuery
ELB API servers MongoDB
Cloud Manager
CloudWatch
Prometheus vs Graphite/StatsD
abs()absent()bottomk()ceil()changes()clamp_max()clamp_min()count_scalar()delta()deriv()drop_common_labels()exp()floor()histogram_quantile()holt_winters()increase()
irate()label_replace()ln()log2()log10()predict_linear()rate()resets()round()scalar()sort()sort_desc()sqrt()time()topk()vector()<aggregation>_over_time()
node_cpu
time
number
node_cpu
time
number
{mode="idle"}
mode
node_cpu {mode="irq"}
node_cpu {instance="10.0.37.12"}{service="web"}{zone="ap-northest-1a"}
sum( irate(
node_netstat_TcpExt_TCPTimeWaitOverflow[1m] )
) by (ec2tag_Service)
countergauge
aggregate
TCP Timeout
node_netstat_TcpExt_TCPTimeWaitOverflow[1m]irate(
node_netstat_TcpExt_TCPTimeWaitOverflow[1m] )
grouping
gaugeaggregate
Memory Used
1 - node_memory_MemFree/node_memory_MemTotalgrouping
avg( 1 - node_memory_MemFree/node_memory_MemTotal
) by (ec2tag_Service)
avg by (ec2tag_Service) ( irate(
node_cpu{job="node", mode="idle"}[1m] )
)
countergauge
aggregate
CPU Utilization
100 - (
* 100)
avg( request_time_summary
) by (ec2tag_Service, quantile)summary
aggregate
Latency
grouping
Customized metricswith Fluentd plugin for Prometheus
Conclusion
#7
#9 #1
#2
…
#11
…
#8 #4
PolicyBuy-in
FlowTech
Policy
Buy-in
Flow
Tech
???
Issue tracking
William YehDevOps Summit 2016 (2016-07-06)
Monitoring
Monitoring: a Process Perspective