19
Jason Hand – DevOps Evangelist Tips & Tricks to Reduce TTR for the Next Incident @jasonhand

Tips & Tricks To Reducing TTR

Embed Size (px)

Citation preview

Jason Hand – DevOps Evangelist

Tips & Tricks to Reduce TTR for the Next Incident

@jasonhand

Time to Resolution (TTR)

•  The total amount of time taken to resolve an incident

•  MTTR – Mean Time To Resolution* – summary over time – measurement used to describe the most

"typical" value in a set of values – the lower the better

*Resolve  =  Repair  =  Recover    

•  Incident Lifecycle – Alerting – Triage – Investigation – Identification – Resolution – Documentation

Alerting “zero  1me”  aler1ng  pla6orm  to  find  people  instantly  can  only  really  effect  average  TTR  by  a  very  small  percentage  

No1fy  on-­‐call  members  

Victor’s Tips

“Include  useful  content  &  context  in  the  alerts.”    

“Use  custom  no8fica8ons  to  dis8nguish  cri8cal  alerts.”    

Triage Assign  degrees  of  urgency  to  incidents  

Victor’s Tips

“Get  the  right  alerts  to  the  right  people  through  rou8ng.”    

“Establish  a  single  source  of  truth  for  all  ac8vi8es  of  an  incident.”    

Investigation • Log  in  • Check  the  logs  • Analyze  metrics  • Review  wikis  • Discuss  w/  team  

Victor’s Tips

“Collaborate  &  Share.”    

“Connect  with  the  right  resources  and  team  members.”    

Identification “Everything  will  be  beKer  if  I  fix  this  one  thing.”  

Victor’s Tips “Provide  quick  access  to  accurate  metrics  &  runbooks.”    

Resolution  Self-­‐documen1ng  what  teams  do  to  solve  the  problem  

Bidirec1onal  integra1on  with  your  favorite  chat  client  and  the  VictorOps  1meline  

Team  members  performing  system  ac1ons  to  fix  the  problem(s)      

Victor’s Tips “Be  vocal  &  share  what  is  taking  place.”    

Documentation Write  down  and  talk  about  what  we  did  

Runbook  

Victor’s Tips “Conduct  (blameless)  post-­‐mortems.”    

Tips & Tricks to Reduce TTR for the Next Incident

Summary

“Conduct  (blameless)  post-­‐mortems.”    

“Be  vocal  &  share  what  is  taking  place.”    

“Provide  quick  access  to  accurate  metrics  &  runbooks.”    

“Collaborate  &  Share.”    

“Connect  with  the  right  resources  and  team  members.”    

“Get  the  right  alerts  to  the  right  people  through  rou8ng.”    

“Establish  a  single  source  of  truth  for  all  ac8vi8es  of  an  incident.”    

“Include  useful  content  &  context  in  the  alerts.”    

“Use  custom  no8fica8ons  to  dis8nguish  cri8cal  alerts.”    

Jason Hand – DevOps Evangelist Tips & Tricks to Reduce TTR for the Next Incident

@jasonhand

Thank  You  

[email protected]