Scaling Twitter To Go After the Fail Whale
Jonathan Reichhold - Twitter Engineering
Early Twitter....
2010 World Cup Challenge
•Tweet and user requests growing exponentially (good problem)
Load....
Monolithic Architecture
•Ruby on Rails
•Temporally-sharded MySQL
•Memcached
•~60 engineers
Stabilize & Understand
•Learn & make improvements
•Don’t just survive
Be Realistic & Ambitious
•Prioritize what can be fixed and timeframes for doing it
•Sometimes need the duct tape
•Find patterns and improvements for the long term
A Bad Approach
•Flip switches/branches/other until fixed
http://www.flickr.com/photos/chrism70/1144424032
Science
Step 1: Trustworty Data
• https://blog.twitter.com/2013/observability-at-twitter
Step 2: Set Expectations
•Being on-call is a job and during high stress will burn folks out
•Maintain calm and order
Post Mortems
•Improvement becomes part of process
•Stress makes system stronger not weaker
Teamwork
•All of this made possible by amazing team and management
•Culture
Capacity Planning & Forecast
•Just in time but realistic
•Figure out real buffers
Longer Term Changes
•Architecture changes take time and changes in organization
Improve Efficiency•Rails/Ruby -> Scala & JVM
•200-300 RPS -> 10,000-20,000
•Single process per request -> Finagle
Service Orientation•Make changes
at interface boundary, not in single monolith
•Team interactions simplified
•Core nouns and verbs
Move out of public cloud
•Flexibility and latency demand at some point
•Hard problem
•Datacenter as failure domain
•Mesos
Dynamic Configuration
•Update routes and compare live vs dark/new
•Quickly adjust to issues
•Faster and less fragile deploys
Improve storage
•Gizzard for MySQL
•Improve Memcached
•Storage as a service
•Snowflake IDs
Development Speed
•Startups live and die by development speed
•Make easier to ship but contain damage
Conclusion
•Fail whale is now an endangered species
•Went from event driven spikes to pushing continuous reliability improvements where events became trivial
Tweet Spikes Today• New Tweets per second (TPS) record: 143,199
TPS. Typical day: more than 500 million Tweets sent; average 5,700 TPS. (August 2 at 7:21:50 PDT; August 3 at 11:21:50 JST)
• https://blog.twitter.com/2013/new-tweets-per-second-record-and-how
Final Thoughts
•Marathon not a sprint. Maintain systems and yourself
•We are hiring to make system even better
Endangered: Fail Whale Jonathan
Reichhold@jreichhold
Questions?
•https://blog.twitter.com/2013/new-tweets-per-second-record-and-how
•https://blog.twitter.com/2013/observability-at-twitter