View
76
Download
0
Category
Preview:
DESCRIPTION
Presentation from GREE Ops team at AWS re:Invent conference in Las Vegas in 2013
Citation preview
© 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
Gaming OpsRunning High-Performance Ops for Mobile Gaming
Eduardo Saito – Director, Engineering
Nick Dor – Sr. Director, Engineering
GREE International Friday, November 15, 2013
Agenda• Part 1 – Lessons Learned
– Incident Management– Change Management– Auto-scale
– Cloud Optimization Tools and Capacity Planning
• Part 2 – Game Architecture, Analytics & Monetization– Game Architecture– Moving a live game– Analytics & Monetization– Cloud Insights
NOC
Ops
Dev
SME (Network, DBA,…)
Othermonitoring tools…
TriageEscalationCommunication
Incident Management
NOC, automated
Ops Dev
Critical
Critical
Non-Critical
Othermonitoring tools…
Application-level issue?Who’s the dev of this game? Phone #?I can’t find the dev… who’s his manager?Oh, the problem is in the backend service, who’s the dev for that service?
Alert Workflow - DevOps way
Ops
Dev, Game X, Server
Dev, Game Y, Client/iOSDev, Service A
Each alert go directly to the right team that can resolve it !
Dev, Service B
Analytics
Type Scope Checked by Who to page?
ELB Load balancer health-check
ELB No one – email alert only
System-level Check cpu / disk / memory / network
Pingdom / Nagios
Ops team
App-level Application issues / bugs
Pingdom Dev and Ops teams
Alerts go to the person that can resolve it
Type Scope Checked by Who to page?
ELB Load balancer health-check
ELB No one – email alert only
System-level Check cpu / disk / memory / network
Pingdom / Nagios
Ops team
Type Scope Checked by Who to page?
ELB Load balancer health-check
ELB No one – email alert only
System-level Check cpu / disk / memory / network
Pingdom / Nagios
Ops team
App-level Application issues / bugs
Pingdom Dev and Ops teams
App-level alerts can be triggered by issues in:
• Server-side• Client-side
• iOS• Android
Dev and Ops are responsible
Team In pager duty
Ops 8
Dev 32, from ~20 games(server-side or client-side, android or iOS developers)
Analytics 5
Big, Simple Status Dashboard
Big dashboard = quick status
Big dashboard=meta monitoring
IM Bot informs in the game channel that an alert was triggered
Use IM Bot for status
Both Ops and Dev receive the alert, troubleshoot
IM Bot = collaboration
IM Bot detects issue is resolved and send all-clear
IM Bot = transparency
Review your incidents and alerts
• Monday morning incident review meeting– Weekly on-call hand-over– Address false-positives / fine-tune your monitoring– Heads-up for events / major releases
• Problem management– Any major or recurrent incident = Problem– Problem = requires post-mortem– Remediation items from post-mortem also tracked weekly till closure
Incident ManagementLessons Learned
• Use automatic paging/escalation tools• Make the alerts go to the right team directly• Use big display dashboard• Use IM-bots to communicate outages• Do weekly reviews of the incidents / alerts• Do post-mortems, follow-up on remediation items
Agenda• Part 1 – Lessons Learned
– Incident Management– Change Management– Auto-scale
– Cloud Optimization Tools and Capacity Planning
• Part 2 – Game Architecture, Analytics & Monetization– Game Architecture– Moving a live game– Analytics & Monetization– Cloud Insights
Change Management
Type Content Owner Tool
Configuration Management
3rd. Party packages and configuration
Ops Puppet
Release – code deploy
1st. Party code Dev Jenkins + In-house scripts
Release – asset deploy
1st. Party – images / new game content / new missions
Dev Jenkins + In-house scripts
Configuration Management
pull push
Ops do changes / test locally
peerreview
pull changes to prod puppet
puppet clients (prod servers) pull changes
syntaxvalidation
not good
Configuration Management Benefits
• Automate and speed-up deployment• Repeatable• Declarative modules/manifests = documentation• All prod changes:
– peer-reviewed via pull-requests in Git– validated by Puppet lint– locally tested via Vagrant (every component has a Vagrant VM)– communicated through email and IM
Change Management
Type Content Owner Tool
Configuration Management
3rd. Party packages and configuration
Ops Puppet
Release – code deploy
1st. Party code Dev Jenkins + In-house scripts
Release – asset deploy
1st. Party – images / new game content / new missions
Dev Jenkins + In-house scripts
Release Management – Code deploy
pushQA
Prod
Beta
Deployhostdev
dev
S3If Prod deploy, in Ops channel of that project:
In QA/dev channel of that project:
Change Management
Type Content Owner Tool
Configuration Management
3rd. Party packages and configuration
Ops Puppet
Release – code deploy
1st. Party code Dev Jenkins + In-house scripts
Release – asset deploy
1st. Party – images / new game content / new missions
Dev Jenkins + In-house scripts
Release Management – Asset deploy
CodeReview
Warns?
Ops approval
Override?
Yes
Yes
NoDev kick off new asset deploy job
Run validation
Deploy to prod
Change Management Lessons Learned
• Changes are made directly by the team that is responsible for that code– 3rd. party code is configuration management = owned by Ops– 1st. party code is release management = owned by Dev
• Changes are made through tools– Configuration management through Puppet– Release management through Jenkins + internal tool
• No change is done manually• All changes are communicated and tracked
Agenda• Part 1 – Lessons Learned
– Incident Management– Change Management– Auto-scale
– Cloud Optimization Tools and Capacity Planning
• Part 2 – Game Architecture, Analytics & Monetization– Game Architecture– Moving a live game– Analytics & Monetization– Cloud Insights
Auto-scale use-cases
–On-demand• for the daily traffic fluctuations and organic growth
–Scheduled• for in-game events
Auto-scale on-demand and scheduled
CPU
# instances in ELB
# auto-scale instances
Scheduled Auto-scale
1- Scheduled pre-provisioning config enabled
CPU
# instances in ELB
# auto-scale instances
as-put-scheduled-update-group-action ccios-app-ScheduledUpFriday
--auto-scaling-group ccios-app-asg --recurrence “00 17 * * 5”
--min-size 16
Scheduled action
Scheduled Auto-scale
2 - Spare capacity in place, ready for event
CPU
# instances in ELB
# auto-scale instances
Scheduled Auto-scale
3 - Event starts, 4x spike
CPU
# instances in ELB
# auto-scale instances
ADD EVENT SCREENSHOT HERE
On-demand Auto-scale
4 – On-demand auto-scale reacts to CPU above 60% and adds more servers
CPU
# instances in ELB
# auto-scale instances
as-put-scaling-policyccios-app-ScaleUpPolicy60
--auto-scaling-group ccios-app-asg--adjustment=8 --type ChangeInCapacity
On-demand policy
On-demand Auto-scale
5 - Scheduled pre-provisioning config is removed
CPU
# instances in ELB
# auto-scale instances
as-put-scheduled-update-group-action ccios-app-ScheduledDownFriday --auto-scaling-group ccios-app-asg
--recurrence "0 21 * * 5" --min-size 2
Scheduled action
On-demand Auto-scale
6 – On-demand auto-scale terminate some instances as CPU drops below 40%
CPU
# instances in ELB
# auto-scale instances
as-put-scaling-policyccios-app-ScaleDownPolicy40
--auto-scaling-group ccios-app-asg--adjustment=-2 --type ChangeInCapacity
On-demand policy
Auto-scale bootstrap workflowEvent Description DurationCloudwatch alarm is triggered Eg. CPU > 60% for 5 minutes 5 minutes
Auto-scale policy is executed Launches n new instances 2 minutes
User-data script is executed This script is defined on the autoscale launch config. Installs base packages, gets instance_id, IP and hostgroup
1 minute
Bootstrap script is executed This script is loaded from S3. It renames host, runs puppet, deploy code, starts web service
11 minutes
Health-check passes and servers start to get traffic
Health-check must pass before ELB start to send traffic to new host
1 minute
Auto-scale external dependencies
Dependency How to resolveConfiguration Management (Puppet/Chef)
Pre-load all necessary package in the AMI / architecture HA for config management
External Repo Pre-load all necessary packages in the AMI / setup internal HA repo
Code deploy Same as above, or put in S3
Monitoring registration Make it asynchronous
Server registration Make it asynchronous
Auto-scale Lessons Learned
• Reduce time to spin-up new instances:– Pre-install all base packages into AMI
• Address those risks:– on-demand and scheduled AS conflicts– bootstrap validation and graceful termination– health-checks: keep it simple– keep some servers out of auto-scale pool, just in case– map and resolve/monitor external dependencies for auto-scale – consider using 2 different thresholds, for quicker ramp-up
Agenda• Part 1 – Lessons Learned
– Incident Management– Change Management– Auto-scale– Cloud Optimization Tools and Capacity Planning
• Part 2 – Game Architecture, Analytics & Monetization– Game Architecture– Moving a live game– Analytics & Monetization– Cloud Insights
• under-utilized hosts• overloaded hosts
• EBS/ELB not in use
• exposed DBs• EC2 behind ELB exposed
directly
• AZ / region distribution• backup audit
• un-healthy instances in ELB• ELB misconfigs
• optimal # of RI• hosts outside RI• cost break-down using tags• estimate on-demand costs
Cloud Optimization areas
Cost Usage
AvailabilitySecurity
Cloud Optimization tools
AWS Trusted Advisor 3rd. Party commercial tools
Open Source tools (eg. Netflix Ice)
In-house tools
Excel !
Cloud Optimization Lessons Learned
• Try Trusted Advisor
• Pilot 3rd.-party solutions
• Evaluate what metrics are important for each component of your architecture
• Do in-house development for other optimizations you need that are not covered by TA or 3rd. party solutions
• Tag all assets! Automate tagging!
Agenda• Part 1 – Lessons Learned
– Incident Management– Change Management– Auto-scale– Cloud Optimization Tools and Capacity Planning
• Part 2 – Game Architecture, Analytics & Monetization– Game Architecture– Moving a live game– Analytics & Monetization– Cloud Insights
GREE Games• All Mobile, all Free-to-Play
– iOS & Android smart phones– Big focus on tablets
• Role Playing Games (RPG)– Multi-million dollar franchise, top-grossing titles– Some of the oldest games on the App Store
• Hardcore– Deeper more intense gameplay mechanics
• Real-Time Strategy (RTS)– Fast action, small unit management
• Casino & Casual Games– Familiar games, wider audience, casual play
Example Game Architecture – RPG
• Application Servers– PHP – Game events Analytics
• Cache Layer– Memcached Elasticache
• Batch Processing Servers– Node.js (moving to GO)– Batches database writes
• Database– MySQL RDS
RDS RDS RDSFailover
DB
ELB
App App AppApp
Cache Cache CacheCache
Batch Batch
Caching Strategy - Current
• Game architecture predates stable NoSQL– We wanted similar performance at scale– Keep combined average internal response times below 300ms
• Memcache Authoritative– Still use an RDBMS; potential data loss is limited
• Allows for cheaper/simpler DB layer– Always do full row replacements (ie: no current_row_value +1)
Data Flow• Reads
– ELB App Cache
• Writes (Synchronous)– ELB App Cache DB– ELB App Cache Batch DB – Standard write-through– No blind writes; always fetch current ver.
• Writes (Asynchronous)– Batch DB– Batch writes to DB every 30 seconds
RDS RDS RDS
ELB
App App AppApp
Cache Cache CacheCache
Batch Batch
Batch Processor
• 80% of game write traffic is Async– Each write is versioned
• Example: Player items (loot) after multiple quests– 10 items in 30 sec; app server sends 10 writes downstream– Batch processor sends last record with final item count to DB
• Greatly reduced writes on DB– Shard at table and DB server level for larger games
Near Future Trends for GREE OPS
• Multi-region games– Latency-sensitive games and the shift towards real-time– Geographic data replication challenges
• Continuous Delivery• Automation of Game Studio tasks
– Game design, art, data/asset deploy– Tighter event pre-provisioning and scale-down
More Performance – Lower Costs
• Facebook HipHop Virtual Machine– JIT compilation & execution of PHP– 5x faster vs. Zend PHP 5.2– Achieved 3x to 4x reduction in application server count– https://github.com/facebook/hhvm
• Google GO– Used for high-concurrency applications– Achieved 2x reduction in batch processing servers vs. Node.js– http://golang.org
Agenda• Part 1 – Lessons Learned
– Incident Management– Change Management– Auto-scale– Cloud Optimization Tools and Capacity Planning
• Part 2 – Game Architecture, Analytics & Monetization– Game Architecture– Moving a live game– Analytics & Monetization– Cloud Insights
Moving a Game – Why?• Physical datacenter to AWS
– West coast East coast– Faster access to EU markets & players
• Reduce necessary attention to infrastructure– Caching & DB layer; custom high-availability middleware
• Take advantage of cloud provisioning– Scripted instance spin-ups, auto-scaling for events/load
• Save money – Reduce stand-by server pool– Provision for average load, not peak
Moving a Live Game – Whaaaaat?
• Live game, two platforms (iOS, Android)– Several million $$$ in combined monthly revenue– More than one million unique players/month
• ~ 30GB Dataset• Minimal downtime (< 5 minutes)
– Mostly to allow for change to reverse proxy config
• Debian CentOS• Physical machines AWS
Moving a Live Game - How
• Develop timeline• R&D & architecture review• Data migration & sync• Game server/client updates• Load testing• D-Day steps & checklist
Moving a Live Game - Timeline
• 3 months overall• DB dataset transfer validation
– Setup direct MySQL to RDS replication– Initial DB transfer time: approx. 8 hours
• Functional & performance testing– Load & capacity profile for application, DB servers– Heavy use of APM metrics – New Relic
Moving a Live Game - Architecture
• Changes required– Caching – discreet memcached to Elasticache nodes– Database – physical MySQL DB servers to RDS
• Decided to drop internally developed MySQL proxy– Bittersweet: great automatic failover; limited internal knowledge
• RDS failover mechanics added to possible game downtime
– Load balancers• LVS to ELB
• Processes– Code asset deployment
Moving a Live Game – D-Day
• Put game into maintenance (shutdown)• Break DB replication (west east)• Setup reverse proxy in datacenter
– Forward traffic from west east AWS ELB
• Bring game back online– Reverse proxy sends traffic to AWS
• Update DNS to point to ELB– Wait for DNS propagation– Slow DNS updates hit the reverse proxy in datacenter
Moving a Live Game – Before & After
Agenda• Part 1 – Lessons Learned
– Incident Management– Change Management– Auto-scale– Cloud Optimization Tools and Capacity Planning
• Part 2 – Game Architecture, Analytics & Monetization– Game Architecture– Moving a live game– Analytics & Monetization– Cloud Insights
Analytics & Monetization
• Specialize in “Live Events”– Higher player engagement (fun!) = more revenue
• Single-player events– “Epic Boss” – Limited-time quests
• Player organization events – Guild vs. Guild battles (World Domination, Syndicate Wars)– Raid Bosses – members help to take down a tough NPC– Tap into social “meta-gaming”
Modern War World Domination Results (August 2013)
Analytics for Player Engagement
• Player retention– 1st week and beyond– Tutorial completion rates
• Balancing mechanics– Player vs. Environment (PvE), Player vs. Player (PvP)– Encourage interaction with other players
• When too much good can be bad– Analytics needs to be paired with player feedback– Fun for all players, payers AND non-payers
Analytics for Decision-making
• Devices & Markets– Understand most popular devices (esp. Android)– Focus efforts on the top devices for your market
• Launching a game– “Soft-launch” – only launch in certain markets, tune game– “Hard-launch” – money down (marketing), marquee live events
• When to sunset & decommission– Depends on strategic goals, infra/engineering costs, etc.
Analytics – Some Scale
• Over 5000 transactions/sec sent to Analytics• Several billion game events per day
– Attacking, winning, losing, buying, clicking, swiping, etc.
• Anticipating 10x increase in next two years• Building petabyte scale data warehouse
capacity
Analytics Pipeline• Working towards “zero-latency” pipeline
– Latency = ETL, summarization, reporting & dashboard– Already reduced from 24 hours to 1 hour in last year
Agenda• Part 1 – Lessons Learned
– Incident Management– Change Management– Auto-scale– Cloud Optimization Tools and Capacity Planning
• Part 2 – Game Architecture, Analytics & Monetization– Game Architecture– Moving a live game– Analytics & Monetization– Cloud Insights
Cloud Insights
• Agility (Time to Deliver)• Elasticity – scale up/down quickly
– Auto Scaling is critical
• Service simplification (RDS/Elasticache/ELB)• Professional development for OPS Team
– Physical (Datacenter/Network focus) vs. Virtual (DevOps focus)
Cloud Insights – Lessons Learned
• Reliability & performance consistency varies• Stuff breaks often
– Develop an “anti-fragile” mindset; build to anticipate failure
• Cost-predictability still elusive• Orphaned servers
– Easy to create; must constantly clean up
• Large-scale monitoring is hard– No silver bullet yet
Thank You
• Thanks to the GREE OPS & Engineering Teams!
eduardo.saito@gree.net
nick.dor@gree.net
• We’re Hiring DevOps Team Members!!
http://gree-corp.com/jobs
Please give us your feedback on this presentation
As a thank you, we will select prize winners daily for completed surveys!
MBL303
Recommended