AWS re:Invent 2013 - MBL303 Gaming Ops - Running High-performance Ops for Mobile Gaming

Gaming OpsRunning High-Performance Ops for Mobile Gaming

Eduardo Saito – Director, Engineering

Nick Dor – Sr. Director, Engineering

GREE International Friday, November 15, 2013

Agenda• Part 1 – Lessons Learned

– Incident Management– Change Management– Auto-scale

– Cloud Optimization Tools and Capacity Planning

• Part 2 – Game Architecture, Analytics & Monetization– Game Architecture– Moving a live game– Analytics & Monetization– Cloud Insights

SME (Network, DBA,…)

Othermonitoring tools…

TriageEscalationCommunication

Incident Management

NOC, automated

Ops Dev

Critical

Non-Critical

Othermonitoring tools…

Application-level issue?Who’s the dev of this game? Phone #?I can’t find the dev… who’s his manager?Oh, the problem is in the backend service, who’s the dev for that service?

Alert Workflow - DevOps way

Dev, Game X, Server

Dev, Game Y, Client/iOSDev, Service A

Each alert go directly to the right team that can resolve it !

Dev, Service B

Analytics

Type Scope Checked by Who to page?

ELB Load balancer health-check

ELB No one – email alert only

System-level Check cpu / disk / memory / network

Pingdom / Nagios

Ops team

App-level Application issues / bugs

Pingdom Dev and Ops teams

Alerts go to the person that can resolve it

Pingdom / Nagios

Ops team

Pingdom / Nagios

Ops team

App-level Application issues / bugs

Pingdom Dev and Ops teams

App-level alerts can be triggered by issues in:

• Server-side• Client-side

• iOS• Android

Dev and Ops are responsible

Team In pager duty

Dev 32, from ~20 games(server-side or client-side, android or iOS developers)

Analytics 5

Big, Simple Status Dashboard

Big dashboard = quick status

Big dashboard=meta monitoring

IM Bot informs in the game channel that an alert was triggered

Use IM Bot for status

Both Ops and Dev receive the alert, troubleshoot

IM Bot = collaboration

IM Bot detects issue is resolved and send all-clear

IM Bot = transparency

Review your incidents and alerts

• Monday morning incident review meeting– Weekly on-call hand-over– Address false-positives / fine-tune your monitoring– Heads-up for events / major releases

• Problem management– Any major or recurrent incident = Problem– Problem = requires post-mortem– Remediation items from post-mortem also tracked weekly till closure

Incident ManagementLessons Learned

• Use automatic paging/escalation tools• Make the alerts go to the right team directly• Use big display dashboard• Use IM-bots to communicate outages• Do weekly reviews of the incidents / alerts• Do post-mortems, follow-up on remediation items

Change Management

Type Content Owner Tool

Configuration Management

3rd. Party packages and configuration

Ops Puppet

Release – code deploy

1st. Party code Dev Jenkins + In-house scripts

Release – asset deploy

1st. Party – images / new game content / new missions

Dev Jenkins + In-house scripts

pull push

Ops do changes / test locally

peerreview

pull changes to prod puppet

puppet clients (prod servers) pull changes

syntaxvalidation

not good

Configuration Management Benefits

• Automate and speed-up deployment• Repeatable• Declarative modules/manifests = documentation• All prod changes:

– peer-reviewed via pull-requests in Git– validated by Puppet lint– locally tested via Vagrant (every component has a Vagrant VM)– communicated through email and IM

Change Management

Ops Puppet

Release Management – Code deploy

pushQA

Deployhostdev

S3If Prod deploy, in Ops channel of that project:

In QA/dev channel of that project:

Change Management

Ops Puppet

Release Management – Asset deploy

CodeReview

Warns?

Ops approval

Override?

NoDev kick off new asset deploy job

Run validation

Deploy to prod

Change Management Lessons Learned

• Changes are made directly by the team that is responsible for that code– 3rd. party code is configuration management = owned by Ops– 1st. party code is release management = owned by Dev

• Changes are made through tools– Configuration management through Puppet– Release management through Jenkins + internal tool

• No change is done manually• All changes are communicated and tracked

Auto-scale use-cases

–On-demand• for the daily traffic fluctuations and organic growth

–Scheduled• for in-game events

Auto-scale on-demand and scheduled

# instances in ELB

# auto-scale instances

Scheduled Auto-scale

1- Scheduled pre-provisioning config enabled

# instances in ELB

as-put-scheduled-update-group-action ccios-app-ScheduledUpFriday

--auto-scaling-group ccios-app-asg --recurrence “00 17 * * 5”

--min-size 16

Scheduled action

2 - Spare capacity in place, ready for event

# instances in ELB

3 - Event starts, 4x spike

# instances in ELB

ADD EVENT SCREENSHOT HERE

On-demand Auto-scale

4 – On-demand auto-scale reacts to CPU above 60% and adds more servers

# instances in ELB

as-put-scaling-policyccios-app-ScaleUpPolicy60

--auto-scaling-group ccios-app-asg--adjustment=8 --type ChangeInCapacity

On-demand policy

5 - Scheduled pre-provisioning config is removed

# instances in ELB

as-put-scheduled-update-group-action ccios-app-ScheduledDownFriday --auto-scaling-group ccios-app-asg

--recurrence "0 21 * * 5" --min-size 2

Scheduled action

6 – On-demand auto-scale terminate some instances as CPU drops below 40%

# instances in ELB

as-put-scaling-policyccios-app-ScaleDownPolicy40

--auto-scaling-group ccios-app-asg--adjustment=-2 --type ChangeInCapacity

On-demand policy

Auto-scale bootstrap workflowEvent Description DurationCloudwatch alarm is triggered Eg. CPU > 60% for 5 minutes 5 minutes

Auto-scale policy is executed Launches n new instances 2 minutes

User-data script is executed This script is defined on the autoscale launch config. Installs base packages, gets instance_id, IP and hostgroup

1 minute

Bootstrap script is executed This script is loaded from S3. It renames host, runs puppet, deploy code, starts web service

11 minutes

Health-check passes and servers start to get traffic

Health-check must pass before ELB start to send traffic to new host

1 minute

Auto-scale external dependencies

Dependency How to resolveConfiguration Management (Puppet/Chef)

Pre-load all necessary package in the AMI / architecture HA for config management

External Repo Pre-load all necessary packages in the AMI / setup internal HA repo

Code deploy Same as above, or put in S3

Monitoring registration Make it asynchronous

Server registration Make it asynchronous

Auto-scale Lessons Learned

• Reduce time to spin-up new instances:– Pre-install all base packages into AMI

• Address those risks:– on-demand and scheduled AS conflicts– bootstrap validation and graceful termination– health-checks: keep it simple– keep some servers out of auto-scale pool, just in case– map and resolve/monitor external dependencies for auto-scale – consider using 2 different thresholds, for quicker ramp-up

– Incident Management– Change Management– Auto-scale– Cloud Optimization Tools and Capacity Planning

• under-utilized hosts• overloaded hosts

• EBS/ELB not in use

• exposed DBs• EC2 behind ELB exposed

directly

• AZ / region distribution• backup audit

• un-healthy instances in ELB• ELB misconfigs

• optimal # of RI• hosts outside RI• cost break-down using tags• estimate on-demand costs

Cloud Optimization areas

Cost Usage

AvailabilitySecurity

Cloud Optimization tools

AWS Trusted Advisor 3rd. Party commercial tools

Open Source tools (eg. Netflix Ice)

In-house tools

Excel !

Cloud Optimization Lessons Learned

• Try Trusted Advisor

• Pilot 3rd.-party solutions

• Evaluate what metrics are important for each component of your architecture

• Do in-house development for other optimizations you need that are not covered by TA or 3rd. party solutions

• Tag all assets! Automate tagging!

GREE Games• All Mobile, all Free-to-Play

– iOS & Android smart phones– Big focus on tablets

• Role Playing Games (RPG)– Multi-million dollar franchise, top-grossing titles– Some of the oldest games on the App Store

• Hardcore– Deeper more intense gameplay mechanics

• Real-Time Strategy (RTS)– Fast action, small unit management

• Casino & Casual Games– Familiar games, wider audience, casual play

Example Game Architecture – RPG

• Application Servers– PHP – Game events Analytics

• Cache Layer– Memcached Elasticache

• Batch Processing Servers– Node.js (moving to GO)– Batches database writes

• Database– MySQL RDS

RDS RDS RDSFailover

App App AppApp

Cache Cache CacheCache

Batch Batch

Caching Strategy - Current

• Game architecture predates stable NoSQL– We wanted similar performance at scale– Keep combined average internal response times below 300ms

• Memcache Authoritative– Still use an RDBMS; potential data loss is limited

• Allows for cheaper/simpler DB layer– Always do full row replacements (ie: no current_row_value +1)

Data Flow• Reads

– ELB App Cache

• Writes (Synchronous)– ELB App Cache DB– ELB App Cache Batch DB – Standard write-through– No blind writes; always fetch current ver.

• Writes (Asynchronous)– Batch DB– Batch writes to DB every 30 seconds

RDS RDS RDS

App App AppApp

Cache Cache CacheCache

Batch Batch

Batch Processor

• 80% of game write traffic is Async– Each write is versioned

• Example: Player items (loot) after multiple quests– 10 items in 30 sec; app server sends 10 writes downstream– Batch processor sends last record with final item count to DB

• Greatly reduced writes on DB– Shard at table and DB server level for larger games

Near Future Trends for GREE OPS

• Multi-region games– Latency-sensitive games and the shift towards real-time– Geographic data replication challenges

• Continuous Delivery• Automation of Game Studio tasks

– Game design, art, data/asset deploy– Tighter event pre-provisioning and scale-down

More Performance – Lower Costs

• Facebook HipHop Virtual Machine– JIT compilation & execution of PHP– 5x faster vs. Zend PHP 5.2– Achieved 3x to 4x reduction in application server count– https://github.com/facebook/hhvm

• Google GO– Used for high-concurrency applications– Achieved 2x reduction in batch processing servers vs. Node.js– http://golang.org

Moving a Game – Why?• Physical datacenter to AWS

– West coast East coast– Faster access to EU markets & players

• Reduce necessary attention to infrastructure– Caching & DB layer; custom high-availability middleware

• Take advantage of cloud provisioning– Scripted instance spin-ups, auto-scaling for events/load

• Save money – Reduce stand-by server pool– Provision for average load, not peak

Moving a Live Game – Whaaaaat?

• Live game, two platforms (iOS, Android)– Several million $$$ in combined monthly revenue– More than one million unique players/month

• ~ 30GB Dataset• Minimal downtime (< 5 minutes)

– Mostly to allow for change to reverse proxy config

• Debian CentOS• Physical machines AWS

Moving a Live Game - How

• Develop timeline• R&D & architecture review• Data migration & sync• Game server/client updates• Load testing• D-Day steps & checklist

Moving a Live Game - Timeline

• 3 months overall• DB dataset transfer validation

– Setup direct MySQL to RDS replication– Initial DB transfer time: approx. 8 hours

• Functional & performance testing– Load & capacity profile for application, DB servers– Heavy use of APM metrics – New Relic

Moving a Live Game - Architecture

• Changes required– Caching – discreet memcached to Elasticache nodes– Database – physical MySQL DB servers to RDS

• Decided to drop internally developed MySQL proxy– Bittersweet: great automatic failover; limited internal knowledge

• RDS failover mechanics added to possible game downtime

– Load balancers• LVS to ELB

• Processes– Code asset deployment

Moving a Live Game – D-Day

• Put game into maintenance (shutdown)• Break DB replication (west east)• Setup reverse proxy in datacenter

– Forward traffic from west east AWS ELB

• Bring game back online– Reverse proxy sends traffic to AWS

• Update DNS to point to ELB– Wait for DNS propagation– Slow DNS updates hit the reverse proxy in datacenter

Moving a Live Game – Before & After

Analytics & Monetization

• Specialize in “Live Events”– Higher player engagement (fun!) = more revenue

• Single-player events– “Epic Boss” – Limited-time quests

• Player organization events – Guild vs. Guild battles (World Domination, Syndicate Wars)– Raid Bosses – members help to take down a tough NPC– Tap into social “meta-gaming”

Modern War World Domination Results (August 2013)

Analytics for Player Engagement

• Player retention– 1st week and beyond– Tutorial completion rates

• Balancing mechanics– Player vs. Environment (PvE), Player vs. Player (PvP)– Encourage interaction with other players

• When too much good can be bad– Analytics needs to be paired with player feedback– Fun for all players, payers AND non-payers

Analytics for Decision-making

• Devices & Markets– Understand most popular devices (esp. Android)– Focus efforts on the top devices for your market

• Launching a game– “Soft-launch” – only launch in certain markets, tune game– “Hard-launch” – money down (marketing), marquee live events

• When to sunset & decommission– Depends on strategic goals, infra/engineering costs, etc.

Analytics – Some Scale

• Over 5000 transactions/sec sent to Analytics• Several billion game events per day

– Attacking, winning, losing, buying, clicking, swiping, etc.

• Anticipating 10x increase in next two years• Building petabyte scale data warehouse

capacity

Analytics Pipeline• Working towards “zero-latency” pipeline

– Latency = ETL, summarization, reporting & dashboard– Already reduced from 24 hours to 1 hour in last year

Cloud Insights

• Agility (Time to Deliver)• Elasticity – scale up/down quickly

– Auto Scaling is critical

• Service simplification (RDS/Elasticache/ELB)• Professional development for OPS Team

– Physical (Datacenter/Network focus) vs. Virtual (DevOps focus)

Cloud Insights – Lessons Learned

• Reliability & performance consistency varies• Stuff breaks often

– Develop an “anti-fragile” mindset; build to anticipate failure

• Cost-predictability still elusive• Orphaned servers

– Easy to create; must constantly clean up

• Large-scale monitoring is hard– No silver bullet yet

Thank You

• Thanks to the GREE OPS & Engineering Teams!

eduardo.saito@gree.net

nick.dor@gree.net

• We’re Hiring DevOps Team Members!!

http://gree-corp.com/jobs

Please give us your feedback on this presentation

As a thank you, we will select prize winners daily for completed surveys!

MBL303

AWS re:Invent 2013 - MBL303 Gaming Ops - Running High-performance Ops for Mobile Gaming

Internet

2014 AWS Re:Invent sharing

20151207 AWS re:invent 2015 ReCap

Gaming Net

AWS re:Invent 신규 서비스 총정리 (윤석찬, AWS테크에반젤리스트)

AWS re:invent 2015

[AWS re:invent 2013 Report] Amazon AppStream

Gaming Morelos

re:Invent 2015 参加報告

AWS re:Invent Recap 2016 Taiwan part 2

Fnac Gaming

Gaming cicd-pipeline gaming-technight-2

[AWS re:invent 2013 Report] 出張レポート

2013年 re:Invent報告会

Gaming Chassis Gaming Chassis

20151028 JAWS-UG 大阪 Special re:Invent 2015報告会 / re:Invent2015で見た未来 - jawsug osaka-special re:invent

[Gaming on AWS] 어썸피스 - One Man Ops

AWS re:Invent 2014 & シリコンバレーレポート

AWS Ops Service Re:cap!!! - re:Invent 2016 の前に抑えておきたい Ops関連サービスアップデート -

Ops Anemia Ops

Alfa Gaming