AWS re:Invent 2013 - MBL303 Gaming Ops - Running High-performance Ops for Mobile Gaming

Preview:

DESCRIPTION

Presentation from GREE Ops team at AWS re:Invent conference in Las Vegas in 2013

Citation preview

© 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.

Gaming OpsRunning High-Performance Ops for Mobile Gaming

Eduardo Saito – Director, Engineering

Nick Dor – Sr. Director, Engineering

GREE International Friday, November 15, 2013

Agenda• Part 1 – Lessons Learned

– Incident Management– Change Management– Auto-scale

– Cloud Optimization Tools and Capacity Planning

• Part 2 – Game Architecture, Analytics & Monetization– Game Architecture– Moving a live game– Analytics & Monetization– Cloud Insights

NOC

Ops

Dev

SME (Network, DBA,…)

Othermonitoring tools…

TriageEscalationCommunication

Incident Management

NOC, automated

Ops Dev

Critical

Critical

Non-Critical

Othermonitoring tools…

Application-level issue?Who’s the dev of this game? Phone #?I can’t find the dev… who’s his manager?Oh, the problem is in the backend service, who’s the dev for that service?

Alert Workflow - DevOps way

Ops

Dev, Game X, Server

Dev, Game Y, Client/iOSDev, Service A

Each alert go directly to the right team that can resolve it !

Dev, Service B

Analytics

Type Scope Checked by Who to page?

ELB Load balancer health-check

ELB No one – email alert only

System-level Check cpu / disk / memory / network

Pingdom / Nagios

Ops team

App-level Application issues / bugs

Pingdom Dev and Ops teams

Alerts go to the person that can resolve it

Type Scope Checked by Who to page?

ELB Load balancer health-check

ELB No one – email alert only

System-level Check cpu / disk / memory / network

Pingdom / Nagios

Ops team

Type Scope Checked by Who to page?

ELB Load balancer health-check

ELB No one – email alert only

System-level Check cpu / disk / memory / network

Pingdom / Nagios

Ops team

App-level Application issues / bugs

Pingdom Dev and Ops teams

App-level alerts can be triggered by issues in:

• Server-side• Client-side

• iOS• Android

Dev and Ops are responsible

Team In pager duty

Ops 8

Dev 32, from ~20 games(server-side or client-side, android or iOS developers)

Analytics 5

Big, Simple Status Dashboard

Big dashboard = quick status

Big dashboard=meta monitoring

IM Bot informs in the game channel that an alert was triggered

Use IM Bot for status

Both Ops and Dev receive the alert, troubleshoot

IM Bot = collaboration

IM Bot detects issue is resolved and send all-clear

IM Bot = transparency

Review your incidents and alerts

• Monday morning incident review meeting– Weekly on-call hand-over– Address false-positives / fine-tune your monitoring– Heads-up for events / major releases

• Problem management– Any major or recurrent incident = Problem– Problem = requires post-mortem– Remediation items from post-mortem also tracked weekly till closure

Incident ManagementLessons Learned

• Use automatic paging/escalation tools• Make the alerts go to the right team directly• Use big display dashboard• Use IM-bots to communicate outages• Do weekly reviews of the incidents / alerts• Do post-mortems, follow-up on remediation items

Agenda• Part 1 – Lessons Learned

– Incident Management– Change Management– Auto-scale

– Cloud Optimization Tools and Capacity Planning

• Part 2 – Game Architecture, Analytics & Monetization– Game Architecture– Moving a live game– Analytics & Monetization– Cloud Insights

Change Management

Type Content Owner Tool

Configuration Management

3rd. Party packages and configuration

Ops Puppet

Release – code deploy

1st. Party code Dev Jenkins + In-house scripts

Release – asset deploy

1st. Party – images / new game content / new missions

Dev Jenkins + In-house scripts

Configuration Management

pull push

Ops do changes / test locally

peerreview

pull changes to prod puppet

puppet clients (prod servers) pull changes

syntaxvalidation

not good

Configuration Management Benefits

• Automate and speed-up deployment• Repeatable• Declarative modules/manifests = documentation• All prod changes:

– peer-reviewed via pull-requests in Git– validated by Puppet lint– locally tested via Vagrant (every component has a Vagrant VM)– communicated through email and IM

Change Management

Type Content Owner Tool

Configuration Management

3rd. Party packages and configuration

Ops Puppet

Release – code deploy

1st. Party code Dev Jenkins + In-house scripts

Release – asset deploy

1st. Party – images / new game content / new missions

Dev Jenkins + In-house scripts

Release Management – Code deploy

pushQA

Prod

Beta

Deployhostdev

dev

S3If Prod deploy, in Ops channel of that project:

In QA/dev channel of that project:

Change Management

Type Content Owner Tool

Configuration Management

3rd. Party packages and configuration

Ops Puppet

Release – code deploy

1st. Party code Dev Jenkins + In-house scripts

Release – asset deploy

1st. Party – images / new game content / new missions

Dev Jenkins + In-house scripts

Release Management – Asset deploy

CodeReview

Warns?

Ops approval

Override?

Yes

Yes

NoDev kick off new asset deploy job

Run validation

Deploy to prod

Change Management Lessons Learned

• Changes are made directly by the team that is responsible for that code– 3rd. party code is configuration management = owned by Ops– 1st. party code is release management = owned by Dev

• Changes are made through tools– Configuration management through Puppet– Release management through Jenkins + internal tool

• No change is done manually• All changes are communicated and tracked

Agenda• Part 1 – Lessons Learned

– Incident Management– Change Management– Auto-scale

– Cloud Optimization Tools and Capacity Planning

• Part 2 – Game Architecture, Analytics & Monetization– Game Architecture– Moving a live game– Analytics & Monetization– Cloud Insights

Auto-scale use-cases

–On-demand• for the daily traffic fluctuations and organic growth

–Scheduled• for in-game events

Auto-scale on-demand and scheduled

CPU

# instances in ELB

# auto-scale instances

Scheduled Auto-scale

1- Scheduled pre-provisioning config enabled

CPU

# instances in ELB

# auto-scale instances

as-put-scheduled-update-group-action ccios-app-ScheduledUpFriday

--auto-scaling-group ccios-app-asg --recurrence “00 17 * * 5”

--min-size 16

Scheduled action

Scheduled Auto-scale

2 - Spare capacity in place, ready for event

CPU

# instances in ELB

# auto-scale instances

Scheduled Auto-scale

3 - Event starts, 4x spike

CPU

# instances in ELB

# auto-scale instances

ADD EVENT SCREENSHOT HERE

On-demand Auto-scale

4 – On-demand auto-scale reacts to CPU above 60% and adds more servers

CPU

# instances in ELB

# auto-scale instances

as-put-scaling-policyccios-app-ScaleUpPolicy60

--auto-scaling-group ccios-app-asg--adjustment=8 --type ChangeInCapacity

On-demand policy

On-demand Auto-scale

5 - Scheduled pre-provisioning config is removed

CPU

# instances in ELB

# auto-scale instances

as-put-scheduled-update-group-action ccios-app-ScheduledDownFriday --auto-scaling-group ccios-app-asg

--recurrence "0 21 * * 5" --min-size 2

Scheduled action

On-demand Auto-scale

6 – On-demand auto-scale terminate some instances as CPU drops below 40%

CPU

# instances in ELB

# auto-scale instances

as-put-scaling-policyccios-app-ScaleDownPolicy40

--auto-scaling-group ccios-app-asg--adjustment=-2 --type ChangeInCapacity

On-demand policy

Auto-scale bootstrap workflowEvent Description DurationCloudwatch alarm is triggered Eg. CPU > 60% for 5 minutes 5 minutes

Auto-scale policy is executed Launches n new instances 2 minutes

User-data script is executed This script is defined on the autoscale launch config. Installs base packages, gets instance_id, IP and hostgroup

1 minute

Bootstrap script is executed This script is loaded from S3. It renames host, runs puppet, deploy code, starts web service

11 minutes

Health-check passes and servers start to get traffic

Health-check must pass before ELB start to send traffic to new host

1 minute

Auto-scale external dependencies

Dependency How to resolveConfiguration Management (Puppet/Chef)

Pre-load all necessary package in the AMI / architecture HA for config management

External Repo Pre-load all necessary packages in the AMI / setup internal HA repo

Code deploy Same as above, or put in S3

Monitoring registration Make it asynchronous

Server registration Make it asynchronous

Auto-scale Lessons Learned

• Reduce time to spin-up new instances:– Pre-install all base packages into AMI

• Address those risks:– on-demand and scheduled AS conflicts– bootstrap validation and graceful termination– health-checks: keep it simple– keep some servers out of auto-scale pool, just in case– map and resolve/monitor external dependencies for auto-scale – consider using 2 different thresholds, for quicker ramp-up

Agenda• Part 1 – Lessons Learned

– Incident Management– Change Management– Auto-scale– Cloud Optimization Tools and Capacity Planning

• Part 2 – Game Architecture, Analytics & Monetization– Game Architecture– Moving a live game– Analytics & Monetization– Cloud Insights

• under-utilized hosts• overloaded hosts

• EBS/ELB not in use

• exposed DBs• EC2 behind ELB exposed

directly

• AZ / region distribution• backup audit

• un-healthy instances in ELB• ELB misconfigs

• optimal # of RI• hosts outside RI• cost break-down using tags• estimate on-demand costs

Cloud Optimization areas

Cost Usage

AvailabilitySecurity

Cloud Optimization tools

AWS Trusted Advisor 3rd. Party commercial tools

Open Source tools (eg. Netflix Ice)

In-house tools

Excel !

Cloud Optimization Lessons Learned

• Try Trusted Advisor

• Pilot 3rd.-party solutions

• Evaluate what metrics are important for each component of your architecture

• Do in-house development for other optimizations you need that are not covered by TA or 3rd. party solutions

• Tag all assets! Automate tagging!

Agenda• Part 1 – Lessons Learned

– Incident Management– Change Management– Auto-scale– Cloud Optimization Tools and Capacity Planning

• Part 2 – Game Architecture, Analytics & Monetization– Game Architecture– Moving a live game– Analytics & Monetization– Cloud Insights

GREE Games• All Mobile, all Free-to-Play

– iOS & Android smart phones– Big focus on tablets

• Role Playing Games (RPG)– Multi-million dollar franchise, top-grossing titles– Some of the oldest games on the App Store

• Hardcore– Deeper more intense gameplay mechanics

• Real-Time Strategy (RTS)– Fast action, small unit management

• Casino & Casual Games– Familiar games, wider audience, casual play

Example Game Architecture – RPG

• Application Servers– PHP – Game events Analytics

• Cache Layer– Memcached Elasticache

• Batch Processing Servers– Node.js (moving to GO)– Batches database writes

• Database– MySQL RDS

RDS RDS RDSFailover

DB

ELB

App App AppApp

Cache Cache CacheCache

Batch Batch

Caching Strategy - Current

• Game architecture predates stable NoSQL– We wanted similar performance at scale– Keep combined average internal response times below 300ms

• Memcache Authoritative– Still use an RDBMS; potential data loss is limited

• Allows for cheaper/simpler DB layer– Always do full row replacements (ie: no current_row_value +1)

Data Flow• Reads

– ELB App Cache

• Writes (Synchronous)– ELB App Cache DB– ELB App Cache Batch DB – Standard write-through– No blind writes; always fetch current ver.

• Writes (Asynchronous)– Batch DB– Batch writes to DB every 30 seconds

RDS RDS RDS

ELB

App App AppApp

Cache Cache CacheCache

Batch Batch

Batch Processor

• 80% of game write traffic is Async– Each write is versioned

• Example: Player items (loot) after multiple quests– 10 items in 30 sec; app server sends 10 writes downstream– Batch processor sends last record with final item count to DB

• Greatly reduced writes on DB– Shard at table and DB server level for larger games

Near Future Trends for GREE OPS

• Multi-region games– Latency-sensitive games and the shift towards real-time– Geographic data replication challenges

• Continuous Delivery• Automation of Game Studio tasks

– Game design, art, data/asset deploy– Tighter event pre-provisioning and scale-down

More Performance – Lower Costs

• Facebook HipHop Virtual Machine– JIT compilation & execution of PHP– 5x faster vs. Zend PHP 5.2– Achieved 3x to 4x reduction in application server count– https://github.com/facebook/hhvm

• Google GO– Used for high-concurrency applications– Achieved 2x reduction in batch processing servers vs. Node.js– http://golang.org

Agenda• Part 1 – Lessons Learned

– Incident Management– Change Management– Auto-scale– Cloud Optimization Tools and Capacity Planning

• Part 2 – Game Architecture, Analytics & Monetization– Game Architecture– Moving a live game– Analytics & Monetization– Cloud Insights

Moving a Game – Why?• Physical datacenter to AWS

– West coast East coast– Faster access to EU markets & players

• Reduce necessary attention to infrastructure– Caching & DB layer; custom high-availability middleware

• Take advantage of cloud provisioning– Scripted instance spin-ups, auto-scaling for events/load

• Save money – Reduce stand-by server pool– Provision for average load, not peak

Moving a Live Game – Whaaaaat?

• Live game, two platforms (iOS, Android)– Several million $$$ in combined monthly revenue– More than one million unique players/month

• ~ 30GB Dataset• Minimal downtime (< 5 minutes)

– Mostly to allow for change to reverse proxy config

• Debian CentOS• Physical machines AWS

Moving a Live Game - How

• Develop timeline• R&D & architecture review• Data migration & sync• Game server/client updates• Load testing• D-Day steps & checklist

Moving a Live Game - Timeline

• 3 months overall• DB dataset transfer validation

– Setup direct MySQL to RDS replication– Initial DB transfer time: approx. 8 hours

• Functional & performance testing– Load & capacity profile for application, DB servers– Heavy use of APM metrics – New Relic

Moving a Live Game - Architecture

• Changes required– Caching – discreet memcached to Elasticache nodes– Database – physical MySQL DB servers to RDS

• Decided to drop internally developed MySQL proxy– Bittersweet: great automatic failover; limited internal knowledge

• RDS failover mechanics added to possible game downtime

– Load balancers• LVS to ELB

• Processes– Code asset deployment

Moving a Live Game – D-Day

• Put game into maintenance (shutdown)• Break DB replication (west east)• Setup reverse proxy in datacenter

– Forward traffic from west east AWS ELB

• Bring game back online– Reverse proxy sends traffic to AWS

• Update DNS to point to ELB– Wait for DNS propagation– Slow DNS updates hit the reverse proxy in datacenter

Moving a Live Game – Before & After

Agenda• Part 1 – Lessons Learned

– Incident Management– Change Management– Auto-scale– Cloud Optimization Tools and Capacity Planning

• Part 2 – Game Architecture, Analytics & Monetization– Game Architecture– Moving a live game– Analytics & Monetization– Cloud Insights

Analytics & Monetization

• Specialize in “Live Events”– Higher player engagement (fun!) = more revenue

• Single-player events– “Epic Boss” – Limited-time quests

• Player organization events – Guild vs. Guild battles (World Domination, Syndicate Wars)– Raid Bosses – members help to take down a tough NPC– Tap into social “meta-gaming”

Modern War World Domination Results (August 2013)

Analytics for Player Engagement

• Player retention– 1st week and beyond– Tutorial completion rates

• Balancing mechanics– Player vs. Environment (PvE), Player vs. Player (PvP)– Encourage interaction with other players

• When too much good can be bad– Analytics needs to be paired with player feedback– Fun for all players, payers AND non-payers

Analytics for Decision-making

• Devices & Markets– Understand most popular devices (esp. Android)– Focus efforts on the top devices for your market

• Launching a game– “Soft-launch” – only launch in certain markets, tune game– “Hard-launch” – money down (marketing), marquee live events

• When to sunset & decommission– Depends on strategic goals, infra/engineering costs, etc.

Analytics – Some Scale

• Over 5000 transactions/sec sent to Analytics• Several billion game events per day

– Attacking, winning, losing, buying, clicking, swiping, etc.

• Anticipating 10x increase in next two years• Building petabyte scale data warehouse

capacity

Analytics Pipeline• Working towards “zero-latency” pipeline

– Latency = ETL, summarization, reporting & dashboard– Already reduced from 24 hours to 1 hour in last year

Agenda• Part 1 – Lessons Learned

– Incident Management– Change Management– Auto-scale– Cloud Optimization Tools and Capacity Planning

• Part 2 – Game Architecture, Analytics & Monetization– Game Architecture– Moving a live game– Analytics & Monetization– Cloud Insights

Cloud Insights

• Agility (Time to Deliver)• Elasticity – scale up/down quickly

– Auto Scaling is critical

• Service simplification (RDS/Elasticache/ELB)• Professional development for OPS Team

– Physical (Datacenter/Network focus) vs. Virtual (DevOps focus)

Cloud Insights – Lessons Learned

• Reliability & performance consistency varies• Stuff breaks often

– Develop an “anti-fragile” mindset; build to anticipate failure

• Cost-predictability still elusive• Orphaned servers

– Easy to create; must constantly clean up

• Large-scale monitoring is hard– No silver bullet yet

Thank You

• Thanks to the GREE OPS & Engineering Teams!

eduardo.saito@gree.net

nick.dor@gree.net

• We’re Hiring DevOps Team Members!!

http://gree-corp.com/jobs

Please give us your feedback on this presentation

As a thank you, we will select prize winners daily for completed surveys!

MBL303