29
Going Ac)ve/Ac)ve Cory von Wallenstein Chief Technology Officer, Dyn Inc. @cvonwallenstein Eric Rosenberry Principal Infrastructure Architect, iova)on Inc. eric.rosenberry@iova)on.com @eprosenx

Dyn: Active/Active Failover with Cory von Wallenstein & Eric Rosenberry

  • Upload
    dyn

  • View
    785

  • Download
    1

Embed Size (px)

DESCRIPTION

Dyn's Cory von Wallenstein & Iovation's Eric Rosenberry did a webinar recently on active/active failover setup with managed DNS. Here's the official slides.

Citation preview

Page 1: Dyn: Active/Active Failover with Cory von Wallenstein & Eric Rosenberry

Going  Ac)ve/Ac)ve  

Cory  von  Wallenstein  Chief  Technology  Officer,    

Dyn  Inc.  @cvonwallenstein  

Eric  Rosenberry  Principal  Infrastructure  Architect,  

iova)on  Inc.  eric.rosenberry@iova)on.com  

@eprosenx    

Page 2: Dyn: Active/Active Failover with Cory von Wallenstein & Eric Rosenberry

Introduc)ons  

Cory  von  Wallenstein  Chief  Technology  Officer,    

Dyn  Inc.  [email protected]  

@cvonwallenstein  

Eric  Rosenberry  Principal  Infrastructure  Architect,  

iova)on  Inc.  eric.rosenberry@iova)on.com  

@eprosenx    

Page 3: Dyn: Active/Active Failover with Cory von Wallenstein & Eric Rosenberry

What  Do  We  Mean  By  Ac)ve/Ac)ve?  

•  Ac)ve  •  Passive  •  Ac)ve/Passive  •  Ac)ve/Ac)ve  

Page 4: Dyn: Active/Active Failover with Cory von Wallenstein & Eric Rosenberry

What  Are  We  Looking  to  Gain?  

•  High(er)  availability  •  Flexibility  to  change  infrastructure  without  down)me  

•  Flexibility  to  expand  infrastructure  without  four  walled  limita)ons  

•  Disaster  resilience  

Page 5: Dyn: Active/Active Failover with Cory von Wallenstein & Eric Rosenberry

Ac)ve/Ac)ve  FUD  

•  “It’s  impossible!”  – CAP  theorem  – WAN  latency  

•  “It’s  built  in  to  my  database!”  – NoSQL  and  WAN  replica)on  

•  Reality  is  it’s  somewhere  in  the  middle,  depending  on  what  problem  you’re  trying  to  solve  

Page 6: Dyn: Active/Active Failover with Cory von Wallenstein & Eric Rosenberry

hZp://www.flickr.com/photos/notaperfectpilot/8119088205/  

“Wired  people  should  know  something  about  wires”  -­‐  Neal  Stephenson,  quoted  in  Andrew  Blum’s  TED  Talk  What  is  the  Internet,  Really?  

Page 7: Dyn: Active/Active Failover with Cory von Wallenstein & Eric Rosenberry

hZp://www.ted.com/talks/andrew_blum_what_is_the_internet_really.html  

Page 8: Dyn: Active/Active Failover with Cory von Wallenstein & Eric Rosenberry

Paradigm  Shif  

•  All  system  maintenance  is  done  during  business  hours  without  impact  

•  All  sofware  upgrades  are  done  during  business  hours  

•  Sofware  upgrades  do  not  require  down)me,  so  code  can  be  pushed  to  produc)on  more  rapidly  (more  frequent  smaller  itera)ons)  

•  Enable  commodity  hardware  usage  

Page 9: Dyn: Active/Active Failover with Cory von Wallenstein & Eric Rosenberry

The  Four  Ques)ons  You  Need  To  Ask  Before  Embarking  

1.  What  problem(s)  am  I  aZemp)ng  to  solve?  2.  How  will  I  segment?  3.  Where  will  I  deploy?  4.  How  will  this  affect  each  part  of  my  app?  

Page 10: Dyn: Active/Active Failover with Cory von Wallenstein & Eric Rosenberry

Step  One:  Scope  the  Problem  

•  What  are  we  replica)ng  and  why?  •  How  close  to  real)me  is  it  needed  to  be?  

– Synchronous  vs.  Asynchronous  •  Think  about  this  for  each  applica)on  )er,  and  set  availability/distribu)on  goals  

Page 11: Dyn: Active/Active Failover with Cory von Wallenstein & Eric Rosenberry

Step  One:  Scope  the  Problem  

•  Example:  •  iova)on  end-­‐user  facing  content  services  must  be  served  using  the  closest  GSLB  

selected  node  and  each  node  must  have  N  capacity  (where  N  =  our  full  overall  global  load)  -­‐  so  overall  we  have  more  than  4N  total  capacity  with  all  nodes  online  

•  iova)on  real-­‐)me  API  services  require  N+1  redundancy  in  each  of  our  two  Ac)ve/Ac)ve  facili)es  -­‐  i.e.  2  *  (N+1)  -­‐  Allows  us  to  lose  any  server,  plus  a  datacenter  and  con)nue  to  func)on  

•  Non  real-­‐)me  API  services    (i.e.  Admin  Console)  require  2N+  resiliancy  (i.e.  one  instance  in  each  of  our  two  Ac)ve/Ac)ve  datacenters,  with  that  instance  running  on  a  N+1  Virtual  cluster)  

•  Some  internal  processes  (i.e.  Research  Analy)cs)  only  require  placement    in  one  datacenter  

Page 12: Dyn: Active/Active Failover with Cory von Wallenstein & Eric Rosenberry

Step  Two:  How  Will  You  Segment?  

•  Global  Server  Load  Balancing  with  DNS  – Round  robin  – Advanced  load  balancing  – Ac)ve  failover  – Geographic  

•  Other  strategies  (out  of  scope  for  today):  – Anycast  –  Challenges  with  TCP  – HTTP  Redirec)on  –  Challenges  with  performance  – BGP  Netblock  based  failover    

Page 13: Dyn: Active/Active Failover with Cory von Wallenstein & Eric Rosenberry

Step  Three:  Where  Will  You  Deploy?  

•  Going  from  1  to  N  •  Where  are  you  thinking?  

– What  are  your  current  datacenter  assets  and  how  can  they  be  leveraged?  

•  And  for  what  reasons?  – Disaster  resilience  – Get  closer  to  users  – Room  to  grow  

Page 14: Dyn: Active/Active Failover with Cory von Wallenstein & Eric Rosenberry

Disaster  Resilience  

hZp://maps.google.com  

Page 15: Dyn: Active/Active Failover with Cory von Wallenstein & Eric Rosenberry

hZp://www.cogentco.com/files/images/network/network_map/networkmap_global_large.png  

Speed  of  light  299,792.458  km/second  

(in  a  vacuum)  

Theore)cal  RTT  ~40ms  

Real  RTT  ~90ms  

Speed  of  Light  

Page 16: Dyn: Active/Active Failover with Cory von Wallenstein & Eric Rosenberry

•  Things  don’t  work  as  well  at  90ms  RTT  latency  as  they  do  at  9ms  RTT  latency  

•  Where  can  you  go  to  get  out  of  the  way  of  a  disaster  but  not  create  latency  headaches?  

hZp://www.globaldatavault.com/natural-­‐disaster-­‐threat-­‐maps.htm  

Implica)ons  on  Selec)on  

Page 17: Dyn: Active/Active Failover with Cory von Wallenstein & Eric Rosenberry

hZp://soladrive.com/images/level3-­‐map-­‐large.png  

Where  The  Fiber  Actually  Goes  

Page 18: Dyn: Active/Active Failover with Cory von Wallenstein & Eric Rosenberry

Disaster  Resilience:  Local  Failures  

hZp://www.datacenterknowledge.com/archives/2012/07/09/outages-­‐surviving-­‐electric-­‐squirrels-­‐ups-­‐failures/  

“A  frying  squirrel  took  out  half  of  our  Santa  Clara  data  center  two  years  back,”  -­‐  Mike  Chris)an,  Yahoo  

Page 19: Dyn: Active/Active Failover with Cory von Wallenstein & Eric Rosenberry

Local  Failures  

hZp://blog.level3.com/level-­‐3-­‐network/the-­‐10-­‐most-­‐bizarre-­‐and-­‐annoying-­‐causes-­‐of-­‐fiber-­‐cuts/  

“Squirrel  chews  account  for  a  whopping  17%  of  our  damages  so  far  this  year!    But  let  me  add  that  it  is  down  from  28%  just  last  year  and  it  con)nues  to  decrease  since  we  added  cable  guards  to  our  plant.”,  Fred  Lawler,  Level(3)  

Page 20: Dyn: Active/Active Failover with Cory von Wallenstein & Eric Rosenberry

Get  closer  to  users  

hZp://www.akamai.com/html/technology/dataviz1.html  

Page 21: Dyn: Active/Active Failover with Cory von Wallenstein & Eric Rosenberry

Get  closer  to  users  

hZp://www.akamai.com/html/technology/dataviz1.html  

Page 22: Dyn: Active/Active Failover with Cory von Wallenstein & Eric Rosenberry

“Sorry,  we’re  full”  

hZp://www.theregister.co.uk/2010/10/12/capgemini_merlin_data_center/  

Page 23: Dyn: Active/Active Failover with Cory von Wallenstein & Eric Rosenberry

Step  Three:  Where  Will  You  Deploy?  

•  Don’t  just  assume  vastly  different  geographic  areas  

•  How  far  do  you  need  to  go  to  get  out  of  same  disaster  zone?  – What  kind  of  disasters  happen  in  your  area?  – What  geographic  barriers  are  there?  – Can  you  drive  it  in  an  emergency?  

Page 24: Dyn: Active/Active Failover with Cory von Wallenstein & Eric Rosenberry

hZp://www.zayo.com/sites/default/files/images/Zayo-­‐US-­‐Network-­‐EXTERNAL-­‐11-­‐1-­‐2012.kmz  

Portland  to  SeaZle  

Page 25: Dyn: Active/Active Failover with Cory von Wallenstein & Eric Rosenberry

Step  Four:  Think  Through  Your  Apps  

•  How  will  these  different  pieces  of  the  architecture  behave  with  increased  latency  between  them?  

•  Can  you  avoid  real-­‐)me  calls  across  the  WAN?  

Page 26: Dyn: Active/Active Failover with Cory von Wallenstein & Eric Rosenberry

Step  Four:  Think  Through  Your  Apps  

•  Examples  from  Iova)on:  – Web  Device  Print  code  is  served  from  four  global  nodes  using  GSLB  

•  via  Dyn  Traffic  Management  • Was  our  first  Ac)ve/Ac)ve  applica)on  

– Real  )me  API  responses  are  served  Ac)ve/Ac)ve  between  Portland  and  SeaZle  

•  50%  of  the  )me  our  API  URL  returns  PDX,  and  50%  it  returns  SEA  IP  

•  Real  )me  queries  are  handled  locally  within  single  DC  

Page 27: Dyn: Active/Active Failover with Cory von Wallenstein & Eric Rosenberry

Summary  

•  Top  takeaways  – Ac)ve/Ac)ve  is  a  Paradigm  Shif  –  It  is  achievable  – Choose  your  loca)ons  carefully  

•  Network  is  a  primary  selec)on  criteria  •  How  far  do  you  really  need  to  go?  

– Analyze  each  applica)on  )ers  constraints  carefully  – Start  with  low  hanging  fruit  

Page 28: Dyn: Active/Active Failover with Cory von Wallenstein & Eric Rosenberry

What iovation Does

Iden)fy  and  re-­‐recognize  devices  connec)ng  to  your  business  sites  

Associate  groups  of  devices  that  would  otherwise  appear  unrelated  

Assess  real-­‐)me  risk  through  business  rules  including  velocity,  anomaly,  proxy  use,  etc.  

Page 29: Dyn: Active/Active Failover with Cory von Wallenstein & Eric Rosenberry

Ques)ons?  

Cory  von  Wallenstein  Chief  Technology  Officer,    

Dyn  Inc.  [email protected]  

@cvonwallenstein  

Eric  Rosenberry  Principal  Infrastructure  Architect,  

iova)on  Inc.  eric.rosenberry@iova)on.com  

@eprosenx