109
Ephemeral Hadoop Clusters in the Cloud Greg Fodor, Etsy [email protected] [1]

Emphemeral hadoop clusters in the cloud

  • Upload
    gfodor

  • View
    1.529

  • Download
    2

Embed Size (px)

Citation preview

Page 1: Emphemeral hadoop clusters in the cloud

Ephemeral  Hadoop  Clusters  in  the  Cloud  

Greg  Fodor,  Etsy  [email protected]  

[1]  

Page 2: Emphemeral hadoop clusters in the cloud

about  me  [email protected]  

@gfodor  Data  Wrangler  

Page 3: Emphemeral hadoop clusters in the cloud

about  etsy  

Page 4: Emphemeral hadoop clusters in the cloud

the  world’s  handmade  marketplace  

Page 5: Emphemeral hadoop clusters in the cloud

total  members:  9,000,000  total  acHve  shops:  800,000  

items  listed:  9.5M  page  views  per  month:  >1B  

2010  sales:  $314.3M  

Page 6: Emphemeral hadoop clusters in the cloud

lots  of  data  

Page 7: Emphemeral hadoop clusters in the cloud

about  this  talk  

Page 8: Emphemeral hadoop clusters in the cloud

ephemeral?  

Page 9: Emphemeral hadoop clusters in the cloud

[5]  

Page 10: Emphemeral hadoop clusters in the cloud

“elasHc”  to  the  extreme  

Page 11: Emphemeral hadoop clusters in the cloud
Page 12: Emphemeral hadoop clusters in the cloud
Page 13: Emphemeral hadoop clusters in the cloud
Page 14: Emphemeral hadoop clusters in the cloud

how  did  we  get  here?  

Page 15: Emphemeral hadoop clusters in the cloud

wanted  to  dip  our  toes  stop  hiWng  the  database  stop  grepping  log  files  

Page 16: Emphemeral hadoop clusters in the cloud

2  data  sources  -­‐>  S3  

Page 17: Emphemeral hadoop clusters in the cloud

database  snapshots  

input:  nightly  diffs  

(SELECT  *  FROM  <table>  WHERE  update_date  >  1  day  ago)  

output:  full  tables  as  sequence  files  

Page 18: Emphemeral hadoop clusters in the cloud

visit  logs  

input:  akamai  access  logs  (event  beacons)  

output:  [visit_id,  [event]]  

Page 19: Emphemeral hadoop clusters in the cloud

processing  the  data  

Page 20: Emphemeral hadoop clusters in the cloud

[2]  

Page 21: Emphemeral hadoop clusters in the cloud

data  flow  joins,  group  bys,  etc.  

Page 22: Emphemeral hadoop clusters in the cloud

cascading  Chris  Wensel  

hhp://www.cascading.org/  

Page 23: Emphemeral hadoop clusters in the cloud

great  implementaHon  

Page 24: Emphemeral hadoop clusters in the cloud

Java  syntax  

[10]  

Page 25: Emphemeral hadoop clusters in the cloud

cascading.jruby  Grégoire  Marabout  (Qualtera),  Mah  Walker  (Etsy),  Stefan  Karpinski  (Etsy),  Steve  Mardenfeld  (Etsy)  

github:  hhp://bit.ly/o3DNtC  blog:  hhp://etsy.me/cFytuL  

Page 26: Emphemeral hadoop clusters in the cloud
Page 27: Emphemeral hadoop clusters in the cloud

“push”  job  binaries  to  S3  

run  on  ElasHc  Map/Reduce  starts  cluster,  runs,  shuts  down  

access  results  on  S3  

Page 28: Emphemeral hadoop clusters in the cloud

next  project:  shop  recommendaHons  

Page 29: Emphemeral hadoop clusters in the cloud

3  steps:  ✔ data  preparaHon  -­‐  Cascading  

✖ analysis/training  ✖ predicHon  

Page 30: Emphemeral hadoop clusters in the cloud

sparse  implementaHon  of  SVD  

Page 31: Emphemeral hadoop clusters in the cloud

3  steps:  ✔ data  preparaHon  -­‐  Cascading  ✖ analysis/training  -­‐  MATLAB  

✖ predicHon  -­‐  MATLAB  

Page 32: Emphemeral hadoop clusters in the cloud

“MATLAB,  in  my    Hadoop  cluster?”  

Page 33: Emphemeral hadoop clusters in the cloud

hadoop  streaming  

Page 34: Emphemeral hadoop clusters in the cloud

arbitrary  scripts  for  map  &  reduce  

Page 35: Emphemeral hadoop clusters in the cloud

Swiss  army  knife  

[3]  

Full  dataset  analysis  Matlab,  Ruby  scripts  

‘ArHfact’  outputs  Tokyo  Cabinet,  Lucene,  SQLite  

Side-­‐effects  MySQL,  CloudFront  

Page 36: Emphemeral hadoop clusters in the cloud

3  steps:  ✔ data  preparaHon  -­‐  Cascading  ✔ analysis/training  -­‐  MATLAB  

✔ predicHon  -­‐  MATLAB  

Page 37: Emphemeral hadoop clusters in the cloud

[4]  

Job  1  Job  2  

Page 38: Emphemeral hadoop clusters in the cloud

Barnum  

Page 39: Emphemeral hadoop clusters in the cloud

Sinatra  web  service  on  EC2  

Page 40: Emphemeral hadoop clusters in the cloud

barnum  starts  job  and  passes  callback  URL  

when  job  finishes,  hadoop  hits  callback  URL  to  barnum  to  proceed  

Page 41: Emphemeral hadoop clusters in the cloud
Page 42: Emphemeral hadoop clusters in the cloud
Page 43: Emphemeral hadoop clusters in the cloud
Page 44: Emphemeral hadoop clusters in the cloud
Page 45: Emphemeral hadoop clusters in the cloud
Page 46: Emphemeral hadoop clusters in the cloud
Page 47: Emphemeral hadoop clusters in the cloud
Page 48: Emphemeral hadoop clusters in the cloud
Page 49: Emphemeral hadoop clusters in the cloud

Barnum  constructs  

Page 50: Emphemeral hadoop clusters in the cloud
Page 51: Emphemeral hadoop clusters in the cloud
Page 52: Emphemeral hadoop clusters in the cloud
Page 53: Emphemeral hadoop clusters in the cloud
Page 54: Emphemeral hadoop clusters in the cloud

3  steps:  ✔ data  preparaHon  -­‐  Cascading  ✔ analysis/training  -­‐  MATLAB  

✔ predicHon  -­‐  MATLAB  

Page 55: Emphemeral hadoop clusters in the cloud

suggested_shops.yaml:  

Page 56: Emphemeral hadoop clusters in the cloud

suggested_shops.yaml:  

Page 57: Emphemeral hadoop clusters in the cloud

suggested_shops.yaml:  

Page 58: Emphemeral hadoop clusters in the cloud

suggested_shops.yaml:  

Page 59: Emphemeral hadoop clusters in the cloud

suggested_shops.yaml:  

Page 60: Emphemeral hadoop clusters in the cloud

suggested_shops.yaml:  

Page 61: Emphemeral hadoop clusters in the cloud

suggested_shops.yaml:  

Page 62: Emphemeral hadoop clusters in the cloud

suggested_shops.yaml:  

Page 63: Emphemeral hadoop clusters in the cloud

suggested_shops.yaml:  

Page 64: Emphemeral hadoop clusters in the cloud

suggested_shops.yaml:  

Page 65: Emphemeral hadoop clusters in the cloud

suggested_shops.yaml:  

Page 66: Emphemeral hadoop clusters in the cloud

suggested_shops.yaml:  

Page 67: Emphemeral hadoop clusters in the cloud

suggested_shops.yaml:  

Page 68: Emphemeral hadoop clusters in the cloud

suggested_shops.yaml:  

Page 69: Emphemeral hadoop clusters in the cloud

suggested_shops.yaml:  

Page 70: Emphemeral hadoop clusters in the cloud

geWng  data  back  to  web  stack?  

Page 71: Emphemeral hadoop clusters in the cloud

v1   [6]  

Page 72: Emphemeral hadoop clusters in the cloud

ad-­‐hoc  shell  scripts  TSV  into  unsharded  MySQL  

not  re-­‐usable  

[6]  

Page 73: Emphemeral hadoop clusters in the cloud

v2  

Page 74: Emphemeral hadoop clusters in the cloud
Page 75: Emphemeral hadoop clusters in the cloud
Page 76: Emphemeral hadoop clusters in the cloud
Page 77: Emphemeral hadoop clusters in the cloud
Page 78: Emphemeral hadoop clusters in the cloud
Page 79: Emphemeral hadoop clusters in the cloud

datasets  are  versioned  based  upon  job  execuHon  Hme  

Page 80: Emphemeral hadoop clusters in the cloud
Page 81: Emphemeral hadoop clusters in the cloud

MySQL  Tables:  

Memcache  Cluster:  

Page 82: Emphemeral hadoop clusters in the cloud

Output  dataset  <-­‐>  ORM  Model  

Page 83: Emphemeral hadoop clusters in the cloud

PHP:  

Page 84: Emphemeral hadoop clusters in the cloud

Cascading:  

PHP:  

Page 85: Emphemeral hadoop clusters in the cloud

Cascading:  

PHP:  

PHP:  

Page 86: Emphemeral hadoop clusters in the cloud

Old  tables  regularly  dropped  

Page 87: Emphemeral hadoop clusters in the cloud

how  we’re  using  this  stack  

analyHcs  (internal)  

products  (external)  

Page 88: Emphemeral hadoop clusters in the cloud

analyHcs  

Page 89: Emphemeral hadoop clusters in the cloud
Page 90: Emphemeral hadoop clusters in the cloud
Page 91: Emphemeral hadoop clusters in the cloud

products  

Page 92: Emphemeral hadoop clusters in the cloud
Page 93: Emphemeral hadoop clusters in the cloud

search  quality  recommendaHons  

Page 94: Emphemeral hadoop clusters in the cloud

May  2011:    4,926  successful  job  runs  

Page 95: Emphemeral hadoop clusters in the cloud

[5]  

Page 96: Emphemeral hadoop clusters in the cloud

scale  up  from  zero  

Page 97: Emphemeral hadoop clusters in the cloud

isolaHon  

Page 98: Emphemeral hadoop clusters in the cloud

isolaHon  across  runs  fresh  machine  each  Hme  

Page 99: Emphemeral hadoop clusters in the cloud

isolaHon  between  developers  no  toe-­‐stepping  

Page 100: Emphemeral hadoop clusters in the cloud

heterogeneous  clusters  

Page 101: Emphemeral hadoop clusters in the cloud

big  RAM  when  you  need  it  (but  not  when  you  don’t)  

Page 102: Emphemeral hadoop clusters in the cloud

need  one  machine?    use  one  machine.  

Page 103: Emphemeral hadoop clusters in the cloud

wriHng  jobs  

Page 104: Emphemeral hadoop clusters in the cloud

PHENOMENAL  COSMIC  POWERS  

[7]  

Page 105: Emphemeral hadoop clusters in the cloud

prototyping  run  slow,  unopHmized  version  on  500  machine  for  <  $100  

Page 106: Emphemeral hadoop clusters in the cloud

parameter  tuning  Try  N=1,  2,  5,  10  and  see  which  results  in  best  output  

Page 107: Emphemeral hadoop clusters in the cloud

[9]  

Page 108: Emphemeral hadoop clusters in the cloud

quesHons?  

Page 109: Emphemeral hadoop clusters in the cloud

photo  credits  [1]  by  elfike  hhp://www.flickr.com/photos/elfike/157439707/  [2]  by  Dan4th  hhp://www.flickr.com/photos/43264265@N00/5371557240/  [3]  by  mandolux    hhp://www.flickr.com/photos/73935252@N00/34418046/  [4]  by  The  Suss-­‐Man  hhp://www.flickr.com/photos/8692813@N06/4580254188/  [5]  by  Stephen  Rees  hhp://www.flickr.com/photos/60142746@N00/214461223/  [6]  by  Let  Ideas  Compete  hhp://www.flickr.com/photos/quesHon_everything/3414827746/  [7]  by  funkandjazz  hhp://www.flickr.com/photos/phunk/2484159004/  [8]  by  ViaMoi  hhp://www.flickr.com/photos/12187843@N07/3343619603/  [9]  by  kreg.steppe  hhp://www.flickr.com/photos/spyndle/500305000/  [10]  clipart  (really)  [11]  by  Chris  Pirillo  hhp://www.flickr.com/photos/49503157467@N01/34588230/