30
Grid Operations ©2013 LinkedIn Corporation. All Rights Reserved. Hadoop Operations at LinkedIn Allen Wittenauer Grid Computing Architect Wednesday, March 20, 13

Hadoop Operations at LinkedIn

Embed Size (px)

DESCRIPTION

This is the slide deck for the talk I gave at SouthWest Big Data (UK) and Hadoop Summit EMEA 2013. Video is at: http://www.youtube.com/watch?v=Hw-V7-T3GmE

Citation preview

Page 1: Hadoop Operations at LinkedIn

Grid Operations

©2013 LinkedIn Corporation. All Rights Reserved.

Hadoop Operations at LinkedInAllen WittenauerGrid Computing Architect

Wednesday, March 20, 13

Page 2: Hadoop Operations at LinkedIn

©2013 LinkedIn Corporation. All Rights Reserved.

“Hadoop is not a developer problem; it’s an operations problem.”

-- Hadoop vendor ex-employee

Wednesday, March 20, 13

Page 3: Hadoop Operations at LinkedIn

©2013 LinkedIn Corporation. All Rights Reserved.

Wednesday, March 20, 13

Page 4: Hadoop Operations at LinkedIn

GRID OPERATIONS ©2013 LinkedIn Corporation. All Rights Reserved.

§ August 2009– 20 Nodes in 1 grid– Apache Hadoop 0.20.0– No configuration management– No monitoring– No security– Free for all, including random mafia hits on running jobs– FIFO Scheduling– ~20 users– 20 tasks per node– Solaris

– No operational support

Wednesday, March 20, 13

Page 5: Hadoop Operations at LinkedIn

GRID OPERATIONS ©2013 LinkedIn Corporation. All Rights Reserved.

Wednesday, March 20, 13

Page 6: Hadoop Operations at LinkedIn

©2013 LinkedIn Corporation. All Rights Reserved.

How We Fixed This(In Chronological Order)

Wednesday, March 20, 13

Page 7: Hadoop Operations at LinkedIn

©2013 LinkedIn Corporation. All Rights Reserved.

Year One

Wednesday, March 20, 13

Page 8: Hadoop Operations at LinkedIn

GRID OPERATIONS ©2013 LinkedIn Corporation. All Rights Reserved.

§ Dropped task count– 10 mappers => 7 mappers– 10 reducers => 5 reducers

§ Reworked ETL– hourlies => dailies– Re-ordered to take advantage of compression§ 10x storage improvement

– Sample impact on one job (not workflow!):§ 80,000 map tasks => 2,000 map tasks§ Run time cut in half

§ Optimize work flows/culture shift§ More task time, less tasks§ Production review to reinforce good behavio(u)r

Wednesday, March 20, 13

Page 9: Hadoop Operations at LinkedIn

GRID OPERATIONS ©2013 LinkedIn Corporation. All Rights Reserved.

§ Switched to Capacity Scheduler– FIFO is terrible– Fair Share only viable for small tasks– Enforced SLAs via custom patch

§ Submitted Jar Size Limit– Encourage distributed cache usage– Enforced limit via custom patch

15% Fast Queue:- Task Time < 15 Minutes- Job Time < 1 Hour- Slot stealing from "Slow" Queue

80% Slow Queue:- Job Time < 24 Hours- Up to 80% of slots

5% ETL Tasks

Wednesday, March 20, 13

Page 10: Hadoop Operations at LinkedIn

GRID OPERATIONS ©2013 LinkedIn Corporation. All Rights Reserved.

§ Benchmarking– Use production code not TeraSort!

§ Cut cost per unit in half§ 2x nodes per rack§ Extra RAM

– buffering– bus speed

Old Node:- 2 Rack Units- 2 CPUs- 16 GB- 8 x 1 TB SATA- 1 x 2 gb NIC

New Node:- 1 Rack Unit- 2 CPUs- 24 or 32 GB- 6 x 2 TB SATA- 1 x 1 gb NIC

Wednesday, March 20, 13

Page 11: Hadoop Operations at LinkedIn

GRID OPERATIONS ©2013 LinkedIn Corporation. All Rights Reserved.

Wednesday, March 20, 13

Page 12: Hadoop Operations at LinkedIn

©2013 LinkedIn Corporation. All Rights Reserved.

Year Two

Wednesday, March 20, 13

Page 13: Hadoop Operations at LinkedIn

GRID OPERATIONS ©2013 LinkedIn Corporation. All Rights Reserved.

Wednesday, March 20, 13

Page 14: Hadoop Operations at LinkedIn

GRID OPERATIONS ©2013 LinkedIn Corporation. All Rights Reserved.

§ DataNode disk partitioning– Separate file systems for different purposes

– Mount options: noatime, commit=30, data=writeback

§ NN, JT, etc– No “special hardware” == use SW RAID

20 GB/, ...

200 GBMR HDFS

5GBSwap

200 GBMR HDFS

...

Wednesday, March 20, 13

Page 15: Hadoop Operations at LinkedIn

GRID OPERATIONS ©2013 LinkedIn Corporation. All Rights Reserved.

LDAP Master+

KDC Master

LDAP Master+

KDC

MultiMaster

Replication

LDAP/KDC Slaves

Client Nodenscd

username, uidgroup name, gid

netgroup, sudoers

LDAP/KDC Slaves

Client Nodenscd

username, uidgroup name, gid

netgroup, sudoers

Wednesday, March 20, 13

Page 16: Hadoop Operations at LinkedIn

GRID OPERATIONS ©2013 LinkedIn Corporation. All Rights Reserved.

Wednesday, March 20, 13

Page 17: Hadoop Operations at LinkedIn

GRID OPERATIONS ©2013 LinkedIn Corporation. All Rights Reserved.

§ Service Bundle– RPMs, config files, etc– Conflict resolution

Host bcfg2 ServerGroup1,Group2,

... Group1 -> Svc1, Svc2, ...Group2 -> Svc1, Svc3, ...Group3 -> Svc4, Svc5, ...Svc1+

Svc2+Svc3

Content

bcfg2client

Wednesday, March 20, 13

Page 18: Hadoop Operations at LinkedIn

GRID OPERATIONS ©2013 LinkedIn Corporation. All Rights Reserved.

§ Different RPM names + different install locations = pre-deploy-ability:

Object RPM Name File Path

Hadoop 1.0.4-p3 Binaries hadoop-1043-bin-1.0.4-3 /dir/hadoop-1.0.4-p3

Grid Config for 1.0.4-p3 gridname-1043-hadoopconf-1.0.4.3-1

/dir/grid-conf-1.0.4-p3

Hadoop 1.1.2-p1 Binaries hadoop-1121-bin-1.1.2.1-1 /dir/hadoop-1.1.2-p1

Grid Config for 1.1.2-p1 gridname-1043-hadoopconf-1.0.4.3-1

/dir/grid-conf-1.1.2-p1

Wednesday, March 20, 13

Page 19: Hadoop Operations at LinkedIn

©2013 LinkedIn Corporation. All Rights Reserved.

Year Three+

Wednesday, March 20, 13

Page 20: Hadoop Operations at LinkedIn

GRID OPERATIONS ©2013 LinkedIn Corporation. All Rights Reserved.

Corp ITActive Directory

@CORP

Grid Realm@GRID

krbtgt/GRID@CORP

Hadoop Services

krbtgt/user@CORPkrbtgt/GRID@CORP

krbtgt/host@GRIDkrbtgt/service@GRID

Password

Wednesday, March 20, 13

Page 21: Hadoop Operations at LinkedIn

©2013 LinkedIn Corporation. All Rights Reserved.

Many months moving to secure Apache Hadoop...

Wednesday, March 20, 13

Page 22: Hadoop Operations at LinkedIn

GRID OPERATIONS ©2013 LinkedIn Corporation. All Rights Reserved.

Wednesday, March 20, 13

Page 23: Hadoop Operations at LinkedIn

GRID OPERATIONS ©2013 LinkedIn Corporation. All Rights Reserved.

Wednesday, March 20, 13

Page 24: Hadoop Operations at LinkedIn

GRID OPERATIONS ©2013 LinkedIn Corporation. All Rights Reserved.

§ March 2013– 5000 Nodes in ~10 grids– Apache Hadoop 1.0.4 + custom patches– Full configuration management– Full monitoring– Security– Capacity scheduler with SLA– ~700 users– 12 tasks per node– Linux

– Five dedicated operations staff members

Wednesday, March 20, 13

Page 25: Hadoop Operations at LinkedIn

GRID OPERATIONS ©2013 LinkedIn Corporation. All Rights Reserved.

Wednesday, March 20, 13

Page 26: Hadoop Operations at LinkedIn

©2013 LinkedIn Corporation. All Rights Reserved.

Future Work

Wednesday, March 20, 13

Page 27: Hadoop Operations at LinkedIn

©2013 LinkedIn Corporation. All Rights Reserved.

Is ‘pure Hadoop’ the right tool for all of our workloads?

Wednesday, March 20, 13

Page 28: Hadoop Operations at LinkedIn

GRID OPERATIONS ©2013 LinkedIn Corporation. All Rights Reserved.

CEPH

HDFS

YARN PBS

Wednesday, March 20, 13

Page 29: Hadoop Operations at LinkedIn

BUSINESS OPERATIONS ©2013 LinkedIn Corporation. All Rights Reserved.

Wednesday, March 20, 13

Page 30: Hadoop Operations at LinkedIn

GRID OPERATIONS ©2013 LinkedIn Corporation. All Rights Reserved.

§ More on LinkedIn Hadoop Performance: – http://www.slideshare.net/allenwittenauer/2012-lihadoopperf

§ LinkedIn Data Analytics:– http://data.linkedin.com/

Wednesday, March 20, 13