Transcript
Page 1: Hadoop Operations at LinkedIn

Grid Operations

©2013 LinkedIn Corporation. All Rights Reserved.

Hadoop Operations at LinkedInAllen WittenauerGrid Computing Architect

Wednesday, March 20, 13

Page 2: Hadoop Operations at LinkedIn

©2013 LinkedIn Corporation. All Rights Reserved.

“Hadoop is not a developer problem; it’s an operations problem.”

-- Hadoop vendor ex-employee

Wednesday, March 20, 13

Page 3: Hadoop Operations at LinkedIn

©2013 LinkedIn Corporation. All Rights Reserved.

Wednesday, March 20, 13

Page 4: Hadoop Operations at LinkedIn

GRID OPERATIONS ©2013 LinkedIn Corporation. All Rights Reserved.

§ August 2009– 20 Nodes in 1 grid– Apache Hadoop 0.20.0– No configuration management– No monitoring– No security– Free for all, including random mafia hits on running jobs– FIFO Scheduling– ~20 users– 20 tasks per node– Solaris

– No operational support

Wednesday, March 20, 13

Page 5: Hadoop Operations at LinkedIn

GRID OPERATIONS ©2013 LinkedIn Corporation. All Rights Reserved.

Wednesday, March 20, 13

Page 6: Hadoop Operations at LinkedIn

©2013 LinkedIn Corporation. All Rights Reserved.

How We Fixed This(In Chronological Order)

Wednesday, March 20, 13

Page 7: Hadoop Operations at LinkedIn

©2013 LinkedIn Corporation. All Rights Reserved.

Year One

Wednesday, March 20, 13

Page 8: Hadoop Operations at LinkedIn

GRID OPERATIONS ©2013 LinkedIn Corporation. All Rights Reserved.

§ Dropped task count– 10 mappers => 7 mappers– 10 reducers => 5 reducers

§ Reworked ETL– hourlies => dailies– Re-ordered to take advantage of compression§ 10x storage improvement

– Sample impact on one job (not workflow!):§ 80,000 map tasks => 2,000 map tasks§ Run time cut in half

§ Optimize work flows/culture shift§ More task time, less tasks§ Production review to reinforce good behavio(u)r

Wednesday, March 20, 13

Page 9: Hadoop Operations at LinkedIn

GRID OPERATIONS ©2013 LinkedIn Corporation. All Rights Reserved.

§ Switched to Capacity Scheduler– FIFO is terrible– Fair Share only viable for small tasks– Enforced SLAs via custom patch

§ Submitted Jar Size Limit– Encourage distributed cache usage– Enforced limit via custom patch

15% Fast Queue:- Task Time < 15 Minutes- Job Time < 1 Hour- Slot stealing from "Slow" Queue

80% Slow Queue:- Job Time < 24 Hours- Up to 80% of slots

5% ETL Tasks

Wednesday, March 20, 13

Page 10: Hadoop Operations at LinkedIn

GRID OPERATIONS ©2013 LinkedIn Corporation. All Rights Reserved.

§ Benchmarking– Use production code not TeraSort!

§ Cut cost per unit in half§ 2x nodes per rack§ Extra RAM

– buffering– bus speed

Old Node:- 2 Rack Units- 2 CPUs- 16 GB- 8 x 1 TB SATA- 1 x 2 gb NIC

New Node:- 1 Rack Unit- 2 CPUs- 24 or 32 GB- 6 x 2 TB SATA- 1 x 1 gb NIC

Wednesday, March 20, 13

Page 11: Hadoop Operations at LinkedIn

GRID OPERATIONS ©2013 LinkedIn Corporation. All Rights Reserved.

Wednesday, March 20, 13

Page 12: Hadoop Operations at LinkedIn

©2013 LinkedIn Corporation. All Rights Reserved.

Year Two

Wednesday, March 20, 13

Page 13: Hadoop Operations at LinkedIn

GRID OPERATIONS ©2013 LinkedIn Corporation. All Rights Reserved.

Wednesday, March 20, 13

Page 14: Hadoop Operations at LinkedIn

GRID OPERATIONS ©2013 LinkedIn Corporation. All Rights Reserved.

§ DataNode disk partitioning– Separate file systems for different purposes

– Mount options: noatime, commit=30, data=writeback

§ NN, JT, etc– No “special hardware” == use SW RAID

20 GB/, ...

200 GBMR HDFS

5GBSwap

200 GBMR HDFS

...

Wednesday, March 20, 13

Page 15: Hadoop Operations at LinkedIn

GRID OPERATIONS ©2013 LinkedIn Corporation. All Rights Reserved.

LDAP Master+

KDC Master

LDAP Master+

KDC

MultiMaster

Replication

LDAP/KDC Slaves

Client Nodenscd

username, uidgroup name, gid

netgroup, sudoers

LDAP/KDC Slaves

Client Nodenscd

username, uidgroup name, gid

netgroup, sudoers

Wednesday, March 20, 13

Page 16: Hadoop Operations at LinkedIn

GRID OPERATIONS ©2013 LinkedIn Corporation. All Rights Reserved.

Wednesday, March 20, 13

Page 17: Hadoop Operations at LinkedIn

GRID OPERATIONS ©2013 LinkedIn Corporation. All Rights Reserved.

§ Service Bundle– RPMs, config files, etc– Conflict resolution

Host bcfg2 ServerGroup1,Group2,

... Group1 -> Svc1, Svc2, ...Group2 -> Svc1, Svc3, ...Group3 -> Svc4, Svc5, ...Svc1+

Svc2+Svc3

Content

bcfg2client

Wednesday, March 20, 13

Page 18: Hadoop Operations at LinkedIn

GRID OPERATIONS ©2013 LinkedIn Corporation. All Rights Reserved.

§ Different RPM names + different install locations = pre-deploy-ability:

Object RPM Name File Path

Hadoop 1.0.4-p3 Binaries hadoop-1043-bin-1.0.4-3 /dir/hadoop-1.0.4-p3

Grid Config for 1.0.4-p3 gridname-1043-hadoopconf-1.0.4.3-1

/dir/grid-conf-1.0.4-p3

Hadoop 1.1.2-p1 Binaries hadoop-1121-bin-1.1.2.1-1 /dir/hadoop-1.1.2-p1

Grid Config for 1.1.2-p1 gridname-1043-hadoopconf-1.0.4.3-1

/dir/grid-conf-1.1.2-p1

Wednesday, March 20, 13

Page 19: Hadoop Operations at LinkedIn

©2013 LinkedIn Corporation. All Rights Reserved.

Year Three+

Wednesday, March 20, 13

Page 20: Hadoop Operations at LinkedIn

GRID OPERATIONS ©2013 LinkedIn Corporation. All Rights Reserved.

Corp ITActive Directory

@CORP

Grid Realm@GRID

krbtgt/GRID@CORP

Hadoop Services

krbtgt/user@CORPkrbtgt/GRID@CORP

krbtgt/host@GRIDkrbtgt/service@GRID

Password

Wednesday, March 20, 13

Page 21: Hadoop Operations at LinkedIn

©2013 LinkedIn Corporation. All Rights Reserved.

Many months moving to secure Apache Hadoop...

Wednesday, March 20, 13

Page 22: Hadoop Operations at LinkedIn

GRID OPERATIONS ©2013 LinkedIn Corporation. All Rights Reserved.

Wednesday, March 20, 13

Page 23: Hadoop Operations at LinkedIn

GRID OPERATIONS ©2013 LinkedIn Corporation. All Rights Reserved.

Wednesday, March 20, 13

Page 24: Hadoop Operations at LinkedIn

GRID OPERATIONS ©2013 LinkedIn Corporation. All Rights Reserved.

§ March 2013– 5000 Nodes in ~10 grids– Apache Hadoop 1.0.4 + custom patches– Full configuration management– Full monitoring– Security– Capacity scheduler with SLA– ~700 users– 12 tasks per node– Linux

– Five dedicated operations staff members

Wednesday, March 20, 13

Page 25: Hadoop Operations at LinkedIn

GRID OPERATIONS ©2013 LinkedIn Corporation. All Rights Reserved.

Wednesday, March 20, 13

Page 26: Hadoop Operations at LinkedIn

©2013 LinkedIn Corporation. All Rights Reserved.

Future Work

Wednesday, March 20, 13

Page 27: Hadoop Operations at LinkedIn

©2013 LinkedIn Corporation. All Rights Reserved.

Is ‘pure Hadoop’ the right tool for all of our workloads?

Wednesday, March 20, 13

Page 28: Hadoop Operations at LinkedIn

GRID OPERATIONS ©2013 LinkedIn Corporation. All Rights Reserved.

CEPH

HDFS

YARN PBS

Wednesday, March 20, 13

Page 29: Hadoop Operations at LinkedIn

BUSINESS OPERATIONS ©2013 LinkedIn Corporation. All Rights Reserved.

Wednesday, March 20, 13

Page 30: Hadoop Operations at LinkedIn

GRID OPERATIONS ©2013 LinkedIn Corporation. All Rights Reserved.

§ More on LinkedIn Hadoop Performance: – http://www.slideshare.net/allenwittenauer/2012-lihadoopperf

§ LinkedIn Data Analytics:– http://data.linkedin.com/

Wednesday, March 20, 13


Recommended