34
1 PBHadoop集群跨机房迁移实战 董西成 @ Hulu

PB级Hadoop集群跨机房迁移实战 - Huodongjia.com › ganhuodocs › 2017-12-27 › ...2017/12/27  · MR Spark Voidbox Hive Presto Zookeeper Oozie Big Data Infrastructure Druid

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: PB级Hadoop集群跨机房迁移实战 - Huodongjia.com › ganhuodocs › 2017-12-27 › ...2017/12/27  · MR Spark Voidbox Hive Presto Zookeeper Oozie Big Data Infrastructure Druid

1

PB级Hadoop集群跨机房迁移实战

董西成 @ Hulu

Page 2: PB级Hadoop集群跨机房迁移实战 - Huodongjia.com › ganhuodocs › 2017-12-27 › ...2017/12/27  · MR Spark Voidbox Hive Presto Zookeeper Oozie Big Data Infrastructure Druid

2

About Hulu

Page 3: PB级Hadoop集群跨机房迁移实战 - Huodongjia.com › ganhuodocs › 2017-12-27 › ...2017/12/27  · MR Spark Voidbox Hive Presto Zookeeper Oozie Big Data Infrastructure Druid

3

Agenda

● Background

● Multi-DC HDFS

● Experience

● Summary

Page 4: PB级Hadoop集群跨机房迁移实战 - Huodongjia.com › ganhuodocs › 2017-12-27 › ...2017/12/27  · MR Spark Voidbox Hive Presto Zookeeper Oozie Big Data Infrastructure Druid

4

Backgroundbasic background of hadoop migration

Page 5: PB级Hadoop集群跨机房迁移实战 - Huodongjia.com › ganhuodocs › 2017-12-27 › ...2017/12/27  · MR Spark Voidbox Hive Presto Zookeeper Oozie Big Data Infrastructure Druid

5

Nesto

Hulu Big Data Infrastructure

Hulu Made

HDFS / HBase / Zookeeper

VoidboxSpark/MapReduce/Pig

Flume / Kafka

YARN

Hive / Presto/Impala/Druid

Spark Streaming

HuluMade

HuluInternal

HuluInternal

HuluInternal atscale (cube cache

layer)

in-house tools micro strategy Tableau

Page 6: PB级Hadoop集群跨机房迁移实战 - Huodongjia.com › ganhuodocs › 2017-12-27 › ...2017/12/27  · MR Spark Voidbox Hive Presto Zookeeper Oozie Big Data Infrastructure Druid

6

Why Hadoop Migration?

Page 7: PB级Hadoop集群跨机房迁移实战 - Huodongjia.com › ganhuodocs › 2017-12-27 › ...2017/12/27  · MR Spark Voidbox Hive Presto Zookeeper Oozie Big Data Infrastructure Druid

7

Challenge● Big data & complex applications

○ Data volume: tens of PB○ Applications: 20k ~30k per day○ Mixed types of applications(MapReduce, Hive, Spark, docker...)○ “all or nothing”

● hour-level downtime○ Decrease business impact

● Economical way○ Order as less new machine in LAS as possible

● Across-team efforts○ DCOps/Devops/Data Teams in BJ/Seatle/SM

● Customize infrastructure (code-level) to guarantee migration smoothly○ simplify application-level migration by enhancing infrastructure

Page 8: PB级Hadoop集群跨机房迁移实战 - Huodongjia.com › ganhuodocs › 2017-12-27 › ...2017/12/27  · MR Spark Voidbox Hive Presto Zookeeper Oozie Big Data Infrastructure Druid

8

Hulu Big Data Landscape

Base Infrastructure(hardware, network, power supply)

Kafka FlumeHDFS(Data) HBase

YARNMR Spark Voidbox Hive Presto

Zook

eepe

r

Ooz

ie

Big Data InfrastructureDruid

Reporting Search RECO AD DS

Data Pipelines

RddisCodis MySQL CouchBase Cassandra …..

All Data Teams

Big Data Infrastructure

DCOps Team

Page 9: PB级Hadoop集群跨机房迁移实战 - Huodongjia.com › ganhuodocs › 2017-12-27 › ...2017/12/27  · MR Spark Voidbox Hive Presto Zookeeper Oozie Big Data Infrastructure Druid

9

The Key For Hadoop Migration: Data Migration● Stateless system is easy

○ YARN, Presto, Impala

● Stateful system is harder

○ With small meta data

■ Hive (MySQL), Zookeeper(Disk)

○ With windowed data

■ Kafka

○ With huge data

■ HDFS(metadata & files), HBase(storing data in HDFS)

Page 10: PB级Hadoop集群跨机房迁移实战 - Huodongjia.com › ganhuodocs › 2017-12-27 › ...2017/12/27  · MR Spark Voidbox Hive Presto Zookeeper Oozie Big Data Infrastructure Druid

10

Multi-DC HDFShow to extend HDFS to support several data centers

Page 11: PB级Hadoop集群跨机房迁移实战 - Huodongjia.com › ganhuodocs › 2017-12-27 › ...2017/12/27  · MR Spark Voidbox Hive Presto Zookeeper Oozie Big Data Infrastructure Druid

11

HDFS: Introduction

● HDFS: Hadoop Distributed File System

○ Almost all hulu big data are stored on HDFS

● HDFS namespaces

○ HDFS federation

○ Three namespaces stored different kinds of data

Page 12: PB级Hadoop集群跨机房迁移实战 - Huodongjia.com › ganhuodocs › 2017-12-27 › ...2017/12/27  · MR Spark Voidbox Hive Presto Zookeeper Oozie Big Data Infrastructure Druid

12

HDFS: Introduction

Active NameNode

Standby NameNode

/

tmp user data

1.txt x.y

a.dat b.dat

/user/1.txt: b11, b12/user/x.y/a.dat: b21/user/x.y/b.dat/: b31,b32,b33

b11

DataNode DataNode DataNode DataNode

fsimage

Memory Disk

snapshot

Disk

Block

NamenodeMachine

b11:node1, node2, node3b12:node4,node9,node3b21: node1,node2,node3...

Folder

Fille

Page 13: PB级Hadoop集群跨机房迁移实战 - Huodongjia.com › ganhuodocs › 2017-12-27 › ...2017/12/27  · MR Spark Voidbox Hive Presto Zookeeper Oozie Big Data Infrastructure Druid

13

HDFS: Brief Introduction

● HDFS Namenode (meta data)

○ directory tree

○ file-blocks mapping

○ block-locations mapping (report from every datanodes)

● HDFS datanode(real data)

○ blocks

Page 14: PB级Hadoop集群跨机房迁移实战 - Huodongjia.com › ganhuodocs › 2017-12-27 › ...2017/12/27  · MR Spark Voidbox Hive Presto Zookeeper Oozie Big Data Infrastructure Druid

14

Extended HDFS: Implement HDFS-level “rsync”

blockblock

block block

block block

block

block

“rsync”

Old DC New DC

Page 15: PB级Hadoop集群跨机房迁移实战 - Huodongjia.com › ganhuodocs › 2017-12-27 › ...2017/12/27  · MR Spark Voidbox Hive Presto Zookeeper Oozie Big Data Infrastructure Druid

15

Extended HDFS: Solutions Comparison

machine

Hadoop In DC1 Hadoop In DC2

machine

machine machine

machine machine

machine machine

machine machine

machine machine

Solution 1: Set up a mirror hadoop cluster in new DC

➔ pros◆ non-invasive to HDFS kernel◆ different hadoop versions

➔ cons◆ keep data consistent is hard◆ not transparent to users (address is changed)◆ order same number of machines

➔ Use in one small hadoop cluster(HDP → CDH)

machine

One single Hadoop In DC1 & DC2

machine

machine machine

machine machine

machine machine

machine machine

Solution 2: Extend HDFS to support multiple DC

➔ pros◆ transparent◆ economical

➔ cons◆ invasive◆ risky

➔ Use in our biggest hadoop cluster(today only covers this part)

Page 16: PB级Hadoop集群跨机房迁移实战 - Huodongjia.com › ganhuodocs › 2017-12-27 › ...2017/12/27  · MR Spark Voidbox Hive Presto Zookeeper Oozie Big Data Infrastructure Druid

16

Architecture

ELS DC

Datanode

block block

Datanode

block block

DCNamenodeDCBalancerELS DC

LAS DC

Datanode

block block

Datanode

block block

DCBalancerLAS DC

DCTunnel

HDFS

Page 17: PB级Hadoop集群跨机房迁移实战 - Huodongjia.com › ganhuodocs › 2017-12-27 › ...2017/12/27  · MR Spark Voidbox Hive Presto Zookeeper Oozie Big Data Infrastructure Druid

17

DCNameNode: Add Data Center Level Topology

● Topology Configuration○ Format “/datacenter/rack/node”, e.g

■ /datacenter1/rack1/node1

■ /datacenter2/rack2/node1

○ One of the data centers is “primary”, which is set by admin

● Read/Write Strategy○ Read local data center first, and then the other

○ Write to primary data center only

Page 18: PB级Hadoop集群跨机房迁移实战 - Huodongjia.com › ganhuodocs › 2017-12-27 › ...2017/12/27  · MR Spark Voidbox Hive Presto Zookeeper Oozie Big Data Infrastructure Druid

18

DCNameNode: Data Center Replica Control

● Each file has a file-level replication factor○ 3 by default

● Control global file replica across different data centers○ e.g DC1:DC2 = 3 : 2

● Fine-grained replica control○ each DC has a minimum and maximum replication factor(RF)

○ minimum file replica = min {file-level RF, DC minimum RF}

○ maximum file replica = min {file-level RF DC maximum RF}

Page 19: PB级Hadoop集群跨机房迁移实战 - Huodongjia.com › ganhuodocs › 2017-12-27 › ...2017/12/27  · MR Spark Voidbox Hive Presto Zookeeper Oozie Big Data Infrastructure Druid

19

DCNamenode: Modified From Namenode

NetworkTopology Server RPC

BlockPlacement DatanodeManager FsImage

BlockManager FSDirectory

FSNamesystem

Page 20: PB级Hadoop集群跨机房迁移实战 - Huodongjia.com › ganhuodocs › 2017-12-27 › ...2017/12/27  · MR Spark Voidbox Hive Presto Zookeeper Oozie Big Data Infrastructure Druid

20

DCBalancer: Balance Data In One Single DC

● Balancer○ A HDFS component to balance data among different machines

● DCBalancer○ Modified from HDFS Balancer

○ Only balance data among one specific data center

Page 21: PB级Hadoop集群跨机房迁移实战 - Huodongjia.com › ganhuodocs › 2017-12-27 › ...2017/12/27  · MR Spark Voidbox Hive Presto Zookeeper Oozie Big Data Infrastructure Druid

21

DCTunnel: Distributed Block Replication Scheduler

● Transfer data block across datacenters in a controllable way

● Features○ Sync blocks according to folder-level whitelist & blacklist

○ Bandwidth limitation

○ Priority-based block replication

■ Missing blocks will be replicated quickly from the other datacenter

○ Quote adjustment

■ E.g. DC1: DC2=3:2

○ Web portal to display progress

Page 22: PB级Hadoop集群跨机房迁移实战 - Huodongjia.com › ganhuodocs › 2017-12-27 › ...2017/12/27  · MR Spark Voidbox Hive Presto Zookeeper Oozie Big Data Infrastructure Druid

22

DCTunnel: Architecture

Page 23: PB级Hadoop集群跨机房迁移实战 - Huodongjia.com › ganhuodocs › 2017-12-27 › ...2017/12/27  · MR Spark Voidbox Hive Presto Zookeeper Oozie Big Data Infrastructure Druid

23

DCTunnel: Optimization

● Minimize impact to online namenode and datanode

● MVCC-based block location management

● Memory-friendly structure to store block-location mapping

● PID-based automatic bandwidth controller

● Optimized for “the long tail”

Page 24: PB级Hadoop集群跨机房迁移实战 - Huodongjia.com › ganhuodocs › 2017-12-27 › ...2017/12/27  · MR Spark Voidbox Hive Presto Zookeeper Oozie Big Data Infrastructure Druid

24

DCTunnel: Minimize Impact To Online NN & DN

● fetch fsimage from backup namenode

● fetch block location from tree service on

every datanode host

● fetch live node list from active

namenode

● replication task will send to datanode

without touching namenode

Namenodeactive

Namenodebackup

DCTunnel

pull

fsim

age

query live

datanode

Physical Machine

Datanode

TreeServie

Physical Machine

Datanode

TreeServie

Physical Machine

Datanode

TreeServie

...

...

Page 25: PB级Hadoop集群跨机房迁移实战 - Huodongjia.com › ganhuodocs › 2017-12-27 › ...2017/12/27  · MR Spark Voidbox Hive Presto Zookeeper Oozie Big Data Infrastructure Druid

25

DCTunnel: MVCC-based Block Location Management

● Challenge in block management

○ 0.3 billion blocks * 5 replica(3 replicas in DC1 and 2 replicas in DC2)

○ update every two hours(fetch metadata from fsimage, block-locations mapping from every datanodes)

○ replica changes frequently(moved, deleted or created)

● Solution:○ MVCC(Multiversion Concurrency Control) based block location management

lashost1 lashost2 lashost3locations of one block

new version(=6) offset of las

old version(=5) offset of las

Read with version 5: lashost1, lashost2, lashost3Read with version 6: lashost1, lashost2

lashost2write new host with version 6

Page 26: PB级Hadoop集群跨机房迁移实战 - Huodongjia.com › ganhuodocs › 2017-12-27 › ...2017/12/27  · MR Spark Voidbox Hive Presto Zookeeper Oozie Big Data Infrastructure Druid

26

DCTunnel: Memory-friendly Structure Storing Block-location

● Memory-friendly map structure

○ low overhead for each element

○ mark deleted and obsolete block collection to avoid huge GC

○ modified from HDFS internal structure LightWeightGSet(add slot-level fined-grained lock)

● OverHead for ~0.2B blocks

lock lock lock ...

head head head ...

node

node

node

node

... ...

node

node

...

16M slot for 200M blocks, total consumed memory is 40GB(One machine is enough).

ConcurrentHashMap ConcurrentSkipList LightWeightGSet

40GB+ 20GB+ 4GB+

Page 27: PB级Hadoop集群跨机房迁移实战 - Huodongjia.com › ganhuodocs › 2017-12-27 › ...2017/12/27  · MR Spark Voidbox Hive Presto Zookeeper Oozie Big Data Infrastructure Druid

27

DCTunnel: PID-based Automatic Bandwidth Controller

● PID controller -- based on speed error

TaskExecutor

integral

proportional

derivative

SpeedEstimator

Expected Speed

SpeedError

–Actual Speed

PID controller

ConcurrentTask count

十十

TaskExecutor: control how many migration task execute concurrently. Every task migrate a block from DC1 to DC2.FRI: https://en.wikipedia.org/wiki/PID_controller

Page 28: PB级Hadoop集群跨机房迁移实战 - Huodongjia.com › ganhuodocs › 2017-12-27 › ...2017/12/27  · MR Spark Voidbox Hive Presto Zookeeper Oozie Big Data Infrastructure Druid

28

DCTunnel: PID-based Automatic Bandwidth Controller

● PID controller example

start with expected speed 800MB/s change expected speed to 1.5GB/s

Page 29: PB级Hadoop集群跨机房迁移实战 - Huodongjia.com › ganhuodocs › 2017-12-27 › ...2017/12/27  · MR Spark Voidbox Hive Presto Zookeeper Oozie Big Data Infrastructure Druid

29

DCTunnel: Optimize For “The Long Tail”

● The Long Tail Problem○ The meta data is updated every 2 hours

○ When most replicas(e.g. 99.9%) are migrated to new DC, finding out remaining

blocks(e.g. 0.1%) becomes harder and harder

● Optimization: build directory-blocks cache for each folder to store only not-migrated blocks

directory->block listimmutable

Cache for directory 1mutable

Cache for directory 2mutable

Task Generator

...

...

delete the block when it is replicated success

scan scan

Page 30: PB级Hadoop集群跨机房迁移实战 - Huodongjia.com › ganhuodocs › 2017-12-27 › ...2017/12/27  · MR Spark Voidbox Hive Presto Zookeeper Oozie Big Data Infrastructure Druid

30

Demo

Page 31: PB级Hadoop集群跨机房迁移实战 - Huodongjia.com › ganhuodocs › 2017-12-27 › ...2017/12/27  · MR Spark Voidbox Hive Presto Zookeeper Oozie Big Data Infrastructure Druid

31

Execution

Page 32: PB级Hadoop集群跨机房迁移实战 - Huodongjia.com › ganhuodocs › 2017-12-27 › ...2017/12/27  · MR Spark Voidbox Hive Presto Zookeeper Oozie Big Data Infrastructure Druid

32

Experience Gainedsharing experiences

Page 33: PB级Hadoop集群跨机房迁移实战 - Huodongjia.com › ganhuodocs › 2017-12-27 › ...2017/12/27  · MR Spark Voidbox Hive Presto Zookeeper Oozie Big Data Infrastructure Druid

33

Experience

● Network○ Fully control network utilization between two DC

○ Avoid machine hotspot

● Validation○ validate both metadata and data

○ Ensure data is copied accurately

● Monitoring○ Measure everything to know your bottleneck(can’t fail behind)

● Fully test and rehearsal before actual migration

Page 34: PB级Hadoop集群跨机房迁移实战 - Huodongjia.com › ganhuodocs › 2017-12-27 › ...2017/12/27  · MR Spark Voidbox Hive Presto Zookeeper Oozie Big Data Infrastructure Druid

34

THANK YOU