52
1 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved. 云中的大数据平台 加快大数据解决方案在企业落地 王天青 @EMC中国研究院

Virtual Hadoop Introduction In Chinese

Embed Size (px)

DESCRIPTION

在大数据领域中,大部分的框架都是基于廉价硬件,同时也针对底层架构做了很多针对性的设计。而从很多非互联网企业现状来讲,他们已经采购了大量的服务器和存储阵列,并在很大程度上使用了计算虚拟化技术。因此如何将传统大数据平台(例如Hadoop)运行在企业现有的IT环境中,并能够更好的和企业的现有生产系统集成,成为了一个重要课题。 VMware推出了一个Big Data Extension(BDE)的产品,它能够将Hadoop集群自动部署在vSphere环境中,并能够动态地对计算和存储节点自动扩展。在开源社区中,OpenStack也有一个类似的项目,叫做Sahara,它能够在OpenStack平台上自动部署Hadoop集群。另外一个方面,有一些企业存储(例如EMC的Isilon)从最近的版本开始,除了传统的Block,File和Object之外,还提供了HDFS接口,因此传统的企业存储能阵列够很好的服务于云平台及其之上的应用。本次演讲将就如何使用BDE/Sahara和提供HDFS的企业存储构建一个企业Hadoop方案进行探讨。

Citation preview

Page 1: Virtual Hadoop Introduction In Chinese

1 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

云中的大数据平台

加快大数据解决方案在企业落地

王天青 @EMC中国研究院

Page 2: Virtual Hadoop Introduction In Chinese

2 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

Agenda

• 背景

• 虚拟化Hadoop

• Big Data Extension (BDE) in VMware

• Sahara in OpenStack

• 业务系统和数据分析系统二位一体

Page 3: Virtual Hadoop Introduction In Chinese

3 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

About Me

• @葛里森, @EMC中国研究院 资深研究员

• “Full Stack” Engineer: Web & Mobile Web, SOA, Web Service, Virtualization, Storage, Distributed Storage, J2EE, Hadoop, NoSQL, OpenStack, CloudFoundry, Data Mining…

• Innovation: Innovator's Insight: China COE, Grissom Wang

Page 4: Virtual Hadoop Introduction In Chinese

4 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

背景

虚拟化发展及企业在部署/运维Hadoop遇到的挑战

Page 5: Virtual Hadoop Introduction In Chinese

5 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

虚拟化技术性能

1. Faster 2. Less Overhead (General 10%)

*Report

Page 6: Virtual Hadoop Introduction In Chinese

6 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

虚拟化的好处

• Standardization: On a Single Common software stack

• Higher consistency and reliability due to abstracting the hardware environment

• Operational flexibility and simplicity with vMotion, Storage vMotion, Live Cloning, template deployments, hot memory and CPU add, Distributed Resource Scheduling, private VLANs, Storage and Network I/O control, etc.

Page 7: Virtual Hadoop Introduction In Chinese

7 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

典型Hadoop部署

H1 H2 H3 H4 H5 H6 H7 H8 H9 H10 H11 H12

R1 R2 R3 R4

D1 D2

/

Data Center

Rack

Host

Page 8: Virtual Hadoop Introduction In Chinese

8 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

Hadoop部署/运维在企业面临的挑战

部署Hadoop需要

• 专用硬件 – PC Server

– Network

• 专门的IT团队 – 部署

– 运维

• 结论:需要很多¥

现有IT架构

• 服务器虚拟化已经被广泛使用

• 大都已经购买了外部存储 – SAN

– NAS

• 挑战:如何能够很好的将Hadoop系统部署/运维在企业现有的IT架构上面

Page 9: Virtual Hadoop Introduction In Chinese

9 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

虚拟化Hadoop 三种方式,虚拟化感知的数据复制/读取策略和虚拟化感知的任务调度策略

Page 10: Virtual Hadoop Introduction In Chinese

10 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

虚拟化Hadoop三种方式

Storage

Compute Combined Storage/ Compute

Storage

T1 T2

VM VM VM

VM VM

VM

1) Unmodified Hadoop in a VM 2) Separate Compute from Storage 3) Separate Virtual Hadoop Clusters per Tenant

Page 11: Virtual Hadoop Introduction In Chinese

11 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

虚拟化Hadoop方式一

Virtualization Host

Virtual Hadoop Node

Other Workload

Datanode

Task Tracker

Slot

Slot

Add/Remove Slots?

Grow/Shrink by tens of GB?

Grow/Shrink of a VM (scale up/down ) is one approach

Page 12: Virtual Hadoop Introduction In Chinese

12 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

虚拟化Hadoop方式一

Virtualization Host

Virtual Hadoop Node

Other Workload

Datanode

Task Tracker

Slot

Slot

Add/remove more virtual nodes? (scale out/in)

Virtual Hadoop Node

Datanode

Task Tracker

Slot

Slot

Page 13: Virtual Hadoop Introduction In Chinese

13 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

虚拟化Hadoop方式一

Virtualization Host

Virtual Hadoop Node

Other Workload

Datanode

Task Tracker

Slot

Slot

Power off the Hadoop VM would in effect fail the Datanode

Page 14: Virtual Hadoop Introduction In Chinese

14 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

虚拟化Hadoop方式一

Virtualization Host

Virtual Hadoop Node

Other Workload

Datanode

Task Tracker

Slot

Slot

Add a node would require TBs of data replication

Virtual Hadoop Node

Datanode

Task Tracker

Slot

Slot

Page 15: Virtual Hadoop Introduction In Chinese

15 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

虚拟化Hadoop方式一 (结论)

• 很直接

• 无法很好的利用虚拟化带来的好处:可伸缩性/可扩展性差 – 当任务数量↓时,因为数据必须一直保留,导致节点无法收缩

– 当任务数量↑时,增加节点后,必须做数据的重新分布(Re-balancing)

Page 16: Virtual Hadoop Introduction In Chinese

16 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

虚拟化Hadoop方式二

Virtualization Host

Virtual Hadoop Node

Other Workload

Task Tracker

Slot

Slot

Truly Elastic Hadoop: Scalable through virtual nodes

Virtual Hadoop Node

Task Tracker

Slot

Slot Virtual Hadoop Node

Task Tracker

Slot

Slot Virtual Hadoop Node

Task Tracker

Slot

Slot

Virtual Hadoop Node

Datanode

Page 17: Virtual Hadoop Introduction In Chinese

17 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

虚拟化Hadoop方式二(结论)

• 计算和存储可以独立按需伸缩

Page 18: Virtual Hadoop Introduction In Chinese

18 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

虚拟化Hadoop方式三

Vir

tual

H

adoo

p Q

ueue

Host Host Host Host Host Host

Distributed File System (HDFS, KFS, GPFS, MAPR, Isilon,…)

Namespace Namespace Namespace

Vir

tual

H

adoo

p cl

uste

r 1

Vir

tual

H

adoo

p cl

uste

r 2

Vir

tual

H

adoo

p cl

uste

r 3

Data Layer

Runtime Layer

Page 19: Virtual Hadoop Introduction In Chinese

19 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

虚拟化Hadoop方式三(结论)

• Hadoop as a Service

Page 20: Virtual Hadoop Introduction In Chinese

20 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

虚拟化Hadoop的挑战以及可行解决方案

• 数据复制放置和读取

• 任务调度

Page 21: Virtual Hadoop Introduction In Chinese

21 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

虚拟化感知的网络拓扑

H3 H4 H5 H6 H7 H8 H9 H10 H11 H12

R1 R2 R3 R4

D1 D2

/

Data Center

Rack

Node Group

H1 H2

NG1 NG3 NG5 NG7 NG2 NG4 NG6

Host

Page 22: Virtual Hadoop Introduction In Chinese

22 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

虚拟化感知的复制放置策略 • No replicas are placed on the same

node or nodes under the same node group

• 1st replica is on the local node or one of nodes under the same node group of the writer

• 2nd replica is on a remote rack of the 1st replica

• 3rd replica is on the same rack as the 2nd replica

• Remaining replicas are placed randomly across rack to meet minimum restriction.

Rack 0 Rack 1

Host 0

Host 1

Host 2

Host 3

1 2

3

Block Replication VM

Page 23: Virtual Hadoop Introduction In Chinese

23 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

虚拟化感知的复制选择策略

Distances for data locality:

• Node local (0)

• Node group local (2)

• Rack local (4)

• Off rack (6)

Rack 0 Rack 1

Host 0

Host 1

Host 2

Host 3

1 2

3

Block Replication VM HDFS Client

Page 24: Virtual Hadoop Introduction In Chinese

24 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

虚拟化感知的任务调度策略

Get task split for Task Tracker or Node Manager in following sequences:

• Node local

• Node group local

• Rack local

• Off rack

Rack 0 Rack 1

Host 0

Host 1 Host 2

Task Split VM Task Tracker

Job Tracker 2

1

get task

3

Page 25: Virtual Hadoop Introduction In Chinese

25 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

虚拟化Hadoop的优势总结

功能性

• High Availability and Fault Tolerance

• Multi-tenancy

• Security

• Hadoop as a Service

资源利用和运维

• Rapid Provisioning(一键部署)

• Datacenter efficiency

• Efficient Resource Utilization

• Operation Simplicity

Page 26: Virtual Hadoop Introduction In Chinese

26 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

Sahara Hadoop on OpenStack

Page 27: Virtual Hadoop Introduction In Chinese

27 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

主要功能

• Open source native OpenStack component

• Bare Hadoop cluster provisioning

• Analytics as a service

Page 28: Virtual Hadoop Introduction In Chinese

28 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

Architecture • Auth component - responsible for client authentication &

authorization, communicates with Keystone

• DAL - Data Access Layer, persists internal models in DB

• Provisioning Engine - component responsible for communication with Nova, Heat, Cinder and Glance

• Vendor Plugins - pluggable mechanism responsible for configuring and launching Hadoop on provisioned VMs; existing management solutions like Apache Ambari and Cloudera Management Console could be utilized for that matter

• EDP - Elastic Data Processing (EDP) responsible for scheduling and managing Hadoop jobs on clusters provisioned by Sahara

• REST API - exposes Sahara functionality via REST

• Python Sahara Client - similar to other OpenStack components Sahara has its own python client

• Sahara pages - GUI for the Sahara is located on Horizon

Page 29: Virtual Hadoop Introduction In Chinese

29 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

Provisioning Workflow

1. User choose Hadoop cluster template in Horizon. Also user could use REST API directly

2. Keystone authenticates users and provides security token that is used to work with the OpenStack, hence limiting user abilities in Sahara to his OpenStack privileges;

3. Sahara’s provisioning engine will Provision VMs with pre-installed Hadoop image through Nova.

4. Hadoop VM images are stored in Glance, each image containing an installed OS and Hadoop;

5. Sahara vendor’s provision plugin will configure the Hadoop cluster on the created VMs.

1

2

3

4

Page 30: Virtual Hadoop Introduction In Chinese

30 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

EDP Workflow with Swift

1. User specify job binary and data source URL. Also user could use REST API directly

2. Sahara will save job binary in Swift if not saved or load the job binary from Swift if already saved.

3. Sahara will upload job binary and configuration file .

4. Hadoop job will read data from Swift (Here it acts like HDFS).

5. Hadoop job will write data to Swift.

Sahara REST API

Horizon

User

Swift

Save or load job

Hadoop VM

Hadoop VM

Hadoop VM

Hadoop VM

Upload job

Read data

Write data

Specify Job

1

2

3

4

5

Page 31: Virtual Hadoop Introduction In Chinese

31 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

Roadmap

• Version 0.1: Basic cluster provisioning

• Version 0.2: Cluster operation support and integration with tooling

• Version 0.3: "Analytics as a service": job execution framework, support different scripting languages

Page 32: Virtual Hadoop Introduction In Chinese

32 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

Big Data Extension (BDE) By VMware

Page 33: Virtual Hadoop Introduction In Chinese

33 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

Architecture

Page 35: Virtual Hadoop Introduction In Chinese

35 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

虚拟化Hadoop总结

一键部署和分析即服务

Page 36: Virtual Hadoop Introduction In Chinese

36 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

总结

一键部署

• 需支持多种Hadoop版本以及发行版

• 需能够利用IaaS系统创建和配置虚拟机

• 需能够支持多种虚拟化Hadoop部署方式

分析即服务

• 支持多租户

• 支持计算/数据隔离

• 支持分析流程(输入、定期执行、输出)

• 能够很好和现有业务系统结合在一起

Page 37: Virtual Hadoop Introduction In Chinese

37 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

业务系统和分析系统二位一体

方案探讨

Page 38: Virtual Hadoop Introduction In Chinese

38 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

为什么?

• 分析流程 – 输入: ETL (Flume or Sqoop) – 分析 – 输出: ETL

• 在大数据背景下,数据ETL将是一个非常耗时,同时也是影响业务系统性能的最大问题。

Page 39: Virtual Hadoop Introduction In Chinese

39 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

外部存储如何更好使用?

现状

• 外部存储具有较高的性能,可靠性以及数据复制保护功能

• 外部存储通常作为存储资源池供IaaS系统使用

• 三类接口:Block,File and Object

• HDFS的Name Node和Data Node的计算仍然位于外部存储之外

挑战

• 业务系统和分析系统的数据如何能够更加方便的互联互通?

• 如何减少ETL时间,甚至消除ETL?

Page 40: Virtual Hadoop Introduction In Chinese

40 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

HDFS STORAGE ARRAY INTERFACE

STRENGTHS

• No Ingest necessary

• NameNode Fault Tolerance

• Eliminate 3x mirroring

• Multi-protocol access

• Simultaneous Multi-Hadoop distribution support

• Smart-Dedupe for Hadoop

• SEC 17a-4 Compliance

• Kerberos Authentication

• Application Multi-tenancy

Page 41: Virtual Hadoop Introduction In Chinese

41 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

二位一体设想

Virtualization Host

Virtual Hadoop Node

Production System

Task Tracker

Slot

Slot Virtual Hadoop Node

Task Tracker

Slot

Slot Virtual Hadoop Node

Task Tracker

Slot

Slot Virtual Hadoop Node

Task Tracker

Slot

Slot

External Storage

Block, File, Object, HDFS

Page 42: Virtual Hadoop Introduction In Chinese

42 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

Q&A

Page 43: Virtual Hadoop Introduction In Chinese
Page 44: Virtual Hadoop Introduction In Chinese

44 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

Sahara Roadmap

Page 45: Virtual Hadoop Introduction In Chinese

45 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

Version 0.1

• Version 0.1 - Basic Cluster Provisioning (April, 10 - released)

• Cluster provisioning

• Deployment Engine implementation for pre-installed images

• Templates for Hadoop cluster configuration

• REST API for cluster startup and operations

• UI integrated into Horizon

Page 46: Virtual Hadoop Introduction In Chinese

46 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

Version 0.2

• Version 0.2 - Cluster Operations (July, 15 - released)

• Manual cluster scaling (add/remove nodes)

• Hadoop cluster topology configuration parameters – Data node placement control

– HDFS location

– Swift integration

• Plugin mechanism for integration with different Hadoop distributions

• Plugins implementation: – Vanilla Apache Hadoop with pre-build image

– Hortonworks Data Platform using Ambari

Page 47: Virtual Hadoop Introduction In Chinese

47 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

Version 0.3

• Version 0.3 - Analytics as a Service (October, 17 - released)

• Havana support

• API to execute Map/Reduce jobs without exposing details of underlying infrastructure (similar to AWS EMR)

• User-friendly UI for ad-hoc analytics queries based on Hive or Pig

• Network configuration support, integration with Neutron (OpenStack Networking, earlier Quantum)

• More implementations of plugins

• Client libraries and tools

• Fedora/RDO integration

Page 48: Virtual Hadoop Introduction In Chinese

48 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

Icehouse

• Icehouse - Graduation and Improvements (April, 17 - released)

• Incubation graduation related

• CI/CD - devstack, tempest, devstack-gate

• Heat for resources orchestration

• Hadoop 2.x support

• CLI in python-saharaclient

• EDP improvements

• UI / UX improvements

• Code hardening

Page 49: Virtual Hadoop Introduction In Chinese

49 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

Future version • Implement v2 REST API - polished and based on Pecan/WSME

• Guest agents instead of ssh/http calls from controller

• Performance testing

• Integration with Ceilometer

• Distributed architecture

• Enhanced scalability through scale-out architecture

• Ubuntu integration

• GlusterFS integration

• Monitoring support - integration with 3rd-party monitoring tools (Zabbix, Nagios)

Page 50: Virtual Hadoop Introduction In Chinese

50 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

BDE

Page 51: Virtual Hadoop Introduction In Chinese

51 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

Page 52: Virtual Hadoop Introduction In Chinese

52 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.