Engineering practices in big data storage and processing

Engineering Practices in Big Data Storage and Processing

Nov.20, 2013

Schubert (Songbo) Zhang

About me

• 张松波 (Schubert Zhang)

• Backgrounds• Senior Engineer Tech Lead and Architect, Infrastructure Data Team, @Baidu

• VP Engineering, Cloud & Big Data R&D, @Hanborq

• Senior Engineering Manager, @UTStarcom

• 10 years of Telecom, 5 years of Cloud Storage & Big Data, 1 year of Internet

2

Categories of (Big) Data

• Rows / Records• Logs• User Profiles• Shopping Orders• …

• Presentation• Tables with Schema• Data Types• Database, Data-Warehouse

• A mess -> organizing, indexing -> fast to retrieve …

• Batch and sequential processing …

• Files / Objects• Documents• Photos• Videos• …

• Presentation• Files in File-System• Objects in Object-Storage-System• With metadata …

• Organizing, indexing -> fast to retrieve …

• Batch and sequential processing …

3

Over the common underlayer storage and IO system: Hardware, Disk, Network …

Products and Engineering ProjectsObject Storage, Data Warehouse, Cluster Management, etc.

For enterprise!

4

Products Line

HanborqProducts

大数据工程 (Big Data)HB-CDW产品线是基于云计算技术实现的面向大数据(PB级)存储、

查询和分析以及挖掘的大数据仓库系统。核心产品包括基于

Hadoop生态系统的大数据仓库、海量结构化数据管理系统

HugeTable。基于Hanborq增强并扩展的Hadoop、HBase、Hive、

Pig等大数据基础软件，实现特有的数据模型、系统架构和标准

的SQL/API，提供对大数据的快速加载、实时索引查询，以及基

于MapReduce和MPP等并行计算技术的深度统计、分析和挖掘。

系统提供灵活的扩展性和安全可靠性。在电信、电力、交通、

大型互联网等大数据行业领域有广泛的使用案例。

云存储 (Cloud Storage)HB-CSS产品线为企业或个人提供云存储解决方案及服务。提供类

似Amazon AWS S3的服务层API和用户体验，可扩展、安全、快速

的云对象存储系统oNest。基于oNest，为企业和个人提供接入云

存储服务的存储网关(Storage Gateway)及类似Dropbox的在线云

存储服务(uDrop/eDrop)。在大型互联网、教育、电信、媒体、交

通等行业领域有广泛的使用案例。

管理系统 (Management)HB-ClusterMaster是大规模数据中心集群规划、操作系统及应用程序自动化安

装部署、配置管理、监控及运营维护的软件系统，实现大规模云计算集群的高

效部署和运维。目前部署和管理的最大单系统案例超过2000个物理服务器节点。

5

• Web Service and API• Amazon AWS S3 RESTful API• S3 Data Model (User->Buckets->Objects)

• Backend Distributed Object Storage System• Google GFS + Facebook Haystack

• Triple copy of data trunks• Write-through, Strong consistency• Append only and Compaction• High efficient Local Index• …

• Backend Distributed Metadata Layer• Flexible data model• NoSQL

6

Cloud Object Storage System : oNest

Web Service(RESTful API over HTTP)

Metadata Layer

Object/Trunk Storage Layer

SDK

(C++/Java/Python/PHP/Go…)


7

User

Bucket1 Bucket2 Bucket3 Bucket4

Object Object Object Object Object

Object Object Object Object Object

Rock&

Chunks Data Model and Data Organization

Bucket

Object

Object

Object

Object/Pebble

Part

Part

Part

Rock

Chunk

Chunk

Chunk

Rock

Chunk

Chunk

Chunk

Rock

Chunk

Chunk

Chunk

Logic Physical

Part

Chunk

Chunk

Chunk

Cloud Object Storage System : RockStor-> oNest

服务层服务层

接口层接口层

存储层存储层

系统管理

计量信息

用户控制

日志管理

统计报表

运维管理

分布式存储系统

WEB服务

对象

相关

功能

负载均衡

WEB服务 WEB服务

对象访问

对象属性

容器

相关

功能

容器访问

容器属性

用户

相关

功能

认证

鉴权

…………

应用系统1 应用系统N………… 资源管理平台

HTTP接口 HTTP接口HTTP接口 HTTP接口HTTP接口管理接口管理接口

RockServer RockServer RockServer RockServer

RockStor Service Load Balancers(访问请求负载均衡器，多点部署，LVS)

RockMaster

RESTful API(Internal)

AAA, CAS

RESTful API (Cloud Service)

分布式存储系统集群 Hadoop

(存储和管理Rock文件)

分布式数据库集群 HBase

(存储和管理元数据)

Management Console

分布式云对象存储系统

SDK (Java) for Developers

8Fast/Simple Prototype Leverage Open Source To be a Product and Service.


• oNest对象云存储平台系统以对象的形式存储数据，为互联网业务和企业用户提供可达百PB级的云存储服务

• oNest系统提供的对象云存储服务的主要特点：

(1) 支持高可靠，多副本数据存储，支持动态环境下数据副本的自动修复

(2) 支持大规模存储（容量x100PB级以上），存储对象数量和容量的线性扩容

(3) 支持一个数据中心内和跨数据中心备份数据

(4) 支持大规模并发访问

(5) 支持安全的数据访问

ClusterMasterClusterMaster

Region

Discovery Service Cluster

ConsoleWebServer

ConsoleWebServer

OAS Cluster

Healer Cluster

DataStorage Cluster

MetaNode Cluster

AZ

Stats Cluster

StatsMaster

StatsSlave

DataNodeOAS

Healer

StatsMaster

StatsSlave

DataNode DataNodeOAS OAS

Healer

Healer

MetaNodeSlave

Master

MetaNodeSlave

Master

MetaNodeSlave

Master

ConsoleWebServer

ConsoleWebServer 机房A

ClusterMasterClusterMaster

Discovery Service Cluster

OAS Cluster

Healer Cluster

DataStorage Cluster

MetaNode Cluster

AZ

DataNodeOAS

Healer

DataNode DataNodeOAS OAS

Healer

Healer

MetaNodeSlave

Master

MetaNodeSlave

Master

MetaNodeSlave

Master

机房BConsoleWebServer

ConsoleWebServer

ConsoleWebServer

ConsoleWebServer

ConsoleWebServer

ConsoleWebServer

ConsoleWebServer

ConsoleWebServer

AAASlave

Master

AAA Cluster

AAAWeb

Service

AAASlave

Master

AAASlave

Master

AAAWeb

Service

AAASlave

Master

AAA Cluster

AAAWeb

Service

AAASlave

Master

AAASlave

Master

AAAWeb

Service

StatsSlave

StatsSlave

StatsMaster

StatsMaster

Proxy

Proxy

9To be a more Complete Product and Service.


10

用户名创建Bucket 新建目录上传对象刷新列表查看属性操作记录

右键菜单

对象列表对象集列表

对象基本属性描述点击进入详细属性描述，包括对象下载地址

点击进入ACL权限管理


教育云应用的用户

BC-oNest对象云存储服务

教育云应用服务

oNest提供统一标准的云存储接口，教育云应用可以通过该接口存储、读取、或操作这些数据对象

oNest是一个弹性的对象云存储系统，可类比Amazon AWS S3。为教育云提供视频、音频、图片、文档等数据的存储服务。

教育云应用即是oNest云存储的用户。

oNest云存储服务

教育云App-1 教育云App-2

SDK

RESTREST

注册、登录、Console

11

Dropbox-Like NetDisk Service: uDrop / eDrop

• Hack Dropbox• keep-alive mechanism• Delta update• Mechanism of shared

file block• Dropbox client database:

Sqlite

• 数据/文件分割和指纹

• 增量上传算法

• 所谓“秒传”

67.228.78.114

67.228.78.116

67.228.78.117

...

208.43.202.5

...

75.101.145.128

75.101.138.84

...

Client

Dropbox Web Server

Amazon S3 & EC2

Softlayer Datacenter

login (https)

download and upload data (https)

keep alive (http)

list, delete rename and sync (https)

12

Dropbox-Like NetDisk Service: uDrop / eDrop

MetaServer

MetaServer

MetaServer

MetaServer

oNest

PC Client

MobileClient

Browser

MetaAPI DataAPI

REST AccessServer

MetaAPI DataAPI

REST AccessServer

MetaAPI DataAPI

Web Server

ZooKeeper HBase

Register

Matcher

13

Big Data Platform

Hive Pig

HDFS

Shared Cluster of Servers

file file file

HBase

HugeTable

BigtableBigtable

MapReduce/ImpalaBulkLoad

(FlumeFlive)

Users, Applications

SQL/Scrpits/Java/Web

Big Data

Source

Big Data

Source

Big Data

Source

…………

Hcatalog

Smart SQL and Execution Engine

Oozie

Ganglia

Nagios

Data Mining

ETL

BackupBackup

ClusterMaster(Deployment)

14

Big Data Warehouse: HugeTable -> Horizon• 以HDFS为基础存储平台，支持多种存储格式，可扩展

• HBase/HFile, • 行存储：TextFile, SequenceFile• 列存储：RCFile/ORCFile, Rarquet, …

• 多种数据访问模型• HBase• MapReduce• MPP: Impala

• HugeTable特有的数据存储模型• Encodeing/Decoding• Indexing• Partitioning• …

• 统一的Data Schema Metadata管理

• Smart SQL Engine and Server• 高性能、高并发、高稳定性、分布式• 选择不同的数据访问模型路径

• 兼容Hive和Pig

• 标准化JDBC客户端接口和客户端工具

• 工程辅助工具• 快速批量加载 BulkLoad和导出 (提供SQL界面)• 快速部署工具

Smart SQL Engine

智能SQL引擎Smart SQL Engine

智能SQL引擎

Unified Schema

统一元数据

HBase

HugeTable Data Model

数据建模

MapReduce

Hive

Impala(MPP)

HDFS

HFile(SSTables)

TextFile(Recorded)

SequenceFile(Key-Value Rows)

RCFile/ORCFile

(Columnar)

Parquet(ColumnIO)

User-Defined Formats ...

SQuirreL SQL Client(GUI)

JDBC Driver

SQLLine(CLI)

JDBC Driver

Apps(Programming)

JDBC Driver

Pig

Web SQL Client

JDBC Driver

15

Big Data Warehouse: HugeTable -> Horizon

DFS (Hadoop HDFS)

Bigtable (HBase)

Data Model(Data Organization, Indexing,

Partitioning, Encoding, Compressing, ...)

Data Warehouse Utilities / Tools(SpeedLoader, SpeedScan, Data

LifeCycle, ...)

SQL Engine(Standard, Familiar, Low Learning Curve, ...)

JDBC and ODBC REST API

MapReduce

Hive

Pig

Oozie

Management

Connecto

rsIntegrating into H

adoop EcosystemHCatalog

...

16

NoSQL vs. SQL

• NoSQL, BigTable, Cassandra, etc., are just the “Storage Engine Layer” of DBMS.

• Users always like and be familiar with SQL to touch their data.

Horizon

Distributed SQL Engine

Distributed Storage Engine（NoSQL, HBase)

vs.

17

MySQL Server

SQL Engine Layer

Storage Engine Layer(MyISAM, InnoDB, etc.)

How about to build a Distributed DBMS? Megastore, Greenplum/Pivotal/GitusDB, etc.

经分大数据平台

统一大数据存储和分析平台

BOSS

帐详单CDR数据

网络

CDR数据

(Gn/Gb/IuPS ...)

信令数据

(Iub/Iucs/mmsc ...)

日志数据

(WAP, WLAN ...)

CRM

用户资料

DPI采集数据

其他数据

批量加载工具(Files,

BulkLoad, etc.)

实时加载工具(Flume, Flive,

etc.)

数据库数据转

移工具(Sqoop, etc.)

其他工程工具

ETL处理

逻辑

其他数据

Hadoop HDFS基础存储层

HBase Impala

MapReduce

Client

Hive

Pig

Data Mining

统计、汇总

分析、报表

类业务

即席查询

类业务(ad-hoc)

数据挖掘

类业务

其他OLAP

业务

大数据来源 (多样性)

SQL Scripts Java

Plan & Design数据存储模型定义 (Schema, Types, Indexes, StorageEngine, etc.)

数据处理操作和流程定义 (SQL, Scripts, Java, WorkFlow, etc.)

根据实际业务数据进行开发和移植

离线接口一般无需修改

数据加载和预处理数据存储、组

织和处理平台数据处理和访问业务功能

MapReduce

...

Horizon

根据实际业务数据进行开发和移植

离线接口一般无需修改

原则：以离线、批量分析为主，兼顾数据查询和管理18

大数据服务平台

HugeTable Data Model

HBase, Hadoop

SQL Engine Server

SQL Engine Server

SQL Engine Server

Hive/PigMapReduce

Connector

(with SpeedScan)

Hive/PigMapReduce

Analysis

LifeCycle(On/Offline, DataDrop)

BulkLoadETLfile

file

HugeTable

JDBC for Local Deployment

Flive

Load Balancer(LVS, with HA)

Online Generated Data (CDR)

Web Service Web Service Web Service

RESTful for Remote Deployment

原则：以实时低时延数据查询为主，兼顾数据分析

19

Cluster Management: ClusterMaster

20

Cluster Management: ClusterMaster

21

Hadoop and Open Source Ecosystem• Hive

• Faster SQL Engine• Support more Storage Engines• More UDFs for database functions (such as NVL,

DECODE from Oracle.)• More UDFs for OLAP (such as Roll-Up, Cube, Efficient

Aggregations, etc.• More algorithms for efficient statistics and estimate

(such as LogLog-Counter for estimated DISTINCT values)

• Pig• Support more Data Storages• More UDFs for analysis, statistics and data mining (such

as K-Mean, ID3 for Decision Tree, etc.)

• Tools• Deployment: Hdeploy, HTCfg, ClusterMaster• Management: Integrate Ganglia, Nagios, Puppet, etc. • Light and handy command line: Hman, etc.• Benchmark Tools: Hbench, etc.

• MapReduce• Runtime Job/Task Schedule & Latency

• Work Pool• Transfer Job description information• …

• Processing Engine Improvements• Shuffle: sendfile, Netty Server, Batch Fetch • Sort Avoidance: Spilling and Partitioning, Hash

Aggregation

• HBase (to be a Data Warehouse backend)• Low Level HFile management • Speed Bulk Load• Speed Scan for Analysis• Flexible control of Flush, Compaction, Split, Balance• Coprocessor for parallel processing

• Flume• Support more Data Sources and Data Storages• More flexible Command Line tool

22

Know the Details of Hadoop …

23

MapReduce Runtime Optimization

• Job/Task Schedule & Latency

• Worker Pool

Worker PoolWorker PoolWorker Pool

MapReduce Client

JobTracker

TaskTracker TaskTracker TaskTracker

ChildWorker

ChildWorker

ChildWorker

ChildWorker

ChildWorker

ChildWorker

ChildWorker

ChildWorker

ChildWorker

RPC(JobConf)

43

24

1

0

5

10

15

20

25

30

35

40

45

50

CDH3u2 (Cloudera)(reuse.jvm disabled)

CDH3u2 (Cloudera)(reuse.jvm enabled)

HDH3u2 (Hanborq)

Job Latency (in second, lower is better)

Total Tasks (96 maps, 4 reduces)

24

MapReduce Processing Engine Optimization • Shuffle: Use sendfile to reduce data copy and context switch.

• Shuffle: Netty Shuffle Server (map side) and Batch Fetch (reduce side).

• Sort Avoidance.• Spilling and Partitioning, Counting Sort, Bytes Merge, Early Reduce, etc.

• Hash Aggregation in job implementation.

Case1 Case2 Case3

CHD3u2 (Cloudera) 197 216 2186

HDH (Hanborq) 175 198 615

197 216

2186

175 198

615

0

200

400

600

800

1000

1200

1400

1600

1800

2000

2200

2400

tim

e (

seco

nd

s)

Sort Avoidance and Aggregation (lower is better)

Case1-1 Case2-1 Case1-2 Case2-2

CDH3u2 (Cloudera) 238 603 136 206

HDH (Hanborq) 233 578 96 151

0

100

200

300

400

500

600

700

tim

e (s

eco

nd

s)

Real Aggregration Jobs (lower is better)

25

中国移动BigCloud自2008年开始与中国移动研究院合作定义、设计和开发“大云”1.0体系结构和产品系列，目前已完成了“大云”2.0的研发任务。

已支持“大云”系统在中国移动及其它行业用户广泛部署，提供软、硬件系统解决方案及服务。云存储及数据仓库产品及服务，单一数据中心部署容量已超过2,000节点，管理超过20PB的存储容量。为电信详单、日志、信令、文档、视频、图片及互联网页数据，提供存储、分析及检索服务。

BC-HugeTable(海量结构化数据管理系统) 大数据仓库 (分析和查询)

大数据库 (分析和查询)

BC-Hadoop(海量数据存储和分析平台) 研究院发行版

汉播发行版HDH

BC-oNest(分布式对象存储系统)

BC-NAS(分布式文件系统中间件)

26

CDR帐详单仓库和查询

0

50

100

150

200

250

300

350

400

450

200906 200907 200908 200909 200910 200911 200912

清单量(亿条)

清单量(亿条)

BSS

电信运营网络

移动核心网

网络交换设备

OSS服务器

Terminals

采集设备实时\批量time-series

数据存储和分析服务器集群HB-CDW系统

(存储，索引，分析)

RDBMS和Web服务器

PC浏览器查询

智能手机查询

HB-CDW集群系统

Internet

报表

查询

Intranet

集群监控管理服务器

智能手机监控PC浏览器监控

分析报表

0

1000000

2000000

3000000

4000000

5000000

6000000

7000000

8000000

200906 200907 200908 200909 200910 200911 200912

查询量(次数)

查询量

- CDR实时生效延迟<1分钟- 查询响应(Latency) < 3秒(平均<0.5秒)- 查询吞吐率：每月2亿次，忙时每秒1000- 数据安全：数据在3个节点冗余备份- 数据分析：每日或每月生成KPI报表

用户规模：约1亿用户

CDR详单数据量

- 每月：详单量500亿条，数据量20TB (每秒2万条以上)

- 总存储6个月：详单量3000亿条，数据量120TB

- 移动互联网业务详单数据量是普通业务CDR的5倍以上

数据存储和处理集群规模

- 32台DELL PE C2100服务器

- 每台12 x 1TB数据硬盘，64GB内存

方案制定时间：2009-10

27

移动 – 经分ETL

28

大数据平台(Hadoop/Hive/Pig/

HugeTable)

高性能/高并发/大存储平台对外总数据接口

(输入/输出)

华为WAP日志服务器(FTP Server)

#1

华为WAP日志服务器(FTP Server)

#2

大数据平台

接口机(FTP Server)

Hadoop Node

Hadoop Node

Hadoop Node

Hadoop Node

……

……

日汇总Jobs(Hive SQL)

亚联系统

防火墙

WAP日志文件

接口机每小时拉文件

每日400GB，约4.6万个小文件

每天更新号段维表数据

每月更新用户信息维表

数据

每日定时取前一日汇总数据

每月定时取前一月汇总数据

数据需符合一经规范

周期(每小时)在接口机上运行Pig脚本，驱动MapReduce

Job并行从接口机读取数据，并做格式转换、编码、压缩

和清洗，写成SequenceFile到HDFS。节省存储空间，提高

后续处理效率，易扩展新的ETL功能输出中间汇总(细粒度)数据

月180GB，存储到HDFS 31

天，待月汇总

日汇总Job(Hive SQL)

31天

每日输出5GB规整

后的数据到接口机

日汇总

一经规整(Pig/Scrpits)

月汇总Jobs(Hive SQL)

日汇总


31天

月汇总


每月输出规整后的

数据到接口机

WorkFlow/Pipeline控制器

29

Lessons LearnedMany lessons and many feelings.

30

1. Right Design Comes from Basic Knowledge of Computer System / Computer Science• Computer Architecture and How

Computer Works• Representing and Manipulating

Information and Programs• Processor Architecture (Pipeline,

Parallel …)• Storage Architecture• IO System, etc.

• Memory/Storage Hierarchy• Modern Operation System• Networking• Languages …

• The core issues of database.• File-system …

• To be distributed now.

31

Basic Knowledge of CS

All solutions of database and big data processing system are stand on the characters of computer architecture, especially disk, network ...

- Sequential vs. Random Access …- Long latency of Disk Seek …- Throughput

32


by Jeff Dean

33


• What every data engineer needs to know about disks

• Basic Algorithms (Sorting, Searching, Strings, Bitmap, …)

• Linux Virtual Memory, Exceptions, Concurrency, etc.

• …

34

2. Keep Simple and Straightforward

• Master-Slave vs. Decentralized (DHT, Consistent Hash)

• Almost all Google products follow Master-Slave pattern. GFS/BigTable/MapReduce/ZooKeeper, etc..• MapReduce: Simplified Data Processing on Large Clusters

• A simple programming model that applies to many large-scale computing problems• Hide messy details

• Bigtable provides the simple data model, distributed B+ tree …

• Shards and Replicas

• Simple and clean API design

35

Keep Simple and Straightforward

• Example: Bigtable vs. Cassandra

MasterMaster

Tablet Server Tablet Server Tablet Server Tablet Server

GFS

Tablet

BigtableCassandra

36

Keep Simple and StraightforwardBigtable (++)

• Master – Tablet Servers

• Dynamic Tablet Splits

• WAL + MemTable + SSTable

• Three Level Distributed B+Tree

• Replication in GFS

• …

Cassandra (--)• Identical Data Nodes, Gossip

• Consistent Hash, Virtual Nodes

• WAL + MemTable + SSTable

• Hinted Handoff

• DHT Ring (neighbor nodes)

• Eventual consistency

• Read Rapir

• Merkle Tree

• Clock Vector

• Anti-entropy protocol (反熵)

• …

• 好复杂：架构的错误，导致系统越来越复杂 …

http://www.slideshare.net/schubertzhang/cassandra-dynamo-paperhttp://www.slideshare.net/schubertzhang/dastorcassandra-report-for-cdr-solution

Bigtable ’s architecture and data model make more sense.

37

http://www.slideshare.net/schubertzhang/cassandra-dynamo-paper

http://www.slideshare.net/schubertzhang/dastorcassandra-report-for-cdr-solution

3. There is no “one-size-fits-all” solution

• There are too many contradictory requirements in the structured data world.

• The contradiction of data processing• Real-time or near-real-time data availability.• Batch processing for large size of data, such as aggregation.

• The contradiction of data access:• Low-latency fast query response, like Lookup.• High-latency ad-hoc analytic query for historical data.

• But, there is no one-size-fits-all answer for above contradictory requirements.

• Identify common problems, and build systems to address them in a general way.

• “Important not to try to be all things to all people!” – Jeff Dean, Keynote at LADIS’09

38

There is no “one-size-fits-all” solution

• MapReduce

• Dremel (MPP)

• Tez/Stingger

• NoSQL/Bigtable (and with Coprocessor)

• DBMS

• …Lambda Architecture: New data is sent to both layers and queries merge views from both layers.

39

There is no “one-size-fits-all” solution

MapReduce Dremel Pregel

Hive Pig Java Impala GoldenOrb

SQL, Scripts, Java, etc.

不同的查询和分析请求，采用不同的并行执行引擎操作数据。

40

4. Monitorable and Metrizable at any time

• Sufficient Statistic, Monitoring …

• Add Sufficient Monitoring/Status/Debugging Hooks

• If your system is slow or misbehaving, can you figure out why?

• Don’t rely on logs too much, log is too costly and inefficient.

• Use real-time statistics/metrics.

• Use tools, jmxetric, JMX, Ganglia, Nagios, Noah …

41

Monitorable and Metrizable at any time

Captured from UTStarcom mSwitch R5 system, Guangxi Site, 2004.

The magic matrix ??!

42

Monitorable and Metrizable at any timeWrite/Insert Operation Benchmark

Read/Query Operation Benchmark

43


Read Throughput: average ~140 ops/s

Latency: average ~500ms, 97% < 2s (SLA)

Bottleneck: disk IO (random seek) (CPU load is very low)

0.00%

5.00%

10.00%

15.00%

20.00%

25.00%

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61

100ms

percentage of read ops

SLA Metrics:

• Throughput

o tThrou :Total Throughput (operation count)

o dThrou : Delta Throughput (operation count)

• Latency

o tAvgLat: Total Average Latency (ms)

o dAvgLat: Delta Average Latency (ms)

o dMaxLat : Delta Maximum Latency (ms)

o dMinLat : Delta Minimum Latency (ms)

• Quantile %

• Total : from benchmark start to present.

• Delta: between each statistical interval (2 minutes here)

44


45

5. Try to make data in-situ

• The ability to access data ‘in place’.

• ProtocolBuffers/Parquet encoding

• Example:• Horizon over HDFS + HBase

Real-Time API

HDFS (HFile)

HBase

Bulk Load(Batch Input)

Writes(Puts)

Reads(Get/Scan)

MapReduce/Impala

(Batch Processing)

Flush/Compaction

HFiles

Schema

Coprocessor

HFiles

Meta

Real-Time Data Service

46

6. Approximated vs. Precise

• For large data sets, it can be prohibitively expensive to find the precise result, but there are efficient estimating methods.

• Example Queries:• How many distinct elements are in the data set (i.e. what is the cardinality of the

data set)?• What are the most frequent elements (the terms “heavy hitters” and “top-k

elements” are also used)?• What are the frequencies of the most frequent elements?• How many elements belong to the specified range (range query, in SQL it looks

like SELECT count(v) WHERE v >= c1 AND v < c2)?• Does the data set contain a particular element (membership query)?• …

47

Approximated vs. Precise

• The algorithms are approximate: with high probability it returns approximately the correct result. (e.g. ±2%)

• select count(distinct userid) from userlogs;

• select top(100) of count(*) from orders group by itemname;

• …

• Statistical and Probabilistic Analysis, Very interesting!

48

Approximated vs. Precise

• Usually Sample/Hash/Bitmap …

• Cardinality Estimation• Linear Counting• Loglog Counting …

• Frequency Estimation / Heavy Hitters• Count-Min Sketch• Count-Mean-Min Sketch• Stream-Summary …

• Range Query• Array of Count-Min Sketches …

• Membership Query• Bloom Filter

• …

49

http://highlyscalable.files.wordpress.com/2012/04/probabilistic-sizes.png

http://highlyscalable.files.wordpress.com/2012/04/probabilistic-sizes.png

5. Open Source and Open Spirit

• Choose you Building Blocks in Engineering view• Know Your Basic Building Blocks, Not just their interfaces, but understand

their implementations (at least at a high level)

• 善用开源，回馈开源，使开源更好更强大

50

6. And more …

• Description and Documents

• Avoid inventing new Interface for Users

• From simple to complete, From prototype to product• Make the architecture robust, try it, and then improve and complete it.

• Product vs. Tech. vs. Trick

• …

51

7. Read Books – Read English Books

52

Thank You!

53

Find me outside

• SlideShare: http://www.slideshare.net/schubertzhanghttp://www.slideshare.net/hanborq

• Github:https://github.com/schubertzhanghttps://github.com/hanborq

• LinkedIn: http://cn.linkedin.com/pub/schubert-zhang/6/b51/b5b/

• Blog:http://cloudepr.blogspot.com

• Facebook:https://www.facebook.com/schubertzhang

• Email & Gtalk: [email protected]

• Weibo:@schubertzh

• WeChat:schubertzh

54

http://www.slideshare.net/schubertzhang

http://www.slideshare.net/hanborq

https://github.com/schubertzhang

https://github.com/hanborq

http://cn.linkedin.com/pub/schubert-zhang/6/b51/b5b/

http://cloudepr.blogspot.com/

https://www.facebook.com/schubertzhang

mailto:[email protected]

Technology

Engineering practices in big data storage and processing