Upload
schubert-zhang
View
552
Download
8
Embed Size (px)
DESCRIPTION
Engineering practices in big data storage and processing. A summary of that experience.
Citation preview
Engineering Practices in Big Data Storage and Processing
Nov.20, 2013
Schubert (Songbo) Zhang
About me
• 张松波 (Schubert Zhang)
• Backgrounds• Senior Engineer Tech Lead and Architect, Infrastructure Data Team, @Baidu
• VP Engineering, Cloud & Big Data R&D, @Hanborq
• Senior Engineering Manager, @UTStarcom
• 10 years of Telecom, 5 years of Cloud Storage & Big Data, 1 year of Internet
2
Categories of (Big) Data
• Rows / Records• Logs• User Profiles• Shopping Orders• …
• Presentation• Tables with Schema• Data Types• Database, Data-Warehouse
• A mess -> organizing, indexing -> fast to retrieve …
• Batch and sequential processing …
• Files / Objects• Documents• Photos• Videos• …
• Presentation• Files in File-System• Objects in Object-Storage-System• With metadata …
• Organizing, indexing -> fast to retrieve …
• Batch and sequential processing …
3
Over the common underlayer storage and IO system: Hardware, Disk, Network …
Products and Engineering ProjectsObject Storage, Data Warehouse, Cluster Management, etc.
For enterprise!
4
Products Line
HanborqProducts
大数据工程 (Big Data)HB-CDW产品线是基于云计算技术实现的面向大数据(PB级)存储、
查询和分析以及挖掘的大数据仓库系统。核心产品包括基于
Hadoop生态系统的大数据仓库、海量结构化数据管理系统
HugeTable。基于Hanborq增强并扩展的Hadoop、HBase、Hive、
Pig等大数据基础软件,实现特有的数据模型、系统架构和标准
的SQL/API,提供对大数据的快速加载、实时索引查询,以及基
于MapReduce和MPP等并行计算技术的深度统计、分析和挖掘。
系统提供灵活的扩展性和安全可靠性。在电信、电力、交通、
大型互联网等大数据行业领域有广泛的使用案例。
云存储 (Cloud Storage)HB-CSS产品线为企业或个人提供云存储解决方案及服务。提供类
似Amazon AWS S3的服务层API和用户体验,可扩展、安全、快速
的云对象存储系统oNest。基于oNest,为企业和个人提供接入云
存储服务的存储网关(Storage Gateway)及类似Dropbox的在线云
存储服务(uDrop/eDrop)。在大型互联网、教育、电信、媒体、交
通等行业领域有广泛的使用案例。
管理系统 (Management)HB-ClusterMaster是大规模数据中心集群规划、操作系统及应用程序自动化安
装部署、配置管理、监控及运营维护的软件系统,实现大规模云计算集群的高
效部署和运维。目前部署和管理的最大单系统案例超过2000个物理服务器节点。
5
• Web Service and API• Amazon AWS S3 RESTful API• S3 Data Model (User->Buckets->Objects)
• Backend Distributed Object Storage System• Google GFS + Facebook Haystack
• Triple copy of data trunks• Write-through, Strong consistency• Append only and Compaction• High efficient Local Index• …
• Backend Distributed Metadata Layer• Flexible data model• NoSQL
6
Cloud Object Storage System : oNest
Web Service(RESTful API over HTTP)
Metadata Layer
Object/Trunk Storage Layer
SDK
(C++/Java/Python/PHP/Go…)
Cloud Object Storage System : oNest
7
User
Bucket1 Bucket2 Bucket3 Bucket4
Object Object Object Object Object
Object Object Object Object Object
Rock&
Chunks Data Model and Data Organization
Bucket
Object
Object
Object
Object/Pebble
Part
Part
Part
Rock
Chunk
Chunk
Chunk
Rock
Chunk
Chunk
Chunk
Rock
Chunk
Chunk
Chunk
Logic Physical
Part
Chunk
Chunk
Chunk
Cloud Object Storage System : RockStor-> oNest
服务层服务层
接口层接口层
存储层存储层
系统管理
计量信息
用户控制
日志管理
统计报表
运维管理
分布式存储系统
WEB服务
对象
相关
功能
负载均衡
WEB服务 WEB服务
对象访问
对象属性
容器
相关
功能
容器访问
容器属性
用户
相关
功能
认证
鉴权
…………
应用系统1 应用系统N………… 资源管理平台
HTTP接口 HTTP接口HTTP接口 HTTP接口HTTP接口 管理接口管理接口
RockServer RockServer RockServer RockServer
RockStor Service Load Balancers(访问请求负载均衡器,多点部署,LVS)
RockMaster
RESTful API(Internal)
AAA, CAS
RESTful API (Cloud Service)
分布式存储系统集群 Hadoop
(存储和管理Rock文件)
分布式数据库集群 HBase
(存储和管理元数据)
Management Console
分布式云对象存储系统
SDK (Java) for Developers
8Fast/Simple Prototype Leverage Open Source To be a Product and Service.
Cloud Object Storage System : oNest
• oNest对象云存储平台系统以对象的形式存储数据,为互联网业务和企业用户提供可达百PB级的云存储服务
• oNest系统提供的对象云存储服务的主要特点:
(1) 支持高可靠,多副本数据存储,支持动态环境下数据副本的自动修复
(2) 支持大规模存储(容量x100PB级以上),存储对象数量和容量的线性扩容
(3) 支持一个数据中心内和跨数据中心备份数据
(4) 支持大规模并发访问
(5) 支持安全的数据访问
ClusterMasterClusterMaster
Region
Discovery Service Cluster
ConsoleWebServer
ConsoleWebServer
OAS Cluster
Healer Cluster
DataStorage Cluster
MetaNode Cluster
AZ
Stats Cluster
StatsMaster
StatsSlave
DataNodeOAS
Healer
StatsMaster
StatsSlave
DataNode DataNodeOAS OAS
Healer
Healer
MetaNodeSlave
Master
MetaNodeSlave
Master
MetaNodeSlave
Master
ConsoleWebServer
ConsoleWebServer 机房A
ClusterMasterClusterMaster
Discovery Service Cluster
OAS Cluster
Healer Cluster
DataStorage Cluster
MetaNode Cluster
AZ
DataNodeOAS
Healer
DataNode DataNodeOAS OAS
Healer
Healer
MetaNodeSlave
Master
MetaNodeSlave
Master
MetaNodeSlave
Master
机房BConsoleWebServer
ConsoleWebServer
ConsoleWebServer
ConsoleWebServer
ConsoleWebServer
ConsoleWebServer
ConsoleWebServer
ConsoleWebServer
AAASlave
Master
AAA Cluster
AAAWeb
Service
AAASlave
Master
AAASlave
Master
AAAWeb
Service
AAASlave
Master
AAA Cluster
AAAWeb
Service
AAASlave
Master
AAASlave
Master
AAAWeb
Service
StatsSlave
StatsSlave
StatsMaster
StatsMaster
Proxy
Proxy
9To be a more Complete Product and Service.
Cloud Object Storage System : oNest
10
用户名创建Bucket 新建目录 上传对象 刷新列表 查看属性 操作记录
右键菜单
对象列表对象集列表
对象基本属性描述点击进入详细属性描述,包括对象下载地址
点击进入ACL权限管理
Cloud Object Storage System : oNest
教育云应用的用户
BC-oNest对象云存储服务
教育云应用服务
oNest提供统一标准的云存储接口,教育云应用可以通过该接口存储、读取、或操作这些数据对象
oNest是一个弹性的对象云存储系统,可类比Amazon AWS S3。为教育云提供视频、音频、图片、文档等数据的存储服务。
教育云应用即是oNest云存储的用户。
oNest云存储服务
教育云App-1 教育云App-2
SDK
RESTREST
注册、登录、Console
11
Dropbox-Like NetDisk Service: uDrop / eDrop
• Hack Dropbox• keep-alive mechanism• Delta update• Mechanism of shared
file block• Dropbox client database:
Sqlite
• 数据/文件分割和指纹
• 增量上传算法
• 所谓“秒传”
67.228.78.114
67.228.78.116
67.228.78.117
...
208.43.202.5
...
75.101.145.128
75.101.138.84
...
Client
Dropbox Web Server
Amazon S3 & EC2
Softlayer Datacenter
login (https)
download and upload data (https)
keep alive (http)
list, delete rename and sync (https)
12
Dropbox-Like NetDisk Service: uDrop / eDrop
MetaServer
MetaServer
MetaServer
MetaServer
oNest
PC Client
MobileClient
Browser
MetaAPI DataAPI
REST AccessServer
MetaAPI DataAPI
REST AccessServer
MetaAPI DataAPI
Web Server
ZooKeeper HBase
Register
Matcher
13
Big Data Platform
Hive Pig
HDFS
Shared Cluster of Servers
file file file
HBase
HugeTable
BigtableBigtable
MapReduce/ImpalaBulkLoad
(FlumeFlive)
Users, Applications
SQL/Scrpits/Java/Web
Big Data
Source
Big Data
Source
Big Data
Source
…………
Hcatalog
Smart SQL and Execution Engine
Oozie
Ganglia
Nagios
Data Mining
ETL
BackupBackup
ClusterMaster(Deployment)
14
Big Data Warehouse: HugeTable -> Horizon• 以HDFS为基础存储平台,支持多种存储格式,可扩展
• HBase/HFile, • 行存储:TextFile, SequenceFile• 列存储:RCFile/ORCFile, Rarquet, …
• 多种数据访问模型• HBase• MapReduce• MPP: Impala
• HugeTable特有的数据存储模型• Encodeing/Decoding• Indexing• Partitioning• …
• 统一的Data Schema Metadata管理
• Smart SQL Engine and Server• 高性能、高并发、高稳定性、分布式• 选择不同的数据访问模型路径
• 兼容Hive和Pig
• 标准化JDBC客户端接口和客户端工具
• 工程辅助工具• 快速批量加载 BulkLoad和导出 (提供SQL界面)• 快速部署工具
Smart SQL Engine
智能SQL引擎Smart SQL Engine
智能SQL引擎
Unified Schema
统一元数据
HBase
HugeTable Data Model
数据建模
MapReduce
Hive
Impala(MPP)
HDFS
HFile(SSTables)
TextFile(Recorded)
SequenceFile(Key-Value Rows)
RCFile/ORCFile
(Columnar)
Parquet(ColumnIO)
User-Defined Formats ...
SQuirreL SQL Client(GUI)
JDBC Driver
SQLLine(CLI)
JDBC Driver
Apps(Programming)
JDBC Driver
Pig
Web SQL Client
JDBC Driver
15
Big Data Warehouse: HugeTable -> Horizon
DFS (Hadoop HDFS)
Bigtable (HBase)
Data Model(Data Organization, Indexing,
Partitioning, Encoding, Compressing, ...)
Data Warehouse Utilities / Tools(SpeedLoader, SpeedScan, Data
LifeCycle, ...)
SQL Engine(Standard, Familiar, Low Learning Curve, ...)
JDBC and ODBC REST API
MapReduce
Hive
Pig
Oozie
Management
Connecto
rsIntegrating into H
adoop EcosystemHCatalog
...
16
NoSQL vs. SQL
• NoSQL, BigTable, Cassandra, etc., are just the “Storage Engine Layer” of DBMS.
• Users always like and be familiar with SQL to touch their data.
Horizon
Distributed SQL Engine
Distributed Storage Engine(NoSQL, HBase)
vs.
17
MySQL Server
SQL Engine Layer
Storage Engine Layer(MyISAM, InnoDB, etc.)
How about to build a Distributed DBMS? Megastore, Greenplum/Pivotal/GitusDB, etc.
经分大数据平台
统一大数据存储和分析平台
BOSS
帐详单CDR数据
网络
CDR数据
(Gn/Gb/IuPS ...)
信令数据
(Iub/Iucs/mmsc ...)
日志数据
(WAP, WLAN ...)
CRM
用户资料
DPI采集数据
其他数据
批量加载工具(Files,
BulkLoad, etc.)
实时加载工具(Flume, Flive,
etc.)
数据库数据转
移工具(Sqoop, etc.)
其他工程工具
ETL处理
逻辑
其他数据
Hadoop HDFS基础存储层
HBase Impala
MapReduce
Client
Hive
Pig
Data Mining
统计、汇总
分析、报表
类业务
即席查询
类业务(ad-hoc)
数据挖掘
类业务
其他OLAP
业务
大数据来源 (多样性)
SQL Scripts Java
Plan & Design数据存储模型定义 (Schema, Types, Indexes, StorageEngine, etc.)
数据处理操作和流程定义 (SQL, Scripts, Java, WorkFlow, etc.)
根据实际业务数据进行开发和移植
离线接口一般无需修改
数据加载和预处理数据存储、组
织和处理平台 数据处理和访问 业务功能
MapReduce
...
Horizon
根据实际业务数据进行开发和移植
离线接口一般无需修改
原则:以离线、批量分析为主,兼顾数据查询和管理18
大数据服务平台
HugeTable Data Model
HBase, Hadoop
SQL Engine Server
SQL Engine Server
SQL Engine Server
Hive/PigMapReduce
Connector
(with SpeedScan)
Hive/PigMapReduce
Analysis
LifeCycle(On/Offline, DataDrop)
BulkLoadETLfile
file
HugeTable
JDBC for Local Deployment
Flive
Load Balancer(LVS, with HA)
Online Generated Data (CDR)
Web Service Web Service Web Service
RESTful for Remote Deployment
原则:以实时低时延数据查询为主,兼顾数据分析
19
Cluster Management: ClusterMaster
20
Cluster Management: ClusterMaster
21
Hadoop and Open Source Ecosystem• Hive
• Faster SQL Engine• Support more Storage Engines• More UDFs for database functions (such as NVL,
DECODE from Oracle.)• More UDFs for OLAP (such as Roll-Up, Cube, Efficient
Aggregations, etc.• More algorithms for efficient statistics and estimate
(such as LogLog-Counter for estimated DISTINCT values)
• Pig• Support more Data Storages• More UDFs for analysis, statistics and data mining (such
as K-Mean, ID3 for Decision Tree, etc.)
• Tools• Deployment: Hdeploy, HTCfg, ClusterMaster• Management: Integrate Ganglia, Nagios, Puppet, etc. • Light and handy command line: Hman, etc.• Benchmark Tools: Hbench, etc.
• MapReduce• Runtime Job/Task Schedule & Latency
• Work Pool• Transfer Job description information• …
• Processing Engine Improvements• Shuffle: sendfile, Netty Server, Batch Fetch • Sort Avoidance: Spilling and Partitioning, Hash
Aggregation
• HBase (to be a Data Warehouse backend)• Low Level HFile management • Speed Bulk Load• Speed Scan for Analysis• Flexible control of Flush, Compaction, Split, Balance• Coprocessor for parallel processing
• Flume• Support more Data Sources and Data Storages• More flexible Command Line tool
22
Know the Details of Hadoop …
23
MapReduce Runtime Optimization
• Job/Task Schedule & Latency
• Worker Pool
Worker PoolWorker PoolWorker Pool
MapReduce Client
JobTracker
TaskTracker TaskTracker TaskTracker
ChildWorker
ChildWorker
ChildWorker
ChildWorker
ChildWorker
ChildWorker
ChildWorker
ChildWorker
ChildWorker
RPC(JobConf)
43
24
1
0
5
10
15
20
25
30
35
40
45
50
CDH3u2 (Cloudera)(reuse.jvm disabled)
CDH3u2 (Cloudera)(reuse.jvm enabled)
HDH3u2 (Hanborq)
Job Latency (in second, lower is better)
Total Tasks (96 maps, 4 reduces)
24
MapReduce Processing Engine Optimization • Shuffle: Use sendfile to reduce data copy and context switch.
• Shuffle: Netty Shuffle Server (map side) and Batch Fetch (reduce side).
• Sort Avoidance.• Spilling and Partitioning, Counting Sort, Bytes Merge, Early Reduce, etc.
• Hash Aggregation in job implementation.
Case1 Case2 Case3
CHD3u2 (Cloudera) 197 216 2186
HDH (Hanborq) 175 198 615
197 216
2186
175 198
615
0
200
400
600
800
1000
1200
1400
1600
1800
2000
2200
2400
tim
e (
seco
nd
s)
Sort Avoidance and Aggregation (lower is better)
Case1-1 Case2-1 Case1-2 Case2-2
CDH3u2 (Cloudera) 238 603 136 206
HDH (Hanborq) 233 578 96 151
0
100
200
300
400
500
600
700
tim
e (s
eco
nd
s)
Real Aggregration Jobs (lower is better)
25
中国移动BigCloud自2008年开始与中国移动研究院合作定义、设计和开发“大云”1.0体系结构和产品系列,目前已完成了“大云”2.0的研发任务。
已支持“大云”系统在中国移动及其它行业用户广泛部署,提供软、硬件系统解决方案及服务。云存储及数据仓库产品及服务,单一数据中心部署容量已超过2,000节点,管理超过20PB的存储容量。为电信详单、日志、信令、文档、视频、图片及互联网页数据,提供存储、分析及检索服务。
BC-HugeTable(海量结构化数据管理系统) 大数据仓库 (分析和查询)
大数据库 (分析和查询)
BC-Hadoop(海量数据存储和分析平台) 研究院发行版
汉播发行版HDH
BC-oNest(分布式对象存储系统)
BC-NAS(分布式文件系统中间件)
26
CDR帐详单仓库和查询
0
50
100
150
200
250
300
350
400
450
200906 200907 200908 200909 200910 200911 200912
清单量(亿条)
清单量(亿条)
BSS
电信运营网络
移动核心网
网络交换设备
OSS服务器
Terminals
采集设备实时\批量time-series
数据存储和分析服务器集群HB-CDW系统
(存储,索引,分析)
RDBMS和Web服务器
PC浏览器查询
智能手机查询
HB-CDW集群系统
Internet
报表
查询
Intranet
集群监控管理服务器
智能手机监控PC浏览器监控
分析报表
0
1000000
2000000
3000000
4000000
5000000
6000000
7000000
8000000
200906 200907 200908 200909 200910 200911 200912
查询量(次数)
查询量
- CDR实时生效延迟<1分钟- 查询响应(Latency) < 3秒(平均<0.5秒)- 查询吞吐率:每月2亿次,忙时每秒1000- 数据安全:数据在3个节点冗余备份- 数据分析:每日或每月生成KPI报表
用户规模:约1亿用户
CDR详单数据量
- 每月:详单量500亿条,数据量20TB (每秒2万条以上)
- 总存储6个月:详单量3000亿条,数据量120TB
- 移动互联网业务详单数据量是普通业务CDR的5倍以上
数据存储和处理集群规模
- 32台DELL PE C2100服务器
- 每台12 x 1TB数据硬盘,64GB内存
方案制定时间:2009-10
27
移动 – 经分ETL
28
大数据平台(Hadoop/Hive/Pig/
HugeTable)
高性能/高并发/大存储平台对外总数据接口
(输入/输出)
华为WAP日志服务器(FTP Server)
#1
华为WAP日志服务器(FTP Server)
#2
大数据平台
接口机(FTP Server)
Hadoop Node
Hadoop Node
Hadoop Node
Hadoop Node
……
……
日汇总Jobs(Hive SQL)
亚联系统
防火墙
WAP日志文件
接口机每小时拉文件
每日400GB,约4.6万个小文件
每天更新号段维表数据
每月更新用户信息维表
数据
每日定时取前一日汇总数据
每月定时取前一月汇总数据
数据需符合一经规范
周期(每小时)在接口机上运行Pig脚本,驱动MapReduce
Job并行从接口机读取数据,并做格式转换、编码、压缩
和清洗,写成SequenceFile到HDFS。节省存储空间,提高
后续处理效率,易扩展新的ETL功能 输出中间汇总(细粒度)数据
月180GB,存储到HDFS 31
天,待月汇总
日汇总Job(Hive SQL)
31天
每日输出5GB规整
后的数据到接口机
日汇总
一经规整(Pig/Scrpits)
月汇总Jobs(Hive SQL)
日汇总
一经规整(Pig/Scrpits)
31天
月汇总
一经规整(Pig/Scrpits)
每月输出规整后的
数据到接口机
WorkFlow/Pipeline控制器
29
Lessons LearnedMany lessons and many feelings.
30
1. Right Design Comes from Basic Knowledge of Computer System / Computer Science• Computer Architecture and How
Computer Works• Representing and Manipulating
Information and Programs• Processor Architecture (Pipeline,
Parallel …)• Storage Architecture• IO System, etc.
• Memory/Storage Hierarchy• Modern Operation System• Networking• Languages …
• The core issues of database.• File-system …
• To be distributed now.
31
Basic Knowledge of CS
All solutions of database and big data processing system are stand on the characters of computer architecture, especially disk, network ...
- Sequential vs. Random Access …- Long latency of Disk Seek …- Throughput
32
Basic Knowledge of CS
by Jeff Dean
33
Basic Knowledge of CS
• What every data engineer needs to know about disks
• Basic Algorithms (Sorting, Searching, Strings, Bitmap, …)
• Linux Virtual Memory, Exceptions, Concurrency, etc.
• …
34
2. Keep Simple and Straightforward
• Master-Slave vs. Decentralized (DHT, Consistent Hash)
• Almost all Google products follow Master-Slave pattern. GFS/BigTable/MapReduce/ZooKeeper, etc..• MapReduce: Simplified Data Processing on Large Clusters
• A simple programming model that applies to many large-scale computing problems• Hide messy details
• Bigtable provides the simple data model, distributed B+ tree …
• Shards and Replicas
• Simple and clean API design
35
Keep Simple and Straightforward
• Example: Bigtable vs. Cassandra
MasterMaster
Tablet Server Tablet Server Tablet Server Tablet Server
GFS
Tablet
BigtableCassandra
36
Keep Simple and StraightforwardBigtable (++)
• Master – Tablet Servers
• Dynamic Tablet Splits
• WAL + MemTable + SSTable
• Three Level Distributed B+Tree
• Replication in GFS
• …
Cassandra (--)• Identical Data Nodes, Gossip
• Consistent Hash, Virtual Nodes
• WAL + MemTable + SSTable
• Hinted Handoff
• DHT Ring (neighbor nodes)
• Eventual consistency
• Read Rapir
• Merkle Tree
• Clock Vector
• Anti-entropy protocol (反熵)
• …
• 好复杂:架构的错误,导致系统越来越复杂 …
http://www.slideshare.net/schubertzhang/cassandra-dynamo-paperhttp://www.slideshare.net/schubertzhang/dastorcassandra-report-for-cdr-solution
Bigtable ’s architecture and data model make more sense.
37
3. There is no “one-size-fits-all” solution
• There are too many contradictory requirements in the structured data world.
• The contradiction of data processing• Real-time or near-real-time data availability.• Batch processing for large size of data, such as aggregation.
• The contradiction of data access:• Low-latency fast query response, like Lookup.• High-latency ad-hoc analytic query for historical data.
• But, there is no one-size-fits-all answer for above contradictory requirements.
• Identify common problems, and build systems to address them in a general way.
• “Important not to try to be all things to all people!” – Jeff Dean, Keynote at LADIS’09
38
There is no “one-size-fits-all” solution
• MapReduce
• Dremel (MPP)
• Tez/Stingger
• NoSQL/Bigtable (and with Coprocessor)
• DBMS
• …Lambda Architecture: New data is sent to both layers and queries merge views from both layers.
39
There is no “one-size-fits-all” solution
MapReduce Dremel Pregel
Hive Pig Java Impala GoldenOrb
SQL, Scripts, Java, etc.
不同的查询和分析请求,采用不同的并行执行引擎操作数据。
40
4. Monitorable and Metrizable at any time
• Sufficient Statistic, Monitoring …
• Add Sufficient Monitoring/Status/Debugging Hooks
• If your system is slow or misbehaving, can you figure out why?
• Don’t rely on logs too much, log is too costly and inefficient.
• Use real-time statistics/metrics.
• Use tools, jmxetric, JMX, Ganglia, Nagios, Noah …
41
Monitorable and Metrizable at any time
Captured from UTStarcom mSwitch R5 system, Guangxi Site, 2004.
The magic matrix ??!
42
Monitorable and Metrizable at any timeWrite/Insert Operation Benchmark
Read/Query Operation Benchmark
43
Monitorable and Metrizable at any time
Read Throughput: average ~140 ops/s
Latency: average ~500ms, 97% < 2s (SLA)
Bottleneck: disk IO (random seek) (CPU load is very low)
0.00%
5.00%
10.00%
15.00%
20.00%
25.00%
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61
100ms
percentage of read ops
SLA Metrics:
• Throughput
o tThrou :Total Throughput (operation count)
o dThrou : Delta Throughput (operation count)
• Latency
o tAvgLat: Total Average Latency (ms)
o dAvgLat: Delta Average Latency (ms)
o dMaxLat : Delta Maximum Latency (ms)
o dMinLat : Delta Minimum Latency (ms)
• Quantile %
• Total : from benchmark start to present.
• Delta: between each statistical interval (2 minutes here)
44
Monitorable and Metrizable at any time
45
5. Try to make data in-situ
• The ability to access data ‘in place’.
• ProtocolBuffers/Parquet encoding
• Example:• Horizon over HDFS + HBase
Real-Time API
HDFS (HFile)
HBase
Bulk Load(Batch Input)
Writes(Puts)
Reads(Get/Scan)
MapReduce/Impala
(Batch Processing)
Flush/Compaction
HFiles
Schema
Coprocessor
HFiles
Meta
Real-Time Data Service
46
6. Approximated vs. Precise
• For large data sets, it can be prohibitively expensive to find the precise result, but there are efficient estimating methods.
• Example Queries:• How many distinct elements are in the data set (i.e. what is the cardinality of the
data set)?• What are the most frequent elements (the terms “heavy hitters” and “top-k
elements” are also used)?• What are the frequencies of the most frequent elements?• How many elements belong to the specified range (range query, in SQL it looks
like SELECT count(v) WHERE v >= c1 AND v < c2)?• Does the data set contain a particular element (membership query)?• …
47
Approximated vs. Precise
• The algorithms are approximate: with high probability it returns approximately the correct result. (e.g. ±2%)
• select count(distinct userid) from userlogs;
• select top(100) of count(*) from orders group by itemname;
• …
• Statistical and Probabilistic Analysis, Very interesting!
48
Approximated vs. Precise
• Usually Sample/Hash/Bitmap …
• Cardinality Estimation• Linear Counting• Loglog Counting …
• Frequency Estimation / Heavy Hitters• Count-Min Sketch• Count-Mean-Min Sketch• Stream-Summary …
• Range Query• Array of Count-Min Sketches …
• Membership Query• Bloom Filter
• …
49
5. Open Source and Open Spirit
• Choose you Building Blocks in Engineering view• Know Your Basic Building Blocks, Not just their interfaces, but understand
their implementations (at least at a high level)
• 善用开源,回馈开源,使开源更好更强大
50
6. And more …
• Description and Documents
• Avoid inventing new Interface for Users
• From simple to complete, From prototype to product• Make the architecture robust, try it, and then improve and complete it.
• Product vs. Tech. vs. Trick
• …
51
7. Read Books – Read English Books
52
Thank You!
53
Find me outside
• SlideShare: http://www.slideshare.net/schubertzhanghttp://www.slideshare.net/hanborq
• Github:https://github.com/schubertzhanghttps://github.com/hanborq
• LinkedIn: http://cn.linkedin.com/pub/schubert-zhang/6/b51/b5b/
• Blog:http://cloudepr.blogspot.com
• Facebook:https://www.facebook.com/schubertzhang
• Email & Gtalk: [email protected]
• Weibo:@schubertzh
• WeChat:schubertzh
54