View
196
Download
0
Category
Preview:
DESCRIPTION
BDTC 2013 Beijing China
Citation preview
INSTITUTE O
F COM
PUTING
TECHN
OLO
GY
BigDataBench: Benchmarking Big Data Systems
Jianfeng ZhanComputer Systems Research Center, ICT, CAS
CCF Big Data Technology Conference 2013-12-06
1
http://prof.ict.ac.cn/jfzhan
2/
Why Big Data Benchmarking?2
Measuring big data architecture and systems quantitatively
3/
What is BigDataBench? An open source project on big data
benchmarking: • http://prof.ict.ac.cn/BigDataBench/
• 6 real-world data sets and 19 workloads– Extended in near future
• 4V characteristics– Volume, Variety, Velocity, and Veracity
3/
4/
Comparison of Big Data Benchmarking Efforts4/
5/
Possible Users5/
BigDataBench
ArchitectureProcessorMemory
Networks…….....
Systems OS for big data
File systems for big data…………………………..
Data management
…………..
Performance optimization
Co-design
Distributed systemsScheduling
Programming systems
6/
Research Publications
Characterizing data analysis workloads in data centers. Zhen Jia, Lei Wang, Jianfeng Zhan, Lixing Zhang, and Chunjie Luo. IISWC 2013 Best paper award
BigDataBench: a Big Data Benchmark Suite from Internet Services. Lei Wang, Jianfeng Zhan, et al. HPCA 2014, Industry Session.
6/
Outline7/
Benchmarking Methodology and Decision1
Case Study
3 How to Use
5
32
Future Work 44
8/
BigDataBench Methodology8/
4V of Big Data BigDataBench
Methodology (Cont’)9/
Representative Data
Sets
Diverse Worklo
ads
Data SourcesText dataGraph dataTable dataExtended …
Data TypesStructuredSemi-structuredUnstructured
Big Data Sets Preserving 4V
BigDataBench
Investigate Typical
Application Domains
data generation tool preserving data characteristics
Application TypesOffline analyticsRealtime analyticsOnline services
Basic & Important Operations and Algorithms Extended…
Represent Software Stack Extended…
Big Data Workloads
10/
Methodology (Cont’)10/
4V of Big Data
System and architecture characteristics
Similarity analysis
BigDataBench
Top Sites on the Web
More details in http://www.alexa.com/topsites/global;0
Search Engine, Social Network and Electronic Commerce hold 80% page views of all the Internet service.11/
12/
Workloads Chosen12/
• Cover workloads in diverse and representative application scenarios
• Search Engine, E-commerce, Social Network
• Pay equal attentions to different application types:
• online service, real-time analytics, offline analytics
• Include different data sources
• Text data, Graph data, Table data
• Cover representative software stacks
13/
19 Chosen Workloads13/
Application Scenarios
Micro Benchmarks
Basic Datastore Operations
Relational Queries
Search engines
Social networks
E-commerce system
14/
Data Generation Tools
Data Sources Text, Graph and Table
• Six real raw data
Synthetics Data Scale
• From GB to PB
Features• Preserve characteristics of real-world data
14/
15/
Naïve Text generator15/
wordsfollowing multinomial distribution
select word randomly
documents
big
architecture
system
CPU
miningdata
benchmarkingmemory
evaluatemachine
learning
cpu
Only modeling on the word level;
16/
Improved Text generator16/
select word randomly
wordsfollowing multinomial distribution under topic2
big
architecture
CPU
benchmarking
miningdata
systemmemory
evaluatemachine
learning
document
topic1
topic3
topic2
select topic randomly
topicsfollowing multinomial distribution
Modeling on the both topic and word level
CPU
17/
Outline17/
Benchmarking Methodology and Decision1
Case Study
3 How to Use
5
32
Future Work 44
18/
BigDataBench Case Study18/
BigDataBench
Evaluating Big Data Hardware
Systems
Performance evaluation and Diagnosis
Workload Characterization
Networks for big data Energy Efficiency of
Big Data Systems
USTC, and Florida International University
ICT, CASSIAT, CAS
CNCERTOSU
SJTU, and XJTU
http://prof.ict.ac.cn/BigDataBench/#users
19/
Testbed 19/
Workloads Analyzed 20/
http://prof.ict.ac.cn/BigDataBench
21/
Floating point operation intensity
21
The total number of (floating point or integer) instructions divided by the total number of memory access bytes in a run of workload.
Very low floating point operation intensities ( 0.009), two orders of magnitude lower than the theory number of state-of-practice CPU (1.8)
Data Analytics Services
Instruction Breakdown
Less floating point operations More Integer operations22/
Data Analytics Services
Ratio of Integer to Floating Point Operations
23/
The average of big data workloads is 100 Parsec, HPCC and SPECFP (1.4, 1.0, 0.67)
Data Analytics Services
Integer operation intensity
The average integer operation intensity of big data workloads is 0.49
That of PARSEC, HPCC, SPECFP is 1.5, 0.38, 0.23 24/
Data Analytics Services
Cache Behaviors
Big data workloads have high L1I misses than HPC workloads Data analysis workloads have better L2 cache behaviors than service workloads
except BFS
Big data workloads have good L3 behaviors
25/
Data Analytics Services
26/
TLB Behaviors
ITLB misses of big data workloads are higher than HPC workloads. DTLB misses of big data workloads are higher than HPC workloads. 26/
data analysis service14 5
BigDataBench Case Study27/
BigDataBench
Evaluating Big Data Hardware
Systems
Performance evaluation and Diagnosis
Big Data workload Characterization
Networks for big data Energy Efficiency of
Big Data Systems
USTC, and Florida International University
ICT, CASSIAT, CAS
CNCERTOSU
SJTU, and XJTU
http://prof.ict.ac.cn/BigDataBench/#users
28/
Evaluating Big Data Hardware Systems
29/
Experimental Platforms
Xeon (Common processor)
Atom ( Low power processor)
Tilera (Many core processor)CPU Type Intel Xeon
E5310 Intel Atom D510 Tilera TilePro36
CPU Core 4 cores @ 1.6GHz
2 cores @ 1.66GHz
36 cores @ 500MHz
L1 I/D Cache 32KB 24KB 16KB/8KB
L2 Cache 4096KB 512KB 64KB
Basic InformationBrief Comparison
30/
Experimental Platforms
Hadoop ClusterInformation Xeon VS Atom Xeon VS Tilera
Comprison(the same logical
core number)
[ 1 Xeon master+7 Xeon slaves ] VS [ 1
Atom master +7 Atom slaves]
[1 Xeon master+7 Xeon slaves] VS [ 1 Xeon
master +1 Tilera slave]
Hadoop setting Following the guidance on Hadoop official website
31/
Benchmark SelectionBigDataBench 1.0
Application Time Complexity Characteristics
Sort O(n*log2n) Integer comparison
WordCount O(n) Integer comparison and calculation
Grep O(n) String comparisonNaïve Bayes O(m*n) Floating-point computation
SVM O(n3) Floating-point computation
32/
Metrics
Performance: Data processed per second (DPS)
Energy Efficiency: Application Performance Power Usage Effectiveness(DPJ)
33/
Xeon VS Atom – DPJ
34/
Xeon VS Tilera – DPJ
35/
Reference
Jing Quan, University of Science and Technology of China, Yingjie Shi, Chinese Academy of Sciences, Ming Zhao, Florida International University, Wei Yang, University of Science and Technology of China.
”The Implications from Benchmarking Three Different Data Center Platforms”
The First Workshop on Big Data Benchmarks, Performance Optimization, and Emerging hardware (BPOE 2013) in conjunction with 2013 IEEE International Conference on Big Data (IEEE Big Data 2013)
Outline36/
Benchmarking Methodology and Decision1
Case Study
3 How to Use
5
32
Future Work 44
37/
BigDataBench Class For Architecture
19 among 19
For OS 19 among 19
For Runtime environment (Hadoop) 9 of 19 workloads
•Sort, Grep, WordCount, PageRank, Index, Kmeans, Connected Components, Collaborative Filtering and Naive Bayes.
For Data management 6 of 19 workloads
•Read, Write, Scan, Select Query, Aggregate Query, Join Query
37/
BigDataBench Class: data sources Text related
6 of 19 workloads•Sort, Grep, WordCount, Index, Collaborative Filtering and Naive Bayes
Graph related 4 of 19 workloads
•BFS, PageRank, Kmeans, and Connected Components
Table related 9 of 19 workloads
•Read, Write, Scan, Select Query, Aggregate Query, Join Query, Nutch Server, Olio Server and Rubis Server
38/
BigDataBench Class: Application Types
Online Services 6 of 19 workloads
• Read, Write, Scan, Nutch server, Olio Server and Rubis server
Offline Analytics 10 of 19 workloads
• Sort, Grep, WordCount, BFS, PageRank, Index, Kmeans, Connected Components, Collaborative Filtering and Naive Bayes.
Realtime Analytics 3 of 19 workloads
• Select Query, Aggregate Query and Join Query
39/
BigDataBench Class: Application Domains Search engine related: Basic Operations + Search Engine
7 of 19 workloads•Sort, Grep, WordCount, BFS, PageRank, Index and Nutch Server
Social network related: Basic Cloud OLTP+ Basic Relational Query+ Social Network
9 of 19 workloads•Read, Write, Scan, Select Query, Aggregate Query, Join Query, Olio Server, Kmeans and Connected Components
E-commerce related: Basic Cloud OLTP+ Basic Relational Query+ Social Network
9 of 19 workloads• Read, Write, Scan, Select Query, Aggregate Query, Join Query, Rubis server, Collaborative Filtering and Naive Bayes
40/
Outline41/
Benchmarking Methodology and Decision1
Case Study
3 How to Use
5
32
Future Work 44
42/
Near Future Work
Multi-media data
Deep learning workloads
HPC
Refine BigDataBench
42/
Related Resources
BigDataBench project http://prof.ict.ac.cn/BigDataBench
BPOE workshop http://prof.ict.ac.cn/bpoe A series of workshops on Big Data Benchmarks,
Performance Optimization, and Emerging Hardware BPOE-4: interaction among OS, architecture, and data
management• Co-located with ASPLOS 2014
43/
BPOE-4 SC Christos Kozyrakis, Stanford Xiaofang Zhou, University of Queensland Dhabaleswar K Panda, Ohio State University Raghunath Nambiar, Cisco Lizy K John, University of Texas at Austin Xiaoyong Du, Renmin University of China H. Peter Hofstee, IBM Austin Research Laboratory Ippokratis Pandis, IBM Almaden Research Center Alexandros Labrinidis, University of Pittsburgh Bill Jia, Facebook Jianfeng Zhan, ICT, Chinese Academy of Sciences
44/
THANKS45/
Recommended