Icse2013 shang

1

Assisting Developers of Big Data Analytics Applications When Deploying on Hadoop

Clouds

Hadi HemmatiBram Adams Weiyi Shang

Zhen Ming Jiang Ahmed E. Hassan

Patrick Martin

2

What are Big Data Analytics Application (BDA App)?

BDAApps

3

Many fields today rely on BDA Apps to make decisions

Software engineering research, especially Mining Software Repositories.

And…

4

Under the hood of BDA Apps

HardwareInfrastructure

SoftwarePlatform

BDA Apps

5

Discrepancy between scale of development and deployment

Small sample data and pseudo cloud

Big data and real-life cloud

Data sample

6ACM Interactions 2012

“Analysts moved back and forth

from local machines to cloud-

based systems.”

7

Many things can go wrong when scaling

BDA App

Step 1 Step 2 Step n…

Large-scale intermediate data generated by each step can fill up the disk space!!!

8

How to verify the deployment of BDA Apps?



Data sample

How to verify

9

Traditional approach for verifying BDA apps

Keyword scan

10

Many false positives!!Large results, too much effort to manually examine

Limitations of traditional approach

11

Not all kills are bad ： “ speculative execution”

Slow task identified

The results of the first finished task are saved, others tasks are killed!!

Duplicate the task to other machines

12

A smarter approach is needed

13

Execution sequences provide context information of log lines

Kill task t on node A.

Assign task t on node A.

Assign task t on node B.

Task t finished on node B.

14

Log abstraction reduces the amount of data to examine

Kill task t1 on node A.Kill task t2 on node B.Kill task t3 on node C.Kill task t4 on node A.Kill task t5 on node D.Kill task t6 on node B.Kill task t7 on node A.Kill task t8 on node C.

Large results, too much effort to manually examine

Kill task $t on node $n.

15

Overview of our approach



Data sample

Underlying platform Underlying platform

Execution sequences

Execution sequences

Execution sequence delta

Log abstraction Log linking

Sequences simplificatio

n

16

Step 1: Log Abstractionreduces the size of logs

Log abstraction Log Linking Simplifying

sequences

Example of log lines

Execution eventsJiang et al. JSME 2008

17

Step 2: Log linkingprovides context for logs


sequences

Example of log lines

Execution events

18

Step 3: Sequence simplificationdeals with repeated logs


sequences

Repeated logs: task t1 read file A.task t1 read file A.task t1 read file A.

Remove repetition and order of events

Comparing small and large runs

19

Logs from testing run with small data

Logs from run with large data

Event sequence

E1, E2, E3, E5, E6

Event sequence

E1, E2, E3, E5, E6

E1, E2, E3, E7, E5, E6

Event sequence delta

E1, E2, E3, E7, E5, E6

20

Case study: subject systemsSource Domain

WordCountofficial example

File processing

Page Rank

developed from scratch

Social network

JACKmigrated from Perl

Log analysis

21

How precise is our approach?

PrecisionEffort Reduction

How much effort reduction does our approach provide?

22

WordCount JACK PageRank0

200400600800

100012001400160018002000

# log sequences # unique log events # log line

Our approach reduces the logs for manual inspection by over 86%

86% reduction

91% reduction

Our approach Keyword search

95% reduction

23




Reduce logs for manual inspection by over 86%

24

We manually inject 3 common failures

Machine Failure

Missing supporting library

Lack of disk space

We measure the number of log lines and log sequences caused by injected failures.

WordCount Page Rank JACK

Cola et al. Euro-Par 2005

25

Our approach generates less false positives than traditional approach


10152025303540

False positive ratio between keyword search and our approach

1:29

1:8

1:36

26




Reduce logs for manual inspection by over 86%

Less false positiveand additional context information to assist in manual inspection

27

28


Physical Infrastructure

Underlying Platform

BDA Apps

29

Our approach can be used in migration of BDA Apps

Hadoop generates more job sequences and task sequences.

PIGPIG automatically optimize the application by grouping jobs and reducing tasks.

Manually browsing logs to find the differences can be time-consuming.

One of the common migrations

We use our approach to compare the execution sequences of PageRank on both platforms

30

31

32

33

One more thing …

34

ReduceMap

Datagoodhellofishcat

schoolnighthappydog

ValueKey dog3cat3

fish4good4

hello5night5happy5

school6

ValueKey

23243516

Counting the frequency of word lengths

Key 45436553

MapReduce: Hadoop’s programming paradigm

35

Hadoop’s architecture

Job 1

Job 2

Job n

.

.

.

Hadoop application

Task

Task

Task

Task

Task

.

.

....

Attempt 1

Attempt 2

Attempt n

.

.

.

36

Not all failures are bugs ： JVM failure

The JVM of an attempt has error

Bugs and memory issues of JVM will cause attempts to be considered fail.

Attempt on the JVM will be mark as FAILED

37

An overview of our approach

Log abstraction Log linking

Sequences simplificatio

n

Execution sequence recovery

Logs from testing run with small data

Logs from run with large data

Execution sequence report

Execution sequence report

Execution sequence delta

Execution sequence recovery

Sequence comparin

g

38


Underlying platform

BDA Apps

Physical

infrastructureor

39


0.5

1

1.5

2

2.5

3

0200400600800100012001400160018002000

Prec

isio

n ra

tio b

etw

een

our

appr

oach

and

key

wor

d se

arch

#lin

es to

man

ually

exa

min

e

Our approach has comparable precision to traditional approach

Repeating same abstracted problem over 1,400 times

Almost triple the precision

Similar precision

Half the precision

Documents

Icse2013 shang