39
1 Assisting Developers of Big Data Analytics Applications When Deploying on Hadoop Clouds Hadi Hemmati Bram Adams Weiyi Shang Zhen Ming Jiang Ahmed E. Hassan Patrick Martin

Icse2013 shang

  • Upload
    sailqu

  • View
    174

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Icse2013 shang

1

Assisting Developers of Big Data Analytics Applications When Deploying on Hadoop

Clouds

Hadi HemmatiBram Adams Weiyi Shang

Zhen Ming Jiang Ahmed E. Hassan

Patrick Martin

Page 2: Icse2013 shang

2

What are Big Data Analytics Application (BDA App)?

BDAApps

Page 3: Icse2013 shang

3

Many fields today rely on BDA Apps to make decisions

Software engineering research, especially Mining Software Repositories.

And…

Page 4: Icse2013 shang

4

Under the hood of BDA Apps

HardwareInfrastructure

SoftwarePlatform

BDA Apps

Page 5: Icse2013 shang

5

Discrepancy between scale of development and deployment

Small sample data and pseudo cloud

Big data and real-life cloud

Data sample

Page 6: Icse2013 shang

6ACM Interactions 2012

“Analysts moved back and forth

from local machines to cloud-

based systems.”

Page 7: Icse2013 shang

7

Many things can go wrong when scaling

BDA App

Step 1 Step 2 Step n…

Large-scale intermediate data generated by each step can fill up the disk space!!!

Page 8: Icse2013 shang

8

How to verify the deployment of BDA Apps?

Small sample data and pseudo cloud

Big data and real-life cloud

Data sample

How to verify

Page 9: Icse2013 shang

9

Traditional approach for verifying BDA apps

Keyword scan

Page 10: Icse2013 shang

10

Many false positives!!Large results, too much effort to manually examine

Limitations of traditional approach

Page 11: Icse2013 shang

11

Not all kills are bad : “ speculative execution”

Slow task identified

The results of the first finished task are saved, others tasks are killed!!

Duplicate the task to other machines

Page 12: Icse2013 shang

12

A smarter approach is needed

Page 13: Icse2013 shang

13

Execution sequences provide context information of log lines

Kill task t on node A.

Assign task t on node A.

Assign task t on node B.

Task t finished on node B.

Page 14: Icse2013 shang

14

Log abstraction reduces the amount of data to examine

Kill task t1 on node A.Kill task t2 on node B.Kill task t3 on node C.Kill task t4 on node A.Kill task t5 on node D.Kill task t6 on node B.Kill task t7 on node A.Kill task t8 on node C.

Large results, too much effort to manually examine

Kill task $t on node $n.

Page 15: Icse2013 shang

15

Overview of our approach

Small sample data and pseudo cloud

Big data and real-life cloud

Data sample

Underlying platform Underlying platform

Execution sequences

Execution sequences

Execution sequence delta

Log abstraction Log linking

Sequences simplificatio

n

Page 16: Icse2013 shang

16

Step 1: Log Abstractionreduces the size of logs

Log abstraction Log Linking Simplifying

sequences

Example of log lines

Execution eventsJiang et al. JSME 2008

Page 17: Icse2013 shang

17

Step 2: Log linkingprovides context for logs

Log abstraction Log Linking Simplifying

sequences

Example of log lines

Execution events

Page 18: Icse2013 shang

18

Step 3: Sequence simplificationdeals with repeated logs

Log abstraction Log Linking Simplifying

sequences

Repeated logs: task t1 read file A.task t1 read file A.task t1 read file A.

Remove repetition and order of events

Page 19: Icse2013 shang

Comparing small and large runs

19

Logs from testing run with small data

Logs from run with large data

Event sequence

E1, E2, E3, E5, E6

Event sequence

E1, E2, E3, E5, E6

E1, E2, E3, E7, E5, E6

Event sequence delta

E1, E2, E3, E7, E5, E6

Page 20: Icse2013 shang

20

Case study: subject systemsSource Domain

WordCountofficial example

File processing

Page Rank

developed from scratch

Social network

JACKmigrated from Perl

Log analysis

Page 21: Icse2013 shang

21

How precise is our approach?

PrecisionEffort Reduction

How much effort reduction does our approach provide?

Page 22: Icse2013 shang

22

WordCount JACK PageRank0

200400600800

100012001400160018002000

# log sequences # unique log events # log line

Our approach reduces the logs for manual inspection by over 86%

86% reduction

91% reduction

Our approach Keyword search

95% reduction

Page 23: Icse2013 shang

23

How precise is our approach?

PrecisionEffort Reduction

How much effort reduction does our approach provide?

Reduce logs for manual inspection by over 86%

Page 24: Icse2013 shang

24

We manually inject 3 common failures

Machine Failure

Missing supporting library

Lack of disk space

We measure the number of log lines and log sequences caused by injected failures.

WordCount Page Rank JACK

Cola et al. Euro-Par 2005

Page 25: Icse2013 shang

25

Our approach generates less false positives than traditional approach

WordCount JACK PageRank05

10152025303540

False positive ratio between keyword search and our approach

1:29

1:8

1:36

Page 26: Icse2013 shang

26

How precise is our approach?

PrecisionEffort Reduction

How much effort reduction does our approach provide?

Reduce logs for manual inspection by over 86%

Less false positiveand additional context information to assist in manual inspection

Page 27: Icse2013 shang

27

Page 28: Icse2013 shang

28

Under the hood of BDA Apps

Physical Infrastructure

Underlying Platform

BDA Apps

Page 29: Icse2013 shang

29

Our approach can be used in migration of BDA Apps

Hadoop generates more job sequences and task sequences.

PIGPIG automatically optimize the application by grouping jobs and reducing tasks.

Manually browsing logs to find the differences can be time-consuming.

One of the common migrations

We use our approach to compare the execution sequences of PageRank on both platforms

Page 30: Icse2013 shang

30

Page 31: Icse2013 shang

31

Page 32: Icse2013 shang

32

Page 33: Icse2013 shang

33

One more thing …

Page 34: Icse2013 shang

34

ReduceMap

Datagoodhellofishcat

schoolnighthappydog

ValueKey dog3cat3

fish4good4

hello5night5happy5

school6

ValueKey

23243516

Counting the frequency of word lengths

Key 45436553

MapReduce: Hadoop’s programming paradigm

Page 35: Icse2013 shang

35

Hadoop’s architecture

Job 1

Job 2

Job n

.

.

.

Hadoop application

Task

Task

Task

Task

Task

.

.

....

Attempt 1

Attempt 2

Attempt n

.

.

.

Page 36: Icse2013 shang

36

Not all failures are bugs : JVM failure

The JVM of an attempt has error

Bugs and memory issues of JVM will cause attempts to be considered fail.

Attempt on the JVM will be mark as FAILED

Page 37: Icse2013 shang

37

An overview of our approach

Log abstraction Log linking

Sequences simplificatio

n

Execution sequence recovery

Logs from testing run with small data

Logs from run with large data

Execution sequence report

Execution sequence report

Execution sequence delta

Execution sequence recovery

Sequence comparin

g

Page 38: Icse2013 shang

38

Under the hood of BDA Apps

Underlying platform

BDA Apps

Physical

infrastructureor

Page 39: Icse2013 shang

39

WordCount JACK PageRank0

0.5

1

1.5

2

2.5

3

0200400600800100012001400160018002000

Prec

isio

n ra

tio b

etw

een

our

appr

oach

and

key

wor

d se

arch

#lin

es to

man

ually

exa

min

e

Our approach has comparable precision to traditional approach

Repeating same abstracted problem over 1,400 times

Almost triple the precision

Similar precision

Half the precision