27
MapReduce in Action Team 306 Led by Chen Lin College of Informatio n Science and Technology 数数数数数数数 Data Mining Group @ Xiamen University

MapReduce in Action

  • Upload
    gigi

  • View
    66

  • Download
    0

Embed Size (px)

DESCRIPTION

MapReduce in Action. 数据挖掘研究组 Data Mining Group @ Xiamen University. College of Information Science and Technology. Team 306 Led by Chen Lin. Contents. 1. Basic MapReduce Programs. 2. Advanced MapReduce. 3. Beyond the horizon. 4. discussion. Job Configuration. Master - PowerPoint PPT Presentation

Citation preview

Page 1: MapReduce in Action

MapReduce in Action

Team 306Led by

Chen Lin

College of Information Science and Technology

数据挖掘研究组Data Mining Group @ Xiamen University

Page 2: MapReduce in Action

YOUR SITE HERE

LOGO

1. Basic MapReduce Programs

2. Advanced MapReduce

3. Beyond the horizon

4. discussion

Contents

Page 3: MapReduce in Action

YOUR SITE HERE

LOGO

JobConfiguration

MasterJobtracker

MasterJobtracker Job

Basic MapReduce Programs

Page 4: MapReduce in Action

YOUR SITE HERE

LOGO

Implement Interface

Environment Configuration

Basic MapReduce Programs

Job Configuration?

Java Class

Page 5: MapReduce in Action

YOUR SITE HERE

LOGO

Interface

CombinerInputFormatOutputFormat

MapperReducer Partitioner

Page 6: MapReduce in Action

YOUR SITE HERE

LOGO

Configure

jvm:Mapred.child.java.opts

{mapred.local.dir}

InputPathOutputPath

How many Map/ReduceTasks?

Page 7: MapReduce in Action

YOUR SITE HERE

LOGO

InputFormat Map Reduce OutputFormat

Basic MapReduce Program

Text

Inputsplit <K1,V2>

K1,List<V1>List<K1,V1>

Page 8: MapReduce in Action

YOUR SITE HERE

LOGOBasic MapReduce

Page 9: MapReduce in Action

YOUR SITE HERE

LOGO

Combiners an optimization in MapReduce that allow for local aggregation before the shue and sort phase

Partitioner determines which reducer will be responsible for processing a particular key, and the execution framework uses this information to copy the data to the right location during the shue and sort phase

PARTITIONERS AND COMBINERS

Page 10: MapReduce in Action

YOUR SITE HERE

LOGO

CREATING CUSTOM INPUTFORMAT

KeyValueText

Sequence File NLine

Text InputFormat

Basic MapReduce Program InputFormat

Page 11: MapReduce in Action

YOUR SITE HERE

LOGO

• TextInputFormat - Each line in the text fi les is a record. Key is the byte offset of the line, and value is the content of the line.• KeyValueTextInputFormat - Each line in the text fi les is a record. The fi rst separator character divides each line. Everything before the separator is the key, and everything after is the value. The separator is set by the key.value.separator.in.input.line property, and the default is the tab (\t) character.• NLineInputFormat - Same as TextInputFormat, but each split is guaranteed to have exactly N lines. The mapred.line.input.format. Lines/map property, which defaults to one, sets N.

InputFormat

Page 12: MapReduce in Action

YOUR SITE HERE

LOGO

4

Basic MapReduce Program

types for the key/value pairs

Page 13: MapReduce in Action

YOUR SITE HERE

LOGO

code for mapper, reducer,

combiner, partitioner, along with

job conguration parameters

The execution framework handles

everything else

Summary for basic Program

What’s a complete MapReduce job ??

Page 14: MapReduce in Action

YOUR SITE HERE

LOGO

Chaining MapReduce jobs

LOCAL AGGREGATION

SECONDARY SORTING

Work on Hadoop Files

Advanced MapReduce

Page 15: MapReduce in Action

YOUR SITE HERE

LOGO

You’ve been doing data processing tasks which a single MapReduce job can accomplish.

But……As you get more comfortable writing

MapReduce programs and take on more ambitious data processing tasks

you’ll find many complex tasks need to be broken down into simpler subtasks, each accomplished by an individual MapReduce job

Chaining MapReduce jobs

Page 16: MapReduce in Action

YOUR SITE HERE

LOGO

in Hadoop, intermediate results are written to local disk before being sent over the network.

Reductions in the amount of intermediate data translate should increase in algorithmic efficiency

use of the combiner is possible to substantially reduce both the number and size of key-value pairs that need to be shuffled from the mappers to the reducers

LOCAL AGGREGATION

Page 17: MapReduce in Action

YOUR SITE HERE

LOGOseudo-code for computing the mean of values associated with the same string.

Page 18: MapReduce in Action

YOUR SITE HERE

LOGOLOCAL AGGREGATION , Is it right ??

Page 19: MapReduce in Action

YOUR SITE HERE

LOGO

1. combiners must have the same input and output key-value type

2. Combiners are optimizations that cannot change the correctness of the algorithm

Hadoop makes no guarantees on how many times combiners are called; it could be zero, one, or multiple times

LOCAL AGGREGATION

Page 20: MapReduce in Action

YOUR SITE HERE

LOGOLOCAL AGGREGATION , right usage !

Page 21: MapReduce in Action

YOUR SITE HERE

LOGO

we also need to sort by value sometimes (k1;m1; v8) (k1;m2; v1) (k1;m3; v7) ::: (k2;m1; v2) (k2;m2; v6) (k2;m3; v9)

k1 (m1; k8) (k1; m1) (k8)

SECONDARY SORTING

Page 22: MapReduce in Action

YOUR SITE HERE

LOGO

It’s a shameThe rest I will talk about Plays an

important role in MapReduce, but, they are beyond my horizon.

So, need all your help, to master them together….

Beyond the horizon

Page 23: MapReduce in Action

YOUR SITE HERE

LOGOBeyond the horizon

Creat user custom

Inputformat Manipulate

local fileCreat user customPartitioner

Pipes for C++Streaming other language

Page 24: MapReduce in Action

YOUR SITE HERE

LOGOBeyond the horizon

Joining data from

different sourcesHive

Pig

HBase

MultipleFileoutput

Page 25: MapReduce in Action

Joining data from different sources

Orders files CSV formatfields: (Customer ID, Order ID, Price, and Purchase Date)

Customers file

CSV format

record fields:

(Customer ID,

Name, and Phone

Number)

Page 26: MapReduce in Action

YOUR SITE HERE

LOGOJoey Leung,555-555-55Edward,123-456-7890Jose Madriz,281-330-8004David Stork,408-555-0000…....

A,12.95,02-Jun-2008B,88.25,20-may-2008C,32.00,30-Nov-2007D,25.02,22-Jan-2009

Joining data from different sources

Joey Leung,555-555-5555,B,88.25,20-May-2008Edward,123-456-7890,C,32.00,30-Nov-2007Jose Madriz,281-330-8004,A,12.95,02-Jun-2008Jose Madriz,281-330-8004,D,25.02,22-Jan-2009

Page 27: MapReduce in Action

YOUR SITE HERE

LOGO

Thank you!

数据挖掘研究组Data Mining Group @ Xiamen University