Lecture 12, Hadoop

7/31/2019 Lecture 12, Hadoop

1/58

Creating Map-Reduce ProgramsUsing Hadoop


2/58

Presentation Overview

Recall Hadoop

Overview of the map-reduce paradigm

Elaboration on the WordCount examplecomponents of Hadoop that make WordCount

possible

Major new example: N-Gram Generatorstep-by-step assembly of this map-reduce job

Design questions to ask when creating your own

Hadoop jobs


3/58

Recall why Hadoop rocks

Hadoop is:

Free and open source

high quality, like all Apache Foundation projects

crossplatform (pure Java)

fault-tolerant

highly scalable

has bindings for non-Java programming languages

applicable to many computational problems


4/58

Map-Reduce System Overview

JobTracker Makes scheduling decisions

TaskTracker Manages tasks for a given node

Task processRuns an individual map or reduce fragment for

a given job

Forks from the TaskTracker


5/58

Map-Reduce System Overview

Processes communicate by custom RPC

implementation

Easy to change/extend

Defined as Java interfaces

Server objects implement the interface

Client proxy objects automatically createdAll messages originate at the client: (e.g., Task to

TaskTracker)

Prevents cycles and therefore deadlocks


6/58

Process Flow Diagram


7/58

Application Overview

Launching Program

Creates a JobConf to define a job.

Submits JobConf to JobTracker and waits for

completion.

Mapper

Is given a stream of key1,value1 pairs

Generates a stream of key2, value2 pairs

Reducer

Is given a key2 and a stream of value2s

Generates a stream of key3, value3 pairs


8/58

Job Launch Process: Client

Client program creates a JobConfIdentify classes implementing Mapperand Reducerinterfaces

JobConf.setMapperClass(); JobConf.setReducerClass()Specify input and output formats

JobConf.setInputFormat(TextInputFormat.class);

JobConf.setOutputFormat(TextOutputFormat.class);Other options too:

JobConf.setNumReduceTasks()

JobConf.setOutputFormat()

Man man more Facade attern


9/58

An onslaught of terminology

We'll explain these terms, each of which plays a

role in any non-trivial map/reduce job:

InputFormat, OutputFormat, FileInputFormat, ...

JobClient and JobConf

JobTracker and TaskTracker

TaskRunner, MapTaskRunner, MapRunner,

InputSplit, RecordReader, LineRecordReader, ...

Writable, WritableComparable, WritableInt, ...


10/58

InputFormat and OutputFormat

The application also chooses input and output formats,which define how the persistent data is read and

written. These are interfaces and can be defined by

the application.

InputFormat

Splits the input to determine the input to each map

task.

Defines a RecordReader that reads key, value pairsthat are passed to the map task

OutputFormat

Given the key, value pairs and a filename, writes the


11/58

Example

public static void main(String[] args) throwsException {

JobConf conf = new JobConf(WordCount.class);

conf.setOutputKeyClass(Text.class);

conf.setOutputValueClass(IntWritable.class);

conf.setMapperClass(Map.class);

conf.setCombinerClass(Reduce.class);

conf.setReducerClass(Reduce.class);


12/58

Job Launch Process: JobClient

Pass JobConf to JobClient.runJob() or

JobClient.submitJob()

runJob() blocks wait until job finishes

submitJob() does notPoll for status to make running decisions

Avoid polling with JobConf.setJobEndNotificationURI()

JobClient:Determines proper division of input into InputSplits

Sends job data to master JobTrackerserver


13/58

Job Launch Process: JobTracker

JobTracker:

Inserts jar and JobConf (serialized to XML) in shared

location

Posts a JobInProgressto its run queue


14/58

Job Launch Process: TaskTracker

TaskTrackersrunning on slave nodes periodically

query JobTrackerfor work

Retrieve job-specific jar and config

Launch task in separate instance of Java

main() is provided by Hadoop


15/58

Job Launch Process: Task

TaskTracker.Child.main():

Sets up the child TaskInProgressattempt

Reads XML configuration

Connects back to necessary MapReduce components

via RPC

Uses TaskRunnerto launch user process


16/58

Job Launch Process: TaskRunner

TaskRunner, MapTaskRunner, MapRunnerwork ina daisy-chain to launch your Mapper

Task knows ahead of time which InputSplitsit should be

mappingCalls Mapperonce for each record retrieved from theInputSplit

Running the Reduceris much the same


17/58

Creating the Mapper

You provide the instance of Mapper

Should extend MapReduceBase

Implement interface Mapper

One instance of your Mapper is initialized by the

MapTaskRunnerfor a TaskInProgress

Exists in separate process from all other instances of

Mapper no data sharing!


18/58

Mapper

Override function map()

void map(WritableComparable key,

Writable value,

OutputCollector output,

Reporter reporter)

Emit (k2,v2) with output.collect(k2, v2)


19/58

Example

public static class Map extends MapReduceBase

implements Mapper {

private final static IntWritable one = newIntWritable(1);

private Text word = new Text();

public void map(LongWritable key, Text value,


Reporter reporter) throws IOException {


20/58

What is Writable?

Hadoop defines its own box classes for strings

(Text), integers (IntWritable), etc.

All values are instances of Writable

All keys are instances of WritableComparable


21/58

Reading data

Data sets are specified by InputFormats

Defines input data (e.g., a directory)

Identifies partitions of the data that form an InputSplit

Factory for RecordReaderobjects to extract (k, v) records

from the input source


22/58

FileInputFormatand friends

TextInputFormatTreats each \n-terminated line

of a file as a value

KeyValueTextInputFormatMaps \n- terminatedtext lines of k SEP v

SequenceFileInputFormat Binary file of (k, v)

pairs with some addl metadata SequenceFileAsTextInputFormat Same, but

maps (k.toString(), v.toString())


23/58

Filtering File Inputs

FileInputFormatwill read all files out of a

specified directory and send them to the mapper

Delegates filtering this file list to a method

subclasses may override

e.g., Create your own xyzFileInputFormat to read *.xyz

from directory list


24/58

Record Readers

Without a RecordReader, Hadoop would be forcedto divide input on byte boundaries.

Each InputFormatprovides its own RecordReader

implementationProvides capability multiplexing

LineRecordReader Reads a line from a text file

KeyValueRecordReader Used byKeyValueTextInputFormat


25/58

Input Split Size

FileInputFormatwill divide large files into chunks

Exact size controlled by mapred.min.split.size

RecordReaders receive file, offset, and length of

chunk

Custom InputFormatimplementations may

override split size e.g., NeverChunkFile


26/58

Sending Data To Reducers

Map function receives OutputCollectorobject

OutputCollector.collect() takes (k, v) elements

Any (WritableComparable, Writable) can be used


27/58

WritableComparator

Compares WritableComparable data

Will call WritableComparable.compare()

Can provide fast path for serialized data

Explicitly stated in JobConf setup

JobConf.setOutputValueGroupingComparator()


28/58

Sending Data To The Client

Reporterobject sent to Mapper allows simple

asynchronous feedback

incrCounter(Enum key, long amount)

setStatus(String msg)

Allows self-identification of input

InputSplit getInputSplit()


29/58

Partitioner

int getPartition(key, val, numPartitions)

Outputs the partition number for a given key

One partition == values sent to one Reduce task

HashPartitionerused by default

Uses key.hashCode() to return partition num

JobConfsets Partitionerimplementation


30/58

Reducer

reduce( WritableComparable key,

Iterator values,


Reporter reporter)

Keys & values sent to one partition all go to thesame reduce task

Calls are sorted by key earlier keys are reduced and output before later keys


31/58

Example

public static class Reduce extendsMapReduceBase implements Reducer {

public void reduce(Text key,Iterator values,


Reporter reporter) throws IOException {int sum = 0;

while (values.hasNext()) {

sum += values.next().get();


32/58

OutputFormat

Analogous to InputFormat

TextOutputFormatWrites key val \n strings to

output file

SequenceFileOutputFormat Uses a binary format

to pack (k, v) pairs

NullOutputFormat Discards output


33/58

Presentation Overview

Recall Hadoop

Overview of the map-reduce paradigm

Elaboration on the WordCount example

components of Hadoop that make WordCount

possible

Major new example: N-Gram Generator

step-by-step assembly of this map-reduce job

Design questions to ask when creating your own

Hadoop jobs


34/58

Major example: N-Gram Generation

N-Gram is a common natural languageprocessing technique (used by Google, etc)

N-Gram is a subsequence of N items in a given

sequence. (i.e. subsequence of words in a

given text)

Example 3-grams (from Google) with

corresponding occurrences

ceramics collectables collectibles (55)

ceramics collected by (52)

ceramics collectibles cooking (45)


35/58

Understanding the process

Someone wise said, A week of writing code saves an hour of research.

Before embarking on developing a Hadoop job,

walk through the process step by step manuallyand understand the flow and manipulation of

data.

Once you can comfortably (and deterministically!)do it mentally, begin writing code.


36/58

Requirements

Input:

a beginning word/phrase

n-gram size (bigram, trigram, n-gram)

the minimum number of occurrences (frequency)

whether letter case matters

Output: all possible n-grams that occur sufficiently

frequently.


37/58

High-level view of data flow

Given: one or more files containing regular text.

Look for the desired startword. If seen, take the

next N-1 words and add the group to the

database.

Similarly to word count, find the number of

occurrences of each N-gram.

Remove those N-grams that do not occurfrequently enough for our liking.


38/58

Follow along

The N-grams implementation exists and is readyfor your perusal.

Grab it:

if you use Git revision control:

git clone git://git.qnan.org/pmw/hadoop-ngram

to get the files with your browser, go to:

http://www.qnan.org/~pmw/software/hadoop-ngram

We used Project Gutenberg ebooks as input.

F ll l


39/58

Follow along

Start Hadoop

bin/start-all.sh

Grab the NGram code and build it:

Type ant and all will be built

Look at the README to see how to run it.

Load some text files into your HDFS

good source: http://www.gutenberg.org

Run it yourself (or see me do it) before we

proceed.

C j W dC ?


40/58

Can we just use WordCount?

We have the WordCount example that does a similarthing. But there are differences:

We don't want to count the number of times our

startword appears; we want to capture the

subsequent words too.

A more subtle problem is that wordcount maps one

line at a time. That's a problem if we want 3-grams

with startword of pillows in the book containing this:

The guests stayed in the guest bedroom; the pillows were

delightfully soft and had a faint scent of mint.

Still WordCount is a ood foundation for our code.

St t f


41/58

Steps we must perform

Read our text in paragraphs rather than in discrete lines:

RecordReader

InputFormat

Develop the mapper and reducer classes:

first mapper: find startword, get the next N-1 words, and

return

first reducer: sum the number of occurrences of each N-

gram

second mapper: no action

second reducer: discard N-grams that are too rare

Driver program

A R dR d


42/58

A new RecordReader

Ours must implement RecordReader

Contain certain functions: createKey(), createValue(),

getPos(), getProgress(), next()

Hadoop offers a LineRecordReader but nosupport for Paragraphs

We'll need a ParagraphRecordReader

Use Delegation Pattern instead of extendingLineRecordReader. We couldn't extend it because

it has private elements.

Create new next() function

public synchronized boolean next(LongWritable key, Text value) throws IOException {

Text linevalue = new Text();

b l d d t thi


43/58

boolean appended, gotsomething;

boolean retval;

byte space[] = {' '};

value.clear();gotsomething = false;

do {

appended = false;

retval = lrr.next(key, linevalue);

if (retval) {

if (linevalue.toString().length() > 0) {byte[] rawline = linevalue.getBytes();

int rawlinelen = linevalue.getLength();

value.append(rawline, 0, rawlinelen);

value.append(space, 0, 1);

appended = true;

}gotsomething = true;

}

} while (appended);

//System.out.println("ParagraphRecordReader::next() returns "+gotsomething+" after setting

value to: ["+value.toString()+"]");

return gotsomething;

}

A ne Inp tFormat


44/58

A new InputFormat

Given to the JobTracker during execution

getRecordReader method

This is the why we need InputFormat

Must return our ParagraphRecordReader


45/58

public class ParagraphInputFormat extends FileInputFormat

implements JobConfigurable {

private CompressionCodecFactory compressionCodecs = null;

public void configure(JobConf conf) {

compressionCodecs = new CompressionCodecFactory(conf);

}

protected boolean isSplitable(FileSystem fs, Path file) {

return compressionCodecs.getCodec(file) == null;}

public RecordReader getRecordReader(InputSplit genericSplit,

JobConf job, Reporter reporter)

throws IOException {

reporter.setStatus(genericSplit.toString());

return new ParagraphRecordReader(job, (FileSplit) genericSplit);

}

}

Firststage:FindMapper


46/58

First stage: Find Mapper

Define the startword at startup

Each time map is called we parse an entire

paragraph and output matching N-Grams

Tell Reporter how far done we are to track

progress

Output like WordCount

output.collect(ngram, new IntWritable(1));

This last part is important... next slide explains.

Importanceofoutput collect()


47/58

Importance of output.collect()

Remember Hadoop's data type model:

map: (K1, V

1) list(K

2, V

2)

This means that for every single (K1, V

1) tuple,

the map stage can output zero, one, two, or anyother number of tuples, and they don't have to

match the input at all.

Example:output.collect(ngram, new IntWritable(1));

output.collect(good-ol'-+ngram, new IntWritable( 0));

Find Mapper


48/58

Find Mapper

Our mapper must have a configure() class

We can pass primitives through JobConf

public void configure(JobConf conf) {

desiredPhrase = conf.get("mapper.desired-phrase");

Nvalue = conf.getInt("mapper.N-value", 3);

caseSensitive = conf.getBoolean("mapper.case-sensitive",

false);}

FindReducer


49/58

Find Reducer

Like WordCount example

Sum all the numbers matching our N-Gram

Output

Secondstage:PruneMapper


50/58

Second stage: Prune Mapper

Parse line from previous output and divide intoKey/Value pairs

Prune Reducer

This way we can sort our elements by frequency

If this N-Gram occurs fewer times than our

minimum, trim it out

Piping data between M/R jobs


51/58

Piping data between M/R jobs

How does the Find map/reduce job pass its results to the Reduce map/reduce job?

I create a temporary file within HDFS. This

temporary file is used as the output of Find andthe input of Reduce.

At the end, I delete the temporary file.

Counters


52/58

Counters

The N-Gram generator has one programmer-defined counter: the number of

partial/incomplete N-grams. These occur when

a paragraph ends before we can read N-1

subsequent words.

We can add as many counters as we want.

JobConf


53/58

JobConf

We need to set everything up2 Jobs executing in series Find and Prune

User inputs parameters

Starting N-Gram word/phrase

N-Gram size

Minimum frequency for pruning

JobConf ngram_find_conf = new JobConf(getConf(),

NGram.class),

ngram_prune_conf = new JobConf(getConf(), NGram.class);

Find JobConf


54/58

Find JobConf

Now we can plug everything in:

Also pass input parameters

And point to our input and output files

ngram_find_conf.setJobName("ngram-find");

ngram_find_conf.setInputFormat(ParagraphInputFormat.class);

ngram_find_conf.setOutputKeyClass(Text.class);

ngram_find_conf.setOutputValueClass(IntWritable.class);

ngram_find_conf.setMapperClass(FindJob_MapClass.class);ngram_find_conf.setReducerClass(FindJob_ReduceClass.class);

ngram_find_conf.set("mapper.desired-phrase", args.get(2), true));

ngram_find_conf.setInt("mapper.N-value", new Integer(other_args.get(3)).intValue());

ngram_find_conf.setBoolean("mapper.case-sensitive", caseSensitive);

FileInputFormat.setInputPaths(ngram_find_conf,

other_args.get(0));

FileOutputFormat.setOutputPath(ngram_find_conf, tempDir);

Prune JobConf


55/58

Prune JobConf

Perform set up as before

We need to point our inputs to the outputs of the

previous job

ngram_prune_conf.setJobName("ngram-prune");

ngram_prune_conf.setInt("reducer.min-freq", min_freq);

ngram_prune_conf.setOutputKeyClass(Text.class);

ngram_prune_conf.setOutputValueClass(IntWritable.class);

ngram_prune_conf.setMapperClass(PruneJob_MapClass.class);

ngram_prune_conf.setReducerClass(PruneJob_ReduceClass.class);

FileInputFormat.setInputPaths(ngram_prune_conf, tempDir);

FileOutputFormat.setOutputPath(ngram_prune_conf, newPath(other_args.get(1)));

Execute Jobs


56/58

Execute Jobs

Run as blocking process with runJobBatch processing is done in series

JobClient.runJob(ngram_find_conf);

JobClient.runJob(ngram_prune_conf);

Design questions to ask


57/58


From where will my input come?InputFileFormat

How is my input structured?

RecordReader(There are already several common IFFs and

RRs. Don't reinvent the wheel.)

Mapper and Reducer classesDo Key (WritableComparator) and Value (Writable)

classes exist?



58/58


Do I need to count anything while job is inprogress?

Where is my output going?

Executor classWhat information do my map/reduce classes need?

Must I block, waiting for job completion? Set

FileFormat?

Documents

Lecture 12, Hadoop