Upload
c1099775
View
217
Download
0
Embed Size (px)
Citation preview
7/31/2019 Lecture 12, Hadoop
1/58
Creating Map-Reduce ProgramsUsing Hadoop
7/31/2019 Lecture 12, Hadoop
2/58
Presentation Overview
Recall Hadoop
Overview of the map-reduce paradigm
Elaboration on the WordCount examplecomponents of Hadoop that make WordCount
possible
Major new example: N-Gram Generatorstep-by-step assembly of this map-reduce job
Design questions to ask when creating your own
Hadoop jobs
7/31/2019 Lecture 12, Hadoop
3/58
Recall why Hadoop rocks
Hadoop is:
Free and open source
high quality, like all Apache Foundation projects
crossplatform (pure Java)
fault-tolerant
highly scalable
has bindings for non-Java programming languages
applicable to many computational problems
7/31/2019 Lecture 12, Hadoop
4/58
Map-Reduce System Overview
JobTracker Makes scheduling decisions
TaskTracker Manages tasks for a given node
Task processRuns an individual map or reduce fragment for
a given job
Forks from the TaskTracker
7/31/2019 Lecture 12, Hadoop
5/58
Map-Reduce System Overview
Processes communicate by custom RPC
implementation
Easy to change/extend
Defined as Java interfaces
Server objects implement the interface
Client proxy objects automatically createdAll messages originate at the client: (e.g., Task to
TaskTracker)
Prevents cycles and therefore deadlocks
7/31/2019 Lecture 12, Hadoop
6/58
Process Flow Diagram
7/31/2019 Lecture 12, Hadoop
7/58
Application Overview
Launching Program
Creates a JobConf to define a job.
Submits JobConf to JobTracker and waits for
completion.
Mapper
Is given a stream of key1,value1 pairs
Generates a stream of key2, value2 pairs
Reducer
Is given a key2 and a stream of value2s
Generates a stream of key3, value3 pairs
7/31/2019 Lecture 12, Hadoop
8/58
Job Launch Process: Client
Client program creates a JobConfIdentify classes implementing Mapperand Reducerinterfaces
JobConf.setMapperClass(); JobConf.setReducerClass()Specify input and output formats
JobConf.setInputFormat(TextInputFormat.class);
JobConf.setOutputFormat(TextOutputFormat.class);Other options too:
JobConf.setNumReduceTasks()
JobConf.setOutputFormat()
Man man more Facade attern
7/31/2019 Lecture 12, Hadoop
9/58
An onslaught of terminology
We'll explain these terms, each of which plays a
role in any non-trivial map/reduce job:
InputFormat, OutputFormat, FileInputFormat, ...
JobClient and JobConf
JobTracker and TaskTracker
TaskRunner, MapTaskRunner, MapRunner,
InputSplit, RecordReader, LineRecordReader, ...
Writable, WritableComparable, WritableInt, ...
7/31/2019 Lecture 12, Hadoop
10/58
InputFormat and OutputFormat
The application also chooses input and output formats,which define how the persistent data is read and
written. These are interfaces and can be defined by
the application.
InputFormat
Splits the input to determine the input to each map
task.
Defines a RecordReader that reads key, value pairsthat are passed to the map task
OutputFormat
Given the key, value pairs and a filename, writes the
7/31/2019 Lecture 12, Hadoop
11/58
Example
public static void main(String[] args) throwsException {
JobConf conf = new JobConf(WordCount.class);
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(Map.class);
conf.setCombinerClass(Reduce.class);
conf.setReducerClass(Reduce.class);
7/31/2019 Lecture 12, Hadoop
12/58
Job Launch Process: JobClient
Pass JobConf to JobClient.runJob() or
JobClient.submitJob()
runJob() blocks wait until job finishes
submitJob() does notPoll for status to make running decisions
Avoid polling with JobConf.setJobEndNotificationURI()
JobClient:Determines proper division of input into InputSplits
Sends job data to master JobTrackerserver
7/31/2019 Lecture 12, Hadoop
13/58
Job Launch Process: JobTracker
JobTracker:
Inserts jar and JobConf (serialized to XML) in shared
location
Posts a JobInProgressto its run queue
7/31/2019 Lecture 12, Hadoop
14/58
Job Launch Process: TaskTracker
TaskTrackersrunning on slave nodes periodically
query JobTrackerfor work
Retrieve job-specific jar and config
Launch task in separate instance of Java
main() is provided by Hadoop
7/31/2019 Lecture 12, Hadoop
15/58
Job Launch Process: Task
TaskTracker.Child.main():
Sets up the child TaskInProgressattempt
Reads XML configuration
Connects back to necessary MapReduce components
via RPC
Uses TaskRunnerto launch user process
7/31/2019 Lecture 12, Hadoop
16/58
Job Launch Process: TaskRunner
TaskRunner, MapTaskRunner, MapRunnerwork ina daisy-chain to launch your Mapper
Task knows ahead of time which InputSplitsit should be
mappingCalls Mapperonce for each record retrieved from theInputSplit
Running the Reduceris much the same
7/31/2019 Lecture 12, Hadoop
17/58
Creating the Mapper
You provide the instance of Mapper
Should extend MapReduceBase
Implement interface Mapper
One instance of your Mapper is initialized by the
MapTaskRunnerfor a TaskInProgress
Exists in separate process from all other instances of
Mapper no data sharing!
7/31/2019 Lecture 12, Hadoop
18/58
Mapper
Override function map()
void map(WritableComparable key,
Writable value,
OutputCollector output,
Reporter reporter)
Emit (k2,v2) with output.collect(k2, v2)
7/31/2019 Lecture 12, Hadoop
19/58
Example
public static class Map extends MapReduceBase
implements Mapper {
private final static IntWritable one = newIntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value,
OutputCollector output,
Reporter reporter) throws IOException {
7/31/2019 Lecture 12, Hadoop
20/58
What is Writable?
Hadoop defines its own box classes for strings
(Text), integers (IntWritable), etc.
All values are instances of Writable
All keys are instances of WritableComparable
7/31/2019 Lecture 12, Hadoop
21/58
Reading data
Data sets are specified by InputFormats
Defines input data (e.g., a directory)
Identifies partitions of the data that form an InputSplit
Factory for RecordReaderobjects to extract (k, v) records
from the input source
7/31/2019 Lecture 12, Hadoop
22/58
FileInputFormatand friends
TextInputFormatTreats each \n-terminated line
of a file as a value
KeyValueTextInputFormatMaps \n- terminatedtext lines of k SEP v
SequenceFileInputFormat Binary file of (k, v)
pairs with some addl metadata SequenceFileAsTextInputFormat Same, but
maps (k.toString(), v.toString())
7/31/2019 Lecture 12, Hadoop
23/58
Filtering File Inputs
FileInputFormatwill read all files out of a
specified directory and send them to the mapper
Delegates filtering this file list to a method
subclasses may override
e.g., Create your own xyzFileInputFormat to read *.xyz
from directory list
7/31/2019 Lecture 12, Hadoop
24/58
Record Readers
Without a RecordReader, Hadoop would be forcedto divide input on byte boundaries.
Each InputFormatprovides its own RecordReader
implementationProvides capability multiplexing
LineRecordReader Reads a line from a text file
KeyValueRecordReader Used byKeyValueTextInputFormat
7/31/2019 Lecture 12, Hadoop
25/58
Input Split Size
FileInputFormatwill divide large files into chunks
Exact size controlled by mapred.min.split.size
RecordReaders receive file, offset, and length of
chunk
Custom InputFormatimplementations may
override split size e.g., NeverChunkFile
7/31/2019 Lecture 12, Hadoop
26/58
Sending Data To Reducers
Map function receives OutputCollectorobject
OutputCollector.collect() takes (k, v) elements
Any (WritableComparable, Writable) can be used
7/31/2019 Lecture 12, Hadoop
27/58
WritableComparator
Compares WritableComparable data
Will call WritableComparable.compare()
Can provide fast path for serialized data
Explicitly stated in JobConf setup
JobConf.setOutputValueGroupingComparator()
7/31/2019 Lecture 12, Hadoop
28/58
Sending Data To The Client
Reporterobject sent to Mapper allows simple
asynchronous feedback
incrCounter(Enum key, long amount)
setStatus(String msg)
Allows self-identification of input
InputSplit getInputSplit()
7/31/2019 Lecture 12, Hadoop
29/58
Partitioner
int getPartition(key, val, numPartitions)
Outputs the partition number for a given key
One partition == values sent to one Reduce task
HashPartitionerused by default
Uses key.hashCode() to return partition num
JobConfsets Partitionerimplementation
7/31/2019 Lecture 12, Hadoop
30/58
Reducer
reduce( WritableComparable key,
Iterator values,
OutputCollector output,
Reporter reporter)
Keys & values sent to one partition all go to thesame reduce task
Calls are sorted by key earlier keys are reduced and output before later keys
7/31/2019 Lecture 12, Hadoop
31/58
Example
public static class Reduce extendsMapReduceBase implements Reducer {
public void reduce(Text key,Iterator values,
OutputCollector output,
Reporter reporter) throws IOException {int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
7/31/2019 Lecture 12, Hadoop
32/58
OutputFormat
Analogous to InputFormat
TextOutputFormatWrites key val \n strings to
output file
SequenceFileOutputFormat Uses a binary format
to pack (k, v) pairs
NullOutputFormat Discards output
7/31/2019 Lecture 12, Hadoop
33/58
Presentation Overview
Recall Hadoop
Overview of the map-reduce paradigm
Elaboration on the WordCount example
components of Hadoop that make WordCount
possible
Major new example: N-Gram Generator
step-by-step assembly of this map-reduce job
Design questions to ask when creating your own
Hadoop jobs
7/31/2019 Lecture 12, Hadoop
34/58
Major example: N-Gram Generation
N-Gram is a common natural languageprocessing technique (used by Google, etc)
N-Gram is a subsequence of N items in a given
sequence. (i.e. subsequence of words in a
given text)
Example 3-grams (from Google) with
corresponding occurrences
ceramics collectables collectibles (55)
ceramics collected by (52)
ceramics collectibles cooking (45)
7/31/2019 Lecture 12, Hadoop
35/58
Understanding the process
Someone wise said, A week of writing code saves an hour of research.
Before embarking on developing a Hadoop job,
walk through the process step by step manuallyand understand the flow and manipulation of
data.
Once you can comfortably (and deterministically!)do it mentally, begin writing code.
7/31/2019 Lecture 12, Hadoop
36/58
Requirements
Input:
a beginning word/phrase
n-gram size (bigram, trigram, n-gram)
the minimum number of occurrences (frequency)
whether letter case matters
Output: all possible n-grams that occur sufficiently
frequently.
7/31/2019 Lecture 12, Hadoop
37/58
High-level view of data flow
Given: one or more files containing regular text.
Look for the desired startword. If seen, take the
next N-1 words and add the group to the
database.
Similarly to word count, find the number of
occurrences of each N-gram.
Remove those N-grams that do not occurfrequently enough for our liking.
7/31/2019 Lecture 12, Hadoop
38/58
Follow along
The N-grams implementation exists and is readyfor your perusal.
Grab it:
if you use Git revision control:
git clone git://git.qnan.org/pmw/hadoop-ngram
to get the files with your browser, go to:
http://www.qnan.org/~pmw/software/hadoop-ngram
We used Project Gutenberg ebooks as input.
F ll l
7/31/2019 Lecture 12, Hadoop
39/58
Follow along
Start Hadoop
bin/start-all.sh
Grab the NGram code and build it:
Type ant and all will be built
Look at the README to see how to run it.
Load some text files into your HDFS
good source: http://www.gutenberg.org
Run it yourself (or see me do it) before we
proceed.
C j W dC ?
7/31/2019 Lecture 12, Hadoop
40/58
Can we just use WordCount?
We have the WordCount example that does a similarthing. But there are differences:
We don't want to count the number of times our
startword appears; we want to capture the
subsequent words too.
A more subtle problem is that wordcount maps one
line at a time. That's a problem if we want 3-grams
with startword of pillows in the book containing this:
The guests stayed in the guest bedroom; the pillows were
delightfully soft and had a faint scent of mint.
Still WordCount is a ood foundation for our code.
St t f
7/31/2019 Lecture 12, Hadoop
41/58
Steps we must perform
Read our text in paragraphs rather than in discrete lines:
RecordReader
InputFormat
Develop the mapper and reducer classes:
first mapper: find startword, get the next N-1 words, and
return
first reducer: sum the number of occurrences of each N-
gram
second mapper: no action
second reducer: discard N-grams that are too rare
Driver program
A R dR d
7/31/2019 Lecture 12, Hadoop
42/58
A new RecordReader
Ours must implement RecordReader
Contain certain functions: createKey(), createValue(),
getPos(), getProgress(), next()
Hadoop offers a LineRecordReader but nosupport for Paragraphs
We'll need a ParagraphRecordReader
Use Delegation Pattern instead of extendingLineRecordReader. We couldn't extend it because
it has private elements.
Create new next() function
public synchronized boolean next(LongWritable key, Text value) throws IOException {
Text linevalue = new Text();
b l d d t thi
7/31/2019 Lecture 12, Hadoop
43/58
boolean appended, gotsomething;
boolean retval;
byte space[] = {' '};
value.clear();gotsomething = false;
do {
appended = false;
retval = lrr.next(key, linevalue);
if (retval) {
if (linevalue.toString().length() > 0) {byte[] rawline = linevalue.getBytes();
int rawlinelen = linevalue.getLength();
value.append(rawline, 0, rawlinelen);
value.append(space, 0, 1);
appended = true;
}gotsomething = true;
}
} while (appended);
//System.out.println("ParagraphRecordReader::next() returns "+gotsomething+" after setting
value to: ["+value.toString()+"]");
return gotsomething;
}
A ne Inp tFormat
7/31/2019 Lecture 12, Hadoop
44/58
A new InputFormat
Given to the JobTracker during execution
getRecordReader method
This is the why we need InputFormat
Must return our ParagraphRecordReader
7/31/2019 Lecture 12, Hadoop
45/58
public class ParagraphInputFormat extends FileInputFormat
implements JobConfigurable {
private CompressionCodecFactory compressionCodecs = null;
public void configure(JobConf conf) {
compressionCodecs = new CompressionCodecFactory(conf);
}
protected boolean isSplitable(FileSystem fs, Path file) {
return compressionCodecs.getCodec(file) == null;}
public RecordReader getRecordReader(InputSplit genericSplit,
JobConf job, Reporter reporter)
throws IOException {
reporter.setStatus(genericSplit.toString());
return new ParagraphRecordReader(job, (FileSplit) genericSplit);
}
}
Firststage:FindMapper
7/31/2019 Lecture 12, Hadoop
46/58
First stage: Find Mapper
Define the startword at startup
Each time map is called we parse an entire
paragraph and output matching N-Grams
Tell Reporter how far done we are to track
progress
Output like WordCount
output.collect(ngram, new IntWritable(1));
This last part is important... next slide explains.
Importanceofoutput collect()
7/31/2019 Lecture 12, Hadoop
47/58
Importance of output.collect()
Remember Hadoop's data type model:
map: (K1, V
1) list(K
2, V
2)
This means that for every single (K1, V
1) tuple,
the map stage can output zero, one, two, or anyother number of tuples, and they don't have to
match the input at all.
Example:output.collect(ngram, new IntWritable(1));
output.collect(good-ol'-+ngram, new IntWritable( 0));
Find Mapper
7/31/2019 Lecture 12, Hadoop
48/58
Find Mapper
Our mapper must have a configure() class
We can pass primitives through JobConf
public void configure(JobConf conf) {
desiredPhrase = conf.get("mapper.desired-phrase");
Nvalue = conf.getInt("mapper.N-value", 3);
caseSensitive = conf.getBoolean("mapper.case-sensitive",
false);}
FindReducer
7/31/2019 Lecture 12, Hadoop
49/58
Find Reducer
Like WordCount example
Sum all the numbers matching our N-Gram
Output
Secondstage:PruneMapper
7/31/2019 Lecture 12, Hadoop
50/58
Second stage: Prune Mapper
Parse line from previous output and divide intoKey/Value pairs
Prune Reducer
This way we can sort our elements by frequency
If this N-Gram occurs fewer times than our
minimum, trim it out
Piping data between M/R jobs
7/31/2019 Lecture 12, Hadoop
51/58
Piping data between M/R jobs
How does the Find map/reduce job pass its results to the Reduce map/reduce job?
I create a temporary file within HDFS. This
temporary file is used as the output of Find andthe input of Reduce.
At the end, I delete the temporary file.
Counters
7/31/2019 Lecture 12, Hadoop
52/58
Counters
The N-Gram generator has one programmer-defined counter: the number of
partial/incomplete N-grams. These occur when
a paragraph ends before we can read N-1
subsequent words.
We can add as many counters as we want.
JobConf
7/31/2019 Lecture 12, Hadoop
53/58
JobConf
We need to set everything up2 Jobs executing in series Find and Prune
User inputs parameters
Starting N-Gram word/phrase
N-Gram size
Minimum frequency for pruning
JobConf ngram_find_conf = new JobConf(getConf(),
NGram.class),
ngram_prune_conf = new JobConf(getConf(), NGram.class);
Find JobConf
7/31/2019 Lecture 12, Hadoop
54/58
Find JobConf
Now we can plug everything in:
Also pass input parameters
And point to our input and output files
ngram_find_conf.setJobName("ngram-find");
ngram_find_conf.setInputFormat(ParagraphInputFormat.class);
ngram_find_conf.setOutputKeyClass(Text.class);
ngram_find_conf.setOutputValueClass(IntWritable.class);
ngram_find_conf.setMapperClass(FindJob_MapClass.class);ngram_find_conf.setReducerClass(FindJob_ReduceClass.class);
ngram_find_conf.set("mapper.desired-phrase", args.get(2), true));
ngram_find_conf.setInt("mapper.N-value", new Integer(other_args.get(3)).intValue());
ngram_find_conf.setBoolean("mapper.case-sensitive", caseSensitive);
FileInputFormat.setInputPaths(ngram_find_conf,
other_args.get(0));
FileOutputFormat.setOutputPath(ngram_find_conf, tempDir);
Prune JobConf
7/31/2019 Lecture 12, Hadoop
55/58
Prune JobConf
Perform set up as before
We need to point our inputs to the outputs of the
previous job
ngram_prune_conf.setJobName("ngram-prune");
ngram_prune_conf.setInt("reducer.min-freq", min_freq);
ngram_prune_conf.setOutputKeyClass(Text.class);
ngram_prune_conf.setOutputValueClass(IntWritable.class);
ngram_prune_conf.setMapperClass(PruneJob_MapClass.class);
ngram_prune_conf.setReducerClass(PruneJob_ReduceClass.class);
FileInputFormat.setInputPaths(ngram_prune_conf, tempDir);
FileOutputFormat.setOutputPath(ngram_prune_conf, newPath(other_args.get(1)));
Execute Jobs
7/31/2019 Lecture 12, Hadoop
56/58
Execute Jobs
Run as blocking process with runJobBatch processing is done in series
JobClient.runJob(ngram_find_conf);
JobClient.runJob(ngram_prune_conf);
Design questions to ask
7/31/2019 Lecture 12, Hadoop
57/58
Design questions to ask
From where will my input come?InputFileFormat
How is my input structured?
RecordReader(There are already several common IFFs and
RRs. Don't reinvent the wheel.)
Mapper and Reducer classesDo Key (WritableComparator) and Value (Writable)
classes exist?
Design questions to ask
7/31/2019 Lecture 12, Hadoop
58/58
Design questions to ask
Do I need to count anything while job is inprogress?
Where is my output going?
Executor classWhat information do my map/reduce classes need?
Must I block, waiting for job completion? Set
FileFormat?