Lecture 12, Hadoop

Embed Size (px)

Citation preview

  • 7/31/2019 Lecture 12, Hadoop

    1/58

    Creating Map-Reduce ProgramsUsing Hadoop

  • 7/31/2019 Lecture 12, Hadoop

    2/58

    Presentation Overview

    Recall Hadoop

    Overview of the map-reduce paradigm

    Elaboration on the WordCount examplecomponents of Hadoop that make WordCount

    possible

    Major new example: N-Gram Generatorstep-by-step assembly of this map-reduce job

    Design questions to ask when creating your own

    Hadoop jobs

  • 7/31/2019 Lecture 12, Hadoop

    3/58

    Recall why Hadoop rocks

    Hadoop is:

    Free and open source

    high quality, like all Apache Foundation projects

    crossplatform (pure Java)

    fault-tolerant

    highly scalable

    has bindings for non-Java programming languages

    applicable to many computational problems

  • 7/31/2019 Lecture 12, Hadoop

    4/58

    Map-Reduce System Overview

    JobTracker Makes scheduling decisions

    TaskTracker Manages tasks for a given node

    Task processRuns an individual map or reduce fragment for

    a given job

    Forks from the TaskTracker

  • 7/31/2019 Lecture 12, Hadoop

    5/58

    Map-Reduce System Overview

    Processes communicate by custom RPC

    implementation

    Easy to change/extend

    Defined as Java interfaces

    Server objects implement the interface

    Client proxy objects automatically createdAll messages originate at the client: (e.g., Task to

    TaskTracker)

    Prevents cycles and therefore deadlocks

  • 7/31/2019 Lecture 12, Hadoop

    6/58

    Process Flow Diagram

  • 7/31/2019 Lecture 12, Hadoop

    7/58

    Application Overview

    Launching Program

    Creates a JobConf to define a job.

    Submits JobConf to JobTracker and waits for

    completion.

    Mapper

    Is given a stream of key1,value1 pairs

    Generates a stream of key2, value2 pairs

    Reducer

    Is given a key2 and a stream of value2s

    Generates a stream of key3, value3 pairs

  • 7/31/2019 Lecture 12, Hadoop

    8/58

    Job Launch Process: Client

    Client program creates a JobConfIdentify classes implementing Mapperand Reducerinterfaces

    JobConf.setMapperClass(); JobConf.setReducerClass()Specify input and output formats

    JobConf.setInputFormat(TextInputFormat.class);

    JobConf.setOutputFormat(TextOutputFormat.class);Other options too:

    JobConf.setNumReduceTasks()

    JobConf.setOutputFormat()

    Man man more Facade attern

  • 7/31/2019 Lecture 12, Hadoop

    9/58

    An onslaught of terminology

    We'll explain these terms, each of which plays a

    role in any non-trivial map/reduce job:

    InputFormat, OutputFormat, FileInputFormat, ...

    JobClient and JobConf

    JobTracker and TaskTracker

    TaskRunner, MapTaskRunner, MapRunner,

    InputSplit, RecordReader, LineRecordReader, ...

    Writable, WritableComparable, WritableInt, ...

  • 7/31/2019 Lecture 12, Hadoop

    10/58

    InputFormat and OutputFormat

    The application also chooses input and output formats,which define how the persistent data is read and

    written. These are interfaces and can be defined by

    the application.

    InputFormat

    Splits the input to determine the input to each map

    task.

    Defines a RecordReader that reads key, value pairsthat are passed to the map task

    OutputFormat

    Given the key, value pairs and a filename, writes the

  • 7/31/2019 Lecture 12, Hadoop

    11/58

    Example

    public static void main(String[] args) throwsException {

    JobConf conf = new JobConf(WordCount.class);

    conf.setOutputKeyClass(Text.class);

    conf.setOutputValueClass(IntWritable.class);

    conf.setMapperClass(Map.class);

    conf.setCombinerClass(Reduce.class);

    conf.setReducerClass(Reduce.class);

  • 7/31/2019 Lecture 12, Hadoop

    12/58

    Job Launch Process: JobClient

    Pass JobConf to JobClient.runJob() or

    JobClient.submitJob()

    runJob() blocks wait until job finishes

    submitJob() does notPoll for status to make running decisions

    Avoid polling with JobConf.setJobEndNotificationURI()

    JobClient:Determines proper division of input into InputSplits

    Sends job data to master JobTrackerserver

  • 7/31/2019 Lecture 12, Hadoop

    13/58

    Job Launch Process: JobTracker

    JobTracker:

    Inserts jar and JobConf (serialized to XML) in shared

    location

    Posts a JobInProgressto its run queue

  • 7/31/2019 Lecture 12, Hadoop

    14/58

    Job Launch Process: TaskTracker

    TaskTrackersrunning on slave nodes periodically

    query JobTrackerfor work

    Retrieve job-specific jar and config

    Launch task in separate instance of Java

    main() is provided by Hadoop

  • 7/31/2019 Lecture 12, Hadoop

    15/58

    Job Launch Process: Task

    TaskTracker.Child.main():

    Sets up the child TaskInProgressattempt

    Reads XML configuration

    Connects back to necessary MapReduce components

    via RPC

    Uses TaskRunnerto launch user process

  • 7/31/2019 Lecture 12, Hadoop

    16/58

    Job Launch Process: TaskRunner

    TaskRunner, MapTaskRunner, MapRunnerwork ina daisy-chain to launch your Mapper

    Task knows ahead of time which InputSplitsit should be

    mappingCalls Mapperonce for each record retrieved from theInputSplit

    Running the Reduceris much the same

  • 7/31/2019 Lecture 12, Hadoop

    17/58

    Creating the Mapper

    You provide the instance of Mapper

    Should extend MapReduceBase

    Implement interface Mapper

    One instance of your Mapper is initialized by the

    MapTaskRunnerfor a TaskInProgress

    Exists in separate process from all other instances of

    Mapper no data sharing!

  • 7/31/2019 Lecture 12, Hadoop

    18/58

    Mapper

    Override function map()

    void map(WritableComparable key,

    Writable value,

    OutputCollector output,

    Reporter reporter)

    Emit (k2,v2) with output.collect(k2, v2)

  • 7/31/2019 Lecture 12, Hadoop

    19/58

    Example

    public static class Map extends MapReduceBase

    implements Mapper {

    private final static IntWritable one = newIntWritable(1);

    private Text word = new Text();

    public void map(LongWritable key, Text value,

    OutputCollector output,

    Reporter reporter) throws IOException {

  • 7/31/2019 Lecture 12, Hadoop

    20/58

    What is Writable?

    Hadoop defines its own box classes for strings

    (Text), integers (IntWritable), etc.

    All values are instances of Writable

    All keys are instances of WritableComparable

  • 7/31/2019 Lecture 12, Hadoop

    21/58

    Reading data

    Data sets are specified by InputFormats

    Defines input data (e.g., a directory)

    Identifies partitions of the data that form an InputSplit

    Factory for RecordReaderobjects to extract (k, v) records

    from the input source

  • 7/31/2019 Lecture 12, Hadoop

    22/58

    FileInputFormatand friends

    TextInputFormatTreats each \n-terminated line

    of a file as a value

    KeyValueTextInputFormatMaps \n- terminatedtext lines of k SEP v

    SequenceFileInputFormat Binary file of (k, v)

    pairs with some addl metadata SequenceFileAsTextInputFormat Same, but

    maps (k.toString(), v.toString())

  • 7/31/2019 Lecture 12, Hadoop

    23/58

    Filtering File Inputs

    FileInputFormatwill read all files out of a

    specified directory and send them to the mapper

    Delegates filtering this file list to a method

    subclasses may override

    e.g., Create your own xyzFileInputFormat to read *.xyz

    from directory list

  • 7/31/2019 Lecture 12, Hadoop

    24/58

    Record Readers

    Without a RecordReader, Hadoop would be forcedto divide input on byte boundaries.

    Each InputFormatprovides its own RecordReader

    implementationProvides capability multiplexing

    LineRecordReader Reads a line from a text file

    KeyValueRecordReader Used byKeyValueTextInputFormat

  • 7/31/2019 Lecture 12, Hadoop

    25/58

    Input Split Size

    FileInputFormatwill divide large files into chunks

    Exact size controlled by mapred.min.split.size

    RecordReaders receive file, offset, and length of

    chunk

    Custom InputFormatimplementations may

    override split size e.g., NeverChunkFile

  • 7/31/2019 Lecture 12, Hadoop

    26/58

    Sending Data To Reducers

    Map function receives OutputCollectorobject

    OutputCollector.collect() takes (k, v) elements

    Any (WritableComparable, Writable) can be used

  • 7/31/2019 Lecture 12, Hadoop

    27/58

    WritableComparator

    Compares WritableComparable data

    Will call WritableComparable.compare()

    Can provide fast path for serialized data

    Explicitly stated in JobConf setup

    JobConf.setOutputValueGroupingComparator()

  • 7/31/2019 Lecture 12, Hadoop

    28/58

    Sending Data To The Client

    Reporterobject sent to Mapper allows simple

    asynchronous feedback

    incrCounter(Enum key, long amount)

    setStatus(String msg)

    Allows self-identification of input

    InputSplit getInputSplit()

  • 7/31/2019 Lecture 12, Hadoop

    29/58

    Partitioner

    int getPartition(key, val, numPartitions)

    Outputs the partition number for a given key

    One partition == values sent to one Reduce task

    HashPartitionerused by default

    Uses key.hashCode() to return partition num

    JobConfsets Partitionerimplementation

  • 7/31/2019 Lecture 12, Hadoop

    30/58

    Reducer

    reduce( WritableComparable key,

    Iterator values,

    OutputCollector output,

    Reporter reporter)

    Keys & values sent to one partition all go to thesame reduce task

    Calls are sorted by key earlier keys are reduced and output before later keys

  • 7/31/2019 Lecture 12, Hadoop

    31/58

    Example

    public static class Reduce extendsMapReduceBase implements Reducer {

    public void reduce(Text key,Iterator values,

    OutputCollector output,

    Reporter reporter) throws IOException {int sum = 0;

    while (values.hasNext()) {

    sum += values.next().get();

  • 7/31/2019 Lecture 12, Hadoop

    32/58

    OutputFormat

    Analogous to InputFormat

    TextOutputFormatWrites key val \n strings to

    output file

    SequenceFileOutputFormat Uses a binary format

    to pack (k, v) pairs

    NullOutputFormat Discards output

  • 7/31/2019 Lecture 12, Hadoop

    33/58

    Presentation Overview

    Recall Hadoop

    Overview of the map-reduce paradigm

    Elaboration on the WordCount example

    components of Hadoop that make WordCount

    possible

    Major new example: N-Gram Generator

    step-by-step assembly of this map-reduce job

    Design questions to ask when creating your own

    Hadoop jobs

  • 7/31/2019 Lecture 12, Hadoop

    34/58

    Major example: N-Gram Generation

    N-Gram is a common natural languageprocessing technique (used by Google, etc)

    N-Gram is a subsequence of N items in a given

    sequence. (i.e. subsequence of words in a

    given text)

    Example 3-grams (from Google) with

    corresponding occurrences

    ceramics collectables collectibles (55)

    ceramics collected by (52)

    ceramics collectibles cooking (45)

  • 7/31/2019 Lecture 12, Hadoop

    35/58

    Understanding the process

    Someone wise said, A week of writing code saves an hour of research.

    Before embarking on developing a Hadoop job,

    walk through the process step by step manuallyand understand the flow and manipulation of

    data.

    Once you can comfortably (and deterministically!)do it mentally, begin writing code.

  • 7/31/2019 Lecture 12, Hadoop

    36/58

    Requirements

    Input:

    a beginning word/phrase

    n-gram size (bigram, trigram, n-gram)

    the minimum number of occurrences (frequency)

    whether letter case matters

    Output: all possible n-grams that occur sufficiently

    frequently.

  • 7/31/2019 Lecture 12, Hadoop

    37/58

    High-level view of data flow

    Given: one or more files containing regular text.

    Look for the desired startword. If seen, take the

    next N-1 words and add the group to the

    database.

    Similarly to word count, find the number of

    occurrences of each N-gram.

    Remove those N-grams that do not occurfrequently enough for our liking.

  • 7/31/2019 Lecture 12, Hadoop

    38/58

    Follow along

    The N-grams implementation exists and is readyfor your perusal.

    Grab it:

    if you use Git revision control:

    git clone git://git.qnan.org/pmw/hadoop-ngram

    to get the files with your browser, go to:

    http://www.qnan.org/~pmw/software/hadoop-ngram

    We used Project Gutenberg ebooks as input.

    F ll l

  • 7/31/2019 Lecture 12, Hadoop

    39/58

    Follow along

    Start Hadoop

    bin/start-all.sh

    Grab the NGram code and build it:

    Type ant and all will be built

    Look at the README to see how to run it.

    Load some text files into your HDFS

    good source: http://www.gutenberg.org

    Run it yourself (or see me do it) before we

    proceed.

    C j W dC ?

  • 7/31/2019 Lecture 12, Hadoop

    40/58

    Can we just use WordCount?

    We have the WordCount example that does a similarthing. But there are differences:

    We don't want to count the number of times our

    startword appears; we want to capture the

    subsequent words too.

    A more subtle problem is that wordcount maps one

    line at a time. That's a problem if we want 3-grams

    with startword of pillows in the book containing this:

    The guests stayed in the guest bedroom; the pillows were

    delightfully soft and had a faint scent of mint.

    Still WordCount is a ood foundation for our code.

    St t f

  • 7/31/2019 Lecture 12, Hadoop

    41/58

    Steps we must perform

    Read our text in paragraphs rather than in discrete lines:

    RecordReader

    InputFormat

    Develop the mapper and reducer classes:

    first mapper: find startword, get the next N-1 words, and

    return

    first reducer: sum the number of occurrences of each N-

    gram

    second mapper: no action

    second reducer: discard N-grams that are too rare

    Driver program

    A R dR d

  • 7/31/2019 Lecture 12, Hadoop

    42/58

    A new RecordReader

    Ours must implement RecordReader

    Contain certain functions: createKey(), createValue(),

    getPos(), getProgress(), next()

    Hadoop offers a LineRecordReader but nosupport for Paragraphs

    We'll need a ParagraphRecordReader

    Use Delegation Pattern instead of extendingLineRecordReader. We couldn't extend it because

    it has private elements.

    Create new next() function

    public synchronized boolean next(LongWritable key, Text value) throws IOException {

    Text linevalue = new Text();

    b l d d t thi

  • 7/31/2019 Lecture 12, Hadoop

    43/58

    boolean appended, gotsomething;

    boolean retval;

    byte space[] = {' '};

    value.clear();gotsomething = false;

    do {

    appended = false;

    retval = lrr.next(key, linevalue);

    if (retval) {

    if (linevalue.toString().length() > 0) {byte[] rawline = linevalue.getBytes();

    int rawlinelen = linevalue.getLength();

    value.append(rawline, 0, rawlinelen);

    value.append(space, 0, 1);

    appended = true;

    }gotsomething = true;

    }

    } while (appended);

    //System.out.println("ParagraphRecordReader::next() returns "+gotsomething+" after setting

    value to: ["+value.toString()+"]");

    return gotsomething;

    }

    A ne Inp tFormat

  • 7/31/2019 Lecture 12, Hadoop

    44/58

    A new InputFormat

    Given to the JobTracker during execution

    getRecordReader method

    This is the why we need InputFormat

    Must return our ParagraphRecordReader

  • 7/31/2019 Lecture 12, Hadoop

    45/58

    public class ParagraphInputFormat extends FileInputFormat

    implements JobConfigurable {

    private CompressionCodecFactory compressionCodecs = null;

    public void configure(JobConf conf) {

    compressionCodecs = new CompressionCodecFactory(conf);

    }

    protected boolean isSplitable(FileSystem fs, Path file) {

    return compressionCodecs.getCodec(file) == null;}

    public RecordReader getRecordReader(InputSplit genericSplit,

    JobConf job, Reporter reporter)

    throws IOException {

    reporter.setStatus(genericSplit.toString());

    return new ParagraphRecordReader(job, (FileSplit) genericSplit);

    }

    }

    Firststage:FindMapper

  • 7/31/2019 Lecture 12, Hadoop

    46/58

    First stage: Find Mapper

    Define the startword at startup

    Each time map is called we parse an entire

    paragraph and output matching N-Grams

    Tell Reporter how far done we are to track

    progress

    Output like WordCount

    output.collect(ngram, new IntWritable(1));

    This last part is important... next slide explains.

    Importanceofoutput collect()

  • 7/31/2019 Lecture 12, Hadoop

    47/58

    Importance of output.collect()

    Remember Hadoop's data type model:

    map: (K1, V

    1) list(K

    2, V

    2)

    This means that for every single (K1, V

    1) tuple,

    the map stage can output zero, one, two, or anyother number of tuples, and they don't have to

    match the input at all.

    Example:output.collect(ngram, new IntWritable(1));

    output.collect(good-ol'-+ngram, new IntWritable( 0));

    Find Mapper

  • 7/31/2019 Lecture 12, Hadoop

    48/58

    Find Mapper

    Our mapper must have a configure() class

    We can pass primitives through JobConf

    public void configure(JobConf conf) {

    desiredPhrase = conf.get("mapper.desired-phrase");

    Nvalue = conf.getInt("mapper.N-value", 3);

    caseSensitive = conf.getBoolean("mapper.case-sensitive",

    false);}

    FindReducer

  • 7/31/2019 Lecture 12, Hadoop

    49/58

    Find Reducer

    Like WordCount example

    Sum all the numbers matching our N-Gram

    Output

    Secondstage:PruneMapper

  • 7/31/2019 Lecture 12, Hadoop

    50/58

    Second stage: Prune Mapper

    Parse line from previous output and divide intoKey/Value pairs

    Prune Reducer

    This way we can sort our elements by frequency

    If this N-Gram occurs fewer times than our

    minimum, trim it out

    Piping data between M/R jobs

  • 7/31/2019 Lecture 12, Hadoop

    51/58

    Piping data between M/R jobs

    How does the Find map/reduce job pass its results to the Reduce map/reduce job?

    I create a temporary file within HDFS. This

    temporary file is used as the output of Find andthe input of Reduce.

    At the end, I delete the temporary file.

    Counters

  • 7/31/2019 Lecture 12, Hadoop

    52/58

    Counters

    The N-Gram generator has one programmer-defined counter: the number of

    partial/incomplete N-grams. These occur when

    a paragraph ends before we can read N-1

    subsequent words.

    We can add as many counters as we want.

    JobConf

  • 7/31/2019 Lecture 12, Hadoop

    53/58

    JobConf

    We need to set everything up2 Jobs executing in series Find and Prune

    User inputs parameters

    Starting N-Gram word/phrase

    N-Gram size

    Minimum frequency for pruning

    JobConf ngram_find_conf = new JobConf(getConf(),

    NGram.class),

    ngram_prune_conf = new JobConf(getConf(), NGram.class);

    Find JobConf

  • 7/31/2019 Lecture 12, Hadoop

    54/58

    Find JobConf

    Now we can plug everything in:

    Also pass input parameters

    And point to our input and output files

    ngram_find_conf.setJobName("ngram-find");

    ngram_find_conf.setInputFormat(ParagraphInputFormat.class);

    ngram_find_conf.setOutputKeyClass(Text.class);

    ngram_find_conf.setOutputValueClass(IntWritable.class);

    ngram_find_conf.setMapperClass(FindJob_MapClass.class);ngram_find_conf.setReducerClass(FindJob_ReduceClass.class);

    ngram_find_conf.set("mapper.desired-phrase", args.get(2), true));

    ngram_find_conf.setInt("mapper.N-value", new Integer(other_args.get(3)).intValue());

    ngram_find_conf.setBoolean("mapper.case-sensitive", caseSensitive);

    FileInputFormat.setInputPaths(ngram_find_conf,

    other_args.get(0));

    FileOutputFormat.setOutputPath(ngram_find_conf, tempDir);

    Prune JobConf

  • 7/31/2019 Lecture 12, Hadoop

    55/58

    Prune JobConf

    Perform set up as before

    We need to point our inputs to the outputs of the

    previous job

    ngram_prune_conf.setJobName("ngram-prune");

    ngram_prune_conf.setInt("reducer.min-freq", min_freq);

    ngram_prune_conf.setOutputKeyClass(Text.class);

    ngram_prune_conf.setOutputValueClass(IntWritable.class);

    ngram_prune_conf.setMapperClass(PruneJob_MapClass.class);

    ngram_prune_conf.setReducerClass(PruneJob_ReduceClass.class);

    FileInputFormat.setInputPaths(ngram_prune_conf, tempDir);

    FileOutputFormat.setOutputPath(ngram_prune_conf, newPath(other_args.get(1)));

    Execute Jobs

  • 7/31/2019 Lecture 12, Hadoop

    56/58

    Execute Jobs

    Run as blocking process with runJobBatch processing is done in series

    JobClient.runJob(ngram_find_conf);

    JobClient.runJob(ngram_prune_conf);

    Design questions to ask

  • 7/31/2019 Lecture 12, Hadoop

    57/58

    Design questions to ask

    From where will my input come?InputFileFormat

    How is my input structured?

    RecordReader(There are already several common IFFs and

    RRs. Don't reinvent the wheel.)

    Mapper and Reducer classesDo Key (WritableComparator) and Value (Writable)

    classes exist?

    Design questions to ask

  • 7/31/2019 Lecture 12, Hadoop

    58/58

    Design questions to ask

    Do I need to count anything while job is inprogress?

    Where is my output going?

    Executor classWhat information do my map/reduce classes need?

    Must I block, waiting for job completion? Set

    FileFormat?