大规模数据处理 / 云计算 Lecture 4 – Word Co-occurrence Matrix

大规模数据处理 / 云计算 Lecture 4 – Word Co-occurrence Matrix

彭波北京大学信息科学技术学院

7/10/2014http://net.pku.edu.cn/~course/cs402/

This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United StatesSee http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details

Jimmy LinUniversity of Maryland SEWMGroup

http://net.pku.edu.cn/~course/cs402/2010/

http://www.umiacs.umd.edu/~jimmylin/cloud-2010-Spring/index.html

http://net.pku.edu.cn/~webg

WordCount Review

Problems&Solutions

• Mac OS – Mac OS X下配置心得 ,by Xin Lv

• Eclipse– eclipse3.7Indigo连接 hadoop心得 , by 朱瑜坚

• Linux– Linux下手动配置运行 hadoop心得 , by

Haoyan Huo

• VMPlayer–暂缺，：）

Homework Submission

• What to hand in– Please pack the ACCEPTED source codes of

oneline evalution into a single rar/tar.gz file, name it as "assign1-YourPinYinName.rar" or "assign1-YourPinYinName.tar.gz" and send the package to our TA by email (cs402.pku AT gmail.com) with "CS40214-Assign1-YourPinYinName" as the subject.

Changping11 使用规范

• hadoop.job.ugi = YourName, cs402

• 输入数据在：– /public–自己传上去的数据，放在个人目录下

• 输出数据，一定放在 /cs402的个人目录下– /cs402/YourName–不要使用默认的 /user/Yourname

Streaming for Python Programmer

Hadoop Streaming

• Hadoop streaming is a utility that comes with the Hadoop distribution. The utility allows you to create and run Map/Reduce jobs with any executable or script as the mapper and/or the reducer.

How Does Streaming Work

• both the mapper and the reducer are executables that read the input from stdin (line by line) and emit the output to stdout

• By default, the prefix of a line up to the first tab character is the key and the rest of the line (excluding the tab character) will be the value

More Features

• Specifying Other Plugins for Jobs– inputformat JavaClassName– outputformat JavaClassName– partitioner JavaClassName– combiner JavaClassName

• Specifying Additional Configuration Variables for Jobs• Customizing the Way to Split Lines into Key/Value Pairs• A Useful Partitioner Class• A Useful Comparator Class• Working with the Hadoop Aggregate Package

Debug in Hadoop

What Constitutes Progress in MapReduce?

• Hadoop will not fail a task that’s making progress.– Reading an input record (in a mapper or

reducer)– Writing an output record (in a mapper or

reducer)– Setting the status message (using Context’s

setStatus() method)– Incrementing a counter (using Context’s

getCounter().increment() method)– Calling Reporter’s progress() method

Counters & Status Message

• Counters are a useful channel for gathering statistics about the job: for quality control or for application-level statistics.

• Status Message

Hadoop Logs

• MapReduce task logs – Each tasktracker child process produces a

logfile using log4j (called syslog), a file for data sent to standard out (stdout), and a file for standard error (stderr).

– accessible through the web UI

'wordcount'How does it work?

mapmap map map

Shuffle and Sort: aggregate values by keys

reduce reduce reduce

k1 k2 k3 k4 k5 k6v1 v2 v3 v4 v5 v6

ba 1 2 c c3 6 a c5 2 b c7 8

a 1 5 b 2 7 c 2 3 6 8

r1 s1 r2 s2 r3 s3

18

“Hello World”: Word Count

19

But, in a real system......

• How to inject user code into a runing system?– job submission– mapper & reducer class instantiate– read/write data in mapper& reducer

Implementation in Hadoop

• job submission

• mapper & reducer class instantiate

• read/write data in mapper& reducer

Hadoop Cluster

22

Job Submission Process

• Asks the jobtracker for a new job ID (by calling getNewJobId() on JobTracker) (step2).

• Checks the output specification of the job. For example, if the output directory has not been specified or it already exists, the job is not submitted and an error is thrown to the MapReduce program.

• Computes the input splits for the job. If the splits cannot be computed (because the input paths don’t exist, for example), the job is not submitted and an error is thrown to the MapReduce program.

• Copies the resources needed to run the job, including the job JAR file, the configuration file, and the computed input splits, to the jobtracker’s filesystem in a directory named after the job ID. – The job JAR is copied with a high replication factor

(controlled by the mapred.submit.replication property, which defaults to 10) so that there are lots of copies across the cluster for the tasktrackers to access when they run tasks for the job (step 3).

• Tells the jobtracker that the job is ready for execution by calling submitJob() on JobTracker (step 4).

InputSplits

• input split – is a chunk of the input that is processed by a

single map. – Each map processes a single split. – Each split is divided into records, and the map

processes each record—a key-value pair—in turn.

– Splits and records are logical:

InputFormat

• An InputFormat is responsible for creating the input splits and dividing them into records.

• Mapper’s run() method

InputFormat Class Hierarchy

Serialization

• Serialization is the process of turning structured objects into a byte stream for trans-mission over a network or for writing to persistent storage.

• Deserialization is the reverse process of turning a byte stream back into a series of structured objects.

• In Hadoop, interprocess communication between nodes in the system is implemented using remote procedure calls (RPCs).

The Writable Interface

• public interface Writable {– void write(DataOutput out) throws IOException;– void readFields(DataInput in) throws IOException;

}• public interface WritableComparable<T>

extends Writable, Comparable<T>– A Writable which is also Comparable. – public int compareTo(WritableComparable w){}

Word Co-occurrence

Tasks

• Do word co-occurrence analysis on ShakeSpeare Collection and AP Collection, which is under the directory of /public/Shakespeare and /public/AP of our sewm cluster (or your own virtual cluster). You will get one line of text data as input to process in map function by default.(80 points)

• Try to optimize your program, and find the fastest one. Write your approaches and evaluation in your report.(20 points)

• Analysis the result data matrix and find something interesting. (10 points bonus)

• Write a report to describe approach to each task, the problem you met etc.

co-occurrence

• Co-occurrence or cooccurrence is a linguistics term that can either mean concurrence / coincidence or, in a more specific sense, the above-chance frequent occurrence of two terms from a text corpus alongside each other in a certain order. Co-occurrence in this linguistic sense can be interpreted as an indicator of semantic proximity or an idiomatic expression. In contrast to collocation, co-occurrence assumes interdependency of the two terms. A co-occurrence restriction is identified when linguistic elements never occur together. Analysis of these restrictions can lead to discoveries about the structure and development of a language.[1]

From Wikipedia, the free encyclopedia

Input Data

Pairs .vs. Stripes Idea: group together pairs into an associative array

Each mapper takes a sentence: Generate all co-occurring term pairs For each term, emit a → { b: countb, c: countc, d: countd … }

Reducers perform element-wise sum of associative arrays

(a, b) → 1 (a, c) → 2 (a, d) → 5 (a, e) → 3 (a, f) → 2

a → { b: 1, c: 2, d: 5, e: 3, f: 2 }

a → { b: 1, d: 5, e: 3 }a → { b: 1, c: 2, d: 2, f: 2 }a → { b: 2, c: 2, d: 7, e: 3, f: 2 }

+

Key: cleverly-constructed data structure

brings together partial results

40

Pairs

• customized KEY– (a,b) TextPair that implements

WritableComparable<>

• customized Partitioner– all (a,b) (a,c) (a,f) (a,*) go to the same

Reducer– Default partitioner HashPartitioner use the

hashCode() method

combinecombine combine combine

ba 1 2 c 9 a c5 2 b c7 8

partition partition partition partition

mapmap map map

k1 k2 k3 k4 k5 k6v1 v2 v3 v4 v5 v6

ba 1 2 c c3 6 a c5 2 b c7 8

Shuffle and Sort: aggregate values by keys

reduce reduce reduce

a 1 5 b 2 7 c 2 9 8

r1 s1 r2 s2 r3 s3

c 2 3 6 8

42

Partitioner

• public abstract class Partitioner<KEY,VALUE>{public int getPartition(KEY key, VALUE

value, int numPartitions)}

• job setup–job.setPartitionerClass(UserPartition

er.class);

Comparators

• Comparable<T>– compareTo()

• Comparator– RawComprator<>– WritableComparator

• job.setSortComparatorClass– sort map() output in Mapper

• job.setGroupingComparatorClass– sort shuffled data in Reducer, group result

sent to reduce()

Stripes

• associative array– map -> MapWritable?

• Caution:– jvm memory big enough– set mapred.child.java.opts (200M by default)

Q&A

Documents

大规模数据处理 / 云计算 Lecture 4 – Word Co-occurrence Matrix