50
大大大大大大大 / 大大大 Lecture 2 – "Hello World" in Hadoop 彭彭 彭彭彭彭彭彭彭彭彭彭彭彭 7/3/2014 http://net.pku.edu.cn/~cours e/cs402/ This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United S See http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details Jimmy Lin University of Maryland SEWMGroup

大规模数据处理 / 云计算 Lecture 2 – "Hello World" in Hadoop

Embed Size (px)

DESCRIPTION

大规模数据处理 / 云计算 Lecture 2 – "Hello World" in Hadoop. 彭波 北京大学信息科学技术学院 7/3/2014 http://net.pku.edu.cn/~course/cs402/. Jimmy Lin University of Maryland. SEWMGroup. - PowerPoint PPT Presentation

Citation preview

大规模数据处理 / 云计算 Lecture 2 – "Hello World" in Hadoop

彭波北京大学信息科学技术学院

7/3/2014http://net.pku.edu.cn/~course/cs402/

This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United StatesSee http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details

Jimmy LinUniversity of Maryland SEWMGroup

CodeLab1

• 遇到的困难– 不熟悉 java !– 开发和运行环境搭建?( eclipse, hadoop)– guide 里面的代码编译报错?– 运行时报错?– 。。。。。。。。。

• 貌似 pdf 里给的代码不能用,点那个“ source code here” 出来的代码是能用的……呃……不过我跑出来的结果和 pdf 里的不一样……

• The method setInputPath(Path) is undefined for the type JobConf WordCount/srcWordCount.java line 21 1404272734726310不知道什么原因。。

• 编译通不过 求助• FileInputPath cannot be resolved

FileOutputPath cannot be resolved这是什么情况。

• Exception in thread "main" java.io.IOException: Cannot run program "chmod": CreateProcess error=2, ?????????我运行的时候报的这个错误

Java Programming for C/C++ Developers

Historical background

• The C programming language– early 1970s– UNIX

• The C++ programming language– early 1980s– object-oriented– a wide variety of application programming

• The Java programming language– early 1990s– originally for consumer electronic devices– enterprise application development

Java SDK

• Software Development Kit – a group of command-line tools and packages

that you will need to write and run Java programs

– base classes (Library)

Working with the SDK

• Factorial– input: a value as a command-line argument– output: factorial of that number OR exception

• Java Specification– every Java source code file must have the

exact same name as the class that is defined inside of it

Execution Environment

Primitive data types

• Char– 16 bits– Unicode character set– escape sequences

Primitive data types

• integer types– signed– exact size

• The floating-point types– IEEE 754

floating-point values

Primitive data types

• The boolean types– true, false

Primitive data types

Operators

•+ is overloaded•If you use the + operator with a String and another operand that is not a String, the other operand is converted into a String

C/C++ functions versus Java methods

• In Java terminology, functions are called methods.

• Methods can only be declared as members of a class; you can't define a method outside of a Java class

Arrays

• objects, so they are declared using the new operator• scores.length• the bracket characters ([ ]) that are used to indicate

arrays are bound to the array type, not the array name• java.lang.ArrayIndexOutOfBounds exception

Strings

• objects of the String class

• String objects are immutable

• same string literals

• String class has a rich interface

Strings

The main() method

• a strict naming convention• first element in the array is the first argument, not the

name of the program.

Other differences

• Pointers:– Java references are pointers to Java objects– cannot be incremented or decremented– no address of operators

• Global variables– no way to declare global variables (or methods)

• no struct, union, typedef, enum• Freely placed methods• Garbage collection

– no malloc() and free()

Defining a Java class

Defining a Java class

• Each member must have its own public or private modifier

• You don't use semicolons (;) after the closing brackets in class and method definitions.

• The main() method is a member of the class

• You call the constructor using the new keyword

access modifiers

access modifiers

• public

• private

• protected

• package access

Inheritance

• extends

• super()

Overloading and overriding

The Object class

• All Java classes are ultimately subclasses of class Object

• a centrally rooted class hierarchy

• usage– toString()– define data structures that take objects of

class Object , it can hold any Java object .vs. C++ template

Interfaces

• All interfaces are implicitly abstract• All members of an interface are implicitly

public• All fields defined in an interface are

implicitly static and final• A Java class can extend only one class,

but it can implement any number of interfaces

• Best practice for polymorphism

more on objects

• Inner classes and inner interfaces• Anonymouse classes and objects

Using Library(Java API)

• Java API, classes are grouped into packages

• you already been using classes from a default package: java.lang when call System.out.println()

• import java.util.ArrayList; or java.util.ArrayList<xx> list = ....

Data Structures

• java.util.*• java

generics

Deploying your application

• A Java program is a bunch of classes.

• A JAR file is Java Archive– create a manifest.txt state which class has

main() method• Main-Class: MyApp

– use jar tool to package all classes files and manifest.txt

– $jar -cvmf manifest.txt app.jar *.class– $java -jar app.jar

Package

• put your classes in packages– java.util, java.net, java.text ....

• preface your package with your reverse domain name

• setup a matching directory structure

References

• 《 Java programming for C C++ developers 》

• 《 Head First Java 》

"Hello World" in Hadoop

What is MapReduce?• Programming model for expressing distributed

computations at a massive scale• Execution framework for organizing and

performing such computations• Open-source implementation called Hadoop

40

Brief History of Hadoop

• Hadoop was created by Doug Cutting, the creator of Apache Lucene/Nutch,

• 2003, Google published GFS• 2004, Google published MapReduce• 2005, Nutch ported to Mapreduce/HDFS• 2006, Cutting join Yahoo!• 2008.1, Hadoop became top-level project

at Apache• 2008.2, Hadoop run on 10000-core cluster

Hadoop Release

New MapReduce API

• favors abstract classes over interfaces• new API in org.apache.hadoop.mapreduce, old

in org.apache.hadoop.mapred• new Context class

– JobConf, OutputCollector,Reporter• new Job class

– JobClient• reduce() method passes values

– new: java.lang.Iterable, for (VALUEIN value : values) { ... }

– old: java.lang.Iterator, hasNext(), next()

Hadoop Streaming & Pipes

• Streaming– support any programming language, even

shell scripts– uses standard input and output to

communicate with the map and reduce code

• Pipes– C++ interface to Hadoop MapReduce– uses sockets as the communication channel

Hadoop Command

• docs in distribution– api– tutorial

• hadoop – -conf xxx

Changping Cluster

• 28 Nodes, 12 Cores/48GB RAM/10T DISK– Namenode/JobTracker server - changping11– ip : 222.29.134.11– hdfs port : 9000– mapreduce port: 9001

How to use ChangpingCluster

• 1. 添加一个域名解析– windows: 编辑 C:\WINDOWS\system32\

drivers\etc\hosts 文件 ,– linux : /etc/hosts 添加一行如下 : 222.29.134.11 changping11

• 否则运行 job 会报告名字解析错误

How to use ChangpingCluster

• 2. 身份设置• 1). 输出文件统一到 "/cs402/YourName" 目录下• 代码中是: FileOutputFormat.setOutputPath(conf,

new Path("/cs402/YourName"));

• 2). Mapred Location 里设置好 hadoop.job.ugi = YourName, cs402

• 用户名和上面文件路径中的名字一致,• 组名必须是 cs402• 或者在 driver 程序里直接设置好。

• Configuration conf = new Configuration();• conf.set("hadoop.job.ugi", "YourName,cs402");

References

• Tom White, Hadoop: The Definitive Guide, O'Reilly, 3rd, 2012.5.

Q&A