35
ه توزیع شدهه پردازش داد کارگادبهشتی شهی- پردیس کامپیوتر و مهندسیانشکده علوم د درس:ه توزیع شدهه داد پایگاستاد: ای طباطبای دکتر هادی ارائه:قیفضل صدی ابوال آذر۱۳۹۳

Hadoop 2.x HDFS Cluster Installation (VirtualBox)

Embed Size (px)

DESCRIPTION

This is a straight-forward tutorial for those who are goring to use HDFS in an academic environment on their notebooks or PCs.

Citation preview

Page 1: Hadoop 2.x  HDFS Cluster Installation (VirtualBox)

کارگاه پردازش داده توزیع شده

پردیس- شهیدبهشتی

دانشکده علوم و مهندسی کامپیوتر

پایگاه داده توزیع شدهدرس:

دکتر هادی طباطباییاستاد:

ابوالفضل صدیقی ارائه: ۱۳۹۳آذر

Page 2: Hadoop 2.x  HDFS Cluster Installation (VirtualBox)

2

Apache Hadoop 2.x Cluster Installation

Amir Sedighi@amirsedighi

http://hexican.com

Dec 2014

Page 3: Hadoop 2.x  HDFS Cluster Installation (VirtualBox)

3

References

● http://hadoop.apache.org/docs/r2.2.0/

● http://www.vasanthivuppuluri.com/hadoop/installing-hadoop-2-5-1-on-64-bit-ubuntu-14-01/

● https://sites.google.com/site/hadoopandhive/home

Page 4: Hadoop 2.x  HDFS Cluster Installation (VirtualBox)

4

Topics

● Assumptions

● First Node

– Installing Java

– Downloading and Extracting Hadoop

– Hadoop and Java Env Variables

– Disabling IP6

– Configuring Hadoop

● Cloning

● HDFS– Starting HDFS

● HDFS Health● FS Commands● Reclaiming Space● Reducing Replication Factor

Page 5: Hadoop 2.x  HDFS Cluster Installation (VirtualBox)

5

Assumptions

● You already know about Linux.

– http://www.slideshare.net/AmirSedighi/distrinuted-data-processing-workshop-sbu

Page 6: Hadoop 2.x  HDFS Cluster Installation (VirtualBox)

6

Installing Java

● $ sudo apt-get install default-jdk

Page 7: Hadoop 2.x  HDFS Cluster Installation (VirtualBox)

7

Downloading and Extracting

● http://hadoop.apache.org/releases.html

● $ tar -zxvf hadoop-2.2.0.tar.gz

Page 8: Hadoop 2.x  HDFS Cluster Installation (VirtualBox)

8

Hadoop and Java Env Variables

● Append the following definitions to /etc/profile or ~/.bashrc

export HADOOP_PREFIX="/home/amir/hadoop-2.2.0"

export HADOOP_HOME=$HADOOP_PREFIX

export HADOOP_COMMON_HOME=$HADOOP_PREFIX

export HADOOP_CONF_DIR=$HADOOP_PREFIX/etc/hadoop

export HADOOP_HDFS_HOME=$HADOOP_PREFIX

export HADOOP_MAPRED_HOME=$HADOOP_PREFIX

export HADOOP_YARN_HOME=$HADOOP_PREFIX

export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native

export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"

export JAVA_HOME=/usr/java/jdk1.7.0_55

export PATH=$PATH:$JAVA_HOME/bin:/home/amir/hadoop-2.2.0/bin:/home/amir/hadoop-2.2.0/sbin

Page 9: Hadoop 2.x  HDFS Cluster Installation (VirtualBox)

9

Disabling IP6

● $ sudo nano /etc/sysctl.conf

# Disable IPv6

net.ipv6.conf.all.disable_ipv6 = 1

net.ipv6.conf.default.disable_ipv6 = 1

net.ipv6.conf.lo.disable_ipv6 = 1

Page 10: Hadoop 2.x  HDFS Cluster Installation (VirtualBox)

10

Hadoop Configuration

● You would need to create or modify the following files inside hadoop/etc/hadoop:

– slaves

– core-site.xml

– yarn-site.xml

– hdfs-site.xml

– hadoop-env.sh

Page 11: Hadoop 2.x  HDFS Cluster Installation (VirtualBox)

11

slaves

● List all DataNodes in slaves file.

slave1

slave2

slave3

Page 12: Hadoop 2.x  HDFS Cluster Installation (VirtualBox)

12

slaves

Create slaves in hadoop/etc/hadoop folder:

u01

u02

u03

u04

u05

u06

...

Page 13: Hadoop 2.x  HDFS Cluster Installation (VirtualBox)

13

etc/hosts and hadoop/etc/hadoop/slaves

Page 14: Hadoop 2.x  HDFS Cluster Installation (VirtualBox)

14

core-site.xml

● Edit core-site.xml and apply the following:

<configuration>

<property>

<name>fs.defaultFS</name>

<value>hdfs://u01/</value>

<description>NameNode URI</description>

</property>

</configuration>

Page 15: Hadoop 2.x  HDFS Cluster Installation (VirtualBox)

15

core-site.xml

Page 16: Hadoop 2.x  HDFS Cluster Installation (VirtualBox)

16

yarn-site.xml<configuration>

<property>

<name>yarn.resourcemanager.hostname</name>

<value>u01</value>

<description>The hostname of the RM.</description>

</property>

<property>

<name>mapreduce.framework.name</name>

<value>yarn</value>

</property>

<property>

<name>yarn.nodemanager.aux-services</name>

<value>mapreduce_shuffle</value>

</property>

<property>

<name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name>

<value>org.apache.hadoop.mapred.ShuffleHandler</value>

</property>

</configuration>

Page 17: Hadoop 2.x  HDFS Cluster Installation (VirtualBox)

17

yarn-site.xml

Page 18: Hadoop 2.x  HDFS Cluster Installation (VirtualBox)

18

hdfs-site.xml

<configuration>

<property>

<name>dfs.datanode.data.dir</name>

<value>file:///home/amir/hadoop-2.2.0/hdfs/datanode</value>

</property>

<property>

<name>dfs.namenode.name.dir</name>

<value>file:///home/amir/hadoop-2.2.0/hdfs/namenode</value>

</property>

</configuration>

Page 19: Hadoop 2.x  HDFS Cluster Installation (VirtualBox)

19

hdfs-site.xml

Page 20: Hadoop 2.x  HDFS Cluster Installation (VirtualBox)

20

hadoop-env.sh

● Add the following:

– export JAVA_HOME=/usr/java/jdk1.7.0_55

Page 21: Hadoop 2.x  HDFS Cluster Installation (VirtualBox)

21

Reboot

● $ sudo reboot

Page 22: Hadoop 2.x  HDFS Cluster Installation (VirtualBox)

22

Cloning

● Extend the cluster by cloning.

– NOTE: Find the instruction here:● http://www.slideshare.net/AmirSedighi/distrinuted-data-

processing-workshop-sbu

Page 23: Hadoop 2.x  HDFS Cluster Installation (VirtualBox)

23

HDFS

● The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware.

● It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant.

● HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware.

● HDFS provides high throughput access to application data and is suitable for applications that have large data sets.

● HDFS relaxes a few POSIX requirements to enable streaming access to file system data.

● HDFS was originally built as infrastructure for the Apache Nutch web search engine project. HDFS is part of the Apache Hadoop Core project.

Page 24: Hadoop 2.x  HDFS Cluster Installation (VirtualBox)

24

HDFS Architecture

Page 25: Hadoop 2.x  HDFS Cluster Installation (VirtualBox)

25

DataNodes

Page 26: Hadoop 2.x  HDFS Cluster Installation (VirtualBox)

26

start-dfs.sh

Page 27: Hadoop 2.x  HDFS Cluster Installation (VirtualBox)

27

HDFS Health

● $ jps

– NameNode

– DataNode

● Check log files● Web UI

– http://u01:50070

Page 28: Hadoop 2.x  HDFS Cluster Installation (VirtualBox)

28

HDFS Health

Page 29: Hadoop 2.x  HDFS Cluster Installation (VirtualBox)

29

Page 30: Hadoop 2.x  HDFS Cluster Installation (VirtualBox)

30

HDFS Health, Live Nodes

Page 31: Hadoop 2.x  HDFS Cluster Installation (VirtualBox)

31

Hadoop FS Commands

● cat

● chmod

● chown

● copyFromLocal

● copyToLocal

● cp

● du

● expunge

● get

● ls

● mkdir

● put

● rm

● tail

Page 32: Hadoop 2.x  HDFS Cluster Installation (VirtualBox)

32

HDFS Commands

Page 33: Hadoop 2.x  HDFS Cluster Installation (VirtualBox)

33

Space Reclamation

● Delete Files

– $ hadoop fs -rm /filename

– $ hadoop fs -expunge

● Decrease Replication Factor

Page 34: Hadoop 2.x  HDFS Cluster Installation (VirtualBox)

34

How to change replication factor of existing files in HDFS

● To set replication of an individual file to 4:

– hadoop dfs -setrep -w 4 /path/to/file

● You can also do this recursively. To change replication of entire HDFS to 1:

– hadoop dfs -setrep -R -w 1 /

Page 35: Hadoop 2.x  HDFS Cluster Installation (VirtualBox)

35

Questions?