13
Apache Hadoop Daniel Lust , Anthony Taliercio

Apache Hadoop Daniel Lust, Anthony Taliercio. What is Apache Hadoop? Allows applications to utilize thousands of nodes while exchanging thousands of terabytes

Embed Size (px)

Citation preview

Page 1: Apache Hadoop Daniel Lust, Anthony Taliercio. What is Apache Hadoop? Allows applications to utilize thousands of nodes while exchanging thousands of terabytes

Apache HadoopDaniel Lust , Anthony Taliercio

Page 2: Apache Hadoop Daniel Lust, Anthony Taliercio. What is Apache Hadoop? Allows applications to utilize thousands of nodes while exchanging thousands of terabytes

What is Apache Hadoop?

Allows applications to utilize thousands of nodes while exchanging thousands of terabytes of data to complete a task

Supports distributed applications under a free license

Used by many popular companiesSuch as: Facebook, Twitter, Ebay, IBM, Apple, Microsoft, Hewlett-Packard, and many others…

Page 3: Apache Hadoop Daniel Lust, Anthony Taliercio. What is Apache Hadoop? Allows applications to utilize thousands of nodes while exchanging thousands of terabytes

Continued…Written in Java

Scales wellCan be used with thousands of nodes

Can be used with just a few nodes and inexpensive hardware

Your average Hadoop cluster will consist of two major parts

A single master node and multiple working nodes. The master node is made up of four parts: the Job Tracker, Task Tracker, NameNode, and DataNode.

A worker node, which is also known as a slave node, can either be a DataNode and TaskTracker or just one of the two.

Page 4: Apache Hadoop Daniel Lust, Anthony Taliercio. What is Apache Hadoop? Allows applications to utilize thousands of nodes while exchanging thousands of terabytes

Overview Of Hadoop

- Hadoop uses whats called an HDFSHadoop Distributed File System

HDFS takes files and splits them across the network redundantly in a cluster

The redundancy to eliminate possible data loss

Page 5: Apache Hadoop Daniel Lust, Anthony Taliercio. What is Apache Hadoop? Allows applications to utilize thousands of nodes while exchanging thousands of terabytes
Page 6: Apache Hadoop Daniel Lust, Anthony Taliercio. What is Apache Hadoop? Allows applications to utilize thousands of nodes while exchanging thousands of terabytes

MapReduce

MapReduceSoftware wrote by google to process massive amounts of unstructured data in a parallel process across a distributed cluster of processors

Page 7: Apache Hadoop Daniel Lust, Anthony Taliercio. What is Apache Hadoop? Allows applications to utilize thousands of nodes while exchanging thousands of terabytes

MapReduce.

Offers a clean abstraction between data analysis tasks, organizing the jobs Issued by the HDFS, so no jobs are unnecessarily repeated.

- If one of them fail, a node may point to a different node to complete the task

Page 8: Apache Hadoop Daniel Lust, Anthony Taliercio. What is Apache Hadoop? Allows applications to utilize thousands of nodes while exchanging thousands of terabytes

Running Hadoop

First run of Hadoop on Master ComputerVarious processes are started including:

TaskTracker

JobTracker

DataNode

Secondary Node

NameNode It also makes a connection through SSH to other SLAVE computers to start a DataNode and TaskTracker

Page 9: Apache Hadoop Daniel Lust, Anthony Taliercio. What is Apache Hadoop? Allows applications to utilize thousands of nodes while exchanging thousands of terabytes

Running Hadoop

Used Hadoop to do a word count on six different books.

HDFS copied the books to different clusters, and ran a pre-written program to do a word count on the books.

Each node returned data, using the DataNode proccess to save its results.

When a node failed, it will issue the job to another node

Page 10: Apache Hadoop Daniel Lust, Anthony Taliercio. What is Apache Hadoop? Allows applications to utilize thousands of nodes while exchanging thousands of terabytes

Example Output of Job Processes

Page 11: Apache Hadoop Daniel Lust, Anthony Taliercio. What is Apache Hadoop? Allows applications to utilize thousands of nodes while exchanging thousands of terabytes

Word count Output

Page 12: Apache Hadoop Daniel Lust, Anthony Taliercio. What is Apache Hadoop? Allows applications to utilize thousands of nodes while exchanging thousands of terabytes

Tested on 1-3 Nodes

1 NODE: JOB COMPLETION 00:01:45

2 NODES: JOB COMPLETION

00:01:28

3 NODE : JOB COMPLETION

00:01:00

Page 13: Apache Hadoop Daniel Lust, Anthony Taliercio. What is Apache Hadoop? Allows applications to utilize thousands of nodes while exchanging thousands of terabytes

Conclusion

Our guide covered everything you need to get started with Apache Hadoop

Although, there are many problems you can see along the way

Troubleshooting was a large part of our project