Apache Hadoop Daniel Lust, Anthony Taliercio. What is Apache Hadoop? Allows applications to utilize...

Preview:

Citation preview

Apache HadoopDaniel Lust , Anthony Taliercio

What is Apache Hadoop?

Allows applications to utilize thousands of nodes while exchanging thousands of terabytes of data to complete a task

Supports distributed applications under a free license

Used by many popular companiesSuch as: Facebook, Twitter, Ebay, IBM, Apple, Microsoft, Hewlett-Packard, and many others…

Continued…Written in Java

Scales wellCan be used with thousands of nodes

Can be used with just a few nodes and inexpensive hardware

Your average Hadoop cluster will consist of two major parts

A single master node and multiple working nodes. The master node is made up of four parts: the Job Tracker, Task Tracker, NameNode, and DataNode.

A worker node, which is also known as a slave node, can either be a DataNode and TaskTracker or just one of the two.

Overview Of Hadoop

- Hadoop uses whats called an HDFSHadoop Distributed File System

HDFS takes files and splits them across the network redundantly in a cluster

The redundancy to eliminate possible data loss

MapReduce

MapReduceSoftware wrote by google to process massive amounts of unstructured data in a parallel process across a distributed cluster of processors

MapReduce.

Offers a clean abstraction between data analysis tasks, organizing the jobs Issued by the HDFS, so no jobs are unnecessarily repeated.

- If one of them fail, a node may point to a different node to complete the task

Running Hadoop

First run of Hadoop on Master ComputerVarious processes are started including:

TaskTracker

JobTracker

DataNode

Secondary Node

NameNode It also makes a connection through SSH to other SLAVE computers to start a DataNode and TaskTracker

Running Hadoop

Used Hadoop to do a word count on six different books.

HDFS copied the books to different clusters, and ran a pre-written program to do a word count on the books.

Each node returned data, using the DataNode proccess to save its results.

When a node failed, it will issue the job to another node

Example Output of Job Processes

Word count Output

Tested on 1-3 Nodes

1 NODE: JOB COMPLETION 00:01:45

2 NODES: JOB COMPLETION

00:01:28

3 NODE : JOB COMPLETION

00:01:00

Conclusion

Our guide covered everything you need to get started with Apache Hadoop

Although, there are many problems you can see along the way

Troubleshooting was a large part of our project

Recommended