Upload
sreenu-musham
View
56
Download
0
Embed Size (px)
Citation preview
Hadoop Eco System
Presented by : Sreenu Musham
27th March, 2015
Sreenu Musham Data Warehouse
Architect
Hadoop Introduction
AgendaBigData and Its Challenges(Recap)
Hadoop and Its Evolution
Terminology used
HDFS
MapReduce
Hadoop Eco System
Hadoop Distributors
Feel of Hadoop (how it looks?)
Big Data and Challenges
1024 KB = 1MB 1024 MB = 1GB 1024 GB = 1TB 1024 TB = 1Petabyte So on … Exabytes, Zettabytes, Yottabytes,
Brontobytes, Geopbytes
Big Data and Challenges
Issues on Disk I/O, Network & Processing in time
Storage? Costly in enterprise machine Vertical solution is not always correct Reliability
handling unstructured data?Schema less data
Disk Vs Transfer Rate
DiskCapacity
Price
TransferRate
`90s `00s `10s 2.1 GB 200 GB 3000 GB
$157/GB $1.05/GB $0.05/GB
16.6 MB/s 56.5 MB/s 210 MB/s
1428%
3140%
12.65%
126 sec to read whole disk
58 min to read whole disk
4 hrs to read whole disk
What is Hadoop?
Apache Open source Software Framework for reliable, scalable, distributed computing of massive amount of data
A framework where the job is divided among the nodes and process them in parallel
Hides underlying system details and complexities from user
Developed in JAVA A set of machines running on HDFS and
MapReduce is known as Hadoop cluster Core Components
HDFS MapReduce
Hadoop is not for all type of work
• Process transactions
• Low-Latency data Access
• Lot of Small Files
• Intensive calculations with little data
Hadoop initiated new kinds of analysis • Not jus old thinking on bigger data• Iterate over whole data sets, not only
sample sets• Use multiple data sources (beyond
structured data)• Work with schema-less data
Story Behind hadoop
Google’s Victory in 2000Searching in 1990’s
Story Behind Hadoop
2003 2004 2005
Created by Doug Cutting and Michael Cafarella (Yahoo)
2006
Yahoo donated the project to
Paper on GFS
paper on MapReduce
2008 2009
Name by Doug
Terabyte Sort
2010
Launches HIVE
Runs 4000 nodeHadoop cluster
Node-Rack-cluster
HA
Node 1
Node 2
Node N
..
…..
RACK
1
Node 1
Node 2
Node N
..
…..
RACK
2
Node 1
Node 2
Node N
..
…..
RACK
N
.. …
Hadoop Cluster
What is HDFS?
HDFS runs on top of existing file system
Uses blocks to store a file or parts of file
Stores data across multiple nodes Size of a file can be larger than any single disk in
the network The default block size is 64MB
The Main Objectives Storing Very large files Streaming data access Commodity hardware Allow access to data on any node in the cluster Able to handle hardware failures
What is HDFS?
If a chunk of the file is smaller than HDFS block size Only the needed space is used
Example: 300MB
HDFS blocks are replicated to several nodes for reliability
Maintains checksums of data for corruption detection and recovery
What is MapReduce?
Is a algorithm/programming model to process the data in the hadoop cluster
Consists of two phases: Map, and then ReduceEach Map task operates on a discrete portion of the
overal dataset• Typically one HDFS block of data
After all Maps are complete, the MapReduce system disstribute the intermediate data to nodes which perform the Reduce phase
Hadoop framework parallelizes the computation, handles failures, efficient communication and performance issues.
Sample Data Flow
Take a large problem and divide it into sub-problems-Break data set down into small chunks
Perform the same function on all sub-tasks
RE
DU
CE
MA
P
Combine the output from all sub-tasks
It works like a Unix pipelinecat input | grep <pattern> | sort | uniq -c > output Input | Map | Shuffle & Sort | Reduce | Output
Hadoop Architecture
Data Node
NameNode
Data Node
Data Node
Data Node
Data Node
Scalable
Master
Slaves
Master/Slave and Shared Nothing
architecture
TaskTracker
TaskTracker
TaskTracker
TaskTracker
TaskTracker
Job Tracker
HDFS Layer
MapReduce Layer
Sample Data Flow- Word Count
AirBoxCarBoxDoAir
Air,2Box,2Car,1Do,1
(0, Air)(1, Box)(2, Car)(3, Box)(4, Do)(5, Air)
(Air, [1, 1])(Box, [1, 1])(Car, [1] )(Do, [1] )
(Air, 2)(Box, 2)(Car, 1)(Do, 1)
(Air, 1)(Box, 1)(Car, 1)(Box, 1)(Do, 1)(Air, 1)
Input Outputmap Shuffle &Sort Reduce
<k1, v1> list(<k2, v2>) <k2, list( v2)> list(<k3, v3>)
Sample Data Flow
(1949, 111)
(1950, 22)
(1949, [111, 78])
(1950, [0, 22, −11])
(1949, 111)(1950, 22)
(1950, 0)(1950, 22)(1950, −11)(1949, 111)(1949, 78)
Input Outputmap Shuffle &Sort Reduce
<k1, v1> list(<k2, v2>) <k2, list( v2)> list(<k3, v3>)
To visualize the way the map works, consider the following sample lines of input data(some unused columns have been dropped to fit the page, indicated by ellipses):0067011990999991950051507004...9999999N9+00001+99999999999...0043011990999991950051512004...9999999N9+00221+99999999999...0043011990999991950051518004...9999999N9-00111+99999999999...0043012650999991949032412004...0500001N9+01111+99999999999...0043012650999991949032418004...0500001N9+00781+99999999999...
These lines are presented to the map function as the key-value pairs:(0, 0067011990999991950051507004...9999999N9+00001+99999999999...)(1, 0043011990999991950051512004...9999999N9+00221+99999999999...)(2, 0043011990999991950051518004...9999999N9-00111+99999999999...)(3, 0043012650999991949032412004...0500001N9+01111+99999999999...)(4, 0043012650999991949032418004...0500001N9+00781+99999999999...)
Replication of Blocks
Node 1
Node 2
Node N
RACK
2
Hadoop Cluster
Node 3
Node 1
Node 2
Node N
RACK
3 Node 3
Name Node
Node 1
Node 3
RACK
1 Node 2
HDFS Client
file.txt
File.txt:B1,B2,B3
Name node
Accessing hadoop/HDFSHadoop fs –ls <path>Hadoop fs mkdir testsreenuHadoop fs –copyFromLocal samplefile.txt : Hadoop fs –put samplefile.txt
Running mapreducehadoop jar newjob.jar samplefile in_dir out_dir
Hive
Pig
PIG codea. Load customer recordsCust=LOAD ‘/input/custs’ using PigStorage(,) AS (custid:chararray,firstname:chararray,lastname:chararray,age:long,profession:chararray);b. Select only 100 recordsAmt=LIMIT cust 100;Dump amt;c. Group customer records by profession
groupbyprofession = GROUP cust BY profession;describe groupbyprofession;
d. Count no of customers by profession countbyprofession = FOREACH groupbyprofession GENERATE group,COUNT(cust)’Dump countbyprofession;
Feel of Hadoop
Hadoop Vs Other Systems
29
Distributed Databases Hadoop
Computing Model - Notion of transactions- Transaction is the unit of work- ACID properties, Concurrency control
- Notion of jobs- Job is the unit of work- No concurrency control
Data Model - Structured data with known schema- Read/Write mode
- Any data will fit in any format - (un)(semi)structured- ReadOnly mode
Cost Model - Expensive servers - Cheap commodity machines
Fault Tolerance - Failures are rare- Recovery mechanisms
- Failures are common over thousands of machines
- Simple yet efficient fault tolerance
Key Characteristics - Efficiency, optimizations, fine-tuning - Scalability, flexibility, fault tolerance
• Cloud Computing• A computing model where any computing
infrastructure can run on the cloud• Hardware & Software are provided as remote
services• Elastic: grows and shrinks based on the user’s
demand• Example: Amazon EC2
Elastic:grows and shrinks based on user’s demand