Hadoop Overview

  • View

  • Download

Embed Size (px)


Hadoop Overview. 2008.01.15 . Hadoop. Brief History 2005 Doug Cutting( Lucene & Nutch ) Nutch 2006 Yahoo (Doug Cutting ) 2008 Apache Top-level Project (2009.1) 0.19.0 release. Hadoop. Java - PowerPoint PPT Presentation


Hadoop Overview

Hadoop Overview2008.01.15HadoopBrief History2005 Doug Cutting(Lucene & Nutch ) Nutch 2006 Yahoo (Doug Cutting )2008 Apache Top-level Project (2009.1) 0.19.0 releaseHadoopJava Apache HDFS, Hbase, MapReduce, Hadoop On Demand(HOD), Streaming, HQL, Hama, etcHadoop Architecture

Hadoop Distributed File System Distributed computing

Distributed File System(DFS) index transaction NHN + KAIST OwFS(Owner based File System)Sun Microsystems NFS Microsoft IBM Transarc's DFSGoogle File System(GFS)Hadoop DFS Google File System(GFS)

GFS PC NAS ( MB~ GB) Google File System(GFS) ( application )GFS master : , GFS chunkserver : MB ~ GB , chunkserver = chunk (default, 64MB)GFS Client : , , API socket

Google File System(GFS) 1. Application File System API GFS Client 2. GFS Client GFS master 3. GFS master client chunk , chunk size, chunk chunkserver 4. GFS client chunk chunkserver 5. GFS chunkserver

Client master , client chunkserver Master Google File System(GFS)Replication Chunk chunkserver chunkserver Default : 3 Chunkserver down , master Google File System(GFS)Replication chunk chunkserver down RAID1 Google NAS Yahoo chunk , chunkserver Disk head RAID(Redundant Array of Independent Disks) ( ) ( ) RAID 1 (100% ). . SCSI " " . SCSI " (Disk Duplexing)" ( ). RAID 1 .

: , : 2 : 10Google File System(GFS)Replication Google File System, . . , 3 write . Google File System , . : , , , (Mime ) 3 3 Google File System , PC NAS .11Hadoop Distributed File System(HDFS) .highly fault-tolerant Low-cost hardware Application data Suitable for application having large data setsAssumptions & Goals of HDFS1. hardware failure

An HDFS instance may consist of hundreds or thousands of server machines, each storing part of the file systems data.The fact that there are a huge number of component and that each component has a non-trivial probability of failure means that some component of HDFS is always non-functional.

, HDFS .13Assumptions & Goals of HDFS2. Streaming Data AccessHDFS batch processing HDFS is optimized to provide streaming read performance; this comes at the expense of random seek times to arbitrary positions in files. HDFS relaxes a few POSIX requirements to enable streaming access to file system data. : high throughput of data access

*POSIX(Potable Operating System Interface) Assumptions & Goals of HDFS3. Large Data Sets HDFS applications large data sets HDFS : gigabytes to terabytes , HDFS aggregate data bandwidth nodes instance Assumptions & Goals of HDFSSimple Coherency ModelHDFSs applications need a write-once-read-more access model for files. data coherency issues A Map/Reduce application or a web crawler application fits perfectly with this model. (appending-writes) Scheduled to be included in Hadoop 0.19 but is not available yetAssumptions & Goals of HDFSMoving Computation is Cheaper than Moving DataApplication computation , data sets This minimizes network congestion and increase the overall throughput of the systemThe assumption is that is often better to migrate the computation closer to where the data is located rather than moving the data to where the application is running., HDFS application interface Assumptions & Goals of HDFSkjuo 6. HDFS This facilitates widespread adoption of HDFS as a platform of choice for large set of applications. Features of HDFSA master/slave architectureA single NameNode, a master server for a HDFS clustermanages the file system namespaceExecutes file system namespace operations like opening, closing, and renaming files and directoriesregulates access to files by clientsDetermines the mapping of blocks to DataNodes

Simplifies the architecture of the system NameNode = arbitrator and repository for all HDFS metadataFeatures of HDFSA number of DataNodes, usually one per node in the clusterManages storage attached to the nodes that they run on Perform block creation, deletion, and replication upon instruction from the NameNode

Features of HDFSThe NameNode and DataNodes are pieces of software designed to run on commodity machines.These machines typically run a GNU/Linux OS.Using the Java language; any machine that supports Java can run the NameNode or DataNode software.Features of HDFSA typically deployment has a dedicated machine that runs only the NameNode software.Each of the other machines in the cluster runs one instance of the DataNode software-> The architecture does not preclude running multiple DataNodes on the same machine but in a real deployment that is rarely the case. Features of HDFSSingle namespace for entire clusterManaged by a single NameNodeFiles are write-onceOptimized for streaming reads of large filesFiles are broken in to large blocksHDFS is a block-structured file systemDefault block size in HDFS : 64 MB vs. in other systems : 4 or 8 KB These blocks are stored in a set of DataNodesClient talks to both NameNode and DataNodesData is not sent through the NameNodeFeatures of HDFSDataNodes holding blocks of multiple files with a replication factor of 2The NameNode maps the filenames onto the block ids

Features of HDFSDataNodes client HTTP Features of HDFSNameNode -> HDFS NameNode File system 2 NameNode NameNode Replay process 30 Features of HDFS

The File System NamespaceFile System Namespace file systems ; dir dir

X X, XThe File System NamespaceNameNode : the file system namespace File system namespace NameNode Application HDFS replicas . the replication factor of that file . NameNode