Download ppt - Introduction to Hadoop 趨勢科技研發實驗室. Copyright 2009 - Trend Micro Inc. Outline Introduction to Hadoop project HDFS (Hadoop Distributed File System) overview

Introduction to Hadoop

Classification

Copyright 2009 - Trend Micro Inc.

Outline Introduction to Hadoop projectHDFS (Hadoop Distributed File System) overviewMapReduce overviewReference

Classification

Classification


What is Hadoop?Hadoop is a cloud computing platform for processing and keeping vast amount of data.Apache top-level projectOpen SourceHadoop Core includesHadoop Distributed File System (HDFS)MapReduce frameworkHadoop subprojectsHBase, Zookeeper,Written in Java. Client interfaces come in C++/Java/Shell ScriptingRuns on Linux, Mac OS/X, Windows, and SolarisCommodity hardware


A Brief History of Hadoop 2003.2First MapReduce library written at Google2003.10Google File System paper published2004.12Google MapReduce paper published2005.7Doug Cutting reports that Nutch now use new MapReduce implementation2006.2Hadoop code moves out of Nutch into new Lucene sub-project2006.11Google Bigtable paper published


A Brief History of Hadoop (continue) 2007.2First HBase code drop from Mike Cafarella2007.4Yahoo! Running Hadoop on 1000-node cluster2008.1Hadoop made an Apache Top Level ProjectClassification

Classification


Who use Hadoop?Yahoo!More than 100,000 CPUs in ~20,000 computers running HadoopGoogleUniversity Initiative to Address Internet-Scale Computing ChallengesAmazonAmazon builds product search indices using the streaming API and pre-existing C++, Perl, and Python tools.Process millions of sessions daily for analyticsIBMBlue Cloud Computing ClustersTrend Micro Threat solution research Morehttp://wiki.apache.org/hadoop/PoweredBy

Classification

Classification


Hadoop Distributed File System Overview


HDFS DesignSingle Namespace for entire clusterVery Large Distributed File System10K nodes, 100 million files, 10 PB Data Coherency Write-once-read-many access modelA file once created, written and closed need not be changed.Appending write to existing files (in the future)Files are broken up into blocks Typically 128 MB block sizeEach block replicated on multiple DataNodes


HDFS DesignMoving computation is cheaper than moving dataData locations exposed so that computations can move to where data residesFile replicationDefault is 3 copies. Configurable by clientsAssumes Commodity Hardware Files are replicated to handle hardware failure Detect failures and recovers from them Streaming Data AccessHigh throughput of data access rather than low latency of data accessOptimized for Batch Processing

Classification

Classification




NameNodeManages File System NamespaceMaps a file name to a set of blocks Maps a block to the DataNodes where it residesCluster Configuration Management Replication Engine for Blocks


NameNode MetadataMeta-data in MemoryThe entire metadata is in main memoryNo demand paging of meta-dataTypes of MetadataList of filesList of Blocks for each fileList of DataNodes for each blockFile attributes, e.g creation time, replication factor


NameNode Metadata (cont.)A Transaction Log (called EditLog)Records file creations, file deletions. EtcFsImageThe entire namespace, mapping of blocks to files and file system properties are stored in a file called FsImage.NameNode can be configured to maintain multiple copies of FsImage and EditLog.CheckpointOccur when NameNode startupRead FsImange and EditLog from disk and apply all transactions from the EditLog to the in-memory representation of the FsImangeThen flushes out the new version into new FsImage

Classification

Classification


Secondary NameNodeCopies FsImage and EditLog from NameNode to a temporary directoryMerges FSImage and EditLog into a new FSImage in temporary directory.Uploads new FSImage to the NameNodeTransaction Log on NameNode is purgedFsImageEditLogFsImage(new)


NameNode FailureA single point of failureTransaction Log stored in multiple directoriesA directory on the local file systemA directory on a remote file system (NFS/CIFS)Need to develop a real HA solution


DataNodeA Block ServerStores data in the local file system (e.g. ext3)Stores meta-data of a block (e.g. CRC) Serves data and meta-data to ClientsBlock ReportPeriodically sends a report of all existing blocks to the NameNodeFacilitates Pipelining of Data Forwards data to other specified DataNodes


HDFS - ReplicationDefault is 3x replication. The block size and replication factor are configured per file.Block placement algorithm is rack-aware


User InterfaceAPIJava APIC language wrapper (libhdfs) for the Java API is also avaiable

POSIX like commandhadoop dfs -mkdir /foodirhadoop dfs -cat /foodir/myfile.txthadoop dfs -rm /foodir myfile.txt

HDFS Adminbin/hadoop dfsadmin safemodebin/hadoop dfsadmin reportbin/hadoop dfsadmin -refreshNodes

Web InterfaceEx: http://172.16.203.136:50070


Web Interface(http://172.16.203.136:50070)


Web Interface(http://172.16.203.136:50070) Browse the file systemClassification

Classification


POSIX Like command


Latest APIhttp://hadoop.apache.org/core/docs/current/api/Java API


ClassificationMapReduce Overview

Classification


Copyright 2009 - Trend Micro Inc.Brief history of MapReduceA demand for large scale data processing in GoogleThe folks at Google discovered certain common themes for processing those large input sizesMany machines are neededTwo basic operations on the input data: Map and ReduceGoogles idea is inspired by map and reduce functions commonly used in functional programming.Functional programming: MapReduceMap[1,2,3,4] (*2) [2,3,6,8]Reduce[1,2,3,4] (sum) 10Divide and Conquer paradigm


Copyright 2009 - Trend Micro Inc.MapReduceMapReduce is a software framework introduced by Google to support distributed computing on large data sets on clusters of computers.Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairsa reduce function that merges all intermediate values associated with the same intermediate keyAnd MapReduce framework handles details of execution. Classification


Copyright 2009 - Trend Micro Inc.Application AreaMapReduce framework is for computing certain kinds of distributable problems using a large number of computers (nodes), collectively referred to as a cluster.

ExamplesLarge data processing such as search, indexing and sortingData mining and machine learning in large data setHuge access logs analysis in large portalsMore applications Classificationhttp://www.dbms2.com/2008/08/26/known-applications-of-mapreduce/


Copyright 2009 - Trend Micro Inc.MapReduce Programming ModelUser define two functionsmap (K1, V1) list(K2, V2)takes an input pair and produces a set of intermediate key/valuereduce (K2, list(V2)) list(K3, V3)accepts an intermediate key I and a set of values for that key. It merges together these values to form a possibly smaller set of values

MapReduce Logical Flow


Copyright 2009 - Trend Micro Inc.MapReduce Execution FlowClassification7


Copyright 2009 - Trend Micro Inc.Word Count ExampleWord count problem :Consider the problem of counting the number of occurrences of each word in a large collection of documents.Input: A collection of (document name, document content)Expected outputA collection of (word, count)User define:Map: (offset, line) [(word1, 1), (word2, 1), ... ]Reduce: (word, [1, 1, ...]) [(word, count)]


Copyright 2009 - Trend Micro Inc.Word Count Execution Flow


Copyright 2009 - Trend Micro Inc.Monitoring the execution from web console http://172.16.203.132:50030/Classification


Copyright 2009 - Trend Micro Inc.ReferenceClassificationHadoop project official sitehttp://hadoop.apache.org/core/Google File System Paperhttp://labs.google.com/papers/gfs.htmlGoogle MapReduce paperhttp://labs.google.com/papers/mapreduce.htmlGoogle: MapReduce in a Weekhttp://code.google.com/edu/submissions/mapreduce/listing.htmlGoogle: Cluster Computing and MapReducehttp://code.google.com/edu/submissions/mapreduce-minilecture/listing.html

ok*ok**ok*ok*ok