Introduction to Hadoop 趨勢科技研發實驗室. Copyright 2009 - Trend Micro Inc. Outline Introduction to Hadoop project HDFS (Hadoop Distributed File System) overview

  • View
    216

  • Download
    3

Embed Size (px)

Transcript

  • Introduction to Hadoop

    Classification

    Copyright 2009 - Trend Micro Inc.

    Outline Introduction to Hadoop projectHDFS (Hadoop Distributed File System) overviewMapReduce overviewReference

    Classification

    Classification

    Copyright 2009 - Trend Micro Inc.

    What is Hadoop?Hadoop is a cloud computing platform for processing and keeping vast amount of data.Apache top-level projectOpen SourceHadoop Core includesHadoop Distributed File System (HDFS)MapReduce frameworkHadoop subprojectsHBase, Zookeeper,Written in Java. Client interfaces come in C++/Java/Shell ScriptingRuns on Linux, Mac OS/X, Windows, and SolarisCommodity hardware

    Copyright 2009 - Trend Micro Inc.

    A Brief History of Hadoop 2003.2First MapReduce library written at Google2003.10Google File System paper published2004.12Google MapReduce paper published2005.7Doug Cutting reports that Nutch now use new MapReduce implementation2006.2Hadoop code moves out of Nutch into new Lucene sub-project2006.11Google Bigtable paper published

    Copyright 2009 - Trend Micro Inc.

    A Brief History of Hadoop (continue) 2007.2First HBase code drop from Mike Cafarella2007.4Yahoo! Running Hadoop on 1000-node cluster2008.1Hadoop made an Apache Top Level ProjectClassification

    Classification

    Copyright 2009 - Trend Micro Inc.

    Who use Hadoop?Yahoo!More than 100,000 CPUs in ~20,000 computers running HadoopGoogleUniversity Initiative to Address Internet-Scale Computing ChallengesAmazonAmazon builds product search indices using the streaming API and pre-existing C++, Perl, and Python tools.Process millions of sessions daily for analyticsIBMBlue Cloud Computing ClustersTrend Micro Threat solution research Morehttp://wiki.apache.org/hadoop/PoweredBy

    Classification

    Classification

    Copyright 2009 - Trend Micro Inc.

    Hadoop Distributed File System Overview

    Copyright 2009 - Trend Micro Inc.

    HDFS DesignSingle Namespace for entire clusterVery Large Distributed File System10K nodes, 100 million files, 10 PB Data Coherency Write-once-read-many access modelA file once created, written and closed need not be changed.Appending write to existing files (in the future)Files are broken up into blocks Typically 128 MB block sizeEach block replicated on multiple DataNodes

    Copyright 2009 - Trend Micro Inc.

    HDFS DesignMoving computation is cheaper than moving dataData locations exposed so that computations can move to where data residesFile replicationDefault is 3 copies. Configurable by clientsAssumes Commodity Hardware Files are replicated to handle hardware failure Detect failures and recovers from them Streaming Data AccessHigh throughput of data access rather than low latency of data accessOptimized for Batch Processing

    Classification

    Classification

    Copyright 2009 - Trend Micro Inc.

    Copyright 2009 - Trend Micro Inc.

    Copyright 2009 - Trend Micro Inc.

    NameNodeManages File System NamespaceMaps a file name to a set of blocks Maps a block to the DataNodes where it residesCluster Configuration Management Replication Engine for Blocks

    Copyright 2009 - Trend Micro Inc.

    NameNode MetadataMeta-data in MemoryThe entire metadata is in main memoryNo demand paging of meta-dataTypes of MetadataList of filesList of Blocks for each fileList of DataNodes for each blockFile attributes, e.g creation time, replication factor

    Copyright 2009 - Trend Micro Inc.

    NameNode Metadata (cont.)A Transaction Log (called EditLog)Records file creations, file deletions. EtcFsImageThe entire namespace, mapping of blocks to files and file system properties are stored in a file called FsImage.NameNode can be configured to maintain multiple copies of FsImage and EditLog.CheckpointOccur when NameNode startupRead FsImange and EditLog from disk and apply all transactions from the EditLog to the in-memory representation of the FsImangeThen flushes out the new version into new FsImage

    Classification

    Classification

    Copyright 2009 - Trend Micro Inc.

    Secondary NameNodeCopies FsImage and EditLog from NameNode to a temporary directoryMerges FSImage and EditLog into a new FSImage in temporary directory.Uploads new FSImage to the NameNodeTransaction Log on NameNode is purgedFsImageEditLogFsImage(new)

    Copyright 2009 - Trend Micro Inc.

    NameNode FailureA single point of failureTransaction Log stored in multiple directoriesA directory on the local file systemA directory on a remote file system (NFS/CIFS)Need to develop a real HA solution

    Copyright 2009 - Trend Micro Inc.

    DataNodeA Block ServerStores data in the local file system (e.g. ext3)Stores meta-data of a block (e.g. CRC) Serves data and meta-data to ClientsBlock ReportPeriodically sends a report of all existing blocks to the NameNodeFacilitates Pipelining of Data Forwards data to other specified DataNodes

    Copyright 2009 - Trend Micro Inc.

    HDFS - ReplicationDefault is 3x replication. The block size and replication factor are configured per file.Block placement algorithm is rack-aware

    Copyright 2009 - Trend Micro Inc.

    User InterfaceAPIJava APIC language wrapper (libhdfs) for the Java API is also avaiable

    POSIX like commandhadoop dfs -mkdir /foodirhadoop dfs -cat /foodir/myfile.txthadoop dfs -rm /foodir myfile.txt

    HDFS Adminbin/hadoop dfsadmin safemodebin/hadoop dfsadmin reportbin/hadoop dfsadmin -refreshNodes

    Web InterfaceEx: http://172.16.203.136:50070

    Copyright 2009 - Trend Micro Inc.

    Web Interface(http://172.16.203.136:50070)

    Copyright 2009 - Trend Micro Inc.

    Web Interface(http://172.16.203.136:50070) Browse the file systemClassification

    Classification

    Copyright 2009 - Trend Micro Inc.

    POSIX Like command

    Copyright 2009 - Trend Micro Inc.

    Latest APIhttp://hadoop.apache.org/core/docs/current/api/Java API

    Copyright 2009 - Trend Micro Inc.

    ClassificationMapReduce Overview

    Classification

    Copyright 2009 - Trend Micro Inc.

    Copyright 2009 - Trend Micro Inc.Brief history of MapReduceA demand for large scale data processing in GoogleThe folks at Google discovered certain common themes for processing those large input sizesMany machines are neededTwo basic operations on the input data: Map and ReduceGoogles idea is inspired by map and reduce functions commonly used in functional programming.Functional programming: MapReduceMap[1,2,3,4] (*2) [2,3,6,8]Reduce[1,2,3,4] (sum) 10Divide and Conquer paradigm

    Copyright 2009 - Trend Micro Inc.

    Copyright 2009 - Trend Micro Inc.MapReduceMapReduce is a software framework introduced by Google to support distributed computing on large data sets on clusters of computers.Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairsa reduce function that merges all intermediate values associated with the same intermediate keyAnd MapReduce framework handles details of execution. Classification

    Copyright 2009 - Trend Micro Inc.

    Copyright 2009 - Trend Micro Inc.Application AreaMapReduce framework is for computing certain kinds of distributable problems using a large number of computers (nodes), collectively referred to as a cluster.

    ExamplesLarge data processing such as search, indexing and sortingData mining and machine learning in large data setHuge access logs analysis in large portalsMore applications Classificationhttp://www.dbms2.com/2008/08/26/known-applications-of-mapreduce/

    Copyright 2009 - Trend Micro Inc.

    Copyright 2009 - Trend Micro Inc.MapReduce Programming ModelUser define two functionsmap (K1, V1) list(K2, V2)takes an input pair and produces a set of intermediate key/valuereduce (K2, list(V2)) list(K3, V3)accepts an intermediate key I and a set of values for that key. It merges together these values to form a possibly smaller set of values

    MapReduce Logical Flow

    Copyright 2009 - Trend Micro Inc.

    Copyright 2009 - Trend Micro Inc.MapReduce Execution FlowClassification7

    Copyright 2009 - Trend Micro Inc.

    Copyright 2009 - Trend Micro Inc.Word Count ExampleWord count problem :Consider the problem of counting the number of occurrences of each word in a large collection of documents.Input: A collection of (document name, document content)Expected outputA collection of (word, count)User define:Map: (offset, line) [(word1, 1), (word2, 1), ... ]Reduce: (word, [1, 1, ...]) [(word, count)]

    Copyright 2009 - Trend Micro Inc.

    Copyright 2009 - Trend Micro Inc.Word Count Execution Flow

    Copyright 2009 - Trend Micro Inc.

    Copyright 2009 - Trend Micro Inc.Monitoring the execution from web console http://172.16.203.132:50030/Classification

    Copyright 2009 - Trend Micro Inc.

    Copyright 2009 - Trend Micro Inc.ReferenceClassificationHadoop project official sitehttp://hadoop.apache.org/core/Google File System Paperhttp://labs.google.com/papers/gfs.htmlGoogle MapReduce paperhttp://labs.google.com/papers/mapreduce.htmlGoogle: MapReduce in a Weekhttp://code.google.com/edu/submissions/mapreduce/listing.htmlGoogle: Cluster Computing and MapReducehttp://code.google.com/edu/submissions/mapreduce-minilecture/listing.html

    ok*ok**ok*ok*ok