74
這些年,我們一起追的 Hadoop Hadoop, the Apple of Our Eyes 蘇國鈞 [email protected] 資訊工業策進會 數位教育研究所 資訊技術訓練中心 1 / 74

Hadoop, the Apple of Our Eyes (這些年,我們一起追的 Hadoop)

Embed Size (px)

Citation preview

  • Hadoop

    Hadoop,theAppleofOurEyes

    [email protected]

    1/74

  • JavaSEJavaEESOAP/RESTfulServicesDesignPatternsEJB/JPAJavaEEStruts/Spring/HibernateOpenSourceFrameworkJBossASGlassFishApplicationServer

    Java.NETHadoopPlatformNoSQLBigDataGoogleAppEngineMicrosoftAzureCloudBeesAndroidWindowsPhoneSmartPhone

    PS.GoogleSearch

    Bio

    2/74

    https://www.google.com.tw/imghp?hl=zh-TW&tab=wi

  • Agenda0.

    1.Hadoop

    2.Hadoop

    3.Hadoop

    4.Hadoop

    5.Hadoop

    6.

    Hadoop

    3/74

  • 4/74

  • LuceneNutchDougCuttingLuceneNutchGoogle2003/20042006NutchHadoopHadoopDoug2008-01ApacheTop-LevelProject2009-09DougCuttingClouderaArchitect2011-06Yahoo!HadoopSpinOffHortonworks

    Hadoop

    5/74

    http://www.cloudera.com/http://www.hortonworks.com/

  • TheApacheHadoopsoftwarelibraryisaframeworkthatallowsforthedistributedprocessingoflargedatasetsacrossclustersofcomputersusingsimpleprogrammingmodels.

    Itisdesignedtoscaleupfromsingleserverstothousandsofmachines,eachofferinglocalcomputationandstorage.

    Ratherthanrelyonhardwaretodeliverhigh-availability,thelibraryitselfisdesignedtodetectandhandlefailuresattheapplicationlayer,sodeliveringahighly-availableserviceontopofaclusterofcomputers,eachofwhichmaybepronetofailures.

    ApacheHadoop

    6/74

    http://hadoop.apache.org/

  • ...

    ...

    HadoopBigData

    7/74

  • Hadoop+BigData

    ()

    8/74

  • Hadoop+BigData

    ()

    9/74

  • 1. SubmitJob2. JTTaskTT3. TTTask4. TTJT

    Hadoop1.x-MapReduce(MRv1)

    JobTracker(Master)TaskTracker(Slave)

    10/74

  • Hadoop1.x

    HadoopHDFS(Storage)HadoopMapReduce(ComputingEngine+ResourceManagement+JobScheduling/Monitoring+...)

    Cluster4,000-4,500NodeJobTrackerConcurrentTask40,000HDFSNamespace/sales/accounting...MapReduceJob...

    ClusterTask

    11/74

  • Hadoop

    BatchJobInteractiveQueryReal-TimeProcessingGraphProcessingIterativeModeling

    Hadoop(BatchProcessing)

    BatchJobBatchJobJobJobI/OOverhead

    Hadoop(HDFS)(MapReduce)

    12/74

  • 13/74

  • Hadoop

    14/74

  • MapReduce

    Hadoop(HDFS)(YARN)

    HadoopBatchDataOperatingSystem

    MapReduceBatchProcessingHiveTezInteractiveSQLQuery...

    15/74

  • MapReduceHadoopMapReduceJobMapReduce

    16/74

  • MapReducePhase1ResourceManagementMapReduceYARNOtherYARNFrameworks

    17/74

  • MapReducePhase2MapReduceYARNBatchJobComputingFrameworkYARNTezStormGiraphSparkOpenMPI...

    18/74

  • MapReducePhase3MapReduce(HivePig)ComputingFramework(Tez)

    19/74

  • HDFS

    HighAvailabilityNamespaceSnapshotI/O2.5-5...

    HDFS->HDFS2

    20/74

  • Hadoop2.x

    HadoopCommon(CoreLibraries)HadoopHDFS(Storage)HadoopMapReduce(ComputingEngine)HadoopYARN(ResourceManagement+JobScheduling/Monitoring)

    Hadoop2.x...BackwardCompatibilityYahoo!Hadoop2.x35,000+Node...

    21/74

  • 1. SubmitJob2. AM3. RMAM4. RequestRM5. Container6. AM/Container7. Client/AM8. AM

    Hadoop2.x-MapReduce(MRv2)ResourceManagerNodeManager-ResourceApplicationMaster-Framework-SpecificResourceManagerResourceNodeManagerContainerContainerResourceScheduleTask

    22/74

  • MapReduce(MRv2)ResourceManagerResourceArbitratorCapacityFairnessSLAPluggableInterface

    ApplicationMasterMRv1MRv2ResourceManagerNodeManagerContainer

    ApplicationMasterMRv1ResourceManagerMRv2ResourceManagerScalable10,000+NodeApplicationMasterPer-Application

    ApplicationMasterFramework-SpecificResourceManagerFramework

    23/74

  • YARN-YetAnotherResourceNegotiatorAGeneral-PurposeDistributedApplicationManagementFrameworkDataOperatingSystemforEnterpriseHadoop

    24/74

  • Resourcevs.Container

    ResourceModel

    RackHostResourceCPUCore

    ContainerResourceModelResource

    YARNApplicationApplicationMasterContainerCommand-Line3rd-PartyJARSecurityTokenNodeManagerContainer

    ContainerOSProcess

    25/74

  • HadoopBatch

    26/74

  • Hadoop

    27/74

  • Windows

    28/74

  • Hadoop

    29/74

  • HDFSDistributedFileSystemMapReduceDistributedDataAnalysisEngineAvroLanguage-NeutralDataSerializationSystem(2010-05Top-LevelProject)MahoutScalableLibraryforMachineLearningHBaseDistributedDataStorage(2010-05Top-LevelProject)PigHighLevelLanguageforDataAnalysis(2010-09Top-LevelProject)HiveDataWarehousingandSQL-LikeQuery(2010-09Top-LevelProject)SqoopDataMigrationToolBetweenHDFSandRDBMS

    HadoopEcosystem

    30/74

  • HCatalogHadoopNamingService

    31/74

  • Yahoo!PigPigLatinMapReduceJob

    FacebookHiveHiveQLMapReduceJob

    HivePig

    HadoopBigDataMapReduce/Java

    32/74

  • StingerInitiative

    HortonworksHadoopMapReduceDataProcessingPlatformHiveInteractiveQueryPB-ScaleProcessing

    SpeedHive10100ScaleTBPBSQLCompatibilitySQL

    1344145Developer39Hive3Release

    33/74

    http://hortonworks.com/labs/stinger/

  • StingerInitiativeHive

    34/74

  • Hive-Speed

    35/74

  • Hive-ScaleORCFile(OptimizedRowColumnarFile))ORCFileHCatalogPigMapReduce

    36/74

    file:///Users/monster/Dropbox/Courseware/TheAppleOfBigData/(http://hortonworks.com/blog/orcfile-in-hdp-2-better-compression-better-performance/

  • Hive-SQLCompatibilityRolePrivilegeGrantRevoke

    37/74

  • SqoopClouderaHadoopRDBMSJDBCMapReduce

    38/74

    http://sqoop.apache.org/

  • Hadoop

    39/74

  • HadoopDistributionDougCutting2011HadoopWorldKeynote

    ThesimilaritybetweenHadoopandLinuxkernel,andthecorrespondingsimilaritybetweenthebigstackofHadoop(Hive,Hbase,Pig,Avro,etc.)andthefullyoperationaloperatingsystemswithitsdistributions(RedHat,Ubuntu,Fedora,Debianetc.)

    HadoopDistribution

    ClouderaClouderaDistributionforHadoop(CDH)OracleOracleBigDataApplianceIntelIntelDistributionforHadoop(IDH)ClouderaHortonworksHortonworksDataPlatform(HDP)MicrosoftMicrosoftHDInsightMapRMapRDistributionforApacheHadoop(M3,M5,M7)...

    Make()ApacheBigTop

    40/74

    http://bigtop.apache.org/

  • ClouderaDistributionforHadoop

    2014900M740MIntel

    41/74

    http://www.cloudera.com/content/cloudera/en/products-and-services/cdh.htmlhttp://www.theregister.co.uk/2014/06/02/cloudera_board_announcement/

  • OracleBigDataAppliance

    OracleBigDataPlatformClouderaDistributionforHadoop(CDH)

    42/74

    http://www.oracle.com/technetwork/database/bigdata-appliance/oracle-bigdatalite-2104726.html

  • HortonworksDataPlatform

    HortonworksHadoopWindowsPortingYARN

    201350MFunding2014100M

    43/74

    http://hortonworks.com/hdp/

  • MicrosoftHDInsight

    HDInsightHortonworksDataPlatform(HDP)

    44/74

    http://azure.microsoft.com/zh-tw/documentation/services/hdinsight/

  • MapRDistributionforApacheHadoop

    201335MFunding2014GoogleQualcomm110M

    45/74

    http://www.mapr.com/products/hadoop-download

  • Hadoop

    46/74

  • ParallelProcessing

    TezSpark...

    UserInterface

    Hue

    SQLonHadoop

    ImpalaPrestoDrill/Dremel/BigQuery...

    DataCollector

    FlumeChukwaScribe...

    MachineLearning

    Mahout...

    HadoopBigData

    47/74

    http://tez.apache.org/https://spark.apache.org/http://gethue.com/http://impala.io/http://prestodb.io/http://incubator.apache.org/drill/http://research.google.com/pubs/pub36632.htmlhttps://developers.google.com/bigquery/?hl=zh-twhttp://flume.apache.org/https://chukwa.apache.org/https://github.com/facebookarchive/scribehttps://mahout.apache.org/

  • TezHortonworksAframeworkfornearreal-timebigdataprocessingInspiredbyMicrosoftDryadStingerInitiativeDataflowmodelonadirectedacyclicgraph(DAG)ofnodesQueryPlan

    48/74

    http://tez.apache.org/http://hortonworks.com/hadoop/tez/http://cs.brown.edu/~debrabant/cis570-website/papers/dryad.pdfhttp://hortonworks.com/labs/stinger/

  • UCBerkeleyAMPLab20092010OpenSourceDataBricksHDFSGeneral-PurposeClusterComputingSystemIn-MemoryHadoop100In-DiskHadoop10YARNMLLibMahoutCrunchCascadingSparkClouderaDataBricksIBMIntelMapRHivePigSqoopOozie

    Spark-Lightning-FastClusterComputing

    49/74

    https://amplab.cs.berkeley.edu/projects/spark-lightning-fast-cluster-computing/http://databricks.com/http://hortonworks.com/press-releases/hortonworks-announces-apache-spark-yarn-ready/https://spark.apache.org/mllib/https://mahout.apache.org/http://spark.apache.org/

  • Hue-HadoopUserExperienceClouderaOnlineDemohttp://demo.gethue.com/

    50/74

    http://gethue.com/http://demo.gethue.com/

  • Hue-InteractiveSQL&Dashboard

    51/74

    http://gethue.com/

  • Impala-Real-TimeQueriesinHadoopCloudera2012HDFS/HBaseDistributedParallelSQLQueryEngineinRealTimeGoogleF1Fault-TolerantDistributedRDBMSDremelAdHocQueryToolSQLonHadoopMapReduceIn-MemoryProcessCompliantwithANSI-92SQLStandardClouderaODBCDriverforImpalaBI/DW

    52/74

    http://www.cloudera.com/content/cloudera/en/products-and-services/cdh/impala.htmlhttp://research.google.com/pubs/pub38125.htmlhttp://research.google.com/pubs/pub36632.htmlhttp://www.cloudera.com/content/support/en/downloads/connectors/impala/impala-odbc-v2-5-15.html

  • PrestoFacebook20122013FacebookDataWarehouseQueryExecutionEngineCacheANSI-SQLCompatibleQueryCPUEfficiencyHive4-78-10MostofFacebookispicturesofcats,updatesaboutbodilyfunctions,nihilisticramblings,andthepingingsofZyngagamesfeedinge-stimstofolk,itbearsnotingthatnoneofthisreallymattersfordesigningmassivedatasystems.

    53/74

    http://prestodb.io/http://www.theregister.co.uk/2013/06/07/hey_presto_facebook_reveals_exabytescale_query_engine/

  • Google2010DremelInteractiveAnalysisofWeb-ScaleDatasets

    ApacheDrillDremelOpenSourceScaleOut10,000+NodePBTrillionRecord

    GoogleBigQueryDremelIaaS

    DremelDrillBigQueryWhatisBigQuery,ItsFeaturesandSomeSuccessfulProductsWhoGotBenefitedfromIt?

    54/74

    http://research.google.com/pubs/pub36632.htmlhttp://incubator.apache.org/drill/https://developers.google.com/bigquery/http://www.netsolutionsindia.com/blog/what-is-bigquery-its-features-and-some-successful-products-who-got-benefited-from-it/

  • ()

    Hadoop

    55/74

  • Cloudera

    Intel()DataBricks(Spark)IBMOracleMapR...

    Hortonworks

    Microsoft()...

    56/74

    http://www.cloudera.com/content/cloudera/en/partners.htmlhttp://databricks.com/http://hortonworks.com/partners/

  • Hivevs.Impala

    DidClouderaJustShootTheirImpala?

    ClouderaHiveHiveReal-TimeDistributedSQLProcessingHiveHiveSpark

    57/74

    http://hortonworks.com/blog/cloudera-just-shoot-impala/https://issues.apache.org/jira/browse/HIVE-7292

  • Hive

    Impala

    Tez/YARNvs.Spark

    Cloudera,MapR,IBM,andIntelbetonSparkasthenewheartofHadoop

    ClouderaHiveSparkSparkSQLonHadoopHiveHiveonSparkSharkClouderaSharkHiveHiveClouderaHiveonSparkHiveQueryPlannerHiveMapReduceTezHortonworksHiveonTezQueryPlannerHiveonMapReduceHiveonTezonYARN

    58/74

    http://www.theregister.co.uk/2014/06/30/cloudera_and_co_spark/https://issues.apache.org/jira/browse/HIVE-7292

  • HadoopMachineLearning

    RecommendationMiningClusteringClassificationUseCase

    2014-04-25GoodbyeMapReduceCodebaseScalaDSLSpark

    Mahout

    59/74

    https://mahout.apache.org/

  • 60/74

  • PythonRubyC/C++C#PerlBash...

    ProgrammingJava

    MapReduceYARNHadoopMapReduceExamplespopcornyGradle

    HadoopStreaming

    61/74

    http://svn.apache.org/repos/asf/hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-examples/http://www.codedata.com.tw/author/popcornyhttps://github.com/popcornylu/hadoop-wordcounthttp://hadoop.apache.org/docs/r1.2.1/streaming.html

  • DataflowMapReduceGoogleFlumeMillWheelDataflowSDKJavaBigQueryDataflowDataflowBigQueryGoogleTwitterDataflowTwitter

    DataflowJava

    iThomeGoogleI/O2014Dataflow

    62/74

    http://googlecloudplatform.blogspot.tw/2014/06/sneak-peek-google-cloud-dataflow-a-cloud-native-data-processing-service.htmlhttp://www.ithome.com.tw/news/89181

  • Data

    SQLonHadoopNoSQLandHadoop...

    HueHiveSQLHiveQLImpalaHiveANSI-SQLSqoopJDBCRDBMS/BI/DWHBaseNoSQL...

    63/74

  • MySQLHadoopApplierMySQLBinaryLogEventlibhdfsCLibraryHadoopReal-TimeIntegration/BackupBetweenMySQLandHadoop

    64/74

    http://dev.mysql.com/tech-resources/articles/mysql-hadoop-applier.html

  • PhoenixCLI-Sqlline PhoenixGUI-SQuirrel

    Phoenix-WeputtheSQLbackinNoSQLSalesforceHBaseJDBCWrapperClientSQLQueryHBaseScanJDBCResultSetQuerymsMillion

    65/74

    http://squirrel-sql.sourceforge.net/http://phoenix.apache.org/

  • JDBCAPIHBase

    publicclassHelloPhoenix{publicstaticvoidmain(String[]args)throwsSQLException{Connectioncon=DriverManager.getConnection("jdbc:phoenix:[zookeeper]");

    Statementstmt=con.createStatement();

    stmt.executeUpdate("createtabletest(mykeyintegernotnullprimarykey,mycolumnvarchar)");stmt.executeUpdate("upsertintotestvalues(1,'Hello')");stmt.executeUpdate("upsertintotestvalues(2,'World!')");

    con.commit();

    PreparedStatementstatement=con.prepareStatement("select*fromtest");

    ResultSetrs=statement.executeQuery();

    while(rs.next())System.out.println(rs.getString("mycolumn"));

    statement.close();

    con.close();}}

    66/74

  • Windows

    WindowsAzureHDInsightEmulator

    Linux

    ClouderaQuickStartVMsforCDHHortonworksHDPSandboxBigSQLQuickStartVM

    Browser

    ClouderaLive

    HadoopasaService

    MicrosoftAzureHDInsightServiceAmazonElasticMapReduce(EMR)

    Platform

    67/74

    http://www.microsoft.com/web/gallery/install.aspx?appid=HDINSIGHThttp://www.cloudera.com/content/support/en/downloads.htmlhttp://hortonworks.com/hdp/downloads/http://www.bigsql.org/se/preDownload.jsphttp://go.cloudera.com/cloudera-live.htmlhttp://manage.windowsazure.com/http://aws.amazon.com/cn/elasticmapreduce/

  • 1Hadoop

    68/74

  • 5HadoopVM

    69/74

  • 10HadoopCluster

    70/74

  • 71/74

  • 72/74

  • 73/74

  • ThePossibilitiesofHadoopforBigData

    0:36

    74/74