Hadoop 与数据分析

  • Upload
    howe

  • View
    162

  • Download
    9

Embed Size (px)

DESCRIPTION

Hadoop 与数据分析. 淘宝数据平台及产品部基础研发组 周敏. 日期: 2010-05-26. 1. Outline. Hadoop 基本概念 Hadoop 的应用范围 Hadoop 底层实现原理 Hive 与数据分析 Hadoop 集群管理 典型的 Hadoop 离线分析系统架构 常见问题及解决方案. 关于打扑克的哲学. 打扑克与 MapReduce. 分牌. 各自齐牌. 再次理牌. 搞定. 交换. shuffle. output. Input split. 统计单词数. a 1. the 1 weather 1 is 1 - PowerPoint PPT Presentation

Citation preview

  • *Hadoop 2010-05-26

  • OutlineHadoopHadoopHadoopHiveHadoopHadoop

  • MapReduceInput split shuffle output

  • The weather is goodThis guyis a good manToday is goodGood manis goodthe 1weather 1is 1good 1today 1is 1good 1this 1guy 1is 1a 1good 1man 1good 1man 1is 1good 1a 1 good 1good 1good 1good 1good 1man 1man 1the 1weather 1today 1guy 1is 1is 1is 1is 1this 1a 1good 5guy 1is 4man 2the 1this 1today 1weather 1

  • *

  • *http://www.trendingtopics.org/

  • *

  • *

  • *Hadoop Hadoop CommonHDFSMapReducePig NoSQL HbaseZookeeperHive(SQL)HadoopChukwaHadoop

  • *DataData data data data dataData data data data dataData data data data data

    Data data data data dataData data data data dataData data data data data

    Data data data data dataData data data data dataData data data data data

    Data data data data dataData data data data dataData data data data dataResultsData data data dataData data data dataData data data dataData data data dataData data data dataData data data dataData data data dataData data data dataData data data dataHadoop

  • // MapClass1map public void map(LongWritable Key, Text value, OutputCollector output, Reporter reporter) throws IOException { String strLine = value.toString(); String[] strList = strLine.split("\""); String mid = strList[3]; String sid = strList[4];String timestr = strList[0];try{ timestr = timestr.substring(0,10);}catch(Exception e){return;}timestr += "0000"; // output.collect(new Text(mid + \ + sid\ + timestr , ...);}

    Hadoop(1)

  • public static class Reducer1 extends MapReduceBase implements Reducer { private Text word = new Text(); private Text str = new Text(); public void reduce(Text key, Iterator values, OutputCollector output, Reporter reporter) throws IOException { String[] t = key.toString().split("\""); word.set(t[0]);// str.set(t[1]); output.collect(word,str);//uid kind }//reduce }//Reduce0b Hadoop(2)

  • public static class MapClass2 extends MapReduceBase implements Mapper {

    private Text word = new Text(); private Text str = new Text();

    public void map(LongWritable Key, Text value, OutputCollector output, Reporter reporter) throws IOException { String strLine = value.toString(); String[] strList = strLine.split("\\s+");word.set(strList[0]);str.set(strList[1]);output.collect(word,str); } }Hadoop(3)

  • public static class Reducer2 extends MapReduceBase implements Reducer { private Text word = new Text(); private Text str = new Text(); public void reduce(Text key, Iterator values, OutputCollector output, Reporter reporter) throws IOException { while(values.hasNext()) { String t = values.next().toString(); // } // output.collect(new Text(mid + \ + sid + \) + ...., ...) }Hadoop(4)

  • BADAACBCBCDGroupCo-groupFunctionAggregate FilterFilterThinking in MapReduce(1)

  • Thinking in MapReduce(2)

  • Magics of Hive:

    SELECT COUNT(DISTINCT mid) FROM log_table

    Hive

  • Hadoop?webalizerawstatAtpanel250GB/5020Hadoop470GB/3666~7

  • Hadoop?IBMFacebookAmazonYahoo!

  • Web ServersLog Collection ServersFilersData Warehousing on a ClusterOracle RACFederated MySQLWebHadoop

  • HadoopRich ClientMetaStore ServerMysqlSchedulerThrift ServerWebJobClientCLI/GUIClientProgramWeb ServerHadoopHive

  • ,Web(50030, 50060, 50070)NameNode,JobTracker, DataNode, TaskTracker: Local RunnerDistributedCache

  • jmap, jstat, hprof,jconsole, jprofiler mat,jstackJobTrackerProfileslaveTaskTrackerProfileslaveChildProfile()

    Profiling

  • I/O, CPU Ganglia

  • ?*

  • *

  • **********************