60
Amazon Elastic MapReduce with Hive/Presto ハンズオン(講義) 2015/5/13 Ryosuke Iwanaga Solutions Architect, Amazon Data Services Japan

Amazon Elastic MapReduce with Hive/Presto ハンズオン(講義)

Embed Size (px)

Citation preview

  • Amazon Elastic MapReducewith Hive/Presto ()2015/5/13Ryosuke IwanagaSolutions Architect, Amazon Data Services Japan

  • AWS

  • 3

    POS

    KPI

    GB TB PB

    ZB

    EB

  • 4

    AWS

  • 5

    M&A

    Before

  • 6

    After

    M&A

    IT

  • 7

  • 8

  • 9

  • 10

    Amazon EMR

  • Amazon Elastic MapReduce

  • 12

    Amazon Elastic MapReduce

    1Hadoop

    Application: Hive, Hue, Impala Bootstrap Action: Presto, Spark

  • 13

    Task Node

    Task Instance Group

    Amazon EMR

    security group

    security group

    Master Node

    Master Instance Group

    Amazon S3DynamoDB

    Amazon Kinesis

    Core Node

    Core Instance Group

    HDFS HDFS

    HDFS HDFS

    Task Node

    Task Instance Group

    HDFS

    AWS

  • 14

    Amazon EMR Master Instance Group

    Master Node1 Failover

    NameNodeJobTracker

    Core NodeTask Node

    Master Node

    Master Instance Group

    Hadoop1: JobTracker Hadoop2: ResourceManager HDFS: NameNode Hive: HiveServer, MetaStore Presto: Coordinator

  • 15

    Amazon EMR Core Instance Group

    1Core Node TaskTracker DataNodeHDFS

    HDFS

    Core Node

    Core Instance Group

    Hadoop1: TaskTracker Hadoop2: NodeManager HDFS: DataNode Presto: Worker

  • 16

    Amazon EMR Core Instance Group

    Core Node HDFS CPU/RAM

    HDFS

    HDFS

    Core Node

    Core Instance Group

    Hadoop1: TaskTracker Hadoop2: NodeManager HDFS: DataNode Presto: Worker

  • 17

    Amazon EMR Task Instance Group

    HDFSCore

    HDFSCore Node

    HDFS

    Task Node

    Task Instance Group

    Hadoop1: TaskTracker Hadoop2: NodeManager Presto: Worker

  • 18

    Amazon EMR Task Instance Group

    Group Spotbid Instance Type

    RISpot

    Task Instance Group 2

    Task Instance Group 1

    c3.xlarge * 2 bid: $0.1

    r3.xlarge * 2 bid: $0.5

  • 19

    :

    1h 1h

    >

    1

    1

  • 20

    Spot Instance

    Task Instance Group Core Instance Group

    SLA SLA

    On-demandCore NodeSLAOn-demand

    Spot InstanceTask NodeOn-demand90%

  • 21

    EMRFS: Amazon S3HDFS

    EMR

    EMR

    Amazon S3

  • 22

    EMRFS

    s3:// Amazon S3

    : Amazon Glacier Amazon S3 Amazon S3

  • 23

    EMRFSConsistent View

    Amazon S3

    EMRFSConsistent View Amazon DynamoDB

    Amazon S3 Amazon DynamoDB

    EMRFS

  • 24

    Amazon EMR

    Amazon S3

    EMR

    t

  • 25

    Amazon EMR: Bootstrap Action

    Node OK

    Bash, Ruby, Python, etc. Amazon S3

    AWS

  • 26

    Amazon EMR: Step

    : ETLHiveQL Amazon S3jar

    Streaming, Hive, PigEMR S3

    script-runner.jarbash Step(Auto-terminate)

  • 27

    Amazon EMR: Application

    Application Hive, Pig, Hue, HBase, etc. Bootstrap ActionStep

    MapReduce

    Bootstrap Action, Step, ApplicationAmazon EMR

  • 28

    Amazon EMR

    (Spot) Amazon S3

  • Hive

  • 30

    Hive

    SQL like

    HDFS Amazon EMREMRFSAmazon S3Amazon DynamoDBAmazon Kinesis

    MapReduce, Tez, Spark

  • 31

    Hive

    Metastore

    HDFS

    Cluster

    Amazon S3

    Hiveserver CREATE TABLE

    SELECT FROM

    1: HDFSS3

    2:

    3:

    4:

    5:

    6: Amazon

    DynamoDB

    Amazon Kinesis

  • 32

    Hive

    MapReduce MapReduce

    Stinger Initiative = Hive 0.13 TezSQLetc.

    Stinger.next = 1 SparkACIDSQL:2011

  • 33

    Hive on Amazon EMR

    Application

    Metastore MasterMySQL

    hive-site.xmlMySQL

    $ aws emr create-cluster \ --applications Name=Hive \

  • YARN

  • 35

    YARN

    YARN = Yet-Another-Resource-Negotiator Hadoop2

    JobTracker YARN MapReduce

  • 36

    YARN: ResourceManager

    CPU, Memory, etc.

  • 37

    YARN: NodeManager

    RM

    Container Container

  • 38

    YARN: Container

    NM

    DefaultContainerExecutor

    LinuxContainerExecutor cgroups

    DockerContainerExecutor Docker

  • 39

    YARN: ApplicationMaster

    Container JobTracker 1AM

    Container

  • 40

    YARN

    Mesos, Amazon EC2 Container Service

    YARN

    YARN Tez: DAGYARN() Twill: YARN Slider: YARN

  • Tez

  • 42

    Tez

    YARNDAG

    Directed Acyclic Graph (DAG) Vertex():

    Map, Reduce, Join Edge():

    ProducerConsumer SQLDAG

    DAGTez

  • 43

    Tez

    Hive,Pig,Cascading MR

    HDFS

    Container Session

  • ORC File / Parquet

  • 45

    Hive

    TEXTFILE 1(CSV)

    SEQUENCEFILE 1

    1SELECT

  • 46

    ORC File / Parquet

    Optimized Row Columnar(ORC) File Stinger ProjectRCFile

    Parquet Twitter/Cloudera GoogleDremel

  • 47

    ORC File

    Stripe IndexFooter Indexmin/maxRow

    Data File

    Footer

    Hive

  • 48

    Parquet

    ColumnChunk Thrift

    ColumnChunkRow Group

    ORC

  • 49

    11

    Hive 0.13ORC/Parquet

  • 50

    HiveORC File/Parquet

    CREATE TABLE t ( col1 STRING, ) STORED AS [ORC/PARQUET]; INSERT INTO t (); SELECT col1 FROM t;

  • 51

    Hadoop/Hive

    SQL

    Hadoop2YARN

    Hive Amazon EMR

  • HiveSQL

  • 53

    Massively Parallel Processing (MPP) SQL

    SQLHive

    HiveMPP Presto, Impala, Spark SQL, Drill

    DWH Amazon Redshift

    MPPAmazon Redshift

    Amazon Redshift

  • 54

    Hive

    Metastoresync/share Hive

    Impala, Spark SQLHiveQL Presto, DrillANSI SQL

    Metastore

    HDFS

    Amazon S3

  • Presto

  • 56

    Presto

    MPP SQL1 ANSI SQL SQL

    YARN

    MySQLPostgreSQL JOIN

  • 57

    Presto

    Coordinator

    Worker Coordinator

    Amazon S3

  • 58

    Presto on Amazon EMR

    20154Application Bootstrap Action

    AWS https://github.com/awslabs/emr-bootstrap-actions/tree/master/presto/latest

    PrestoYARN PrestoAmazon EMR

  • 60

    Amazon EMR

    AWS Amazon EMR Amazon S3