79
December 10, 2014 | Korea 일호, Solutions Architect

AWS re:Invent re:Cap - 데이터 분석: Amazon EC2 C4 Instance + Amazon EBS - 김일호

Embed Size (px)

Citation preview

Page 1: AWS re:Invent re:Cap - 데이터 분석: Amazon EC2 C4 Instance + Amazon EBS - 김일호

December 10, 2014 | Korea

김 일호, Solutions Architect

Page 2: AWS re:Invent re:Cap - 데이터 분석: Amazon EC2 C4 Instance + Amazon EBS - 김일호

BDT201 - Big Data and HPC State of the Union

BDT202 - HPC Now Means 'High Personal Computing'

BDT203 - From Zero to NoSQL Hero: Amazon DynamoDB Tutorial

BDT204 - Rendering a Seamless Satellite Map of the World with AWS and NASA Data

BDT205 - Your First Big Data Application on AWS

BDT206 - See How Amazon Redshift is Powering Business Intelligence in the Enterprise

BDT207 - Use Streaming Analytics to Exploit Perishable Insights

BDT208 - Finding High Performance in the Cloud for HPC

BDT209 - Intel’s Healthcare Cloud Solution Using Wearables for Parkinson’s Disease Research

BDT302 - Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR and Analyzing Results with Amazon Redshift

BDT303 - Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and Amazon Redshift

BDT305 - Lessons Learned and Best Practices for Running Hadoop on AWS

BDT306 - Mission-Critical Stream Processing with Amazon EMR and Amazon Kinesis

BDT307 - Running NoSQL on Amazon EC2

BDT308 - Using Amazon Elastic MapReduce as Your Scalable Data Warehouse

BDT308-JT - Using Amazon Elastic MapReduce as Your Scalable Data Warehouse - Japanese Track

BDT309 - Delivering Results with Amazon Redshift, One Petabyte at a Time

BDT309-JT - Delivering Results with Amazon Redshift, One Petabyte at a Time - Japanese Track

BDT310 - Big Data Architectural Patterns and Best Practices on AWS

BDT311 - MegaRun: Behind the 156,000 Core HPC Run on AWS and Experience of On-demand Clusters for Manufacturing Production

Workloads

BDT312 - Using the Cloud to Scale from a Database to a Data Platform

BDT401 - Big Data Orchestra - Harmony within Data Analysis Tools

BDT402 - Performance Profiling in Production: Analyzing Web Requests at Scale Using Amazon Elastic MapReduce and Storm

BDT403 - Netflix's Next Generation Big Data Platform

Page 3: AWS re:Invent re:Cap - 데이터 분석: Amazon EC2 C4 Instance + Amazon EBS - 김일호

Redshift EMR EC2

Process & Analyze

Store

AWS Direct Connect

S3

Amazon Kinesis

Glacier

AWS Import/Export

DynamoDB

Collect

Automate AWS Data Pipeline

Page 4: AWS re:Invent re:Cap - 데이터 분석: Amazon EC2 C4 Instance + Amazon EBS - 김일호

Amazon SQS

Amazon S3

DynamoDB

Any SQL or NO SQL Store

Log Aggregation tools

Amazon

EMR

Amazon

Redshift

Visualization tools

Business

Intelligence Tools

Business

Intelligence Tools

GIS tools

Amazon data pipeline

Page 5: AWS re:Invent re:Cap - 데이터 분석: Amazon EC2 C4 Instance + Amazon EBS - 김일호
Page 6: AWS re:Invent re:Cap - 데이터 분석: Amazon EC2 C4 Instance + Amazon EBS - 김일호
Page 7: AWS re:Invent re:Cap - 데이터 분석: Amazon EC2 C4 Instance + Amazon EBS - 김일호

Log4J

EMR-Kinesis Connector

Hive with

Amazon S3 Amazon Redshift

parallel COPY from

Amazon S3

Amazon Kinesis

processing state

Page 8: AWS re:Invent re:Cap - 데이터 분석: Amazon EC2 C4 Instance + Amazon EBS - 김일호
Page 9: AWS re:Invent re:Cap - 데이터 분석: Amazon EC2 C4 Instance + Amazon EBS - 김일호

Launch a 3-instance Hadoop 2.4 cluster with Hive installed:

m3.xlarge

YOUR-AWS-REGION

YOUR-AWS-SSH-KEY

Page 10: AWS re:Invent re:Cap - 데이터 분석: Amazon EC2 C4 Instance + Amazon EBS - 김일호

YOUR-BUCKET-NAME

Page 11: AWS re:Invent re:Cap - 데이터 분석: Amazon EC2 C4 Instance + Amazon EBS - 김일호

Create an Amazon Kinesis stream to hold incoming data:

aws kinesis create-stream \

--stream-name AccessLogStream \

--shard-count 2

Page 12: AWS re:Invent re:Cap - 데이터 분석: Amazon EC2 C4 Instance + Amazon EBS - 김일호

\

CHOOSE-A-REDSHIFT-PASSWORD

Page 13: AWS re:Invent re:Cap - 데이터 분석: Amazon EC2 C4 Instance + Amazon EBS - 김일호

YOUR-IAM-ACCESS-KEY YOUR-IAM-SECRET-KEY

Page 14: AWS re:Invent re:Cap - 데이터 분석: Amazon EC2 C4 Instance + Amazon EBS - 김일호
Page 15: AWS re:Invent re:Cap - 데이터 분석: Amazon EC2 C4 Instance + Amazon EBS - 김일호

Log4J

Page 16: AWS re:Invent re:Cap - 데이터 분석: Amazon EC2 C4 Instance + Amazon EBS - 김일호

YOUR-AWS-SSH-KEYYOUR-EMR-MASTER-PRIVATE-DNS

YOUR-EMR-MASTER-PRIVATE-DNSYOUR-EMR-HOSTNAME

Start Hive:

hive

Page 17: AWS re:Invent re:Cap - 데이터 분석: Amazon EC2 C4 Instance + Amazon EBS - 김일호
Page 18: AWS re:Invent re:Cap - 데이터 분석: Amazon EC2 C4 Instance + Amazon EBS - 김일호

YOUR-IAM-ACCESS-KEY

YOUR-IAM-SECRET-KEY;

YOUR-AWS-REGION

hive>

hive>

hive>

hive>

hive>

hive>

Page 19: AWS re:Invent re:Cap - 데이터 분석: Amazon EC2 C4 Instance + Amazon EBS - 김일호
Page 20: AWS re:Invent re:Cap - 데이터 분석: Amazon EC2 C4 Instance + Amazon EBS - 김일호

hive>

STORED BY 'com.amazon.emr.kinesis.hive.KinesisStorageHandler'

TBLPROPERTIES("kinesis.stream.name"="AccessLogStream");

Page 21: AWS re:Invent re:Cap - 데이터 분석: Amazon EC2 C4 Instance + Amazon EBS - 김일호

-- return the first row in the stream

hive>

-- return count all items in the Stream

hive>

-- return count of all rows with given host hive>

Page 22: AWS re:Invent re:Cap - 데이터 분석: Amazon EC2 C4 Instance + Amazon EBS - 김일호

Log4J

EMR-Kinesis Connector

Page 23: AWS re:Invent re:Cap - 데이터 분석: Amazon EC2 C4 Instance + Amazon EBS - 김일호

http://127.0.0.1:19026/cluster

http://127.0.0.1:19101

Page 24: AWS re:Invent re:Cap - 데이터 분석: Amazon EC2 C4 Instance + Amazon EBS - 김일호
Page 25: AWS re:Invent re:Cap - 데이터 분석: Amazon EC2 C4 Instance + Amazon EBS - 김일호

hive>

YOUR-S3-BUCKET/emroutput

Page 26: AWS re:Invent re:Cap - 데이터 분석: Amazon EC2 C4 Instance + Amazon EBS - 김일호

-- set up Hive's "dynamic partioning"

-- splits output files when writing to Amazon S3

hive>

hive>

Page 27: AWS re:Invent re:Cap - 데이터 분석: Amazon EC2 C4 Instance + Amazon EBS - 김일호

-- compress output files on Amazon S3 using Gzip

hive>

hive>

hive>

hive>

Page 28: AWS re:Invent re:Cap - 데이터 분석: Amazon EC2 C4 Instance + Amazon EBS - 김일호
Page 29: AWS re:Invent re:Cap - 데이터 분석: Amazon EC2 C4 Instance + Amazon EBS - 김일호

-- convert the Apache log timestamp to a UNIX timestamp

-- split files in Amazon S3 by the hour in the log lines

hive>

Page 30: AWS re:Invent re:Cap - 데이터 분석: Amazon EC2 C4 Instance + Amazon EBS - 김일호

Log4J

EMR-Kinesis Connector

Hive with

Amazon S3

Page 31: AWS re:Invent re:Cap - 데이터 분석: Amazon EC2 C4 Instance + Amazon EBS - 김일호
Page 32: AWS re:Invent re:Cap - 데이터 분석: Amazon EC2 C4 Instance + Amazon EBS - 김일호

YOUR-S3-BUCKET

YOUR-S3-BUCKET

Page 33: AWS re:Invent re:Cap - 데이터 분석: Amazon EC2 C4 Instance + Amazon EBS - 김일호

# using the PostgreSQL CLI

YOUR-REDSHIFT-ENDPOINT

Or use any JDBC or ODBC SQL client with the PostgreSQL

8.x drivers or native Redshift support

• Aginity Workbench for Amazon Redshift

• SQL Workbench/J

Page 34: AWS re:Invent re:Cap - 데이터 분석: Amazon EC2 C4 Instance + Amazon EBS - 김일호
Page 35: AWS re:Invent re:Cap - 데이터 분석: Amazon EC2 C4 Instance + Amazon EBS - 김일호
Page 36: AWS re:Invent re:Cap - 데이터 분석: Amazon EC2 C4 Instance + Amazon EBS - 김일호

YOUR-S3-BUCKET

YOUR-IAM-ACCESS_KEY

YOUR-IAM-SECRET-KEY

Page 37: AWS re:Invent re:Cap - 데이터 분석: Amazon EC2 C4 Instance + Amazon EBS - 김일호

-- show all requests from a given IP address

-- count all requests on a given day

-- show all requests referred from other sites

Page 38: AWS re:Invent re:Cap - 데이터 분석: Amazon EC2 C4 Instance + Amazon EBS - 김일호
Page 39: AWS re:Invent re:Cap - 데이터 분석: Amazon EC2 C4 Instance + Amazon EBS - 김일호

Log4J

EMR-Kinesis Connector

Hive with

Amazon S3 Amazon Redshift

parallel COPY from

Amazon S3

Page 40: AWS re:Invent re:Cap - 데이터 분석: Amazon EC2 C4 Instance + Amazon EBS - 김일호

Bonus:

Page 41: AWS re:Invent re:Cap - 데이터 분석: Amazon EC2 C4 Instance + Amazon EBS - 김일호
Page 42: AWS re:Invent re:Cap - 데이터 분석: Amazon EC2 C4 Instance + Amazon EBS - 김일호

hive>

hive>

hive>

hive>

hive>

Page 43: AWS re:Invent re:Cap - 데이터 분석: Amazon EC2 C4 Instance + Amazon EBS - 김일호

-- Create an external table on Amazon S3

-- to hold query results.

-- Partition (split files on Amazon S3) by iteration

hive>

YOUR-S3-BUCKET

Page 44: AWS re:Invent re:Cap - 데이터 분석: Amazon EC2 C4 Instance + Amazon EBS - 김일호

-- set up a first iteration

-- create OS-ERROR_COUNT result (404 error codes) under dynamic partition 0

Page 45: AWS re:Invent re:Cap - 데이터 분석: Amazon EC2 C4 Instance + Amazon EBS - 김일호

-- set up a second iteration over the data in the Kinesis Stream

-- create OS-ERROR_COUNT result under dynamic partition 1. -- if file is empty, the previous iteration read all remaining stream data

Page 46: AWS re:Invent re:Cap - 데이터 분석: Amazon EC2 C4 Instance + Amazon EBS - 김일호

Log4J

EMR-Kinesis Connector

Hive with

Amazon S3 Amazon Redshift

parallel COPY from

Amazon S3

Amazon Kinesis

processing state

Page 47: AWS re:Invent re:Cap - 데이터 분석: Amazon EC2 C4 Instance + Amazon EBS - 김일호

YOUR-S3-BUCKET

YOUR-S3-BUCKET

Page 48: AWS re:Invent re:Cap - 데이터 분석: Amazon EC2 C4 Instance + Amazon EBS - 김일호

YOUR-S3-BUCKET YOUR-PREFIX.gz .

YOUR-PREFIX.gz

Page 49: AWS re:Invent re:Cap - 데이터 분석: Amazon EC2 C4 Instance + Amazon EBS - 김일호
Page 50: AWS re:Invent re:Cap - 데이터 분석: Amazon EC2 C4 Instance + Amazon EBS - 김일호

DataXu

Page 51: AWS re:Invent re:Cap - 데이터 분석: Amazon EC2 C4 Instance + Amazon EBS - 김일호
Page 52: AWS re:Invent re:Cap - 데이터 분석: Amazon EC2 C4 Instance + Amazon EBS - 김일호

DataXu Records

tx_id: "AFTfN0uAWZ"

exchange: “APPNEXUS"

request_id:"bb656107-3bf7-47a7-8548-8229563e9dc9”

….

adslot: {slot_id: "2686449714718898993”, uuid: "9d2403f1-fc6c-4d38-b6b1-

839fe4b42455”, price_micro_cpm: 661385, currency: "USD”, seat_id: "12-914”,

campaign_id: "C0513n7”, creative_id: “R53a537”}

time_stamp: 1415393474434

serviced_by_host: "cr02.us-east-01”

Confirmation Record

[- 69.120.26.172 - - [08/Nov/2014:21:59:54 -0500] "GET

/rs?id=fc6f2106175a43df8ae4f3b7e6fa8c37&t=marketing&cbust=14155020001916

62 HTTP/1.1" 302 - "http://ads-

by.madadsmedia.com/tags/25628/10217/iframe/728x90.html" "Mozilla/5.0

(compatible; MSIE 10.0; Windows NT 6.2; Trident/6.0)" "wfivefivec=c876d00e-

1831-4eba-b78d-cd99188e951a" "OWW=-"

Fraud Record

Page 53: AWS re:Invent re:Cap - 데이터 분석: Amazon EC2 C4 Instance + Amazon EBS - 김일호

Continuous

Processing

CDN

Real-time

Bidding

Retargeting

Platform

Reporting

Qubole

Real Time

Apps KCL Apps

Archiver

Amazon Kinesis Event Replay Amazon S3

Producers Aggregator Continuous

Processing Storage Analytics

Redshift

Page 54: AWS re:Invent re:Cap - 데이터 분석: Amazon EC2 C4 Instance + Amazon EBS - 김일호
Page 55: AWS re:Invent re:Cap - 데이터 분석: Amazon EC2 C4 Instance + Amazon EBS - 김일호
Page 56: AWS re:Invent re:Cap - 데이터 분석: Amazon EC2 C4 Instance + Amazon EBS - 김일호

Client/Sensor Aggregator Continuous Processing

Storage Analytics + Reporting

Page 57: AWS re:Invent re:Cap - 데이터 분석: Amazon EC2 C4 Instance + Amazon EBS - 김일호
Page 58: AWS re:Invent re:Cap - 데이터 분석: Amazon EC2 C4 Instance + Amazon EBS - 김일호

https://github.com/awslabs/kinesis-log4j-appender

Page 59: AWS re:Invent re:Cap - 데이터 분석: Amazon EC2 C4 Instance + Amazon EBS - 김일호

Client/Sensor Aggregator Continuous Processing

Storage Analytics + Reporting

Page 60: AWS re:Invent re:Cap - 데이터 분석: Amazon EC2 C4 Instance + Amazon EBS - 김일호

Amazon Kinesis storage is replicated across

Availability Zones

Amazon Web Services

AZ AZ AZ

Durable, highly consistent storage replicates dataacross three data centers (availability zones)

Aggregate andarchive to S3

Millions ofsources producing100s of terabytes

per hour

FrontEnd

AuthenticationAuthorization

Ordered streamof events supportsmultiple readers

Real-timedashboardsand alarms

Machine learningalgorithms or

sliding windowanalytics

Aggregate analysisin Hadoop or adata warehouse

Inexpensive: $0.028 per million puts

Page 61: AWS re:Invent re:Cap - 데이터 분석: Amazon EC2 C4 Instance + Amazon EBS - 김일호
Page 62: AWS re:Invent re:Cap - 데이터 분석: Amazon EC2 C4 Instance + Amazon EBS - 김일호

0

200000

400000

600000

800000

1000000

1200000

0 100 200 300 400 500 600 700 800 900 1000 1100

1K

B M

essag

es/s

ec

Shards

TCO for average 1M events/second:

with 50:1 packing and 10:1 compression: $6351/month

raw: $28610/month

Page 63: AWS re:Invent re:Cap - 데이터 분석: Amazon EC2 C4 Instance + Amazon EBS - 김일호

Client/Sensor Aggregator Continuous Processing

Storage Analytics + Reporting

Page 64: AWS re:Invent re:Cap - 데이터 분석: Amazon EC2 C4 Instance + Amazon EBS - 김일호
Page 65: AWS re:Invent re:Cap - 데이터 분석: Amazon EC2 C4 Instance + Amazon EBS - 김일호

Amazon Kinesis

14 17 18 21 23

Shard-i

2 3 5 8 10

Shard

ID

Lock Seq

num

Shard-i

Host A

Host B

Shard ID Last Archived

Shard-i

0

10

18 X 2

3

5

8

10

14

17

18

21

23

0

3 10

Host A Host B

{Event 10, …}

10 23

14

17

18 21

23

Page 66: AWS re:Invent re:Cap - 데이터 분석: Amazon EC2 C4 Instance + Amazon EBS - 김일호

Client/Sensor Aggregator Continuous Processing

Storage Analytics + Reporting

Page 67: AWS re:Invent re:Cap - 데이터 분석: Amazon EC2 C4 Instance + Amazon EBS - 김일호

CDN

Real Time

Bidding

Retargetin

g

Platform

Reporting

Qubole

Real Time

Apps KCL Apps

Archiver

Kinesis Event Replay S3

Page 68: AWS re:Invent re:Cap - 데이터 분석: Amazon EC2 C4 Instance + Amazon EBS - 김일호

Producers Aggregator Continuous

Processing Storage Analytics

CDN

Real-time

Bidding

Retargeting

Platform

Reporting

Qubole

Real Time

Apps KCL Apps

Archiver

Amazon Kinesis Event Replay Amazon S3

Amazon

Redshift

Page 69: AWS re:Invent re:Cap - 데이터 분석: Amazon EC2 C4 Instance + Amazon EBS - 김일호

Producers Aggregator Continuous

Processing Storage Analytics

CDN

Real-time

Bidding

Retargeting

Platform

Reporting

Qubole

Real Time

Apps KCL Apps

Archiver

Amazon Kinesis Event Replay Amazon S3

Redshift

Page 70: AWS re:Invent re:Cap - 데이터 분석: Amazon EC2 C4 Instance + Amazon EBS - 김일호

Client/Sensor Aggregator Continuous Processing

Storage Analytics + Reporting

Page 71: AWS re:Invent re:Cap - 데이터 분석: Amazon EC2 C4 Instance + Amazon EBS - 김일호

• Unordered processing

– Randomize partition key to distribute events over

many shards and use multiple workers

• Exact order processing

– Control the partition key to ensure events are

grouped onto the same shard and read by the

same worker.

• Need both? Get global sequence number Producer

Get Global

Sequence Unordered

Stream

Campaign Centric

Stream

Fraud Inspection

Stream

Get Event

Metadata

Id event Stream – partition key

1 confirmation Campaign-centric stream - UUID

2 fraud Unordered Stream

Fraud-inspection stream – sessionid

Page 72: AWS re:Invent re:Cap - 데이터 분석: Amazon EC2 C4 Instance + Amazon EBS - 김일호

HTTP

Post

AWS SDK

LOG4J

Flume

Fluentd

Get* APIs

Apache

Storm

Amazon

Elastic

MapReduce

Sending Reading

Amazon EMR

Playback Amazon S3

Archiver

Page 73: AWS re:Invent re:Cap - 데이터 분석: Amazon EC2 C4 Instance + Amazon EBS - 김일호

Client/Sensor Aggregator Continuous Processing

Storage Analytics + Reporting

Page 74: AWS re:Invent re:Cap - 데이터 분석: Amazon EC2 C4 Instance + Amazon EBS - 김일호

http://bit.ly/aws-bdt205

Page 75: AWS re:Invent re:Cap - 데이터 분석: Amazon EC2 C4 Instance + Amazon EBS - 김일호
Page 76: AWS re:Invent re:Cap - 데이터 분석: Amazon EC2 C4 Instance + Amazon EBS - 김일호

General Purpose: M1, M3 (, T2)

Compute Optimized: C1, CC2, C3, C4

Memory Optimized: M2, CR1, R3

Storage Optimized: HI1, HS1, I2

GPU: CG1, G2

Micro: T1, T2

Page 77: AWS re:Invent re:Cap - 데이터 분석: Amazon EC2 C4 Instance + Amazon EBS - 김일호

2006 2007 2008 2009 2010 2011 2012-2013 December, 2014

m1.small

m1.xlarge

m1.large

m1.small

m2.2xlarge

m2.4xlarge

c1.medium

c1.xlarge

m1.xlarge

m1.large

m1.small

cc2.8xlarge

cc1.4xlarge

cg1.4xlarge

t1.micro

m2.xlarge

m2.2xlarge

m2.4xlarge

c1.medium

c1.xlarge

m1.xlarge

m1.large

m1.small

c3.large

c3.xlarge

c3.2xlarge

c3.4xlarge

c3.8xlarge

cr1.8xlarge

hs1.8xlarge

m3.xlarge

m3.2xlarge

hi1.4xlarge

m1.medium

cc2.8xlarge

cc1.4xlarge

cg1.4xlarge

t1.micro

m2.xlarge

m2.2xlarge

m2.4xlarge

c1.medium

c1.xlarge

m1.xlarge

m1.large

m1.small

cc1.4xlarge

cg1.4xlarge

t1.micro

m2.xlarge

m2.2xlarge

m2.4xlarge

c1.medium

c1.xlarge

m1.xlarge

m1.large

m1.small

c3.large

c3.xlarge

c3.2xlarge

c3.4xlarge

c3.8xlarge

hs1.8xlarge

m3.xlarge

m3.2xlarge

hi1.4xlarge

m1.medium

cc2.8xlarge

cc1.4xlarge

cg1.4xlarge

t1.micro

m2.xlarge

m2.2xlarge

m2.4xlarge

c1.medium

c1.xlarge

m1.xlarge

m1.large

m1.small

c1.medium

c1.xlarge

m1.xlarge

m1.large

m1.small

new

existing

g2.2xlarge

m3.medium

m3.large

i2.large

i2.xlarge

i2.4xlarge

i2.8xlarge

r3.large

r3.xlarge

r3.2xlarge

r3.4xlarge

r3.8xlarge

t2.micro

t2.small

t2.medium

c4.large

c4.xlarge

c4.2xlarge

c4.4xlarge

c4.8xlarge

2010

introducing now

Page 78: AWS re:Invent re:Cap - 데이터 분석: Amazon EC2 C4 Instance + Amazon EBS - 김일호

78

The next generation of Amazon EC2 Compute-optimized instances • Based on Intel Xeon E5-2666 v3 (Haswell) processors

• 2.9 GHz – peaking at 3.5 GHz with Turbo Boost

Ideal for running tier 1 applications, gaming and web servers, transcoding, and high performance computing workloads.

EBS-optimized by default… and at no additional cost!

Instance Name vCPU Count RAM Network Performance

c4.large 2 3.75 GiB Moderate

c4.xlarge 4 7.5 GiB Moderate

c4.2xlarge 8 15 GiB High

c4.4xlarge 16 30 GiB High

c4.8xlarge 36 60 GiB 10 Gbps

Preliminary specifications. May change prior to release

Page 79: AWS re:Invent re:Cap - 데이터 분석: Amazon EC2 C4 Instance + Amazon EBS - 김일호

79

Increases to the performance and capacity of General Purpose

(SSD) and Provisioned IOPS (SSD) volumes.

EBS Name Capacity IOPS Throughput

Amazon EBS General Purpose (SSD) 16 TB (up from 1TB)

10000 IOPS (up from 3000 IOPS)

160 MBps *

Amazon EBS Provisioned IOPS (SSD) 16 TB (up from 1TB)

20000 IOPS (up from 4000 IOPS)

320 MBps *

* When attached to EBS Optimized instances