30
Cloud Computing BigDataծ ଛ ߂ਾ੦অ 2011.12 י չ߾ફ ٱ৲इ չ ߅ݤ-ࠒٸչ-ضˁए 2.0 оଞ˲ Ԃݛ܁߾ ҬԂ ݨݡТЬ.

제1회 Korea Community Day 발표자료 Bigdata

Embed Size (px)

DESCRIPTION

Bigdata Platform, Hadoop, Hive, ...

Citation preview

Page 1: 제1회 Korea Community Day 발표자료 Bigdata

Cloud Computing

BigData2011.12

- - 2.0 .

Page 2: 제1회 Korea Community Day 발표자료 Bigdata
Page 3: 제1회 Korea Community Day 발표자료 Bigdata

[email protected]) (www.gruter.com)

SDS, NHN

www.jaso.co.krwww.cloudata.orgwww.cloumon.orgwww.twitter.com/babokimwww.facebook.com/babokim

Page 4: 제1회 Korea Community Day 발표자료 Bigdata

BigData Definition(1)

Big Data(BD) / /, ,

What is

BigData?

DB (McKinsey, 2011)- SW , ,

DB (IDC, 2011)- Big Data( ) , ,

, ,

SNS M2M

, , ,

Economist(2010.05)

Gartner(2011.03)

21

Information silo

McKinsey(2011.05)

/,

/, 5

6

: Big Data, (KT )

Page 5: 제1회 Korea Community Day 발표자료 Bigdata

BigData Definition(2)

Very large, distributed aggregations of loosely structured data

Petabytes/exabytes of data,Millions/billions of people,Billions/trillions of records,Loosely-structured and often distributed data,Flat schemas with few complex interrelationships,Often involving time-stamped events,Often made up of incomplete data,Often including connections between data elements that must be probabilistically inferred,

Applications that involved Big-data can beTransactional (e.g., Facebook, PhotoBox), or,Analytic (e.g., ClickFox, Merced Applications).

http://wikibon.org/wiki/v/Enterprise_Big-data

Page 6: 제1회 Korea Community Day 발표자료 Bigdata

Big-data Analytics Complements Data Warehouse

Traditional Data Warehouse

- Complete record from transactional system- All data centralized- Analytics designed against stable environment- Many reports run on a production basis

Big-data Analytic Environment

- Data from many sources inside and outside of organization(including traditional DW)

- Data often physically distributed- Need to iteration solution to test/improve models- Large-memory analytics also part of iteration- Every iteration usually requires complete reload of information

http://wikibon.org/wiki/v/Enterprise_Big-data

Page 7: 제1회 Korea Community Day 발표자료 Bigdata

Facebook Social plug-in

Feedback

process over 20 billion events per day (200,000 events per second) with a lag of less than 30 seconds.

Analytic

Transactional

Page 8: 제1회 Korea Community Day 발표자료 Bigdata

BigData

Collecting Reporting/SearchingAnalysis

Repository/

Cluster-ing

Classifi-cation

Senti-mentalAnalysis

Indexing

, SNS

Store

( )

(DBMS, NoSQL)

Index

Robot

RSS Reader

OpenAPI

/

/

User Define Query Script

ETL

Data Aggregator

Page 9: 제1회 Korea Community Day 발표자료 Bigdata

Workers schemify tweetsand append to Hadoop

Workers update statistics on URLs byincrementing counters in Cassandra

Distribute tweets randomlyon multiple queues

Workers choose queue to enqueueto using hash/mod of URL

All updates for same URLguaranteed to go to same worker

Workers share the load ofschemifying tweets

Twitter : backtype

Page 10: 제1회 Korea Community Day 발표자료 Bigdata

BigData

Architectural Requirements

Scalability- Scale-out - Elasticity

Reliability--

Flexibility- Easy for adding Analysis Rule- Support various data format

Latency- Real time, Near Real time, Batch

High Throughput- Global web scale traffic- ~ /sec

---- Hadoop

Component ,

IBM, HP, Oracle

-- BI/DW

?

Page 11: 제1회 Korea Community Day 발표자료 Bigdata

BigData

Flume, Scribe, Chukwa

Hadoop FileSystemMogileFS

, NoSQL(Cloudata, HBase,Cassandra)Katta, ElasticSearch

count, sum aggregation S4, Storm

, Hadoop MapReduce(Hive,Pig)Giraph, GoldenOrb

/ Cluster, Classification Mahout, R

ZooKeeper, HUE, Cloumon

Serialization Thrift, Avro, ProtoBuf

Page 12: 제1회 Korea Community Day 발표자료 Bigdata

Hadoop Echo System

http://indoos.wordpress.com/2010/08/16/hadoop-ecosystem-world-map/

Page 13: 제1회 Korea Community Day 발표자료 Bigdata

Software Stack

Data Store

File System(HadoopFS)

NoSQL(Cloudata, HBase, Cassandra)

Batch Analysis

Data Analysis Platform(hadoop)

Man

agem

ent

Monito

ring(clo

umon)

Cluster

Manag

ement

(ZooKeep

er)Interface

Web Phone Pad

(Near)Real-timeAnalysis

Aggregator

Job Workflow Engine(oozie, cascade)

Data Visualization

Collector(flume, scribe, chukwa)

Script Language(Hive, Pig)

CEP Engine(Esper)

Real-time Analysis Platform

Analysis Job

Rule M

anagem

ent

Search(ElasticSearch)

Analysis Job

Mining Lib(Mahout)

Statistics Lib(R)

Page 14: 제1회 Korea Community Day 발표자료 Bigdata

Chukwa(Yahoo)Hadoop FileSystem

HDFSMapReduce ( )

Scribe(Facebook)

(thrift)Hadoop JNI

Flume(Cloudera)

, , Hadoop, HBase, Search Engine

CentralizedStorage(HDFS)Agent

(local)

ApplicationServer

log

ApplicationServer Log4j

Temp Log

Collector #1

Collector #2

Page 15: 제1회 Korea Community Day 발표자료 Bigdata

- Esper Event

- Gruter ClouStream, Yahoo S4, Twitter Storm, Facebook Puma

ClouStream

Puma

Page 16: 제1회 Korea Community Day 발표자료 Bigdata

: Hadoop File System

BigData Defacto Standardx86

/

NameNode SPOF(Single Point Of Failure)

Page 17: 제1회 Korea Community Day 발표자료 Bigdata

: MapReduce

Page 18: 제1회 Korea Community Day 발표자료 Bigdata

: Hadoop MapReduce

MapReduce , MapReduceMapReduce

Hadoop FileSystem/

DB, FTP Server

FIFO, Fair, Capacity /

MapReduce , (streaming)

Page 19: 제1회 Korea Community Day 발표자료 Bigdata

: Script Language

Hive> CREATE TABLE invites (foo INT, bar STRING) PARTITIONED BY (ds STRING);hive> LOAD DATA LOCAL INPATH './examples/files/kv1.txt' OVERWRITE INTO TABLE invites;hive> SELECT a.foo FROM invites a WHERE a.ds='2008-08-15';hive> FROM pokes t1 JOIN invites t2 ON (t1.bar = t2.bar) INSERT OVERWRITE TABLE events SELECT t1.bar, t1.foo, t2.foo;

Visits = load /data/visits as (user, url, time);Visits = foreach Visits generate user, Canonicalize(url), time;Pages = load /data/pages as (url, pagerank);VP = join Visits by url, Pages by url;UserVisits = group VP by user;UserPageranks = foreach UserVisits generate user,AVG(VP.pagerank) as avgpr;GoodUsers = filter UserPageranks by avgpr > 0.5 ;store GoodUsers into '/data/good_users';

Hive

Pig

Page 20: 제1회 Korea Community Day 발표자료 Bigdata

Next Generation Hadoop(0.23)

YARN(Next MapReduce Framework)

HDFS Federation

Page 21: 제1회 Korea Community Day 발표자료 Bigdata

NoSQL

, , Scale-out ,

Key/value, Document , Simple Column Schema Free

Big Data x86

Eventually consistent / BASE (not ACID)Simple API

Twitter: Cassandra, HBase, Hadoop, Scribe, FlockDB, RedisFacebook: Cassandra, HBase, Hadoop, Scribe, HiveNetflix: Amazon SimpleDB, CassandraDigg: CassandraSimpleGeo: CassandraStumbleUpon: HBase, OpenTSDBYahoo!: Hadoop, HBase, PNUTSRackspace: CassandraDAUM: MongoDBNCSoft: Cassandra

CAP(Brewers Conjecture)

Page 22: 제1회 Korea Community Day 발표자료 Bigdata

NoSQL: Cloudata/HBase

Distributed Data Storagesemi-structured data store(not file system)

/Google Bigtable clone

Data Model, Architecture, FeaturesOpen source

http://www.cloudata.orgGoal

500 nodes300 GB /node, Peta bytes

Create, drop, modify table schema

Single row operationMulti row operation: like, between

Scanner, Direct Uploader, MapReduce Adapter

Automatic table split & re-assignment

(Hadoop)Failover

~

Page 23: 제1회 Korea Community Day 발표자료 Bigdata
Page 24: 제1회 Korea Community Day 발표자료 Bigdata

seenal.com

Page 25: 제1회 Korea Community Day 발표자료 Bigdata
Page 26: 제1회 Korea Community Day 발표자료 Bigdata
Page 27: 제1회 Korea Community Day 발표자료 Bigdata

10.29

Page 28: 제1회 Korea Community Day 발표자료 Bigdata
Page 29: 제1회 Korea Community Day 발표자료 Bigdata

BigData ., BigData

., , , BigData

.BigData , Data

.,

. .

.(6 ~ 1 ).

.

..

Page 30: 제1회 Korea Community Day 발표자료 Bigdata

.

Facebook: [email protected]

www.jaso.co.kr