Cloud Computing
BigData2011.12
- - 2.0 .
[email protected]) (www.gruter.com)
SDS, NHN
www.jaso.co.krwww.cloudata.orgwww.cloumon.orgwww.twitter.com/babokimwww.facebook.com/babokim
BigData Definition(1)
Big Data(BD) / /, ,
What is
BigData?
DB (McKinsey, 2011)- SW , ,
DB (IDC, 2011)- Big Data( ) , ,
, ,
SNS M2M
, , ,
Economist(2010.05)
Gartner(2011.03)
21
Information silo
McKinsey(2011.05)
/,
/, 5
6
: Big Data, (KT )
BigData Definition(2)
Very large, distributed aggregations of loosely structured data
Petabytes/exabytes of data,Millions/billions of people,Billions/trillions of records,Loosely-structured and often distributed data,Flat schemas with few complex interrelationships,Often involving time-stamped events,Often made up of incomplete data,Often including connections between data elements that must be probabilistically inferred,
Applications that involved Big-data can beTransactional (e.g., Facebook, PhotoBox), or,Analytic (e.g., ClickFox, Merced Applications).
http://wikibon.org/wiki/v/Enterprise_Big-data
Big-data Analytics Complements Data Warehouse
Traditional Data Warehouse
- Complete record from transactional system- All data centralized- Analytics designed against stable environment- Many reports run on a production basis
Big-data Analytic Environment
- Data from many sources inside and outside of organization(including traditional DW)
- Data often physically distributed- Need to iteration solution to test/improve models- Large-memory analytics also part of iteration- Every iteration usually requires complete reload of information
http://wikibon.org/wiki/v/Enterprise_Big-data
Facebook Social plug-in
Feedback
process over 20 billion events per day (200,000 events per second) with a lag of less than 30 seconds.
Analytic
Transactional
BigData
Collecting Reporting/SearchingAnalysis
Repository/
Cluster-ing
Classifi-cation
Senti-mentalAnalysis
Indexing
, SNS
Store
( )
(DBMS, NoSQL)
Index
Robot
RSS Reader
OpenAPI
/
/
User Define Query Script
ETL
Data Aggregator
Workers schemify tweetsand append to Hadoop
Workers update statistics on URLs byincrementing counters in Cassandra
Distribute tweets randomlyon multiple queues
Workers choose queue to enqueueto using hash/mod of URL
All updates for same URLguaranteed to go to same worker
Workers share the load ofschemifying tweets
Twitter : backtype
BigData
Architectural Requirements
Scalability- Scale-out - Elasticity
Reliability--
Flexibility- Easy for adding Analysis Rule- Support various data format
Latency- Real time, Near Real time, Batch
High Throughput- Global web scale traffic- ~ /sec
---- Hadoop
Component ,
IBM, HP, Oracle
-- BI/DW
?
BigData
Flume, Scribe, Chukwa
Hadoop FileSystemMogileFS
, NoSQL(Cloudata, HBase,Cassandra)Katta, ElasticSearch
count, sum aggregation S4, Storm
, Hadoop MapReduce(Hive,Pig)Giraph, GoldenOrb
/ Cluster, Classification Mahout, R
ZooKeeper, HUE, Cloumon
Serialization Thrift, Avro, ProtoBuf
Hadoop Echo System
http://indoos.wordpress.com/2010/08/16/hadoop-ecosystem-world-map/
Software Stack
Data Store
File System(HadoopFS)
NoSQL(Cloudata, HBase, Cassandra)
Batch Analysis
Data Analysis Platform(hadoop)
Man
agem
ent
Monito
ring(clo
umon)
Cluster
Manag
ement
(ZooKeep
er)Interface
Web Phone Pad
(Near)Real-timeAnalysis
Aggregator
Job Workflow Engine(oozie, cascade)
Data Visualization
Collector(flume, scribe, chukwa)
Script Language(Hive, Pig)
CEP Engine(Esper)
Real-time Analysis Platform
Analysis Job
Rule M
anagem
ent
Search(ElasticSearch)
Analysis Job
Mining Lib(Mahout)
Statistics Lib(R)
Chukwa(Yahoo)Hadoop FileSystem
HDFSMapReduce ( )
Scribe(Facebook)
(thrift)Hadoop JNI
Flume(Cloudera)
, , Hadoop, HBase, Search Engine
CentralizedStorage(HDFS)Agent
(local)
ApplicationServer
log
ApplicationServer Log4j
Temp Log
Collector #1
Collector #2
- Esper Event
- Gruter ClouStream, Yahoo S4, Twitter Storm, Facebook Puma
ClouStream
Puma
: Hadoop File System
BigData Defacto Standardx86
/
NameNode SPOF(Single Point Of Failure)
: MapReduce
: Hadoop MapReduce
MapReduce , MapReduceMapReduce
Hadoop FileSystem/
DB, FTP Server
FIFO, Fair, Capacity /
MapReduce , (streaming)
: Script Language
Hive> CREATE TABLE invites (foo INT, bar STRING) PARTITIONED BY (ds STRING);hive> LOAD DATA LOCAL INPATH './examples/files/kv1.txt' OVERWRITE INTO TABLE invites;hive> SELECT a.foo FROM invites a WHERE a.ds='2008-08-15';hive> FROM pokes t1 JOIN invites t2 ON (t1.bar = t2.bar) INSERT OVERWRITE TABLE events SELECT t1.bar, t1.foo, t2.foo;
Visits = load /data/visits as (user, url, time);Visits = foreach Visits generate user, Canonicalize(url), time;Pages = load /data/pages as (url, pagerank);VP = join Visits by url, Pages by url;UserVisits = group VP by user;UserPageranks = foreach UserVisits generate user,AVG(VP.pagerank) as avgpr;GoodUsers = filter UserPageranks by avgpr > 0.5 ;store GoodUsers into '/data/good_users';
Hive
Pig
Next Generation Hadoop(0.23)
YARN(Next MapReduce Framework)
HDFS Federation
NoSQL
, , Scale-out ,
Key/value, Document , Simple Column Schema Free
Big Data x86
Eventually consistent / BASE (not ACID)Simple API
Twitter: Cassandra, HBase, Hadoop, Scribe, FlockDB, RedisFacebook: Cassandra, HBase, Hadoop, Scribe, HiveNetflix: Amazon SimpleDB, CassandraDigg: CassandraSimpleGeo: CassandraStumbleUpon: HBase, OpenTSDBYahoo!: Hadoop, HBase, PNUTSRackspace: CassandraDAUM: MongoDBNCSoft: Cassandra
CAP(Brewers Conjecture)
NoSQL: Cloudata/HBase
Distributed Data Storagesemi-structured data store(not file system)
/Google Bigtable clone
Data Model, Architecture, FeaturesOpen source
http://www.cloudata.orgGoal
500 nodes300 GB /node, Peta bytes
Create, drop, modify table schema
Single row operationMulti row operation: like, between
Scanner, Direct Uploader, MapReduce Adapter
Automatic table split & re-assignment
(Hadoop)Failover
~
seenal.com
10.29
BigData ., BigData
., , , BigData
.BigData , Data
.,
. .
.(6 ~ 1 ).
.
..