Introduction to Big Data · 2018-01-26 · Big Data: Batch Processing & Distributed Data Store Hadoop/Spark; HBase/Cassandra BI Reporting OLAP & Dataware house Business Objects, SAS,

Introduction to Big Data

Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul Mali Destek Programı kapsamında

yürütülmekte olan TR10/16/YNY/0036 no’lu İstanbul Big Data Eğitim ve Araştırma Merkezi Projesi dahilinde

gerçekleştirilmiştir. İçerik ile ilgili tek sorumluluk Bahçeşehir Üniversitesi’ne ait olup İSTKA veya Kalkınma Bakanlığı’nın

görüşlerini yansıtmamaktadır.

Knowledge Doubling Curve

Buckminster Fuller proposed the concept of

“Knowledge Doubling Curve”

• Until 1900 human knowledge doubled

approximately every century

• By the end of World War II knowledge was

doubling every 25 years

• Currently, human knowledge is expected to be

doubling every 13 months on average

• According to IBM technical report*, it will be

doubling every 12 hours in the future.

*http://www-935.ibm.com/services/no/cio/leverage/levinfo_wp_gts_thetoxic.pdf

data

information

knowledge

wisdom

Big Data EveryWhere!

• Lots of data is being collected and warehoused

– Web data, e-commerce

– purchases at department/ grocery stores

– Bank/Credit Card transactions

– Social Network

Big Data, Small World

• %19 of total time spent on

internet, spent on Social media

• Total Social media users 2.5

billion

• Total number of Facebook

users is alone more than 1.5

billion.

• %46 out of 7,4 billion people in the world,

are Active Internet Users

• 3,4 billion Active Internet Users

• %18,5 Internet Users rate of growth , 2013-

2015

The Model Has Changed…

• The Model of Generating/Consuming Data has Changed

Old Model: Few companies are generating data, all others are consuming data

New Model: all of us are generating data, and all of us are consuming data

6

How much data? • Google processes 20 PB a day (2008)

• Wayback Machine has 3 PB + 100 TB/month (3/2009)

• Facebook has 2.5 PB of user data + 15 TB/day (4/2009)

• eBay has 6.5 PB of user data + 50 TB/day (5/2009)

• CERN’s Large Hydron Collider (LHC) generates 15 PB a year

640K ought to be enough for anybody.

Maximilien Brice, © CERN

There's certainly a lot of it!

2015

1 Zettabyte

1 Exabyte

1 Petabyte

(brain) 14 PB: http://www.quora.com/Neuroscience-1/How-much-data-can-the-human-brain-store

(2002) 5 EB: http://www2.sims.berkeley.edu/research/projects/how-much-info-2003/execsum.htm

1 Petabyte == 1000 TB 2002 2009

(2009) 800 EB: http://www.emc.com/collateral/analyst-reports/idc-digital-universe-are-you-ready.pdf

(2015) 8 ZB: http://www.emc.com/collateral/analyst-reports/idc-extracting-value-from-chaos-ar.pdf

2006 2011

(2006) 161 EB: http://www.emc.com/collateral/analyst-reports/expanding-digital-idc-white-paper.pdf

(2011) 1.8 ZB: http://www.emc.com/leadership/programs/digital-universe.htm (life in video) 60 PB: in 4320p resolution, extrapolated from 16MB for 1:21 of 640x480 video (w/sound) – almost certainly a gross overestimate, as sleep can be compressed significantly!

5 EB

161 EB

800 EB

1.8 ZB 8.0 ZB

14 PB

60 PB

Data produced each year

100-years of HD video + audio

Human brain's capacity

Data, data everywhere…

References

1 TB = 1000 GB

120 PB

logarith

mic scale

Variety (Complexity)

• Relational Data (Tables/Transaction/Legacy Data) • Text Data (Web) • Semi-structured Data (XML) • Graph Data

– Social Network, Semantic Web (RDF), …

• Streaming Data – You can only scan the data once

• A single application can be generating/collecting

many types of data

• Big Public Data (online, weather, finance, etc)

10

To extract knowledge all these types of data need to linked together

A Single View to the Customer

Customer

Social Media

Gaming

Entertain

Banking Finance

Our

Known History

Purchase

Velocity (Speed)

• Data is begin generated fast and need to be processed fast

• Online Data Analytics

• Late decisions missing opportunities

• Examples – E-Promotions: Based on your current location, your purchase history,

what you like send promotions right now for store next to you

– Healthcare monitoring: sensors monitoring your activities and body any abnormal measurements require immediate reaction

12

Real-time/Fast Data

Social media and networks (all of us are generating data)

Scientific instruments (collecting all sorts of data)

Mobile devices (tracking all objects all the time)

Sensor technology and networks (measuring all kinds of data)

• The progress and innovation is no longer hindered by the ability to collect data

• But, by the ability to manage, analyze, summarize, visualize, and discover knowledge from the collected data in a timely manner and in a scalable fashion

13

Some Make it 4V’s

14

Harnessing Big Data

• OLTP: Online Transaction Processing (DBMSs)

• OLAP: Online Analytical Processing (Data Warehousing)

• RTAP: Real-Time Analytics Processing (Big Data Architecture & technology)

15

Big Data: Batch Processing &

Distributed Data Store Hadoop/Spark; HBase/Cassandra

BI Reporting OLAP &

Dataware house

Business Objects, SAS, Informatica, Cognos other SQL

Reporting Tools

Interactive Business

Intelligence & In-memory RDBMS

QliqView, Tableau, HANA

Big Data:

Real Time & Single View

Graph Databases

THE EVOLUTION OF BUSINESS INTELLIGENCE

1990’s 2000’s 2010’s

Speed

Scale

Scale

Speed

Big Data Analytics

• Big data is more real-time in nature than traditional DW applications

• Traditional DW architectures (e.g. Exadata, Teradata) are not well-suited for big data apps

• Shared nothing, massively parallel processing, scale out architectures are well-suited for big data apps

17

the companies are expanding as fast as the data!

Parallelization problems?

• Parallelization problems arise from:

– Communication between workers (e.g., to exchange state)

– Access to shared resources (e.g., data)

• Thus, we need a synchronization mechanism

Source: Ricardo Guimarães Herrmann

Managing Multiple Workers

• Difficult because – We don’t know the order in which workers run – We don’t know when workers interrupt each other – We don’t know the order in which workers access shared data

• Thus, we need: – Semaphores (lock, unlock) – Conditional variables (wait, notify, broadcast) – Barriers

• Still, lots of problems: – Deadlock, livelock, race conditions... – Dining philosophers, sleeping barbers, cigarette smokers...

• Moral of the story: be careful!

Concurrency Challenge!

• Concurrency is difficult to reason about • Concurrency is even more difficult to reason about

– At the scale of datacenters (even across datacenters) – In the presence of failures – In terms of multiple interacting services

• Not to mention debugging… • The reality:

– Lots of one-off solutions, custom code – Write you own dedicated library, then program with it – Burden on the programmer to explicitly manage

everything

Parallelization Challenges

• How do we assign work units to workers?

• What if we have more work units than workers?

• What if workers need to share partial results?

• How do we aggregate partial results?

• How do we know all the workers have finished?

• What if workers die?

What is the common theme of all of these problems?

What’s the point?

• It’s all about the right level of abstraction – The von Neumann architecture has served us well, but

is no longer appropriate for the multi-core/cluster environment

• Hide system-level details from the developers – No more race conditions, lock contention, etc.

• Separating the what from how – Developer specifies the computation that needs to be

performed – Execution framework (“runtime”) handles actual

execution

The datacenter is the computer!

Current Tools

• Programming models

– Shared memory (pthreads)

– Message passing (MPI)

• Design Patterns

– Master-slaves

– Producer-consumer flows

– Shared work queues

Message Passing

P1 P2 P3 P4 P5

Shared Memory

P1 P2 P3 P4 P5

Mem

ory

master

slaves

producer consumer

producer consumer

work queue

“Big Ideas”

• Scale “out”, not “up”

– Limits of SMP and large shared-memory machines

• Move processing to the data

– Cluster have limited bandwidth

• Process data sequentially, avoid random access

– Seeks are expensive, disk throughput is reasonable

• Seamless scalability

– From the mythical man-month to the tradable machine-hour

Divide and Conquer

“Work”

w1 w2 w3

r1 r2 r3

“Result”

“worker” “worker” “worker”

Partition

Combine

Data Science

Data Science - the big picture

Data Science Venn Diagram

http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram

Data Science concerns

A few examples…

… Data Science is at the heart of computer science

Stanford's

Autonomous

Vehicles project

(Thrun et al.)

Learning to

Powerslide

A few examples…

… Data Science is at the heart of computer science

Learning ground

from obstacles

Visualization

Motivation

Recommender Systems

predicting movie ratings

Napoleon Dynamite = Batman Begins =

Netflix Prize

Finding Nemo = Lord of the Rings =

1.22 .75

?? ??

Some films are difficult to predict…

Documents

Introduction to Big Data · 2018-01-26 · Big Data: Batch Processing & Distributed Data Store Hadoop/Spark; HBase/Cassandra BI Reporting OLAP & Dataware house Business Objects, SAS,