35
Introduction to Big Data Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul Mali Destek Programı kapsamında yürütülmekte olan TR10/16/YNY/0036 no’lu İstanbul Big Data Eğitim ve Araştırma Merkezi Projesi dahilinde gerçekleştirilmiştir. İçerik ile ilgili tek sorumluluk Bahçeşehir Üniversitesi’ne ait olup İSTKA veya Kalkınma Bakanlığı’nın görüşlerini yansıtmamaktadır.

Introduction to Big Data · 2018-01-26 · Big Data: Batch Processing & Distributed Data Store Hadoop/Spark; HBase/Cassandra BI Reporting OLAP & Dataware house Business Objects, SAS,

  • Upload
    others

  • View
    9

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Introduction to Big Data · 2018-01-26 · Big Data: Batch Processing & Distributed Data Store Hadoop/Spark; HBase/Cassandra BI Reporting OLAP & Dataware house Business Objects, SAS,

Introduction to Big Data

Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul Mali Destek Programı kapsamında

yürütülmekte olan TR10/16/YNY/0036 no’lu İstanbul Big Data Eğitim ve Araştırma Merkezi Projesi dahilinde

gerçekleştirilmiştir. İçerik ile ilgili tek sorumluluk Bahçeşehir Üniversitesi’ne ait olup İSTKA veya Kalkınma Bakanlığı’nın

görüşlerini yansıtmamaktadır.

Page 2: Introduction to Big Data · 2018-01-26 · Big Data: Batch Processing & Distributed Data Store Hadoop/Spark; HBase/Cassandra BI Reporting OLAP & Dataware house Business Objects, SAS,

Knowledge Doubling Curve

Buckminster Fuller proposed the concept of

“Knowledge Doubling Curve”

• Until 1900 human knowledge doubled

approximately every century

• By the end of World War II knowledge was

doubling every 25 years

• Currently, human knowledge is expected to be

doubling every 13 months on average

• According to IBM technical report*, it will be

doubling every 12 hours in the future.

*http://www-935.ibm.com/services/no/cio/leverage/levinfo_wp_gts_thetoxic.pdf

Page 3: Introduction to Big Data · 2018-01-26 · Big Data: Batch Processing & Distributed Data Store Hadoop/Spark; HBase/Cassandra BI Reporting OLAP & Dataware house Business Objects, SAS,

data

information

knowledge

wisdom

Page 4: Introduction to Big Data · 2018-01-26 · Big Data: Batch Processing & Distributed Data Store Hadoop/Spark; HBase/Cassandra BI Reporting OLAP & Dataware house Business Objects, SAS,

Big Data EveryWhere!

• Lots of data is being collected and warehoused

– Web data, e-commerce

– purchases at department/ grocery stores

– Bank/Credit Card transactions

– Social Network

Page 5: Introduction to Big Data · 2018-01-26 · Big Data: Batch Processing & Distributed Data Store Hadoop/Spark; HBase/Cassandra BI Reporting OLAP & Dataware house Business Objects, SAS,

Big Data, Small World

• %19 of total time spent on

internet, spent on Social media

• Total Social media users 2.5

billion

• Total number of Facebook

users is alone more than 1.5

billion.

• %46 out of 7,4 billion people in the world,

are Active Internet Users

• 3,4 billion Active Internet Users

• %18,5 Internet Users rate of growth , 2013-

2015

Page 6: Introduction to Big Data · 2018-01-26 · Big Data: Batch Processing & Distributed Data Store Hadoop/Spark; HBase/Cassandra BI Reporting OLAP & Dataware house Business Objects, SAS,

The Model Has Changed…

• The Model of Generating/Consuming Data has Changed

Old Model: Few companies are generating data, all others are consuming data

New Model: all of us are generating data, and all of us are consuming data

6

Page 7: Introduction to Big Data · 2018-01-26 · Big Data: Batch Processing & Distributed Data Store Hadoop/Spark; HBase/Cassandra BI Reporting OLAP & Dataware house Business Objects, SAS,

How much data? • Google processes 20 PB a day (2008)

• Wayback Machine has 3 PB + 100 TB/month (3/2009)

• Facebook has 2.5 PB of user data + 15 TB/day (4/2009)

• eBay has 6.5 PB of user data + 50 TB/day (5/2009)

• CERN’s Large Hydron Collider (LHC) generates 15 PB a year

640K ought to be enough for anybody.

Page 8: Introduction to Big Data · 2018-01-26 · Big Data: Batch Processing & Distributed Data Store Hadoop/Spark; HBase/Cassandra BI Reporting OLAP & Dataware house Business Objects, SAS,

Maximilien Brice, © CERN

Page 9: Introduction to Big Data · 2018-01-26 · Big Data: Batch Processing & Distributed Data Store Hadoop/Spark; HBase/Cassandra BI Reporting OLAP & Dataware house Business Objects, SAS,

There's certainly a lot of it!

2015

1 Zettabyte

1 Exabyte

1 Petabyte

(brain) 14 PB: http://www.quora.com/Neuroscience-1/How-much-data-can-the-human-brain-store

(2002) 5 EB: http://www2.sims.berkeley.edu/research/projects/how-much-info-2003/execsum.htm

1 Petabyte == 1000 TB 2002 2009

(2009) 800 EB: http://www.emc.com/collateral/analyst-reports/idc-digital-universe-are-you-ready.pdf

(2015) 8 ZB: http://www.emc.com/collateral/analyst-reports/idc-extracting-value-from-chaos-ar.pdf

2006 2011

(2006) 161 EB: http://www.emc.com/collateral/analyst-reports/expanding-digital-idc-white-paper.pdf

(2011) 1.8 ZB: http://www.emc.com/leadership/programs/digital-universe.htm (life in video) 60 PB: in 4320p resolution, extrapolated from 16MB for 1:21 of 640x480 video (w/sound) – almost certainly a gross overestimate, as sleep can be compressed significantly!

5 EB

161 EB

800 EB

1.8 ZB 8.0 ZB

14 PB

60 PB

Data produced each year

100-years of HD video + audio

Human brain's capacity

Data, data everywhere…

References

1 TB = 1000 GB

120 PB

logarith

mic scale

Page 10: Introduction to Big Data · 2018-01-26 · Big Data: Batch Processing & Distributed Data Store Hadoop/Spark; HBase/Cassandra BI Reporting OLAP & Dataware house Business Objects, SAS,

Variety (Complexity)

• Relational Data (Tables/Transaction/Legacy Data) • Text Data (Web) • Semi-structured Data (XML) • Graph Data

– Social Network, Semantic Web (RDF), …

• Streaming Data – You can only scan the data once

• A single application can be generating/collecting

many types of data

• Big Public Data (online, weather, finance, etc)

10

To extract knowledge all these types of data need to linked together

Page 11: Introduction to Big Data · 2018-01-26 · Big Data: Batch Processing & Distributed Data Store Hadoop/Spark; HBase/Cassandra BI Reporting OLAP & Dataware house Business Objects, SAS,

A Single View to the Customer

Customer

Social Media

Gaming

Entertain

Banking Finance

Our

Known History

Purchase

Page 12: Introduction to Big Data · 2018-01-26 · Big Data: Batch Processing & Distributed Data Store Hadoop/Spark; HBase/Cassandra BI Reporting OLAP & Dataware house Business Objects, SAS,

Velocity (Speed)

• Data is begin generated fast and need to be processed fast

• Online Data Analytics

• Late decisions missing opportunities

• Examples – E-Promotions: Based on your current location, your purchase history,

what you like send promotions right now for store next to you

– Healthcare monitoring: sensors monitoring your activities and body any abnormal measurements require immediate reaction

12

Page 13: Introduction to Big Data · 2018-01-26 · Big Data: Batch Processing & Distributed Data Store Hadoop/Spark; HBase/Cassandra BI Reporting OLAP & Dataware house Business Objects, SAS,

Real-time/Fast Data

Social media and networks (all of us are generating data)

Scientific instruments (collecting all sorts of data)

Mobile devices (tracking all objects all the time)

Sensor technology and networks (measuring all kinds of data)

• The progress and innovation is no longer hindered by the ability to collect data

• But, by the ability to manage, analyze, summarize, visualize, and discover knowledge from the collected data in a timely manner and in a scalable fashion

13

Page 14: Introduction to Big Data · 2018-01-26 · Big Data: Batch Processing & Distributed Data Store Hadoop/Spark; HBase/Cassandra BI Reporting OLAP & Dataware house Business Objects, SAS,

Some Make it 4V’s

14

Page 15: Introduction to Big Data · 2018-01-26 · Big Data: Batch Processing & Distributed Data Store Hadoop/Spark; HBase/Cassandra BI Reporting OLAP & Dataware house Business Objects, SAS,

Harnessing Big Data

• OLTP: Online Transaction Processing (DBMSs)

• OLAP: Online Analytical Processing (Data Warehousing)

• RTAP: Real-Time Analytics Processing (Big Data Architecture & technology)

15

Page 16: Introduction to Big Data · 2018-01-26 · Big Data: Batch Processing & Distributed Data Store Hadoop/Spark; HBase/Cassandra BI Reporting OLAP & Dataware house Business Objects, SAS,

Big Data: Batch Processing &

Distributed Data Store Hadoop/Spark; HBase/Cassandra

BI Reporting OLAP &

Dataware house

Business Objects, SAS, Informatica, Cognos other SQL

Reporting Tools

Interactive Business

Intelligence & In-memory RDBMS

QliqView, Tableau, HANA

Big Data:

Real Time & Single View

Graph Databases

THE EVOLUTION OF BUSINESS INTELLIGENCE

1990’s 2000’s 2010’s

Speed

Scale

Scale

Speed

Page 17: Introduction to Big Data · 2018-01-26 · Big Data: Batch Processing & Distributed Data Store Hadoop/Spark; HBase/Cassandra BI Reporting OLAP & Dataware house Business Objects, SAS,

Big Data Analytics

• Big data is more real-time in nature than traditional DW applications

• Traditional DW architectures (e.g. Exadata, Teradata) are not well-suited for big data apps

• Shared nothing, massively parallel processing, scale out architectures are well-suited for big data apps

17

Page 18: Introduction to Big Data · 2018-01-26 · Big Data: Batch Processing & Distributed Data Store Hadoop/Spark; HBase/Cassandra BI Reporting OLAP & Dataware house Business Objects, SAS,

the companies are expanding as fast as the data!

Page 19: Introduction to Big Data · 2018-01-26 · Big Data: Batch Processing & Distributed Data Store Hadoop/Spark; HBase/Cassandra BI Reporting OLAP & Dataware house Business Objects, SAS,

Parallelization problems?

• Parallelization problems arise from:

– Communication between workers (e.g., to exchange state)

– Access to shared resources (e.g., data)

• Thus, we need a synchronization mechanism

Page 20: Introduction to Big Data · 2018-01-26 · Big Data: Batch Processing & Distributed Data Store Hadoop/Spark; HBase/Cassandra BI Reporting OLAP & Dataware house Business Objects, SAS,

Source: Ricardo Guimarães Herrmann

Page 21: Introduction to Big Data · 2018-01-26 · Big Data: Batch Processing & Distributed Data Store Hadoop/Spark; HBase/Cassandra BI Reporting OLAP & Dataware house Business Objects, SAS,

Managing Multiple Workers

• Difficult because – We don’t know the order in which workers run – We don’t know when workers interrupt each other – We don’t know the order in which workers access shared data

• Thus, we need: – Semaphores (lock, unlock) – Conditional variables (wait, notify, broadcast) – Barriers

• Still, lots of problems: – Deadlock, livelock, race conditions... – Dining philosophers, sleeping barbers, cigarette smokers...

• Moral of the story: be careful!

Page 22: Introduction to Big Data · 2018-01-26 · Big Data: Batch Processing & Distributed Data Store Hadoop/Spark; HBase/Cassandra BI Reporting OLAP & Dataware house Business Objects, SAS,

Concurrency Challenge!

• Concurrency is difficult to reason about • Concurrency is even more difficult to reason about

– At the scale of datacenters (even across datacenters) – In the presence of failures – In terms of multiple interacting services

• Not to mention debugging… • The reality:

– Lots of one-off solutions, custom code – Write you own dedicated library, then program with it – Burden on the programmer to explicitly manage

everything

Page 23: Introduction to Big Data · 2018-01-26 · Big Data: Batch Processing & Distributed Data Store Hadoop/Spark; HBase/Cassandra BI Reporting OLAP & Dataware house Business Objects, SAS,

Parallelization Challenges

• How do we assign work units to workers?

• What if we have more work units than workers?

• What if workers need to share partial results?

• How do we aggregate partial results?

• How do we know all the workers have finished?

• What if workers die?

What is the common theme of all of these problems?

Page 24: Introduction to Big Data · 2018-01-26 · Big Data: Batch Processing & Distributed Data Store Hadoop/Spark; HBase/Cassandra BI Reporting OLAP & Dataware house Business Objects, SAS,

What’s the point?

• It’s all about the right level of abstraction – The von Neumann architecture has served us well, but

is no longer appropriate for the multi-core/cluster environment

• Hide system-level details from the developers – No more race conditions, lock contention, etc.

• Separating the what from how – Developer specifies the computation that needs to be

performed – Execution framework (“runtime”) handles actual

execution

The datacenter is the computer!

Page 25: Introduction to Big Data · 2018-01-26 · Big Data: Batch Processing & Distributed Data Store Hadoop/Spark; HBase/Cassandra BI Reporting OLAP & Dataware house Business Objects, SAS,

Current Tools

• Programming models

– Shared memory (pthreads)

– Message passing (MPI)

• Design Patterns

– Master-slaves

– Producer-consumer flows

– Shared work queues

Message Passing

P1 P2 P3 P4 P5

Shared Memory

P1 P2 P3 P4 P5

Mem

ory

master

slaves

producer consumer

producer consumer

work queue

Page 26: Introduction to Big Data · 2018-01-26 · Big Data: Batch Processing & Distributed Data Store Hadoop/Spark; HBase/Cassandra BI Reporting OLAP & Dataware house Business Objects, SAS,

“Big Ideas”

• Scale “out”, not “up”

– Limits of SMP and large shared-memory machines

• Move processing to the data

– Cluster have limited bandwidth

• Process data sequentially, avoid random access

– Seeks are expensive, disk throughput is reasonable

• Seamless scalability

– From the mythical man-month to the tradable machine-hour

Page 27: Introduction to Big Data · 2018-01-26 · Big Data: Batch Processing & Distributed Data Store Hadoop/Spark; HBase/Cassandra BI Reporting OLAP & Dataware house Business Objects, SAS,

Divide and Conquer

“Work”

w1 w2 w3

r1 r2 r3

“Result”

“worker” “worker” “worker”

Partition

Combine

Page 28: Introduction to Big Data · 2018-01-26 · Big Data: Batch Processing & Distributed Data Store Hadoop/Spark; HBase/Cassandra BI Reporting OLAP & Dataware house Business Objects, SAS,

Data Science

Page 29: Introduction to Big Data · 2018-01-26 · Big Data: Batch Processing & Distributed Data Store Hadoop/Spark; HBase/Cassandra BI Reporting OLAP & Dataware house Business Objects, SAS,

Data Science - the big picture

Data Science Venn Diagram

http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram

Page 30: Introduction to Big Data · 2018-01-26 · Big Data: Batch Processing & Distributed Data Store Hadoop/Spark; HBase/Cassandra BI Reporting OLAP & Dataware house Business Objects, SAS,

Data Science concerns

Page 31: Introduction to Big Data · 2018-01-26 · Big Data: Batch Processing & Distributed Data Store Hadoop/Spark; HBase/Cassandra BI Reporting OLAP & Dataware house Business Objects, SAS,

A few examples…

… Data Science is at the heart of computer science

Stanford's

Autonomous

Vehicles project

(Thrun et al.)

Learning to

Powerslide

Page 32: Introduction to Big Data · 2018-01-26 · Big Data: Batch Processing & Distributed Data Store Hadoop/Spark; HBase/Cassandra BI Reporting OLAP & Dataware house Business Objects, SAS,

A few examples…

… Data Science is at the heart of computer science

Learning ground

from obstacles

Page 33: Introduction to Big Data · 2018-01-26 · Big Data: Batch Processing & Distributed Data Store Hadoop/Spark; HBase/Cassandra BI Reporting OLAP & Dataware house Business Objects, SAS,

Visualization

Motivation

Page 34: Introduction to Big Data · 2018-01-26 · Big Data: Batch Processing & Distributed Data Store Hadoop/Spark; HBase/Cassandra BI Reporting OLAP & Dataware house Business Objects, SAS,

Recommender Systems

predicting movie ratings

Page 35: Introduction to Big Data · 2018-01-26 · Big Data: Batch Processing & Distributed Data Store Hadoop/Spark; HBase/Cassandra BI Reporting OLAP & Dataware house Business Objects, SAS,

Napoleon Dynamite = Batman Begins =

Netflix Prize

Finding Nemo = Lord of the Rings =

1.22 .75

?? ??

Some films are difficult to predict…