Upload
others
View
9
Download
0
Embed Size (px)
Citation preview
Introduction to Big Data
Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul Mali Destek Programı kapsamında
yürütülmekte olan TR10/16/YNY/0036 no’lu İstanbul Big Data Eğitim ve Araştırma Merkezi Projesi dahilinde
gerçekleştirilmiştir. İçerik ile ilgili tek sorumluluk Bahçeşehir Üniversitesi’ne ait olup İSTKA veya Kalkınma Bakanlığı’nın
görüşlerini yansıtmamaktadır.
Knowledge Doubling Curve
Buckminster Fuller proposed the concept of
“Knowledge Doubling Curve”
• Until 1900 human knowledge doubled
approximately every century
• By the end of World War II knowledge was
doubling every 25 years
• Currently, human knowledge is expected to be
doubling every 13 months on average
• According to IBM technical report*, it will be
doubling every 12 hours in the future.
*http://www-935.ibm.com/services/no/cio/leverage/levinfo_wp_gts_thetoxic.pdf
data
information
knowledge
wisdom
Big Data EveryWhere!
• Lots of data is being collected and warehoused
– Web data, e-commerce
– purchases at department/ grocery stores
– Bank/Credit Card transactions
– Social Network
Big Data, Small World
• %19 of total time spent on
internet, spent on Social media
• Total Social media users 2.5
billion
• Total number of Facebook
users is alone more than 1.5
billion.
• %46 out of 7,4 billion people in the world,
are Active Internet Users
• 3,4 billion Active Internet Users
• %18,5 Internet Users rate of growth , 2013-
2015
The Model Has Changed…
• The Model of Generating/Consuming Data has Changed
Old Model: Few companies are generating data, all others are consuming data
New Model: all of us are generating data, and all of us are consuming data
6
How much data? • Google processes 20 PB a day (2008)
• Wayback Machine has 3 PB + 100 TB/month (3/2009)
• Facebook has 2.5 PB of user data + 15 TB/day (4/2009)
• eBay has 6.5 PB of user data + 50 TB/day (5/2009)
• CERN’s Large Hydron Collider (LHC) generates 15 PB a year
640K ought to be enough for anybody.
Maximilien Brice, © CERN
There's certainly a lot of it!
2015
1 Zettabyte
1 Exabyte
1 Petabyte
(brain) 14 PB: http://www.quora.com/Neuroscience-1/How-much-data-can-the-human-brain-store
(2002) 5 EB: http://www2.sims.berkeley.edu/research/projects/how-much-info-2003/execsum.htm
1 Petabyte == 1000 TB 2002 2009
(2009) 800 EB: http://www.emc.com/collateral/analyst-reports/idc-digital-universe-are-you-ready.pdf
(2015) 8 ZB: http://www.emc.com/collateral/analyst-reports/idc-extracting-value-from-chaos-ar.pdf
2006 2011
(2006) 161 EB: http://www.emc.com/collateral/analyst-reports/expanding-digital-idc-white-paper.pdf
(2011) 1.8 ZB: http://www.emc.com/leadership/programs/digital-universe.htm (life in video) 60 PB: in 4320p resolution, extrapolated from 16MB for 1:21 of 640x480 video (w/sound) – almost certainly a gross overestimate, as sleep can be compressed significantly!
5 EB
161 EB
800 EB
1.8 ZB 8.0 ZB
14 PB
60 PB
Data produced each year
100-years of HD video + audio
Human brain's capacity
Data, data everywhere…
References
1 TB = 1000 GB
120 PB
logarith
mic scale
Variety (Complexity)
• Relational Data (Tables/Transaction/Legacy Data) • Text Data (Web) • Semi-structured Data (XML) • Graph Data
– Social Network, Semantic Web (RDF), …
• Streaming Data – You can only scan the data once
• A single application can be generating/collecting
many types of data
• Big Public Data (online, weather, finance, etc)
10
To extract knowledge all these types of data need to linked together
A Single View to the Customer
Customer
Social Media
Gaming
Entertain
Banking Finance
Our
Known History
Purchase
Velocity (Speed)
• Data is begin generated fast and need to be processed fast
• Online Data Analytics
• Late decisions missing opportunities
• Examples – E-Promotions: Based on your current location, your purchase history,
what you like send promotions right now for store next to you
– Healthcare monitoring: sensors monitoring your activities and body any abnormal measurements require immediate reaction
12
Real-time/Fast Data
Social media and networks (all of us are generating data)
Scientific instruments (collecting all sorts of data)
Mobile devices (tracking all objects all the time)
Sensor technology and networks (measuring all kinds of data)
• The progress and innovation is no longer hindered by the ability to collect data
• But, by the ability to manage, analyze, summarize, visualize, and discover knowledge from the collected data in a timely manner and in a scalable fashion
13
Some Make it 4V’s
14
Harnessing Big Data
• OLTP: Online Transaction Processing (DBMSs)
• OLAP: Online Analytical Processing (Data Warehousing)
• RTAP: Real-Time Analytics Processing (Big Data Architecture & technology)
15
Big Data: Batch Processing &
Distributed Data Store Hadoop/Spark; HBase/Cassandra
BI Reporting OLAP &
Dataware house
Business Objects, SAS, Informatica, Cognos other SQL
Reporting Tools
Interactive Business
Intelligence & In-memory RDBMS
QliqView, Tableau, HANA
Big Data:
Real Time & Single View
Graph Databases
THE EVOLUTION OF BUSINESS INTELLIGENCE
1990’s 2000’s 2010’s
Speed
Scale
Scale
Speed
Big Data Analytics
• Big data is more real-time in nature than traditional DW applications
• Traditional DW architectures (e.g. Exadata, Teradata) are not well-suited for big data apps
• Shared nothing, massively parallel processing, scale out architectures are well-suited for big data apps
17
the companies are expanding as fast as the data!
Parallelization problems?
• Parallelization problems arise from:
– Communication between workers (e.g., to exchange state)
– Access to shared resources (e.g., data)
• Thus, we need a synchronization mechanism
Source: Ricardo Guimarães Herrmann
Managing Multiple Workers
• Difficult because – We don’t know the order in which workers run – We don’t know when workers interrupt each other – We don’t know the order in which workers access shared data
• Thus, we need: – Semaphores (lock, unlock) – Conditional variables (wait, notify, broadcast) – Barriers
• Still, lots of problems: – Deadlock, livelock, race conditions... – Dining philosophers, sleeping barbers, cigarette smokers...
• Moral of the story: be careful!
Concurrency Challenge!
• Concurrency is difficult to reason about • Concurrency is even more difficult to reason about
– At the scale of datacenters (even across datacenters) – In the presence of failures – In terms of multiple interacting services
• Not to mention debugging… • The reality:
– Lots of one-off solutions, custom code – Write you own dedicated library, then program with it – Burden on the programmer to explicitly manage
everything
Parallelization Challenges
• How do we assign work units to workers?
• What if we have more work units than workers?
• What if workers need to share partial results?
• How do we aggregate partial results?
• How do we know all the workers have finished?
• What if workers die?
What is the common theme of all of these problems?
What’s the point?
• It’s all about the right level of abstraction – The von Neumann architecture has served us well, but
is no longer appropriate for the multi-core/cluster environment
• Hide system-level details from the developers – No more race conditions, lock contention, etc.
• Separating the what from how – Developer specifies the computation that needs to be
performed – Execution framework (“runtime”) handles actual
execution
The datacenter is the computer!
Current Tools
• Programming models
– Shared memory (pthreads)
– Message passing (MPI)
• Design Patterns
– Master-slaves
– Producer-consumer flows
– Shared work queues
Message Passing
P1 P2 P3 P4 P5
Shared Memory
P1 P2 P3 P4 P5
Mem
ory
master
slaves
producer consumer
producer consumer
work queue
“Big Ideas”
• Scale “out”, not “up”
– Limits of SMP and large shared-memory machines
• Move processing to the data
– Cluster have limited bandwidth
• Process data sequentially, avoid random access
– Seeks are expensive, disk throughput is reasonable
• Seamless scalability
– From the mythical man-month to the tradable machine-hour
Divide and Conquer
“Work”
w1 w2 w3
r1 r2 r3
“Result”
“worker” “worker” “worker”
Partition
Combine
Data Science
Data Science - the big picture
Data Science Venn Diagram
http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram
Data Science concerns
A few examples…
… Data Science is at the heart of computer science
Stanford's
Autonomous
Vehicles project
(Thrun et al.)
Learning to
Powerslide
A few examples…
… Data Science is at the heart of computer science
Learning ground
from obstacles
Visualization
Motivation
Recommender Systems
predicting movie ratings
Napoleon Dynamite = Batman Begins =
Netflix Prize
Finding Nemo = Lord of the Rings =
1.22 .75
?? ??
Some films are difficult to predict…