Upload
dataiku
View
460
Download
1
Embed Size (px)
Citation preview
The Paradox of Big Data
2001 Programming Languages 2004 Natural Language Processing
2006 Social Recommendation
2008 Distributed Computing
2011 Social Gaming2012 Advertising
2013 Dataiku
2009 Web Mining
Type Spent Coding
2010
100%100%80%50%
20%
0%10%50%
20%
Favorite Language
CExascriptExascript
Exascript
Python
Powerpoint
Python
Java
None
Largest Dataset
100GB100GB10GB10TB
100TB
100kB500GB100TB
10TB
I’m Florian and I like data
www.dataiku.com
Dataiku in short
Software editor behind Data Science Studio,the « Photoshop for Data Science »
COMMUNITY EDITION
http://www.dataiku.com/dss/trynow/
Goals For Today• Big Data with the bias of what I know of it
(Analytics …)
• Big Data: History and Feelings
• What are the key technologies to watch ?
• Some practical use cases ?
• How to get started ?
Dataiku
Motivation
1/8/144
First Hard Drive: 3,75 Megabytes Access Time: 1 second
IN 2008 man
invented big data
Volume Variety Velocity
WHAT IF THE MARKETING GUY HAD CHOSEN ANOTHER LETTER?
Capacity Complexity Celerity
OR SIMPLER
Size Serendipity Speed
OR AFTER A DRINK
Big Blur Blazing
Or Combine
C… B.. S….
Or Combine
Complete Bull Sh..
SOOO WHAT IS
BIG DATA ?
PARADOX #1 SIMPLEXITY
SUBTLE PATTERNS
"MORE BUSINESS" BUTTONS
PARADOX #2 SELF-AWARE
DATA SCIENTIST AT NIGHT
DATA CLEANER THE DAY
DATA PLUMBERER THE WEEK-END
WAIT COMPUTATION BETWEEN COFFEES
PARADOX #3 WHERE TO STORE DATA?
MY DATA IS WORTH MILLIONS
I SEND IT TO THE
MARKETING CLOUD
AND BACKUP IT TO GOOGLE
PARADOX #4 IS IT BIG OR NOT ?
WE ALL LIVE IN A BIG DATA
LAKE
ALL MY DATA MAY FITS IN HERE
PARADOX #5 (at last) HUMAN OR NOT ?
TECHCRUNCH SAYS THAT MACHINE LEARNING WILL SAVE
US ALL
I JUST WANT MORE REPORTS
BIG DATA TECH TRENDS
ELEPHANT MAKE BABIES
Dataiku - Pig, Hive and Cascading
WELCOME TO TECHNOSLAVIA
Hadoop Ceph
Sphere Cassandra
Kafka Flume Spark
Scikit-Learn GraphLAB prediction.io jubatus
Mahout WEKA
MLBase LibSVM
RapidMiner Panda
Kibana
InfiniDB Drill Spark SQL
Hive Impala
…
Elastic Search
SOLR MongoDB
Riak Membase
Pig
Cascading
Talend
Machine Learning Mystery Land
Scalability Central
SQL Colunnar Republic
Vizualization County Data Clean Wasteland
Statistician Old House
R Real-time island
Storm
NOSQL Nihiland
DRIVER 1: BACK TO THE BASICS
RAM -‐ CPU -‐ DISK
2000 2013
1000$ / GB
6$ / GB$10 / GB
$0.06 / GB
memory divided by 150
disk cost divided by 250
MAP REDUCE times
HACK REDUCE times
A PERSISTENT MEMORY PROBLEM
DATA IS BIGGER
IS USEFUL DATA BIGGER ?
WHOLE DATA
REFINED DATA
GOLD
NEEDLE IN HAYSTACK ?
OILD
REFINE BEFORE USE
HOW BIG IS BIG DATA ?Web Site
– $1Billion revenue per year – 10 Millions Unique Visitor per month – 100.Millions orders / actions / per day
10TB RAW DATA
1TB REFINED DATA
1 TERABYTE
FITS IN MEMORY
1TB
DRIVER 2 : ECOSYSTEM GROWS
• 1 Circle OPEN SOURCE – YAHOO – IBM – LINKEDIN -‐ FACEBOOK
• 2 Circle – STANDFORD BERKELEY – STARTUPS
STARTUPS
64m$
6.75m$
14m$
2m$
40m$
20m$
20.5m$
19m$
4m$
100m$
1.8m$
17m$
11m$
7.75m$
1.7m$
20132012
2011
2010
2009
$1B per year Invested in Big Data
TECH 223m$
301m$
ALL > SPARK
Real-‐Time Resilient Distributed Memory Framework
• Abstraction with any DAG operation on data: -‐ Filter -‐ Map -‐ Reduce -‐ Cache
SPARK AND ITS ECOSYSTEM
SHARK
MLBASE
STREAMING
Real-‐Time Queries
Real-‐Time Updates
In-‐Memory Learning
SPAR
K
SooOOo WHAT IS IT IN PRACTICE?
www.dataiku.com
Turn Device Logs Into Next Years' Business
Parking ticket machine data
OpenStreetMapdata
Cleaning and enrichment of data Crossing data
Data Science Studio
Creation of a predictive algorithm
Availability of the predictions
Each street is segmented into small pieces that are enriched with geospatial information.
The parking ticket history is joined with the points of
interest from OpenStreetMap.
The availability of parking lots is predicted by street
segments from the joined data.
The algorithm is finally integrated in the iPhone
app « Find me a space ».
by
www.dataiku.com
Optimizing Last Mile with Data Science Studio
Data Science Studio
Historical delivery and retrieval data
Modeling of a score for each delivery
Cleaning and temporal enrichment of data
Data aggregation by geographic location
Incorporation of new deliveries to the existing model
by
• Reformulation de la recherche
• Pas de réponse
• Clic sur un pro• Top recherche• Clic de navigation ou filtre
COMMENT AMÉLIORER LA PERTINENCE DE NOS RÉPONSES VIA L’ANALYSE DU COMPORTEMENT UTILISATEUR ?
20 M
Analyse & corrections
automatisation
>10 occurrences1,4M
requêtes
>200M recherches
✗ ✓
0,5M requêtes priorisées
SOLUTION
Machine
Gestion Exploration
pagesjaunes.frAnnuaire
hadoop PIG+Hive
Export indexation
Moteur d’interprétation
crawl Autres référentiels
Sickit-learn
www.dataiku.com
Analyst
Panels
1970 : Birth of Computer Analytics
ComputerExpensive Software
Marketing Studies
www.dataiku.com
Multiple Data Sources
Analyst Team
Many Models
CRM
Logs
2015 : BUILD YOUR FACTORY
Server ClusterLight Software
Personalised Experience Model
Acquisition Cost Opportunity
Model
Stock Optimisation Model
Optimize Delivery
www.dataiku.com
Churn
Volume Forecast
RecommenderSegmentation Lifetime Value
Risk Score Hot Location
Pricing Ranking FraudEvent Paths
A MODEL An automated way to make a computertake a decision from raw (historical) data
The model can be used to take immediate (real-time)actions through an API
www.dataiku.com
Churn
Volume Forecast
RecommenderSegmentation Lifetime Value
Risk Score Hot Location
Pricing Ranking FraudEvent Paths
SooOOo How To I ENTER WONDERLAND ?
STEP 1 : LEARN
• PYTHON + PANDAS + SCIKIT
• R
• SCALA
http://scikit-learn.org/https://www.coursera.org/course/rprog
STEP 2 : PRACTICE• Try to enter in a Contest on kaggle.com or
• or datascience.net
• Join a meetup
www.dataiku.com
http://www.dataiku.com/dss/trynow/
Dataiku HQ
2 rue Jean Lantier
75001 Paris France
Dataiku West
2423A Durant Avenue
Berkeley, CA 94704
Florian [email protected]
You have ideas
“My data is too dirty. I don’t even know where to start ”
“We could probably better understand ours users. But how ?
“There’s a trend here, but our full historical data is just too big”
You have data
You need a tool