The paradox of big data - dataiku / oxalide APEROTECH

The Paradox of Big Data

2001 Programming Languages 2004 Natural Language Processing

2006 Social Recommendation

2008 Distributed Computing

2011 Social Gaming2012 Advertising

2013 Dataiku

2009 Web Mining

Type Spent Coding

2010

100%100%80%50%

20%

0%10%50%

20%

Favorite Language

CExascriptExascript

Exascript

Python

Powerpoint

Python

Java

None

Largest Dataset

100GB100GB10GB10TB

100TB

100kB500GB100TB

10TB

I’m Florian and I like data

www.dataiku.com

Dataiku in short

Software editor behind Data Science Studio,the « Photoshop for Data Science »

COMMUNITY EDITION

http://www.dataiku.com/dss/trynow/


Goals For Today• Big Data with the bias of what I know of it

(Analytics …)

• Big Data: History and Feelings

• What are the key technologies to watch ?

• Some practical use cases ?

• How to get started ?

Dataiku

Motivation

1/8/144

First Hard Drive: 3,75 Megabytes Access Time: 1 second

IN 2008 man

invented big data

Volume Variety Velocity

WHAT IF THE MARKETING GUY HAD CHOSEN ANOTHER LETTER?

Capacity Complexity Celerity

OR SIMPLER

Size Serendipity Speed

OR AFTER A DRINK

Big Blur Blazing

Or Combine

C… B.. S….

Or Combine

Complete Bull Sh..

SOOO WHAT IS

BIG DATA ?

PARADOX #1 SIMPLEXITY

SUBTLE PATTERNS

"MORE BUSINESS" BUTTONS

PARADOX #2 SELF-AWARE

DATA SCIENTIST AT NIGHT

DATA CLEANER THE DAY

DATA PLUMBERER THE WEEK-END

WAIT COMPUTATION BETWEEN COFFEES

PARADOX #3 WHERE TO STORE DATA?

MY DATA IS WORTH MILLIONS

I SEND IT TO THE

MARKETING CLOUD

AND BACKUP IT TO GOOGLE

PARADOX #4 IS IT BIG OR NOT ?

WE ALL LIVE IN A BIG DATA

LAKE

ALL MY DATA MAY FITS IN HERE

PARADOX #5 (at last) HUMAN OR NOT ?

TECHCRUNCH SAYS THAT MACHINE LEARNING WILL SAVE

US ALL

I JUST WANT MORE REPORTS

BIG DATA TECH TRENDS

ELEPHANT MAKE BABIES

Dataiku - Pig, Hive and Cascading

WELCOME TO TECHNOSLAVIA

Hadoop Ceph

Sphere Cassandra

Kafka Flume Spark

Scikit-Learn GraphLAB prediction.io jubatus

Mahout WEKA

MLBase LibSVM

RapidMiner Panda

Kibana

InfiniDB Drill Spark SQL

Hive Impala

…

Elastic Search

SOLR MongoDB

Riak Membase

Pig

Cascading

Talend

Machine Learning Mystery Land

Scalability Central

SQL Colunnar Republic

Vizualization County Data Clean Wasteland

Statistician Old House

R Real-time island

Storm

NOSQL Nihiland

DRIVER 1: BACK TO THE BASICS

RAM -‐ CPU -‐ DISK

2000 2013

1000$ / GB

6$ / GB$10 / GB

$0.06 / GB

memory divided by 150

disk cost divided by 250

MAP REDUCE times

HACK REDUCE times

A PERSISTENT MEMORY PROBLEM

DATA IS BIGGER

IS USEFUL DATA BIGGER ?

WHOLE DATA

REFINED DATA

GOLD

NEEDLE IN HAYSTACK ?

OILD

REFINE BEFORE USE

HOW BIG IS BIG DATA ?Web Site

– $1Billion revenue per year – 10 Millions Unique Visitor per month – 100.Millions orders / actions / per day

10TB RAW DATA

1TB REFINED DATA

1 TERABYTE

FITS IN MEMORY

1TB

DRIVER 2 : ECOSYSTEM GROWS

• GOOGLE

• 1 Circle OPEN SOURCE – YAHOO – IBM – LINKEDIN -‐ FACEBOOK

• 2 Circle – STANDFORD BERKELEY – STARTUPS

STARTUPS

64m$

6.75m$

14m$

2m$

40m$

20m$

20.5m$

19m$

4m$

100m$

1.8m$

17m$

11m$

7.75m$

1.7m$

20132012

2011

2010

2009

$1B per year Invested in Big Data

TECH 223m$

301m$

ALL > SPARK

Real-‐Time Resilient Distributed Memory Framework

• Abstraction with any DAG operation on data: -‐ Filter -‐ Map -‐ Reduce -‐ Cache

SPARK AND ITS ECOSYSTEM

SHARK

MLBASE

STREAMING

Real-‐Time Queries

Real-‐Time Updates

In-‐Memory Learning

SPAR

K

SooOOo WHAT IS IT IN PRACTICE?

www.dataiku.com

Turn Device Logs Into Next Years' Business

Parking ticket machine data

OpenStreetMapdata

Cleaning and enrichment of data Crossing data

Data Science Studio

Creation of a predictive algorithm

Availability of the predictions

Each street is segmented into small pieces that are enriched with geospatial information.

The parking ticket history is joined with the points of

interest from OpenStreetMap.

The availability of parking lots is predicted by street

segments from the joined data.

The algorithm is finally integrated in the iPhone

app « Find me a space ».

by

www.dataiku.com

Optimizing Last Mile with Data Science Studio

Data Science Studio

Historical delivery and retrieval data

Modeling of a score for each delivery

Cleaning and temporal enrichment of data

Data aggregation by geographic location

Incorporation of new deliveries to the existing model

by

• Reformulation de la recherche

• Pas de réponse

• Clic sur un pro• Top recherche• Clic de navigation ou filtre

COMMENT AMÉLIORER LA PERTINENCE DE NOS RÉPONSES VIA L’ANALYSE DU COMPORTEMENT UTILISATEUR ?

20 M

Analyse & corrections

automatisation

>10 occurrences1,4M

requêtes

>200M recherches

✗ ✓

0,5M requêtes priorisées

SOLUTION

Machine

Gestion Exploration

pagesjaunes.frAnnuaire

hadoop PIG+Hive

Export indexation

Moteur d’interprétation

crawl Autres référentiels

Sickit-learn

www.dataiku.com

Analyst

Panels

1970 : Birth of Computer Analytics

ComputerExpensive Software

Marketing Studies

www.dataiku.com

Multiple Data Sources

Analyst Team

Many Models

CRM

Logs

2015 : BUILD YOUR FACTORY

Server ClusterLight Software

Personalised Experience Model

Acquisition Cost Opportunity

Model

Stock Optimisation Model

Optimize Delivery

www.dataiku.com

Churn

Volume Forecast

RecommenderSegmentation Lifetime Value

Risk Score Hot Location

Pricing Ranking FraudEvent Paths

A MODEL An automated way to make a computertake a decision from raw (historical) data

The model can be used to take immediate (real-time)actions through an API

www.dataiku.com

Churn

Volume Forecast

RecommenderSegmentation Lifetime Value

Risk Score Hot Location

Pricing Ranking FraudEvent Paths

SooOOo How To I ENTER WONDERLAND ?

STEP 1 : LEARN

• PYTHON + PANDAS + SCIKIT

• R

• SCALA

http://scikit-learn.org/https://www.coursera.org/course/rprog

STEP 2 : PRACTICE• Try to enter in a Contest on kaggle.com or

• or datascience.net

• Join a meetup

http://kaggle.com

http://datascience.net

www.dataiku.com


Dataiku HQ

2 rue Jean Lantier

75001 Paris France

Dataiku West

2423A Durant Avenue

Berkeley, CA 94704

Florian [email protected]

You have ideas

“My data is too dirty. I don’t even know where to start ”

“We could probably better understand ours users. But how ?

“There’s a trend here, but our full historical data is just too big”

You have data

You need a tool


mailto:[email protected]

Technology

The paradox of big data - dataiku / oxalide APEROTECH