57
The Paradox of Big Data

The paradox of big data - dataiku / oxalide APEROTECH

  • Upload
    dataiku

  • View
    460

  • Download
    1

Embed Size (px)

Citation preview

Page 1: The paradox of big data - dataiku / oxalide APEROTECH

The Paradox of Big Data

Page 2: The paradox of big data - dataiku / oxalide APEROTECH

2001 Programming Languages 2004 Natural Language Processing

2006 Social Recommendation

2008 Distributed Computing

2011 Social Gaming2012 Advertising

2013 Dataiku

2009 Web Mining

Type Spent Coding

2010

100%100%80%50%

20%

0%10%50%

20%

Favorite Language

CExascriptExascript

Exascript

Python

Powerpoint

Python

Java

None

Largest Dataset

100GB100GB10GB10TB

100TB

100kB500GB100TB

10TB

I’m Florian and I like data

Page 3: The paradox of big data - dataiku / oxalide APEROTECH

www.dataiku.com

Dataiku in short

Software  editor  behind  Data  Science  Studio,the  «  Photoshop  for  Data  Science  »  

COMMUNITY  EDITION

http://www.dataiku.com/dss/trynow/

Page 4: The paradox of big data - dataiku / oxalide APEROTECH

Goals For Today• Big Data with the bias of what I know of it

(Analytics …)

• Big Data: History and Feelings

• What are the key technologies to watch ?

• Some practical use cases ?

• How to get started ?

Page 5: The paradox of big data - dataiku / oxalide APEROTECH

Dataiku

Motivation

1/8/144

First Hard Drive: 3,75 Megabytes Access Time: 1 second

Page 6: The paradox of big data - dataiku / oxalide APEROTECH

IN 2008 man

invented big data

Volume Variety Velocity

Page 7: The paradox of big data - dataiku / oxalide APEROTECH

WHAT IF THE MARKETING GUY HAD CHOSEN ANOTHER LETTER?

Capacity Complexity Celerity

Page 8: The paradox of big data - dataiku / oxalide APEROTECH

OR SIMPLER

Size Serendipity Speed

Page 9: The paradox of big data - dataiku / oxalide APEROTECH

OR AFTER A DRINK

Big Blur Blazing

Page 10: The paradox of big data - dataiku / oxalide APEROTECH

Or Combine

C… B.. S….

Page 11: The paradox of big data - dataiku / oxalide APEROTECH

Or Combine

Complete Bull Sh..

Page 12: The paradox of big data - dataiku / oxalide APEROTECH

SOOO WHAT IS

BIG DATA ?

Page 13: The paradox of big data - dataiku / oxalide APEROTECH

PARADOX #1 SIMPLEXITY

Page 14: The paradox of big data - dataiku / oxalide APEROTECH

SUBTLE PATTERNS

Page 15: The paradox of big data - dataiku / oxalide APEROTECH

"MORE BUSINESS" BUTTONS

Page 16: The paradox of big data - dataiku / oxalide APEROTECH

PARADOX #2 SELF-AWARE

Page 17: The paradox of big data - dataiku / oxalide APEROTECH

DATA SCIENTIST AT NIGHT

Page 18: The paradox of big data - dataiku / oxalide APEROTECH

DATA CLEANER THE DAY

Page 19: The paradox of big data - dataiku / oxalide APEROTECH

DATA PLUMBERER THE WEEK-END

Page 20: The paradox of big data - dataiku / oxalide APEROTECH

WAIT COMPUTATION BETWEEN COFFEES

Page 21: The paradox of big data - dataiku / oxalide APEROTECH

PARADOX #3 WHERE TO STORE DATA?

Page 22: The paradox of big data - dataiku / oxalide APEROTECH

MY DATA IS WORTH MILLIONS

Page 23: The paradox of big data - dataiku / oxalide APEROTECH

I SEND IT TO THE

MARKETING CLOUD

AND BACKUP IT TO GOOGLE

Page 24: The paradox of big data - dataiku / oxalide APEROTECH

PARADOX #4 IS IT BIG OR NOT ?

Page 25: The paradox of big data - dataiku / oxalide APEROTECH

WE ALL LIVE IN A BIG DATA

LAKE

Page 26: The paradox of big data - dataiku / oxalide APEROTECH

ALL MY DATA MAY FITS IN HERE

Page 27: The paradox of big data - dataiku / oxalide APEROTECH

PARADOX #5 (at last) HUMAN OR NOT ?

Page 28: The paradox of big data - dataiku / oxalide APEROTECH

TECHCRUNCH SAYS THAT MACHINE LEARNING WILL SAVE

US ALL

Page 29: The paradox of big data - dataiku / oxalide APEROTECH

I JUST WANT MORE REPORTS

Page 30: The paradox of big data - dataiku / oxalide APEROTECH

BIG DATA TECH TRENDS

Page 31: The paradox of big data - dataiku / oxalide APEROTECH

ELEPHANT MAKE BABIES

Page 32: The paradox of big data - dataiku / oxalide APEROTECH

Dataiku - Pig, Hive and Cascading

WELCOME TO TECHNOSLAVIA

Hadoop Ceph

Sphere Cassandra

Kafka Flume Spark

Scikit-Learn GraphLAB prediction.io jubatus

Mahout WEKA

MLBase LibSVM

RapidMiner Panda

Kibana

InfiniDB Drill Spark SQL

Hive Impala

Elastic Search

SOLR MongoDB

Riak Membase

Pig

Cascading

Talend

Machine Learning Mystery Land

Scalability Central

SQL Colunnar Republic

Vizualization County Data Clean Wasteland

Statistician Old House

R Real-time island

Storm

NOSQL Nihiland

Page 33: The paradox of big data - dataiku / oxalide APEROTECH

DRIVER  1:  BACK  TO  THE  BASICS

RAM      -­‐    CPU    -­‐  DISK    

Page 34: The paradox of big data - dataiku / oxalide APEROTECH

2000 2013

1000$  /  GB

6$  /  GB$10  /  GB

$0.06  /  GB

memory    divided  by  150  

disk  cost  divided  by  250  

MAP  REDUCE  times

HACK  REDUCE  times

A  PERSISTENT  MEMORY  PROBLEM

Page 35: The paradox of big data - dataiku / oxalide APEROTECH

DATA  IS  BIGGER

Page 36: The paradox of big data - dataiku / oxalide APEROTECH

IS  USEFUL  DATA  BIGGER  ?

WHOLE  DATA

REFINED  DATA

Page 37: The paradox of big data - dataiku / oxalide APEROTECH

GOLD

NEEDLE  IN  HAYSTACK  ?

Page 38: The paradox of big data - dataiku / oxalide APEROTECH

OILD

REFINE  BEFORE  USE

Page 39: The paradox of big data - dataiku / oxalide APEROTECH

HOW  BIG  IS  BIG  DATA  ?Web  Site  

– $1Billion  revenue  per  year    – 10  Millions  Unique  Visitor  per  month  – 100.Millions  orders  /  actions  /  per  day

10TB  RAW  DATA

1TB  REFINED  DATA

Page 40: The paradox of big data - dataiku / oxalide APEROTECH

1  TERABYTE

FITS  IN  MEMORY  

1TB

Page 41: The paradox of big data - dataiku / oxalide APEROTECH

DRIVER  2  :  ECOSYSTEM  GROWS

• GOOGLE  

• 1  Circle   OPEN  SOURCE  – YAHOO  –  IBM  –  LINKEDIN  -­‐  FACEBOOK  

• 2  Circle    – STANDFORD  BERKELEY  – STARTUPS

Page 42: The paradox of big data - dataiku / oxalide APEROTECH

STARTUPS

64m$

6.75m$

14m$

2m$

40m$

20m$

20.5m$

19m$

4m$

100m$

1.8m$

17m$

11m$

7.75m$

1.7m$

20132012

2011

2010

2009

 $1B  per  year  Invested  in  Big  Data    

TECH  223m$

301m$

Page 43: The paradox of big data - dataiku / oxalide APEROTECH

ALL  >    SPARK

Real-­‐Time  Resilient  Distributed  Memory  Framework  

• Abstraction  with  any  DAG  operation  on  data:  -­‐ Filter  -­‐ Map  -­‐ Reduce    -­‐ Cache

Page 44: The paradox of big data - dataiku / oxalide APEROTECH

SPARK  AND  ITS  ECOSYSTEM

SHARK

MLBASE

STREAMING

Real-­‐Time  Queries  

Real-­‐Time  Updates

In-­‐Memory  Learning

SPAR

K

Page 45: The paradox of big data - dataiku / oxalide APEROTECH

SooOOo WHAT IS IT IN PRACTICE?

Page 46: The paradox of big data - dataiku / oxalide APEROTECH

www.dataiku.com

Turn Device Logs Into Next Years' Business

Parking  ticket  machine  data

OpenStreetMapdata

Cleaning  and  enrichment  of  data Crossing  data

Data Science Studio

Creation  of  a  predictive  algorithm

Availability  of the  predictions

Each  street  is  segmented  into  small  pieces  that  are  enriched  with  geospatial  information.

The  parking  ticket  history  is  joined  with  the  points  of  

interest  from  OpenStreetMap.

The  availability  of  parking  lots  is  predicted  by  street  

segments  from  the  joined  data.

The  algorithm  is  finally  integrated  in  the  iPhone  

app «  Find  me  a  space  ».  

by

Page 47: The paradox of big data - dataiku / oxalide APEROTECH

www.dataiku.com

Optimizing Last Mile with Data Science Studio

Data Science Studio

Historical delivery and retrieval data

Modeling of a score for each delivery

Cleaning and temporal enrichment of data

Data aggregation by geographic location

Incorporation of new deliveries to the existing model

by

Page 48: The paradox of big data - dataiku / oxalide APEROTECH

• Reformulation de la recherche

• Pas de réponse

• Clic sur un pro• Top recherche• Clic de navigation ou filtre

COMMENT AMÉLIORER LA PERTINENCE DE NOS RÉPONSES VIA L’ANALYSE DU COMPORTEMENT UTILISATEUR ?

20 M

Analyse & corrections

automatisation

>10 occurrences1,4M

requêtes

>200M recherches

✗ ✓

0,5M requêtes priorisées

Page 49: The paradox of big data - dataiku / oxalide APEROTECH

SOLUTION

Machine

Gestion Exploration

pagesjaunes.frAnnuaire

hadoop PIG+Hive

Export indexation

Moteur d’interprétation

crawl Autres référentiels

Sickit-learn

Page 50: The paradox of big data - dataiku / oxalide APEROTECH

www.dataiku.com

Analyst

Panels

1970 : Birth of Computer Analytics

ComputerExpensive Software

Marketing Studies

Page 51: The paradox of big data - dataiku / oxalide APEROTECH

www.dataiku.com

Multiple  Data    Sources  

Analyst Team

Many  Models

CRM

Logs

2015 : BUILD YOUR FACTORY

Server ClusterLight Software

Personalised Experience Model

Acquisition Cost Opportunity

Model

Stock Optimisation Model

Optimize Delivery

Page 52: The paradox of big data - dataiku / oxalide APEROTECH

www.dataiku.com

Churn

Volume Forecast

RecommenderSegmentation Lifetime Value

Risk Score Hot Location

Pricing Ranking FraudEvent Paths

A MODEL An automated way to make a computertake a decision from raw (historical) data

The model can be used to take immediate (real-time)actions through an API

Page 53: The paradox of big data - dataiku / oxalide APEROTECH

www.dataiku.com

Churn

Volume Forecast

RecommenderSegmentation Lifetime Value

Risk Score Hot Location

Pricing Ranking FraudEvent Paths

Page 54: The paradox of big data - dataiku / oxalide APEROTECH

SooOOo How To I ENTER WONDERLAND ?

Page 55: The paradox of big data - dataiku / oxalide APEROTECH

STEP 1 : LEARN

• PYTHON + PANDAS + SCIKIT

• R

• SCALA

http://scikit-learn.org/https://www.coursera.org/course/rprog

Page 56: The paradox of big data - dataiku / oxalide APEROTECH

STEP 2 : PRACTICE• Try to enter in a Contest on kaggle.com or

• or datascience.net

• Join a meetup

Page 57: The paradox of big data - dataiku / oxalide APEROTECH

www.dataiku.com

http://www.dataiku.com/dss/trynow/

Dataiku HQ

2 rue Jean Lantier

75001 Paris France

Dataiku West

2423A Durant Avenue

Berkeley, CA 94704

Florian [email protected]

You have ideas

“My data is too dirty. I don’t even know where to start ”

“We could probably better understand ours users. But how ?

“There’s a trend here, but our full historical data is just too big”

You have data

You need a tool