40
Deep Learning and Topological Data Analysis for machine intelligence and predictive analytics

Edward Kibardin presentation at The Chief Data Scientist Europe 2016

Embed Size (px)

Citation preview

Page 1: Edward Kibardin presentation at The Chief Data Scientist Europe 2016

Deep Learning and Topological Data Analysis for

machine intelligence and predictive analytics

Page 2: Edward Kibardin presentation at The Chief Data Scientist Europe 2016

Unlabeled Data

Page 3: Edward Kibardin presentation at The Chief Data Scientist Europe 2016

The vast majority of data is unlabeled

Medical and DNA profiling

Images

Text Stock market transactions

Customer Activities

Sensor signals System Logs Sound

Unlabeled Data

Page 4: Edward Kibardin presentation at The Chief Data Scientist Europe 2016

•  How many categories in my dataset? •  Which categories are the best for the business? •  Why some objects are not like the others? •  How I can contextualize new objects? •  Is there a simpler way to describe my data?

Business questions to unlabeled data: Unlabeled Data

Page 5: Edward Kibardin presentation at The Chief Data Scientist Europe 2016

A topological invariant is a map f that assigns the same object to homeomorphic spaces, that is:

Homology: is a machine that converts local data about a space into global algebraic structure

Topological invariants

Reference: Wikipedia, 2010.

Page 6: Edward Kibardin presentation at The Chief Data Scientist Europe 2016

The Čech Complex

Combinatorial representations

Page 7: Edward Kibardin presentation at The Chief Data Scientist Europe 2016

a b

a.  Compute a combinatorial model approximating the structure of the underlying space

b.  Then compute topological invariants of this structure c.  Represent these topological invariants in 2d space

Topology Data Analysis Pipeline

c

Page 8: Edward Kibardin presentation at The Chief Data Scientist Europe 2016

Theorem: Supposeh:Xg is a discrete Morse function. Then X is homotopy equivalent to a CW-complex with exactly one cell of dimension p for each critical simplex of dimension p.

Reference:TengMa;ZhuangzhiWu;PeiLuo;LuFeng.Reebgraphcomputa1onthroughspectralclustering,2011.

Morse Theory and Reeb Graph

Page 9: Edward Kibardin presentation at The Chief Data Scientist Europe 2016

Deep Generative Nets + TDA

1. Learning of deep generative model 2. Fine-tuning using topological loss

Page 10: Edward Kibardin presentation at The Chief Data Scientist Europe 2016

Case study: Netflix competition A dataset from Netflix open competition best collaborative filtering algorithm to predict user ratings for films:

•  100,480,507 ratings •  480,189 users •  17,770 movies •  2.1 GB of CSV file

Page 11: Edward Kibardin presentation at The Chief Data Scientist Europe 2016

Case study: Netflix competition

PCA

Standard Approach to cluster analysis

Page 12: Edward Kibardin presentation at The Chief Data Scientist Europe 2016

Case study: Netflix competition

PCA

Hessian LLE

Isomap

Locally-Linear Embedding (LLE)

Local Tangent Space Alignment (LTSA)

Standard Approach to cluster analysis

Page 13: Edward Kibardin presentation at The Chief Data Scientist Europe 2016

Case study: Netflix competition Topological Result

Page 14: Edward Kibardin presentation at The Chief Data Scientist Europe 2016

Case study: Netflix competition Topological Result with Labels

Page 15: Edward Kibardin presentation at The Chief Data Scientist Europe 2016

Case study: Netflix competition Horror Movies

Page 16: Edward Kibardin presentation at The Chief Data Scientist Europe 2016

Case study: Netflix competition Science Fiction / Fantasy Series

Page 17: Edward Kibardin presentation at The Chief Data Scientist Europe 2016

Case study: 20 Newsgroups The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups.

•  18,820 documents •  From 6 to 5000 words each •  20 newsgroups (classes)

20Newsgroupsacademicdataset

(semi-supervised)

Page 18: Edward Kibardin presentation at The Chief Data Scientist Europe 2016

Case study: 20 Newsgroups alt.atheism comp.graphics comp.os.ms-windows.misc comp.sys.ibm.pc.hardware comp.sys.mac.hardware comp.windows.x misc.forsale rec.autos rec.motorcycles rec.sport.baseball rec.sport.hockey

sci.crypt sci.electronics sci.med sci.space soc.religion.christian talk.politics.guns talk.politics.mideast talk.politics.misc talk.religion.misc

Page 19: Edward Kibardin presentation at The Chief Data Scientist Europe 2016

Detailed topology (user group overlay) Case study: 20 Newsgroups

Page 20: Edward Kibardin presentation at The Chief Data Scientist Europe 2016

Baseball cluster Case study: 20 Newsgroups

“pitch” > 1.2 This must be baseball speed game margin realist chip ucdavi edu gari

built villanova huckabai basebal game and shade hour that damn long don plai hour game watch game for that long butt fall asleep and watch

channel surf pitch catch color

Page 21: Edward Kibardin presentation at The Chief Data Scientist Europe 2016

Motorcycles cluster Case study: 20 Newsgroups

“bike” > 1.114 This must be motorcycles

ride sixteen dai had put test drive honda final saturdai rain fact clear warm and sunni and wind di week ago long cool ride hawk cycl for test ride

had sold and deliv demo fifteen hour arriv and demo vfr bike lock showroom surround bike and

not like move todai even bike us dirt bike us street bike car and big tent full outlandishli fat tour bike trailer squeez park lot sort fat bike convent shelli and dave run msf each time classroom and back lot usual free cookout

distribut severli affect will bike perform such load cling back rest secur shift increas chanc surf

collect wisdom request can afford leather pant boot and jean can make you knee protector

rollerblad us bean and sell

Page 22: Edward Kibardin presentation at The Chief Data Scientist Europe 2016

Result of learning first two groups Case study: 20 Newsgroups

Labeled baseball!Unlabeled baseball!

Labeled Motorcycles!

Unlabeled Motorcycles!

Autos Pc.hardware

Mac.hardware

Page 23: Edward Kibardin presentation at The Chief Data Scientist Europe 2016

Result of learning five groups Case study: 20 Newsgroups

Mac.hardware

Baseball!

Pc.hardware

Autos

Motorcycles!

Scy.med!

Politics.misc!

Politics.!mideast!

Hockey!

Page 24: Edward Kibardin presentation at The Chief Data Scientist Europe 2016

Final result for 2nd layer Case study: 20 Newsgroups

Motorcycles

Christian Atheism

Religion.misc

Politics.guns

Politics.misc

Politics.mideast

Scy.crypt

Scy.med Hockey

Baseball

Autos

Forsale Mac.hardware

Electronics

Scy.space Comp.graphics

Windows.x

Ms-windows.misc

Pc.hardware

Page 25: Edward Kibardin presentation at The Chief Data Scientist Europe 2016

Case study: Badoo A subset of user activity in the United States. Aggregated activity metrics over two weeks in August 2014.

•  88,567 users •  867 metrics

Page 26: Edward Kibardin presentation at The Chief Data Scientist Europe 2016

Case study: Badoo Data Transformation

Used aggregated representations of user activities per day: •  Number of likes •  Number of dislikes •  Number of matches •  Profiles visited •  Photos uploaded •  Number of messages sent (no content analysed) •  Number of message replies •  Interactions with different app features

Page 27: Edward Kibardin presentation at The Chief Data Scientist Europe 2016

Case study: Badoo

Page 28: Edward Kibardin presentation at The Chief Data Scientist Europe 2016

Case study: Badoo Messages sent / received

Page 29: Edward Kibardin presentation at The Chief Data Scientist Europe 2016

Case study: Badoo Users with high retention

Page 30: Edward Kibardin presentation at The Chief Data Scientist Europe 2016

Case study: Badoo Users grouped in retention clusters by using deep generative nets

Page 31: Edward Kibardin presentation at The Chief Data Scientist Europe 2016

Case study: Badoo Users grouped in retention clusters by using deep generative nets

Page 32: Edward Kibardin presentation at The Chief Data Scientist Europe 2016

Case study: Badoo Users grouped in retention clusters by using deep generative nets

“Pretty boys”: users with high score, received a lot of likes and messages in

first 3 days

“Dedicated”: users, invested much time in profiles, were active of site and received several

messages in first three days

“Curious”: invested less time in profiles, send lots of messages, sometimes being blocked by other users

Page 33: Edward Kibardin presentation at The Chief Data Scientist Europe 2016

Case study: Badoo On-line learning and prediction of user clusters

1. Configure integration 2. Perform segmentation

3. System performs classification 4. Report classification results

•  CSV API •  JSON API •  Database connector

Page 34: Edward Kibardin presentation at The Chief Data Scientist Europe 2016

Case study: Financial Articles Understand main topics from news and scientific articles on economics topic

•  17,020 documents

Page 35: Edward Kibardin presentation at The Chief Data Scientist Europe 2016

Case study: Financial Articles

Demo

Page 36: Edward Kibardin presentation at The Chief Data Scientist Europe 2016
Page 37: Edward Kibardin presentation at The Chief Data Scientist Europe 2016

Case study: Financial Articles

Page 38: Edward Kibardin presentation at The Chief Data Scientist Europe 2016

Case study: Financial Articles

Page 39: Edward Kibardin presentation at The Chief Data Scientist Europe 2016

Links Topology And Data (Gunnar Carlsson): http://www.ams.org/journals/bull/2009-46-02/S0273-0979-09-01249-X/S0273-0979-09-01249-X.pdf Discrete Morse Theory and Persistent Homology (Kevin P. Knudson): http://www.math.fsu.edu/~hironaka/FSUUF/knudson.pdf Topological Persistence and Simplification (Herbert Edelsbrunner, David Letscher, Afra Zomorodian): http://math.uchicago.edu/~shmuel/AAT-readings/Data%20Analysis%20/PersTop.pdf Extracting and Composing Robust Features with Denoising Autoencoders (Pascal Vincent, Hugo Larochelle, Yoshua Bengio, Pierre-Antoine Manzagol) http://www.iro.umontreal.ca/~vincentp/Publications/denoising_autoencoders_tr1316.pdf

Page 40: Edward Kibardin presentation at The Chief Data Scientist Europe 2016

[email protected] www.datarefiner.com