42
The History (and Future!) of Machine Learning at Reddit Chris Slowe, CTO, Reddit

Chief Analytics Officer Fall USA 2017 - Chris Slowe

Embed Size (px)

Citation preview

Page 1: Chief Analytics Officer Fall USA 2017 - Chris Slowe

The History (and Future!) of

Machine Learning at Reddit Chris Slowe, CTO, Reddit

Page 2: Chief Analytics Officer Fall USA 2017 - Chris Slowe

A Search for Intelligence on

the Internet Chris Slowe, CTO, Reddit

Page 3: Chief Analytics Officer Fall USA 2017 - Chris Slowe

A Little About Me

● PhD in Experimental Physics

○ Emphasis on the “science is messy” part

● YCombinator Summer ‘05 Alum

● Founding Engineer at Reddit (starting ‘05)

● Chief Scientist at Hipmunk ‘10-’15

● Back at Reddit ‘16

● CTO ‘17

Page 4: Chief Analytics Officer Fall USA 2017 - Chris Slowe

What is Reddit?

Page 5: Chief Analytics Officer Fall USA 2017 - Chris Slowe

What is Reddit?

Reddit is the frontpage of the internet

A social network where there are tens of thousands of communities

around whatever passions or interests you might have

It’s where people converse about the things that

are most important to them

Page 6: Chief Analytics Officer Fall USA 2017 - Chris Slowe

Reddit by the numbers

Alexa Rank (US/World)

MAU

Communities

Posts per day

Comments day

Votes per day

Searches per Day

4th/7th

320M

1.1M

1M

5M

75M

70M

Page 7: Chief Analytics Officer Fall USA 2017 - Chris Slowe

SCALE

ENDLESS CONTENT ENDLESS CONTENT ENDLESS CONTENT ENDLESS CONTENT ENDLESS CONTENT ENDLESS

CONTENT ENDLESS CONTENT ENDLESS CONTENT ENDLESS CONTENT ENDLESS CONTENT ENDLESS CONTENT

ENDLESS CONTENT ENDLESS CONTENT ENDLESS CONTENT ENDLESS CONTENT ENDLESS CONTENT ENDLESS

Page 8: Chief Analytics Officer Fall USA 2017 - Chris Slowe

So, what are we doing with all that power

Page 9: Chief Analytics Officer Fall USA 2017 - Chris Slowe

Cat Walking a Human Cat Fist Bumping

Page 10: Chief Analytics Officer Fall USA 2017 - Chris Slowe

Wait, it’s not just cat pictures!

Page 11: Chief Analytics Officer Fall USA 2017 - Chris Slowe

Community > Content > Individual

● Authenticity

● Creative freedom

● Empathy @ scale

Page 12: Chief Analytics Officer Fall USA 2017 - Chris Slowe

r/confession

Secrets that if revealed

would change your life

forever?

Page 13: Chief Analytics Officer Fall USA 2017 - Chris Slowe
Page 14: Chief Analytics Officer Fall USA 2017 - Chris Slowe

r/assistance

Empathy and support at

scale

Page 15: Chief Analytics Officer Fall USA 2017 - Chris Slowe

With so much content how

do we get users to what

they’ll like most?

Page 16: Chief Analytics Officer Fall USA 2017 - Chris Slowe

The History

Page 17: Chief Analytics Officer Fall USA 2017 - Chris Slowe

Some Reddit Historical Context

● Launched in 2005 (cf: Facebook 2004, Twitter 2006)

● Originally build targeting a better version of del.icio.us popular

● Voting has always been a core function, but comments didn’t come until

about 6 months in.

● Very early on wanted to break into personalization and recommendations

● Comparative ML dark ages: netflix prize was kicked off in late 2006

● Communities/subreddits didn’t come until 2007-8.

Page 18: Chief Analytics Officer Fall USA 2017 - Chris Slowe

Reddit in 2006!

Page 19: Chief Analytics Officer Fall USA 2017 - Chris Slowe

Reddit in 2006! (look closely!)

Page 20: Chief Analytics Officer Fall USA 2017 - Chris Slowe

Initial Attempts: Simple Collaborative Filter!

● Nice feature of reddit: we have votes and we also have down votes!

● Start simple (aka “complicated for 2005”) -- multidimensional vectorspace

where each dimension corresponds to a post

● Represent users by a vector of their votes on each item

○ +1 = upvote

○ -1 = downvote

● User distance by cosine similarity

● Recenter by subtracting global mean

● apply singular value decomposition

Page 21: Chief Analytics Officer Fall USA 2017 - Chris Slowe

Initial Problems: Data Sparseness

● No categories/communities then (just one pile of content)

● The median number of votes per user was generally something very close to

“1”

● Solution: focus on the 10% core and assign the rest a “buddy”

● Still swamped by the “mean response” behavior of the site

○ A lazy version of recommendations: show what’s on the front page!

○ Lots of people like it by definition!

Page 22: Chief Analytics Officer Fall USA 2017 - Chris Slowe

Initial Problems 2: No feedback loop!

● Well other than the “users complain” one

● Long before we had a first class notion of A/B tests (let alone before it

became common practice on the web!)

● Also, no notion of relevance here outside of the votes

○ No capacity to detect “Politics” as there’s no semantic consideration

Page 23: Chief Analytics Officer Fall USA 2017 - Chris Slowe

“When Life gives you lemons,

make lemon hand grenades” -- Cave Johnson, Portal 2

Page 24: Chief Analytics Officer Fall USA 2017 - Chris Slowe

Applicability to Anti-Cheating!

● Recommendations are all about “filling in the blanks”

○ The data is sparse because those are the values you don’t know but want to

○ The intention is to correlate users to fill in these gaps

● A slightly different problem: vote cheating!

○ Create a bunch of accounts

○ Have them all vote for your content

○ ???

○ Profit!

● In the above example, these accounts will also have anomalously good

recommendations!

Page 25: Chief Analytics Officer Fall USA 2017 - Chris Slowe

There’s a Graph for That

Page 26: Chief Analytics Officer Fall USA 2017 - Chris Slowe

Islands of

highly

correlated

users

(colloquially:

cheaters!)

Lots of mean-

response voting

(aka on the

front page)

Well what do we have here...

Page 27: Chief Analytics Officer Fall USA 2017 - Chris Slowe

Let the humans do the hard part

Page 28: Chief Analytics Officer Fall USA 2017 - Chris Slowe

Enter “subreddits”

● By 2007, Reddit was still on a 4 month doubling curve! ○ “Eternal September” on steroids

○ “Everything was better 6 months ago” every 6 months.

● Increasingly broad audience: not just programming and science anymore: ○ Put our efforts into scaling up human centered “categorization”

○ Branded these areas as “subreddits”

○ At its core, scaling by parallelization (each subreddit is a freestanding reddit!)

● Front page becomes combination of content across subreddits

● Users (later “mods”) are in charge of keeping subreddits on topic ○ Subreddits morph into “communities” with their own mores...

Page 29: Chief Analytics Officer Fall USA 2017 - Chris Slowe

Reddit in 2009!

Page 30: Chief Analytics Officer Fall USA 2017 - Chris Slowe

The Modern Era

Page 31: Chief Analytics Officer Fall USA 2017 - Chris Slowe

We’ve grown up a little! They like us. They really like us!

Page 32: Chief Analytics Officer Fall USA 2017 - Chris Slowe

Gaming

News & Politics

Communities

Q&A

Learning

Sports

Fitness & Health

TV & Movies

Images & GIFs

And the number of communities has grown!

Page 33: Chief Analytics Officer Fall USA 2017 - Chris Slowe

...but so has the technology!

40k events/sec

21 GB per month

~13k active subreddits

50M accounts

Page 34: Chief Analytics Officer Fall USA 2017 - Chris Slowe

We <3 Kafka

Midas

(Enrich)

App

App

App Event

Collectors

Event

Collectors

Event

Collectors

Event

Collectors

Minsky

(ML)

Spamurai

(Rules)

Page 35: Chief Analytics Officer Fall USA 2017 - Chris Slowe

“Let the humans do the hard part”

...but ML has gotten easier!

Page 36: Chief Analytics Officer Fall USA 2017 - Chris Slowe

subreddit recommendations

• Similarity of subreddits defined by the Jaccard distance with respect to the

set of subscribers.

• Problem 1: we have a lot of subscriptions

• Problem 2: we used to have these “default” subreddits

– Everyone subscribed to them

– Not so much “mean behavior” as “very high noise floor”

– Human did the hard part of cleaning this up.

Page 37: Chief Analytics Officer Fall USA 2017 - Chris Slowe

A second approach: doc2vec

• Take the corpus of all comments on a given subreddit

• Smash all the comments into one document

• Use the subreddit as the label

• Apply:

– vanilla doc2vec (via gensim)

– k-means clustering (via sklearn)

Page 38: Chief Analytics Officer Fall USA 2017 - Chris Slowe

We get beautiful results

"gaming": [

"emulation",

"Games",

"3DS",

"PS4",

"Steam",

"pcgaming",

"PS3",

"nintendo",

"xboxone",

"Vive",

...

]

"health": [

"Fitness",

"bodybuilding",

"weightlifting",

"powerlifting",

"xxfitness",

"swoleacceptance",

"ketogains",

"progresspics",

"bodyweightfitness",

"Rowing",

...

]

"cryptocurrency": [

"ethtrader",

"btc",

"Bitcoin",

"Lisk",

"EthereumClassic",

"Monero",

"Buttcoin",

"ethereum",

"CryptoCurrency",

"dogecoin",

...

]

"BigCities": [

"Denver",

"Austin",

"Winnipeg",

"sanfrancisco",

"Atlanta",

"nashville",

"LosAngeles",

"Cleveland",

"chicago",

"Calgary",

...

]

Page 39: Chief Analytics Officer Fall USA 2017 - Chris Slowe

And we can measure a lift!

Page 40: Chief Analytics Officer Fall USA 2017 - Chris Slowe

Where we’re headed!

● We have made steps to try Deep Learning (it is

after all the done thing)

○ Models for Harassment via PMs

○ Looking to apply more broadly

● Logistic regressions for home feed

● Increase number of signals:

○ Geography

○ Platform

○ Time of Day

● Do it Live!

Page 41: Chief Analytics Officer Fall USA 2017 - Chris Slowe

All Possible by our Relevance Team

● Luis Bitencourt-Emilio - Director

● Dr. Eric Xu - Engineering Manager

● Dr. Jenny Jin - Data Science

● Katie Bauer - Data Science

● Artem Yankov

● Keith Blaha

● Niraj Sheth

● Dan Ellis

● William Seltzer

Page 42: Chief Analytics Officer Fall USA 2017 - Chris Slowe

Thanks! Chris Slowe

[email protected]

u/KeyserSosa

@KeyserSosa

PS: We’re hiring!

http://reddit.com/jobs