Chief Analytics Officer Fall USA 2017 - Chris Slowe

Preview:

Citation preview

The History (and Future!) of

Machine Learning at Reddit Chris Slowe, CTO, Reddit

A Search for Intelligence on

the Internet Chris Slowe, CTO, Reddit

A Little About Me

● PhD in Experimental Physics

○ Emphasis on the “science is messy” part

● YCombinator Summer ‘05 Alum

● Founding Engineer at Reddit (starting ‘05)

● Chief Scientist at Hipmunk ‘10-’15

● Back at Reddit ‘16

● CTO ‘17

What is Reddit?

What is Reddit?

Reddit is the frontpage of the internet

A social network where there are tens of thousands of communities

around whatever passions or interests you might have

It’s where people converse about the things that

are most important to them

Reddit by the numbers

Alexa Rank (US/World)

MAU

Communities

Posts per day

Comments day

Votes per day

Searches per Day

4th/7th

320M

1.1M

1M

5M

75M

70M

SCALE

ENDLESS CONTENT ENDLESS CONTENT ENDLESS CONTENT ENDLESS CONTENT ENDLESS CONTENT ENDLESS

CONTENT ENDLESS CONTENT ENDLESS CONTENT ENDLESS CONTENT ENDLESS CONTENT ENDLESS CONTENT

ENDLESS CONTENT ENDLESS CONTENT ENDLESS CONTENT ENDLESS CONTENT ENDLESS CONTENT ENDLESS

So, what are we doing with all that power

Cat Walking a Human Cat Fist Bumping

Wait, it’s not just cat pictures!

Community > Content > Individual

● Authenticity

● Creative freedom

● Empathy @ scale

r/confession

Secrets that if revealed

would change your life

forever?

r/assistance

Empathy and support at

scale

With so much content how

do we get users to what

they’ll like most?

The History

Some Reddit Historical Context

● Launched in 2005 (cf: Facebook 2004, Twitter 2006)

● Originally build targeting a better version of del.icio.us popular

● Voting has always been a core function, but comments didn’t come until

about 6 months in.

● Very early on wanted to break into personalization and recommendations

● Comparative ML dark ages: netflix prize was kicked off in late 2006

● Communities/subreddits didn’t come until 2007-8.

Reddit in 2006!

Reddit in 2006! (look closely!)

Initial Attempts: Simple Collaborative Filter!

● Nice feature of reddit: we have votes and we also have down votes!

● Start simple (aka “complicated for 2005”) -- multidimensional vectorspace

where each dimension corresponds to a post

● Represent users by a vector of their votes on each item

○ +1 = upvote

○ -1 = downvote

● User distance by cosine similarity

● Recenter by subtracting global mean

● apply singular value decomposition

Initial Problems: Data Sparseness

● No categories/communities then (just one pile of content)

● The median number of votes per user was generally something very close to

“1”

● Solution: focus on the 10% core and assign the rest a “buddy”

● Still swamped by the “mean response” behavior of the site

○ A lazy version of recommendations: show what’s on the front page!

○ Lots of people like it by definition!

Initial Problems 2: No feedback loop!

● Well other than the “users complain” one

● Long before we had a first class notion of A/B tests (let alone before it

became common practice on the web!)

● Also, no notion of relevance here outside of the votes

○ No capacity to detect “Politics” as there’s no semantic consideration

“When Life gives you lemons,

make lemon hand grenades” -- Cave Johnson, Portal 2

Applicability to Anti-Cheating!

● Recommendations are all about “filling in the blanks”

○ The data is sparse because those are the values you don’t know but want to

○ The intention is to correlate users to fill in these gaps

● A slightly different problem: vote cheating!

○ Create a bunch of accounts

○ Have them all vote for your content

○ ???

○ Profit!

● In the above example, these accounts will also have anomalously good

recommendations!

There’s a Graph for That

Islands of

highly

correlated

users

(colloquially:

cheaters!)

Lots of mean-

response voting

(aka on the

front page)

Well what do we have here...

Let the humans do the hard part

Enter “subreddits”

● By 2007, Reddit was still on a 4 month doubling curve! ○ “Eternal September” on steroids

○ “Everything was better 6 months ago” every 6 months.

● Increasingly broad audience: not just programming and science anymore: ○ Put our efforts into scaling up human centered “categorization”

○ Branded these areas as “subreddits”

○ At its core, scaling by parallelization (each subreddit is a freestanding reddit!)

● Front page becomes combination of content across subreddits

● Users (later “mods”) are in charge of keeping subreddits on topic ○ Subreddits morph into “communities” with their own mores...

Reddit in 2009!

The Modern Era

We’ve grown up a little! They like us. They really like us!

Gaming

News & Politics

Communities

Q&A

Learning

Sports

Fitness & Health

TV & Movies

Images & GIFs

And the number of communities has grown!

...but so has the technology!

40k events/sec

21 GB per month

~13k active subreddits

50M accounts

We <3 Kafka

Midas

(Enrich)

App

App

App Event

Collectors

Event

Collectors

Event

Collectors

Event

Collectors

Minsky

(ML)

Spamurai

(Rules)

“Let the humans do the hard part”

...but ML has gotten easier!

subreddit recommendations

• Similarity of subreddits defined by the Jaccard distance with respect to the

set of subscribers.

• Problem 1: we have a lot of subscriptions

• Problem 2: we used to have these “default” subreddits

– Everyone subscribed to them

– Not so much “mean behavior” as “very high noise floor”

– Human did the hard part of cleaning this up.

A second approach: doc2vec

• Take the corpus of all comments on a given subreddit

• Smash all the comments into one document

• Use the subreddit as the label

• Apply:

– vanilla doc2vec (via gensim)

– k-means clustering (via sklearn)

We get beautiful results

"gaming": [

"emulation",

"Games",

"3DS",

"PS4",

"Steam",

"pcgaming",

"PS3",

"nintendo",

"xboxone",

"Vive",

...

]

"health": [

"Fitness",

"bodybuilding",

"weightlifting",

"powerlifting",

"xxfitness",

"swoleacceptance",

"ketogains",

"progresspics",

"bodyweightfitness",

"Rowing",

...

]

"cryptocurrency": [

"ethtrader",

"btc",

"Bitcoin",

"Lisk",

"EthereumClassic",

"Monero",

"Buttcoin",

"ethereum",

"CryptoCurrency",

"dogecoin",

...

]

"BigCities": [

"Denver",

"Austin",

"Winnipeg",

"sanfrancisco",

"Atlanta",

"nashville",

"LosAngeles",

"Cleveland",

"chicago",

"Calgary",

...

]

And we can measure a lift!

Where we’re headed!

● We have made steps to try Deep Learning (it is

after all the done thing)

○ Models for Harassment via PMs

○ Looking to apply more broadly

● Logistic regressions for home feed

● Increase number of signals:

○ Geography

○ Platform

○ Time of Day

● Do it Live!

All Possible by our Relevance Team

● Luis Bitencourt-Emilio - Director

● Dr. Eric Xu - Engineering Manager

● Dr. Jenny Jin - Data Science

● Katie Bauer - Data Science

● Artem Yankov

● Keith Blaha

● Niraj Sheth

● Dan Ellis

● William Seltzer

Thanks! Chris Slowe

chris@reddit.com

u/KeyserSosa

@KeyserSosa

PS: We’re hiring!

http://reddit.com/jobs

Recommended