Chief Analytics Officer Fall USA 2017 - Chris Slowe

The History (and Future!) of

Machine Learning at Reddit Chris Slowe, CTO, Reddit

A Search for Intelligence on

the Internet Chris Slowe, CTO, Reddit

A Little About Me

● PhD in Experimental Physics

○ Emphasis on the “science is messy” part

● YCombinator Summer ‘05 Alum

● Founding Engineer at Reddit (starting ‘05)

● Chief Scientist at Hipmunk ‘10-’15

● Back at Reddit ‘16

● CTO ‘17

What is Reddit?

Reddit is the frontpage of the internet

A social network where there are tens of thousands of communities

around whatever passions or interests you might have

It’s where people converse about the things that

are most important to them

Reddit by the numbers

Alexa Rank (US/World)

Communities

Posts per day

Comments day

Votes per day

Searches per Day

4th/7th

ENDLESS CONTENT ENDLESS CONTENT ENDLESS CONTENT ENDLESS CONTENT ENDLESS CONTENT ENDLESS

CONTENT ENDLESS CONTENT ENDLESS CONTENT ENDLESS CONTENT ENDLESS CONTENT ENDLESS CONTENT

ENDLESS CONTENT ENDLESS CONTENT ENDLESS CONTENT ENDLESS CONTENT ENDLESS CONTENT ENDLESS

So, what are we doing with all that power

Cat Walking a Human Cat Fist Bumping

Wait, it’s not just cat pictures!

Community > Content > Individual

● Authenticity

● Creative freedom

● Empathy @ scale

r/confession

Secrets that if revealed

would change your life

forever?

r/assistance

Empathy and support at

With so much content how

do we get users to what

they’ll like most?

The History

Some Reddit Historical Context

● Launched in 2005 (cf: Facebook 2004, Twitter 2006)

● Originally build targeting a better version of del.icio.us popular

● Voting has always been a core function, but comments didn’t come until

about 6 months in.

● Very early on wanted to break into personalization and recommendations

● Comparative ML dark ages: netflix prize was kicked off in late 2006

● Communities/subreddits didn’t come until 2007-8.

Reddit in 2006!

Reddit in 2006! (look closely!)

Initial Attempts: Simple Collaborative Filter!

● Nice feature of reddit: we have votes and we also have down votes!

● Start simple (aka “complicated for 2005”) -- multidimensional vectorspace

where each dimension corresponds to a post

● Represent users by a vector of their votes on each item

○ +1 = upvote

○ -1 = downvote

● User distance by cosine similarity

● Recenter by subtracting global mean

● apply singular value decomposition

Initial Problems: Data Sparseness

● No categories/communities then (just one pile of content)

● The median number of votes per user was generally something very close to

“1”

● Solution: focus on the 10% core and assign the rest a “buddy”

● Still swamped by the “mean response” behavior of the site

○ A lazy version of recommendations: show what’s on the front page!

○ Lots of people like it by definition!

Initial Problems 2: No feedback loop!

● Well other than the “users complain” one

● Long before we had a first class notion of A/B tests (let alone before it

became common practice on the web!)

● Also, no notion of relevance here outside of the votes

○ No capacity to detect “Politics” as there’s no semantic consideration

“When Life gives you lemons,

make lemon hand grenades” -- Cave Johnson, Portal 2

Applicability to Anti-Cheating!

● Recommendations are all about “filling in the blanks”

○ The data is sparse because those are the values you don’t know but want to

○ The intention is to correlate users to fill in these gaps

● A slightly different problem: vote cheating!

○ Create a bunch of accounts

○ Have them all vote for your content

○ ???

○ Profit!

● In the above example, these accounts will also have anomalously good

recommendations!

There’s a Graph for That

Islands of

highly

correlated

(colloquially:

cheaters!)

Lots of mean-

response voting

(aka on the

front page)

Well what do we have here...

Let the humans do the hard part

Enter “subreddits”

● By 2007, Reddit was still on a 4 month doubling curve! ○ “Eternal September” on steroids

○ “Everything was better 6 months ago” every 6 months.

● Increasingly broad audience: not just programming and science anymore: ○ Put our efforts into scaling up human centered “categorization”

○ Branded these areas as “subreddits”

○ At its core, scaling by parallelization (each subreddit is a freestanding reddit!)

● Front page becomes combination of content across subreddits

● Users (later “mods”) are in charge of keeping subreddits on topic ○ Subreddits morph into “communities” with their own mores...

Reddit in 2009!

The Modern Era

We’ve grown up a little! They like us. They really like us!

Gaming

News & Politics

Communities

Learning

Sports

Fitness & Health

TV & Movies

Images & GIFs

And the number of communities has grown!

...but so has the technology!

40k events/sec

21 GB per month

~13k active subreddits

50M accounts

We <3 Kafka

(Enrich)

App Event

Collectors

Minsky

Spamurai

(Rules)

“Let the humans do the hard part”

...but ML has gotten easier!

subreddit recommendations

• Similarity of subreddits defined by the Jaccard distance with respect to the

set of subscribers.

• Problem 1: we have a lot of subscriptions

• Problem 2: we used to have these “default” subreddits

– Everyone subscribed to them

– Not so much “mean behavior” as “very high noise floor”

– Human did the hard part of cleaning this up.

A second approach: doc2vec

• Take the corpus of all comments on a given subreddit

• Smash all the comments into one document

• Use the subreddit as the label

• Apply:

– vanilla doc2vec (via gensim)

– k-means clustering (via sklearn)

We get beautiful results

"gaming": [

"emulation",

"Games",

"3DS",

"PS4",

"Steam",

"pcgaming",

"PS3",

"nintendo",

"xboxone",

"Vive",

"health": [

"Fitness",

"bodybuilding",

"weightlifting",

"powerlifting",

"xxfitness",

"swoleacceptance",

"ketogains",

"progresspics",

"bodyweightfitness",

"Rowing",

"cryptocurrency": [

"ethtrader",

"btc",

"Bitcoin",

"Lisk",

"EthereumClassic",

"Monero",

"Buttcoin",

"ethereum",

"CryptoCurrency",

"dogecoin",

"BigCities": [

"Denver",

"Austin",

"Winnipeg",

"sanfrancisco",

"Atlanta",

"nashville",

"LosAngeles",

"Cleveland",

"chicago",

"Calgary",

And we can measure a lift!

Where we’re headed!

● We have made steps to try Deep Learning (it is

after all the done thing)

○ Models for Harassment via PMs

○ Looking to apply more broadly

● Logistic regressions for home feed

● Increase number of signals:

○ Geography

○ Platform

○ Time of Day

● Do it Live!

All Possible by our Relevance Team

● Luis Bitencourt-Emilio - Director

● Dr. Eric Xu - Engineering Manager

● Dr. Jenny Jin - Data Science

● Katie Bauer - Data Science

● Artem Yankov

● Keith Blaha

● Niraj Sheth

● Dan Ellis

● William Seltzer

Thanks! Chris Slowe

chris@reddit.com

u/KeyserSosa

@KeyserSosa

PS: We’re hiring!

http://reddit.com/jobs

Chief Analytics Officer Fall USA 2017 - Chris Slowe

Business

Zukunftschance Chief Data Officer

Ebg 100 premiers jours d'un Chief Digital Officer

Chief People Officer Program: CPO รุ่นที่ 6

In-Office Pharmacy: Should Every Practice Have One? Jeff Patton, M.D., Chief Executive Officer Tennessee Oncology, Chief Medical Officer Rain Tree Oncology

IN #32014FOCUS€¦ · Chris Nuttall, Chief Operating Officer i NCAB Group, nævner mobiltelefoner som et eksempel: ”Tænk bare på deres udvikling. En moderne telefon er ikke bare

Michelle Scray Chief Probation Officer Chairperson · Michelle Scray Chief Probation Officer ... Michelle Scray Brown Chief Probation Officer Chair, Community Corrections s and inter-county

Chief Intelligence Officer - O futuro está nas informações

Chief Executive Officer, Tourism matters!€¦ · Tourism matters! Unser Referent am 01. Juni 2012 Martin Gauss, Chief Executive Officer, Air Baltic Corporation Martin Gauss begann

Chief People Officer Program: CPO รุ่นที่ 7

SIDER INL B BERLINSIDER - investors.ado.immoinvestors.ado.immo/download/companies/adoproperties... · Chief Financial Officer EYAL HORN Chief Operating Officer Mrd. EUR Immobilienwert

Chief People Officer Program : CPO รุ่นที่ 6

Moderator: Marlon Yarber, Assistant Chief Probation Officer … · 2016. 2. 3. · Moderator: Marlon Yarber, Assistant Chief Probation Officer. Sacramento County Probation Department

Luis Solis - Chief Innovation Officer Summit 2013

Chief Networking Officer & Redes Sociais na Unisul Virtual

10 Mandamentos do Chief Networking Officer

april 2014 Stockbrokers MeMbership...chief executive Officer asX limited andrew penn chief Financial Officer Telstra elana rubin Director NaB Wealth / mlc & mirvac Group limited peter

倫理規定...Watches and Jewelry Division Marco Bizzarri President and Chief Executive Officer, Gucci Grégory Boutté Chief Client & Digital Officer Marie-Claire Daveu Chief Sustainability

El efecto sistémico del fluoruro RADM Chris Halliday Chief Dental Officer, USPHS

Advantus Update and Investment Outlook · PDF fileAdvantus Update and Investment Outlook Chris Sebald, CFA President and Chief Investment Officer Advantus Capital Management A Securian

Brochure : Chief People Officer Program : CPO รุ่นที่ 6