Upload
corinium-coriniumglobal
View
197
Download
2
Embed Size (px)
Citation preview
The History (and Future!) of
Machine Learning at Reddit Chris Slowe, CTO, Reddit
A Search for Intelligence on
the Internet Chris Slowe, CTO, Reddit
A Little About Me
● PhD in Experimental Physics
○ Emphasis on the “science is messy” part
● YCombinator Summer ‘05 Alum
● Founding Engineer at Reddit (starting ‘05)
● Chief Scientist at Hipmunk ‘10-’15
● Back at Reddit ‘16
● CTO ‘17
What is Reddit?
What is Reddit?
Reddit is the frontpage of the internet
A social network where there are tens of thousands of communities
around whatever passions or interests you might have
It’s where people converse about the things that
are most important to them
Reddit by the numbers
Alexa Rank (US/World)
MAU
Communities
Posts per day
Comments day
Votes per day
Searches per Day
4th/7th
320M
1.1M
1M
5M
75M
70M
SCALE
ENDLESS CONTENT ENDLESS CONTENT ENDLESS CONTENT ENDLESS CONTENT ENDLESS CONTENT ENDLESS
CONTENT ENDLESS CONTENT ENDLESS CONTENT ENDLESS CONTENT ENDLESS CONTENT ENDLESS CONTENT
ENDLESS CONTENT ENDLESS CONTENT ENDLESS CONTENT ENDLESS CONTENT ENDLESS CONTENT ENDLESS
So, what are we doing with all that power
Cat Walking a Human Cat Fist Bumping
Wait, it’s not just cat pictures!
Community > Content > Individual
● Authenticity
● Creative freedom
● Empathy @ scale
r/confession
Secrets that if revealed
would change your life
forever?
r/assistance
Empathy and support at
scale
With so much content how
do we get users to what
they’ll like most?
The History
Some Reddit Historical Context
● Launched in 2005 (cf: Facebook 2004, Twitter 2006)
● Originally build targeting a better version of del.icio.us popular
● Voting has always been a core function, but comments didn’t come until
about 6 months in.
● Very early on wanted to break into personalization and recommendations
● Comparative ML dark ages: netflix prize was kicked off in late 2006
● Communities/subreddits didn’t come until 2007-8.
Reddit in 2006!
Reddit in 2006! (look closely!)
Initial Attempts: Simple Collaborative Filter!
● Nice feature of reddit: we have votes and we also have down votes!
● Start simple (aka “complicated for 2005”) -- multidimensional vectorspace
where each dimension corresponds to a post
● Represent users by a vector of their votes on each item
○ +1 = upvote
○ -1 = downvote
● User distance by cosine similarity
● Recenter by subtracting global mean
● apply singular value decomposition
Initial Problems: Data Sparseness
● No categories/communities then (just one pile of content)
● The median number of votes per user was generally something very close to
“1”
● Solution: focus on the 10% core and assign the rest a “buddy”
● Still swamped by the “mean response” behavior of the site
○ A lazy version of recommendations: show what’s on the front page!
○ Lots of people like it by definition!
Initial Problems 2: No feedback loop!
● Well other than the “users complain” one
● Long before we had a first class notion of A/B tests (let alone before it
became common practice on the web!)
● Also, no notion of relevance here outside of the votes
○ No capacity to detect “Politics” as there’s no semantic consideration
“When Life gives you lemons,
make lemon hand grenades” -- Cave Johnson, Portal 2
Applicability to Anti-Cheating!
● Recommendations are all about “filling in the blanks”
○ The data is sparse because those are the values you don’t know but want to
○ The intention is to correlate users to fill in these gaps
● A slightly different problem: vote cheating!
○ Create a bunch of accounts
○ Have them all vote for your content
○ ???
○ Profit!
● In the above example, these accounts will also have anomalously good
recommendations!
There’s a Graph for That
Islands of
highly
correlated
users
(colloquially:
cheaters!)
Lots of mean-
response voting
(aka on the
front page)
Well what do we have here...
Let the humans do the hard part
Enter “subreddits”
● By 2007, Reddit was still on a 4 month doubling curve! ○ “Eternal September” on steroids
○ “Everything was better 6 months ago” every 6 months.
● Increasingly broad audience: not just programming and science anymore: ○ Put our efforts into scaling up human centered “categorization”
○ Branded these areas as “subreddits”
○ At its core, scaling by parallelization (each subreddit is a freestanding reddit!)
● Front page becomes combination of content across subreddits
● Users (later “mods”) are in charge of keeping subreddits on topic ○ Subreddits morph into “communities” with their own mores...
Reddit in 2009!
The Modern Era
We’ve grown up a little! They like us. They really like us!
Gaming
News & Politics
Communities
Q&A
Learning
Sports
Fitness & Health
TV & Movies
Images & GIFs
And the number of communities has grown!
...but so has the technology!
40k events/sec
21 GB per month
~13k active subreddits
50M accounts
We <3 Kafka
Midas
(Enrich)
App
App
App Event
Collectors
Event
Collectors
Event
Collectors
Event
Collectors
Minsky
(ML)
Spamurai
(Rules)
“Let the humans do the hard part”
...but ML has gotten easier!
subreddit recommendations
• Similarity of subreddits defined by the Jaccard distance with respect to the
set of subscribers.
• Problem 1: we have a lot of subscriptions
• Problem 2: we used to have these “default” subreddits
– Everyone subscribed to them
– Not so much “mean behavior” as “very high noise floor”
– Human did the hard part of cleaning this up.
A second approach: doc2vec
• Take the corpus of all comments on a given subreddit
• Smash all the comments into one document
• Use the subreddit as the label
• Apply:
– vanilla doc2vec (via gensim)
– k-means clustering (via sklearn)
We get beautiful results
"gaming": [
"emulation",
"Games",
"3DS",
"PS4",
"Steam",
"pcgaming",
"PS3",
"nintendo",
"xboxone",
"Vive",
...
]
"health": [
"Fitness",
"bodybuilding",
"weightlifting",
"powerlifting",
"xxfitness",
"swoleacceptance",
"ketogains",
"progresspics",
"bodyweightfitness",
"Rowing",
...
]
"cryptocurrency": [
"ethtrader",
"btc",
"Bitcoin",
"Lisk",
"EthereumClassic",
"Monero",
"Buttcoin",
"ethereum",
"CryptoCurrency",
"dogecoin",
...
]
"BigCities": [
"Denver",
"Austin",
"Winnipeg",
"sanfrancisco",
"Atlanta",
"nashville",
"LosAngeles",
"Cleveland",
"chicago",
"Calgary",
...
]
And we can measure a lift!
Where we’re headed!
● We have made steps to try Deep Learning (it is
after all the done thing)
○ Models for Harassment via PMs
○ Looking to apply more broadly
● Logistic regressions for home feed
● Increase number of signals:
○ Geography
○ Platform
○ Time of Day
● Do it Live!
All Possible by our Relevance Team
● Luis Bitencourt-Emilio - Director
● Dr. Eric Xu - Engineering Manager
● Dr. Jenny Jin - Data Science
● Katie Bauer - Data Science
● Artem Yankov
● Keith Blaha
● Niraj Sheth
● Dan Ellis
● William Seltzer
Thanks! Chris Slowe
u/KeyserSosa
@KeyserSosa
PS: We’re hiring!
http://reddit.com/jobs