A Top-N Recommender System Evaluation Protocol Inspired by Deployed Systems

A Top-‐N Recommender System Evalua8on Protocol Inspired by Deployed Systems

Alan Said, Alejandro Bellogín, Arjen De Vries CWI

@alansaid, @abellogin, @arjenpdevries

Outline •  Evalua8on

–  Real world –  Offline

•  Protocol •  Experiments & Results •  Conclusions

2013-‐10-‐13 LSRS'13 2

•  Not algorithmic comparison! •  Comparison of evalua8on

EVALUATION

2013-‐10-‐13 LSRS'13 3

Evalua8on •  Does p@10 in [Smith,2010a] measure the same quality as p@10 in [Smith,

2012b]? –  Even if it does

•  is the underlying data the same? •  was cross-‐valida8on performed similarly? •  etc.

2013-‐10-‐13 LSRS'13 4

Evalua8on •  What metrics should we use? •  How should we evaluate?

–  Relevance criteria for test items –  Cross valida8on (n-‐fold, random)

•  Should all users and items be treated the same way? –  Do certain users and items reflect different evalua8on quali8es?

2013-‐10-‐13 LSRS'13 5

Offline Evalua8on Recommender System accuracy evalua8on is currently based on methods from IR/ML

–  One training set –  One test set –  (One valida8on set) –  Algorithms are trained on the training set –  Evaluate using metric@N (e.g. p@N – a page size)

•  Even when N is larger than the number of test items •  p@N = 1.0 is (almost) impossible

2013-‐10-‐13 LSRS'13 6

Evalua8on in produc8on •  One dynamic training set

–  All of the available data at a certain point in 8me –  Con8nuously updated

•  No test set –  Only live user interac8ons

•  Clicked/purchased items are good recommenda8ons

Can we simulate this offline?

2013-‐10-‐13 LSRS'13 7

Evalua8on Protocol •  Based on “real world” concepts •  Uses as much available data as possible •  Trains algorithms once per user and evalua8on selng (e.g. N) •  Evaluates p@N when there are exactly N correct items in the test set

–  possible p@N = 1 (gold standard)

2013-‐10-‐13 LSRS'13 8

Evalua8on Protocol Three concepts: 1.  Personalized training & test sets

–  Use all available informa8on about the system for the candidate user –  Different test/training sets for different levels of N

2.  Candidate item selec8on (items in test sets) –  Only “good” items go in test sets (no random 80%-‐20% splits) –  How “good” an item is is based on each user’s personal preference

3.  Candidate user selec8on (users in test sets) –  Candidate users must have items in the training set –  When evalua8ng p@N, each user in test set should have N items in test set

•  Effec8vely precision becomes R-‐precision

Train each algorithm once for each user in the test set and once for each N.

2013-‐10-‐13 LSRS'13 9

Evalua8on Protocol

2013-‐10-‐13 LSRS'13 10

EXPERIMENTS

2013-‐10-‐13 LSRS'13 11

Experiments •  Datasets:

–  Movielens 100k •  Minimum 20 ra8ngs per user •  943 users •  6.43% density •  Not realis8c

–  Movielens 1M sample •  100k ra8ngs •  1000 users •  3.0% density

•  Algorithms –  SVD –  User-‐based CF (kNN) –  Item-‐based CF

1

10

100

10 100 1000

numbe

r of u

sers

number of raAngs

1

10

100

10 100 1000 nu

mbe

r of u

sers

number of raAngs 2013-‐10-‐13 LSRS'13 12

According to proposed protocol: •  Evaluate R-‐precision for

N=[1,5,10,20,50,100] •  Users evaluated at N must have at

least N items rated above the relevance threshold (RT)

•  RT depends on the users mean ra8ng and standard devia8on

•  Number of runs: |N|*|users|

Experimental Selngs

2013-‐10-‐13 LSRS'13 13

Baseline •  Evaluate p@N for

N=[1,5,10,20,50,100] •  80%-‐20% training-‐test split

–  Items in test set rated at least 3

•  Number of runs: 1

Results

14

User-‐based CF ML1M sample

2013-‐10-‐13 LSRS'13

Results

15

User-‐based CF ML1M sample User-‐based CF ML100k

2013-‐10-‐13

SVD ML1M sample

LSRS'13

SVD ML1M sample

Results What about 8me?

–  |N|*|users| vs. 1? –  Trade-‐off between a realis8c

evalua8on and complexity?

2013-‐10-‐13 LSRS'13 16

Conclusions •  We can emulate a realis8c produc8on scenario by crea8ng personalized

training/test sets and evalua8ng them for each candidate user separately •  We can see how well a recommender performs at different levels of recall

(page size) •  We can compare towards a gold standard •  We can reduce evalua8on 8me

2013-‐10-‐13 LSRS'13 17

Ques8ons? •  Thanks!

•  Also: check out

–  ACM TIST Special Issue on RecSys Benchmarking – bit.ly/RecSysBe –  The ACM RecSys Wiki – www.recsyswiki.com

2013-‐10-‐13 LSRS'13 18

Technology

A Top-N Recommender System Evaluation Protocol Inspired by Deployed Systems