Upload
alan-said
View
970
Download
0
Embed Size (px)
DESCRIPTION
he evaluation of recommender systems is crucial for their development. In today's recommendation landscape there are many standardized recommendation algorithms and approaches, however, there exists no standardized method for experimental setup of evaluation -- not even for widely used measures such as precision and root-mean-squared error. This creates a setting where comparison of recommendation results using the same datasets becomes problematic. In this paper, we propose an evaluation protocol specifically developed with the recommendation use-case in mind, i.e. the recommendation of one or several items to an end user. The protocol attempts to closely mimic a scenario of a deployed (production) recommendation system, taking specific user aspects into consideration and allowing a comparison of small and large scale recommendation systems. The protocol is evaluated on common recommendation datasets and compared to traditional recommendation settings found in research literature. Our results show that the proposed model can better capture the quality of a recommender system than traditional evaluation does, and is not affected by characteristics of the data (e.g. size. sparsity, etc.).
Citation preview
A Top-‐N Recommender System Evalua8on Protocol Inspired by Deployed Systems
Alan Said, Alejandro Bellogín, Arjen De Vries CWI
@alansaid, @abellogin, @arjenpdevries
Outline • Evalua8on
– Real world – Offline
• Protocol • Experiments & Results • Conclusions
2013-‐10-‐13 LSRS'13 2
• Not algorithmic comparison! • Comparison of evalua8on
EVALUATION
2013-‐10-‐13 LSRS'13 3
Evalua8on • Does p@10 in [Smith,2010a] measure the same quality as p@10 in [Smith,
2012b]? – Even if it does
• is the underlying data the same? • was cross-‐valida8on performed similarly? • etc.
2013-‐10-‐13 LSRS'13 4
Evalua8on • What metrics should we use? • How should we evaluate?
– Relevance criteria for test items – Cross valida8on (n-‐fold, random)
• Should all users and items be treated the same way? – Do certain users and items reflect different evalua8on quali8es?
2013-‐10-‐13 LSRS'13 5
Offline Evalua8on Recommender System accuracy evalua8on is currently based on methods from IR/ML
– One training set – One test set – (One valida8on set) – Algorithms are trained on the training set – Evaluate using metric@N (e.g. p@N – a page size)
• Even when N is larger than the number of test items • p@N = 1.0 is (almost) impossible
2013-‐10-‐13 LSRS'13 6
Evalua8on in produc8on • One dynamic training set
– All of the available data at a certain point in 8me – Con8nuously updated
• No test set – Only live user interac8ons
• Clicked/purchased items are good recommenda8ons
Can we simulate this offline?
2013-‐10-‐13 LSRS'13 7
Evalua8on Protocol • Based on “real world” concepts • Uses as much available data as possible • Trains algorithms once per user and evalua8on selng (e.g. N) • Evaluates p@N when there are exactly N correct items in the test set
– possible p@N = 1 (gold standard)
2013-‐10-‐13 LSRS'13 8
Evalua8on Protocol Three concepts: 1. Personalized training & test sets
– Use all available informa8on about the system for the candidate user – Different test/training sets for different levels of N
2. Candidate item selec8on (items in test sets) – Only “good” items go in test sets (no random 80%-‐20% splits) – How “good” an item is is based on each user’s personal preference
3. Candidate user selec8on (users in test sets) – Candidate users must have items in the training set – When evalua8ng p@N, each user in test set should have N items in test set
• Effec8vely precision becomes R-‐precision
Train each algorithm once for each user in the test set and once for each N.
2013-‐10-‐13 LSRS'13 9
Evalua8on Protocol
2013-‐10-‐13 LSRS'13 10
EXPERIMENTS
2013-‐10-‐13 LSRS'13 11
Experiments • Datasets:
– Movielens 100k • Minimum 20 ra8ngs per user • 943 users • 6.43% density • Not realis8c
– Movielens 1M sample • 100k ra8ngs • 1000 users • 3.0% density
• Algorithms – SVD – User-‐based CF (kNN) – Item-‐based CF
1
10
100
10 100 1000
numbe
r of u
sers
number of raAngs
1
10
100
10 100 1000 nu
mbe
r of u
sers
number of raAngs 2013-‐10-‐13 LSRS'13 12
According to proposed protocol: • Evaluate R-‐precision for
N=[1,5,10,20,50,100] • Users evaluated at N must have at
least N items rated above the relevance threshold (RT)
• RT depends on the users mean ra8ng and standard devia8on
• Number of runs: |N|*|users|
Experimental Selngs
2013-‐10-‐13 LSRS'13 13
Baseline • Evaluate p@N for
N=[1,5,10,20,50,100] • 80%-‐20% training-‐test split
– Items in test set rated at least 3
• Number of runs: 1
Results
14
User-‐based CF ML1M sample
2013-‐10-‐13 LSRS'13
Results
15
User-‐based CF ML1M sample User-‐based CF ML100k
2013-‐10-‐13
SVD ML1M sample
LSRS'13
SVD ML1M sample
Results What about 8me?
– |N|*|users| vs. 1? – Trade-‐off between a realis8c
evalua8on and complexity?
2013-‐10-‐13 LSRS'13 16
Conclusions • We can emulate a realis8c produc8on scenario by crea8ng personalized
training/test sets and evalua8ng them for each candidate user separately • We can see how well a recommender performs at different levels of recall
(page size) • We can compare towards a gold standard • We can reduce evalua8on 8me
2013-‐10-‐13 LSRS'13 17
Ques8ons? • Thanks!
• Also: check out
– ACM TIST Special Issue on RecSys Benchmarking – bit.ly/RecSysBe – The ACM RecSys Wiki – www.recsyswiki.com
2013-‐10-‐13 LSRS'13 18