18
A TopN Recommender System Evalua8on Protocol Inspired by Deployed Systems Alan Said , Alejandro Bellogín, Arjen De Vries CWI @alansaid, @abellogin, @arjenpdevries

A Top-N Recommender System Evaluation Protocol Inspired by Deployed Systems

Embed Size (px)

DESCRIPTION

he evaluation of recommender systems is crucial for their development. In today's recommendation landscape there are many standardized recommendation algorithms and approaches, however, there exists no standardized method for experimental setup of evaluation -- not even for widely used measures such as precision and root-mean-squared error. This creates a setting where comparison of recommendation results using the same datasets becomes problematic. In this paper, we propose an evaluation protocol specifically developed with the recommendation use-case in mind, i.e. the recommendation of one or several items to an end user. The protocol attempts to closely mimic a scenario of a deployed (production) recommendation system, taking specific user aspects into consideration and allowing a comparison of small and large scale recommendation systems. The protocol is evaluated on common recommendation datasets and compared to traditional recommendation settings found in research literature. Our results show that the proposed model can better capture the quality of a recommender system than traditional evaluation does, and is not affected by characteristics of the data (e.g. size. sparsity, etc.).

Citation preview

Page 1: A Top-N Recommender System Evaluation Protocol Inspired by Deployed Systems

A  Top-­‐N  Recommender  System  Evalua8on  Protocol  Inspired  by  Deployed  Systems  

Alan  Said,  Alejandro  Bellogín,  Arjen  De  Vries  CWI  

@alansaid,  @abellogin,  @arjenpdevries  

Page 2: A Top-N Recommender System Evaluation Protocol Inspired by Deployed Systems

Outline  •  Evalua8on  

–  Real  world    –  Offline  

•  Protocol  •  Experiments  &  Results  •  Conclusions  

2013-­‐10-­‐13   LSRS'13   2  

•  Not  algorithmic  comparison!    •  Comparison  of  evalua8on  

Page 3: A Top-N Recommender System Evaluation Protocol Inspired by Deployed Systems

EVALUATION  

2013-­‐10-­‐13   LSRS'13   3  

Page 4: A Top-N Recommender System Evaluation Protocol Inspired by Deployed Systems

Evalua8on  •  Does  p@10  in  [Smith,2010a]  measure  the  same  quality  as  p@10  in  [Smith,

2012b]?  –  Even  if  it  does  

•  is  the  underlying  data  the  same?  •  was  cross-­‐valida8on  performed  similarly?  •  etc.  

2013-­‐10-­‐13   LSRS'13   4  

Page 5: A Top-N Recommender System Evaluation Protocol Inspired by Deployed Systems

Evalua8on  •  What  metrics  should  we  use?  •  How  should  we  evaluate?  

–  Relevance  criteria  for  test  items  –  Cross  valida8on  (n-­‐fold,  random)  

•  Should  all  users  and  items  be  treated  the  same  way?  –  Do  certain  users  and  items  reflect  different  evalua8on  quali8es?  

 

2013-­‐10-­‐13   LSRS'13   5  

Page 6: A Top-N Recommender System Evaluation Protocol Inspired by Deployed Systems

Offline  Evalua8on  Recommender  System  accuracy  evalua8on  is  currently  based  on  methods  from  IR/ML  

–  One  training  set  –  One  test  set  –  (One  valida8on  set)  –  Algorithms  are  trained  on  the  training  set  –  Evaluate  using  metric@N  (e.g.  p@N  –  a  page  size)  

•  Even  when  N  is  larger  than  the  number  of  test  items  •  p@N  =  1.0  is  (almost)  impossible  

2013-­‐10-­‐13   LSRS'13   6  

Page 7: A Top-N Recommender System Evaluation Protocol Inspired by Deployed Systems

Evalua8on  in  produc8on  •  One  dynamic  training  set  

–  All  of  the  available  data  at  a  certain  point  in  8me  –  Con8nuously  updated  

•  No  test  set    –  Only  live  user  interac8ons  

•  Clicked/purchased  items  are  good  recommenda8ons  

Can  we  simulate  this  offline?  

2013-­‐10-­‐13   LSRS'13   7  

Page 8: A Top-N Recommender System Evaluation Protocol Inspired by Deployed Systems

Evalua8on  Protocol  •  Based  on  “real  world”  concepts  •  Uses  as  much  available  data  as  possible  •  Trains  algorithms  once  per  user  and  evalua8on  selng  (e.g.  N)  •  Evaluates  p@N  when  there  are  exactly  N  correct  items  in  the  test  set  

–  possible  p@N  =  1  (gold  standard)  

2013-­‐10-­‐13   LSRS'13   8  

Page 9: A Top-N Recommender System Evaluation Protocol Inspired by Deployed Systems

Evalua8on  Protocol  Three  concepts:  1.  Personalized  training  &  test  sets  

–  Use  all  available  informa8on  about  the  system  for  the  candidate  user  –  Different  test/training  sets  for  different  levels  of  N  

2.  Candidate  item  selec8on  (items  in  test  sets)  –  Only  “good”  items  go  in  test  sets  (no  random  80%-­‐20%  splits)  –  How  “good”  an  item  is  is  based  on  each  user’s  personal  preference  

3.  Candidate  user  selec8on  (users  in  test  sets)  –  Candidate  users  must  have  items  in  the  training  set  –  When  evalua8ng  p@N,  each  user  in  test  set  should  have  N  items  in  test  set  

•  Effec8vely  precision  becomes  R-­‐precision  

Train  each  algorithm  once  for  each  user  in  the  test  set  and  once  for  each  N.      

2013-­‐10-­‐13   LSRS'13   9  

Page 10: A Top-N Recommender System Evaluation Protocol Inspired by Deployed Systems

Evalua8on  Protocol  

2013-­‐10-­‐13   LSRS'13   10  

Page 11: A Top-N Recommender System Evaluation Protocol Inspired by Deployed Systems

EXPERIMENTS  

2013-­‐10-­‐13   LSRS'13   11  

Page 12: A Top-N Recommender System Evaluation Protocol Inspired by Deployed Systems

Experiments  •  Datasets:  

–  Movielens  100k  •  Minimum  20  ra8ngs  per  user  •  943  users  •  6.43%  density  •  Not  realis8c  

–  Movielens  1M  sample  •  100k  ra8ngs  •  1000  users  •  3.0%  density  

•  Algorithms  –  SVD  –  User-­‐based  CF  (kNN)  –  Item-­‐based  CF  

1  

10  

100  

10   100   1000  

numbe

r  of  u

sers  

number  of  raAngs  

1  

10  

100  

10   100   1000  nu

mbe

r  of  u

sers  

number  of  raAngs  2013-­‐10-­‐13   LSRS'13   12  

Page 13: A Top-N Recommender System Evaluation Protocol Inspired by Deployed Systems

According  to  proposed  protocol:  •  Evaluate  R-­‐precision  for  

N=[1,5,10,20,50,100]  •  Users  evaluated  at  N  must  have  at  

least  N  items  rated  above  the  relevance  threshold  (RT)  

•  RT  depends  on  the  users  mean  ra8ng  and  standard  devia8on  

•  Number  of  runs:  |N|*|users|  

 

Experimental  Selngs  

2013-­‐10-­‐13   LSRS'13   13  

Baseline  •  Evaluate  p@N  for  

N=[1,5,10,20,50,100]  •  80%-­‐20%  training-­‐test  split  

–  Items  in  test  set  rated  at  least  3  

•  Number  of  runs:  1  

 

Page 14: A Top-N Recommender System Evaluation Protocol Inspired by Deployed Systems

Results  

14  

User-­‐based  CF  ML1M  sample  

2013-­‐10-­‐13   LSRS'13  

Page 15: A Top-N Recommender System Evaluation Protocol Inspired by Deployed Systems

Results  

15  

User-­‐based  CF  ML1M  sample   User-­‐based  CF  ML100k  

2013-­‐10-­‐13  

SVD  ML1M  sample  

LSRS'13  

SVD  ML1M  sample  

Page 16: A Top-N Recommender System Evaluation Protocol Inspired by Deployed Systems

Results  What  about  8me?  

–  |N|*|users|  vs.  1?  –  Trade-­‐off  between  a  realis8c  

evalua8on  and  complexity?  

2013-­‐10-­‐13   LSRS'13   16  

Page 17: A Top-N Recommender System Evaluation Protocol Inspired by Deployed Systems

Conclusions  •  We  can  emulate  a  realis8c  produc8on  scenario  by  crea8ng  personalized  

training/test  sets  and  evalua8ng  them  for  each  candidate  user  separately  •  We  can  see  how  well  a  recommender  performs  at  different  levels  of  recall  

(page  size)  •  We  can  compare  towards  a  gold  standard  •  We  can  reduce  evalua8on  8me  

2013-­‐10-­‐13   LSRS'13   17  

Page 18: A Top-N Recommender System Evaluation Protocol Inspired by Deployed Systems

Ques8ons?  •  Thanks!  

 •  Also:  check  out  

–  ACM  TIST  Special  Issue  on  RecSys  Benchmarking  –  bit.ly/RecSysBe    –  The  ACM  RecSys  Wiki  –  www.recsyswiki.com    

2013-­‐10-­‐13   LSRS'13   18