Transcript
Page 1: A Top-N Recommender System Evaluation Protocol Inspired by Deployed Systems

A  Top-­‐N  Recommender  System  Evalua8on  Protocol  Inspired  by  Deployed  Systems  

Alan  Said,  Alejandro  Bellogín,  Arjen  De  Vries  CWI  

@alansaid,  @abellogin,  @arjenpdevries  

Page 2: A Top-N Recommender System Evaluation Protocol Inspired by Deployed Systems

Outline  •  Evalua8on  

–  Real  world    –  Offline  

•  Protocol  •  Experiments  &  Results  •  Conclusions  

2013-­‐10-­‐13   LSRS'13   2  

•  Not  algorithmic  comparison!    •  Comparison  of  evalua8on  

Page 3: A Top-N Recommender System Evaluation Protocol Inspired by Deployed Systems

EVALUATION  

2013-­‐10-­‐13   LSRS'13   3  

Page 4: A Top-N Recommender System Evaluation Protocol Inspired by Deployed Systems

Evalua8on  •  Does  p@10  in  [Smith,2010a]  measure  the  same  quality  as  p@10  in  [Smith,

2012b]?  –  Even  if  it  does  

•  is  the  underlying  data  the  same?  •  was  cross-­‐valida8on  performed  similarly?  •  etc.  

2013-­‐10-­‐13   LSRS'13   4  

Page 5: A Top-N Recommender System Evaluation Protocol Inspired by Deployed Systems

Evalua8on  •  What  metrics  should  we  use?  •  How  should  we  evaluate?  

–  Relevance  criteria  for  test  items  –  Cross  valida8on  (n-­‐fold,  random)  

•  Should  all  users  and  items  be  treated  the  same  way?  –  Do  certain  users  and  items  reflect  different  evalua8on  quali8es?  

 

2013-­‐10-­‐13   LSRS'13   5  

Page 6: A Top-N Recommender System Evaluation Protocol Inspired by Deployed Systems

Offline  Evalua8on  Recommender  System  accuracy  evalua8on  is  currently  based  on  methods  from  IR/ML  

–  One  training  set  –  One  test  set  –  (One  valida8on  set)  –  Algorithms  are  trained  on  the  training  set  –  Evaluate  using  metric@N  (e.g.  p@N  –  a  page  size)  

•  Even  when  N  is  larger  than  the  number  of  test  items  •  p@N  =  1.0  is  (almost)  impossible  

2013-­‐10-­‐13   LSRS'13   6  

Page 7: A Top-N Recommender System Evaluation Protocol Inspired by Deployed Systems

Evalua8on  in  produc8on  •  One  dynamic  training  set  

–  All  of  the  available  data  at  a  certain  point  in  8me  –  Con8nuously  updated  

•  No  test  set    –  Only  live  user  interac8ons  

•  Clicked/purchased  items  are  good  recommenda8ons  

Can  we  simulate  this  offline?  

2013-­‐10-­‐13   LSRS'13   7  

Page 8: A Top-N Recommender System Evaluation Protocol Inspired by Deployed Systems

Evalua8on  Protocol  •  Based  on  “real  world”  concepts  •  Uses  as  much  available  data  as  possible  •  Trains  algorithms  once  per  user  and  evalua8on  selng  (e.g.  N)  •  Evaluates  p@N  when  there  are  exactly  N  correct  items  in  the  test  set  

–  possible  p@N  =  1  (gold  standard)  

2013-­‐10-­‐13   LSRS'13   8  

Page 9: A Top-N Recommender System Evaluation Protocol Inspired by Deployed Systems

Evalua8on  Protocol  Three  concepts:  1.  Personalized  training  &  test  sets  

–  Use  all  available  informa8on  about  the  system  for  the  candidate  user  –  Different  test/training  sets  for  different  levels  of  N  

2.  Candidate  item  selec8on  (items  in  test  sets)  –  Only  “good”  items  go  in  test  sets  (no  random  80%-­‐20%  splits)  –  How  “good”  an  item  is  is  based  on  each  user’s  personal  preference  

3.  Candidate  user  selec8on  (users  in  test  sets)  –  Candidate  users  must  have  items  in  the  training  set  –  When  evalua8ng  p@N,  each  user  in  test  set  should  have  N  items  in  test  set  

•  Effec8vely  precision  becomes  R-­‐precision  

Train  each  algorithm  once  for  each  user  in  the  test  set  and  once  for  each  N.      

2013-­‐10-­‐13   LSRS'13   9  

Page 10: A Top-N Recommender System Evaluation Protocol Inspired by Deployed Systems

Evalua8on  Protocol  

2013-­‐10-­‐13   LSRS'13   10  

Page 11: A Top-N Recommender System Evaluation Protocol Inspired by Deployed Systems

EXPERIMENTS  

2013-­‐10-­‐13   LSRS'13   11  

Page 12: A Top-N Recommender System Evaluation Protocol Inspired by Deployed Systems

Experiments  •  Datasets:  

–  Movielens  100k  •  Minimum  20  ra8ngs  per  user  •  943  users  •  6.43%  density  •  Not  realis8c  

–  Movielens  1M  sample  •  100k  ra8ngs  •  1000  users  •  3.0%  density  

•  Algorithms  –  SVD  –  User-­‐based  CF  (kNN)  –  Item-­‐based  CF  

1  

10  

100  

10   100   1000  

numbe

r  of  u

sers  

number  of  raAngs  

1  

10  

100  

10   100   1000  nu

mbe

r  of  u

sers  

number  of  raAngs  2013-­‐10-­‐13   LSRS'13   12  

Page 13: A Top-N Recommender System Evaluation Protocol Inspired by Deployed Systems

According  to  proposed  protocol:  •  Evaluate  R-­‐precision  for  

N=[1,5,10,20,50,100]  •  Users  evaluated  at  N  must  have  at  

least  N  items  rated  above  the  relevance  threshold  (RT)  

•  RT  depends  on  the  users  mean  ra8ng  and  standard  devia8on  

•  Number  of  runs:  |N|*|users|  

 

Experimental  Selngs  

2013-­‐10-­‐13   LSRS'13   13  

Baseline  •  Evaluate  p@N  for  

N=[1,5,10,20,50,100]  •  80%-­‐20%  training-­‐test  split  

–  Items  in  test  set  rated  at  least  3  

•  Number  of  runs:  1  

 

Page 14: A Top-N Recommender System Evaluation Protocol Inspired by Deployed Systems

Results  

14  

User-­‐based  CF  ML1M  sample  

2013-­‐10-­‐13   LSRS'13  

Page 15: A Top-N Recommender System Evaluation Protocol Inspired by Deployed Systems

Results  

15  

User-­‐based  CF  ML1M  sample   User-­‐based  CF  ML100k  

2013-­‐10-­‐13  

SVD  ML1M  sample  

LSRS'13  

SVD  ML1M  sample  

Page 16: A Top-N Recommender System Evaluation Protocol Inspired by Deployed Systems

Results  What  about  8me?  

–  |N|*|users|  vs.  1?  –  Trade-­‐off  between  a  realis8c  

evalua8on  and  complexity?  

2013-­‐10-­‐13   LSRS'13   16  

Page 17: A Top-N Recommender System Evaluation Protocol Inspired by Deployed Systems

Conclusions  •  We  can  emulate  a  realis8c  produc8on  scenario  by  crea8ng  personalized  

training/test  sets  and  evalua8ng  them  for  each  candidate  user  separately  •  We  can  see  how  well  a  recommender  performs  at  different  levels  of  recall  

(page  size)  •  We  can  compare  towards  a  gold  standard  •  We  can  reduce  evalua8on  8me  

2013-­‐10-­‐13   LSRS'13   17  

Page 18: A Top-N Recommender System Evaluation Protocol Inspired by Deployed Systems

Ques8ons?  •  Thanks!  

 •  Also:  check  out  

–  ACM  TIST  Special  Issue  on  RecSys  Benchmarking  –  bit.ly/RecSysBe    –  The  ACM  RecSys  Wiki  –  www.recsyswiki.com    

2013-­‐10-­‐13   LSRS'13   18