Upload
abel-scott
View
217
Download
2
Embed Size (px)
Citation preview
Distributed Networks & Systems Lab
A Survey of Collaborative Filter-ing
Distributed Networks & Systems LabDistributed Networks & Systems Lab
Outline
Introduction
Collaborative filtering
Characteristics and challenges
Memory-based CF
Model-based CF
Hybrid CF
Recent advances in CF
Conclusion
Distributed Networks & Systems LabDistributed Networks & Systems Lab
Introduction
Recommendation System
Help users to discover new items that may be hard for users to find
Subclass of information filtering system that seek to predict the ‘rating’ or ‘preference’ that user would give to an item
Recommender systems identify recommendations au-tonomously for individual users based on past pur-chases and searches, and on other users' behavior
Distributed Networks & Systems LabDistributed Networks & Systems Lab
Distributed Networks & Systems LabDistributed Networks & Systems Lab
Recommender System
Recommendation System
Content-based
Collabora-tive
Filtering
Hybrid
based on a de-scription of the item and a profile of the user’s preference
Combination of col-laborative filtering and content-based approach
based on collecting and analyzing a large amount of information on users’ behaviors, activities or preferences and predict-ing what users will like based on their similarity to other users.
Distributed Networks & Systems LabDistributed Networks & Systems Lab
Recommender System
• Recom-mendation System
• Content-based
• Col-labo-
rative • Filter-
ing
• Hybrid
• Memory-based
• Model-based
• Hybrid
Distributed Networks & Systems LabDistributed Networks & Systems Lab
Collaborative filtering
Distributed Networks & Systems LabDistributed Networks & Systems Lab
Characteristics & Challenges
Collaborative filtering has performance challenges from the distinguishable characteristics
Data sparsityScalabilitySynonymyGray sheepShilling attacks
Distributed Networks & Systems LabDistributed Networks & Systems Lab
Data Sparsity
In internet markets, the variation of products makes user-item matrix sparse.
How to process sparse data and match?
Distributed Networks & Systems LabDistributed Networks & Systems Lab
Data Sparsity
Cold start problemA new user or item has just entered
the system.Hard to find similar ones since there
is not enough informationToo small users’ ratings compared to
the large number of items in the sys-tem
Causes reduced coverage
Distributed Networks & Systems LabDistributed Networks & Systems Lab
Data Sparsity
Users with same tastes may not be in-dentified as such if there is no co-rated items
Distributed Networks & Systems LabDistributed Networks & Systems Lab
Data Sparsity
Dimensionality reduction techniquesSingular Value Decomposition
Removes unrepresentative or insignificant users or items to reduce the dimensionali-ties of the user-item based matrix directly
Reduced sparsity, but some drawbacks Meaningful data also discarded Caused decrease in quality
Distributed Networks & Systems LabDistributed Networks & Systems Lab
Scalability
Large size of data caused longer com-pute time under limited resources
Dimensionality reduction can help this problem, but requires extra steps(matrix factorization) which has expensive cost
Incremental SVD algorithm has been suggested to reduce the cost of the step
Distributed Networks & Systems LabDistributed Networks & Systems Lab
Synonymy
Same kind of products, different names
“Children movie”, “children film”Memory based CF systems are vul-
nerable to this problemAttempts were made to solve thisIntellectual or automatically term
expansion could have partial solu-tion, but has some drawbacks
Distributed Networks & Systems LabDistributed Networks & Systems Lab
Gray Sheep
Users that are not ordinaryHard to make prediction for themNo full solution for thisPer-user approach were made to re-
duce this problem
Distributed Networks & Systems LabDistributed Networks & Systems Lab
Shilling Attacks
Intended increase in good rating and negative rating by the product sales company
Item based CF algorithm was much less affected by the attacks than the user-based CF algorithm
Distributed Networks & Systems LabDistributed Networks & Systems Lab
Other Challenges
Observing personal habit of usersPrivacy invasion
Noise increaseFrom increase in diversity
ExplainabilityLet users know the reason why the sys-
tem recommends the specific item
Distributed Networks & Systems LabDistributed Networks & Systems Lab
Memory-based CF
Memorize the rating matrix and issue recommendations based on the rela-tionship between the queried user and item and the rest of the matrix
Uses the entire or a sample of the user-item database to make predic-tion
Every user is part of a group of peo-ple with similar interests
Distributed Networks & Systems LabDistributed Networks & Systems Lab
Neighborhood-based CF methods
Most popular memory-based CF method
Predict ratings by referring to users whose ratings are similar to the queried user, or to items that are similar to queried item.
Calculate similarity or weight then,Aggregate the neighbors to get the top-N
most frequent items as the recommenda-tion
Distributed Networks & Systems LabDistributed Networks & Systems Lab
Similarity Computation
Critical step
For item-based CFCompute similarity between items
For user-based CFCompute similarity between users u
and v who have both rated the same items
Distributed Networks & Systems LabDistributed Networks & Systems Lab
Correlation-based Similarity
To get the similarity Wu,v between two users u and v
Wi,j between two items i and j
Pearson Correlation is used to mea-sure similarity Measures the linear independence be-
tween two variables(or users) as a func-tion of their attributes
Distributed Networks & Systems LabDistributed Networks & Systems Lab
Pearson Correlation
User-based algorithmi∈I summations are over the items that both the users u and v have rated,And is the average rating of the co-rated items of the u-th user.
Item-based algorithmru,Is is the rating of user u on item I, And is the rating of the i-th item by those users.
ur
ir
Distributed Networks & Systems LabDistributed Networks & Systems Lab
Vector Cosine-based Similarity
Used to find similarity between two documentseach document as a vector of word frequen-
ciesCompute the cosine of the angle formed by
the frequency vectors
For collaborative filtering,Treat users or items as a vector of ratings
and compute the cosine of the angle formed by the rating vectors
Distributed Networks & Systems LabDistributed Networks & Systems Lab
Vector Cosine-based Similarity
Similarity between two items i and j
Example:For vector A={x1, y1}, vector B={x2, y2}
Distributed Networks & Systems LabDistributed Networks & Systems Lab
Prediction and Recommendation Computation
In the neighborhood-based CF, a sub-set of nearest neighbors of the active user are chosen based on their simi-larity with him or her and weighted aggregate of their ratings is used to generate predictions for the active user
Distributed Networks & Systems LabDistributed Networks & Systems Lab
Weighted sum of Others’ ratings
To make prediction for active user a, on a certain item i,
We can take a weighted average of all the ratings on that item by using this
ar average ratings for the user a on all other ratingsaverage ratings for the user u on all other ratings
wa,u weight between the user a and user uur
Distributed Networks & Systems LabDistributed Networks & Systems Lab
Example
To predict the rating for U1 on I2,
Distributed Networks & Systems LabDistributed Networks & Systems Lab
Simple Weighted Average
For item-based prediction,We can use simple weighted average
Pu,i for user u on item i
Distributed Networks & Systems LabDistributed Networks & Systems Lab
Top-N recommendations
To recommend a set of N top-ranked items that will be of interest to a certain user
Returning customer may get the list of recommenda-tion
Top-N recommendation techniques analyze the user-item matrix to discover relations between different users or items and use them to compute recom-mendations
Association rule mining can be used to make Top-N recommendations
Distributed Networks & Systems LabDistributed Networks & Systems Lab
Model Based CF
The design and development of models (ma-chine learning, data mining algorithms) can allow the system to learn to recognize the complex patterns based on training data and make predictions from learned models
Classification algorithm can be used as CF models if the user ratings are categorical
Regression models and SVD methods can be used for numerical ratings
Distributed Networks & Systems LabDistributed Networks & Systems Lab
Simple Bayesian CF Algorithm
Uses a naïve Bayes (NB) strategy to make predictions
Assuming the features are independent given the class
The probability of a certain class given all of the features can be computed
Then class with the highest probability will be classified as the predicted classes
Distributed Networks & Systems LabDistributed Networks & Systems Lab
Clustering CF Algorithms
Shows better scalability
Make predictions within much smaller clusters rather than the entire cus-tomer bse
Distributed Networks & Systems LabDistributed Networks & Systems Lab
Hybrid Collaborative Filtering
Memory-based and model-based CF approaches are combined to from hybrid CF approaches
Shows some improvement
Probabilistic memory-based CFPersonality diagnosis
Distributed Networks & Systems LabDistributed Networks & Systems Lab
Probabilistic memory-based CF
Combined memory-based and model basedTo address the New user problem, an active learning ex-
tension to the PMCF system can be used to actively query a user for additional information.
To reduce computation time, PMCF Selects a small subset, ‘profile space’ from the entire database of user ratings and make prediction from the small profile space, not the whole database
Better accuracy thanPearson correlation-based CF
Model based using naïve Bayes
Distributed Networks & Systems LabDistributed Networks & Systems Lab
Personality diagnosis
Combined and keeps the both advan-tage
Given the active user’s known ratings, we can calculate the probability that he or she is the same “personality type” as other users, and predict whether he will like the new items
Distributed Networks & Systems LabDistributed Networks & Systems Lab
Conclusion