36
Distributed Networks & Systems Lab A Survey of Collaborative Filtering

Distributed Networks & Systems Lab. Introduction Collaborative filtering Characteristics and challenges Memory-based CF Model-based CF Hybrid CF Recent

Embed Size (px)

Citation preview

Page 1: Distributed Networks & Systems Lab. Introduction Collaborative filtering Characteristics and challenges Memory-based CF Model-based CF Hybrid CF Recent

Distributed Networks & Systems Lab

A Survey of Collaborative Filter-ing

Page 2: Distributed Networks & Systems Lab. Introduction Collaborative filtering Characteristics and challenges Memory-based CF Model-based CF Hybrid CF Recent

Distributed Networks & Systems LabDistributed Networks & Systems Lab

Outline

Introduction

Collaborative filtering

Characteristics and challenges

Memory-based CF

Model-based CF

Hybrid CF

Recent advances in CF

Conclusion

Page 3: Distributed Networks & Systems Lab. Introduction Collaborative filtering Characteristics and challenges Memory-based CF Model-based CF Hybrid CF Recent

Distributed Networks & Systems LabDistributed Networks & Systems Lab

Introduction

Recommendation System

Help users to discover new items that may be hard for users to find

Subclass of information filtering system that seek to predict the ‘rating’ or ‘preference’ that user would give to an item

Recommender systems identify recommendations au-tonomously for individual users based on past pur-chases and searches, and on other users' behavior

Page 4: Distributed Networks & Systems Lab. Introduction Collaborative filtering Characteristics and challenges Memory-based CF Model-based CF Hybrid CF Recent

Distributed Networks & Systems LabDistributed Networks & Systems Lab

Page 5: Distributed Networks & Systems Lab. Introduction Collaborative filtering Characteristics and challenges Memory-based CF Model-based CF Hybrid CF Recent

Distributed Networks & Systems LabDistributed Networks & Systems Lab

Recommender System

Recommendation System

Content-based

Collabora-tive

Filtering

Hybrid

based on a de-scription of the item and a profile of the user’s preference

Combination of col-laborative filtering and content-based approach

based on collecting and analyzing a large amount of information on users’ behaviors, activities or preferences and predict-ing what users will like based on their similarity to other users.

Page 6: Distributed Networks & Systems Lab. Introduction Collaborative filtering Characteristics and challenges Memory-based CF Model-based CF Hybrid CF Recent

Distributed Networks & Systems LabDistributed Networks & Systems Lab

Recommender System

• Recom-mendation System

• Content-based

• Col-labo-

rative • Filter-

ing

• Hybrid

• Memory-based

• Model-based

• Hybrid

Page 7: Distributed Networks & Systems Lab. Introduction Collaborative filtering Characteristics and challenges Memory-based CF Model-based CF Hybrid CF Recent

Distributed Networks & Systems LabDistributed Networks & Systems Lab

Collaborative filtering

Page 8: Distributed Networks & Systems Lab. Introduction Collaborative filtering Characteristics and challenges Memory-based CF Model-based CF Hybrid CF Recent

Distributed Networks & Systems LabDistributed Networks & Systems Lab

Characteristics & Challenges

Collaborative filtering has performance challenges from the distinguishable characteristics

Data sparsityScalabilitySynonymyGray sheepShilling attacks

Page 9: Distributed Networks & Systems Lab. Introduction Collaborative filtering Characteristics and challenges Memory-based CF Model-based CF Hybrid CF Recent

Distributed Networks & Systems LabDistributed Networks & Systems Lab

Data Sparsity

In internet markets, the variation of products makes user-item matrix sparse.

How to process sparse data and match?

Page 10: Distributed Networks & Systems Lab. Introduction Collaborative filtering Characteristics and challenges Memory-based CF Model-based CF Hybrid CF Recent

Distributed Networks & Systems LabDistributed Networks & Systems Lab

Data Sparsity

Cold start problemA new user or item has just entered

the system.Hard to find similar ones since there

is not enough informationToo small users’ ratings compared to

the large number of items in the sys-tem

Causes reduced coverage

Page 11: Distributed Networks & Systems Lab. Introduction Collaborative filtering Characteristics and challenges Memory-based CF Model-based CF Hybrid CF Recent

Distributed Networks & Systems LabDistributed Networks & Systems Lab

Data Sparsity

Users with same tastes may not be in-dentified as such if there is no co-rated items

Page 12: Distributed Networks & Systems Lab. Introduction Collaborative filtering Characteristics and challenges Memory-based CF Model-based CF Hybrid CF Recent

Distributed Networks & Systems LabDistributed Networks & Systems Lab

Data Sparsity

Dimensionality reduction techniquesSingular Value Decomposition

Removes unrepresentative or insignificant users or items to reduce the dimensionali-ties of the user-item based matrix directly

Reduced sparsity, but some drawbacks Meaningful data also discarded Caused decrease in quality

Page 13: Distributed Networks & Systems Lab. Introduction Collaborative filtering Characteristics and challenges Memory-based CF Model-based CF Hybrid CF Recent

Distributed Networks & Systems LabDistributed Networks & Systems Lab

Scalability

Large size of data caused longer com-pute time under limited resources

Dimensionality reduction can help this problem, but requires extra steps(matrix factorization) which has expensive cost

Incremental SVD algorithm has been suggested to reduce the cost of the step

Page 14: Distributed Networks & Systems Lab. Introduction Collaborative filtering Characteristics and challenges Memory-based CF Model-based CF Hybrid CF Recent

Distributed Networks & Systems LabDistributed Networks & Systems Lab

Synonymy

Same kind of products, different names

“Children movie”, “children film”Memory based CF systems are vul-

nerable to this problemAttempts were made to solve thisIntellectual or automatically term

expansion could have partial solu-tion, but has some drawbacks

Page 15: Distributed Networks & Systems Lab. Introduction Collaborative filtering Characteristics and challenges Memory-based CF Model-based CF Hybrid CF Recent

Distributed Networks & Systems LabDistributed Networks & Systems Lab

Gray Sheep

Users that are not ordinaryHard to make prediction for themNo full solution for thisPer-user approach were made to re-

duce this problem

Page 16: Distributed Networks & Systems Lab. Introduction Collaborative filtering Characteristics and challenges Memory-based CF Model-based CF Hybrid CF Recent

Distributed Networks & Systems LabDistributed Networks & Systems Lab

Shilling Attacks

Intended increase in good rating and negative rating by the product sales company

Item based CF algorithm was much less affected by the attacks than the user-based CF algorithm

Page 17: Distributed Networks & Systems Lab. Introduction Collaborative filtering Characteristics and challenges Memory-based CF Model-based CF Hybrid CF Recent

Distributed Networks & Systems LabDistributed Networks & Systems Lab

Other Challenges

Observing personal habit of usersPrivacy invasion

Noise increaseFrom increase in diversity

ExplainabilityLet users know the reason why the sys-

tem recommends the specific item

Page 18: Distributed Networks & Systems Lab. Introduction Collaborative filtering Characteristics and challenges Memory-based CF Model-based CF Hybrid CF Recent

Distributed Networks & Systems LabDistributed Networks & Systems Lab

Memory-based CF

Memorize the rating matrix and issue recommendations based on the rela-tionship between the queried user and item and the rest of the matrix

Uses the entire or a sample of the user-item database to make predic-tion

Every user is part of a group of peo-ple with similar interests

Page 19: Distributed Networks & Systems Lab. Introduction Collaborative filtering Characteristics and challenges Memory-based CF Model-based CF Hybrid CF Recent

Distributed Networks & Systems LabDistributed Networks & Systems Lab

Neighborhood-based CF methods

Most popular memory-based CF method

Predict ratings by referring to users whose ratings are similar to the queried user, or to items that are similar to queried item.

Calculate similarity or weight then,Aggregate the neighbors to get the top-N

most frequent items as the recommenda-tion

Page 20: Distributed Networks & Systems Lab. Introduction Collaborative filtering Characteristics and challenges Memory-based CF Model-based CF Hybrid CF Recent

Distributed Networks & Systems LabDistributed Networks & Systems Lab

Similarity Computation

Critical step

For item-based CFCompute similarity between items

For user-based CFCompute similarity between users u

and v who have both rated the same items

Page 21: Distributed Networks & Systems Lab. Introduction Collaborative filtering Characteristics and challenges Memory-based CF Model-based CF Hybrid CF Recent

Distributed Networks & Systems LabDistributed Networks & Systems Lab

Correlation-based Similarity

To get the similarity Wu,v between two users u and v

Wi,j between two items i and j

Pearson Correlation is used to mea-sure similarity Measures the linear independence be-

tween two variables(or users) as a func-tion of their attributes

Page 22: Distributed Networks & Systems Lab. Introduction Collaborative filtering Characteristics and challenges Memory-based CF Model-based CF Hybrid CF Recent

Distributed Networks & Systems LabDistributed Networks & Systems Lab

Pearson Correlation

User-based algorithmi∈I summations are over the items that both the users u and v have rated,And is the average rating of the co-rated items of the u-th user.

Item-based algorithmru,Is is the rating of user u on item I, And is the rating of the i-th item by those users.

ur

ir

Page 23: Distributed Networks & Systems Lab. Introduction Collaborative filtering Characteristics and challenges Memory-based CF Model-based CF Hybrid CF Recent

Distributed Networks & Systems LabDistributed Networks & Systems Lab

Vector Cosine-based Similarity

Used to find similarity between two documentseach document as a vector of word frequen-

ciesCompute the cosine of the angle formed by

the frequency vectors

For collaborative filtering,Treat users or items as a vector of ratings

and compute the cosine of the angle formed by the rating vectors

Page 24: Distributed Networks & Systems Lab. Introduction Collaborative filtering Characteristics and challenges Memory-based CF Model-based CF Hybrid CF Recent

Distributed Networks & Systems LabDistributed Networks & Systems Lab

Vector Cosine-based Similarity

Similarity between two items i and j

Example:For vector A={x1, y1}, vector B={x2, y2}

Page 25: Distributed Networks & Systems Lab. Introduction Collaborative filtering Characteristics and challenges Memory-based CF Model-based CF Hybrid CF Recent

Distributed Networks & Systems LabDistributed Networks & Systems Lab

Prediction and Recommendation Computation

In the neighborhood-based CF, a sub-set of nearest neighbors of the active user are chosen based on their simi-larity with him or her and weighted aggregate of their ratings is used to generate predictions for the active user

Page 26: Distributed Networks & Systems Lab. Introduction Collaborative filtering Characteristics and challenges Memory-based CF Model-based CF Hybrid CF Recent

Distributed Networks & Systems LabDistributed Networks & Systems Lab

Weighted sum of Others’ ratings

To make prediction for active user a, on a certain item i,

We can take a weighted average of all the ratings on that item by using this

ar average ratings for the user a on all other ratingsaverage ratings for the user u on all other ratings

wa,u weight between the user a and user uur

Page 27: Distributed Networks & Systems Lab. Introduction Collaborative filtering Characteristics and challenges Memory-based CF Model-based CF Hybrid CF Recent

Distributed Networks & Systems LabDistributed Networks & Systems Lab

Example

To predict the rating for U1 on I2,

Page 28: Distributed Networks & Systems Lab. Introduction Collaborative filtering Characteristics and challenges Memory-based CF Model-based CF Hybrid CF Recent

Distributed Networks & Systems LabDistributed Networks & Systems Lab

Simple Weighted Average

For item-based prediction,We can use simple weighted average

Pu,i for user u on item i

Page 29: Distributed Networks & Systems Lab. Introduction Collaborative filtering Characteristics and challenges Memory-based CF Model-based CF Hybrid CF Recent

Distributed Networks & Systems LabDistributed Networks & Systems Lab

Top-N recommendations

To recommend a set of N top-ranked items that will be of interest to a certain user

Returning customer may get the list of recommenda-tion

Top-N recommendation techniques analyze the user-item matrix to discover relations between different users or items and use them to compute recom-mendations

Association rule mining can be used to make Top-N recommendations

Page 30: Distributed Networks & Systems Lab. Introduction Collaborative filtering Characteristics and challenges Memory-based CF Model-based CF Hybrid CF Recent

Distributed Networks & Systems LabDistributed Networks & Systems Lab

Model Based CF

The design and development of models (ma-chine learning, data mining algorithms) can allow the system to learn to recognize the complex patterns based on training data and make predictions from learned models

Classification algorithm can be used as CF models if the user ratings are categorical

Regression models and SVD methods can be used for numerical ratings

Page 31: Distributed Networks & Systems Lab. Introduction Collaborative filtering Characteristics and challenges Memory-based CF Model-based CF Hybrid CF Recent

Distributed Networks & Systems LabDistributed Networks & Systems Lab

Simple Bayesian CF Algorithm

Uses a naïve Bayes (NB) strategy to make predictions

Assuming the features are independent given the class

The probability of a certain class given all of the features can be computed

Then class with the highest probability will be classified as the predicted classes

Page 32: Distributed Networks & Systems Lab. Introduction Collaborative filtering Characteristics and challenges Memory-based CF Model-based CF Hybrid CF Recent

Distributed Networks & Systems LabDistributed Networks & Systems Lab

Clustering CF Algorithms

Shows better scalability

Make predictions within much smaller clusters rather than the entire cus-tomer bse

Page 33: Distributed Networks & Systems Lab. Introduction Collaborative filtering Characteristics and challenges Memory-based CF Model-based CF Hybrid CF Recent

Distributed Networks & Systems LabDistributed Networks & Systems Lab

Hybrid Collaborative Filtering

Memory-based and model-based CF approaches are combined to from hybrid CF approaches

Shows some improvement

Probabilistic memory-based CFPersonality diagnosis

Page 34: Distributed Networks & Systems Lab. Introduction Collaborative filtering Characteristics and challenges Memory-based CF Model-based CF Hybrid CF Recent

Distributed Networks & Systems LabDistributed Networks & Systems Lab

Probabilistic memory-based CF

Combined memory-based and model basedTo address the New user problem, an active learning ex-

tension to the PMCF system can be used to actively query a user for additional information.

To reduce computation time, PMCF Selects a small subset, ‘profile space’ from the entire database of user ratings and make prediction from the small profile space, not the whole database

Better accuracy thanPearson correlation-based CF

Model based using naïve Bayes

Page 35: Distributed Networks & Systems Lab. Introduction Collaborative filtering Characteristics and challenges Memory-based CF Model-based CF Hybrid CF Recent

Distributed Networks & Systems LabDistributed Networks & Systems Lab

Personality diagnosis

Combined and keeps the both advan-tage

Given the active user’s known ratings, we can calculate the probability that he or she is the same “personality type” as other users, and predict whether he will like the new items

Page 36: Distributed Networks & Systems Lab. Introduction Collaborative filtering Characteristics and challenges Memory-based CF Model-based CF Hybrid CF Recent

Distributed Networks & Systems LabDistributed Networks & Systems Lab

Conclusion