Upload
blanche-louise-young
View
215
Download
0
Embed Size (px)
Citation preview
WindMine: Fast and Effective Mining of Web-click
Sequences
SDM 2011 Y. Sakurai et al. 1
Yasushi Sakurai (NTT)Lei Li (Carnegie Mellon Univ.)Yasuko Matsubara (Kyoto Univ.)Christos Faloutsos (Carnegie Mellon Univ.)
Introduction
Web-click sequence applicationsWeb masters and web-site owners- Capacity planning- Intrusion detection- Advertisement design
Goal- Find meaningful patterns for web-click data
(e.g., the lunch-break trend, huge spike, anomalies)
- Find periodicity (daily and/or weekly, etc)- Determine suitable window sizes automatically
SDM 2011 Y. Sakurai et al. 2
Introduction
Examples access count from a business news site
SDM 2011 Y. Sakurai et al. 3
Original web-click sequence
Problem definition
Web-click sequences of m URLs:
(X1 , … , Xm) Web-click sequence X of duration n :
X = (x1 ,…, xt ,…, xn)
Local Component Analysis: Given m sequences of duration n, (X1 , … , Xm) - Find patterns, main components of the sequences- Find the ‘best window’ size w for the analysis
Final challenge: scalable algorithm for the local component analysis
SDM 2011 Y. Sakurai et al. 4
Background
Independent component analysis (ICA)
- PCA vs. ICA
SDM 2011 Y. Sakurai et al. 5
1PC
2PC
1IC
2IC
Why not ‘PCA’?
Example of component analysis
SDM 2011 Y. Sakurai et al. 7
ICA recognizes the components successfully and separately
ICA recognizes the components successfully and separately
PCAPCA
ICAICA
Main idea (1)
Multi-scale local component analysis
SDM 2011 Y. Sakurai et al. 8
Divide a sequence into subsequences of length
w
Divide a sequence into subsequences of length
w
Compute the local components from the
window matrix
Compute the local components from the
window matrix
aa bb
cc dd
ee ff
gg hh
aa bb cc dd ee ff gg hh
time
SDM 2011 9
Main idea (2)
Best window size selection
Proposed criterion: CEM (Component Entropy Maximization)- Estimate the optimal number of w for the sequence set- Compute the entropy of the weight values of the
mixing matrix A- ‘popular’ (widely-used) components show high CEM
scores
Y. Sakurai et al.
Q : How to estimate a ‘good window size’ automatically when we have multiple sequences?
Q : How to estimate a ‘good window size’ automatically when we have multiple sequences?
Main idea (2)
CEM criterion: - CEM score of the j-th component for the window size w
- Probability for the j-th component (size of the j-th component’s contribution to each subsequence)
- Normalized weight values for each subsequences
- Mixing matrix
SDM 2011 Y. Sakurai et al. 10
k: # of componentsM: # of
subsequences
ijijijw pp
wC ,,, log
1
i
jijiji aap ,,,
j
jijiji aaa 2,,,
][ , jiw aA
),,1;,,1( kjMi
WindMine-part
Efficient solution
Hierarchical partitioning approach: WindMine-part- Partition the original window matrix into sub-matrices- Extract local components each from the sub-matrices- Reuse the local components for the component
analysis on the higher level
SDM 2011 Y. Sakurai et al. 11
Q : How do we efficiently extract the best local component from large sequence sets?
Q : How do we efficiently extract the best local component from large sequence sets?
WindMine-part
SDM 2011 Y. Sakurai et al. 12
local componentswindow matrix sub-matrices
Level 1
Level 2
Experimental Results
Experiments with real and datasetsOndemand TV, WebClick,
Automobile, Temperature, Sunspots
EvaluationAccuracy for pattern discoveryAccuracy for the best window sizeComputation time
SDM 2011 Y. Sakurai et al. 13
Pattern discovery
Ondemand TVaccess count of users
SDM 2011 Y. Sakurai et al. 14
Original sequence
PCA: failed PCA: failed
Anomaly spikes
Anomaly spikes
Weekly pattern Weekly pattern Daily pattern Daily pattern
Pattern discovery
WebClickQ & A site
SDM 2011 Y. Sakurai et al. 15
Weekly patternWeekly pattern
Low activity during sleeping
time
Low activity during sleeping
time
Dip at dinner time
Dip at dinner time
Increase from morning to night and reach a peak
Increase from morning to night and reach a peak
Pattern discovery
WebClickjob-seeking site
SDM 2011 Y. Sakurai et al. 16
High activity on week days
(daily access decreases as the weekend approaches)
High activity on week days
(daily access decreases as the weekend approaches)
Workers arrive at their office
Workers arrive at their office
Job seeking during a short
break
Job seeking during a short
break
Large spike during the lunch break
Large spike during the lunch break
Pattern discovery
WebClickother websites
SDM 2011 Y. Sakurai et al. 17
Educational site for kids
(they visit here after school, 3pm)
Educational site for kids
(they visit here after school, 3pm)
Website for baby nursery (the main users will be
their parents, rather than babies!)
Website for baby nursery (the main users will be
their parents, rather than babies!)
High activity 8am-11pm,
weekday(business purposes)
High activity 8am-11pm,
weekday(business purposes)
Pattern discovery
WebClickother websites
SDM 2011 Y. Sakurai et al. 18
The users visit three times a day
(early morning, noon, early evening)
The users visit three times a day
(early morning, noon, early evening)
The users rarely visit here late in the
evening (which is indeed good
for their health!)
The users rarely visit here late in the
evening (which is indeed good
for their health!)
Access count is still high in the night, 0am-1am
(healthy diet should include an earlier bed
time!)
Access count is still high in the night, 0am-1am
(healthy diet should include an earlier bed
time!)
Access count increases after meal
times
Access count increases after meal
times
Computation time
Wall clock time vs. # of subsequences- Up to 70 times faster
SDM 2011 Y. Sakurai et al. 21