Upload
helena-dixon
View
238
Download
0
Embed Size (px)
Citation preview
Discovering RFM Sequential Patterns From Customers’ Purchasing Data
中央大學資管系陳彥良 教授
Date: 112/04/21
2
Agenda
• Introduction
• Related Work
• Problem Definition
• Algorithm
• Performance Evaluation
• Conclusion
Sequential Pattern Mining1
• Sequential pattern mining – To find the relationships between occurrences of
sequential events– To find if there exist any specific order of the
occurrences.
• Example– Every time Microsoft stock drops 5%,
IBM stock will also drops at least 4% within three days.
Introduction1
Sequential Pattern Mining2
• Applications of sequential pattern mining– Customer shopping sequences:
• First buy computer, then CD-ROM, and then digital camera, within 3 months.
– Medical treatments, natural disasters (e.g., earthquakes), science & eng. processes, stocks and markets, etc.
– Telephone calling patterns, Weblog click streams– DNA sequences and gene structures
Introduction2
Sequential Patterns v.s. Association Rules
Correlation between transactions
Correlation between transactions
Relationships intra transaction
Relationships intra transaction
CID Purchased Items
1
1
1
2
2
Which items are bought together?
( , )
Which items are bought in a certain order?
< , >
Introduction3
What Is Sequential Pattern Mining?
• Given a set of sequences, find the complete set of frequent subsequences
A sequence database
A sequence : < (ef) (ab) (df) c b >
An element may contain a set of items.Items within an element are unorderedand we list them alphabetically.
<a(bc)dc> is a subsequence of <<a(abc)(ac)d(cf)>
Given support threshold min_sup =2, <(ab)c> is a sequential pattern
SID sequence
10 <a(abc)(ac)d(cf)>
20 <(ad)c(bc)(ae)>
30 <(ef)(ab)(df)cb>
40 <eg(af)cbc>
Introduction4
7
A SPM Example and the Problems
• Since traditional SPM methods discover only frequencies of the maximal sequential patterns
– In a real-life situation the environment may change constantly and users’ behavior may also change over time
– A lot of patterns are of little value
Introduction5
8
RFM Definition in Marketing by Bult and Wansbeek
• R (Recency): period from the last purchase to now– R↓: higher possibility the customer makes a repeated purchase
• F (Frequency): number of purchases made in a certain period– F↑: the customer has higher loyalty
• M (Monetary): the amount of money spent during a certain period– M↑: the customer is more important
Introduction6
9
The Proposed Algorithm: RFM-SPM
• Frequency constraint (traditional SPM) Frequency, Recency and Monetary constraints (RFM-SPM)
• Each constraint has two thresholds– Upper threshold and lower threshold– Ensure considered factor can be restricted within a
specified range
• By setting these three factors to different intervals, we can discover those patterns which we feel interested
Introduction7
10
Recency Constraint
• Specified by giving a range from Rtime_min to Rtime_max, which are the number of days away from the starting date of the sequence database.
Starting date Ending dateRtime_min = 200 Rtime_max = 270
200
270
Introduction8
Sequence DB
2002/12/312001/12/27 2001/12/27+200 2001/12/27+270
Ensuring that the last transaction of the pattern occurred in this interval
11
Monetary Constraint
• Given by a range from M_min to M_max. It ensures that the value of the discovered pattern must be between the M_min and M_max.
• Suppose the pattern is <(a), (bc)>. Then we say a sequence satisfy this pattern with respect to the monetary constraint, if we can find an occurrence of pattern <(a), (bc)> in this data sequence whose value is within this range.
Introduction9
12
Frequency Constraint
• The frequency of a pattern is the percentage of sequences in database that satisfy the recency constraint and monetary constraint.
• A pattern could be output as an RFM-pattern if its frequency falls within the interval of minsup_min and minsup_max.
Introduction10
13
A Example of RFM-Pattern
• 30% of customers who bought a computer
would recently come back buying a scanner
and a microphone and the total amount of these
products is greater than NT 55,000 dollars.
Introduction11
• 30% of customers who bought a computer
would recently come back buying a scanner
and a microphone and the total amount of these
products is greater than NT 55,000 dollars.
14
Related Work• Cluster
– Similar needs and/or characteristics that are likely to exhibit similar purchasing behaviors
• Classification– Classifying customers to different categories of customer value and
they are also used to classify unseen cases• Association rule
– Extracting Share Frequent Itemsets with Infrequent Subsets• SPM
– Constraint-Based Sequential Pattern Mining: the Consideration of Recency and Compactness
– Discovering RFM sequential patterns from customers’ purchasing data
Introduction Related work1
R F M
R F M
M
R F
15
Data-Sequence in RFM-SPMIntroduction Related work Problem def1
Sid Sequence
10 <(a) (c) (ab) (a) (c)>
20 <(b) (c) (a), (b), (c) >
30 <(ab) (b) (c)>
40 <(b) (bc)>
50 <(c) (b) (ab) (bc)>
Traditional sequence DB
Sid Sequence
10 <(a, 1, 10), (c, 3, 40), (a, 4, 30), (b, 4, 70), (a, 6, 50), (c, 10, 70)>
20 <(b, 3, 30), (c, 5, 50), (a, 7, 20), (b, 7, 70), (c, 14, 20) >
30 <(a, 8, 40), (b, 8, 50), (b, 16, 20), (c, 20, 100)>
40 <(b, 15, 30), (b, 22, 20), (c,22, 120)>
50 <(c, 5, 30), (b, 6, 40), (a, 10, 30), (b, 10, 60), (b, 19, 90), (c, 19, 70)>
Transferred sequence DB
An Overview of Program Definition
16
Containment of itemsetContainment of itemset
SubsequenceSubsequence Recent SubsequenceRecent Subsequence
Recent Monetary Subsequence
Recent Monetary Subsequence
Introduction Related work Problem def2
17
Example 3.1. (subsequence)
• Data-sequence A = – < (a, 1, 10), (c, 3, 40), (a, 4, 30), (b, 4, 70), (a, 6, 50), (e, 6, 90),
(c, 10, 70) >
Itemset (ab) - be contained in A [ ]
Sequence B <(ab)(ae)> - a subsequence of A [ ]
Introduction Related work Problem def3
Yes
Yes
An Overview of Program Definition
18
Containment of itemsetContainment of itemset
SubsequenceSubsequence Recent SubsequenceRecent Subsequence
Recent Monetary Subsequence
Recent Monetary Subsequence
Introduction Related work Problem def4
19
Example 3.2. (recent subsequence)
• Data-sequence A = <(a, 1, 10), (c, 3, 40), (a, 4, 30), (b, 4, 70),
(a, 6, 50), (e, 6, 90), (c, 10, 70)>
• Rtime_min = 5 and Rtime_max = 8.
Sequence B <(ab)(ae)> - is a recent subsequence of A [
] Sequence B <(ab)(ae)> is a subsequence of A
The occurring time of itemset (ae)= 6 ≥ Rtime_min and 6 < Rtime_max
Introduction Related work Problem def5
Yes
An Overview of Program Definition
20
Containment of itemsetContainment of itemset
SubsequenceSubsequence Recent SubsequenceRecent Subsequence
Recent Monetary Subsequence
Recent Monetary Subsequence
Introduction Related work Problem def6
21
Example 3.3. (recent monetary subsequence )
• Data-sequence A =
– <(a, 1, 10), (c, 3, 40), (a, 4, 30), (b, 4, 70), (a, 6, 50), (e, 6, 90), (c, 10, 70)>
• Rtime_min = 5, Rtime_max = 8 , M_min = 200, M_max = 250.
• Sequence B <(ab)(ae)> - is a recent monetary subsequence of A [
] Sequence B <(ab)(ae)> is a recent subsequence of A
The total money of this subsequence = 240 ≥ M_min and 240 < M_max.
Introduction Related work Problem def7
Yes
Definition 3.1. (f-pattern, rf-pattern, rfm-pattern)• Let B = <I1I2...Is> be a sequence of itemsets.
Call B an Contain B as a Denote Thresholdf-pattern Subsequence f-support or B.supf no less than minsup_min
rf-pattern recent subsequence
rf-support or B.suprf
no less than minsup_min
rfm-pattern recent monetary subsequence
rfm-support or B.suprfm
between minsup_min and minsup_max
Introduction Related work Problem def8
22
Example 3.4. (RFM pattern) • Given a data-sequence DB and six thresholds • R: Rtime_min=10 ≤ < Rtime_max = 21• M: M_min = 150 ≤ < M_max = 250• F: Minsup_min = 2 ≤ < Minsup_max = 4• The RFM-patterns are listed as follows:
– Containing 1 itemset = { }– Containing 2 itemsets ={<(ab)(c)> }– Containing 3 itemsets ={<(c)(b)(c)>, <(c)(ab)(c)> }– Containing 4 itemsets ={<(c)(b)(a)(c)>}
Sid Sequence
10 <(a, 1, 10), (c, 3, 40), (a, 4, 30), (b, 4, 70), (a, 6, 50), (c, 10, 70)>
20 <(b, 3, 30), (c, 5, 50), (a, 7, 20), (b, 7, 70), (c, 14, 20) >
30 <(a, 8, 40), (b, 8, 50), (b, 16, 20), (c, 20, 100)>
40 <(b, 15, 30), (b, 22, 20), (c,22, 120)>
50 <(c, 5, 30), (b, 6, 40), (a, 10, 30), (b, 10, 60), (b, 19, 90), (c, 19, 70)>23
Introduction Related work Problem def9
24
RFM-Apriori Algorithm
• The RFM-Apriori algorithm is developed by modifying the well-know Apriori (GSP) algorithm
• GSP
– Put all items into C1, the set of candidate f-patterns with length 1, and then scans the database to find the frequent 1-patterns (L1)
– Assume we already have the set of frequent (k-1)-patterns Lk-1. Then it generates the set of candidate f-patterns Ck by joining Lk-1 with Lk-1
– Afterwards, it scan the database to determine the supports of the patterns in Ck, and then find out Lk
Introduction Related work Problem def RFM-Apriori Algorithm1
25
RFM-Apriori AlgorithmIntroduction Related work Problem def RFM-Apriori Algorithm2
25
C1 L1 C2 L2 Lk-1… Ck
CI1(LI1
f)
LI1(LI1
f, LI1rf, LI1
rfm)
CI2 LI2(LI2
rf, LI2rfm)
LIk-1(LIk-1
rf, LIk-1rfm)
… CIk
LI1f x LI1
rf
L1 x L1
LIk-1rf x LIk-1
rfApriori
All items Lk-1 x Lk-1
1
2 3 4
Candidate Generation
Support Counting
Let CIk denote the set of candidate rf-patterns with length k in RFM-Apriori
Count B.supf
CountB.Suprf
B.suprfm
1InverseCandidateTree
Lk
LIk(LIk
rf, LIkrfm)
2
26
Example 4.1. (Candidate generation- CI2)
• Suppose LI1f= {<a>, <b>, <c>, < (ab)>, < (bc)>} and
LI1rf= {<b>, <c>}, the CI2 is as follows:
– CI2={<(a)(b)>, <(a)(c)>, <(b)(b)>, <(b)(c)>, <(c)(b)>, <(c)(c)>, (ab)(b)>, <(ab)(c)>, (bc)(b)>, <(bc)(c)> }
illustrationLI1
f LI1rf
b c
…….
abcabbc
Introduction Related work Problem def RFM-Apriori Algorithm3
27
Example 4.2. (Candidate generation- CIk, k>2)
• Suppose LI3rf={<(b)(a)(c)>, <(c)(a)(c)>, <(b)(b)(c)>,
<(c)(b)(c)>, <(b)(ab)(c)>, <(c)(ab)(c)> }, the CI4 is as follows:– CI4={<(b)(c)(a)(c)>, <(c)(b)(a)(c)>, <(b)(b)(a)(c)>,
<(c)(c)(a)(c)>,<(b)(b)(b)(c)>, <(b)(c)(b)(c)>, <(c)(b)(b)(c)>,<(c)(c)(b)(c)>,<(b)(b)(ab)(c)>,<(b)(c)(ab)(c)>, <(c)(b)(ab)(c)>,<(c)(c)(ab)(c)> }
<(b)(ab)(c)> <(c)(ab)(c)>LI3rf:
{<(b)(c)(ab)(c)>, <(c)(b)(ab)(c)>}CI4:
illustration
Introduction Related work Problem def RFM-Apriori Algorithm4
28
RFM-Apriori Algorithm – Example• Given a data-sequence DB and six thresholds
Rtime_min=10, Rtime_max=21, M_min=150,
M_max=250, Minsup_min=2 and Minsup_max=4, try to
find the patterns that satisfy RFM constrains
Introduction Related work Problem def RFM-Apriori Algorithm5
29
CI1
LI1
30
Synthetic data parameters
Introduction Related work Problem def RFM-Apriori Algorithm Experiment1
31
Synthetic data parameters settings
|S| = 4, |I| = 1.25, NS = 5000, NI = 25,000, N = 10000, TI = 10, H_price = 1000, M_price = 500, L_price = 100, H_quantity = 1, M_quantity = 3 and L_quantity = 1.
Introduction Related work Problem def RFM-Apriori Algorithm Experiment2
32
Real-life dataset – SC-POS
• The sales data of a chain supermarket in Taiwan.• The SC-POS dataset recorded all transactions from
twenty branches between 2001/12/27 and 2002/12/31.
• Each transaction in SC-POS dataset is the shopping list of a customer’s transactions, each transaction of which recorded the purchased date and time and the purchased items.
• A series of data preprocessing and cleaning tasks were performed, the final dataset contained 17685 items and 33500 customers’ data-sequences.
Introduction Related work Problem def RFM-Apriori Algorithm Experiment3
33
Test 4.1. Comparing the runtimes and number of patterns of the two algorithms
• Varying minsup_min from 1.25% to 0.5% in synthetic datasets
• Varying minsup_min from 3.5% to 2.5% in real-life dataset.
Introduction Related work Problem def RFM-Apriori Algorithm Experiment4
34
0
50
100
150
200
0.005 0.008 0.01 0.013
minsup_min
Run
time(
sec)
GSPRFM
0
500
1000
1500
2000
2500
3000
0.005 0.008 0.01 0.013
minsup_min
num
ber o
f patt
erns
GSPRFM
0
5000
10000
15000
20000
0.025 0.03 0.035
minsup_min
Run
time(
sec)
GSPRFM
0100200300400500600700
0.025 0.03 0.035
minsup_min
num
ber of
pat
tern
sGSPRFM
SYN-DS1
SC-POS
Introduction Related work Problem def RFM-Apriori Algorithm Experiment5
More complicated procedure to generate candidate pattern and compute supports
Generates fewer candidate and frequent patterns
>
<
35
Test 4.2. Scalability test• During this test, we vary the value of a selected
parameter and keep all the other parameters constant.
• In each test, a parameter is increased to determine how the algorithms scale-up as the parameter increases.
– The first test varies the number of customers, lDl; from 250,000 to 750,000;
– The second varies the average number of transactions per customer, lCl; from 10 to 20
– The final one varies the average number of items bought per transaction, lTl; from2.5 to 4.5
Introduction Related work Problem def RFM-Apriori Algorithm Experiment6
36
0
100200
300
400500
600
250K 500K 750K
| D|
Run
time(
sec)
GSPRFM
0
5001000
1500
20002500
3000
250K 500K 750K
| D|
num
ber of
pat
tern
s
GSPRFM
Introduction Related work Problem def RFM-Apriori Algorithm Experiment7
37
0
200
400
600
800
2.5 3.5 4.5
| T |
Run
time(
sec)
GSPRFM
0
2000
4000
6000
8000
2.5 3.5 4.5
| T |
num
ber of
pat
tern
s
GSPRFM
Introduction Related work Problem def RFM-Apriori Algorithm Experiment8
0200400600800
100012001400
10 15 20
| C |
Run
time(
sec)
GSPRFM
0
1000
2000
3000
4000
10 15 20
| C |
num
ber of
pat
tern
s
GSPRFM
Longer sequences would result in more patterns
38
Test 4.3. Testing the reaction of runtime and number of patterns by varying following parameters
• Varying the Rtime_min from 75 to 115
• Varying the M_min from 1000 to 5000
Introduction Related work Problem def RFM-Apriori Algorithm Experiment9
39
0
50
100
150
200
75 85 95 105
115
recency time
Run
time(
sec)
RFM
050
100150200250300350
75 85 95 105
115
recency time
num
ber of
pat
tern
s
RFM
050
100150200250300
Monetary
Run
time(
sec)
RFM
0100200300400500600
1000
2000
3000
4000
5000
Monetary
num
ber of
pat
tern
s
RFM
Introduction Related work Problem def RFM-Apriori Algorithm Experiment10
CIK=LIK-1rf x LIK-1
rf
40
Test 4.4. Comparing the number of three kinds of interesting patterns
• (*F*)
• (RF*)
• (RFM)
Introduction Related work Problem def RFM-Apriori Algorithm Experiment11
41
Introduction Related work Problem def RFM-Apriori Algorithm Experiment12
C10-T2.5-S4-I1.25 RF* RFM *F*
Name # of patterns
% # of patterns
% # of patterns
%
D=25, minsup_min=0.075 186 22 48 6 835 100%
D=50, minsup_min=0.075 189 24 23 3 796 100%
D=75, minsup_min =0.075 187 24 23 3 793 100%
C=10, minsup_min =0.015 4 4 0 0 99 100%
C=15, minsup_min =0.015 78 22 5 1 360 100%
C=20, minsup_min =0.015 293 29 36 4 1001 100%
T=2.5, minsup_min =0.008 152 22 43 6 700 100%
T=3.5, minsup_min =0.008 639 28 204 9 2273 100%
T=4.5, minsup_min =0.001 1122 25 455 10 4554 100%
SC-POS,minsup_min =0.015
168 10 14 1 1704 100%
42
Test 4.5. Segment the discovered patterns by RFM constraints as following
Divisions R F M
1 0-75 0.007-0.008
0-100
2 75-150 0.008-0.009
100-200
3 150-225 0.009-0.01 200-300
4 225-300 0.01-0.02 300-400
5 300-360 0.02-1 400-
RFM-segmentation # of patterns
1-1-1(R-F-M) 50
3-3-3 0
5-5-5 3
1-5-1 40
5-1-1 97
1-1-5 0
5-1-5 17
1-5-5 0
5-5-1 22
5-3-5 4
3-5-5 0
5-5-3 3
Introduction Related work Problem def RFM-Apriori Algorithm Experiment13
Managerial Applications
• Growing patterns: (RFM)– A(BC) in segments 122, 233, 334, 445, 555
• Weakening patterns– A(BC) in segments 134, 233, 322, 421, 511
• Dead patterns: – A(BC) in segments 123, 211
• Emerging patterns– A(BC) in segments 412, 523
43
Introduction Related work Problem def RFM-Apriori Algorithm Experiment14
Managerial Applications
• Stable patterns – A(BC) in segments 132, 232, 332, 432, 532
• Sort all patterns with R=3 according to M
• Sort all patterns with R=3 according to F
44
Introduction Related work Problem def RFM-Apriori Algorithm Experiment14
45
Conclusion
• We have developed an efficient algorithm for mining frequent patterns with consideration of Recency and Monetary.
• These two factors can help users identify those patterns which are active recently and have high monetary value
• Besides, the experiments showed our approach is more efficient than the traditional GSP algorithm.
46
Thanks for your attention!!!!!