Discovering RFM Sequential Patterns From Customers’ Purchasing Data 中央大學資管系陳彥良教授 Date: 2015/10/14

Discovering RFM Sequential Patterns From Customers’ Purchasing Data

中央大學資管系陳彥良教授

Date: 112/04/21

2

Agenda

• Introduction

• Related Work

• Problem Definition

• Algorithm

• Performance Evaluation

• Conclusion

Sequential Pattern Mining1

• Sequential pattern mining – To find the relationships between occurrences of

sequential events– To find if there exist any specific order of the

occurrences.

• Example– Every time Microsoft stock drops 5%,

IBM stock will also drops at least 4% within three days.

Introduction1

Sequential Pattern Mining2

• Applications of sequential pattern mining– Customer shopping sequences:

• First buy computer, then CD-ROM, and then digital camera, within 3 months.

– Medical treatments, natural disasters (e.g., earthquakes), science & eng. processes, stocks and markets, etc.

– Telephone calling patterns, Weblog click streams– DNA sequences and gene structures

Introduction2

Sequential Patterns v.s. Association Rules

Correlation between transactions

Correlation between transactions

Relationships intra transaction

Relationships intra transaction

CID Purchased Items

1

1

1

2

2

Which items are bought together?

( , )

Which items are bought in a certain order?

< , >

Introduction3

What Is Sequential Pattern Mining?

• Given a set of sequences, find the complete set of frequent subsequences

A sequence database

A sequence : < (ef) (ab) (df) c b >

An element may contain a set of items.Items within an element are unorderedand we list them alphabetically.

<a(bc)dc> is a subsequence of <<a(abc)(ac)d(cf)>

Given support threshold min_sup =2, <(ab)c> is a sequential pattern

SID sequence

10 <a(abc)(ac)d(cf)>

20 <(ad)c(bc)(ae)>

30 <(ef)(ab)(df)cb>

40 <eg(af)cbc>

Introduction4

7

A SPM Example and the Problems

• Since traditional SPM methods discover only frequencies of the maximal sequential patterns

– In a real-life situation the environment may change constantly and users’ behavior may also change over time

– A lot of patterns are of little value

Introduction5

8

RFM Definition in Marketing by Bult and Wansbeek

• R (Recency): period from the last purchase to now– R↓: higher possibility the customer makes a repeated purchase

• F (Frequency): number of purchases made in a certain period– F↑: the customer has higher loyalty

• M (Monetary): the amount of money spent during a certain period– M↑: the customer is more important

Introduction6

9

The Proposed Algorithm: RFM-SPM

• Frequency constraint (traditional SPM) Frequency, Recency and Monetary constraints (RFM-SPM)

• Each constraint has two thresholds– Upper threshold and lower threshold– Ensure considered factor can be restricted within a

specified range

• By setting these three factors to different intervals, we can discover those patterns which we feel interested

Introduction7

10

Recency Constraint

• Specified by giving a range from Rtime_min to Rtime_max, which are the number of days away from the starting date of the sequence database.

Starting date Ending dateRtime_min = 200 Rtime_max = 270

200

270

Introduction8

Sequence DB

2002/12/312001/12/27 2001/12/27+200 2001/12/27+270

Ensuring that the last transaction of the pattern occurred in this interval

11

Monetary Constraint

• Given by a range from M_min to M_max. It ensures that the value of the discovered pattern must be between the M_min and M_max.

• Suppose the pattern is <(a), (bc)>. Then we say a sequence satisfy this pattern with respect to the monetary constraint, if we can find an occurrence of pattern <(a), (bc)> in this data sequence whose value is within this range.

Introduction9

12

Frequency Constraint

• The frequency of a pattern is the percentage of sequences in database that satisfy the recency constraint and monetary constraint.

• A pattern could be output as an RFM-pattern if its frequency falls within the interval of minsup_min and minsup_max.

Introduction10

13

A Example of RFM-Pattern

• 30% of customers who bought a computer

would recently come back buying a scanner

and a microphone and the total amount of these

products is greater than NT 55,000 dollars.

Introduction11

• 30% of customers who bought a computer

would recently come back buying a scanner

and a microphone and the total amount of these

products is greater than NT 55,000 dollars.

14

Related Work• Cluster

– Similar needs and/or characteristics that are likely to exhibit similar purchasing behaviors

• Classification– Classifying customers to different categories of customer value and

they are also used to classify unseen cases• Association rule

– Extracting Share Frequent Itemsets with Infrequent Subsets• SPM

– Constraint-Based Sequential Pattern Mining: the Consideration of Recency and Compactness

– Discovering RFM sequential patterns from customers’ purchasing data

Introduction Related work1

R F M

R F M

M

R F

15

Data-Sequence in RFM-SPMIntroduction Related work Problem def1

Sid Sequence

10 <(a) (c) (ab) (a) (c)>

20 <(b) (c) (a), (b), (c) >

30 <(ab) (b) (c)>

40 <(b) (bc)>

50 <(c) (b) (ab) (bc)>

Traditional sequence DB

Sid Sequence

10 <(a, 1, 10), (c, 3, 40), (a, 4, 30), (b, 4, 70), (a, 6, 50), (c, 10, 70)>

20 <(b, 3, 30), (c, 5, 50), (a, 7, 20), (b, 7, 70), (c, 14, 20) >

30 <(a, 8, 40), (b, 8, 50), (b, 16, 20), (c, 20, 100)>

40 <(b, 15, 30), (b, 22, 20), (c,22, 120)>

50 <(c, 5, 30), (b, 6, 40), (a, 10, 30), (b, 10, 60), (b, 19, 90), (c, 19, 70)>

Transferred sequence DB

An Overview of Program Definition

16

Containment of itemsetContainment of itemset

SubsequenceSubsequence Recent SubsequenceRecent Subsequence

Recent Monetary Subsequence


Introduction Related work Problem def2

17

Example 3.1. (subsequence)

• Data-sequence A = – < (a, 1, 10), (c, 3, 40), (a, 4, 30), (b, 4, 70), (a, 6, 50), (e, 6, 90),

(c, 10, 70) >

Itemset (ab) - be contained in A [ ]

Sequence B <(ab)(ae)> - a subsequence of A [ ]


Yes

Yes


18






19

Example 3.2. (recent subsequence)

• Data-sequence A = <(a, 1, 10), (c, 3, 40), (a, 4, 30), (b, 4, 70),

(a, 6, 50), (e, 6, 90), (c, 10, 70)>

• Rtime_min = 5 and Rtime_max = 8.

Sequence B <(ab)(ae)> - is a recent subsequence of A [

] Sequence B <(ab)(ae)> is a subsequence of A

The occurring time of itemset (ae)= 6 ≥ Rtime_min and 6 < Rtime_max


Yes


20






21

Example 3.3. (recent monetary subsequence )

• Data-sequence A =

– <(a, 1, 10), (c, 3, 40), (a, 4, 30), (b, 4, 70), (a, 6, 50), (e, 6, 90), (c, 10, 70)>

• Rtime_min = 5, Rtime_max = 8 , M_min = 200, M_max = 250.

• Sequence B <(ab)(ae)> - is a recent monetary subsequence of A [

] Sequence B <(ab)(ae)> is a recent subsequence of A

The total money of this subsequence = 240 ≥ M_min and 240 < M_max.


Yes

Definition 3.1. (f-pattern, rf-pattern, rfm-pattern)• Let B = <I1I2...Is> be a sequence of itemsets.

Call B an Contain B as a Denote Thresholdf-pattern Subsequence f-support or B.supf no less than minsup_min

rf-pattern recent subsequence

rf-support or B.suprf

no less than minsup_min

rfm-pattern recent monetary subsequence

rfm-support or B.suprfm

between minsup_min and minsup_max


22

Example 3.4. (RFM pattern) • Given a data-sequence DB and six thresholds • R: Rtime_min=10 ≤ < Rtime_max = 21• M: M_min = 150 ≤ < M_max = 250• F: Minsup_min = 2 ≤ < Minsup_max = 4• The RFM-patterns are listed as follows:

– Containing 1 itemset = { }– Containing 2 itemsets ={<(ab)(c)> }– Containing 3 itemsets ={<(c)(b)(c)>, <(c)(ab)(c)> }– Containing 4 itemsets ={<(c)(b)(a)(c)>}

Sid Sequence

10 <(a, 1, 10), (c, 3, 40), (a, 4, 30), (b, 4, 70), (a, 6, 50), (c, 10, 70)>

20 <(b, 3, 30), (c, 5, 50), (a, 7, 20), (b, 7, 70), (c, 14, 20) >

30 <(a, 8, 40), (b, 8, 50), (b, 16, 20), (c, 20, 100)>

40 <(b, 15, 30), (b, 22, 20), (c,22, 120)>

50 <(c, 5, 30), (b, 6, 40), (a, 10, 30), (b, 10, 60), (b, 19, 90), (c, 19, 70)>23


24

RFM-Apriori Algorithm

• The RFM-Apriori algorithm is developed by modifying the well-know Apriori (GSP) algorithm

• GSP

– Put all items into C1, the set of candidate f-patterns with length 1, and then scans the database to find the frequent 1-patterns (L1)

– Assume we already have the set of frequent (k-1)-patterns Lk-1. Then it generates the set of candidate f-patterns Ck by joining Lk-1 with Lk-1

– Afterwards, it scan the database to determine the supports of the patterns in Ck, and then find out Lk

Introduction Related work Problem def RFM-Apriori Algorithm1

25

RFM-Apriori AlgorithmIntroduction Related work Problem def RFM-Apriori Algorithm2

25

C1 L1 C2 L2 Lk-1… Ck

CI1(LI1

f)

LI1(LI1

f, LI1rf, LI1

rfm)

CI2 LI2(LI2

rf, LI2rfm)

LIk-1(LIk-1

rf, LIk-1rfm)

… CIk

LI1f x LI1

rf

L1 x L1

LIk-1rf x LIk-1

rfApriori

All items Lk-1 x Lk-1

1

2 3 4

Candidate Generation

Support Counting

Let CIk denote the set of candidate rf-patterns with length k in RFM-Apriori

Count B.supf

CountB.Suprf

B.suprfm

1InverseCandidateTree

Lk

LIk(LIk

rf, LIkrfm)

2

26

Example 4.1. (Candidate generation- CI2)

• Suppose LI1f= {<a>, <b>, <c>, < (ab)>, < (bc)>} and

LI1rf= {<b>, <c>}, the CI2 is as follows:

– CI2={<(a)(b)>, <(a)(c)>, <(b)(b)>, <(b)(c)>, <(c)(b)>, <(c)(c)>, (ab)(b)>, <(ab)(c)>, (bc)(b)>, <(bc)(c)> }

illustrationLI1

f LI1rf

b c

…….

abcabbc


27

Example 4.2. (Candidate generation- CIk, k>2)

• Suppose LI3rf={<(b)(a)(c)>, <(c)(a)(c)>, <(b)(b)(c)>,

<(c)(b)(c)>, <(b)(ab)(c)>, <(c)(ab)(c)> }, the CI4 is as follows:– CI4={<(b)(c)(a)(c)>, <(c)(b)(a)(c)>, <(b)(b)(a)(c)>,

<(c)(c)(a)(c)>,<(b)(b)(b)(c)>, <(b)(c)(b)(c)>, <(c)(b)(b)(c)>,<(c)(c)(b)(c)>,<(b)(b)(ab)(c)>,<(b)(c)(ab)(c)>, <(c)(b)(ab)(c)>,<(c)(c)(ab)(c)> }

<(b)(ab)(c)> <(c)(ab)(c)>LI3rf:

{<(b)(c)(ab)(c)>, <(c)(b)(ab)(c)>}CI4:

illustration


28

RFM-Apriori Algorithm – Example• Given a data-sequence DB and six thresholds

Rtime_min=10, Rtime_max=21, M_min=150,

M_max=250, Minsup_min=2 and Minsup_max=4, try to

find the patterns that satisfy RFM constrains


29

CI1

LI1

30

Synthetic data parameters

Introduction Related work Problem def RFM-Apriori Algorithm Experiment1

31

Synthetic data parameters settings

|S| = 4, |I| = 1.25, NS = 5000, NI = 25,000, N = 10000, TI = 10, H_price = 1000, M_price = 500, L_price = 100, H_quantity = 1, M_quantity = 3 and L_quantity = 1.


32

Real-life dataset – SC-POS

• The sales data of a chain supermarket in Taiwan.• The SC-POS dataset recorded all transactions from

twenty branches between 2001/12/27 and 2002/12/31.

• Each transaction in SC-POS dataset is the shopping list of a customer’s transactions, each transaction of which recorded the purchased date and time and the purchased items.

• A series of data preprocessing and cleaning tasks were performed, the final dataset contained 17685 items and 33500 customers’ data-sequences.


33

Test 4.1. Comparing the runtimes and number of patterns of the two algorithms

• Varying minsup_min from 1.25% to 0.5% in synthetic datasets

• Varying minsup_min from 3.5% to 2.5% in real-life dataset.


34

0

50

100

150

200

0.005 0.008 0.01 0.013

minsup_min

Run

time(

sec)

GSPRFM

0

500

1000

1500

2000

2500

3000

0.005 0.008 0.01 0.013

minsup_min

num

ber o

f patt

erns

GSPRFM

0

5000

10000

15000

20000

0.025 0.03 0.035

minsup_min

Run

time(

sec)

GSPRFM

0100200300400500600700

0.025 0.03 0.035

minsup_min

num

ber of

pat

tern

sGSPRFM

SYN-DS1

SC-POS


More complicated procedure to generate candidate pattern and compute supports

Generates fewer candidate and frequent patterns

>

<

35

Test 4.2. Scalability test• During this test, we vary the value of a selected

parameter and keep all the other parameters constant.

• In each test, a parameter is increased to determine how the algorithms scale-up as the parameter increases.

– The first test varies the number of customers, lDl; from 250,000 to 750,000;

– The second varies the average number of transactions per customer, lCl; from 10 to 20

– The final one varies the average number of items bought per transaction, lTl; from2.5 to 4.5


36

0

100200

300

400500

600

250K 500K 750K

| D|

Run

time(

sec)

GSPRFM

0

5001000

1500

20002500

3000

250K 500K 750K

| D|

num

ber of

pat

tern

s

GSPRFM


37

0

200

400

600

800

2.5 3.5 4.5

| T |

Run

time(

sec)

GSPRFM

0

2000

4000

6000

8000

2.5 3.5 4.5

| T |

num

ber of

pat

tern

s

GSPRFM


0200400600800

100012001400

10 15 20

| C |

Run

time(

sec)

GSPRFM

0

1000

2000

3000

4000

10 15 20

| C |

num

ber of

pat

tern

s

GSPRFM

Longer sequences would result in more patterns

38

Test 4.3. Testing the reaction of runtime and number of patterns by varying following parameters

• Varying the Rtime_min from 75 to 115

• Varying the M_min from 1000 to 5000


39

0

50

100

150

200

75 85 95 105

115

recency time

Run

time(

sec)

RFM

050

100150200250300350

75 85 95 105

115

recency time

num

ber of

pat

tern

s

RFM

050

100150200250300

Monetary

Run

time(

sec)

RFM

0100200300400500600

1000

2000

3000

4000

5000

Monetary

num

ber of

pat

tern

s

RFM


CIK=LIK-1rf x LIK-1

rf

40

Test 4.4. Comparing the number of three kinds of interesting patterns

• (*F*)

• (RF*)

• (RFM)


41


C10-T2.5-S4-I1.25 RF* RFM *F*

Name # of patterns

% # of patterns

% # of patterns

%

D=25, minsup_min=0.075 186 22 48 6 835 100%

D=50, minsup_min=0.075 189 24 23 3 796 100%

D=75, minsup_min =0.075 187 24 23 3 793 100%

C=10, minsup_min =0.015 4 4 0 0 99 100%

C=15, minsup_min =0.015 78 22 5 1 360 100%

C=20, minsup_min =0.015 293 29 36 4 1001 100%

T=2.5, minsup_min =0.008 152 22 43 6 700 100%

T=3.5, minsup_min =0.008 639 28 204 9 2273 100%

T=4.5, minsup_min =0.001 1122 25 455 10 4554 100%

SC-POS,minsup_min =0.015

168 10 14 1 1704 100%

42

Test 4.5. Segment the discovered patterns by RFM constraints as following

Divisions R F M

1 0-75 0.007-0.008

0-100

2 75-150 0.008-0.009

100-200

3 150-225 0.009-0.01 200-300

4 225-300 0.01-0.02 300-400

5 300-360 0.02-1 400-

RFM-segmentation # of patterns

1-1-1(R-F-M) 50

3-3-3 0

5-5-5 3

1-5-1 40

5-1-1 97

1-1-5 0

5-1-5 17

1-5-5 0

5-5-1 22

5-3-5 4

3-5-5 0

5-5-3 3


Managerial Applications

• Growing patterns: (RFM)– A(BC) in segments 122, 233, 334, 445, 555

• Weakening patterns– A(BC) in segments 134, 233, 322, 421, 511

• Dead patterns: – A(BC) in segments 123, 211

• Emerging patterns– A(BC) in segments 412, 523

43


Managerial Applications

• Stable patterns – A(BC) in segments 132, 232, 332, 432, 532

• Sort all patterns with R=3 according to M

• Sort all patterns with R=3 according to F

44


45

Conclusion

• We have developed an efficient algorithm for mining frequent patterns with consideration of Recency and Monetary.

• These two factors can help users identify those patterns which are active recently and have high monetary value

• Besides, the experiments showed our approach is more efficient than the traditional GSP algorithm.

46

Thanks for your attention!!!!!

Documents

Discovering RFM Sequential Patterns From Customers’ Purchasing Data 中央大學資管系 陳彥良 教授 Date: 2015/10/14

Discovering RFM Sequential Patterns From Customers’ Purchasing Data 中央大學資管系陳彥良教授 Date: 2015/10/14