View in Colab

Customer Segmentation using RFM analysis

This is a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail.The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers.

We will create cutomer segments as per the Recency,Frequency and Monetary analysis by analyzing the data to know our customer base. This knowlwdge can then be used to target customers to retain customers, pitch offers etc

Source

UCI ML Repo - Online Retail Data Set

Attribute Information:

Column Description Type
InvoiceNo Invoice number. Nominal, a 6-digit integral number uniquely assigned to each transaction. If this code starts with letter 'c', it indicates a cancellation.
StockCode Product (item) code. Nominal, a 5-digit integral number uniquely assigned to each distinct product.
Description Product (item) name. Nominal.
Quantity The quantities of each product (item) per transaction. Numeric.
InvoiceDate Invice Date and time. Numeric, the day and time when each transaction was generated.
UnitPrice Unit price. Numeric, Product price per unit in sterling.
CustomerID Customer number. Nominal, a 5-digit integral number uniquely assigned to each customer.
Country Country name. Nominal, the name of the country where each customer resides.

Importing Libraries

In [1]:
import numpy as np
import pandas as pd


import time, warnings
import datetime as dt

#visualizations
import matplotlib.pyplot as plt
from pandas.plotting import scatter_matrix
%matplotlib inline
import seaborn as sns

warnings.filterwarnings("ignore")

Read the data

In [2]:
data = pd.read_csv('../data/commercial_data.csv')
data.head()
Out[2]:
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country
0 545220 21955 DOORMAT UNION JACK GUNS AND ROSES 2 3/1/2011 8:30 7.95 14620.0 United Kingdom
1 545220 48194 DOORMAT HEARTS 2 3/1/2011 8:30 7.95 14620.0 United Kingdom
2 545220 22556 PLASTERS IN TIN CIRCUS PARADE 12 3/1/2011 8:30 1.65 14620.0 United Kingdom
3 545220 22139 RETROSPOT TEA SET CERAMIC 11 PC 3 3/1/2011 8:30 4.95 14620.0 United Kingdom
4 545220 84029G KNITTED UNION FLAG HOT WATER BOTTLE 4 3/1/2011 8:30 3.75 14620.0 United Kingdom

Checking for cancelled orders

In [3]:
data[data.InvoiceNo.str.contains('\D').replace(pd.NA,False)]
Out[3]:
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country
173363 A563185 B Adjust bad debt 1 8/12/2011 14:50 11062.06 NaN United Kingdom
173364 A563186 B Adjust bad debt 1 8/12/2011 14:51 -11062.06 NaN United Kingdom
173365 A563187 B Adjust bad debt 1 8/12/2011 14:52 -11062.06 NaN United Kingdom

no cancel orders, but some bad debt corrections, however there seem to be missing customer ID..

In [4]:
data[data.CustomerID.isna()].head(10)
Out[4]:
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country
73 545230 20960 WATERMELON BATH SPONGE 1 3/1/2011 9:55 2.46 NaN United Kingdom
74 545230 21082 SET/20 FRUIT SALAD PAPER NAPKINS 1 3/1/2011 9:55 1.63 NaN United Kingdom
75 545230 21488 RED WHITE SCARF HOT WATER BOTTLE 1 3/1/2011 9:55 8.29 NaN United Kingdom
76 545230 35970 ZINC FOLKART SLEIGH BELLS 1 3/1/2011 9:55 4.13 NaN United Kingdom
77 545230 82583 HOT BATHS METAL SIGN 1 3/1/2011 9:55 4.13 NaN United Kingdom
78 545230 82583 HOT BATHS METAL SIGN 7 3/1/2011 9:55 4.96 NaN United Kingdom
338 545299 21730 GLASS STAR FROSTED T-LIGHT HOLDER 1 3/1/2011 12:19 4.95 NaN United Kingdom
540 545315 82482 WOODEN PICTURE FRAME WHITE FINISH 2 3/1/2011 14:14 4.96 NaN United Kingdom
541 545315 82600 NO SINGING METAL SIGN 1 3/1/2011 14:14 4.13 NaN United Kingdom
542 545315 84969 BOX OF 6 ASSORTED COLOUR TEASPOONS 1 3/1/2011 14:14 8.29 NaN United Kingdom
In [5]:
data.CustomerID.isna().sum()
Out[5]:
59942

Why missing CustID?

  • possibly guest checkout feature on the website

What could be possible features that you would collect if you want to segment "guest" customers?

  • browser, IP, location, cookie

Remove rows where customerID are NA

In [6]:
data.dropna(subset=['CustomerID'], inplace=True)
data.shape
Out[6]:
(176137, 8)

RFM Analysis

RFM (Recency, Frequency, Monetary) analysis is a customer segmentation technique that uses past purchase behavior to divide customers into groups. RFM helps divide customers into various categories or clusters to identify customers who are more likely to respond to promotions and also for future personalization services.

RECENCY (R): Days since last purchase

FREQUENCY (F): Total number of purchases

MONETARY VALUE (M): Total money this customer spent.

We will create those 3 customer attributes for each customer.

Recency

To calculate recency, we need to choose a date point from which we evaluate how many days ago was the customer's last purchase.

Find out the latest date in the data to use it as for reference

In [7]:
data.InvoiceDate.max()
Out[7]:
'9/9/2011 9:52'
In [8]:
now = dt.date(2011, 12, 9)
print(now)
2011-12-09

Create a new column called date which contains the date of invoice only

In [9]:
data['date'] = pd.DatetimeIndex(data.InvoiceDate).date

data.head()
Out[9]:
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country date
0 545220 21955 DOORMAT UNION JACK GUNS AND ROSES 2 3/1/2011 8:30 7.95 14620.0 United Kingdom 2011-03-01
1 545220 48194 DOORMAT HEARTS 2 3/1/2011 8:30 7.95 14620.0 United Kingdom 2011-03-01
2 545220 22556 PLASTERS IN TIN CIRCUS PARADE 12 3/1/2011 8:30 1.65 14620.0 United Kingdom 2011-03-01
3 545220 22139 RETROSPOT TEA SET CERAMIC 11 PC 3 3/1/2011 8:30 4.95 14620.0 United Kingdom 2011-03-01
4 545220 84029G KNITTED UNION FLAG HOT WATER BOTTLE 4 3/1/2011 8:30 3.75 14620.0 United Kingdom 2011-03-01

Check the last date of purchase with respect to CustomerID and calculate the RECENCY

In [10]:
recency_df = data.groupby('CustomerID', as_index=False).date.max()
recency_df.columns = ['CustomerID', 'LastPurchaseDate']
recency_df.head()
Out[10]:
CustomerID LastPurchaseDate
0 12747.0 2011-08-22
1 12748.0 2011-09-30
2 12749.0 2011-08-01
3 12820.0 2011-09-26
4 12821.0 2011-05-09
In [11]:
recency_df['Recency'] = recency_df.LastPurchaseDate.apply(lambda x: (now-x).days)
recency_df.head()
Out[11]:
CustomerID LastPurchaseDate Recency
0 12747.0 2011-08-22 109
1 12748.0 2011-09-30 70
2 12749.0 2011-08-01 130
3 12820.0 2011-09-26 74
4 12821.0 2011-05-09 214

Frequency

Frequency helps us to know how many times a customer purchased from us. To do that we need to check how many invoices are registered by the same customer.

Drop duplicate data from the data

In [12]:
invoice_level_data = data.copy()
invoice_level_data.drop_duplicates(subset=['InvoiceNo', 'CustomerID'], keep='first', inplace=True)
invoice_level_data.head()
Out[12]:
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country date
0 545220 21955 DOORMAT UNION JACK GUNS AND ROSES 2 3/1/2011 8:30 7.95 14620.0 United Kingdom 2011-03-01
15 545221 22021 BLUE FELT EASTER EGG BASKET 6 3/1/2011 8:35 1.65 14740.0 United Kingdom 2011-03-01
45 545222 22957 SET 3 PAPER VINTAGE CHICK PAPER EGG 6 3/1/2011 8:49 2.95 13880.0 United Kingdom 2011-03-01
54 545223 22487 WHITE WOOD GARDEN PLANT LADDER 4 3/1/2011 8:58 8.50 16462.0 United Kingdom 2011-03-01
55 545224 22664 TOY TIDY DOLLY GIRL DESIGN 5 3/1/2011 9:03 2.10 17068.0 United Kingdom 2011-03-01

Calculate the frequency of purchases

In [13]:
frequency_df = invoice_level_data.groupby('CustomerID', as_index=False).InvoiceNo.count()
frequency_df.columns = ['CustomerID', 'Frequency']
frequency_df.head()
Out[13]:
CustomerID Frequency
0 12747.0 5
1 12748.0 96
2 12749.0 3
3 12820.0 1
4 12821.0 1

Monetary

Monetary attribute answers the question: How much money did the customer spent over time?

To do that, first, we will create a new column total cost to have the total price per invoice.

In [14]:
data['TotalCost'] = data.Quantity * data.UnitPrice
In [15]:
data.head()
Out[15]:
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country date TotalCost
0 545220 21955 DOORMAT UNION JACK GUNS AND ROSES 2 3/1/2011 8:30 7.95 14620.0 United Kingdom 2011-03-01 15.90
1 545220 48194 DOORMAT HEARTS 2 3/1/2011 8:30 7.95 14620.0 United Kingdom 2011-03-01 15.90
2 545220 22556 PLASTERS IN TIN CIRCUS PARADE 12 3/1/2011 8:30 1.65 14620.0 United Kingdom 2011-03-01 19.80
3 545220 22139 RETROSPOT TEA SET CERAMIC 11 PC 3 3/1/2011 8:30 4.95 14620.0 United Kingdom 2011-03-01 14.85
4 545220 84029G KNITTED UNION FLAG HOT WATER BOTTLE 4 3/1/2011 8:30 3.75 14620.0 United Kingdom 2011-03-01 15.00
In [16]:
monetary_df = data.groupby('CustomerID', as_index=False).TotalCost.sum()
monetary_df.columns = ['CustomerID', 'Monetary']
monetary_df.head()
Out[16]:
CustomerID Monetary
0 12747.0 1760.09
1 12748.0 14680.85
2 12749.0 2755.23
3 12820.0 217.77
4 12821.0 92.72

Create RFM Table

In [17]:
rfm_df = (recency_df
 .merge(frequency_df, on='CustomerID')
 .merge(monetary_df, on='CustomerID')
)

rfm_df.set_index('CustomerID', inplace=True)
rfm_df.head()
Out[17]:
LastPurchaseDate Recency Frequency Monetary
CustomerID
12747.0 2011-08-22 109 5 1760.09
12748.0 2011-09-30 70 96 14680.85
12749.0 2011-08-01 130 3 2755.23
12820.0 2011-09-26 74 1 217.77
12821.0 2011-05-09 214 1 92.72

Customer segments with RFM Model

The simplest way to create customers segments from RFM Model is to use Quartiles. We assign a score from 1 to 4 to Recency, Frequency and Monetary. Four is the best/highest value, and one is the lowest/worst value. A final RFM score is calculated simply by combining individual RFM score numbers.

Note: Quintiles (score from 1-5) offer better granularity, in case the business needs that but it will be more challenging to create segments since we will have 555 possible combinations. So, we will use quartiles.

Find RFM quartiles

In [18]:
quantiles = rfm_df.quantile(q=[0,0.25,0.5,0.75,1])
quantiles
Out[18]:
Recency Frequency Monetary
0.00 70.0 1.0 2.9000
0.25 85.0 1.0 258.0775
0.50 119.0 2.0 518.3500
0.75 183.0 3.0 1182.9725
1.00 283.0 96.0 141789.3200

Since we have duplicate bin edges for Frequency column, we can custom define the range for it

In [19]:
rfm_df.Frequency.quantile(q=[0,0.5,0.7,0.8,1])
Out[19]:
0.0     1.0
0.5     2.0
0.7     3.0
0.8     4.0
1.0    96.0
Name: Frequency, dtype: float64

Creation of RFM Segments

We will create two segmentation classes since, high recency is bad, while high frequency and monetary value is good.

Create functions as per the appropriate quaritle values and apply them to create segments

In [20]:
r_labels = range(4,0,-1)
r_groups = pd.qcut(rfm_df.Recency, q=4, labels=r_labels)
r_groups.head()
Out[20]:
CustomerID
12747.0    3
12748.0    4
12749.0    2
12820.0    4
12821.0    1
Name: Recency, dtype: category
Categories (4, int64): [4 < 3 < 2 < 1]
In [21]:
m_labels = range(1,5)
m_groups = pd.qcut(rfm_df.Monetary, q=4, labels=m_labels)
m_groups.head()
Out[21]:
CustomerID
12747.0    4
12748.0    4
12749.0    4
12820.0    1
12821.0    1
Name: Monetary, dtype: category
Categories (4, int64): [1 < 2 < 3 < 4]
In [22]:
f_labels = range(1,5)
f_groups = pd.qcut(rfm_df.Frequency, q=[0,0.5,0.7,0.8,1], labels=f_labels)

f_groups.head()
Out[22]:
CustomerID
12747.0    4
12748.0    4
12749.0    2
12820.0    1
12821.0    1
Name: Frequency, dtype: category
Categories (4, int64): [1 < 2 < 3 < 4]

Now that we have the score of each customer, we can represent our customer segmentation, combine the scores (R_Quartile, F_Quartile,M_Quartile) together.

In [23]:
rfm_df = rfm_df.assign(R=r_groups,F=f_groups,M=m_groups)
In [24]:
rfm_df
Out[24]:
LastPurchaseDate Recency Frequency Monetary R F M
CustomerID
12747.0 2011-08-22 109 5 1760.09 3 4 4
12748.0 2011-09-30 70 96 14680.85 4 4 4
12749.0 2011-08-01 130 3 2755.23 2 2 4
12820.0 2011-09-26 74 1 217.77 4 1 1
12821.0 2011-05-09 214 1 92.72 1 1 1
... ... ... ... ... ... ... ...
18280.0 2011-03-07 277 1 180.60 1 1 1
18281.0 2011-06-12 180 1 80.82 2 1 1
18282.0 2011-08-05 126 1 100.21 2 1 1
18283.0 2011-09-05 95 8 802.77 3 4 3
18287.0 2011-05-22 201 1 765.28 1 1 3

2864 rows × 7 columns

In [25]:
rfm_df['RFM_segment'] = rfm_df.apply(lambda x : '{}{}{}'.format(x.R , x.F, x.M), axis=1)
rfm_df['RFM_score'] = rfm_df.loc[:,['R','F','M']].sum(axis=1)
rfm_df.head()
Out[25]:
LastPurchaseDate Recency Frequency Monetary R F M RFM_segment RFM_score
CustomerID
12747.0 2011-08-22 109 5 1760.09 3 4 4 344 11.0
12748.0 2011-09-30 70 96 14680.85 4 4 4 444 12.0
12749.0 2011-08-01 130 3 2755.23 2 2 4 224 8.0
12820.0 2011-09-26 74 1 217.77 4 1 1 411 6.0
12821.0 2011-05-09 214 1 92.72 1 1 1 111 3.0

Find out the best customers

In [26]:
rfm_df.loc[rfm_df.RFM_segment == '444'].index.values
Out[26]:
array([12748., 12839., 12901., 12910., 12921., 12957., 12971., 13004.,
       13014., 13018., 13021., 13078., 13089., 13094., 13097., 13102.,
       13178., 13263., 13266., 13268., 13384., 13394., 13408., 13418.,
       13468., 13488., 13576., 13599., 13630., 13694., 13695., 13704.,
       13767., 13777., 13798., 13842., 13862., 13881., 13985., 14004.,
       14031., 14056., 14060., 14062., 14096., 14159., 14191., 14194.,
       14221., 14227., 14235., 14239., 14282., 14298., 14367., 14395.,
       14401., 14456., 14462., 14524., 14527., 14543., 14562., 14606.,
       14659., 14667., 14676., 14680., 14688., 14709., 14735., 14755.,
       14769., 14800., 14808., 14854., 14868., 14944., 14952., 14961.,
       15005., 15039., 15044., 15061., 15078., 15114., 15140., 15144.,
       15150., 15152., 15187., 15194., 15218., 15241., 15290., 15301.,
       15311., 15356., 15358., 15410., 15465., 15498., 15547., 15555.,
       15640., 15674., 15796., 15804., 15827., 15838., 15867., 15955.,
       15981., 15984., 16011., 16013., 16029., 16033., 16076., 16103.,
       16133., 16145., 16156., 16161., 16168., 16187., 16326., 16407.,
       16422., 16458., 16523., 16525., 16558., 16607., 16626., 16656.,
       16672., 16681., 16700., 16705., 16709., 16710., 16713., 16729.,
       16746., 16779., 16813., 16818., 16839., 16928., 16931., 16945.,
       17017., 17049., 17061., 17068., 17069., 17220., 17238., 17243.,
       17306., 17315., 17340., 17377., 17389., 17402., 17416., 17428.,
       17450., 17491., 17511., 17576., 17581., 17611., 17613., 17644.,
       17651., 17652., 17656., 17669., 17675., 17677., 17686., 17716.,
       17719., 17725., 17730., 17750., 17757., 17758., 17811., 17841.,
       17857., 17858., 17865., 17920., 17949., 17997., 18008., 18041.,
       18094., 18102., 18109., 18118., 18144., 18172., 18198., 18225.,
       18226., 18229., 18241.])

Learner Activity

1. Find the following:

  1. Best Customer
  • See above filter RFM_seg = 444
  1. Loyal Customer
  • we are treating our most frequent customer as our loyal customer, hence filter RFM_seg = x4x
In [27]:
rfm_df.loc[rfm_df.F == 4,'RFM_segment'].value_counts()
Out[27]:
444    211
344    155
244     32
343     30
443     26
243     12
143      5
144      1
242      1
442      1
Name: RFM_segment, dtype: int64
  1. Big Spenders
  • since Monetary value distribution is skewed, we might consider our big spender to be filtered by RFM_seg = xx3+
In [28]:
rfm_df.loc[rfm_df.M >= 3,'RFM_segment'].value_counts()
Out[28]:
444    211
344    155
213    112
313    108
413    104
113     86
334     55
223     54
323     50
423     45
434     38
214     34
244     32
424     30
343     30
333     29
314     29
414     28
224     26
443     26
233     25
234     25
114     24
324     23
433     16
123     13
243     12
143      5
134      3
124      2
133      1
144      1
Name: RFM_segment, dtype: int64
  1. Almost lost customers
  • RFM_seg = <=2 <=2 <=2
  • or if you have periodic data, like for every financial quarter FY Q1, FY Q2, FY Q3 check the customer trend
In [29]:
rfm_df.query('R<=2 and F<=2 and M<=2').RFM_segment.value_counts()
Out[29]:
111    333
112    234
211    192
212    163
222     22
122      8
121      5
221      3
Name: RFM_segment, dtype: int64
  1. Lost customers
  • customers with the lowest score RFM_seg = 111
In [30]:
rfm_df.loc[rfm_df.RFM_segment=='111','RFM_segment'].value_counts()
Out[30]:
111    333
Name: RFM_segment, dtype: int64

2. Now that we know our customers segments, how will you target them?

  • best -> make them feel valued send vouchers on bday, anniversary
  • loyal-> upsell, crosssell
  • almost/ lost -> discounts