This is a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail.The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers.
We will create cutomer segments as per the Recency,Frequency and Monetary analysis by analyzing the data to know our customer base. This knowlwdge can then be used to target customers to retain customers, pitch offers etc
UCI ML Repo - Online Retail Data Set
Column | Description | Type |
---|---|---|
InvoiceNo | Invoice number. | Nominal, a 6-digit integral number uniquely assigned to each transaction. If this code starts with letter 'c', it indicates a cancellation. |
StockCode | Product (item) code. | Nominal, a 5-digit integral number uniquely assigned to each distinct product. |
Description | Product (item) name. | Nominal. |
Quantity | The quantities of each product (item) per transaction. | Numeric. |
InvoiceDate | Invice Date and time. | Numeric, the day and time when each transaction was generated. |
UnitPrice | Unit price. | Numeric, Product price per unit in sterling. |
CustomerID | Customer number. | Nominal, a 5-digit integral number uniquely assigned to each customer. |
Country | Country name. | Nominal, the name of the country where each customer resides. |
import numpy as np
import pandas as pd
import time, warnings
import datetime as dt
#visualizations
import matplotlib.pyplot as plt
from pandas.plotting import scatter_matrix
%matplotlib inline
import seaborn as sns
warnings.filterwarnings("ignore")
data = pd.read_csv('../data/commercial_data.csv')
data.head()
data[data.InvoiceNo.str.contains('\D').replace(pd.NA,False)]
no cancel orders, but some bad debt corrections, however there seem to be missing customer ID..
data[data.CustomerID.isna()].head(10)
data.CustomerID.isna().sum()
Why missing CustID?
What could be possible features that you would collect if you want to segment "guest" customers?
data.dropna(subset=['CustomerID'], inplace=True)
data.shape
RFM (Recency, Frequency, Monetary) analysis is a customer segmentation technique that uses past purchase behavior to divide customers into groups. RFM helps divide customers into various categories or clusters to identify customers who are more likely to respond to promotions and also for future personalization services.
RECENCY (R): Days since last purchase
FREQUENCY (F): Total number of purchases
MONETARY VALUE (M): Total money this customer spent.
We will create those 3 customer attributes for each customer.
To calculate recency, we need to choose a date point from which we evaluate how many days ago was the customer's last purchase.
data.InvoiceDate.max()
now = dt.date(2011, 12, 9)
print(now)
data['date'] = pd.DatetimeIndex(data.InvoiceDate).date
data.head()
recency_df = data.groupby('CustomerID', as_index=False).date.max()
recency_df.columns = ['CustomerID', 'LastPurchaseDate']
recency_df.head()
recency_df['Recency'] = recency_df.LastPurchaseDate.apply(lambda x: (now-x).days)
recency_df.head()
Frequency helps us to know how many times a customer purchased from us. To do that we need to check how many invoices are registered by the same customer.
invoice_level_data = data.copy()
invoice_level_data.drop_duplicates(subset=['InvoiceNo', 'CustomerID'], keep='first', inplace=True)
invoice_level_data.head()
frequency_df = invoice_level_data.groupby('CustomerID', as_index=False).InvoiceNo.count()
frequency_df.columns = ['CustomerID', 'Frequency']
frequency_df.head()
data['TotalCost'] = data.Quantity * data.UnitPrice
data.head()
monetary_df = data.groupby('CustomerID', as_index=False).TotalCost.sum()
monetary_df.columns = ['CustomerID', 'Monetary']
monetary_df.head()
rfm_df = (recency_df
.merge(frequency_df, on='CustomerID')
.merge(monetary_df, on='CustomerID')
)
rfm_df.set_index('CustomerID', inplace=True)
rfm_df.head()
The simplest way to create customers segments from RFM Model is to use Quartiles. We assign a score from 1 to 4 to Recency, Frequency and Monetary. Four is the best/highest value, and one is the lowest/worst value. A final RFM score is calculated simply by combining individual RFM score numbers.
Note: Quintiles (score from 1-5) offer better granularity, in case the business needs that but it will be more challenging to create segments since we will have 555 possible combinations. So, we will use quartiles.
quantiles = rfm_df.quantile(q=[0,0.25,0.5,0.75,1])
quantiles
Since we have duplicate bin edges for Frequency
column, we can custom define the range for it
rfm_df.Frequency.quantile(q=[0,0.5,0.7,0.8,1])
We will create two segmentation classes since, high recency is bad, while high frequency and monetary value is good.
r_labels = range(4,0,-1)
r_groups = pd.qcut(rfm_df.Recency, q=4, labels=r_labels)
r_groups.head()
m_labels = range(1,5)
m_groups = pd.qcut(rfm_df.Monetary, q=4, labels=m_labels)
m_groups.head()
f_labels = range(1,5)
f_groups = pd.qcut(rfm_df.Frequency, q=[0,0.5,0.7,0.8,1], labels=f_labels)
f_groups.head()
rfm_df = rfm_df.assign(R=r_groups,F=f_groups,M=m_groups)
rfm_df
rfm_df['RFM_segment'] = rfm_df.apply(lambda x : '{}{}{}'.format(x.R , x.F, x.M), axis=1)
rfm_df['RFM_score'] = rfm_df.loc[:,['R','F','M']].sum(axis=1)
rfm_df.head()
rfm_df.loc[rfm_df.RFM_segment == '444'].index.values
1. Find the following:
rfm_df.loc[rfm_df.F == 4,'RFM_segment'].value_counts()
rfm_df.loc[rfm_df.M >= 3,'RFM_segment'].value_counts()
rfm_df.query('R<=2 and F<=2 and M<=2').RFM_segment.value_counts()
rfm_df.loc[rfm_df.RFM_segment=='111','RFM_segment'].value_counts()
2. Now that we know our customers segments, how will you target them?