AFTER THE BIG THREE modeling variables (recency, frequency and monetary value), some analysts rank product purchase data next most important.
We’re not sure it’s No. 4 on the modeling hit parade, but it’s certainly in the top 10, and for some businesses even ranks in the top five. In any event, product purchases are a key source of customer information, and so there’s the question of how to deal with them.
There are several choices:
-
Create a variable for each product. On each customer’s record, code this variable a 1 if the customer has bought a product or a 0 if the customer has not made a purchase. This is called the dummy variable approach. So if you have, say, 40 products from which your customers can choose, you will set up 40 dummy variables.
-
The second approach is similar but makes more sense. Suppose your customers can buy from each product line multiple times. It would make more sense to still set up 40 variables, but instead of coding each variable a 1 or a 0, count the number of times each customer bought each product and enter that count into the customer’s record.
-
A slight variation on the preceding approach would be to record the dollars spent on each product rather than just the count of the number of purchases. This would make more sense if the products differed significantly in price.
-
Another option is to use a technique called principal components analysis, sometimes casually referred to as factor analysis. To keep the purists happy we’ll just call it PCA.
In a PCA of product data the idea is to capture the product purchase behavior of a customer across the range of products offered. What we’re eventually hoping to discover is whether or not the purchase, or lack of purchase of different product combinations, will give us a clue as to the future behavior of an individual customer, or of groups of customers, if we are doing the analysis at the source key or some geographic (ZIP code) level.
Without getting too technical, the PCA program creates a new set of principal component variables and related principal component scores that can be used later on in a regular or logistic regression predictive model.
Again, let’s assume we’re working with 40 product lines and we know how many times each customer has bought each product. The program initially will generate 40 principal components, but each one will contain a different amount of data. In general, maybe four to eight of the 40 components will include 70% or more of the information in the entire set. And these four to eight components can be used in regression modeling just like any other continuous variable