Working to Build Better Regression Models

Posted on by Chief Marketer Staff

In my last column (“It’s Not Too Late to Catch Up,” DIRECT, May 1), I noted I was surprised that less than 30% of the companies surveyed in a recent PricewaterhouseCoopers study of customer relationship management practices used regression models. In contrast, close to 50% were using RFM models. If regression is really a better tool, for no other reason than the obvious observation that regression models can call on variables other than RFM, why the disparity?

I don’t know. But part of the answer may have to do with modeling attempts that didn’t work – or didn’t work better than RFM.

For starters it should be clear that in order for a regression model to work better than an RFM model, the regression model has to incorporate factors other than RFM variables that aid in the prediction of the dependent variable.

Response Models

To keep things simple, let’s concentrate on response models, because most RFM models are used to predict response. Let’s further stipulate that for the purpose of this discussion, to “work better” means to improve the lift, or the ratio of responders to names promoted at some agreed-upon file depth.

For example, for a regression model to work better than an RFM model at a depth of, say, 30% of the file, the regression model would have to identify significantly more responders than an RFM model would have recognized at the same depth. Also, the argument that it’s easier to score a file with a single regression equation than it is to manage a RFM process won’t matter in this discussion – even though it’s true.

So we get back to the question of choosing more variables (other than those that are RFM-related – recency and frequency of purchase and some measure of monetary value).

One way to do this is to create new variables out of frequency and monetary factors, such as the total number of purchases or total sales divided by the number of months the name has been on file or divided by the number of times it’s been promoted.

Another key variable that frequently appears is tenure, or the length of time a customer has been on the database. This is such an important item that it’s frequently the basis for creating separate models, one for relatively new customers, and one or more models for customers who have been on the file longer.

Then there’s purchase data – which particular products the customer has bought, or the product categories he or she favors. This variable can be handled through the use of dummy or 0/1 coded variables. And the best way to handle this data is through use of principal components analysis, a method that can determine buying patterns over the entire set of purchase possibilities.

Of course, what’s needed most is a repeatable process for identifying key factors from the host of variables that appear on our databases. Here, statistical techniques like correlation tables and simple cross tabs, which show the relationship between potential variables and response, can help. And, of course, marketing people should always tell the modeler which variables they either know or think to be significant predictors.

The Best Way to Go

However, we think the best technique for identifying potential variables is CHAID (chi-square automatic interaction detection).

CHAID can be used to pictorially display the differences in response rates looking at each potential variable, one at a time. When used in this manner, the marketing person is on an equal footing with the analyst or statistician, because the results, with just a little bit of explanation, are so easy to understand. (Whether CHAID should be used beyond this point as a replacement for a regression model is a subject we won’t get into here.)

Needless to say, a CHAID analysis can’t be done for every conceivable potential variable, so some combination of judgment and reliance on the correlation table will be required in this initial variables-selection process.

Now, let’s assume for the purpose of this discussion that we identify 20 to 30 or even 50 variables, other than the basic RFM factors, that are each individually related to response. The last thing we’d want to do is use all of them in a model at the same time. The model would so over-fit the data that while a decile analysis of the calibration sample (the sample upon which the model was built) or even the validation sample (the hold-out sample intended to prove the validity of the model) would look wonderful, the results of the model would never be replicated upon rollout.

To at least some degree, this is a danger you never have to worry about, because the programs that produce regression models, if used correctly, will prevent this from happening. But these very same stepwise regression programs frequently will produce models that contain too many variables – despite the fact the statistics describing these variables will suggest they are significant.

When this happens, even though the decile analysis done on the validation sample will look good, the model will have less than an optimum chance to hold up during rollout promotions. To prevent or at least reduce the chances of this, we suggest pruning away the least significant of the key variables and observing the effect on the decile analysis. If the decile analysis is not made worse, than drop the variable, and often as not you’ll find that eliminating the unnecessary factors actually improves the decile analysis – increases the spread and removes “bumps” in the model.

If all these steps are followed, you should have a pretty good chance of replacing your RFM models.

More

Related Posts

Chief Marketer Videos

by Chief Marketer Staff

In our latest Marketers on Fire LinkedIn Live, Anywhere Real Estate CMO Esther-Mireya Tejeda discusses consumer targeting strategies, the evolution of the CMO role and advice for aspiring C-suite marketers.

	
        

Call for entries now open



CALL FOR ENTRIES OPEN