Real-World Modeling-It’s Not About Tools

OCCASIONALLY THERE ARE articles in the trade and academic press about the relative merits of neural networks, logistic regression and what I’ll refer to as relatively simple but structured RFM (recency, frequency and monetary) analysis. The usual conclusion is that given the decision to investigate a specific number of potential predictor variables, it’s not always true neural nets will beat regression, or vice versa. The other conclusion is that both methods allow for consideration of more variables than RFM does. By definition, that’s true. However, if all you want to look at is RFM variables, then a simple RFM analysis guided by a CHAID analysis is probably the best way to handle this option. (The 125-cell technique suggested by Arthur Hughes and others also works for many people.)

In reality, the decision to use one method over another should be strongly influenced by other considerations, such as ease of implementation and explanation for the user’s skill level, the probability of detecting errors, and the expected robustness of each alternative.

While measuring the performance of alternative techniques is fun to do and read about, it really doesn’t get to the heart of the modeling problem. The core modeling problem is not about spreading an average response rate or an average order, given a data set based on one or more past promotions. That’s easy. It’s about developing a modeling methodology that can be adapted to a changing environment.

Following are some key questions modelers have to answer before choosing a tool to produce an equation or series of equations to predict an outcome. As you go through them you’ll see many of the questions have to do with the integrity of the data, and others with your business processes and marketing strategy. Questions about modeling tools pale in comparison with these concerns.

* Is the data you’re working with accurate, not in the sense of being right to the last decimal point, but rather, are the definitions correct? Are returns really returns, are claims really claims, are customer start dates really start dates, and so on.

* Is the mail file complete, or just what was available? Are all the responses or orders accounted for?

* Do the values of the predictive variables you are working with reflect the customer’s values at the time of name selection, or were they gathered at some later date?

* Which independent predictor variables have an effect on the dependent variable, the event we are trying to predict? Is the effect the same across all subsegments of the file? If not, and if the differences are significant, you may require multiple models, one for each segment, or you may have to build “interaction variables” into the model to capture the effect of the different subpopulations.

* Is the relationship between a suspected independent predictor variable and the dependent variable linear, and best represented by a single straight line? Or, is it better represented by a curved line or a series of broken lines? If you want an accurate prediction, this question needs to be answered.

* Is the population that will be measured by the model the same as the population from which the model was built? If it was built using customer names that’d been on the file for some time, it probably won’t apply to new customers.

* How will the model be used? If customers with high scores are mailed to or called as frequently as a model might suggest, have you correctly taken fatigue and cannibalization into account?

* Has the offer changed? A model built on a soft offer may not work on a hard offer, and probably won’t if the difference is significant.

* How will categorical variables, with lots of possibilities-such as product line purchases-be handled? If you use simple dummy variables to indicate purchase or non-purchase, when the number of categories exceeds five or six the results become unstable. Then principal components analysis, where patterns of purchase can be identified and quantified, is probably a better solution.

* If you are going to use household-level demographic data instead of geographic data, how will you handle non-matches and missing values on matched records?

* If your business requires both a response and a conversion, should you model both separately, or try to model conversions directly? Strategically it makes a difference.

* If you have lots and lots of data from many promotions, how do you use all of it? Do you average it, weigh it and build seasonal models, or do you just use the last promotion and “forgetaboutit”? There are lots and lots of choices and no definitive solutions.

These are the real-world issues you and your modelers need to think about. Once you answer these questions you can use any tool you like to arrive at a final equation or set of equations-provided you know how to use the tool you selected. Remember, there’s nothing more dangerous than the wrong person using the right tool.