Off the Straight and Narrow Path

If you’re a marketer who uses regression models, you need to understand the topic of nonlinearity: What it is, why it is important, how it could improve your models and why it doesn’t happen automatically.

If you’ve built or used regression models to predict response or sales, you know that a regression equation looks something like this:

Y = a + b1*X1 + b2*X2 + b3*X3…bn*Xn

In this equation Y is the thing you’re trying to predict (the dependent variable) and the Xs represent the things (independent variables) you know about your customers or prospects. Typical independent variables include such performance indicators as recency, frequency and dollar sales; such demographics as age and income; and such promotion history as the number of times called or mailed.

The regression coefficients are represented by the “b’s,” and you can think of them as weights generated by a regression program and assigned to each variable in the model. The “a” is a constant that we can skip over for now.

The job of the statistician, working with a data set (such as the results of a past promotion), is to discover which independent variables have a significant effect on the dependent variable and then feed this information to the regression program, which will produce the regression equation.

One of the keys to a good, long-lasting model is to find the right set of predictive variables given the hundreds – if not thousands – of potential predictors. But in addition to finding the right variables, it’s important to determine if the relationship between a predictor variable such as age and a dependent variable such as sales is best described by a simple straight line, or whether some other nonlinear structure makes for a better, more accurate prediction.

When a nonlinear relationship exists, it’s the modeler’s job to try different transformations of the data to determine the best fit. You as a user can tell if this has been done in one of your models if you see something like this:

Sales = a + b1*log of recency + b2*square root of prior sales

This equation tells you the modeler determined that the relationship between sales and recency is best described by replacing recency (number of days since last purchase) by the log (logarithm) of recency, and that the relationship between sales and prior sales is best characterized by replacing prior sales by the square root of prior sales.

Straightening the Relationship

The charts on the left show how the log transformation works to straighten the relationship between sales and recency. The top chart is a plot of sales against months since last purchase; the second chart is a plot of sales against the log of the months since last purchase. The log transformation straightens the data and results in a better fit as indicated by the R-squared value of 1 vs. an R-squared value of .86 for the original or untransformed data.

The previous equation with transformations certainly looks more impressive than the one below, without transformations:

Sales = a + b1*recency + b2*prior sales

But the question is: Does finding the right shape of a relationship, correcting for nonlinearity or straightening – three different ways to say the same thing – really make a difference?

To answer this question we created two data sets. Each data set has 400 observations representing 400 customers, each of whom responded to a mailing and made a purchase. As is customary, the first data set will be used to build the model; the second, to test or validate the model.

But to make sure we could prove our point, we cheated. Instead of searching for variables that had a nonlinear relationship with sales, and developing an equation, we started with the correct model!

In the correct model each customer’s sales is determined by this formula:

Sales = 75 – 30 times the log of the number of days since last purchase + 5 times the square root of prior orders + 0.5 times the exponential value of prior sales/million + 6 if age is greater than 45 + a random error that ranges between -50 and +50.

To determine the effect of correcting for nonlinearity, I simply ran the data through an Excel spreadsheet and had the program calculate a regression model, using the four variables (recency, orders, prior sales and age) but with no attempt to incorporate their known nonlinear relationships.

The program produced the following equation:

Sales = 64 – 0.58*recency + 0.20*orders + 2.61*prior sales + 0.085*age

The model had an R-squared value of 33%. (In other words, the simple model explained 33% of the variation in sales.)

Then I ran the data through the program again, this time substituting the correct form of the relationship for the original, uncorrected data.

The same program produced this equation:

Sales = 84 – 29.29*log of the number of days since last purchase + 4.39*square root of prior orders + 0.48*the exponential value of prior sales/million + 4.05 if age is greater than 45

The model’s R-squared value was 79%. (Even though we knew the correct form of the four variables affecting the model, the model wasn’t perfect because of the random error.)

So it would appear that knowing the correct shape of the relationship between independent variables and the dependent variable makes a huge difference – at least to a statistician. But what about the difference it makes to a direct marketer?

To answer this question we applied both models to our second data set of 400 different customers and produced the two decile analyses shown in Tables 1 and 2 on page 144.

A Closer Fit

As you can see by comparing the tables, the correct model results in a greater spread and a closer fit and is, therefore, the better model. But don’t draw the wrong conclusions from this example. In the real world, the search for the correct relationship is not done just to get a better fit. In fact, that is a relatively weak reason for going through all the work it takes to find and correct for nonlinearity. The truth is, many relationships are so nonlinear these important variables will not appear in a regression model at all – unless their nonlinearity is first identified and then corrected for.

Why is that? Because the regression programs are expecting linear relationships and a relationship that is in fact very strong, but very nonlinear, may be missed entirely by an analyst just running data through a regression program. (Remember, the regression programs don’t do this automatically; this has to be done by an analyst working with the data.)

So how does one discover these nonlinear relationships? By using a number of graphical techniques and/or CHAID. The lesson for the direct marketer is that these nonlinear relationships exist. If you don’t see them in your models, that doesn’t mean they aren’t there – they just may have been overlooked, and your models can be significantly improved.

Exploratory Analysis

One last note: Correcting for nonlinearity is a central part of what statisticians call exploratory data analysis. This practice is recommended even when the modeling technique does not assume the relationships it’s being asked to analyze are linear. For example, artificial neural network solutions do not assume linear relationships. Nevertheless, straightening complicated nonlinear relationships prior to submission of data to the neural net is a commonly recommended procedure. It makes it easier for the net to arrive at a reliable solution.