AT A RECENT DATABASE marketing conference, several exhibits featured automatic or semi-automatic data mining systems. Each vendor claimed his or her approach provided users a multitude of algorithms to aid and abet the analysis outcome. And of course, each had a catchy phrase for the various modeling tools their method supported. Apparently, this is the conclusion one was supposed to reach: “The more algorithms, the better the final result.” And yes, this logic has been adopted by many marketers and analysts alike.
Well, it's time to place the “multitude of algorithms” into perspective. It's clear that a particular technique plays only a tertiary role in formulating a model. What actually distinguishes mediocre from good models is the “other” work surrounding the production of the modeling “rules.” These tasks occur before and after the rules are created. In fact, most serious data miners will concur that algorithm formulation occupies some 10% of the overall work needed to complete the assignment. What then is the other 90%?
The first step toward determining this is to organize a clear research plan that will address a marketing problem. I'm continually amazed that many managers believe modeling can attack and successfully treat almost any direct marketing problem.
One of the classic stories involves a marketer's request to double the response rate of a program by using the same 250,000 names emerging from an external list. But doubling the response rate from the same list requires more than a data mining exercise.
Perhaps a monetary incentive exceeding the benefits of the entire campaign would do it! Finding a segment of the file to which, say, 10% to 20% would respond at this higher pace can be done. Regrettably, this small portion of the file, even if it is tracked down, would not double the entire list's response rate.
What about the retailer that's focused on the bottom line, and seeks to develop a response model? Is this the correct goal? Isn't it possible to realize a respectable response rate, but nevertheless fall short in total customer spending? Or how about the firm that offers its audience an application for insurance, intending to secure significant new business? There sure are a lot of responders. That's good. Still, only a tiny fraction eventually get approved. In both cases the marketing problems can be solved, but one has to articulate the precise objective to be addressed before modeling can get under way.
MAKE TIME FOR ANALYSIS
So now that the objective is within reach, one needs to locate a “good” sample from which to conduct analysis. How many marketers take the time to actually analyze the sample to assure it represents the parent population? There are simple tests to evaluate the “closeness” of the sample to the universe. Why not use them? Another timeless story involves a credit card grantor that used a certain digit or combination of digits extracted from the account number to represent the sample. And you guessed it: Without the systems analysts' knowledge, these digits were reserved for so-called VIPs. This was no representative sample.
Other factors to weigh in sample selection include the impact of creative, offer and seasonality. Be advised: The research design is no trivial matter and requires some serious thinking up front.
But that's not all. Developing a file for analysis demands real consideration. I can't say how many times I've witnessed a marketer extract all of its mailing data from the previous four or five months, and then flag each record according to whether the individual responded. After the files are prepared, and the response rate tabulated from these samples, the analyst discovers that the response rate from these files doesn't match the response rates disseminated earlier to management.
How could this happen? There could be many reasons. One may be that some suppression or selection was done, perhaps at a lettershop, which resulted in different reported numbers. Perhaps a number of records on the master file were source-coded incorrectly. Or maybe nothing was wrong with the sample, and the hard copy reports were flawed.
Interrogating the data once it's received can produce additional problems. One “model killer” is producing a file that has data compiled differently than at the time of the mailing.
So, for example, a retailer may classify its merchandise into 20 different categories. Say category 01 at the time of the mailing included five merchandise classes. After the mailing, but before sample extraction for model development, the merchandiser modifies his category definitions. So now category 01 includes seven classes, of which only three were the same as the pre-mailing classification. Employing these new designations would be useless, as the data needs to be compiled in an identical manner, the way it was originally at the time of the first mailing.
Another issue that must be dealt with is newer customers. If one is prepared to use two years of data as a predictor, it's clear that newer members of the file would not have any associated data for, say, 13 to 24 months back. Including these records without thinking through the implications can result in a poor sample selection.
A final thought here about data quality: Regrettably, some marketers still are not that concerned about it. Fortunately, over the last several years I've seen increased focus in this area. Enterprisewide databases, customer relationship management programs, multichannel marketing and other business intelligence platforms build the case quickly for breaking down data silos and applying a data quality regimen. But some firms still aren't there yet, and a lack of data quality further complicates model development.
Understanding data is a key step in the data mining process. It's the one procedure that should never be assigned to automated systems; professional inspection is vital. Yet it is glossed over frequently, resulting in sometimes ridiculous conclusions.
Take the analyst who arrives at this result: “Those who respond spend more than those who don't.” You don't need an advanced degree to reach this insight. Or take one of my favorites: “Those owning homes are more likely to respond to a home equity line of credit offer.” A modeler must sift through these sorts of “statements” and determine which should be discarded immediately and which have value.
One more example is worth noting. An attrition model for a home equity line of credit offer showed that the defectors were more likely to have additional products. Could this be true? It's contrary to many of the basic tenets of customer loyalty. Upon further investigation, it was learned that many of the defectors opened up a new account at the same bank in order to leverage more favorable interest rates. When the “actual” defector population was analyzed, the marketer found the relationship between products owned and loyalty to be in line with conventional wisdom.
Automated systems can't think through what patterns make sense, and which are garbage. It's not easy to recognize what trends and relationships hold up over time and which are spurious. Understanding the data takes time. And it is this time that often contributes to ordinary vs. extraordinary models.
Once the model is completed, it's necessary to translate the model code onto a production database. Model development can occur on two types of files. An extract in the same format as the database can be produced, and the analyst simply works with this extract to do his or her work. Another method, probably more common, is the analyst working with an extract that has only certain fields on it deemed to be important to create the model.
In the former method, model code can be easier to complete; the data miner is working with a mirror image of the file. Applying the code is more straightforward. All fields are in the same position. There simply is less chance to make an error.
However, the more-used latter method frequently causes problems. This is especially true when dealing with relational files where data has to be aggregated. In this scenario, extract fields no longer are in the same position as they were on the database.
Many of the more robust database marketers that employ a host of models take extra care to deploy quality control procedures. Indeed, several I know employ one or more analysts who continually monitor model scoring. They produce key statistics by score range for the scored file, and match these with the identical distributions that emerged from the original analysis file. If something doesn't look right, then further investigation must be done. While there are several reasons the original and current distributions may look different, some are program killers. This is another vital component of the modeling process, and it's often overlooked.
Expect the Unexpected
On numerous modeling assignments I've found missing data on potentially critical variables. One incident concerned a file that emerged from a new data warehouse.
Not surprisingly, the marketer was pleased with the investment made in developing this huge repository. Extracting files was relatively simple. Among the results I witnessed was that the response rate among those owning nine vehicles was double that of everyone else.
How many were privileged to own this many cars? More than a third of those who received the mailing! The marketer was quite impressed that his customer base was so upscale.
However, after further investigation, we determined the value “9” was used by the database builders as a value assigned to a missing data field, and wasn't an accurate vehicle count. Talk about a traffic accident.
Editor’s Note
Direct acknowledges that the findings and writings of Jim Wheaton of Wheaton Group about the predictive modeling process were inadvertently cited in the above contributed article (“Get Some Perspective on Analytic Software,”Sept. 8) but not attributed because the original source wasn’t known.
Jim’s 1994 article, “Myths and Realities of Building Models,” can be found on Wheaton Group’s Web site (www.wheatongroup.com) in the reference library for the direct and database marketing community. A shorter version originally appeared in DM News. The library consists of approximately 60 articles published by Wheaton Group’s four principals through Jan. 1 of this year. The articles are divided into seven categories:
1. Database marketing, direct marketing and/or CRM.
2. Data mining, predictive modeling, lifetime value analysis and/or data quality.
3. Data warehousing, desktop access, systems and/or technology.
4. Testing, statistical formulas and/or results analysis.
5. Overlay data, compiled lists and/or cooperative databases.
6. Merge/purge, data hygiene and/or service bureaus.
7. Technical.
The reference library allows keyword searches of the articles. All contain an Adobe PDF print option and should download in less than five seconds.
For copies of the eight pieces published since Jan. 1, contact Jim Wheaton at 919-969-8859. He can also be contacted at jim.wheaton@wheatongroup.com.




