It’s time to place the “multitude of algorithms” into perspective. It is clear that a particular technique plays only a tertiary role in formulating a model. Rather what distinguishes the mediocre from good models is the “other” work surrounding the production of the modeling “rules.” These tasks occur before and after the rules are created. In fact, most serious data miners will concur that algorithm formulation occupies some 10% of the overall work needed to complete the assignment. What then is the other 90%?
The first step is organizing a clear research plan that will address a marketing problem. I am continually amazed that many managers believe modeling can attack and successfully treat almost any direct marketing problem. One of the classic stories involves a marketer’s request to double the response rate of a program by using the same 250,000 names emerging from an external list. But doubling the response rate from the same list requires more than a data mining exercise. Perhaps a monetary incentive exceeding the benefits of the entire campaign would do it! Finding a segment of the file to which say 10-20% would respond at this higher pace is probably doable. Regrettably, this small portion of the file, even if found successfully, would not double the entire list’s response rate.
What about the retailer that is focused on the bottom line, and seeks to develop a response model? Is this the correct goal? Isn’t it possible to realize a respectable response rate, but nevertheless fall short in total customer spending? Or how about the firm that offers its audience an application for insurance, intending to secure significant new business. There sure are a lot of responders. That’s good. Still, only a tiny fraction is eventually approved. In both these cases, the marketing problems are solvable, but one needs to articulate the precise objective that needs to be addressed before modeling can be put to work.
So now that the objective is reachable, it’s time to locate a “good” sample from which to conduct analysis. How many marketers take the necessary time to actually analyze the sample to assure that it represents the parent population? There are simple tests to evaluate the “closeness” of the sample to the universe. Why not use them? Another timeless story involves a credit card grantor that used a certain digit or combination of digits extracted from the account number to represent the sample. You guessed it. Unbeknownst to the systems analysts, these digits were reserved for what was referred to as VIPs. This was no representative sample.
Other factors to consider in sample selection include the impact of creative, offer and seasonality. I hope you are getting the feeling that the research design is no trivial matter, and requires some serious upfront thinking.
But we’re not done. Developing a file for analysis demands serious consideration. How many times have I witnessed a marketer extract all the mailings from the previous four or five months, and then flag each record as to whether the individual responded. After the files are prepared, and the response rate tabulated from these samples, the analyst discovers that the response rate from these files does not match the response rates that were disseminated earlier to management.
How could this happen? There may be many reasons. One potential cause may be some suppression or selection that was done, perhaps at a letter shop which resulted in different reported numbers. Perhaps a number of records on the master file were incorrectly source coded. Or maybe nothing is wrong with the sample. Rather the hard-copy reports may be incorrect.
Interrogating the data once it is received can produce additional problems. On numerous occasions on modeling assignments, I have found missing data on potentially critical variables. One incident concerns a file that emerged from a new data warehouse. Not surprisingly, the marketer was pleased with the investment he made in developing this huge repository. Extracting files was a relatively simple matter. Among the results I witnessed was the response rate among those owning nine vehicles was double that of everyone else. How many were privileged to own this many cars? More than a third of those who received the mailing! The marketer was quite impressed that his customer base was so upscale. mailed quantity were the proud owners of nine vehicles!
However, after further investigation, we determined the value “9” was used by the database builders as a value assigned to a missing data field, and wasn’t an accurate vehicle count. Talk about a traffic accident.
One “model killer” is producing a file that has data compiled differently than at the time of the mailing. So, for example, a retailer may classify its merchandise into 20 different categories. Say category 01 at the time of the mailing included five classes of merchandise. Subsequent to the mailing, but before sample extraction for model development, the merchandiser modifies his category definitions. So now category 01 includes seven classes, of which, only three were the same as the pre-mailing classification. Clearly using these new designations would be useless, as the data needs to be compiled in an identical manner, the way it was originally at the time of the first mailing.
Another issue that must be dealt with is newer customers. If one is prepared to use two years of data as predictors, it is clear that newer members of the file would not have any associated data for, say, 13 to 24 months back. Including these records without thinking through the implications can result in a flawed sample selection.
A final thought here revolves around the quality of the data. Regrettably, some marketers still are not that concerned about data integrity or quality. Fortunately, over the last several years, I have seen increased focus in this area. Enterprise-wide databases, customer relationship management programs, multichannel marketing and other business intelligence platforms build the case quickly for breaking down data silos, and applying a data quality regimen. However, some firms still aren’t there yet. And a lack of data quality further complicates model development.
Understanding the data is a key step in the data mining process. It is the one that should never be assigned to automated systems. Professional inspection is vital. Yet frequently, it is glossed over, resulting in sometimes ridiculous conclusions. Take the analyst that arrives at the following result: “Those who respond spend more than those who don’t respond.” You don’t need an advanced degree to reach this insight. Or take one of my favorites: “Those owning homes are more likely to respond to a home equity line of credit offer.” It is these sorts of “statements” that a modeler must sift through and determine which should be immediately discarded, and which ones have value.
One more example is worth noting. An attrition model for a home equity line of credit offer showed that the attritors were more likely to have additional products. Could this be true? It is contrary to many of the basic tenets of customer loyalty. Upon further investigation, it was learned that many of the attritors immediately opened up a new account at the same bank in order to leverage more favorable interest rates. When the “actual” attritor population was analyzed, the marketer found the relationship between products owned and loyalty to be in line with conventional wisdom.
Automated systems cannot think through what patterns make sense, and which are garbage. It is not easy to recognize what trends and relationships hold up over time, and which are spurious in nature. Understanding the data takes time. And it is this time that can frequently contribute to ordinary vs. extraordinary models.
Once the model is completed, it is necessary to translate the model code onto a production database. Model development can occur on two types of files. An extract in the same format as the database can be produced, and the analyst simply works with this extract to do his or her work. Another method, probably the more common one, is the analyst working with an extract that has only certain fields on it that are deemed to be important to create the model. In the former method, model code can be easier to complete. The data miner is working with a mirror image of the file. Applying the code is more straightforward. All the fields are in the same position. There simply is less of a chance to commit an error. However, the more used latter method frequently can be associated with problems. This is especially true if we are dealing with relational files where data has to be aggregated. In this scenario, extract fields are no longer in the same position as they were on the database.
Many of the more robust database marketers who employ a host of models take extra care to deploy quality control procedures. Indeed several that I know employ one or more analysts that continually monitor model scoring. What they do is produce key statistics by score range for the scored file, and match these up with the identical distributions that emerged from the original analysis file. If something doesn’t look right, then further investigation must be done. While there are several reasons the original and current distributions may look different, some of them are program killers. This is another vital component of the modeling process, and is often times overlooked.
Sam Koslowsky ([email protected]) is vice president of modeling solutions for Harte-Hanks Inc., New York.
Editor’s Note
The CRM Loop acknowledges that the findings and writings of Jim Wheaton of Wheaton Group about the predictive modeling process were inadvertently cited in the above contributed article (“Get Some Perspective on Analytic Software,”Sept. 8) but not attributed because the original source wasn’t known.
Jim’s 1994 article, “Myths and Realities of Building Models,” can be found on Wheaton Group’s Web site (www.wheatongroup.com) in the reference library for the direct and database marketing community. A shorter version originally appeared in DM News. The library consists of approximately 60 articles published by Wheaton Group’s four principals through Jan. 1 of this year. The articles are divided into seven categories:
1. Database marketing, direct marketing and/or CRM.
2. Data mining, predictive modeling, lifetime value analysis and/or data quality.
3. Data warehousing, desktop access, systems and/or technology.
4. Testing, statistical formulas and/or results analysis.
5. Overlay data, compiled lists and/or cooperative databases.
6. Merge/purge, data hygiene and/or service bureaus.
7. Technical.
The reference library allows keyword searches of the articles. All contain an Adobe PDF print option and should download in less than five seconds.
For copies of the eight pieces published since Jan. 1, contact Jim Wheaton at 919-969-8859. He can also be contacted at [email protected].