(Magilla Marketing) A volunteer project is underway that has the potential to help spam filters avoid false positives, where anti-spam software mistakenly tags wanted e-mail as spam.
Dubbed
Project Aims to Clean Spam Data, Help Filters
A volunteer project is underway that has the potential to help spam filters avoid false positives, where anti-spam software mistakenly tags wanted e-mail as spam.
Dubbed “Spam or Ham,” the project asks volunteers to look at as many of a series of random e-mails as they want, and label each as either spam—unsolicited e-mail—or ham—wanted e-mail.
The project aims to get a consensus on 92,189 e-mails by getting 10 people to view and label each one. As of last week, the project was reportedly more than a third done.
The database—which has already been automatically split and labeled by existing spam filter technology—is the result of an effort by a group called the Text Retrieval Conference, or TREC, which is affiliated with the U.S. National Institute of Standards and Technology.
The idea is that once thousands of human volunteers catalog the e-mails, the resulting enormous database of theoretically accurately labeled messages can be used to help developers make anti-spam filters less prone to mistakes.
The project’s developer, John Graham-Cumming, decided to call for volunteers because it would be impossible for a handful of people to sift through and categorize so many messages.
“I thought, ‘why don’t we do a Web site where random people come in and look at mails and give us their opinion on them, and then once we’ve got say, maybe 10 people per message, we’ll get a consensus on what they are?’” he said. “The idea is can we clean up the TREC dataset to make sure it’s perfectly split into spam and regular mail.”
One possible pitfall is that people may inaccurately identify permission-based marketing e-mail as spam, and phishing e-mail aimed at getting bank account information as legitimate, said Graham-Cumming.
Unusually, the Enron fiasco helped make this project possible. Dozens of the defunct energy company’s employees’ e-mails were made public during its bankruptcy trial. The e-mails are part of the Spam or Ham project.
Graham-Cumming said that once the effort is completed, it will be available free to anyone who can use it.
“I just wanted to make sure that the underlying data is accurate,” he said. “The question is: Can we get this set of data that is really nice and clean and use it to test spam filters?”