Tired of the limitations and annoying false positives with commercial spam filters? Classifier4J is an open source Java library that will let you build custom applications that read e-mails and other types of text documents, separating the wheat from the chaff exactly the way you intend.
|
s more and more information fills our lives and clutters our inboxes, our ability to effectively read, filter, and process this information manually declines hand-in-hand. There is only so much time that we can spend at it. The trend shows no signs of abating, despite the best efforts of many individuals and companies in the industry. By all accounts, things are going to get worse.
Enter intelligent filters, ones that not only look for certain keywords that don‘t need to be reprinted here, but that also attempt to determine the sentiment of text. In other words, filters that can read an e-mail and statistically figure out what it is about and whether it interests you or not based on a set of parameters that you define. Many modern spam filters do this, training themselves on the mail that you specify is or isn‘t spam. These tools are getting better by the day but they aren‘t foolproof. For example, false positives are a frequent problem.
Classifier4J is an open source Java library designed just for this purpose, that is, for classifying text. (It is available from Sourceforge at http://classifier4j..) It has an implementation of a Bayesian classifier—a statistical methodology for calculating the probability of a given hypothesis being true (based on Bayes theorem; see http://www./better.html for a good implementation outline). A Bayesian classifier is typically used in evaluating the contents of text for a given subject matter. The classic example is in determining if an e-mail is a spam or not.
In this article I will build a simple POP3 client using the JavaMail API, which has lots of very cool features that allow you to build your own mail applications that use IMAP, POP3 and SMTP. Check the Sun documentation for in-depth details. This client will pull e-mails from your POP3 box and pass them through the classifier4J libraries to classify their contents, determine their spam relevance, and even do an automatic summary of their contents!
To get started, you first need to get and use the JavaMail API. This is available from Sun. (The source code in this article uses version 1.3.1). You will also need the JavaBeans Activation Framework (JAF), which is a dependency of JavaMail.
Once you have downloaded and installed these packages, you are ready to build your first e-mail client. You will need to have a POP3 e-mail account, and the username, login, and server name details associated with that account.
|
|
Building Your First E-mail Client This application will be a very simple console application; you can expand it later into something more complex. One cool idea is to build it into a servlet-based application that gives you a hosted e-mail client, such as Hotmail or Yahoo!, but with a built-in spam filter and classifier.
Here is the application:
package com.devx.jmail;
import javax.mail.*;
import javax.mail.internet.*;
import java.util.*;
import java.io.*;
import net.sf.classifier4J.*;
public class MailReader
{
public static void main(String[] args)
{
try
{
String popServer="yourpopserveraddress";
String popUser="yourpopusername";
String popPassword="yourpoppassword";
GetMail(popServer, popUser, popPassword);
}
catch (Exception e)
{
e.printStackTrace();
}
System.exit(0);
}
The values for popServer, popUser, and popPassword should be the correct values for your POP3 account or the application won‘t work. As you can see this is a very simple console application that doesn‘t do much (yet!), and the GetMail function is the workhorse.
Getting Mail The JavaMail API is huge, and has way too much depth to go into detail here, so for the purposes of this example you‘ll be doing the simplest thing possible: logging in, scanning the inbox for contents, and downloading a copy of those contents. You can view the full function GetMail in the download, but the snippets that handle the heavy lifting are shown here:
store.connect(popServer, popUser, popPassword);
folder = store.getDefaultFolder();
if (folder == null) throw new Exception("No default folder");
folder = folder.getFolder("INBOX");
if (folder == null) throw new Exception("No POP3 INBOX");
folder.open(Folder.READ_ONLY);
Message[] msgs = folder.getMessages();
for (int nMsg = 0; nMsg < msgs.length; nMsg++)
{
strEmail = buildMessage(msgs[nMsg]);
}
After setting up a JavaMail store object, you connect to it using the ServerName, UserName, and Password parameters. Should this work and not throw an exception (the code in Listing 2 should be writing a try..catch clause), you will be able to get the default folder associated with the store. If there is no default folder, then an exception will be thrown.
Every POP3 account has an ‘INBOX‘ folder containing incoming mail. If that folder exists, it is opened, and an array of Message objects is read from it. This is the list of all mail that is currently in your inbox. Don‘t worry, reading the mail won‘t delete it from your inbox as the folder is opened as ‘READ_ONLY.‘
You then loop through all of these Message objects and build a string out of the Message using the ‘buildMessage‘ function. This function is available in its entirety in the download, but the key parts of it are shown here:
InputStream is = messagePart.getInputStream();
BufferedReader reader=new BufferedReader(new InputStreamReader(is));
String thisLine=reader.readLine();
while (thisLine!=null)
{
strReturn +=thisLine;
thisLine=reader.readLine();
}
A POP3 e-mail message is made up of a number of entities, including the sender name, sender address, subject, and body. The body can be made up of a number of parts and may include attachments. The buildMessage function gets all these entities and simply appends them all to a string that it returns to the caller.
The important part of the message for our example is the e-mail body. This can be numerous lines of text, so the messagePart object (which is built from the e-mail body, see the full function) exposes an InputStream that you can use to read it line by line. This is used to create a BufferedReader, which then reads in the e-mail body.
You now have a simple e-mail client that logs in to your POP3 box, gets the mail from your inbox, downloads them one by one, and converts them into a string that can be used for classification and summarization with Classifier4J.
|
|
Simple Text Classification Classifier4J includes a lot of libraries for text classification. The first that we will look at is the SimpleClassifier, which is a straightforward matching. The code below shows how to use this to establish a probability score that the e-mail is a spam. It is determined to be a spam (or not) simply based on the presence of the word ‘Belgium.‘ (If you are familiar with the Hitch Hikers Guide to the Galaxy you will know why this word is appropriate. For a full explanation, you can visit this rude words guide).
public static double checkSpam(String strEmailBody)
{
double dClassification = 0.0;
try
{
SimpleClassifier classifier = new SimpleClassifier();
classifier.setSearchWord( "Belgium" );
dClassification = classifier.classify(strEmailBody);
}
catch(Exception e)
{
e.printStackTrace();
}
return dClassification;
}
In the download, this function is called by the GetMail function, so when an e-mail is downloaded and bundled into a string, it is passed to this function, and the spam score is determined. As this is a very simple case, the score will either be 0.0 or 1.0, with 0.0 being legitimate e-mail and 1.0 being the spam side of the continuum. In the program, anything with a spam score of >0.7 will be considered a spam. Note that this is case-sensitive. An e-mail with the word ‘belgium‘ will score 0.0, and one with the word ‘Belgium‘ will score 1.0. Should you want to make it case insensitive, you would have to check against a converted version of strEmailBody, i.e. to check the lower-case version of that string for ‘belgium,‘ or the upper case version for ‘BELGIUM.‘
Bayesian Classification The simple classifier above is great for getting started, but once you want to get into some more detailed classification, you will need to use the Bayesian one. Thankfully, this is very simple to use, with all the complex statistical analysis done for you under the hood.
This is a very simple case of how a Bayesian filter can be used:
IWordsDataSource wds = new SimpleWordsDataSource();
wds.addMatch("Belgium");
wds.addMatch("Vogon");
wds.addMatch("Devx");
IClassifier classifier = new BayesianClassifier(wds);
dReturn = classifier.classify(strEmailBody);
The filter is initialized with a words data source, which in turn is set up with three words as a match. This sample, while simple, is ultimately useless as the Bayesian filter has very little on which to base its judgments. To properly use a Bayesian filter it has to be trained with a large data set of words that match the context as well as words that don‘t match the context. In the real world, a lot of words match both contexts.
To make this a little clearer, consider the word ‘the.‘ It appears in just about every e-mail that is spam and non-spam alike. However ‘millionaire‘ is more likely to appear in a spam. A full spam-filtering application is constantly trained by what is spam and what isn‘t (valid and invalid, respectively), gaining intelligence as it goes. Thus, when it receives an incoming mail it uses its experience with previous ones to determine whether the mail is spam or not.
You can train a Bayesian filter in Classifier4J using the ITrainableClassifier interface. A full example demonstrating this is available in the Classifier4J optional distribution download, which is available in the src/java/net/sf/classifier4J/demo path. This demonstration takes as input text files that have already been deemed valid or invalid as a method of training the filter. The example then trains the Bayesian filter to use these input files as stimuli in determining the relevance of another file. It should be relatively straightforward to adapt the mail application used here to constantly retrain the filter on incoming mail and to use that to increase your chances of filtering out all your spam.
Auto-Summarizing with Classifier4J In addition to classifying your incoming e-mail, this application can also summarize the contents. For example you could expand the application to fish through your inbox, summarize the contents of the mail, and send the summary as a new e-mail somewhere else, perhaps to your cell phone or other mobility tool.
Summarizing with Classifier4J couldn‘t be easier: You simply create a class from ISummarizer, pass it the string to be summarized and the number of sentences you want in the summary. It does the rest, returning you a string.
The code below, which is available in the download, shows the getSummary method in action.
public static String getSummary(String strEmailBody, int nSentences)
{
ISummariser summ = new SimpleSummariser();
String strSumm = summ.summarise(strEmailBody,nSentences);
return strSumm;
} In the application, ISummarizer is called, the e-mail body is sent to it and a request is made for a summary of the e-mail in three sentences. Here is the code:
String strSumm = getSummary(strEmail,3);
To test and demonstrate this application, I looked up an old DevX article of mine, cut and pasted the entire first page into the body of an e-mail, and sent it to myself. When the Java application downloaded that e-mail it summarized it nicely. Here is the summary result:
The invention of database driver methodologies such as JDBC
and ODBC led to applications being loosely coupled with their back end databases,
allowing best-of-breed databases to be chosen—and swapped
out when necessary—without any ill-effect on the user interface.
Similarly, the decoupling of data and presentation in HTML—by using XML for the data and
XSLT for the presentation of data—has led to much innovation and flexibility,
not least of which is the ability to deliver a document as data in XML and deliver
custom styling for that document with different XSLTs.
A runtime engine would be present on the desktop, and servers would
be able to deliver the GUI to the browser with an XML document.
This is a summary of exactly 1001 words of text, into three sentences containing 115 words of text, and still keeping a pretty good handle on what the article was about. Very impressive indeed!
With more and more information bombarding your inbox, your instant messenger, your telephone, your television, and every other media device every day, technology that can understand the context of such information is a massive area of potential growth. The obvious application is in spam filtering, but there are many other useful ways of leveraging this ability. How about an intelligent agent that monitors incoming news stories from a news feed, finds the ones that are most likely to interest you, summarizes them, and sends them to your mobile device? Or one that reads movie reviews, scanning them for characteristics that interest you, and e-mails you relevant plot summaries?
The options are endless, and the Classifier4J open source library is the toolkit that will allow you to start writing these applications. This article has merely scratched the surface of what can be done using Classifier4j with an e-mail interface—the rest is up to you!
Laurence Moroney is a senior architect in a major financial services house in New York city. He has written software in many fields, from casino management to enterprise chat systems. He is the co-author of a forthcoming book on Web Services security.
|
|
|