Spam Filtering: Understanding SEP and CEP

In order to help folks further understand the differences between CEP and SEP, prompted by Marc’s reply in the blogosphere, More Cloudy Thoughts, here is the scoop.

In the early days of spam filtering, let’s go back around 10 years, detecting spam was performed with rule-based systems.  In fact, here is a link to one of the first papers that documented rule-based approaches in spam filtering, E-Mail Bombs and Countermeasures: Cyber Attacks on Availability and Brand Integrity published in IEEE Network Magazine, Volume 12, Issue 2, p.10-17 (1998).   At the time, rule-based approaches were common (the state-of-the-art) in antispam filtering.

Over time, however, the spammers get more clever and they find many ways to poke holes in rule-based detection approaches.  They learn to write with spaces between the letters in the words, they change the subject and message text frequently, they randomize their originating IP addresses, they use IP addresses of your best friends, they changed the timing and frequency of the spam, etc. ad infinitium.

Not to sound like an elitist for speaking the truth,  but the more operational experience you have with detection-oriented solutions, the more you will understand that rule-based approaches (alone) are not scalable nor efficient.  If you followed a rules-based approach (only), against heavy, complex spam (the type of spam we see in cyberspace today), you would spend much of your time writing rules and still not stop very much of the spam!

The same is true for the security situation-detection example in Marc’s example.

Like Google’s Gmail spam filter, and Microsoft’s old Mr Clippy (the goofy help algorithm of the past), you need detection techiques that use advanced statistical methods to detect complex situations as they emerge.  With rules, you can only detect simple situations unless you have a tremendous amount of resources to build a maintain very complex rule bases (and even then rules have limitations for real-time analytics).

We did not make this up at Techrotech, BTW.   Neither did our favorite search engine and leading free email provider, Google!   

This is precisely why Gmail has a great spam filter.   Google detects spam with a Bayesian Classifer, not a rule-based system.    If they used (only) a rule-based approach, your Gmail inbox would be full of spam!!! 

The same is true for search and retrieval algorithms, but that is a topic for another day.  However, you can bet your annual paycheck that Google uses a Bayesian type of classifer in their highly confidential search and retreival (and – hint – classification) algorithms.

In closing, don’t let the folks selling software and analysts promoting three-letter-acronyms (TLAs) cloud your thinking. 

What we are seeing in the market place, the so-called CEP market place, are simple event processing engines.  CEP is already happening in the operations of Google, a company that needs real-time CEP for spam filtering and also for search-and-retrieval.  We also see real-time CEP in top quality security products that use advanced neural networks, and Bayesian networks, to detect problems (fraud, abuse, denial-of-service attacks, phishing, identity theft) in cyberspace.

Advertisements

4 Responses to Spam Filtering: Understanding SEP and CEP

  1. Marc says:

    Thanks Greg.

    Two questions (which may be fodder for a new blog entry):

    1) How would you add complexity to my simple example so that we can see how Bayesian Classifiers would work?

    2) Could one implement a Bayesian Classifier in a SQL-ish like language?

  2. Greg Reemler says:

    Hi Marc,

    1) Most realistic problems, beyond trivial, require more advanced analytics. It is easy to build trival examples.

    BTW, TIBCO did a series of briefings on CEP and identity theft detection (in the public domain) that illustrated a simple Bayes Classifier for this problem.

    2) I suppose you can implement just about anything in an SQL-like language, if you search the network (Google), I don’t think you will find many people implementing these classes of analytics in SQL. There are some in Java (and C of course):

    http://www.google.co.uk/search?hl=en&safe=off&sa=X&oi=spell&resnum=0&ct=result&cd=1&q=Bayesian+Classifier+java&spell=1

    Regards,

    Greg

  3. Hans says:

    Marc, it’s easy to envision where you might use a Bayesian classifier. Let’s say you want to detect intruders based on something other than their retina scan. So you think of several other criteria to distinguish these people. Perhaps the color and cut of their suit and the university that they attended. To do the most simple Bayesian analysis, you gather this data for everyone entering your facility (be they intruder or not). You “train” your Bayesian classifier with lots of data, where each row contains the suit color, suit cut, university and status as an intruder. The classifier develops a model that, under certain assumptions, does the best possible job of determining whether someone is likely to be an intruder based only on their suit cut, suit color and university. Now someone arrives for whom you can not get a retina scan (they are wearing sunglasses). You feed in the color and cut of their suit and ask them for the university that they attended. From this, the Bayesian inference engine uses the previously developed model to give you a probability that they are an intruder.

    Clearly, there are several challenges here. First, you need to properly define the predictor variables. Second, you have to train the classifier properly. Basically, you need a huge and continuously updating sample where you know all the predictor variables and also the person’s status as an intruder. For spam, this information comes from people marking spam as such in their inbox; that message marked as spam goes back in to train the model based on various qualities of the message.

  4. Hans says:

    And BTW, I started to implement a simple Bayesian classifier and inference engine in SQL. I ran into one huge problem: matrix manipulation in SQL boggles my mind. So I never got very far, but I think you’ll be better off doing this in a procedural language or in R or Matlab.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: