Analysis and Evaluation of Statistical Spam Filtering.

Do you need this or any other assignment done for you from scratch?
We have qualified writers to help you.
We assure you a quality paper that is 100% free from plagiarism and AI.
You can choose either format of your choice ( Apa, Mla, Havard, Chicago, or any other)

NB: We do not resell your papers. Upon ordering, we do an original paper exclusively for you.

NB: All your data is kept safe from the public.

Click Here To Order Now!

Introduction

A statistical filter is an automated system that is developed to evaluate documents in a language legible to machines. Statistical filtering as a concept was introduced by M. Sahami et al., in 1998 (Dwork & Naor 2002). Though Paul Graham did not invent it, he drew the attention of machine researchers in his famous paper ‘A Plan for Spam’ which advocated for building Bayesian probability models of spam and non-spam words. The paper emphasized the need to draw attention to software that determines the indicator of spam probability for each word. The essay presented such software as automatic, could operate on a short code, adjustable to suit specific needs and effective in performing the intended function (Zdziarski, 2005).

Body

Filtering of spam mail originated from text classification research. Its ultimate goal is to recognize spam accurately. The different methodologies vary depending on the classification algorithm in place. Statistical filters have the same origin and though they have evolved they still share the same basic functioning principles. They are all multinomial or multivariate models (Okin, 2003). Statistical filters distill a document into a set of pre set features depending on the type.

It can be programmed to sieve information based on specific words, numbers or even whole phrases. These features are then coded as Boolean (multivariate) or real values (multinomial) vector upon which filtering is based on. It is then possible to make specific adjustments into the filter by using rule based methods which can either be generated automatically or hand designed. The derived machine learning algorithms are mainly determined by the overall frequencies or statistics of the specific feature being distilled (Bergin, 1996).

There are different types of content filters that vary based on the features they are designed to detect. Generally, spam is more repetitive and in most cases there are certain terms that it contains. Word based spam filters are the simplest form. They trace certain words within the email and block it. It is possible to evade these types of spam filters by configuring the message to remove the most common words in spam mail by either replacing them or misspelling these words.

Heuristic filters are more complex than word filters. They are also called rule filters. They trace multiple terms found in an email. They scan the contents of incoming emails and prescribe points to words or phrases. Key words normally found in spam receive higher points and a total score for the entire email is determined. The owner is actually the one who determines the cut off score that will be used to classify email as spam or legitimate.

The filter identifies these messages that rate a certain score or higher and blocks or deletes them while preserving those that rate a lower score. Heuristic filters are relatively easy to operate and quite fast. A major disadvantage of these filters is that they can filter off legitimate mail as well if it happens to contain a high usage of certain words or phrases. Alternatively, mail spammers may avoid them by avoiding the use of these words in their email messages.

Bayesian filters are by far the most effective form of content-based filters. These filters come up with mathematical probability that a message is spam basing it on features that are evident in other spam mail. The first method to use the Bayesian classification was the ifile system invented by Jason Renee and released in 1996.Although a lot of research was taking place and many variants of this software were developed, it was Paul Graham’s publication on Spam that popularized the algorithm to a wider audience. This is because they have an element of adaptability is not easily avoided by spammers.

All the websites that have installed the various Bayesian filter have a varying and unique set of words assigned specific statistical value in the database (Domingos & Pazzani, 2002). These are also referred to as tokens. This gives the spam filter a comparative advantage over the spammer because it is not possible for the spammer to write a message that will intentionally evade the filter. Bayesian filters are upgraded after every short while and do to this it is easy to track changes in spam mail and still blocks it.

Bayesian spam filtering has evolved over the last few years to a popular spam filtering mechanism that is applied by majority of the mail clients to sort mail into spam and legitimate folders. Individual users can also install specific e mail filtering software. These filters are highly effective but require a lot of considerable patience as the user has to manually mark spam mail initially.

The filter acquires the words in legitimate and spam mail and compiles it to a list which it applies to block spam. The Bayesian algorithm however looses efficiency with time as it increases the word list. A major advantage of a Bayesian filter is that it is sensitive to the needs of the client (Drucker, 2006). It is possible to tailor the spam filter to avoid it from filtering off legitimate mail as spam. There are certain tokens that may be traced in large numbers in email lead to email being regarded to as spam Majority of servers have mail filters that utilize the Bayesian fundamentals. Some server-level spam filters are heuristic based. These are widely applied and include versions of SpamAssassin, ASSPand SpamBayes.

It is possible to install the functionality of these programs within the main server software that takes care of a greater number of mails. Once it is installed and given directives, spam filtering software requires no other maintenance follow ups. The user is only required to identify and mark messages as spam or as non spam and the spam filtering software and the spam filtering machine operates upon guidelines drawn from the user’s bias to content in his messages (Mulligan, 1999).

A statistical filter will adjust the settings in the content of spam emails as soon as it can detect changes (Wyman, 1998). They are programmed to also monitor the unique differences in transporting the message by looking at message headers. Headers act as a better option to using content in discriminating against spam mail (Harold,Tipton, & Krause, 2006).

There is an on going competition between Spammers and anti spammer service and experts, as they attempt to evade statistical filtering software by inserting various random data into their messages while concealing it from the most obvious text, making it more probable for the message to be classified as neutral. This randomly placed data in valid and is mainly concealed by setting it in smaller fonts or in the same color with the background of the text (Goodman, 2004).

Software programs that use the Bayesian classification include Bogofilter, Mozilla, Mozilla thunderbird and Mailwasher. CRM114 is a recent innovation that detects spam by Bayesian classification on the phrases within the messages. POPFile is an easily available and applicable free e mail filter that can be used by individual clients to separate mail into easily manageable folders using the principles of Bayesian filtering. Later versions of Spam Assassin are also examples that sorts mail and detect spam by Bayesian principles. Older versions use rule ranking. The operating speed of these spam filtering tools is determined by the data structures. The technique lexing of words that is used determines the ability to discard false random strings (Gregory & Simon, 2005).

Bayesian filters built to detect words work more efficiently than the others. A major disadvantage of these algorithms is that the numbers of words that can be contained in an email are limitless. Even after including most English words from a message, the number of character sequences that can be generated is boundless and if new text is used the scope is even wider. This fact is also applicable for a message that may have incorporated random strings in Message-IDs, UU and base64 encodings.

Infrequent terms can be removed from the spam filter to increase operational speed (Goodman 2004).To classify spam filtering, machine researchers mainly employ the use of naive Bayes which is however sensitive to the selection mode of the small feature set and does not perform optimally in situations with heavy penalties for error. Other software like AdaBoost and maximum entropy model can be used. These are preferred because they are not sensitive to the selection strategy. They are also easily adjustable to highest possible performance across different datasets and extremely high feature dimension.

There are other different algorithms that are currently applied in the fight against spam. Support Vector Machine algorithms portray the best accuracy and speed when it is preset to use binary features (Dumais & Horvitz 2005). Using Boosting decision trees is also a good and practical approach to spam filtering and also has a good speed and accuracy.

In conclusion, spam mail is a major problem for all internet users (Goodman, 2004). It is a major time waster and may lead to holders of email accounts ignoring and cancelling out legitimate mail as they try to avoid spam. Spam mail could also be used as sources of viruses and cause damage to important information. The war between spammers and experts and their spam filters has considerably intensified with time (Graham, 2007).

It is absolutely impossible to prevent spammers from sending spam mail but spam filters can prevent these spam mail from reaching us. The range of these spam filters has evolved over time from simple non statistical procedures like blacklisting and white listing to more advanced statistical software that ensures that up to 96% of spam mail is blocked from your inbox. Most of these methods conduct various checks to ensure that there is no spam in the inbox. The best spam filter algorithms are content based ones that evaluate the messages for specific words or phrases and calculate a probability of it being spam mail. They are very efficient and are widely applied (Pogue, 2005).

Other filtering methods include using the challenge response system that gives the spammers a challenge before the message is sent. Due to the large numbers of the messages spammers send and due to the fact that these spam messages are usually automated, it is not possible for them to eradicate. Collaborative filters have also been applied. These employ a community based approach and spammers cannot access certain communities after individuals there tag their messages as spam. Domain name systems look up systems uses several anti spam techniques to identify and block the action of spammers (Schwartz, 1999).

Conclusion

None of these methods ensure total eradication of spam messages but if a number of approaches are used together there are higher chances of blocking spam mail from the legitimate mail (OECD, 2006).

References

Jonathan A. Zdziarski, (2005), Ending Spam: Bayesian Content Filtering and the Art of Statistical Language Classification, San Francisco, No Starch Press, Inc.

Thomas J. Bergin, Richard G. Gibson, (1996), History of Programming Languages II, Boston, Addison-Wesley.

J. R. Okin, (2003), The Internet Revolution: The Not-for-dummies Guide to the History, Technology, And Use of the Internet, New Jersey, Ironbound Press.

Organization for Economic Co-operation and Development (2006), OECD Anti-spam Toolkit of Recommended Policies and Measures, Boston, OECD Publishing.

Carolyn Wyman, Spam: a biography (1998), New York, Harcourt Brace.

Alan Schwartz, S. Garfinkel, Stopping Spam, (1999), Florida, O’Reilly.

Steve H. Graham, (2007), The Good the Spam and the Ugly: Shooting It Out with Internet Bad Guys, New York, Citadel Press Inc.

Geoff Mulligan, (1999), Removing the Spam: Email Processing and Filtering, Michigan, Addison-Wesley.

David Pogue, (2005), Mac OS X: The Missing Manual, Tiger Edition, Florida, O’Reilly.

Marcia S. Smith, B. G. Kutais, (2007), Spam and Internet Privacy, Michigan, Nova Publishers.

Peter H. Gregory, Mike Simon, Michael A. Simon, (2005), Blocking Spam & Spyware for Dummies, New York, For Dummies.

Harold F. Tipton, Micki Krause, (2006), Information Security Management Handbook, Boston, CRC Press.

D. Fallows (2003) Spam: How it is hurting email and degrading life on the Internet, Florida, Pew Internet and American Life Project.

Dumais M. S. & Horvitz E. (2005), A Bayesian Approach to Filtering Junk E-mail, Texas, Citadel Inc.

Dwork C. & Naor M., (2002), Pricing through Processing Junk Mail, New York, Prentice Hall.

Goodman J. (2004) “Spam Technologies and Policies”. Web.

Domingos P. & Pazzani M, (2002), Bayesian classifier, California, University Press.

Drucker H. D, (2006) Machines for spam categorization, New York, Addison-Wesley.

Do you need this or any other assignment done for you from scratch?
We have qualified writers to help you.
We assure you a quality paper that is 100% free from plagiarism and AI.
You can choose either format of your choice ( Apa, Mla, Havard, Chicago, or any other)

NB: We do not resell your papers. Upon ordering, we do an original paper exclusively for you.

NB: All your data is kept safe from the public.

Click Here To Order Now!