Writing our own spam detector

Hacker

Professional
Messages
1,048
Reputation
9
Reaction score
724
Points
113
This article was written for educational purposes only. We do not call anyone to anything, only for information purposes! The author is not responsible for your actions

Working in Data Science, as you may have heard, is prestigious and monetary. In this article I will introduce you to the basics of the R language, which is used in "data science". We will write a so-called naive Bayesian classifier that will detect spam in mail. By constructing a program, you will see how dry boring mathematics can be quite visual and useful. You will be able to write and understand this program even if you have never encountered anything like it.

First, however, I must warn you that for a complete understanding it will be good if you are already familiar with the basics of statistical methods. Of course, I will try to present the material in such a way that a thoughtful reader unfamiliar with statistics also has a chance to delve into the topic, but the theoretical foundations of statistics will remain outside the scope of this article.

Plus, I proceed from the fact that, on the one hand, you are well-versed in programming theory and practice, you are familiar with algorithms and data structures, but, on the other hand, you have not yet encountered the R language. I will introduce you to it to the extent that which is enough for you to comfortably write, run and understand the first program.

Getting to know the R language and preparing the workplace
First, let's see how to install R and how to run and debug programs in it, how to find and install the necessary libraries (in the world of R they are called packages), how convenient it is to search for documentation for various functions.

The R language is extremely powerful when it comes to processing and analyzing data. De facto, it is the standard in Data Science. The language was created by statistician mathematicians for statistician mathematicians. In it, as in other tools designed for high-tech computing, the most common data type is a vector. The most popular data structure in R is the data frame. A data slice is a matrix with attributes that looks very similar to a relational database.

Download IDE for R from the official site. Versions are available for three operating systems: Linux, Mac and Windows. The installation should be easy, but if something goes wrong, check the FAQ first.

Now we launch the IDE. It will look something like this.

r-ide.jpg


Syntax highlighting here, of course, is so-so - or rather, it does not exist at all - so you can safely write the code in some other editor, and use the IDE only to run ready-made code. Personally, I write in R in the online editor at rextester.com.

To install the required package (library), use the function install.packages. There install.packages is a useful option - suggests. By default, it is assigned a value FALSE. But if you translate it into TRUE, it will install.packages download and install all the secondary packages that the package you are installing relies on. I recommend that you always set this option to TRUE, especially when starting with a clean slate and just (re) installed R.

Here's a script for your convenience that checks if the packages you need are installed. If some of them are not installed, the script downloads and installs them. Reprint and save the script to a file update_packages.r. Now it downloads only two packages that we will need when we write a spam detector. You can add other packages as needed.

update-packages.jpg


To execute the script (both this one and all the others that you write), first switch the working directory to the one where you saved it. To do this, enter a function setwd (for example, setwd("e:/work/r")) into the R console. The execute the command of The of of The statement The statement The Then statement of The source like of of SO: source("install_packages.r"). It will run your script, and you will see how packages are loaded that you have not yet installed.

To connect the package to the program, use the function library or require. Eg library('tm').

To find the documentation for a function, just type in the console ?xxxwhere xxx is the name of the function you are interested in. The IDE will open a page in the browser with information on this function.

Preparing data​

First, let's prepare datasets for training and testing the future detector. I suggest taking them from the Apache SpamAssassin archive. Following the link, you will find a selection of letters, packaged in three categories: spam (actually, spam), easy_ham (legitimate letters that are easy to distinguish from spam), hard_ham (legitimate letters that are hard to distinguish from spam).

Create a folder in your working directory data. Go to it and create five more folders in it:
  • easy_nonspam_learn, easy_nonspam_verify;
  • spam_learn, spam_verify;
  • hard_nonspam_verify...
In folders spam_learn and spam_verify distribute letters from brotherly spam. Folders easy_nonspam_learn, easy_nonspam_verify - from 'easy_ham' folder. hard_nonspam_verifyCopy all letters from the folder into the folder hard_ham.

As you probably already guessed, by letters from folders _learn we will train our detector to distinguish spam from non-spam, and by letters from folders _verify we will check how well it has learned to do this.

But why didn't we create a folder then hard_nonspam_learn? For the severity of the experiment! We will train the detector only with messages that are easy to distinguish from spam. And in the end, let's see if he can recognize hard_nonspam legitimate mail in letters from the category without prior training.

How traits are constructed
Now that we have the initial data for training and verification, we need to "construct signs" that our detector will look for in raw text files with letters. The ability to construct features is one of the basic skills in Data Science. The key to success here is professional intuition, which comes with years of practice. Computers cannot yet do this work automatically, instead of us. And, most likely, they never will.

On the other hand, computers can facilitate our work of constructing features. In particular, R has a package called tm (from the words Text Mining) for text analysis. With its help, we will calculate which words are most often found in spam and in non-spam, and we will use their frequency as signs.

Modern spam detectors do a lot more than counting word rate, but as you will soon see, even our basic spam detector will do a pretty good job of separating spam from non-spam.

We will base our detector on a naive Bayesian classifier. The logic of its work is as follows: if we see a word that occurs more often in spam than in non-spam, then we put it in the box of spam signs. By the same principle, we create a collection of signs for non-spam.

How can these signs help us distinguish spam from non-spam? We are looking for both types of signs in the analyzed letter. If in the end it turns out that there are more signs of spam than signs of non-spam, then the letter is spam, otherwise it is legitimate.

When calculating the probabilities of whether our email is spam, we do not take into account that some words may be interdependent. We evaluate every word in isolation from all other words. In statistical slang, this approach is called "statistical independence." When statisticians start with such an assumption, not being completely sure that it is valid here, they say: "Our model is naive." Hence the name: a naive Bayesian classifier, not just a Bayesian classifier.

Writing a function for reading letters from files
First, we load the libraries that we need, and write the paths to the folders in which the files with letters are stored.

src01.jpg


Each separate file with a letter consists of two blocks: a header with metadata and the content of the letter. The first block is separated from the second by an empty line (this is a feature of the email protocol described in RFC822). We don't need metadata. We are only interested in the contents of the letter. Therefore, we will write a function that reads it from a file with a letter.

src02.jpg


What are we doing here? In the R language, file I / O is done in the same way as in most other programming languages. The function getMessage receives the path to the file as input and opens it in the rt (read as text) mode.

Note that we are using Latin-1 encoding here. What for? Because many letters contain characters that are not in ASCII encoding.

The function readLines reads a text file line by line. Each line becomes a separate element in the vector text.

After we have read all the lines from the file, we look for the first empty one, and then we extract all the lines after it. We put the result into a vector msg. As you probably understood, msg this is the content of the letter, without header metadata.

Finally, we collapse the entire vector msg into a single block of text (see the part of the code with the function paste). We use the symbol as a line separator \n. What for? This will make it more convenient to handle. And faster.

Now let's create a vector that will contain text messages from all spam emails. Each separate element of the vector is a separate letter. Why do we need such a vector? We will use it to train our detector.

src03.jpg


First, we get a list of all files from the spam folder. But there, in addition to letters, there is also a file cmds (a service file with a long list of Unix commands for moving files), which we do not need. Therefore, the second line from the previous code snippet removes the name of this file from the final list.

To create the vector we want, we'll use a function sapplythat will apply the function getMessageto all the filenames we just got with dir.

Pay attention, here we are passing to sapplyan unnamed function - to combine the name of the file and the path to the directory where it lies. Get used to it, for the R language this is a very common construction.

Preparing a text corpus for spam emails
Now we need to create a text corpus. With its help, we will be able to manipulate terms in letters (in corpus linguistics, the constituent parts of the text, including words, are called terms). Why do we need this? To construct spam signs for our detector.

Technically, this means that we need to create a term document matrix (TDM), in which N rows and Mcolumns are (N - the number of unique terms found in all documents; M - the number of documents in the text corpus). The cell [iTerm, jDoc] indicates how many times a term with a number iTermoccurs in a letter with a number jDoc.

src04.jpg


The function getTDM receives a vector with all text messages from all spam emails at the input, and outputs TDM at the output.

The package tm allows constructing a corpus of texts in several ways, including from a vector of letters (see function VectorSource). If you are interested in alternative sources, type in the R-console ?getSources.

But before constructing the corpus, we need to tell the package tmhow to clean up and normalize the text. We pass our wishes through a parameter control, which is a list of options.

As you can see, we are using four options here.
  1. stopwords=TRUE - Ignore 488 stop words (common words in the English language). To see what words are in this list, type in the console stopwords().
  2. removePunctuation=TRUE and removeNumbers=TRUE- speak for themselves. We use them to reduce noise from the corresponding symbols. Moreover, many of our letters are stuffed with HTML tags.
  3. minDocFreq=2 - lines in our TDM need to be created only for those terms that appear in the text corpus more than once.

PREPARING THE GROUND FOR THE CONSTRUCTION OF FEATURES
Let me remind you that we want to train the detector so that it can estimate the probability that the message being analyzed is spam. How are we going to do this? Searching for terms in the analyzed email that are signs of spam for us. And here's what you need to do in the code for this ...

We will use the TDM we just created to create a training dataset from spam emails. To do this, we will make a data slice, which will include the total frequency of all terms - they can subsequently serve as signs that the analyzed letter is spam.

src05.jpg


What are we doing here? To get a slice of the data, we first convert our TDM to a standard R-matrix (function as.matrix). Then, using, rowSums we create a vector in which the total frequency from all spam messages is calculated for each term.

Next, we combine string (with terms) and numeric (with frequency) vectors using the function data.frame.

Finally, we refine our data a little: we set the labels for the columns and convert the frequencies to a numerical representation.

Generating training data
src06.jpg


In this small piece of code, we first calculate the percentage of documents in which it occurs (occurrence) for each term. How do we do it? We run each row (that is, each term) through an unnamed function (passed as an argument to sapply), which first counts the number of cells with a value greater than zero, and then divides the resulting sum by the number of columns in TDM (that is, by the number of documents in spam corpus of texts).

Next, we calculate the density of the term (density) over the entire body of texts.

Finally, using the function, we transform add to our data slice two vectors that have just been calculated: spam_learn.occurrence (the percentage of documents in which the term under consideration occurs) and spam_learn.density (the density of the term throughout the entire body of texts).

Everything! The set of training data for the spam part of our detector is ready. Let's check them out: let's see which terms from the spam text corpus turned out to be the brightest indicators. To do this, let's sort ours spam_learn.df by column occurrenceand see what happened. Enter in the R console:
Code:
head (spam_learn.df [with (spam_learn.df, order (-frequency)),])

The result should look something like this.

indicatorsspam.jpg


As you can see, HTML tags are the most prominent indicators of spam. They account for over 30% of spam training emails.

Processing emails easy_nonspam
We easy_nonspam_learn do the same with letters from a folder as with spam. The code is pretty much the same.

src07.jpg


The only difference (see line 80): we read from easy_nonspam_learn not all letters (there are more than 2000 of them), but as many as we read from the folder spam_learn (about 500). Why such an egalitarianism? For the detector to work impartially. Otherwise, if he is overtrained in one category of letters, he will see spam where it is not there, or vice versa.

Now the set of training data for the non-spam part of our detector is also ready. Let's see what happened:
Code:
head (easy_nonspam_learn.df [with (easy_nonspam_learn.df, order (-frequency)),])

The result should look something like this.

indicatorseasynonspam.jpg


Writing a classifier
So, we have two sets of training data, that is, two piggy banks of signs: for spam and for non-spam. How will they help our detector to separate the wheat from the chaff? The detector will calculate for each of the piggy banks the “naive Bayesian probability” that the analyzed letter belongs to its category. Here is a function that embodies this idea.

src08.jpg


We pass four parameters to it:
  1. path - a letter to be analyzed;
  2. trainingDF - a slice of data on the training set with which we want to compare the analyzed letter;
  3. prior - our "naive guess" about what percentage of emails usually turns out to be spam;
  4. cNone - the constant of probability, which we assign to new terms - those that are not in the training letters.
Pay attention to how I deal here with those terms of the analyzed letter that were not in the training letters and which our detector therefore does not know. Rather, what do I do with their zero frequency. If their frequency is left as it is, then the function classifyEmail will not work correctly. After all, the calculations in it are based on the multiplication of frequencies. If even one of the factors is zero - and this will happen very often - then the result will invariably be zero.

Therefore, to terms with zero frequency, I assign a small constant frequency: 0.1% - that is, a frequency that is obviously less, much less than the frequencies recorded in our training data. By the way, this way of handling unknown terms is a very common approach in Data Science.

How classifyEmail works
Please note that we do the first three steps here in the same way as when processing training data:
  1. get.msg extracts text from email;
  2. get.tdm converts this text to TDM;
  3. rowSums calculates the frequency.
Next, we need to determine how the terms from the analyzed letter intersect with the terms from our training data. To do this, we use the command intersect. We pass it the terms found in the analyzed letter and the terms from the training data.

The final step of the classification: to determine if the analyzed letter contains any terms that are also in our training data. If there are any, we use them to understand what the probability is that the analyzed letter belongs to the category of the data slice transmitted by the parameter trainingDF.

How does it all look live? Let's say we are trying to assess whether the email in question is spam or not. The function msg.match returns all terms from the analyzed message that are in the spam data slice ( spam_learn.df). If there are no such terms at all ( length(msg.match) < 1), then the function replaces the zero frequencies with the value cNone.

Checking the detector
The Test the detector Our, trained on letters spam and easy_nonspam<br> <br> we will of complex of letters: hard_nonspam.

src09.jpg


What are we doing here? First, as always, we take a list of files with letters. Then we make two runs through the classifier. The first run is to check for spam, the second is for non-spam. As a result, we get two vectors: hard_nonspam_verify.spam_test and hard_nonspam_verify.nonspam_test. We compare the numerical values from these vectors in pairs and calculate the result: how many emails look more like spam, and how many more like non-spam.

You probably can't wait to see what the score is in the end, and understand how well the detector did its job. However, before showing the result, I want to remind you of two things. 1) We know beforehand that there is no spam in the sample of emails that we have just tested. Therefore, ideally, the spam counter should remain zero. 2) We know that it is difficult to classify emails from this sample, because they contain terms that are clear indicators of spam. What am I doing? Besides, the result is likely to be not very inspiring. So what is he?

resulthard.jpg


Our detector worked with an error of 8%. For a commercial program, this is, of course, too much, but for a simple detector that you and I wrote for educational purposes, the result is quite pleasant. Especially considering the fact that hard_nonspam our detector saw letters from the category here for the first time, because they were not included in the training sample.

Testing the detector
First, we will wrap up the three actions that we performed in the previous section of the code (for comparing spam and non-spam) in a simple function spamClassifier. What she does? If the message looks more like spam, it returns one, otherwise zero.

src10.jpg


Now, to test the detector's operation, load letters from folders with the suffix _verify. Do you remember that initially we divided the letters into two parts: for training the detector (suffix _learn) and for testing it (suffix _verify)?

src11.jpg


Then we run the detector for letters of all three categories. We run it in the same way as we did it before. The only difference is that we are using a function for simplicity spamClassifier, not classifyEmail.

src12.jpg


Then we prepare the final data slice, which will show how the detector worked. We have already done this a little higher.

src13.jpg


Happened the the for What the AT the the Look in the end: head(classDF).

resultall.jpg


Judging by the first six lines of the table, the detector is doing its job well, but for a more objective picture, let's count false positive and false negative responses - in total for all letters.

src14.jpg


What are we doing here? We create a matrix of three rows and two columns. Line - is the category of the Primary Actual Primary-letter-the (spam, easy_nonspam, hard_nonspam), and the columns - a category That has defined Our detector. Ideally, we should get it [1, 1, 0] in the first column and [0, 0, 1] in the second. And here's how we actually did it.

resultfinal.jpg

The table shows that our classifier works, although not perfect, but quite well.

Drawing the result on a graph
src15.jpg


What are we doing here? Show on the graph how much (in percentage) the message looks like spam (axis OX) and non-spam (axis OY). Moreover, we display the data on a logarithmic scale. Why such a trick? Because the scatter of the values that we plot on the graph is very large. Therefore, without the logarithmic transformation, the result would be less clear.

You will find the graph saved in PDF format in your working folder, in a subdirectory images.

plotresult.jpg


Ideally, all spam should go above the diagonal line and non-spam below. As you can see from the picture, although our result looks pretty good, it's still not perfect.

By analyzing the graph, we can get some intuition about the weak points of our detector. We see that some letters are pressed against the axis OX. Their value Prob(SPAM) is zero. Others are pressed to OY - their value Prob(SPAM)is equal to zero. And some letters are generally pinned to point zero.

plotartefacts.jpg


All these observations indicate that we have prepared a not very good set of training data for non-spam. Because looking at the result of processing verification emails, we see that there are many more terms that it would be nice to use as indicators of non-spam than the set that we actually use.

This is understandable. After all, we have designed a simplified spam detector. Of course, it can be developed and improved. Which is what I suggest you do as your homework. If you dare, be sure to share your successes in this field in the comments to the article.
 
Top