I get a lot of junk comments on this site, posted by bots on the net that try to do phishing attacks. They don’t get displayed on the production blog, because I never authorise them. Right now I have over 35000 comments in the database, that I need to weed out as being spam or not.
I recently was reading an introductory book on data science, and that book has a chapter where a method for spam filtering is illustrated. So I set about to implement it and see if it works on my data.
I first had to copy all the messages on the main comments table in the database to another table that serves as the training set. Then Over each word on each message of that set I calculated the probability of it being spam or not spam, when seen in a message. Note that I considered all words to be spam in the training set, since I don’t get any legitimate comments, because basically no-one reads this site.
Last step, when receiving a new message I calculate the probability of it being spam given the word probabilities calculated before. Right now I have a job running periodically processing the backlog of messages I already have in the database. I am curious of seeing what will be left of them after I process them all.