Lab 5: Spam filter - Part II
In this part of the assignment, you will train your classifier on real-world e-mails, which you can download from here. They have the same convention of file-names as in part I.
The program should be written in Java, and the name of the main file should be
filter.javaand all files should compile using command:
But now, the program should take only one argument:
java filter testfile
testfileis the name of a file containg a single e-mail, which is to be classified. The program should return the same output as in part I. Since you do not specify the directory with the training files, the program needs to store its knowledge gained during training somewhere, e.g. in a separate file (which you can submit together with your code).
The program will be tested using automatic marking on a testing set with the same number of e-mails to that in the training set, and with the same class distribution to that in the training set. The testing set will contain e-mails from the same sources as the training set. The tests will be performed on the departmental Linux machines (MVB 2.11), and your program needs to classify all testing e-mails within 30 minutes. Sample automarking script (with a single testing e-mail) can be downloaded from here.
You also need to test your program yourself, using 10-fold cross validation.
In this part of the assignment, you will discover that in many practical machine learning problems implementing the learning algorithm is often only a small part of the overall system. Thus to get a high mark from the assignment you need to implement any of the more advanced classifying techniqes or clever pre-processing methods. You will find plenty of them on the internet. If you do not know where to look for them, ask Google ;-).
Problems you might have
- Arithmetic underflow: Because the number of words is very large, the probabilities dealt with are very small. When multiplied together, the values may become so small they cannot be represented. The solution is to use logarithms. The formula for classification becomes:
ci = argmax (log pi + sum (log P(wj|ci))).
The multiplications become additions because log(a*b)=log(a)+log(b).
- Memory: You need to be slightly careful about how you store some information while preprocessing.
Submit your code and any files it uses via the online submission system. Also submit 2 slides that you will be using during your presentation. The presentations will take place in January in MVB 3.43, and the detailed schedule is available here. Please submit your slides in a single pdf file (NOT in Power Point native format - so if you use Power Point to prepare the slides, please print them to pdf). Please note that you will be ONLY allowed to use 2 slides during your presentation (so if you submit more, you will NOT be able to show them). Also, please use maximum 10 lines of text per slide (more lines of text will result in reduced mark), but you are welcome to include charts and tables.
Each presentation will last up to 5 minutes (if you do not finish within 5 minutes, the presentation will be interupted). The presentation should describe your work, including the methods used to preprocess the data, any simplifying assumptions made and the results of a 10-fold cross validation carried out on the data provided using your program. Do not describe the Naive Bayes algorithm. If you implement extra preprocessing options, you should report the accuracy of 10-fold cross validation with and without these options (e.g. on a slide).
Below the provisional marking criteria are listed, but I reserve the right to change them.
Your program classifies the testing set with the accuracy significantly higher than chance within 30 minutes -- 30% Your presentation cointains the results of 10-fold cross validation -- 10% The remaining 60% of the mark will be based on the extra work you do. I attach below the criteria I used last year - but they are likely to change this year (as last year, different weighting was given to this assignment), so just treat them as very rough guideline. Preprocessing I divided preprocessing techniques used in two groups: Simple techniques - for each simple technique I gave: - 1% for implementation, and - 2% for comparison of accuracy with and without the technique. Advanced techniques - for each of advanced technique, I gave - 2% for implementation, - 2% for comparison of accuracy with and without, and - 2% for analysis how the accuracy depended on parameters of a technique. Result analysis - analyses of different types of errors (false positive, etc) - 1% - listing top words predicting spam/non-spam - 2% - statistical comparison of different classifiers 1-4% Other Marked depending on the technique And of course there will be 2 prizes: - Highest accuracy on the testing set - Most interesting approach The winners will receive original 8'' diskettes, and some bonus points.
My result for this assignment was the second best. I got 99,6% accuracy. In the report you will be able to see the techniques which I used to achieve this, but it is rally subjective - it depends of the training emails.
Source Code: http://velchev.co.uk/blogger/SpamFilter1.1.zip