TREC 2005
The Spam TREC attempted to build a “gold standard” corpus of spam and ham and then rigourously test a wide variety of spam filters (all from universities or open source projects) against the corpus (or corpuses since there were four, although only one will be made public and consists of over 92,000 categorized messages drawn from the Enron corpus and seeded with recent spam messages).
The full set of filters tested were:
Beijing University of Posts and Telecommunications
bogofilter 0.92.2
Chinese Academy of Sciences (ICT)
Dalhousie University
DSPAM 3.4.9
IBM Research (Richard Segal)
Indiana University
Jozef Stefan Institute
Laird Breyer
Massey University (SpamBayes)
Mitsubishi Electric Research Labs (CRM-114)
Ponticia Universidade Catolica Do Rio Grande Do Sul
POPFile 0.22.2
SpamAssassin 3.0.2 (various configurations)
SpamProbe 1.0a
Universite Paris-Sud
York University
Perhaps the most interesting news of all is that the best performance by a spam filter in his testing wasn’t SpamAssassin, or a Bayesian filter, but a filter based on compression from the Jozef Stefan Institute in Slovenia.
The papers associated with the TREC Spam Track contain extensive information about the test runs with full results including the percentage of spam missed (which they call sm%), the percentage of ham missed (which they call hm%) and (1-ROCA)% which gives a measure of the degree of imperfection of a spam filter (perfection being Spam Hit Rate of 1 and Ham Strike Rate of 0). (You can find an explanation of ROC curves here: http://www.sussex.ac.uk/Users/danw/pdf/roc2.pdf).
Looking at the results of the testing against the public corpus the top ten filters were:
Filter | (1-ROCA)%
————————+———–
Jozef Stefan Institute | 0.02
Laird Breyer | 0.04
CRM-114 | 0.04
IBM Research | 0.04
bogofilter | 0.05
SpamAssassin (Bayesian) | 0.06
SpamProbe | 0.06
Massey University | 0.16
POPFile | 0.33
York University | 0.46
(Note that the Jozef Stefan Institute came out on top for all the corpus runs).

March 30th, 2010 at 12:24 pm
comment
Спасибо за информацию !