Skip to content

Files

Latest commit

3306468 · Apr 21, 2017

History

History

data

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
Apr 21, 2017
Apr 21, 2017
Apr 21, 2017
Apr 21, 2017
Apr 21, 2017
This directory contains the Ling-Spam corpus, as described in the 
paper:

I. Androutsopoulos, J. Koutsias, K.V. Chandrinos, George Paliouras, 
and C.D. Spyropoulos, "An Evaluation of Naive Bayesian Anti-Spam 
Filtering". In Potamias, G., Moustakis, V. and van Someren, M. (Eds.), 
Proceedings of the Workshop on Machine Learning in the New Information 
Age, 11th European Conference on Machine Learning (ECML 2000), 
Barcelona, Spain, pp. 9-17, 2000.

There are four subdirectories, corresponding to four versions of 
the corpus:

bare: Lemmatiser disabled, stop-list disabled.
lemm: Lemmatiser enabled, stop-list disabled.
lemm_stop: Lemmatiser enabled, stop-list enabled.
stop: Lemmatiser disabled, stop-list enabled.

Each one of these 4 directories contains 10 subdirectories (part1, 
..., part10). These correspond to the 10 partitions of the corpus 
that were used in the 10-fold experiments. In each repetition, one 
part was reserved for testing and the other 9 were used for training. 

Each one of the 10 subdirectories contains both spam and legitimate 
messages, one message in each file. Files whose names have the form
spmsg*.txt are spam messages. All other files are legitimate messages.

By obtaining a copy of this corpus you agree to acknowledge the use 
and origin of the corpus in any published work of yours that makes 
use of the corpus, and to notify the person below about this work.

Ion Androutsopoulos 
http://www.aueb.gr/users/ion/
Ling-Spam corpus last updated: July 17, 2000
This file (readme.txt) last updated: July 30, 2003.