Co-training with a Single Natural Feature Set Applied to Email ...

t.
Co-training with a Single Natural Feature Set Applied to Email Classification
1
Co-training with a Single Natural Feature
Set Applied to Email Classification
Jason Chan
School of Information Technologies
The University of Sydney, Australia
jchan3@it.usyd.edu.au
Irena Koprinska
School of Information Technologies
The University of Sydney, Australia
irena@it.usyd.edu.au
Josiah
Poon
School of Information Technologies
The University of Sydney, Australia
josiah@it.usyd.edu.au
Abstract
When dealing with information overload from the
Internet, such as the classification of Web pages and the
filtering of email spam, a new technique called co-
training has been shown to be a promising approach to
help build more accurate classifiers. Co-training allows
classifiers to learn with fewer labelled documents by
taking advantage of the more abundant unclassified
documents. However, conventional co-training requires
the dataset to be described by two disjoint and natural
feature sets that are sufficiently redundant. In many
practical situations, it is not intuitively obvious how to
obtain two natural feature sets. This paper shows that
when only a single natural feature set is used, the
performance of co-training is beneficial in the application
of email classification.
1. Introduction
One of the greatest problems facing users of the
Internet is dealing with information overload. Usually, the
great majority of information available consists of
unwanted or unhelpful instances. Even after
implementing narrowing searches or applying email
filters, there still exists a significant quantity of undesired
documents.
Research in Text Categorization [6] has shown that it
is possible to build effective classifiers to filter unwanted
documents given a sufficiently large set of training
examples. However, obtaining labelled Web pages or
emails is very costly, because it usually requires a great
deal of human effort to classify unlabelled documents.
A new technique to overcome this problem, called
co-
training [1], was shown to be capable of converting
unlabelled Web documents into labelled Web documents
by initially starting off with only a small pool of classified
examples. One of the main requirements that were stated
for co-training to be successful was that the dataset must
be described by two disjoint sets of natural features that
were redundantly sufficient. That is, using only either one
of the natural sets of attributes, a classifier can be built
with reasonably high accuracy. For example, in their
experiment that dealt with the problem of classifying Web
pages, the two sets of features used to describe a page
were the words in the body of the page and the words in
hyperlinks of other documents referring to that particular
page.
In the great majority of practical situations, there do
not exist two natural sets of features that can describe the
dataset. This paper investigates the applicability of co-
training to such datasets. We compare co-training of a
single natural feature set and co-training with two natural
feature sets. By analysing the results, we address the
question of when co-training with a random split of
features is likely to be useful. The experiments are based
on the application of email classification, which attempts
to determine whether a given message is a genuine email
or spam.
2. The co-training algorithm
In a given application with redundantly sufficient
features, a classifier with reasonable performance can be
built with each of the two sets of features separately. Co-
training employs these two classifiers in a loop to label all
the unlabelled examples. Each classifier takes turns to
select the most confidently predicted examples and add
these into the training set. Both classifiers then re-learn on
the enlarged training set so that they take into account the
newly added (and previously unlabelled) data. The loop is
then repeated for a number of iterations to maximize
performance on a separate validation set.
The idea behind the co-training algorithm is that one
classifier, with its set of features, can confidently predict
the class of an unlabelled example because it is similar to
the training instances. However, it may only be similar to
the training instances for this classifiers set of features.
Because of the confidence with which this classifier
predicts this examples class, it will be labelled
accordingly and placed into the training set. Hence, the
other classifier will be able to learn from this instance and
adjust better in future.
3. Previous work on co-training
Blum and Mitchell [1] first introduced the technique of
co-training. In their application of identifying academic
course home pages from a set of Web documents, co-
training was shown to be able to reduce the error rate of a
Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence (WI04)
0-7695-2100-2/04 $ 20.00 IEEE 2
classifier. Theoretical insights were also given, among
them being the requirement of redundantly sufficient
feature sets.
Kiritchenko and Matwin [2] applied co-training to the
task of predicting which folder a user would place an
email into. They found that the performance of co-
training is sensitive to the learning algorithm used. In
particular, co-training with nave Bayes (NB) worsens
performance, while Support Vector Machines (SVM)
improves it. The authors explained this with the inability
of NB to deal with large sparse datasets. This explanation
was confirmed by significantly better results after feature
selection.
Nigam and Ghani [4] investigated the sensitivity of co-
training to the assumption of redundant sufficiency. One
of their experiments involved performing co-training on a
dataset whereby a natural split of feature sets is not used.
The two feature sets were chosen by randomly assigning
all the features of the dataset into two different groups.
This was tried for two datasets: one with a clear
redundancy of features, and one with an unknown level of
redundancy and non-evident natural split in features. The
results indicated that the presence of redundancy in the
feature sets gave the co-training algorithm a bigger
advantage over expectation maximization. Together with
theoretical insights, this result led the researchers to
conclude that co-training has a considerable dependence
on the assumption of redundant sufficiency. However,
even when this assumption is violated, as in many
practical settings, the performance of co-training can still
be quite useful in improving a classifiers performance.
4. Experimental setup
4.1 Objective
In the large majority of cases, datasets consist of only a
single set of features with no obvious or natural way to
divide them into two separate sets. Hence, the question of
whether co-training can be useful with only a single
natural feature set is of great practical importance.
This paper investigates the performance of co-training
with only one natural feature set in comparison to the use
of two natural feature sets. The main question that we
address is:
how useful is co-training with a single natural
feature set?
4.2 Dataset, preprocessing and classifiers used
We applied tests on email classification using the
LingSpam
1
corpus. This dataset consists of 2883 emails
of which 479 are spam and 2404 are genuine emails. Each
email is broken up into two sections: the text found in the
subject header of the email and the words found in the
main body of the message. After applying a stop list
2
, a
word count of each word type was kept with a distinction

1
http://www.mlnet.org/cgi-bin/mlnetois.pl/?File=dataset-
details.html&Id=963839410Ling-Spam
2
http://alt-usage-english.org/excerpts/fxcommon.html
made between the words that appeared in the subject
header and those that appeared in the body.
The standard bag-of-words representation was used
and feature selection was performed with Information
Gain [8]. Upon inspection of the word lists, it was decided
that the top 100 words was a suitable cut-off, resulting in
a dimensionality reduction of about 98%. This value is
similar to thresholds used in other experiments, such as
[2]. Each of the email documents was then represented
using the term frequencies of the selected 100 features.
This term weighting was motivated by its successful use
in the domain of Web page classification [5].
Three types of classifiers were tested: Decision Tree
(DT), NB and SVM. In previous work on co-training [2],
NB has often been used as a benchmark. The SVM was
used in text categorization and email classification [2]
with great success. Implementations of these classifiers
were obtained from WEKA [7].
4.2 The feature sets
Below is a summary of the feature sets used in the
experiments.
Body: all words that appear in the body of an email
Subject: all words that appear in subject of an email
Half1: a random selection of half of the feature set
consisting of the combination of
Subject and Body
Half2: the other half of the features not found in Half1
The two feature sets
Half1 and Half2 are created to test
the hypothesis that it is possible to randomly split a
natural feature set into two smaller feature sets to obtain
useful results in co-training. The
Body and Subject feature
sets will hereon be referred to as the
natural
feature sets,
while the oth