The Deployment of an e-Commerce Platform and Related Projects in a ...
ze=-1 color=black>
Below is a cache of http://www.ijcir.org/volume1-number1/article3.pdf. It's a snapshot of the page taken as our search engine crawled the Web.
The web site itself may have changed. You can check the current page or check for previous versions at the Internet Archive.
Yahoo! is not affiliated with the authors of this page or responsible for its content.
The Deployment of an e-Commerce Platform and Related Projects in a Rural Area in South Africa
18
A Methodology for Feature Selection in Named Entity Recognition
Fredrick Edward Kitoogo
,
Department of Computer Science,
Faculty of Computing and IT,
Makerere University,
Venansius Baryamureeba,
Department of Computer Science,
Faculty of Computing and IT,
MakerereUniversity.
In this paper a method for feature selection in named entity recognition is proposed. Unlike
traditional named entity recognition approaches which mainly consider accuracy improvement as
the sole objective, the innovation here is manifested in the use of a multi-objective genetic
algorithm which is employed for feature selection basing on various aspects including error rate
reduction and time taken for evaluation, and also demonstrating the use of Pareto optimization. The
proposed method is evaluated in the context of named entity recognition, using three different data sets
and a K-nearest Neighbour machine learning algorithm. Comprehensive experiments demonstrate the
feasibility of the methodology.
Categories and Subject Descriptors: I.5.2 [Pattern Recognition]: Design MethodologyFeature
evaluation and selection; I.2.7 [Artificial Intelligence]: Natural Language processinglanguage
parsing and understanding
General Terms: Computer Science, Language Processing
Additional Key Words and Phrases: feature selection, multi-objective genetic algorithm, named entity recognition.
IJCIR Reference Format:
Kitoogo, F.E. and Baryamureeba, V. 2007.A Methodology for Feature Selection in Named Entity
Recognition. International Journal of Computing and ICT Research, Vol.1, No. 1, pp. 18-26.
http://www.ijcir.org/volume1-number1/article3.pdf
1. INTRODUCTION
The Machine Learning approaches to the named entity recognition (NER) problem follow three major
steps namely; (i) feature engineering, where identification of lexical and phrasal characteristics in text
which expresses references to named entities (NEs) is done, (ii) algorithm selection, when the decision of
Authors Address: Fredrick Edward Kitoogo
, Department of Computer Science, Faculty of Computing and IT,
Makerere University, P.O. Box 7062, Kampala, Uganda, fkitoogo@cit.mak.ac.ug, www.cit.ac.ug
Venansius Baryamureeba, Department of Computer Science, Faculty of Computing and IT, Makerere University, P.O. Box
7062, Kampala, Uganda, barya@cit.mak.ac.ug, www.cit.ac.ug
"Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that
copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first
page. Copyrights for components of this work owned by others than IJCIR must be honored. Abstracting with credit is permitted. To
copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee."
© International Journal of Computing and ICT Research 2006.
International Journal of Computing and ICT Research, ISSN 1818-1139, Vol.1, No.1, pp. 18 - 26
,
June 2007
International Journal of Computing and ICT Research, Vol. 1, No. 1, June 2007
19
which machine learning algorithm/algorithms to use for learning is made and (iii) classification, when
the actual learning of the feature list to detect and classify the named entity phrases is done.
NEs are theoretically identified and classified by using features (various abstract entities that combine to
specify underlying phonological, morphological, semantic, and syntactic properties of linguistic forms and
that act as the targets of linguistic rules and operations). Two kinds of features that have been defined by
McDonald [1996] are internal and external features; internal features are the ones provided from within
the sequence of words that constitute the entity, in contrast, external features are those that can be obtained
by the context in which entities appear.
The choice of the best discriminative features to represent NEs affects many aspects of the NER
problem such as accuracy, learning time, and the optimal training data set size. In many NER applications, it
is not unusual to find problems involving hundreds of features. However, it has been observed that beyond a
certain point, the inclusion of additional features leads to a worse rather than better performance [Oliveira et
al. 2003].
Feature engineering refers to the task of identifying and selecting an effective subset of features
to represent entities from a larger set of often mutually redundant or even irrelevant features. It
encompasses feature design, feature selection, feature induction, and feature impact optimization [Rinnger
2005]. Feature selection; a sub-task of feature engineering is not a trivial problem since there may be (i)
redundancy, where certain features are correlated so that it is not necessary to include all of them in
modeling and (ii) interdependence, where two or more features between them convey important
information that is obscure if any of them is included on its own.
Many real-world problems like feature selection for named entity recognition involve the optimization
of multiple objectives, such as number of features and accuracy. The tendency is that the different
objectives to be optimized represent conflicting goals (such as improving the quality of a product and
reducing its cost),in multi-objective optimization the optimization of each objective corresponds to an
optimal solution. Therefore, in multi-objective optimization one usually wants
to discover several optimal solutions, taking all objectives into account, without assigning greater
priority to one objective or the other. Most named entity recognition systems as demonstrated in Tjong
Kim Sang and De Meulder [2003] tend to consider only one objective of improving accuracy.
There is need for systems which put into consideration other objectives like the cost of the
solution on top of improvement of accuracy. The systems should be further able to provide users with
different sets of optimal solutions thus giving the end-user the option of being able to choose the solution
representing the best trade-off between conflicting objectives a posteriori, after examining a set of high-
quality solutions returned by the named entity recognition system. Intuitively, this is better than forcing
the user to choose a trade-off between conflicting goals a priori.
This paper proposes the use of a multi-objective genetic algorithm (MOGA) as a means to search for
subsets of features (feature selection). The MOGA will generate a feature set of alternative solutions (from
a fixed entire feature population) and
use a cross-validation method to indicate the best accuracy/complexity (number of
Features)/cost of using the feature sub-set (in this case only time for classification was used as a cost) trade-
off. The classification accuracy will be supplied by a machine learning algorithm.
The remainder of the paper is structured as follows: in Section 2, previous approaches are
reviewed, in Section 3, we outline the proposed method and specify the feature population from which
feature selection will be done. We describe the search (optimization) procedure of the method, in Section 4
the experiments and results are presented. Finally, Section 5 closes with a conclusion and an outlook for
future work.
2. PREVIOUS APPROACHES
Many researchers have tackled feature selection in various ways and likewise the performance of the
different approaches varies substantially. The more closely related approaches are presented in this section.
2.1 Complete Search
Some researchers manually designed features and used all of them without selecting optimal subsets [Carreras
et al. 2003; Zhou et al. 2004; Shen et al. 2004]
International Journal of Computing and ICT Research, Vol. 1, No. 1, June 2007
20
, while others leave the task of ignoring the useless features to the learning algorithm [Mayfield et al.
2003]. The problem with these approaches, is that they are computationally not feasible in practice.
2.2 Randomized Search
Randomized algorithms make use of randomized or probabilisti