The Study on Text Categorization
Based on Features of Conceptual Language Space
ZHANG Yun-liang (Signal and Information Processing)
Directed by ZHANG Quan
Abstract
Categorization
is an important and basic work in the processing of texts. Text categorization
for subjects is of benefit to processing of electronic books and periodicals.
Text categorization for authorship can be used in fake identification,
authorship recognition of lost book and judicial appraisal. Text categorization
is also helpful in information retrieval and other applications. Conceptual
language space is the universal nature of all human languages and the base of
human intercommunion. The features of conceptual language space break the
surface of language phenomena and reveal the mapping network of concepts.
This
dissertation aims to improve the effect of text categorization by the study on
both theory and experiments of using conceptual language space features and
improved categorization algorithms.
This
dissertation combine the theory exploration and experiment and mainly includes
the following aspects: the analysis of the characters of features in conceptual
language space, mainly sentence category space and conception space; the effect
and cause of different features with KNN algorithm; different text
categorization algorithms in different application background; theory,
experiment and empirical parameters in different improved algorithm.
The
main results of this dissertation are listed as the following:
(1)
Propose the text categorization
processing strategy of combining conceptual language space features with vector
space model, which leads to good effect of text categorization. The MAFM
(maximum micro-average F-measure) in text categorization for subject is 0.812
and for authorship 0.9.
(2)
Propose and implement the
transform strategy from compound sentence category to primitive sentence
category, reduce the dimension of sentence category vector space effectively
with limited information loss of compound sentence category.
(3)
Based on the non-uniformity of
feature distribution of text, propose and implement resolution judgment
algorithm, which improve the effect of the categorization to some extent.
(4)
Propose and implement
multi-feature integration judgment algorithm with 3 schemes, which improve the
categorization effect in varying degrees. Propose the strategy of feature choice
for integration and give the sequence list of 13 conceptual language space
features for subject and authorship.
(5)
Propose and realize the flexible
KNN algorithm, which improve the effect of text categorization. The application
Constraints of this algorithm are also proposed.
The usage of conceptual language space features and
algorithm improvement receives good effect in text categorization. The
performance can be improved by the enhancement of the analysis ability in
conceptual language space and the development of better algorithms.
Keywords: Conceptual language space; Hierarchical Network of Concepts (HNC) theory; text categorization; subject; authorship; effect