The Study on Text Categorization 

Based on Features of Conceptual Language Space

ZHANG Yun-liang (Signal and Information Processing)

Directed by ZHANG Quan


Abstract


Categorization is an important and basic work in the processing of texts. Text categorization for subjects is of benefit to processing of electronic books and periodicals. Text categorization for authorship can be used in fake identification, authorship recognition of lost book and judicial appraisal. Text categorization is also helpful in information retrieval and other applications. Conceptual language space is the universal nature of all human languages and the base of human intercommunion. The features of conceptual language space break the surface of language phenomena and reveal the mapping network of concepts.

This dissertation aims to improve the effect of text categorization by the study on both theory and experiments of using conceptual language space features and improved categorization algorithms.

This dissertation combine the theory exploration and experiment and mainly includes the following aspects: the analysis of the characters of features in conceptual language space, mainly sentence category space and conception space; the effect and cause of different features with KNN algorithm; different text categorization algorithms in different application background; theory, experiment and empirical parameters in different improved algorithm.

The main results of this dissertation are listed as the following:

(1)      Propose the text categorization processing strategy of combining conceptual language space features with vector space model, which leads to good effect of text categorization. The MAFM (maximum micro-average F-measure) in text categorization for subject is 0.812 and for authorship 0.9.

(2)      Propose and implement the transform strategy from compound sentence category to primitive sentence category, reduce the dimension of sentence category vector space effectively with limited information loss of compound sentence category.

(3)      Based on the non-uniformity of feature distribution of text, propose and implement resolution judgment algorithm, which improve the effect of the categorization to some extent.

(4)      Propose and implement multi-feature integration judgment algorithm with 3 schemes, which improve the categorization effect in varying degrees. Propose the strategy of feature choice for integration and give the sequence list of 13 conceptual language space features for subject and authorship.

(5)      Propose and realize the flexible KNN algorithm, which improve the effect of text categorization. The application Constraints of this algorithm are also proposed.

       The usage of conceptual language space features and algorithm improvement receives good effect in text categorization. The performance can be improved by the enhancement of the analysis ability in conceptual language space and the development of better algorithms.

  

 

Keywords: Conceptual language space; Hierarchical Network of Concepts (HNC) theory; text categorization; subject; authorship; effect