The Research on Text Categorization 

Based on Language Concept Space

Ding Zeya (Signal and Information Processing)

Directed by ZHANG Quan


Abstract


Text categorization is an important information processing technology in the era of information explosion. It is widely used in information retrieval, information filtering, and so on. The development of information processing needs to improve the performance of text classification. As many text classification models can not understand text content, improvements of the performance are limited. Then, I introduce semantic knowledge of HNC language concept space into text classification model, which uses concept information, concept association knowledge and sentence category information to understand category texts and to improve the performance of classification. The main difficulty in this research is the effective combination of semantic knowledge and classification model. Some contribution and creative points of this paper are listed as follows:

1.      A dimensionality reduction method based on concept. This method gets category core concept by concept discrimination, and uses category concept to reduce term dimensionality. Then, it computes the degree of association between the text and the category and classifys texts based on category concept. According to the result of experiment, this method can reduce term discrimination effectively, and ensure the performance of text categorization. When the number of terms is small, this method is better than SVM, KNN or Bayes

2.      A text categorization method based on concept association rules(NR). This method is not only involved in words and concepts, but also makes further explorations on characteristic of category concept association and presents an algorithm for mining concept association rules from text concept trees. This method uses concept association rule tree to classify texts. Popular category corpus and particular topic corpus are tested in the experiment. The result shows that F1 values of NR for two kinds of corpus achieve 0.9123 and 0.9602. Its effect is much better than other methods.

3.      A categorization method based on semantic chunk association(SSR). This method introduces semantic chunk association knowledge into the classifier, and aggregates sentences of every text according to semantic chunk association. Then, it aggregates semantic chunk association of category texts and classify the text by computing degree of semantic association between the text and every category. Limited by sentence category information, SSR is only better than Bayes and inferior to SVM and NR.

4.      An unsupervised word sense disambiguation method based on context concept. It is for concept disambiguation in text categorization. It uses correlation measures of words and concepts between word senses and their context to realize unsupervised word sense disambiguation. The precision achieves 85.61%.

5.      A near-replicas detection algorithm of webpage based on Edit Distance. In order to find large numbers of near-replicas in web corpus, this algorithm computes the amount of similarity between web pages by edit distance and compares both the text content and structure of web pages. It is testified by experiment that the precision and recall rate achieve 98.39% and 89.71%. The algorithm of near-replicas detection is effective.

 

Keywords: Text categorization; HNC Theory; Language concept space; Concept; Concept association rule; Semantic chunk association; Word sense disambiguation; Near-replicas detection