Content-oriented Information Retrieval Model

Chen Wu (Signal and Information Processing)

Directed by ZHANG Quan


In the past few decades, Information Retrieval (IR) technology has been extensively studied. Upon that, some important models appear, such as vector space model, statistical language model and so on. Most of the models are based on the pure Statistical Language Processing (SLP) methods which mainly focus on the term frequency within a document and do not try to analyze and differentiate the meanings of the terms; therefore, they have difficulties in realizing an ideal system which was wished to be filled with intelligence. To address this issue, on the ground of the current research achievements of both Nature Language Understanding (NLU) and SLP, this paper proposed a brand-new schema which is intended to expedite the way to the future ideal IR model through combining the NLU methods with the SLP methods. In the schema, the approaches of semantic extraction and semantic expression are embedded into the approaches of term weighting and similarity measurement. According to the differentiation in the proportion of the SLP methods in the whole IR model, two important IR models, based on the schema, have also been proposed. There are “Concept-based Glossary Model (CGM)” and “Concept-based Sentence group Model (CSM)”. In both the models, concepts, namely the formalized expressions of the meaning, are introduced.

       CGM is a non-special domain IR model. In the model, “HNC Extended Sentence Category Analysis System” has been adopted. This system can help the IR model to extract the concept of a word as well as a sentence. Based on the concepts, some SLP methods have been extensively studied in order to find the exact conjoint point between the SLP methods and the NLU methods.

       Compared to the CGM, the lever of the language understanding in the CSM has been improved to some level. This model takes the meaning of the sentence groups as its processing objects. Each sentence group in a text will be firstly mapped into a special Content Unite Framework defined by HNC and then be processed as a whole. In order to extract the meaning of a sentence group, the method for marking off the Chinese sentence groups should be brought out in advance. This topic will also be addressed.

Based on the research work, the main contribution and creative points of this dissertation are listed as the following:

       1) Proposed a brand-new schema which tried to expedite the research way to the future ideal IR model. The features of the schema lie in that the SLP methods and the NLU methods have been integrated into each other. The experiments arming at the schema indicate that the IR models with the semantics have the upper hand to the ones without it (improved by 2% to 8% respectively)

       2) Some shortages of the current statistical IR models have been detected during studying the models of this dissertation. The shortages including: In the traditional statistical language model, the measurement of the likelihood that the query would have been generated from the estimated model is too broad-brush; That confident in the maximum likelihood estimator of a term in a document have risks. Through considering these two points, the models proposed in the dissertation put up a well performance. Under the CIRB030 test collection. The average precision of the proposed DGMSys outperforms the traditional VSM model by 6.8% with the Relax evaluation standard, and by 7.5% with the Rigid evaluation standard. Furter more, DGMSys can support a retrieval task with more than 50,000 terms and upper 4,000,000,000 docs. It is also provided with high retrieval speed. Under the test collection containing 381375 docs, the average retrieval time of 42 topics (each topic contains about 10 query keys) takes only 800 ms. The average number of the returned docs of all topics is about 151,707. The experiment was made by general PC. Moreover, the system has been integrated with semantic processing tache which is a particular tache that other IR systems do not have.

       3) Proposed a method for marking off the Chinese sentence groups based on the semantic relationship between the sentences. Taking advantages of the symbolic system of Language concept space defined by HNC, some formalized rules for detecting the Chinese sentence groups are also presented. The experiments on 1203 sentence groups and 4186 sentences show that the average precision of the rules is upper 82.9% and the recall is upper 73%. These can well satisfy the requirement of an IR system.

       4) The proposed model based on the concepts solved the problem of the data sparseness in IR models elementary. According to author’s statistic, the whole word senses in 29139 Chinese staple words is about 38987. The different word senses among them is 29139 which accounts for nearly 74.7% of total number of the words. Consequently, we guessed that using concepts instead of words can solve the problem of the data sparseness. The experiments well support our supposal. The IR system based on the concepts cuts down the dimension of the token by half. Under the test collection containing 381375 docs, the total number of the tokens in the concept-based IR systems is about 251206, while the counterpart in the word-based ones is about 120821 which accounts for nearly 1/2 of the former one. Due to this, the retrieval time cost in the concept-based system is fewer than the word-based one.

       In summary, this dissertation proposed a brand-new schema which takes advantages of both the NLU methods and the SLP methods. The experiments on the IR Systems advocated by the schema indicate the feasibility and effective of the schema as well as the models.


Keywords: Information Retrieval; HNC theory; Statistical Nature Processing; Semantics; Language Model