Content-oriented Information Retrieval Model
Chen Wu (Signal and Information Processing)
Directed by ZHANG Quan
Abstract
In
the past few decades, Information Retrieval (IR) technology has been extensively studied. Upon
that, some important models appear, such as vector space model, statistical
language model and so on. Most of the models are based on the pure Statistical
Language Processing (SLP) methods which mainly focus on the term frequency
within a document and do not try to analyze and differentiate the meanings of
the terms; therefore, they have difficulties in realizing
an ideal system which was wished to be filled with intelligence. To address this issue,
on the ground of the current research achievements of both Nature Language
Understanding (NLU) and SLP, this paper proposed a brand-new schema which is
intended to expedite the way to the future ideal IR model through
combining the NLU methods with the SLP methods. In the schema, the approaches of
semantic extraction and semantic expression are embedded into the approaches of
term weighting and similarity measurement. According to the differentiation in
the proportion of the SLP methods in the whole IR model, two important IR
models, based on the schema, have also been proposed. There are “Concept-based
Glossary Model (CGM)” and “Concept-based Sentence group Model (CSM)”. In
both the models, concepts, namely the
formalized expressions of the meaning,
are introduced.
CGM is a non-special domain IR model. In the model, “HNC Extended
Sentence Category Analysis System” has been adopted. This system can help the
IR model to extract the concept of a word as well as a sentence. Based on the
concepts, some SLP methods have been extensively studied in order to find the
exact conjoint point between the SLP methods and the NLU methods.
Compared to the CGM, the lever of the language understanding in the CSM
has been improved to some level. This model takes the meaning of the sentence
groups as its processing objects. Each sentence group in a text will be firstly
mapped into a special Content Unite Framework defined by HNC and then be
processed as a whole. In order to extract the meaning of a sentence group, the
method for marking off the Chinese sentence groups should be brought out in
advance. This topic will also be addressed.
Based
on the research work, the main contribution and creative points of this
dissertation are listed as the following:
1) Proposed a brand-new schema which tried to expedite the research way
to the future ideal IR model. The features of the schema lie in that the SLP
methods and the NLU methods have been integrated into each other. The
experiments arming at the schema indicate that the IR models with the semantics
have the upper hand to the ones without it (improved by 2% to 8% respectively)
2) Some shortages of the current statistical IR models have been detected
during studying the models of this dissertation. The shortages including: ①In
the traditional statistical language model, the measurement of the likelihood
that the query would have been generated from the estimated model is too
broad-brush; ②That
confident in the maximum likelihood estimator of a term in a document have
risks. Through considering these two points, the models proposed in the
dissertation put up a well performance. Under the CIRB030 test collection. The
average precision of the proposed DGMSys outperforms the traditional VSM model
by 6.8% with the Relax evaluation standard, and by 7.5% with the Rigid
evaluation standard. Furter more, DGMSys can support a retrieval task with more
than 50,000 terms and upper 4,000,000,000 docs. It is also provided with high
retrieval speed. Under the test collection containing 381375 docs, the average
retrieval time of 42 topics (each topic contains about 10 query keys) takes only
800 ms. The average number of the returned docs of all topics is about 151,707.
The experiment was made by general PC. Moreover, the system has been integrated
with semantic processing tache which is a particular tache that other IR systems
do not have.
3) Proposed a method for marking off the Chinese sentence groups based on
the semantic relationship between the sentences. Taking advantages of the
symbolic system of Language concept space defined by HNC, some formalized rules
for detecting the Chinese sentence groups are also presented. The experiments on
1203 sentence groups and 4186 sentences show that the average precision of the
rules is upper 82.9% and the recall is upper 73%. These can well satisfy the
requirement of an IR system.
4) The proposed model based on the concepts solved the problem of the
data sparseness in IR models elementary. According to author’s statistic, the
whole word senses in 29139 Chinese staple words is about 38987. The different
word senses among them is 29139 which accounts for nearly 74.7% of total number
of the words. Consequently, we guessed that using concepts instead of words can
solve the problem of the data sparseness. The experiments well support our
supposal. The IR system based on the concepts cuts down the dimension of the
token by half. Under the test collection containing 381375 docs, the total
number of the tokens in the concept-based IR systems is about 251206, while the
counterpart in the word-based ones is about 120821 which accounts for nearly 1/2
of the former one. Due to this, the retrieval time cost in the concept-based
system is fewer than the word-based one.
In summary, this dissertation proposed a brand-new schema which takes
advantages of both the NLU methods and the SLP methods. The experiments on the
IR Systems advocated by the schema indicate the feasibility and effective of the
schema as well as the models.
Keywords: Information Retrieval; HNC theory; Statistical Nature Processing; Semantics; Language Model