The Design and Implementation for HNC Corpus Software System
XIE Fa-kui (Signal and Information Processing)
Directed by Quan Zhang
is a collection of linguistic materials in electronic form. And it is also a
very important tool for linguistic studies, NLP and related fields. The HNC
theory as a new NLP theory needs the corresponding corpus. Our goal is to design
and implement a corpus software system which embodies the HNC characteristics, and helps HNC study.
main contribution of this dissertation is as follow:
integrative HNC corpus system is established, which contains raw and tagged
corpus. Some functions, such as management, processing, tagging, searching, and
statistical processing, are presented. The corpus system
with three-layer structure: the application, interface, and implementation
layers. The interface layer consists of a set of universal interfaces for corpus
access, then it can effectively isolate the top and bottom layers, and
simplifies the development process.
multi-user corpus management platform is constructed. All users’ corpus and
public corpus is managed on the server. The platform adopts C/S model which
allows many users simultaneously access the server.
functions of the corpus system are improved. ① In the aspect of tagging, a
novel XML-based tagging mode
which greatly simplifies the process of tagging. The information of the
linguistic space and the linguistic concept space is transformed into XML. ②
In the aspect of searching, we have achieved full-text search based on Lucene, and HNC features
search including basic search, advanced search, XQuery search. ③ In the aspect
of statistical processing, apart from conventional statistical processing, we
have designed and implemented
four basic modes of HNC feature: amount, ratio, attribute distribution,
user-defined distribution. Users can freely define the content for statistical
computer-aided tagging models are explored. Concretely, a maximum entropy model
is adopted to deal with the problem of semantic chunks segmentation. And an
example-based model is adopted to deal with the problem of sentence category
(5) Sentence category reorganization corpus is constructed. Relying on the basic corpus, tagged corpus is reorganized by sentence category. Some basic functions, such as feedbacking tagging mistakes and marking parsing difficulties, are also provided.
Key words: HNC; Corpus; linguistic space; linguistic concept space; Tagging; Searching; Statistical Processing; XML; XQuery; Maximum Entropy Model