The Design and Implementation for HNC Corpus Software System

XIE Fa-kui (Signal and Information Processing)

Directed by Quan Zhang



Corpus is a collection of linguistic materials in electronic form. And it is also a very important tool for linguistic studies, NLP and related fields. The HNC theory as a new NLP theory needs the corresponding corpus. Our goal is to design and implement a corpus software system which embodies the HNC characteristics, and helps HNC study.

The main contribution of this dissertation is as follow:

(1)   An integrative HNC corpus system is established, which contains raw and tagged corpus. Some functions, such as management, processing, tagging, searching, and statistical processing, are presented. The corpus system is designed with three-layer structure: the application, interface, and implementation layers. The interface layer consists of a set of universal interfaces for corpus access, then it can effectively isolate the top and bottom layers, and simplifies the development process.

(2)   A multi-user corpus management platform is constructed. All users’ corpus and public corpus is managed on the server. The platform adopts C/S model which allows many users simultaneously access the server.

(3)   Some functions of the corpus system are improved. ① In the aspect of tagging, a novel XML-based tagging mode is realized, which greatly simplifies the process of tagging. The information of the linguistic space and the linguistic concept space is transformed into XML. ② In the aspect of searching, we have achieved full-text search based on Lucene, and HNC features search including basic search, advanced search, XQuery search. ③ In the aspect of statistical processing, apart from conventional statistical processing, we have designed and implemented four basic modes of HNC feature: amount, ratio, attribute distribution, user-defined distribution. Users can freely define the content for statistical processing.

(4)   Some computer-aided tagging models are explored. Concretely, a maximum entropy model is adopted to deal with the problem of semantic chunks segmentation. And an example-based model is adopted to deal with the problem of sentence category parsing.

(5)   Sentence category reorganization corpus is constructed. Relying on the basic corpus, tagged corpus is reorganized by sentence category. Some basic functions, such as feedbacking tagging mistakes and marking parsing difficulties, are also provided.

Key words:  HNC; Corpus; linguistic space; linguistic concept space; Tagging; Searching; Statistical Processing; XML; XQuery; Maximum Entropy Model