The Design and Implementation of HNC Corpus

Lu Tan (Signal and Information Processing)

Directed by Quan Zhang


 ABSTRACT

 

    Corpus is an important resource on linguistic study and Natural Language Processing. With the rapid development of computer technology, the storage capability and the performance of processing language resources of the computer is becoming more and more powerful, and corpus played more important role in linguistic study, NLP and related fields. HNC theory is a new theory on NLP, and it needs the corresponding development on the corpus based on HNC. As a very active field, the corpus research developed rapidly and many achievements were gained. However, these corpora can’t be used directly by HNC. The reason is obvious, that the corpora were developed on the basis of POS for syntax analysis.

    To sum up, it’s very necessary and urgent to build our own HNC corpus. The main contribution of this dissertation is as follow:

(1)   We have constructed a uniform framework with mature functions of HNC corpus, which includes HNC corpus and the corpus application platform.

(2)    We built HNC Chinese raw corpus and Chinese HNC-tagged corpus.

(3)    We have designed and implemented HNC corpus application software platform with the functions of corpus tagging, managing, searching and statistic. The application platform of HNC corpus includes two sub-systems, one is HNC tagging and managing sub-system, and the other is HNC searching and statistic sub-system.

(4)    We have designed and implemented HNC corpus tagging tool. The tool can not only make corpus tagging easier with convenient toolbar, but also support the functions of error checking and instant tagging help.

(5)    We have designed and implemented HNC Instant Assistant. As a corpus assistant tool, it has many advantages such as friendly interface, strong functions and good usability. The user has two methods of getting the required HNC information, one is to search information with keyboard input, and the other is fetching words from the screen.

(6)    We use many advanced computer technologies in the HNC corpus software development, namely, the interface programming, the instant tagging help and the regular expressions and so on.

Key words:  Hierarchical Networks of Concepts (HNC); Corpus; Corpus Building; Corpus Annotation; Corpus Searching and Statistic