Research on Key Techniques of Event Text Analysis

Using Language Concept Knowledge 

Chi Zhejie (Signal and Information Processing)

Directed by ZHANG Quan


Abstract


The rapid development of information technology results in explosive growth of information and most of which is presented in text. There is an urgent need to acquire useful information from massive text data, particularly the knowledge associated with specific event is the focus. Event extraction as a fine-grained information extraction is an important way to acquire event knowledge and plays an important role in fields of automatic text summarization, question and answering system, information retrieval and topic tracking. Event extraction focuses on event type recognition and event element extraction. In addition, there are expanding, converging and tracking needs for extracted event texts, which call for similarity calculation technique.

This dissertation relies on the knowledge of language concept space. Considering the features of sentence semantic structure in sentence category space and the concept knowledge in concept primitive space, event in text are analyzed including event type recognition and event role extraction. Meanwhile, word and sentence similarity calculation are proposed to expand event texts. Some contribution and creative points of this dissertation are listed as follows:

1.        An event extraction method based on language concept space is presented, which includes event type recognition and event element extraction. In the point of view of sentence category analysis, the method regards the event-triggered concepts as indicators and determines event types based on frequency index. Weight mechanism is introduced to account for different contribution of different semantic chunks while calculating frequencies. After recognizing event type, event role extraction is conducted by constructing the corresponding relationship between semantic chunks and event elements. The proposed method is knowledge-based and rule-driven. Comparing with traditional machine learning based method, the method is more efficient and has better applicability. It is testified in Chinese Emergency Corpus that the macro F1 of event type recognition and event element extraction achieve 0.871 and 0.768 respectively, which are 4.8 percentage points and 6.4 percentage points higher than the baseline.

2.        A similarity computational method based on collocative concepts is discussed. Under the guidance of large-scale corpus statistics, the method regards collocative concept vector as context after extracting collocative concepts for each concept. Then concept similarity is calculated by measuring the similarity of their contexts. Based on concept similarity, word similarity computation can be carried out by using word-concept primitive mapping table. To evaluate the order consistency of similarity results, ordinal pair conformity is proposed. The experiment shows that the results of the proposed method are highly consistent with human judgments, which achieves 0.704, 0.768 and 0.757 in correlation coefficient, compatibility degree and ordinal pair conformity. Compared with word collocation based method, HowNet based method and former HNC based method, the promotion of correlation coefficient is 0.160, 0.070 and 0.046 respectively.

3.        A multi-dimensional computational method for concept similarity based on concept primitive symbol system is proposed. The method considers the hierarchy, netted nature, comparability and duality, attached feature and quintuple information of concept primitive symbol system and constructs a comprehensive calculation formula to evaluate concept similarity. Weight fitting is introduced to make the results more consistent with reality while measuring the depth and distance of a node. Experiments on manual test set show that the computed similarities are highly consistent with human judgments. The proposed method achieves 0.810, 0.827 and 0.794 in correlation coefficient, compatibility degree and ordinal pair conformity respectively. The correlation coefficient is 0.266, 0.176, 0.152 and 0.126 higher than word collocation based method, HowNet based method, former HNC based method and collocative concepts based method.

4.        A computational method for sentence similarity based on sentence category analysis is studied. To better understand the sentence comprehensively, the method regards semantic chunks as processing units. After analyzing sentence category, the grammatical level similarity is measured by the types of sentence category and semantic chunks, while the semantic similarity is calculated by the word similarities between two chunks. Considering these two factors comprehensively, a weight fitting formula is used to calculate sentence similarity. The proposed method, which applies the aforementioned two kinds of word similarity results, achieves good performance in the experiment. The method, which uses the result of similarity based on concept primitive symbol system, performs better than the other method with a correlation coefficient of 0.813, which is 0.039 higher than dependency analysis based method. Moreover, an event text expansion method is studied by using word similarity and sentence similarity. The proposed method demonstrates its good scalability in small-scale event text set.

 

Keywords: HNC Theory; Language concept space; Concept primitive; Sentence category; Semantic chunk; Event extraction; Word similarity; Sentence similarity