Research on Key Techniques of 

Event Content Mining for Open-source Text 

Wu Chongbin (Signal and Information Processing)

Directed by ZHANG Quan


Abstract


Event content mining for open-source text, the basis of advanced event information processing, is a kind of intelligent information processing which transforms event information from free text into formatted data and identifies the domain of the event. With the application of the knowledge of HNC Theory, this paper designs a system for event content mining using conceptual knowledge, and discusses the solution and approach for several key techniques including text clustering, unlisted word recognition, automatic word sense learning, and word sense disambiguation. Some contribution and creative ideas are listed as follows:

1.        A conceptual-primitive-based text representation and a combined-features- based text representation for text clustering. The first one is proposed in order to solve the problem, which is common in a word-based text representation, of excessiveness of the scale of VSM dimension. And the second one, involving words of specified types and conceptual primitives, is proposed so as to solve the problem of loss of information caused by unlisted words in the first kind of text representation. Compared to the word-based text representation, each of these two kinds of text representation achieves shape decrease in VSM dimension, and has a better performance in text clustering. For the indicator of VSM dimension and compared to the word-based text representation, the conceptual-primitive-based text representation achieves a 92.5% reduction and the combined-features-based text representation achieves a 91.0% reduction. For the indicator of manual-classification-based F-value evaluation, an indicator for the performance of text clustering, the conceptual-primitive-based text representation achieves an increase by 9.6%, and the combined-features-based text representation achieves an increase by 25.8%.

2.        An approach to unlisted words recognizing based on the Apriori-property and web-search-engine, and an approach that learns word sense automatically from online encyclopedias. Both of these two approaches take advantage of information from the Internet, and neither of them needs training. Therefore, these two approaches are appropriate for processing of open-source text. And to certain extent, they help the semantic-based systems get rid of the adversely affect of unlisted words. According to experiment results, the precision and recall of the proposed approach in unlisted word recognition are respectively 93.9% and 97.9%; and in a 10 points scoring system, the macro-averages of scores in word types and in words domains the word sense learning system achieves are respectively 7.2329 and 6.3542.

3.        A tailored-context-window (TCW) strategy for a Bayes-based word sense disambiguation (WSD) system. Unlike unified-context-window (UCW) strategy, TCW strategy determines context window for each polysemant rather than determining a unified context window for all polysemants. Two rules, a precision-based rule and a fitting-function-based rule, for tailored context window determination are introduced. Both theoretical comparison and experimental comparison between TCW strategy and UCW strategy are carried out. In the theoretical comparison, TCW strategy achieves better performance in ideal results within which the average increases in Macro-P and in Micro-P are respectively 6 percentage points and 5 percentage points. However, TCW strategy performs no better than UCW strategy in the test of automatic systems.

4.        A design for an event content mining system taking advantage of conceptual knowledge. A 2-dimension criterion of event categorization based on event type and domain is involved in order to refine the event classification, and the event framework is designed in accordance with the correspondence between event factor and somatic chunk. By involving HNC knowledge and applying sentence category analyzing system, the event content mining system extracts and reorganizes event content, and identifies the domain of each event as well, which can provide data support for relation mining over events within the same domain or events belonging to different domains. The design presented in this paper is supposed to be a solution for reference for the future work of an event content mining system development.

 

Keywords: Event content mining; HNC Theory; Text clustering; Unlisted word recognition; Word sense learning; Word sense disambiguation; Event category; Event framework

 detection