摘要
通过分析英语新闻报道的特点,提出了一种基于词汇区分和位置特征相结合的特征项抽取算法.词汇区分是指将单词分为首字母是大写的单词和首字母不是大写的单词,位置特征利用新闻报道的倒金字塔式的结构特点决定单词的重要性.提出了一种基于多个特征项抽取算法融合的特征项权值计算方法,该方法认为被越多的特征项抽取算法选中的特征项越重要.提出了一种基于多数投票策略的双重过滤算法,对报道和话题是否相关进行两次过滤,大大降低了系统的误报率.实验表明提出的3种算法不但取得了很好的效果,而且具有很好的可扩展性.
As a new area of natural language processing, topic tracking has received a lot of attentions from experts both at home and at broad, and has become more and more popular. Topic tracking is defined to be the task of monitoring a stream of news stories to find those that discuss the topic known to the system. Research is made into three key problems in the query-based topic tracking: feature extraction, feature weight computation, and similarity measure. Firstly, a feature extraction algorithm based on the combination of word differentiation and the location property is proposed. The basic idea of word differentiation is to divide words into capital words, whose initials are capital, and common words, whose initials are not capital. The location property decides the importance of words based on the inverse-pyramidal structure of the news stories. Secondly, a new method to compute the feature's weight based on the combination of several different feature extraction algorithms is proposed. This method gives the feature bigger weight, which is selected by more feature extraction algorithms. Finally, a double filtration algorithm based on the majority vote rule is proposed, which makes two judgments about the relativity of a story and a topic, and reduces the system's false alarm successfully. Experiments indicate that these three proposed methods not only perform well, but also have good scalability.
出处
《计算机研究与发展》
EI
CSCD
北大核心
2007年第8期1412-1417,共6页
Journal of Computer Research and Development
基金
国家自然科学基金重点项目(60435020)
国家"八六三"高技术研究发展计划基金项目(2004AA117010-08)
关键词
话题跟踪
词汇区分
多数投票策略
双重过滤
归一化检测开销
topic tracking
word differentiation
majority vote rule
double filtration
normalized detection cost