期刊文献+

基于查询向量的英语话题跟踪研究 被引量:8

English Topic Tracking Research Based on Query Vector
下载PDF
导出
摘要 通过分析英语新闻报道的特点,提出了一种基于词汇区分和位置特征相结合的特征项抽取算法.词汇区分是指将单词分为首字母是大写的单词和首字母不是大写的单词,位置特征利用新闻报道的倒金字塔式的结构特点决定单词的重要性.提出了一种基于多个特征项抽取算法融合的特征项权值计算方法,该方法认为被越多的特征项抽取算法选中的特征项越重要.提出了一种基于多数投票策略的双重过滤算法,对报道和话题是否相关进行两次过滤,大大降低了系统的误报率.实验表明提出的3种算法不但取得了很好的效果,而且具有很好的可扩展性. As a new area of natural language processing, topic tracking has received a lot of attentions from experts both at home and at broad, and has become more and more popular. Topic tracking is defined to be the task of monitoring a stream of news stories to find those that discuss the topic known to the system. Research is made into three key problems in the query-based topic tracking: feature extraction, feature weight computation, and similarity measure. Firstly, a feature extraction algorithm based on the combination of word differentiation and the location property is proposed. The basic idea of word differentiation is to divide words into capital words, whose initials are capital, and common words, whose initials are not capital. The location property decides the importance of words based on the inverse-pyramidal structure of the news stories. Secondly, a new method to compute the feature's weight based on the combination of several different feature extraction algorithms is proposed. This method gives the feature bigger weight, which is selected by more feature extraction algorithms. Finally, a double filtration algorithm based on the majority vote rule is proposed, which makes two judgments about the relativity of a story and a topic, and reduces the system's false alarm successfully. Experiments indicate that these three proposed methods not only perform well, but also have good scalability.
出处 《计算机研究与发展》 EI CSCD 北大核心 2007年第8期1412-1417,共6页 Journal of Computer Research and Development
基金 国家自然科学基金重点项目(60435020) 国家"八六三"高技术研究发展计划基金项目(2004AA117010-08)
关键词 话题跟踪 词汇区分 多数投票策略 双重过滤 归一化检测开销 topic tracking word differentiation majority vote rule double filtration normalized detection cost
  • 相关文献

参考文献12

  • 1J Carthy,A F S Smeaton.The design of a topic tracking system[C].The 22nd Annual Colloquium on IR Research,Cambridge,UK,2000.
  • 2贾自艳,何清,张海俊,李嘉佑,史忠植.一种基于动态进化模型的事件探测和追踪算法[J].计算机研究与发展,2004,41(7):1273-1280. 被引量:58
  • 3Yiming Yang,Tom Ault,Thomas Pierce,et al.Improving text categorization methods for event tracking[C].The 23rd Annual Int'l ACM SIGIR Conf on Research and Development in Information Retrieval,Athens,Greece,2000.
  • 4James Allan,Jaime Carbonell.Topic detection and tracking pilot study:Final report[C].The DARPA Broadcast News Transcriptions and Understanding Workshop,San Francisco,1998.
  • 5Nianli Ma,Yiming Yang,Monica Rogati.Applying CLIR techniques to event tracking[C].Asia Information Retrieval Symp,Beijing,2004.
  • 6P van Mulbregt,J P Yamron,I Carp,et al.Text segmentation and topic tracking on broadcast news via a hidden Markov model approach[C].ICSLP-98,Sydney,1998.
  • 7Yiming Yang,Jan O Pedersen.A comparative study on feature selection in text categorization[C].The Int'l Conf on Machine Learning,Nashville,USA,1997.
  • 8Juha Makkonen,Helena Ahonen-Myka,Marko Salmenkivi.Applying semantic classes in event detection and tracking[C].Int'l Conf on Natural Language Processing,Mumbai,India,2002.
  • 9James Allan,Victor Lavrenko,Ron Papka.Event tracking[R].University of Massachusetts,Computer Science Department,Tech Rep:IR-128,1998.
  • 10周嫔,马少平,苏中.多分类器合成方法综述[C].见:中文信息处理国际会议论文集,1998:85~92

二级参考文献7

  • 1R Papka.On-line new event detection,clustering,and tracking:[Ph D dissertation].MA:University of Massachusetts Amherst,1999
  • 2K Hui,W Lam.Automatic event generation from multi-lingual news stories.In:Proc of the First ACM/IEEE-CS Joint Conf on Digital Libraries.Roanoke,New York:ACM Press,2001.23~24
  • 3N Stokes,J Carthy,A F Smeaton.Segmenting broadcast news streams using lexical chaining.In:T Vidal,P Liberatore,eds.Proc of STAIRS 2002.Amsterdam:IOS Press,2002.145~154
  • 4D Randall.The Universal Journalist,Second Edition.London:Pluto Press,2000
  • 5S H Lin,M C Chen,J M Ho,et al.ACIRD:Intelligent Internet document organization and retrieval.IEEE Trans on Knowledge and Data Engineering,2002,14(3):599~613
  • 6G Salton,B Buckley.Term-weighting approaches in automatic text retrieval.Information Processing and Management,1998,24(5):513~523
  • 7李晓黎,刘继敏,史忠植.基于支持向量机与无监督聚类相结合的中文网页分类器[J].计算机学报,2001,24(1):62-68. 被引量:108

共引文献58

同被引文献80

引证文献8

二级引证文献33

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部