期刊文献+

基于领域知识的文本分类 被引量:12

An Approach Based on Domain Knowledge to Text Categorization
下载PDF
导出
摘要 提出了一种基于知识的文本分类方法,其中引入领域知识,利用领域特征作为文本特征,增强文本表示能力,将文本分类过程看作集聚计算过程.文本索引过程采用了改进型特征选取和权重计算方法.提出了一种基于互信息的学习算法,从训练语料中自动学习领域特征集聚计算公式.实验结果显示基于领域知识的文本分类技术总体性能优于传统的贝叶斯分类模型,领域知识的应用能够有效改善对相似主题和相反主题的分类性能. A knowledge-based text categorization method is proposed, taking domain features as textual features to improve text representation function and considering text categorization as aggregation computation procedure. A feature re-selection and reweighting technique is proposed for text indexing procedure. To learn feature aggregation functions from labeled training collection automatically, a learning method based on mutual information is employed. Comparative experiment results showed that the text categorization method based on domain knowledge works better than the conventional naive Bayes classifier based on beg-of-words model as a whole and that using domain knowledge will improve effectiveness of classifying similaror or antithetical topics.
出处 《东北大学学报(自然科学版)》 EI CAS CSCD 北大核心 2005年第8期733-735,共3页 Journal of Northeastern University(Natural Science)
基金 国家自然科学基金资助项目(60203019)微软亚洲研究院联合资助项目(60473140)国家教育部科学技术研究重点项目(104065).
关键词 领域知识 文本分类 集聚计算 机器学习 朴素贝叶斯模型 domain knowledge text categorization aggregation computation machine lemming naive Bayes model
  • 相关文献

参考文献11

  • 1朱靖波,姚天顺.文本内容主题的识别方法[J].东北大学学报(自然科学版),2002,23(5):425-427. 被引量:8
  • 2Boykin S, Merlino A. Machine learning of event segmentation for news on demand[J]. Communications of the ACM, 2000,43(2):35-41.
  • 3Luhn H P. A statistical approach to mechanized encoding and searching of literary information[J]. IBM Journal, 1957,10(1):309-317.
  • 4Edmundson H. New methods in automatic extracting[J]. Journal of the ACM, 1969,16(2):264-285.
  • 5Salton G, James A, Buckley C. Automatic analysis, theme generation, and summarization of machine-readable texts[J]. Science, 1994,264(3):1421-1426.
  • 6Lehnert W, Loiselle C. An introduction to plot unit[A]. Semantic Structures-Advances in Natural Language Processing[C]. Hillsdale: Lawrence Erlbaum Associates, 1989.88-111.
  • 7Hearst A. Context and structure in automated full-text information access[D]. Berkeley:University of California, 1994.103-105.
  • 8Peter W F. Latent semantic analysis for text-based research, behavior research methods[J]. Instruments and Computers, 1996,28(2):197-202.
  • 9Fabrizio S. Machine learning in automated text categorization[J]. ACM Computing Surveys, 2002,34(1):1-47.
  • 10Sangkon L, Masami S. Passage segmentation based on topic matter[J]. Computer Processing of Oriental Languages, 2002,15(3):305-340.

二级参考文献2

  • 1朱靖波.面向英汉机器翻译的统计消岐技术研究[M].沈阳:东北大学,1999..
  • 2林鸿飞,高天,姚天顺.中文文本的可视化表示[J].东北大学学报(自然科学版),2000,21(5):501-504. 被引量:7

共引文献7

同被引文献82

引证文献12

二级引证文献35

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部