摘要
针对基于词袋的机器学习文本分类方法所存在的高维度、高稀疏性、不能识别同义词、语义信息缺失等问题,和基于规则模式的文本分类所存在的虽然准确率较高但鲁棒性较差的问题,提出了一种采用词汇—语义规则模式从金融新闻文本中提取事件语义标注信息,并将其作为分类特征用于机器学习文本分类中的新方法。实验证明采用该方法相比基于词袋的文本分类方法在采用相同的特征选择算法和分类算法的基础上,F1值提高8.6%,查准率提高7.7%,查全率提高8.8%。本方法融合了知识驱动和数据驱动在文本分类中的优点,同时避免了它们所存在的主要缺点,具有显著的实用性和研究参考价值。
The main problems of traditional machine learning text classification method which based on BOW (bag of words) are high dimension and high sparseness, can not identify synonyms and lack of semantic information etc. Meanwhile, rule based methods have high precision but have weaker robustness. In order to solve these problems, this paper proposed a novel method which based on lexical-semantic patterns to extract event semantic annotations from financial news text, and applied these annotations as features in machine learning method. The experiment shows that this method lifts F 1 value 8.6% than BOW, and the precision is increased by 7.7%, recall is increased by 8.8%, which based on same feature selection algorithm and classification method. This method combines the advantages of the two methods of knowledge driven and data driven in text classification, at the same time avoids the major drawbacks of last two methods, it has a good practical and research reference value.
作者
罗明
黄海量
Luo Ming;Huang Hailiang(College of Information Management & Engineering;Shanghai Key Laboratory of Financial Information Technology,Shanghai University of Finance & Economic,Shanghai 200433,China)
出处
《计算机应用研究》
CSCD
北大核心
2018年第8期2281-2284,2288,共5页
Application Research of Computers
基金
上海市科技人才计划项目(14XD1421000)
上海市科技创新行动计划项目(16511102900)
上海财经大学2014年研究生创新基金资助项目(CXJJ-2014-438)
关键词
文本分类
金融文本
语义标注
词汇-语义模式
有限状态机
text classification
financial text
semantic annotation
lexical-semantic pattern
finite state machine