摘要
本文以涵盖日常会话、会议发言、小说、议论文、政府白皮书以及新闻报道等多个类型的训练组文本为对象,统计其名词比、数词比、接续词句比等22项数据,将其作为文本表示方式进行线性分析,从中选取14项具有显著判别能力的指标,确定了其权重,由此构建基于Bayes分类函数的文本分类模型。观察这14项典型指标可知,除词汇占比类数据外,句长等指标也能够成为文本分类的有效依据。经测试,在绝大多数情况下,该模型的分类准确率都高于85%,召回率都高于81%,实现了以较小的运算量达到较高分类精度的目标。
The paper applied 22 indicators such as percentages of nouns,numeral,and sentences with conjunction,obtained from a training data set which includes daily conversations,conference speeches,novels,argumentation,government white papers,and news reports,as text features in the linear analysis to construct a text classification model.After the analysis,14 indicators with significant discrimination are selected and the weights of these indicators are determined.Among these 14 typical ones,not only vocabulary-based ones such as percentages of nouns etc.,but indicators such as sentence length are also effective.A text classification model based on Bayes classification function is constructed.After a test with the test data set,it is found that in most cases,the precision of this model is over 85%and the recall rate is over 81%.So it is proved that the model can achieve a higher accuracy with smaller computation.
作者
毛文伟
MAO Wen-wei(Office of Research Affairs,Shanghai International Studies University,Shanghai 200083,China)
出处
《外语电化教学》
CSSCI
北大核心
2019年第6期97-102,112,共7页
Technology Enhanced Foreign Language Education
基金
2019年国家社科基金项目“基于数据挖掘技术的中国日语学习者认知机制研究”(项目编号:19BYY201)的阶段性成果
关键词
文本分类
线性分析
日语
文本特征
贝叶斯
Text Classification
Linear Analysis
Japanese
Text Features
Bayes