摘要
【目的】在机器学习的计算模式下,利用特征加权和浅层次分类方法可以有效实现期刊论文的中图法分类。【应用背景】传统的人工分类方式在大数据环境下显得力不从心,而期刊电子化趋势使得自动分类技术能够有效缓解人工分类的压力。【方法】将机器学习的思想运用到期刊论文的自动分类领域,分析比较支持向量机和BP神经网络算法在期刊论文自动分类中的效果,利用层次分类理念将中图法转化为三层分类体系,将类目号的获取简化为三层分类的实现,基于特征的来源设置特征值的权重。【结果】分类实验表明,支持向量机算法在大规模稀疏数据环境中较BP神经网络算法更合理,三层体系的分类正确率自顶向下分别达到95.05%、92.89%和89.02%,综合正确率接近80%,多来源的特征权重在论文自动分类中较单一权重具有更好的分类效果。【结论】研究表明机器学习方法在期刊论文的自动分类方面具有较高的可行性、合理性和有效性,为期刊论文自动分类的实现提出新的思路。
[Objective] Under the computing mode of machine learning, using the methods of feature weighting and shallow-hierarchical classification can effectively achieve Chinese Library Classification (CLC) classification for periodical articles. [Context] The traditional way of artificial classification shows its own limits in the background of "Big Data", and the trend of periodicals electronic makes that automatic classification techniques can effectively relief the pressure of artificial classification jobs. [Methods] This paper introduces the thinking of machine-learning into the field of automatic classification of periodical articles. It analyzes and compares the effects of Support Vector Machine(SVM) and BP Neural Networks Algorithm(BPNN) in the procedure of automatic classification, transforms CLC into another classification system with three levels in the thoughts of hierarchical classification, and sets the weights based the sources of classification features. [Results] The experiments of classification tests show that SVM is more reasonable than BPNN under the condition of large-scale sparse data, the accuracy rates of these three levels reach 95.05%, 92.89% and 89.02%, and the integrated accuracy rate is close to 80%, and the feature weights from mulit-sources can lead to better classification results than single-source. [Conclusions] The study proves that the model of machine-learning with feature weighting and shallow-hierarchical classification in automatic classification of periodical articles has higher feasibility, rationality and effectiveness, and a new idea on automatic classification of periodical articles has been presented.
出处
《现代图书情报技术》
CSSCI
北大核心
2014年第3期80-87,共8页
New Technology of Library and Information Service
基金
江苏省自然科学基金项目"面向专利预警的中文本体学习研究"(项目编号:BK20130587)
国家社会科学基金重点项目"基于语义的馆藏资源深度聚合与可视化展示研究"(项目编号:11AZD090)子课题的研究成果之一
关键词
机器学习
期刊论文
文本自动分类
特征加权
层次分类法
Machine-Learning Periodical article Automatic text categorization Feature weighting Hierarchy classification