摘要
【目的/意义】为推动潜在“精品”文献识别及其在科技文献识别与传播利用领域中的应用。【方法/过程】以国际顶级期刊Science和Nature期刊出版的论文及其引用分布数据为样本,统计出全部论文的首次响应时间、摘要长度,总被引频次、资金资助、论文篇幅等特征,构建“精品”论文特征矩阵;然后基于“精品”论文特征矩阵和随机森林算法进行潜在“精品”论文识别模型的训练与识别应用。【结果/结论】研究结果显示,融合“精品”论文特征矩阵和随机森林模型能够较好地识别Science和Nature期刊中的潜在“精品”论文,模型正确识别分类的准确率均值达到80%以上,其中Nature期刊的“精品”文献识别准确率高出Science期刊的“精品”论文识别准确率2%左右;使用信息增益方法的模型识别效果比使用基尼不纯度方法的识别效果略好。此外,Science和Nature期刊“精品”论文的首次被引速度极快,在出版当年即被引用。【创新/局限】“精品”文献特征矩阵和机器学习模型的结合能够较好地应用于潜在“精品”论文的识别与推荐,然而未来需将模型推广应用于海量文献中“精品”论文的识别检验。
【Purpose/significance】To promote the identification of potential"high-quality"literature and its application in the field of identification.【Method/process】This paper takes the articles from journals named Science and Nature,as well as their citation distribution data as sample.Such characteristics of each article as first-citation time,abstract length,total citation times,financial support and paper length was calculated to construct the feature matrix of"high-quality"articles.Then,based on the feature matrix of"highquality"articles and random forest algorithm,the recognition model of potential"high-quality"articles is trained and applied.【Result/conclusion】The results show that the fusion of the feature matrix of"high-quality"articles and the random forest model can efficiently identify the potential"high-quality"articles from Science and Nature,and the model’s average accuracy of correct recognition and classification is over 80%,among which the accuracy of identifying"high-quality"articles in the Nature was about 2%higher than that in the Science.The model’s effect of recognition using the information gain method is slightly better than that using the Gini impurity method.In addition,the first citation of"high-quality"articles in the Science and Nature is extremely rapid,being cited within the year of publication【Innovation/limitation】The combination of"high-quality"literature feature matrix and machine learning model can be well applied to the identification and recommendation of potential"high-quality"articles in high-impact journals.However,in the future,the model needs to be popularized and applied to the identification and inspection of"high-quality"articles in massive literature.
作者
胡泽文
任萍
周西姬
HU Ze-wen;REN Ping;ZHOU Xi-ji(School of Management Science and Engineering,Nanjing University of Information Science&Technology,Nanjing 210044,China)
出处
《情报科学》
CSSCI
北大核心
2022年第4期90-95,106,共7页
Information Science
基金
国家社会科学基金项目“面向海量科技文献的潜在‘精品’识别方法与应用研究”(20CTQ031)。
关键词
随机森林
识别模型
潜在精品
高被引
首次被引
科学计量
random forest model
Identification model
potential"high-quality"articles
highly cited
First-citation
scientometrics