摘要
软件错误报告的自动分类能够节省大量人力和时间,然而用户提交的错误报告主观性较强,对错误报告的描述较随意,造成自动分类的效率低下。为此,基于传统的词频-逆向文件频率(TF-IDF)算法,结合文档内词条频度与词条在同类别及不同类别文档中的分布情况,提出2种特征降维的改进算法,降维后再对词条进行权值处理,进一步提高特征降维的效果。实验结果表明,应用该算法得到的错误报告自动分类在精确率、召回率、F1值和准确度等指标上比现有算法都有明显提高。
Automatic classification of software bug reports save a large number of time and human resources. However, the bug reports submitted by users have a strong subjectivity, with casual text descriptions. This results in ineffective classification. Two improved algorithms are proposed to reduce feature dimensions in classifying bug reports from their text descriptions. These two algorithms are based on the traditional Term Frequency-Inverse Document Frequency ( TF-IDF) algorithm, combined with the term frequency in documentations and the distribution of the term in the same category and different types of categories. One weight processing is used after feature dimension reduction in order to get a better result. Experimental results indicate that the proposed algorithm has better performance in term of precision, recall,F1 score,and accuracy than the current algorithms.
出处
《计算机工程》
CAS
CSCD
北大核心
2015年第6期183-187,共5页
Computer Engineering
基金
国家自然科学重大国际(地区)合作研究基金资助项目(81320108019)
福建省自然科学基金资助项目(2014J01220)
关键词
特征降维
错误报告
文本自动分类
词频-逆向文件频率
特征权重
频率
feature dimension reduction
bug report
text automatic classification
Term Frequency-Inverse DocumentFrequency ( TF-IDF )
feature weight
frequency