In practice, some bugs have more impact than others and thus deserve more immediate attention. Due to tight schedule and limited human resources, developers may not have enough time to inspect all bugs. Thus, they oft...In practice, some bugs have more impact than others and thus deserve more immediate attention. Due to tight schedule and limited human resources, developers may not have enough time to inspect all bugs. Thus, they often concentrate on bugs that are highly impactful. In the literature, high-impact bugs are used to refer to the bugs which appear at unexpected time or locations and bring more unexpected effects (i.e., surprise bugs), or break pre-existing functionalities and destroy the user experience (i.e., breakage bugs). Unfortunately, identifying high-impact bugs from thousands of bug reports in a bug tracking system is not an easy feat. Thus, an automated technique that can identify high-impact bug reports can help developers to be aware of them early, rectify them quickly, and minimize the damages they cause. Considering that only a small proportion of bugs are high-impact bugs, the identification of high-impact bug reports is a difficult task. In this paper, we propose an approach to identify high-impact bug reports by leveraging imbalanced learning strategies. We investigate the effectiveness of various variants, each of which combines one particular imbalanced learning strategy and one particular classification algorithm. In particular, we choose four widely used strategies for dealing with imbalanced data and four state-of-the-art text classification algorithms to conduct experiments on four datasets from four different open source projects. We mainly perform an analytical study on two types of high-impact bugs, i.e., surprise bugs and breakage bugs. The results show that different variants have different performances, and the best performing variants SMOTE (synthetic minority over-sampling technique) + KNN (K-nearest neighbours) for surprise bug identification and RUS (random under-sampling) + NB (naive Bayes) for breakage bug identification outperform the Fl-scores of the two state-of-the-art approaches by Thung et al. and Garcia and Shihab.展开更多
In this paper,an Observation Points Classifier Ensemble(OPCE)algorithm is proposed to deal with High-Dimensional Imbalanced Classification(HDIC)problems based on data processed using the Multi-Dimensional Scaling(MDS)...In this paper,an Observation Points Classifier Ensemble(OPCE)algorithm is proposed to deal with High-Dimensional Imbalanced Classification(HDIC)problems based on data processed using the Multi-Dimensional Scaling(MDS)feature extraction technique.First,dimensionality of the original imbalanced data is reduced using MDS so that distances between any two different samples are preserved as well as possible.Second,a novel OPCE algorithm is applied to classify imbalanced samples by placing optimised observation points in a low-dimensional data space.Third,optimization of the observation point mappings is carried out to obtain a reliable assessment of the unknown samples.Exhaustive experiments have been conducted to evaluate the feasibility,rationality,and effectiveness of the proposed OPCE algorithm using seven benchmark HDIC data sets.Experimental results show that(1)the OPCE algorithm can be trained faster on low-dimensional imbalanced data than on high-dimensional data;(2)the OPCE algorithm can correctly identify samples as the number of optimised observation points is increased;and(3)statistical analysis reveals that OPCE yields better HDIC performances on the selected data sets in comparison with eight other HDIC algorithms.This demonstrates that OPCE is a viable algorithm to deal with HDIC problems.展开更多
Since traditional machine learning methods are sensitive to skewed distribution and do not consider the characteristics in multiclass imbalance problems,the skewed distribution of multiclass data poses a major challen...Since traditional machine learning methods are sensitive to skewed distribution and do not consider the characteristics in multiclass imbalance problems,the skewed distribution of multiclass data poses a major challenge to machine learning algorithms.To tackle such issues,we propose a new splitting criterion of the decision tree based on the one-against-all-based Hellinger distance(OAHD).Two crucial elements are included in OAHD.First,the one-against-all scheme is integrated into the process of computing the Hellinger distance in OAHD,thereby extending the Hellinger distance decision tree to cope with the multiclass imbalance problem.Second,for the multiclass imbalance problem,the distribution and the number of distinct classes are taken into account,and a modified Gini index is designed.Moreover,we give theoretical proofs for the properties of OAHD,including skew insensitivity and the ability to seek a purer node in the decision tree.Finally,we collect 20 public real-world imbalanced data sets from the Knowledge Extraction based on Evolutionary Learning(KEEL)repository and the University of California,Irvine(UCI)repository.Experimental and statistical results show that OAHD significantly improves the performance compared with the five other well-known decision trees in terms of Precision,F-measure,and multiclass area under the receiver operating characteristic curve(MAUC).Moreover,through statistical analysis,the Friedman and Nemenyi tests are used to prove the advantage of OAHD over the five other decision trees.展开更多
These days,imbalanced datasets,denoted throughout the paper by ID,(a dataset that contains some(usually two)classes where one contains considerably smaller number of samples than the other(s))emerge in many real world...These days,imbalanced datasets,denoted throughout the paper by ID,(a dataset that contains some(usually two)classes where one contains considerably smaller number of samples than the other(s))emerge in many real world problems(like health care systems or disease diagnosis systems,anomaly detection,fraud detection,stream based malware detection systems,and so on)and these datasets cause some problems(like under-training of minority class(es)and over-training of majority class(es),bias towards majority class(es),and so on)in classification process and application.Therefore,these datasets take the focus of many researchers in any science and there are several solutions for dealing with this problem.The main aim of this study for dealing with IDs is to resample the borderline samples discovered by Support Vector Data Description(SVDD).There are naturally two kinds of resampling:Under-sampling(U-S)and oversampling(O-S).The O-S may cause the occurrence of over-fitting(the occurrence of over-fitting is its main drawback).The U-S can cause the occurrence of significant information loss(the occurrence of significant information loss is its main drawback).In this study,to avoid the drawbacks of the sampling techniques,we focus on the samples that may be misclassified.The data points that can be misclassified are considered to be the borderline data points which are on border(s)between the majority class(es)and minority class(es).First by SVDD,we find the borderline examples;then,the data resampling is applied over them.At the next step,the base classifier is trained on the newly created dataset.Finally,we compare the result of our method in terms of Area Under Curve(AUC)and F-measure and G-mean with the other state-of-the-art methods.We show that our method has betterresults than the other state-of-the-art methods on our experimental study.展开更多
Objective:To determine the most influential data features and to develop machine learning approaches that best predict hospital readmissions among patients with diabetes.Methods:In this retrospective cohort study,we s...Objective:To determine the most influential data features and to develop machine learning approaches that best predict hospital readmissions among patients with diabetes.Methods:In this retrospective cohort study,we surveyed patient statistics and performed feature analysis to identify the most influential data features associated with readmissions.Classification of all-cause,30-day readmission outcomes were modeled using logistic regression,artificial neural network,and Easy Ensemble.F1 statistic,sensitivity,and positive predictive value were used to evaluate the model performance.Results:We identified 14 most influential data features(4 numeric features and 10 categorical features)and evaluated 3 machine learning models with numerous sampling methods(oversampling,undersampling,and hybrid techniques).The deep learning model offered no improvement over traditional models(logistic regression and Easy Ensemble)for predicting readmission,whereas the other two algorithms led to much smaller differences between the training and testing datasets.Conclusions:Machine learning approaches to record electronic health data offer a promising method for improving readmission prediction in patients with diabetes.But more work is needed to construct datasets with more clinical variables beyond the standard risk factors and to fine-tune and optimize machine learning models.展开更多
The Extreme Learning Machine(ELM) and its variants are effective in many machine learning applications such as Imbalanced Learning(IL) or Big Data(BD) learning. However, they are unable to solve both imbalanced ...The Extreme Learning Machine(ELM) and its variants are effective in many machine learning applications such as Imbalanced Learning(IL) or Big Data(BD) learning. However, they are unable to solve both imbalanced and large-volume data learning problems. This study addresses the IL problem in BD applications. The Distributed and Weighted ELM(DW-ELM) algorithm is proposed, which is based on the Map Reduce framework. To confirm the feasibility of parallel computation, first, the fact that matrix multiplication operators are decomposable is illustrated.Then, to further improve the computational efficiency, an Improved DW-ELM algorithm(IDW-ELM) is developed using only one Map Reduce job. The successful operations of the proposed DW-ELM and IDW-ELM algorithms are finally validated through experiments.展开更多
Detecting the boundaries of protein domains is an important and challenging task in both experimental and computational structural biology. In this paper, a promising method for detecting the domain structure of a pro...Detecting the boundaries of protein domains is an important and challenging task in both experimental and computational structural biology. In this paper, a promising method for detecting the domain structure of a protein from sequence information alone is presented. The method is based on analyzing multiple sequence alignments derived from a database search. Multiple measures are defined to quantify the domain information content of each position along the sequence. Then they are combined into a single predictor using support vector machine. What is more important, the domain detection is first taken as an imbal- anced data learning problem. A novel undersampling method is proposed on distance-based maximal entropy in the feature space of Support Vector Machine (SVM). The overall precision is about 80%. Simulation results demonstrate that the method can help not only in predicting the complete 3D structure of a protein but also in the machine learning system on general im- balanced datasets.展开更多
A clustering-based undersampling (CUS) and distance-based near-miss method are widely used in current imbalanced learning algorithms, but this method has certain drawbacks. In particular, the CUS does not consider the...A clustering-based undersampling (CUS) and distance-based near-miss method are widely used in current imbalanced learning algorithms, but this method has certain drawbacks. In particular, the CUS does not consider the influence of the distance factor on the majority of instances, and the near-miss method omits the inter-class(es) within the majority of samples. To overcome these drawbacks, this study proposes an undersampling method combining distance measurement and majority class clustering. Resampling methods are used to develop an ensemble-based imbalanced-learning algorithm called the clustering and distance-based imbalance learning model (CDEILM). This algorithm combines distance-based undersampling, feature selection, and ensemble learning. In addition, a cluster size-based resampling (CSBR) method is proposed for preserving the original distribution of the majority class, and a hybrid imbalanced learning framework is constructed by fusing various types of resampling methods. The combination of CDEILM and CSBR can be considered as a specific case of this hybrid framework. The experimental results show that the CDEILM and CSBR methods can achieve better performance than the benchmark methods, and that the hybrid model provides the best results under most circumstances. Therefore, the proposed model can be used as an alternative imbalanced learning method under specific circumstances, e.g., for providing a solution to credit evaluation problems in financial applications.展开更多
Application programming interface(API)libraries are extensively used by developers.To correctly program with APIs and avoid bugs,developers shall pay attention to API directives,which illustrate the constraints of API...Application programming interface(API)libraries are extensively used by developers.To correctly program with APIs and avoid bugs,developers shall pay attention to API directives,which illustrate the constraints of APIs.Unfortunately,API directives usually have diverse morphologies,making it time-consuming and error-prone for developers to discover all the relevant API directives.In this paper,we propose an approach leveraging text classification to discover API directives from API specifications.Specifically,given a set of training sentences in API specifications,our approach first characterizes each sentence by three groups of features.Then,to deal with the unequal distribution between API directives and non-directives,our approach employs an under-sampling strategy to split the imbalanced training set into several subsets and trains several classifiers.Given a new sentence in an API specification,our approach synthesizes the trained classifiers to predict whether it is an API directive.We have evaluated our approach over a publicly available annotated API directive corpus.The experimental results reveal that our approach achieves an F-measure value of up to 82.08%.In addition,our approach statistically outperforms the state-of-the-art approach by up to 29.67%in terms of F-measure.展开更多
As the complexity of software systems is increasing;software maintenance is becoming a challenge for software practitioners.The prediction of classes that require high maintainability effort is of utmost necessity to ...As the complexity of software systems is increasing;software maintenance is becoming a challenge for software practitioners.The prediction of classes that require high maintainability effort is of utmost necessity to develop cost-effective and high-quality software.In research of software engineering predictive modeling,various software maintainability prediction(SMP)models are evolved to forecast maintainability.To develop a maintainability prediction model,software practitioners may come across situations in which classes or modules requiring high maintainability effort are far less than those requiring low maintainability effort.This condition gives rise to a class imbalance problem(CIP).In this situation,the minority classes’prediction,i.e.,the classes demanding high maintainability effort,is a challenge.Therefore,in this direction,this study investigates three techniques for handling the CIP on ten open-source software to predict software maintainability.This empirical investigation supports the use of resampling with replacement technique(RR)for treating CIP and develop useful models for SMP.展开更多
基金This work is supported by the National Natural Science Foundation of China under Grant Nos. 61602403 and 61402406 and the National Key Technology Research and Development Program of the Ministry of Science and Technology of China under Grant No. 2015BAH17F01.
文摘In practice, some bugs have more impact than others and thus deserve more immediate attention. Due to tight schedule and limited human resources, developers may not have enough time to inspect all bugs. Thus, they often concentrate on bugs that are highly impactful. In the literature, high-impact bugs are used to refer to the bugs which appear at unexpected time or locations and bring more unexpected effects (i.e., surprise bugs), or break pre-existing functionalities and destroy the user experience (i.e., breakage bugs). Unfortunately, identifying high-impact bugs from thousands of bug reports in a bug tracking system is not an easy feat. Thus, an automated technique that can identify high-impact bug reports can help developers to be aware of them early, rectify them quickly, and minimize the damages they cause. Considering that only a small proportion of bugs are high-impact bugs, the identification of high-impact bug reports is a difficult task. In this paper, we propose an approach to identify high-impact bug reports by leveraging imbalanced learning strategies. We investigate the effectiveness of various variants, each of which combines one particular imbalanced learning strategy and one particular classification algorithm. In particular, we choose four widely used strategies for dealing with imbalanced data and four state-of-the-art text classification algorithms to conduct experiments on four datasets from four different open source projects. We mainly perform an analytical study on two types of high-impact bugs, i.e., surprise bugs and breakage bugs. The results show that different variants have different performances, and the best performing variants SMOTE (synthetic minority over-sampling technique) + KNN (K-nearest neighbours) for surprise bug identification and RUS (random under-sampling) + NB (naive Bayes) for breakage bug identification outperform the Fl-scores of the two state-of-the-art approaches by Thung et al. and Garcia and Shihab.
基金National Natural Science Foundation of China,Grant/Award Number:61972261Basic Research Foundations of Shenzhen,Grant/Award Numbers:JCYJ20210324093609026,JCYJ20200813091134001。
文摘In this paper,an Observation Points Classifier Ensemble(OPCE)algorithm is proposed to deal with High-Dimensional Imbalanced Classification(HDIC)problems based on data processed using the Multi-Dimensional Scaling(MDS)feature extraction technique.First,dimensionality of the original imbalanced data is reduced using MDS so that distances between any two different samples are preserved as well as possible.Second,a novel OPCE algorithm is applied to classify imbalanced samples by placing optimised observation points in a low-dimensional data space.Third,optimization of the observation point mappings is carried out to obtain a reliable assessment of the unknown samples.Exhaustive experiments have been conducted to evaluate the feasibility,rationality,and effectiveness of the proposed OPCE algorithm using seven benchmark HDIC data sets.Experimental results show that(1)the OPCE algorithm can be trained faster on low-dimensional imbalanced data than on high-dimensional data;(2)the OPCE algorithm can correctly identify samples as the number of optimised observation points is increased;and(3)statistical analysis reveals that OPCE yields better HDIC performances on the selected data sets in comparison with eight other HDIC algorithms.This demonstrates that OPCE is a viable algorithm to deal with HDIC problems.
基金Project supported by the National Natural Science Foundation of China(Nos.61802085 and 61563012)the Guangxi Provincial Natural Science Foundation,China(Nos.2021GXNSFAA220074and 2020GXNSFAA159038)+1 种基金the Guangxi Key Laboratory of Embedded Technology and Intelligent System Foundation,China(No.2018A-04)the Guangxi Key Laboratory of Trusted Software Foundation,China(No.kx202011)。
文摘Since traditional machine learning methods are sensitive to skewed distribution and do not consider the characteristics in multiclass imbalance problems,the skewed distribution of multiclass data poses a major challenge to machine learning algorithms.To tackle such issues,we propose a new splitting criterion of the decision tree based on the one-against-all-based Hellinger distance(OAHD).Two crucial elements are included in OAHD.First,the one-against-all scheme is integrated into the process of computing the Hellinger distance in OAHD,thereby extending the Hellinger distance decision tree to cope with the multiclass imbalance problem.Second,for the multiclass imbalance problem,the distribution and the number of distinct classes are taken into account,and a modified Gini index is designed.Moreover,we give theoretical proofs for the properties of OAHD,including skew insensitivity and the ability to seek a purer node in the decision tree.Finally,we collect 20 public real-world imbalanced data sets from the Knowledge Extraction based on Evolutionary Learning(KEEL)repository and the University of California,Irvine(UCI)repository.Experimental and statistical results show that OAHD significantly improves the performance compared with the five other well-known decision trees in terms of Precision,F-measure,and multiclass area under the receiver operating characteristic curve(MAUC).Moreover,through statistical analysis,the Friedman and Nemenyi tests are used to prove the advantage of OAHD over the five other decision trees.
基金grants to HAR and HP.HAR is supported by UNSW Scientia Program Fellowship and is a member of the UNSW Graduate School of Biomedical Engineering.
文摘These days,imbalanced datasets,denoted throughout the paper by ID,(a dataset that contains some(usually two)classes where one contains considerably smaller number of samples than the other(s))emerge in many real world problems(like health care systems or disease diagnosis systems,anomaly detection,fraud detection,stream based malware detection systems,and so on)and these datasets cause some problems(like under-training of minority class(es)and over-training of majority class(es),bias towards majority class(es),and so on)in classification process and application.Therefore,these datasets take the focus of many researchers in any science and there are several solutions for dealing with this problem.The main aim of this study for dealing with IDs is to resample the borderline samples discovered by Support Vector Data Description(SVDD).There are naturally two kinds of resampling:Under-sampling(U-S)and oversampling(O-S).The O-S may cause the occurrence of over-fitting(the occurrence of over-fitting is its main drawback).The U-S can cause the occurrence of significant information loss(the occurrence of significant information loss is its main drawback).In this study,to avoid the drawbacks of the sampling techniques,we focus on the samples that may be misclassified.The data points that can be misclassified are considered to be the borderline data points which are on border(s)between the majority class(es)and minority class(es).First by SVDD,we find the borderline examples;then,the data resampling is applied over them.At the next step,the base classifier is trained on the newly created dataset.Finally,we compare the result of our method in terms of Area Under Curve(AUC)and F-measure and G-mean with the other state-of-the-art methods.We show that our method has betterresults than the other state-of-the-art methods on our experimental study.
基金supported in part by the Key Research and Development Program for Guangdong Province(No.2019B010136001)in part by Hainan Major Science and Technology Projects(No.ZDKJ2019010)+3 种基金in part by the National Key Research and Development Program of China(No.2016YFB0800803 and No.2018YFB1004005)in part by National Natural Science Foundation of China(No.81960565,No.81260139,No.81060073,No.81560275,No.61562021,No.30560161 and No.61872110)in part by Hainan Special Projects of Social Development(No.ZDYF2018103 and No.2015SF 39)in part by Hainan Association for Academic Excellence Youth Science and Technology Innovation Program(No.201515)
文摘Objective:To determine the most influential data features and to develop machine learning approaches that best predict hospital readmissions among patients with diabetes.Methods:In this retrospective cohort study,we surveyed patient statistics and performed feature analysis to identify the most influential data features associated with readmissions.Classification of all-cause,30-day readmission outcomes were modeled using logistic regression,artificial neural network,and Easy Ensemble.F1 statistic,sensitivity,and positive predictive value were used to evaluate the model performance.Results:We identified 14 most influential data features(4 numeric features and 10 categorical features)and evaluated 3 machine learning models with numerous sampling methods(oversampling,undersampling,and hybrid techniques).The deep learning model offered no improvement over traditional models(logistic regression and Easy Ensemble)for predicting readmission,whereas the other two algorithms led to much smaller differences between the training and testing datasets.Conclusions:Machine learning approaches to record electronic health data offer a promising method for improving readmission prediction in patients with diabetes.But more work is needed to construct datasets with more clinical variables beyond the standard risk factors and to fine-tune and optimize machine learning models.
基金partially supported by the National Natural Science Foundation of China(Nos.61402089,61472069,and 61501101)the Fundamental Research Funds for the Central Universities(Nos.N161904001,N161602003,and N150408001)+2 种基金the Natural Science Foundation of Liaoning Province(No.2015020553)the China Postdoctoral Science Foundation(No.2016M591447)the Postdoctoral Science Foundation of Northeastern University(No.20160203)
文摘The Extreme Learning Machine(ELM) and its variants are effective in many machine learning applications such as Imbalanced Learning(IL) or Big Data(BD) learning. However, they are unable to solve both imbalanced and large-volume data learning problems. This study addresses the IL problem in BD applications. The Distributed and Weighted ELM(DW-ELM) algorithm is proposed, which is based on the Map Reduce framework. To confirm the feasibility of parallel computation, first, the fact that matrix multiplication operators are decomposable is illustrated.Then, to further improve the computational efficiency, an Improved DW-ELM algorithm(IDW-ELM) is developed using only one Map Reduce job. The successful operations of the proposed DW-ELM and IDW-ELM algorithms are finally validated through experiments.
基金National Natural Science Foundation of China (Grant No. 60433020, 60673099, 60673023)"985" project of Jilin University
文摘Detecting the boundaries of protein domains is an important and challenging task in both experimental and computational structural biology. In this paper, a promising method for detecting the domain structure of a protein from sequence information alone is presented. The method is based on analyzing multiple sequence alignments derived from a database search. Multiple measures are defined to quantify the domain information content of each position along the sequence. Then they are combined into a single predictor using support vector machine. What is more important, the domain detection is first taken as an imbal- anced data learning problem. A novel undersampling method is proposed on distance-based maximal entropy in the feature space of Support Vector Machine (SVM). The overall precision is about 80%. Simulation results demonstrate that the method can help not only in predicting the complete 3D structure of a protein but also in the machine learning system on general im- balanced datasets.
文摘A clustering-based undersampling (CUS) and distance-based near-miss method are widely used in current imbalanced learning algorithms, but this method has certain drawbacks. In particular, the CUS does not consider the influence of the distance factor on the majority of instances, and the near-miss method omits the inter-class(es) within the majority of samples. To overcome these drawbacks, this study proposes an undersampling method combining distance measurement and majority class clustering. Resampling methods are used to develop an ensemble-based imbalanced-learning algorithm called the clustering and distance-based imbalance learning model (CDEILM). This algorithm combines distance-based undersampling, feature selection, and ensemble learning. In addition, a cluster size-based resampling (CSBR) method is proposed for preserving the original distribution of the majority class, and a hybrid imbalanced learning framework is constructed by fusing various types of resampling methods. The combination of CDEILM and CSBR can be considered as a specific case of this hybrid framework. The experimental results show that the CDEILM and CSBR methods can achieve better performance than the benchmark methods, and that the hybrid model provides the best results under most circumstances. Therefore, the proposed model can be used as an alternative imbalanced learning method under specific circumstances, e.g., for providing a solution to credit evaluation problems in financial applications.
基金the National Key Research and Development Plan of China under Grant No.2018YFB1003900the National Natural Science Foundation of China under Grant No.61902181,the China Postdoctoral Science Foundation under Grant No.2020M671489the CCF-Tencent Open Research Fund under Grant No.RAGR20200106.
文摘Application programming interface(API)libraries are extensively used by developers.To correctly program with APIs and avoid bugs,developers shall pay attention to API directives,which illustrate the constraints of APIs.Unfortunately,API directives usually have diverse morphologies,making it time-consuming and error-prone for developers to discover all the relevant API directives.In this paper,we propose an approach leveraging text classification to discover API directives from API specifications.Specifically,given a set of training sentences in API specifications,our approach first characterizes each sentence by three groups of features.Then,to deal with the unequal distribution between API directives and non-directives,our approach employs an under-sampling strategy to split the imbalanced training set into several subsets and trains several classifiers.Given a new sentence in an API specification,our approach synthesizes the trained classifiers to predict whether it is an API directive.We have evaluated our approach over a publicly available annotated API directive corpus.The experimental results reveal that our approach achieves an F-measure value of up to 82.08%.In addition,our approach statistically outperforms the state-of-the-art approach by up to 29.67%in terms of F-measure.
文摘As the complexity of software systems is increasing;software maintenance is becoming a challenge for software practitioners.The prediction of classes that require high maintainability effort is of utmost necessity to develop cost-effective and high-quality software.In research of software engineering predictive modeling,various software maintainability prediction(SMP)models are evolved to forecast maintainability.To develop a maintainability prediction model,software practitioners may come across situations in which classes or modules requiring high maintainability effort are far less than those requiring low maintainability effort.This condition gives rise to a class imbalance problem(CIP).In this situation,the minority classes’prediction,i.e.,the classes demanding high maintainability effort,is a challenge.Therefore,in this direction,this study investigates three techniques for handling the CIP on ten open-source software to predict software maintainability.This empirical investigation supports the use of resampling with replacement technique(RR)for treating CIP and develop useful models for SMP.