期刊文献+
共找到10篇文章
< 1 >
每页显示 20 50 100
High-Impact Bug Report Identification with Imbalanced Learning Strategies 被引量:6
1
作者 Xin-Li Yang David Lo +2 位作者 Xin Xia Qiao Huang Jian-Ling Sun 《Journal of Computer Science & Technology》 SCIE EI CSCD 2017年第1期181-198,共18页
In practice, some bugs have more impact than others and thus deserve more immediate attention. Due to tight schedule and limited human resources, developers may not have enough time to inspect all bugs. Thus, they oft... In practice, some bugs have more impact than others and thus deserve more immediate attention. Due to tight schedule and limited human resources, developers may not have enough time to inspect all bugs. Thus, they often concentrate on bugs that are highly impactful. In the literature, high-impact bugs are used to refer to the bugs which appear at unexpected time or locations and bring more unexpected effects (i.e., surprise bugs), or break pre-existing functionalities and destroy the user experience (i.e., breakage bugs). Unfortunately, identifying high-impact bugs from thousands of bug reports in a bug tracking system is not an easy feat. Thus, an automated technique that can identify high-impact bug reports can help developers to be aware of them early, rectify them quickly, and minimize the damages they cause. Considering that only a small proportion of bugs are high-impact bugs, the identification of high-impact bug reports is a difficult task. In this paper, we propose an approach to identify high-impact bug reports by leveraging imbalanced learning strategies. We investigate the effectiveness of various variants, each of which combines one particular imbalanced learning strategy and one particular classification algorithm. In particular, we choose four widely used strategies for dealing with imbalanced data and four state-of-the-art text classification algorithms to conduct experiments on four datasets from four different open source projects. We mainly perform an analytical study on two types of high-impact bugs, i.e., surprise bugs and breakage bugs. The results show that different variants have different performances, and the best performing variants SMOTE (synthetic minority over-sampling technique) + KNN (K-nearest neighbours) for surprise bug identification and RUS (random under-sampling) + NB (naive Bayes) for breakage bug identification outperform the Fl-scores of the two state-of-the-art approaches by Thung et al. and Garcia and Shihab. 展开更多
关键词 high-impact bug imbalanced learning bug report identification
原文传递
Observation points classifier ensemble for high-dimensional imbalanced classification 被引量:1
2
作者 Yulin He Xu Li +3 位作者 Philippe Fournier‐Viger Joshua Zhexue Huang Mianjie Li Salman Salloum 《CAAI Transactions on Intelligence Technology》 SCIE EI 2023年第2期500-517,共18页
In this paper,an Observation Points Classifier Ensemble(OPCE)algorithm is proposed to deal with High-Dimensional Imbalanced Classification(HDIC)problems based on data processed using the Multi-Dimensional Scaling(MDS)... In this paper,an Observation Points Classifier Ensemble(OPCE)algorithm is proposed to deal with High-Dimensional Imbalanced Classification(HDIC)problems based on data processed using the Multi-Dimensional Scaling(MDS)feature extraction technique.First,dimensionality of the original imbalanced data is reduced using MDS so that distances between any two different samples are preserved as well as possible.Second,a novel OPCE algorithm is applied to classify imbalanced samples by placing optimised observation points in a low-dimensional data space.Third,optimization of the observation point mappings is carried out to obtain a reliable assessment of the unknown samples.Exhaustive experiments have been conducted to evaluate the feasibility,rationality,and effectiveness of the proposed OPCE algorithm using seven benchmark HDIC data sets.Experimental results show that(1)the OPCE algorithm can be trained faster on low-dimensional imbalanced data than on high-dimensional data;(2)the OPCE algorithm can correctly identify samples as the number of optimised observation points is increased;and(3)statistical analysis reveals that OPCE yields better HDIC performances on the selected data sets in comparison with eight other HDIC algorithms.This demonstrates that OPCE is a viable algorithm to deal with HDIC problems. 展开更多
关键词 classifier ensemble feature transformation high-dimensional data classification imbalanced learning observation point mechanism
下载PDF
One-against-all-based Hellinger distance decision tree for multiclass imbalanced learning
3
作者 Minggang DONG Ming LIU Chao JING 《Frontiers of Information Technology & Electronic Engineering》 SCIE EI CSCD 2022年第2期278-290,共13页
Since traditional machine learning methods are sensitive to skewed distribution and do not consider the characteristics in multiclass imbalance problems,the skewed distribution of multiclass data poses a major challen... Since traditional machine learning methods are sensitive to skewed distribution and do not consider the characteristics in multiclass imbalance problems,the skewed distribution of multiclass data poses a major challenge to machine learning algorithms.To tackle such issues,we propose a new splitting criterion of the decision tree based on the one-against-all-based Hellinger distance(OAHD).Two crucial elements are included in OAHD.First,the one-against-all scheme is integrated into the process of computing the Hellinger distance in OAHD,thereby extending the Hellinger distance decision tree to cope with the multiclass imbalance problem.Second,for the multiclass imbalance problem,the distribution and the number of distinct classes are taken into account,and a modified Gini index is designed.Moreover,we give theoretical proofs for the properties of OAHD,including skew insensitivity and the ability to seek a purer node in the decision tree.Finally,we collect 20 public real-world imbalanced data sets from the Knowledge Extraction based on Evolutionary Learning(KEEL)repository and the University of California,Irvine(UCI)repository.Experimental and statistical results show that OAHD significantly improves the performance compared with the five other well-known decision trees in terms of Precision,F-measure,and multiclass area under the receiver operating characteristic curve(MAUC).Moreover,through statistical analysis,the Friedman and Nemenyi tests are used to prove the advantage of OAHD over the five other decision trees. 展开更多
关键词 Decision trees Multiclass imbalanced learning Node splitting criterion Hellinger distance One-against-all scheme
原文传递
Dealing with Imbalanced Dataset Leveraging Boundary Samples Discovered by Support Vector Data Description 被引量:1
4
作者 Zhengbo Luo Hamïd Parvïn +3 位作者 Harish Garg Sultan Noman Qasem Kim-Hung Pho Zulkefli Mansor 《Computers, Materials & Continua》 SCIE EI 2021年第3期2691-2708,共18页
These days,imbalanced datasets,denoted throughout the paper by ID,(a dataset that contains some(usually two)classes where one contains considerably smaller number of samples than the other(s))emerge in many real world... These days,imbalanced datasets,denoted throughout the paper by ID,(a dataset that contains some(usually two)classes where one contains considerably smaller number of samples than the other(s))emerge in many real world problems(like health care systems or disease diagnosis systems,anomaly detection,fraud detection,stream based malware detection systems,and so on)and these datasets cause some problems(like under-training of minority class(es)and over-training of majority class(es),bias towards majority class(es),and so on)in classification process and application.Therefore,these datasets take the focus of many researchers in any science and there are several solutions for dealing with this problem.The main aim of this study for dealing with IDs is to resample the borderline samples discovered by Support Vector Data Description(SVDD).There are naturally two kinds of resampling:Under-sampling(U-S)and oversampling(O-S).The O-S may cause the occurrence of over-fitting(the occurrence of over-fitting is its main drawback).The U-S can cause the occurrence of significant information loss(the occurrence of significant information loss is its main drawback).In this study,to avoid the drawbacks of the sampling techniques,we focus on the samples that may be misclassified.The data points that can be misclassified are considered to be the borderline data points which are on border(s)between the majority class(es)and minority class(es).First by SVDD,we find the borderline examples;then,the data resampling is applied over them.At the next step,the base classifier is trained on the newly created dataset.Finally,we compare the result of our method in terms of Area Under Curve(AUC)and F-measure and G-mean with the other state-of-the-art methods.We show that our method has betterresults than the other state-of-the-art methods on our experimental study. 展开更多
关键词 imbalanced learning CLASSIFICATION borderline examples
下载PDF
Predictive modeling of 30-day readmission risk of diabetes patients by logistic regression,artificial neural network,and EasyEnsemble 被引量:1
5
作者 Xiayu Xiang Chuanyi Liu +2 位作者 Yanchun Zhang Wei Xiang Binxing Fang 《Asian Pacific Journal of Tropical Medicine》 SCIE CAS 2021年第9期417-428,共12页
Objective:To determine the most influential data features and to develop machine learning approaches that best predict hospital readmissions among patients with diabetes.Methods:In this retrospective cohort study,we s... Objective:To determine the most influential data features and to develop machine learning approaches that best predict hospital readmissions among patients with diabetes.Methods:In this retrospective cohort study,we surveyed patient statistics and performed feature analysis to identify the most influential data features associated with readmissions.Classification of all-cause,30-day readmission outcomes were modeled using logistic regression,artificial neural network,and Easy Ensemble.F1 statistic,sensitivity,and positive predictive value were used to evaluate the model performance.Results:We identified 14 most influential data features(4 numeric features and 10 categorical features)and evaluated 3 machine learning models with numerous sampling methods(oversampling,undersampling,and hybrid techniques).The deep learning model offered no improvement over traditional models(logistic regression and Easy Ensemble)for predicting readmission,whereas the other two algorithms led to much smaller differences between the training and testing datasets.Conclusions:Machine learning approaches to record electronic health data offer a promising method for improving readmission prediction in patients with diabetes.But more work is needed to construct datasets with more clinical variables beyond the standard risk factors and to fine-tune and optimize machine learning models. 展开更多
关键词 Electronic health records Hospital readmissions Feature analysis Predictive models imbalanced learning DIABETES
下载PDF
Distributed and Weighted Extreme Learning Machine for Imbalanced Big Data Learning 被引量:10
6
作者 Zhiqiong Wang Junchang Xin +4 位作者 Hongxu Yang Shuo Tian Ge Yu Chenren Xu Yudong Yao 《Tsinghua Science and Technology》 SCIE EI CAS CSCD 2017年第2期160-173,共14页
The Extreme Learning Machine(ELM) and its variants are effective in many machine learning applications such as Imbalanced Learning(IL) or Big Data(BD) learning. However, they are unable to solve both imbalanced ... The Extreme Learning Machine(ELM) and its variants are effective in many machine learning applications such as Imbalanced Learning(IL) or Big Data(BD) learning. However, they are unable to solve both imbalanced and large-volume data learning problems. This study addresses the IL problem in BD applications. The Distributed and Weighted ELM(DW-ELM) algorithm is proposed, which is based on the Map Reduce framework. To confirm the feasibility of parallel computation, first, the fact that matrix multiplication operators are decomposable is illustrated.Then, to further improve the computational efficiency, an Improved DW-ELM algorithm(IDW-ELM) is developed using only one Map Reduce job. The successful operations of the proposed DW-ELM and IDW-ELM algorithms are finally validated through experiments. 展开更多
关键词 weighted Extreme learning Machine(ELM) imbalanced big data MapReduce framework user-defined counter
原文传递
A Novel Method for Prediction of Protein Domain Using Distance-Based Maximal Entropy
7
作者 Shu-xue Zou Yan-xin Huang Yan Wang Chun-guang Zhou 《Journal of Bionic Engineering》 SCIE EI CSCD 2008年第3期215-223,共9页
Detecting the boundaries of protein domains is an important and challenging task in both experimental and computational structural biology. In this paper, a promising method for detecting the domain structure of a pro... Detecting the boundaries of protein domains is an important and challenging task in both experimental and computational structural biology. In this paper, a promising method for detecting the domain structure of a protein from sequence information alone is presented. The method is based on analyzing multiple sequence alignments derived from a database search. Multiple measures are defined to quantify the domain information content of each position along the sequence. Then they are combined into a single predictor using support vector machine. What is more important, the domain detection is first taken as an imbal- anced data learning problem. A novel undersampling method is proposed on distance-based maximal entropy in the feature space of Support Vector Machine (SVM). The overall precision is about 80%. Simulation results demonstrate that the method can help not only in predicting the complete 3D structure of a protein but also in the machine learning system on general im- balanced datasets. 展开更多
关键词 protein domain boundary SVM imbalanced data learning distance-based maximal entropy
下载PDF
Improved hybrid resampling and ensemble model for imbalance learning and credit evaluation 被引量:1
8
作者 Gang Kou Hao Chen Mohammed A.Hefni 《Journal of Management Science and Engineering》 2022年第4期511-529,共19页
A clustering-based undersampling (CUS) and distance-based near-miss method are widely used in current imbalanced learning algorithms, but this method has certain drawbacks. In particular, the CUS does not consider the... A clustering-based undersampling (CUS) and distance-based near-miss method are widely used in current imbalanced learning algorithms, but this method has certain drawbacks. In particular, the CUS does not consider the influence of the distance factor on the majority of instances, and the near-miss method omits the inter-class(es) within the majority of samples. To overcome these drawbacks, this study proposes an undersampling method combining distance measurement and majority class clustering. Resampling methods are used to develop an ensemble-based imbalanced-learning algorithm called the clustering and distance-based imbalance learning model (CDEILM). This algorithm combines distance-based undersampling, feature selection, and ensemble learning. In addition, a cluster size-based resampling (CSBR) method is proposed for preserving the original distribution of the majority class, and a hybrid imbalanced learning framework is constructed by fusing various types of resampling methods. The combination of CDEILM and CSBR can be considered as a specific case of this hybrid framework. The experimental results show that the CDEILM and CSBR methods can achieve better performance than the benchmark methods, and that the hybrid model provides the best results under most circumstances. Therefore, the proposed model can be used as an alternative imbalanced learning method under specific circumstances, e.g., for providing a solution to credit evaluation problems in financial applications. 展开更多
关键词 imbalanced learning Clustering-based under-sampling Ensemble methods Hybrid methods Credit risk evaluation
原文传递
Discovering API Directives from API Specifications with Text Classification
9
作者 Jing-Xuan Zhang Chuan-Qi Tao +1 位作者 Zhi-Qiu Huang Xin Chen 《Journal of Computer Science & Technology》 SCIE EI CSCD 2021年第4期922-943,共22页
Application programming interface(API)libraries are extensively used by developers.To correctly program with APIs and avoid bugs,developers shall pay attention to API directives,which illustrate the constraints of API... Application programming interface(API)libraries are extensively used by developers.To correctly program with APIs and avoid bugs,developers shall pay attention to API directives,which illustrate the constraints of APIs.Unfortunately,API directives usually have diverse morphologies,making it time-consuming and error-prone for developers to discover all the relevant API directives.In this paper,we propose an approach leveraging text classification to discover API directives from API specifications.Specifically,given a set of training sentences in API specifications,our approach first characterizes each sentence by three groups of features.Then,to deal with the unequal distribution between API directives and non-directives,our approach employs an under-sampling strategy to split the imbalanced training set into several subsets and trains several classifiers.Given a new sentence in an API specification,our approach synthesizes the trained classifiers to predict whether it is an API directive.We have evaluated our approach over a publicly available annotated API directive corpus.The experimental results reveal that our approach achieves an F-measure value of up to 82.08%.In addition,our approach statistically outperforms the state-of-the-art approach by up to 29.67%in terms of F-measure. 展开更多
关键词 Application programming interface(API)directive API specification imbalanced learning text classification
原文传递
Handling class imbalance problem in software maintainability prediction:an empirical investigation
10
作者 Ruchika Malhotra Kusum Lata 《Frontiers of Computer Science》 SCIE EI CSCD 2022年第4期5-18,共14页
As the complexity of software systems is increasing;software maintenance is becoming a challenge for software practitioners.The prediction of classes that require high maintainability effort is of utmost necessity to ... As the complexity of software systems is increasing;software maintenance is becoming a challenge for software practitioners.The prediction of classes that require high maintainability effort is of utmost necessity to develop cost-effective and high-quality software.In research of software engineering predictive modeling,various software maintainability prediction(SMP)models are evolved to forecast maintainability.To develop a maintainability prediction model,software practitioners may come across situations in which classes or modules requiring high maintainability effort are far less than those requiring low maintainability effort.This condition gives rise to a class imbalance problem(CIP).In this situation,the minority classes’prediction,i.e.,the classes demanding high maintainability effort,is a challenge.Therefore,in this direction,this study investigates three techniques for handling the CIP on ten open-source software to predict software maintainability.This empirical investigation supports the use of resampling with replacement technique(RR)for treating CIP and develop useful models for SMP. 展开更多
关键词 sofiware maintenance software maintainability imbalanced learning
原文传递
上一页 1 下一页 到第
使用帮助 返回顶部