Most modern technologies,such as social media,smart cities,and the internet of things(IoT),rely on big data.When big data is used in the real-world applications,two data challenges such as class overlap and class imba...Most modern technologies,such as social media,smart cities,and the internet of things(IoT),rely on big data.When big data is used in the real-world applications,two data challenges such as class overlap and class imbalance arises.When dealing with large datasets,most traditional classifiers are stuck in the local optimum problem.As a result,it’s necessary to look into new methods for dealing with large data collections.Several solutions have been proposed for overcoming this issue.The rapid growth of the available data threatens to limit the usefulness of many traditional methods.Methods such as oversampling and undersampling have shown great promises in addressing the issues of class imbalance.Among all of these techniques,Synthetic Minority Oversampling TechniquE(SMOTE)has produced the best results by generating synthetic samples for the minority class in creating a balanced dataset.The issue is that their practical applicability is restricted to problems involving tens of thousands or lower instances of each.In this paper,we have proposed a parallel mode method using SMOTE and MapReduce strategy,this distributes the operation of the algorithm among a group of computational nodes for addressing the aforementioned problem.Our proposed solution has been divided into three stages.Thefirst stage involves the process of splitting the data into different blocks using a mapping function,followed by a pre-processing step for each mapping block that employs a hybrid SMOTE algo-rithm for solving the class imbalanced problem.On each map block,a decision tree model would be constructed.Finally,the decision tree blocks would be com-bined for creating a classification model.We have used numerous datasets with up to 4 million instances in our experiments for testing the proposed scheme’s cap-abilities.As a result,the Hybrid SMOTE appears to have good scalability within the framework proposed,and it also cuts down the processing time.展开更多
Driven by the need of a plethora of machine learning applications,several attempts have been made at improving the performance of classifiers applied to imbalanced datasets.In this paper,we present a fast maximum entr...Driven by the need of a plethora of machine learning applications,several attempts have been made at improving the performance of classifiers applied to imbalanced datasets.In this paper,we present a fast maximum entropy machine(MEM)combined with a synthetic minority over-sampling technique for handling binary classification problems with high imbalance ratios,large numbers of data samples,and medium/large numbers of features.A random Fourier feature representation of kernel functions and primal estimated sub-gradient solver for support vector machine(PEGASOS)are applied to speed up the classic MEM.Experiments have been conducted using various real datasets(including two China Mobile datasets and several other standard test datasets)with various configurations.The obtained results demonstrate that the proposed algorithm has extremely low complexity but an excellent overall classification performance(in terms of several widely used evaluation metrics)as compared to the classic MEM and some other state-of-the-art methods.The proposed algorithm is particularly valuable in big data applications owing to its significantly low computational complexity.展开更多
Integrating machine learning and data mining is crucial for processing big data and extracting valuable insights to enhance decision-making.However,imbalanced target variables within big data present technical challen...Integrating machine learning and data mining is crucial for processing big data and extracting valuable insights to enhance decision-making.However,imbalanced target variables within big data present technical challenges that hinder the performance of supervised learning classifiers on key evaluation metrics,limiting their overall effectiveness.This study presents a comprehensive review of both common and recently developed Supervised Learning Classifiers(SLCs)and evaluates their performance in data-driven decision-making.The evaluation uses various metrics,with a particular focus on the Harmonic Mean Score(F-1 score)on an imbalanced real-world bank target marketing dataset.The findings indicate that grid-search random forest and random-search random forest excel in Precision and area under the curve,while Extreme Gradient Boosting(XGBoost)outperforms other traditional classifiers in terms of F-1 score.Employing oversampling methods to address the imbalanced data shows significant performance improvement in XGBoost,delivering superior results across all metrics,particularly when using the SMOTE variant known as the BorderlineSMOTE2 technique.The study concludes several key factors for effectively addressing the challenges of supervised learning with imbalanced datasets.These factors include the importance of selecting appropriate datasets for training and testing,choosing the right classifiers,employing effective techniques for processing and handling imbalanced datasets,and identifying suitable metrics for performance evaluation.Additionally,factors also entail the utilisation of effective exploratory data analysis in conjunction with visualisation techniques to yield insights conducive to data-driven decision-making.展开更多
In recent years,academic misconduct has been frequently exposed by the media,with serious impacts on the academic community.Current research on academic misconduct focuses mainly on detecting plagiarism in article con...In recent years,academic misconduct has been frequently exposed by the media,with serious impacts on the academic community.Current research on academic misconduct focuses mainly on detecting plagiarism in article content through the application of character-based and non-text element detection techniques over the entirety of a manuscript.For the most part,these techniques can only detect cases of textual plagiarism,which means that potential culprits can easily avoid discovery through clever editing and alterations of text content.In this paper,we propose an academic misconduct detection method based on scholars’submission behaviors.The model can effectively capture the atypical behavioral approach and operation of the author.As such,it is able to detect various types of misconduct,thereby improving the accuracy of detection when combined with a text content analysis.The model learns by forming a dual network group that processes text features and user behavior features to detect potential academic misconduct.First,the effect of scholars’behavioral features on the model are considered and analyzed.Second,the Synthetic Minority Oversampling Technique(SMOTE)is applied to address the problem of imbalanced samples of positive and negative classes among contributing scholars.Finally,the text features of the papers are combined with the scholars’behavioral data to improve recognition precision.Experimental results on the imbalanced dataset demonstrate that our model has a highly satisfactory performance in terms of accuracy and recall.展开更多
In recent years,deep learning gained proliferating popularity in the cybersecurity application domain,since when being compared to traditional machine learning methods,it usually involves less human efforts,produces b...In recent years,deep learning gained proliferating popularity in the cybersecurity application domain,since when being compared to traditional machine learning methods,it usually involves less human efforts,produces better results,and provides better generalizability.However,the imbalanced data issue is very common in cybersecurity,which can substantially deteriorate the performance of the deep learning models.This paper introduces a transfer learning based method to tackle the imbalanced data issue in cybersecurity using return-oriented programming payload detection as a case study.We achieved 0.0290 average false positive rate,0.9705 average F1 score and 0.9521 average detection rate on 3 different target domain programs using 2 different source domain programs,with 0 benign training data sample in the target domain.The performance improvement compared to the baseline is a trade-off between false positive rate and detection rate.Using our approach,the total number of false positives is reduced by 23.16%,and as a trade-off,the number of detected malicious samples decreases by 0.68%.展开更多
With the development of advanced metering infrastructure(AMI),large amounts of electricity consumption data can be collected for electricity theft detection.However,the imbalance of electricity consumption data is vio...With the development of advanced metering infrastructure(AMI),large amounts of electricity consumption data can be collected for electricity theft detection.However,the imbalance of electricity consumption data is violent,which makes the training of detection model challenging.In this case,this paper proposes an electricity theft detection method based on ensemble learning and prototype learning,which has great performance on imbalanced dataset and abnormal data with different abnormal level.In this paper,convolutional neural network(CNN)and long short-term memory(LSTM)are employed to obtain abstract feature from electricity consumption data.After calculating the means of the abstract feature,the prototype per class is obtained,which is used to predict the labels of unknown samples.In the meanwhile,through training the network by different balanced subsets of training set,the prototype is representative.Compared with some mainstream methods including CNN,random forest(RF)and so on,the proposed method has been proved to effectively deal with the electricity theft detection when abnormal data only account for 2.5%and 1.25%of normal data.The results show that the proposed method outperforms other state-of-the-art methods.展开更多
Telemarketing is a well-established marketing approach to offering products and services to prospective customers.The effectiveness of such an approach,however,is highly dependent on the selection of the appropriate c...Telemarketing is a well-established marketing approach to offering products and services to prospective customers.The effectiveness of such an approach,however,is highly dependent on the selection of the appropriate consumer base,as reaching uninterested customers will induce annoyance and consume costly enterprise resources in vain while missing interested ones.The introduction of business intelligence and machine learning models can positively influence the decision-making process by predicting the potential customer base,and the existing literature in this direction shows promising results.However,the selection of influential features and the construction of effective learning models for improved performance remain a challenge.Furthermore,from the modelling perspective,the class imbalance nature of the training data,where samples with unsuccessful outcomes highly outnumber successful ones,further compounds the problem by creating biased and inaccurate models.Additionally,customer preferences are likely to change over time due to various reasons,and/or a fresh group of customers may be targeted for a new product or service,necessitating model retraining which is not addressed at all in existing works.A major challenge in model retraining is maintaining a balance between stability(retaining older knowledge)and plasticity(being receptive to new information).To address the above issues,this paper proposes an ensemble machine learning model with feature selection and oversampling techniques to identify potential customers more accurately.A novel online learning method is proposed for model retraining when new samples are available over time.This newly introduced method equips the proposed approach to deal with dynamic data,leading to improved readiness of the proposed model for practical adoption,and is a highly useful addition to the literature.Extensive experiments with real-world data show that the proposed approach achieves excellent results in all cases(e.g.,98.6%accuracy in classifying customers)and outperforms recent competing models in the literature by a considerable margin of 3%on a widely used dataset.展开更多
Since the overall prediction error of a classifier on imbalanced problems can be potentially misleading and bi- ased, alternative performance measures such as G-mean and F-measure have been widely adopted. Various tec...Since the overall prediction error of a classifier on imbalanced problems can be potentially misleading and bi- ased, alternative performance measures such as G-mean and F-measure have been widely adopted. Various techniques in- cluding sampling and cost sensitive learning are often em- ployed to improve the performance of classifiers in such sit- uations. However, the training process of classifiers is still largely driven by traditional error based objective functions. As a result, there is clearly a gap between the measure accord- ing to which the classifier is evaluated and how the classifier is trained. This paper investigates the prospect of explicitly using the appropriate measure itself to search the hypothesis space to bridge this gap. In the case studies, a standard three- layer neural network is used as the classifier, which is evolved by genetic algorithms (GAs) with G-mean as the objective function. Experimental results on eight benchmark problems show that the proposed method can achieve consistently fa- vorable outcomes in comparison with a commonly used sam- pling technique. The effectiveness of multi-objective opti- mization in handling imbalanced problems is also demon- strated.展开更多
Parametric understanding for specifying formation characteristics can be perceived through conven-tional approaches.Significantly,attributes of reservoir lithology are practiced for hydrocarbon explora-tion.Well loggi...Parametric understanding for specifying formation characteristics can be perceived through conven-tional approaches.Significantly,attributes of reservoir lithology are practiced for hydrocarbon explora-tion.Well logging is conventional approach which is applicable to predict lithology efficiently as compared to geophysical modeling and petrophysical analysis due to cost effectiveness and suitable interpretation time.However,manual interpretation of lithology identification through well logging data requires domain expertise with an extended length of time for measurement.Therefore,in this study,Deep Neural Network(DNN)has been deployed to automate the lithology identification process from well logging data which would provide support by increasing time-effective for monitoring lithology.DNN model has been developed for predicting formation lithology leading to the optimization of the model through the thorough evaluation of the best parameters and hyperparameters including the number of neurons,number of layers,optimizer,learning rate,dropout values,and activation functions.Accuracy of the model is examined by utilizing different evaluation metrics through the division of the dataset into the subdomains of training,validation and testing.Additionally,an attempt is contributed to remove interception for formation lithology prediction while addressing the imbalanced nature of the associated dataset as well in the training process using class weight.It is assessed that accuracy is not a true and only reliable metric to evaluate the lithology classification model.The model with class weight recognizes all the classes but has low accuracy as well as a low F1-score while LSTM based model has high accuracyas well as a high F1-score.展开更多
Decision trees have three main disadvantages: reduced performance when the training set is small; rigid decision criteria; and the fact that a single "uncharacteristic" attribute might "derail" the classification...Decision trees have three main disadvantages: reduced performance when the training set is small; rigid decision criteria; and the fact that a single "uncharacteristic" attribute might "derail" the classification process. In this paper we present ConfDTree (Confidence-Based Decision Tree) -- a post-processing method that enables decision trees to better classify outlier instances. This method, which can be applied to any decision tree algorithm, uses easy-to-implement statistical methods (confidence intervals and two-proportion tests) in order to identify hard-to-classify instances and to propose alternative routes. The experimental study indicates that the proposed post-processing method consistently and significantly improves the predictive performance of decision trees, particularly for small, imbalanced or multi-class datasets in which an average improvement of 5%-9% in the AUC performance is reported.展开更多
文摘Most modern technologies,such as social media,smart cities,and the internet of things(IoT),rely on big data.When big data is used in the real-world applications,two data challenges such as class overlap and class imbalance arises.When dealing with large datasets,most traditional classifiers are stuck in the local optimum problem.As a result,it’s necessary to look into new methods for dealing with large data collections.Several solutions have been proposed for overcoming this issue.The rapid growth of the available data threatens to limit the usefulness of many traditional methods.Methods such as oversampling and undersampling have shown great promises in addressing the issues of class imbalance.Among all of these techniques,Synthetic Minority Oversampling TechniquE(SMOTE)has produced the best results by generating synthetic samples for the minority class in creating a balanced dataset.The issue is that their practical applicability is restricted to problems involving tens of thousands or lower instances of each.In this paper,we have proposed a parallel mode method using SMOTE and MapReduce strategy,this distributes the operation of the algorithm among a group of computational nodes for addressing the aforementioned problem.Our proposed solution has been divided into three stages.Thefirst stage involves the process of splitting the data into different blocks using a mapping function,followed by a pre-processing step for each mapping block that employs a hybrid SMOTE algo-rithm for solving the class imbalanced problem.On each map block,a decision tree model would be constructed.Finally,the decision tree blocks would be com-bined for creating a classification model.We have used numerous datasets with up to 4 million instances in our experiments for testing the proposed scheme’s cap-abilities.As a result,the Hybrid SMOTE appears to have good scalability within the framework proposed,and it also cuts down the processing time.
基金The author Feng Yin was funded by the Shenzhen Science and Technology Innovation Council(No.JCYJ20170307155957688)and by National Natural Science Foundation of China Key Project(No.61731018)The authors Feng Yin and Shuguang(Robert)Cui were funded by Shenzhen Fundamental Research Funds under Grant(Key Lab)No.ZDSYS201707251409055,Grant(Peacock)No.KQTD2015033114415450,and Guangdong province“The Pearl River Talent Recruitment Program Innovative and Entrepreneurial Teams in 2017”-Data Driven Evolution of Future Intelligent Network Team.The associate editor coordinating the review of this paper and approving it for publication was X.Cheng.
文摘Driven by the need of a plethora of machine learning applications,several attempts have been made at improving the performance of classifiers applied to imbalanced datasets.In this paper,we present a fast maximum entropy machine(MEM)combined with a synthetic minority over-sampling technique for handling binary classification problems with high imbalance ratios,large numbers of data samples,and medium/large numbers of features.A random Fourier feature representation of kernel functions and primal estimated sub-gradient solver for support vector machine(PEGASOS)are applied to speed up the classic MEM.Experiments have been conducted using various real datasets(including two China Mobile datasets and several other standard test datasets)with various configurations.The obtained results demonstrate that the proposed algorithm has extremely low complexity but an excellent overall classification performance(in terms of several widely used evaluation metrics)as compared to the classic MEM and some other state-of-the-art methods.The proposed algorithm is particularly valuable in big data applications owing to its significantly low computational complexity.
基金support from the Cyber Technology Institute(CTI)at the School of Computer Science and Informatics,De Montfort University,United Kingdom,along with financial assistance from Universiti Tun Hussein Onn Malaysia and the UTHM Publisher’s office through publication fund E15216.
文摘Integrating machine learning and data mining is crucial for processing big data and extracting valuable insights to enhance decision-making.However,imbalanced target variables within big data present technical challenges that hinder the performance of supervised learning classifiers on key evaluation metrics,limiting their overall effectiveness.This study presents a comprehensive review of both common and recently developed Supervised Learning Classifiers(SLCs)and evaluates their performance in data-driven decision-making.The evaluation uses various metrics,with a particular focus on the Harmonic Mean Score(F-1 score)on an imbalanced real-world bank target marketing dataset.The findings indicate that grid-search random forest and random-search random forest excel in Precision and area under the curve,while Extreme Gradient Boosting(XGBoost)outperforms other traditional classifiers in terms of F-1 score.Employing oversampling methods to address the imbalanced data shows significant performance improvement in XGBoost,delivering superior results across all metrics,particularly when using the SMOTE variant known as the BorderlineSMOTE2 technique.The study concludes several key factors for effectively addressing the challenges of supervised learning with imbalanced datasets.These factors include the importance of selecting appropriate datasets for training and testing,choosing the right classifiers,employing effective techniques for processing and handling imbalanced datasets,and identifying suitable metrics for performance evaluation.Additionally,factors also entail the utilisation of effective exploratory data analysis in conjunction with visualisation techniques to yield insights conducive to data-driven decision-making.
基金This work is supported by the National Key R&D Program of China under grant 2018YFB1003205by the National Natural Science Foundation of China under grants U1836208 and U1836110+1 种基金by the Priority Academic Program Development of Jiangsu Higher Education Institutions(PAPD)fundand by the Collaborative Innovation Center of Atmospheric Environment and Equipment Technology(CICAEET)fund,China.
文摘In recent years,academic misconduct has been frequently exposed by the media,with serious impacts on the academic community.Current research on academic misconduct focuses mainly on detecting plagiarism in article content through the application of character-based and non-text element detection techniques over the entirety of a manuscript.For the most part,these techniques can only detect cases of textual plagiarism,which means that potential culprits can easily avoid discovery through clever editing and alterations of text content.In this paper,we propose an academic misconduct detection method based on scholars’submission behaviors.The model can effectively capture the atypical behavioral approach and operation of the author.As such,it is able to detect various types of misconduct,thereby improving the accuracy of detection when combined with a text content analysis.The model learns by forming a dual network group that processes text features and user behavior features to detect potential academic misconduct.First,the effect of scholars’behavioral features on the model are considered and analyzed.Second,the Synthetic Minority Oversampling Technique(SMOTE)is applied to address the problem of imbalanced samples of positive and negative classes among contributing scholars.Finally,the text features of the papers are combined with the scholars’behavioral data to improve recognition precision.Experimental results on the imbalanced dataset demonstrate that our model has a highly satisfactory performance in terms of accuracy and recall.
基金supported by NSF CNS-2019340,NSF ECCS-2140175,and NIST 60NANB22D144.
文摘In recent years,deep learning gained proliferating popularity in the cybersecurity application domain,since when being compared to traditional machine learning methods,it usually involves less human efforts,produces better results,and provides better generalizability.However,the imbalanced data issue is very common in cybersecurity,which can substantially deteriorate the performance of the deep learning models.This paper introduces a transfer learning based method to tackle the imbalanced data issue in cybersecurity using return-oriented programming payload detection as a case study.We achieved 0.0290 average false positive rate,0.9705 average F1 score and 0.9521 average detection rate on 3 different target domain programs using 2 different source domain programs,with 0 benign training data sample in the target domain.The performance improvement compared to the baseline is a trade-off between false positive rate and detection rate.Using our approach,the total number of false positives is reduced by 23.16%,and as a trade-off,the number of detected malicious samples decreases by 0.68%.
基金supported by National Natural Science Foundation of China(No.52277083).
文摘With the development of advanced metering infrastructure(AMI),large amounts of electricity consumption data can be collected for electricity theft detection.However,the imbalance of electricity consumption data is violent,which makes the training of detection model challenging.In this case,this paper proposes an electricity theft detection method based on ensemble learning and prototype learning,which has great performance on imbalanced dataset and abnormal data with different abnormal level.In this paper,convolutional neural network(CNN)and long short-term memory(LSTM)are employed to obtain abstract feature from electricity consumption data.After calculating the means of the abstract feature,the prototype per class is obtained,which is used to predict the labels of unknown samples.In the meanwhile,through training the network by different balanced subsets of training set,the prototype is representative.Compared with some mainstream methods including CNN,random forest(RF)and so on,the proposed method has been proved to effectively deal with the electricity theft detection when abnormal data only account for 2.5%and 1.25%of normal data.The results show that the proposed method outperforms other state-of-the-art methods.
文摘Telemarketing is a well-established marketing approach to offering products and services to prospective customers.The effectiveness of such an approach,however,is highly dependent on the selection of the appropriate consumer base,as reaching uninterested customers will induce annoyance and consume costly enterprise resources in vain while missing interested ones.The introduction of business intelligence and machine learning models can positively influence the decision-making process by predicting the potential customer base,and the existing literature in this direction shows promising results.However,the selection of influential features and the construction of effective learning models for improved performance remain a challenge.Furthermore,from the modelling perspective,the class imbalance nature of the training data,where samples with unsuccessful outcomes highly outnumber successful ones,further compounds the problem by creating biased and inaccurate models.Additionally,customer preferences are likely to change over time due to various reasons,and/or a fresh group of customers may be targeted for a new product or service,necessitating model retraining which is not addressed at all in existing works.A major challenge in model retraining is maintaining a balance between stability(retaining older knowledge)and plasticity(being receptive to new information).To address the above issues,this paper proposes an ensemble machine learning model with feature selection and oversampling techniques to identify potential customers more accurately.A novel online learning method is proposed for model retraining when new samples are available over time.This newly introduced method equips the proposed approach to deal with dynamic data,leading to improved readiness of the proposed model for practical adoption,and is a highly useful addition to the literature.Extensive experiments with real-world data show that the proposed approach achieves excellent results in all cases(e.g.,98.6%accuracy in classifying customers)and outperforms recent competing models in the literature by a considerable margin of 3%on a widely used dataset.
文摘Since the overall prediction error of a classifier on imbalanced problems can be potentially misleading and bi- ased, alternative performance measures such as G-mean and F-measure have been widely adopted. Various techniques in- cluding sampling and cost sensitive learning are often em- ployed to improve the performance of classifiers in such sit- uations. However, the training process of classifiers is still largely driven by traditional error based objective functions. As a result, there is clearly a gap between the measure accord- ing to which the classifier is evaluated and how the classifier is trained. This paper investigates the prospect of explicitly using the appropriate measure itself to search the hypothesis space to bridge this gap. In the case studies, a standard three- layer neural network is used as the classifier, which is evolved by genetic algorithms (GAs) with G-mean as the objective function. Experimental results on eight benchmark problems show that the proposed method can achieve consistently fa- vorable outcomes in comparison with a commonly used sam- pling technique. The effectiveness of multi-objective opti- mization in handling imbalanced problems is also demon- strated.
文摘Parametric understanding for specifying formation characteristics can be perceived through conven-tional approaches.Significantly,attributes of reservoir lithology are practiced for hydrocarbon explora-tion.Well logging is conventional approach which is applicable to predict lithology efficiently as compared to geophysical modeling and petrophysical analysis due to cost effectiveness and suitable interpretation time.However,manual interpretation of lithology identification through well logging data requires domain expertise with an extended length of time for measurement.Therefore,in this study,Deep Neural Network(DNN)has been deployed to automate the lithology identification process from well logging data which would provide support by increasing time-effective for monitoring lithology.DNN model has been developed for predicting formation lithology leading to the optimization of the model through the thorough evaluation of the best parameters and hyperparameters including the number of neurons,number of layers,optimizer,learning rate,dropout values,and activation functions.Accuracy of the model is examined by utilizing different evaluation metrics through the division of the dataset into the subdomains of training,validation and testing.Additionally,an attempt is contributed to remove interception for formation lithology prediction while addressing the imbalanced nature of the associated dataset as well in the training process using class weight.It is assessed that accuracy is not a true and only reliable metric to evaluate the lithology classification model.The model with class weight recognizes all the classes but has low accuracy as well as a low F1-score while LSTM based model has high accuracyas well as a high F1-score.
文摘Decision trees have three main disadvantages: reduced performance when the training set is small; rigid decision criteria; and the fact that a single "uncharacteristic" attribute might "derail" the classification process. In this paper we present ConfDTree (Confidence-Based Decision Tree) -- a post-processing method that enables decision trees to better classify outlier instances. This method, which can be applied to any decision tree algorithm, uses easy-to-implement statistical methods (confidence intervals and two-proportion tests) in order to identify hard-to-classify instances and to propose alternative routes. The experimental study indicates that the proposed post-processing method consistently and significantly improves the predictive performance of decision trees, particularly for small, imbalanced or multi-class datasets in which an average improvement of 5%-9% in the AUC performance is reported.