Prediction of machine failure is challenging as the dataset is often imbalanced with a low failure rate.The common approach to han-dle classification involving imbalanced data is to balance the data using a sampling a...Prediction of machine failure is challenging as the dataset is often imbalanced with a low failure rate.The common approach to han-dle classification involving imbalanced data is to balance the data using a sampling approach such as random undersampling,random oversampling,or Synthetic Minority Oversampling Technique(SMOTE)algorithms.This paper compared the classification performance of three popular classifiers(Logistic Regression,Gaussian Naïve Bayes,and Support Vector Machine)in predicting machine failure in the Oil and Gas industry.The original machine failure dataset consists of 20,473 hourly data and is imbalanced with 19945(97%)‘non-failure’and 528(3%)‘failure data’.The three independent variables to predict machine failure were pressure indicator,flow indicator,and level indicator.The accuracy of the classifiers is very high and close to 100%,but the sensitivity of all classifiers using the original dataset was close to zero.The performance of the three classifiers was then evaluated for data with different imbalance rates(10%to 50%)generated from the original data using SMOTE,SMOTE-Support Vector Machine(SMOTE-SVM)and SMOTE-Edited Nearest Neighbour(SMOTE-ENN).The classifiers were evaluated based on improvement in sensitivity and F-measure.Results showed that the sensitivity of all classifiers increases as the imbalance rate increases.SVM with radial basis function(RBF)kernel has the highest sensitivity when data is balanced(50:50)using SMOTE(Sensitivitytest=0.5686,Ftest=0.6927)compared to Naïve Bayes(Sensitivitytest=0.4033,Ftest=0.6218)and Logistic Regression(Sensitivitytest=0.4194,Ftest=0.621).Overall,the Gaussian Naïve Bayes model consistently improves sensitivity and F-measure as the imbalance ratio increases,but the sensitivity is below 50%.The classifiers performed better when data was balanced using SMOTE-SVM compared to SMOTE and SMOTE-ENN.展开更多
Imbalanced data classification is one of the major problems in machine learning.This imbalanced dataset typically has significant differences in the number of data samples between its classes.In most cases,the perform...Imbalanced data classification is one of the major problems in machine learning.This imbalanced dataset typically has significant differences in the number of data samples between its classes.In most cases,the performance of the machine learning algorithm such as Support Vector Machine(SVM)is affected when dealing with an imbalanced dataset.The classification accuracy is mostly skewed toward the majority class and poor results are exhibited in the prediction of minority-class samples.In this paper,a hybrid approach combining data pre-processing technique andSVMalgorithm based on improved Simulated Annealing(SA)was proposed.Firstly,the data preprocessing technique which primarily aims at solving the resampling strategy of handling imbalanced datasets was proposed.In this technique,the data were first synthetically generated to equalize the number of samples between classes and followed by a reduction step to remove redundancy and duplicated data.Next is the training of a balanced dataset using SVM.Since this algorithm requires an iterative process to search for the best penalty parameter during training,an improved SA algorithm was proposed for this task.In this proposed improvement,a new acceptance criterion for the solution to be accepted in the SA algorithm was introduced to enhance the accuracy of the optimization process.Experimental works based on ten publicly available imbalanced datasets have demonstrated higher accuracy in the classification tasks using the proposed approach in comparison with the conventional implementation of SVM.Registering at an average of 89.65%of accuracy for the binary class classification has demonstrated the good performance of the proposed works.展开更多
The imbalance of dissolved gas analysis(DGA)data will lead to over-fitting,weak generalization and poor recognition performance for fault diagnosis models based on deep learning.To handle this problem,a novel transfor...The imbalance of dissolved gas analysis(DGA)data will lead to over-fitting,weak generalization and poor recognition performance for fault diagnosis models based on deep learning.To handle this problem,a novel transformer fault diagnosis method based on improved auxiliary classifier generative adversarial network(ACGAN)under imbalanced data is proposed in this paper,which meets both the requirements of balancing DGA data and supplying accurate diagnosis results.The generator combines one-dimensional convolutional neural networks(1D-CNN)and long short-term memories(LSTM),which can deeply extract the features from DGA samples and be greatly beneficial to ACGAN’s data balancing and fault diagnosis.The discriminator adopts multilayer perceptron networks(MLP),which prevents the discriminator from losing important features of DGA data when the network is too complex and the number of layers is too large.The experimental results suggest that the presented approach can effectively improve the adverse effects of DGA data imbalance on the deep learning models,enhance fault diagnosis performance and supply desirable diagnosis accuracy up to 99.46%.Furthermore,the comparison results indicate the fault diagnosis performance of the proposed approach is superior to that of other conventional methods.Therefore,the method presented in this study has excellent and reliable fault diagnosis performance for various unbalanced datasets.In addition,the proposed approach can also solve the problems of insufficient and imbalanced fault data in other practical application fields.展开更多
Most modern technologies,such as social media,smart cities,and the internet of things(IoT),rely on big data.When big data is used in the real-world applications,two data challenges such as class overlap and class imba...Most modern technologies,such as social media,smart cities,and the internet of things(IoT),rely on big data.When big data is used in the real-world applications,two data challenges such as class overlap and class imbalance arises.When dealing with large datasets,most traditional classifiers are stuck in the local optimum problem.As a result,it’s necessary to look into new methods for dealing with large data collections.Several solutions have been proposed for overcoming this issue.The rapid growth of the available data threatens to limit the usefulness of many traditional methods.Methods such as oversampling and undersampling have shown great promises in addressing the issues of class imbalance.Among all of these techniques,Synthetic Minority Oversampling TechniquE(SMOTE)has produced the best results by generating synthetic samples for the minority class in creating a balanced dataset.The issue is that their practical applicability is restricted to problems involving tens of thousands or lower instances of each.In this paper,we have proposed a parallel mode method using SMOTE and MapReduce strategy,this distributes the operation of the algorithm among a group of computational nodes for addressing the aforementioned problem.Our proposed solution has been divided into three stages.Thefirst stage involves the process of splitting the data into different blocks using a mapping function,followed by a pre-processing step for each mapping block that employs a hybrid SMOTE algo-rithm for solving the class imbalanced problem.On each map block,a decision tree model would be constructed.Finally,the decision tree blocks would be com-bined for creating a classification model.We have used numerous datasets with up to 4 million instances in our experiments for testing the proposed scheme’s cap-abilities.As a result,the Hybrid SMOTE appears to have good scalability within the framework proposed,and it also cuts down the processing time.展开更多
For imbalanced datasets, the focus of classification is to identify samples of the minority class. The performance of current data mining algorithms is not good enough for processing imbalanced datasets. The synthetic...For imbalanced datasets, the focus of classification is to identify samples of the minority class. The performance of current data mining algorithms is not good enough for processing imbalanced datasets. The synthetic minority over-sampling technique(SMOTE) is specifically designed for learning from imbalanced datasets, generating synthetic minority class examples by interpolating between minority class examples nearby. However, the SMOTE encounters the overgeneralization problem. The densitybased spatial clustering of applications with noise(DBSCAN) is not rigorous when dealing with the samples near the borderline.We optimize the DBSCAN algorithm for this problem to make clustering more reasonable. This paper integrates the optimized DBSCAN and SMOTE, and proposes a density-based synthetic minority over-sampling technique(DSMOTE). First, the optimized DBSCAN is used to divide the samples of the minority class into three groups, including core samples, borderline samples and noise samples, and then the noise samples of minority class is removed to synthesize more effective samples. In order to make full use of the information of core samples and borderline samples,different strategies are used to over-sample core samples and borderline samples. Experiments show that DSMOTE can achieve better results compared with SMOTE and Borderline-SMOTE in terms of precision, recall and F-value.展开更多
Imbalanced data is one type of datasets that are frequently found in real-world applications, e.g., fraud detection and cancer diagnosis. For this type of datasets, improving the accuracy to identify their minority cl...Imbalanced data is one type of datasets that are frequently found in real-world applications, e.g., fraud detection and cancer diagnosis. For this type of datasets, improving the accuracy to identify their minority class is a critically important issue.Feature selection is one method to address this issue. An effective feature selection method can choose a subset of features that favor in the accurate determination of the minority class. A decision tree is a classifier that can be built up by using different splitting criteria. Its advantage is the ease of detecting which feature is used as a splitting node. Thus, it is possible to use a decision tree splitting criterion as a feature selection method. In this paper, an embedded feature selection method using our proposed weighted Gini index(WGI) is proposed. Its comparison results with Chi2, F-statistic and Gini index feature selection methods show that F-statistic and Chi2 reach the best performance when only a few features are selected. As the number of selected features increases, our proposed method has the highest probability of achieving the best performance. The area under a receiver operating characteristic curve(ROC AUC) and F-measure are used as evaluation criteria. Experimental results with two datasets show that ROC AUC performance can be high, even if only a few features are selected and used, and only changes slightly as more and more features are selected. However, the performance of Fmeasure achieves excellent performance only if 20% or more of features are chosen. The results are helpful for practitioners to select a proper feature selection method when facing a practical problem.展开更多
These days,imbalanced datasets,denoted throughout the paper by ID,(a dataset that contains some(usually two)classes where one contains considerably smaller number of samples than the other(s))emerge in many real world...These days,imbalanced datasets,denoted throughout the paper by ID,(a dataset that contains some(usually two)classes where one contains considerably smaller number of samples than the other(s))emerge in many real world problems(like health care systems or disease diagnosis systems,anomaly detection,fraud detection,stream based malware detection systems,and so on)and these datasets cause some problems(like under-training of minority class(es)and over-training of majority class(es),bias towards majority class(es),and so on)in classification process and application.Therefore,these datasets take the focus of many researchers in any science and there are several solutions for dealing with this problem.The main aim of this study for dealing with IDs is to resample the borderline samples discovered by Support Vector Data Description(SVDD).There are naturally two kinds of resampling:Under-sampling(U-S)and oversampling(O-S).The O-S may cause the occurrence of over-fitting(the occurrence of over-fitting is its main drawback).The U-S can cause the occurrence of significant information loss(the occurrence of significant information loss is its main drawback).In this study,to avoid the drawbacks of the sampling techniques,we focus on the samples that may be misclassified.The data points that can be misclassified are considered to be the borderline data points which are on border(s)between the majority class(es)and minority class(es).First by SVDD,we find the borderline examples;then,the data resampling is applied over them.At the next step,the base classifier is trained on the newly created dataset.Finally,we compare the result of our method in terms of Area Under Curve(AUC)and F-measure and G-mean with the other state-of-the-art methods.We show that our method has betterresults than the other state-of-the-art methods on our experimental study.展开更多
As a promising edge learning framework in future 6G networks,federated learning(FL)faces a number of technical challenges due to the heterogeneous network environment and diversified user behaviors.Data imbalance is o...As a promising edge learning framework in future 6G networks,federated learning(FL)faces a number of technical challenges due to the heterogeneous network environment and diversified user behaviors.Data imbalance is one of these challenges that can significantly degrade the learning efficiency.To deal with data imbalance issue,this work proposes a new learning framework,called clustered federated learning with weighted model aggregation(weighted CFL).Compared with traditional FL,our weighted CFL adaptively clusters the participating edge devices based on the cosine similarity of their local gradients at each training iteration,and then performs weighted per-cluster model aggregation.Therein,the similarity threshold for clustering is adaptive over iterations in response to the time-varying divergence of local gradients.Moreover,the weights for per-cluster model aggregation are adjusted according to the data balance feature so as to speed up the convergence rate.Experimental results show that the proposed weighted CFL achieves a faster model convergence rate and greater learning accuracy than benchmark methods under the imbalanced data scenario.展开更多
Accurate fault diagnosis of heating,ventilation,and air conditioning(HVAC)systems is of significant importance for maintaining normal operation,reducing energy consumption,and minimizing maintenance costs.However,in p...Accurate fault diagnosis of heating,ventilation,and air conditioning(HVAC)systems is of significant importance for maintaining normal operation,reducing energy consumption,and minimizing maintenance costs.However,in practical applications,it is challenging to obtain sufficient fault data for HVAC systems,leading to imbalanced data,where the number of fault samples is much smaller than that of normal samples.Moreover,most existing HVAC system fault diagnosis methods heavily rely on balanced training sets to achieve high fault diagnosis accuracy.Therefore,to address this issue,a composite neural network fault diagnosis model is proposed,which combines SMOTETomek,multi-scale one-dimensional convolutional neural networks(M1DCNN),and support vector machine(SVM).This method first utilizes SMOTETomek to augment the minority class samples in the imbalanced dataset,achieving a balanced number of faulty and normal data.Then,it employs the M1DCNN model to extract feature information from the augmented dataset.Finally,it replaces the original Softmax classifier with an SVM classifier for classification,thus enhancing the fault diagnosis accuracy.Using the SMOTETomek-M1DCNN-SVM method,we conducted fault diagnosis validation on both the ASHRAE RP-1043 dataset and experimental dataset with an imbalance ratio of 1:10.The results demonstrate the superiority of this approach,providing a novel and promising solution for intelligent building management,with accuracy and F1 scores of 98.45%and 100%for the RP-1043 dataset and experimental dataset,respectively.展开更多
Transition towards carbon-neutral power systems has necessitated optimization of power dispatch in active distribution networks(ADNs)to facilitate integration of distributed renewable generation.Due to unavailability ...Transition towards carbon-neutral power systems has necessitated optimization of power dispatch in active distribution networks(ADNs)to facilitate integration of distributed renewable generation.Due to unavailability of network topology and line impedance in many distribution networks,physical model-based methods may not be applicable to their operations.To tackle this challenge,some studies have proposed constraint learning,which replicates physical models by training a neural network to evaluate feasibility of a decision(i.e.,whether a decision satisfies all critical constraints or not).To ensure accuracy of this trained neural network,training set should contain sufficient feasible and infeasible samples.However,since ADNs are mostly operated in a normal status,only very few historical samples are infeasible.Thus,the historical dataset is highly imbalanced,which poses a significant obstacle to neural network training.To address this issue,we propose an enhanced constraint learning method.First,it leverages constraint learning to train a neural network as surrogate of ADN's model.Then,it introduces Synthetic Minority Oversampling Technique to generate infeasible samples to mitigate imbalance of historical dataset.By incorporating historical and synthetic samples into the training set,we can significantly improve accuracy of neural network.Furthermore,we establish a trust region to constrain and thereafter enhance reliability of the solution.Simulations confirm the benefits of the proposed method in achieving desirable optimality and feasibility while maintaining low computational complexity.展开更多
Imbalanced data classification is an important research topic in real-world applications,like fault diagnosis in an aircraft manufacturing system.The over-sampling method is often used to solve this problem.It generat...Imbalanced data classification is an important research topic in real-world applications,like fault diagnosis in an aircraft manufacturing system.The over-sampling method is often used to solve this problem.It generates samples according to the distance between minority data.However,the traditional over-sampling method may change the original data distribution,which is harmful to the classification performance.In this paper,we propose a new method called Conditional SelfAttention Generative Adversarial Network with Differential Evolution(CSAGAN-DE)for imbalanced data classification.The new method aims at improving the classification performance of minority data by enhancing the quality of the generation of minority data.In CSAGAN-DE,the minority data are fed into the self-attention generative adversarial network to approximate the data distribution and create new data for the minority class.Then,the differential evolution algorithm is employed to automatically determine the number of generated minority data for achieving a satisfactory classification performance.Several experiments are conducted to evaluate the performance of the new CSAGAN-DE method.The results show that the new method can efficiently improve the classification performance compared with other related methods.展开更多
In recent years,deep learning gained proliferating popularity in the cybersecurity application domain,since when being compared to traditional machine learning methods,it usually involves less human efforts,produces b...In recent years,deep learning gained proliferating popularity in the cybersecurity application domain,since when being compared to traditional machine learning methods,it usually involves less human efforts,produces better results,and provides better generalizability.However,the imbalanced data issue is very common in cybersecurity,which can substantially deteriorate the performance of the deep learning models.This paper introduces a transfer learning based method to tackle the imbalanced data issue in cybersecurity using return-oriented programming payload detection as a case study.We achieved 0.0290 average false positive rate,0.9705 average F1 score and 0.9521 average detection rate on 3 different target domain programs using 2 different source domain programs,with 0 benign training data sample in the target domain.The performance improvement compared to the baseline is a trade-off between false positive rate and detection rate.Using our approach,the total number of false positives is reduced by 23.16%,and as a trade-off,the number of detected malicious samples decreases by 0.68%.展开更多
Encrypted traffic classification has become a hot issue in network security research.The class imbalance problem of traffic samples often causes the deterioration of Machine Learning based classifier performance.Altho...Encrypted traffic classification has become a hot issue in network security research.The class imbalance problem of traffic samples often causes the deterioration of Machine Learning based classifier performance.Although the Generative Adversarial Network(GAN)method can generate new samples by learning the feature distribution of the original samples,it is confronted with the problems of unstable training andmode collapse.To this end,a novel data augmenting approach called Graph CWGAN-GP is proposed in this paper.The traffic data is first converted into grayscale images as the input for the proposed model.Then,the minority class data is augmented with our proposed model,which is built by introducing conditional constraints and a new distance metric in typical GAN.Finally,the classical deep learning model is adopted as a classifier to classify datasets augmented by the Condition GAN(CGAN),Wasserstein GAN-Gradient Penalty(WGAN-GP)and Graph CWGAN-GP,respectively.Compared with the state-of-the-art GAN methods,the Graph CWGAN-GP cannot only control the modes of the data to be generated,but also overcome the problem of unstable training and generate more realistic and diverse samples.The experimental results show that the classification precision,recall and F1-Score of theminority class in the balanced dataset augmented in this paper have improved by more than 2.37%,3.39% and 4.57%,respectively.展开更多
With the rapid development of information technology,IoT devices play a huge role in physiological health data detection.The exponential growth of medical data requires us to reasonably allocate storage space for clou...With the rapid development of information technology,IoT devices play a huge role in physiological health data detection.The exponential growth of medical data requires us to reasonably allocate storage space for cloud servers and edge nodes.The storage capacity of edge nodes close to users is limited.We should store hotspot data in edge nodes as much as possible,so as to ensure response timeliness and access hit rate;However,the current scheme cannot guarantee that every sub-message in a complete data stored by the edge node meets the requirements of hot data;How to complete the detection and deletion of redundant data in edge nodes under the premise of protecting user privacy and data dynamic integrity has become a challenging problem.Our paper proposes a redundant data detection method that meets the privacy protection requirements.By scanning the cipher text,it is determined whether each sub-message of the data in the edge node meets the requirements of the hot data.It has the same effect as zero-knowledge proof,and it will not reveal the privacy of users.In addition,for redundant sub-data that does not meet the requirements of hot data,our paper proposes a redundant data deletion scheme that meets the dynamic integrity of the data.We use Content Extraction Signature(CES)to generate the remaining hot data signature after the redundant data is deleted.The feasibility of the scheme is proved through safety analysis and efficiency analysis.展开更多
Missing value is one of the main factors that cause dirty data.Without high-quality data,there will be no reliable analysis results and precise decision-making.Therefore,the data warehouse needs to integrate high-qual...Missing value is one of the main factors that cause dirty data.Without high-quality data,there will be no reliable analysis results and precise decision-making.Therefore,the data warehouse needs to integrate high-quality data consistently.In the power system,the electricity consumption data of some large users cannot be normally collected resulting in missing data,which affects the calculation of power supply and eventually leads to a large error in the daily power line loss rate.For the problem of missing electricity consumption data,this study proposes a group method of data handling(GMDH)based data interpolation method in distribution power networks and applies it in the analysis of actually collected electricity data.First,the dependent and independent variables are defined from the original data,and the upper and lower limits of missing values are determined according to prior knowledge or existing data information.All missing data are randomly interpolated within the upper and lower limits.Then,the GMDH network is established to obtain the optimal complexity model,which is used to predict the missing data to replace the last imputed electricity consumption data.At last,this process is implemented iteratively until the missing values do not change.Under a relatively small noise level(α=0.25),the proposed approach achieves a maximum error of no more than 0.605%.Experimental findings demonstrate the efficacy and feasibility of the proposed approach,which realizes the transformation from incomplete data to complete data.Also,this proposed data interpolation approach provides a strong basis for the electricity theft diagnosis and metering fault analysis of electricity enterprises.展开更多
Time-series data provide important information in many fields,and their processing and analysis have been the focus of much research.However,detecting anomalies is very difficult due to data imbalance,temporal depende...Time-series data provide important information in many fields,and their processing and analysis have been the focus of much research.However,detecting anomalies is very difficult due to data imbalance,temporal dependence,and noise.Therefore,methodologies for data augmentation and conversion of time series data into images for analysis have been studied.This paper proposes a fault detection model that uses time series data augmentation and transformation to address the problems of data imbalance,temporal dependence,and robustness to noise.The method of data augmentation is set as the addition of noise.It involves adding Gaussian noise,with the noise level set to 0.002,to maximize the generalization performance of the model.In addition,we use the Markov Transition Field(MTF)method to effectively visualize the dynamic transitions of the data while converting the time series data into images.It enables the identification of patterns in time series data and assists in capturing the sequential dependencies of the data.For anomaly detection,the PatchCore model is applied to show excellent performance,and the detected anomaly areas are represented as heat maps.It allows for the detection of anomalies,and by applying an anomaly map to the original image,it is possible to capture the areas where anomalies occur.The performance evaluation shows that both F1-score and Accuracy are high when time series data is converted to images.Additionally,when processed as images rather than as time series data,there was a significant reduction in both the size of the data and the training time.The proposed method can provide an important springboard for research in the field of anomaly detection using time series data.Besides,it helps solve problems such as analyzing complex patterns in data lightweight.展开更多
Large-scale wireless sensor networks(WSNs)play a critical role in monitoring dangerous scenarios and responding to medical emergencies.However,the inherent instability and error-prone nature of wireless links present ...Large-scale wireless sensor networks(WSNs)play a critical role in monitoring dangerous scenarios and responding to medical emergencies.However,the inherent instability and error-prone nature of wireless links present significant challenges,necessitating efficient data collection and reliable transmission services.This paper addresses the limitations of existing data transmission and recovery protocols by proposing a systematic end-to-end design tailored for medical event-driven cluster-based large-scale WSNs.The primary goal is to enhance the reliability of data collection and transmission services,ensuring a comprehensive and practical approach.Our approach focuses on refining the hop-count-based routing scheme to achieve fairness in forwarding reliability.Additionally,it emphasizes reliable data collection within clusters and establishes robust data transmission over multiple hops.These systematic improvements are designed to optimize the overall performance of the WSN in real-world scenarios.Simulation results of the proposed protocol validate its exceptional performance compared to other prominent data transmission schemes.The evaluation spans varying sensor densities,wireless channel conditions,and packet transmission rates,showcasing the protocol’s superiority in ensuring reliable and efficient data transfer.Our systematic end-to-end design successfully addresses the challenges posed by the instability of wireless links in large-scaleWSNs.By prioritizing fairness,reliability,and efficiency,the proposed protocol demonstrates its efficacy in enhancing data collection and transmission services,thereby offering a valuable contribution to the field of medical event-drivenWSNs.展开更多
Guest Editors Prof.Ling Tian Prof.Jian-Hua Tao University of Electronic Science and Technology of China Tsinghua University lingtian@uestc.edu.cn jhtao@tsinghua.edu.cn Dr.Bin Zhou National University of Defense Techno...Guest Editors Prof.Ling Tian Prof.Jian-Hua Tao University of Electronic Science and Technology of China Tsinghua University lingtian@uestc.edu.cn jhtao@tsinghua.edu.cn Dr.Bin Zhou National University of Defense Technology binzhou@nudt.edu.cn, Since the concept of “Big Data” was first introduced in Nature in 2008, it has been widely applied in fields, such as business, healthcare, national defense, education, transportation, and security. With the maturity of artificial intelligence technology, big data analysis techniques tailored to various fields have made significant progress, but still face many challenges in terms of data quality, algorithms, and computing power.展开更多
Capabilities to assimilate Geostationary Operational Environmental Satellite “R-series ”(GOES-R) Geostationary Lightning Mapper(GLM) flash extent density(FED) data within the operational Gridpoint Statistical Interp...Capabilities to assimilate Geostationary Operational Environmental Satellite “R-series ”(GOES-R) Geostationary Lightning Mapper(GLM) flash extent density(FED) data within the operational Gridpoint Statistical Interpolation ensemble Kalman filter(GSI-EnKF) framework were previously developed and tested with a mesoscale convective system(MCS) case. In this study, such capabilities are further developed to assimilate GOES GLM FED data within the GSI ensemble-variational(EnVar) hybrid data assimilation(DA) framework. The results of assimilating the GLM FED data using 3DVar, and pure En3DVar(PEn3DVar, using 100% ensemble covariance and no static covariance) are compared with those of EnKF/DfEnKF for a supercell storm case. The focus of this study is to validate the correctness and evaluate the performance of the new implementation rather than comparing the performance of FED DA among different DA schemes. Only the results of 3DVar and pEn3DVar are examined and compared with EnKF/DfEnKF. Assimilation of a single FED observation shows that the magnitude and horizontal extent of the analysis increments from PEn3DVar are generally larger than from EnKF, which is mainly caused by using different localization strategies in EnFK/DfEnKF and PEn3DVar as well as the integration limits of the graupel mass in the observation operator. Overall, the forecast performance of PEn3DVar is comparable to EnKF/DfEnKF, suggesting correct implementation.展开更多
基金supported under the research Grant(PO Number:920138936)from the Institute of Technology PETRONAS Sdn Bhd,32610,Bandar Seri Iskandar,Perak,Malaysia.
文摘Prediction of machine failure is challenging as the dataset is often imbalanced with a low failure rate.The common approach to han-dle classification involving imbalanced data is to balance the data using a sampling approach such as random undersampling,random oversampling,or Synthetic Minority Oversampling Technique(SMOTE)algorithms.This paper compared the classification performance of three popular classifiers(Logistic Regression,Gaussian Naïve Bayes,and Support Vector Machine)in predicting machine failure in the Oil and Gas industry.The original machine failure dataset consists of 20,473 hourly data and is imbalanced with 19945(97%)‘non-failure’and 528(3%)‘failure data’.The three independent variables to predict machine failure were pressure indicator,flow indicator,and level indicator.The accuracy of the classifiers is very high and close to 100%,but the sensitivity of all classifiers using the original dataset was close to zero.The performance of the three classifiers was then evaluated for data with different imbalance rates(10%to 50%)generated from the original data using SMOTE,SMOTE-Support Vector Machine(SMOTE-SVM)and SMOTE-Edited Nearest Neighbour(SMOTE-ENN).The classifiers were evaluated based on improvement in sensitivity and F-measure.Results showed that the sensitivity of all classifiers increases as the imbalance rate increases.SVM with radial basis function(RBF)kernel has the highest sensitivity when data is balanced(50:50)using SMOTE(Sensitivitytest=0.5686,Ftest=0.6927)compared to Naïve Bayes(Sensitivitytest=0.4033,Ftest=0.6218)and Logistic Regression(Sensitivitytest=0.4194,Ftest=0.621).Overall,the Gaussian Naïve Bayes model consistently improves sensitivity and F-measure as the imbalance ratio increases,but the sensitivity is below 50%.The classifiers performed better when data was balanced using SMOTE-SVM compared to SMOTE and SMOTE-ENN.
文摘Imbalanced data classification is one of the major problems in machine learning.This imbalanced dataset typically has significant differences in the number of data samples between its classes.In most cases,the performance of the machine learning algorithm such as Support Vector Machine(SVM)is affected when dealing with an imbalanced dataset.The classification accuracy is mostly skewed toward the majority class and poor results are exhibited in the prediction of minority-class samples.In this paper,a hybrid approach combining data pre-processing technique andSVMalgorithm based on improved Simulated Annealing(SA)was proposed.Firstly,the data preprocessing technique which primarily aims at solving the resampling strategy of handling imbalanced datasets was proposed.In this technique,the data were first synthetically generated to equalize the number of samples between classes and followed by a reduction step to remove redundancy and duplicated data.Next is the training of a balanced dataset using SVM.Since this algorithm requires an iterative process to search for the best penalty parameter during training,an improved SA algorithm was proposed for this task.In this proposed improvement,a new acceptance criterion for the solution to be accepted in the SA algorithm was introduced to enhance the accuracy of the optimization process.Experimental works based on ten publicly available imbalanced datasets have demonstrated higher accuracy in the classification tasks using the proposed approach in comparison with the conventional implementation of SVM.Registering at an average of 89.65%of accuracy for the binary class classification has demonstrated the good performance of the proposed works.
基金The authors gratefully acknowledge financial support of national natural science foundation of China(No.52067021)natural science foundation of Xinjiang Uygur Autonomous Region(2022D01C35)+1 种基金excellent youth scientific and technological talents plan of Xinjiang(No.2019Q012)major science&technology special project of Xinjiang Uygur Autonomous Region(2022A01002-2).
文摘The imbalance of dissolved gas analysis(DGA)data will lead to over-fitting,weak generalization and poor recognition performance for fault diagnosis models based on deep learning.To handle this problem,a novel transformer fault diagnosis method based on improved auxiliary classifier generative adversarial network(ACGAN)under imbalanced data is proposed in this paper,which meets both the requirements of balancing DGA data and supplying accurate diagnosis results.The generator combines one-dimensional convolutional neural networks(1D-CNN)and long short-term memories(LSTM),which can deeply extract the features from DGA samples and be greatly beneficial to ACGAN’s data balancing and fault diagnosis.The discriminator adopts multilayer perceptron networks(MLP),which prevents the discriminator from losing important features of DGA data when the network is too complex and the number of layers is too large.The experimental results suggest that the presented approach can effectively improve the adverse effects of DGA data imbalance on the deep learning models,enhance fault diagnosis performance and supply desirable diagnosis accuracy up to 99.46%.Furthermore,the comparison results indicate the fault diagnosis performance of the proposed approach is superior to that of other conventional methods.Therefore,the method presented in this study has excellent and reliable fault diagnosis performance for various unbalanced datasets.In addition,the proposed approach can also solve the problems of insufficient and imbalanced fault data in other practical application fields.
文摘Most modern technologies,such as social media,smart cities,and the internet of things(IoT),rely on big data.When big data is used in the real-world applications,two data challenges such as class overlap and class imbalance arises.When dealing with large datasets,most traditional classifiers are stuck in the local optimum problem.As a result,it’s necessary to look into new methods for dealing with large data collections.Several solutions have been proposed for overcoming this issue.The rapid growth of the available data threatens to limit the usefulness of many traditional methods.Methods such as oversampling and undersampling have shown great promises in addressing the issues of class imbalance.Among all of these techniques,Synthetic Minority Oversampling TechniquE(SMOTE)has produced the best results by generating synthetic samples for the minority class in creating a balanced dataset.The issue is that their practical applicability is restricted to problems involving tens of thousands or lower instances of each.In this paper,we have proposed a parallel mode method using SMOTE and MapReduce strategy,this distributes the operation of the algorithm among a group of computational nodes for addressing the aforementioned problem.Our proposed solution has been divided into three stages.Thefirst stage involves the process of splitting the data into different blocks using a mapping function,followed by a pre-processing step for each mapping block that employs a hybrid SMOTE algo-rithm for solving the class imbalanced problem.On each map block,a decision tree model would be constructed.Finally,the decision tree blocks would be com-bined for creating a classification model.We have used numerous datasets with up to 4 million instances in our experiments for testing the proposed scheme’s cap-abilities.As a result,the Hybrid SMOTE appears to have good scalability within the framework proposed,and it also cuts down the processing time.
基金supported by the National Key Research and Development Program of China(2018YFB1003700)the Scientific and Technological Support Project(Society)of Jiangsu Province(BE2016776)+2 种基金the“333” project of Jiangsu Province(BRA2017228 BRA2017401)the Talent Project in Six Fields of Jiangsu Province(2015-JNHB-012)
文摘For imbalanced datasets, the focus of classification is to identify samples of the minority class. The performance of current data mining algorithms is not good enough for processing imbalanced datasets. The synthetic minority over-sampling technique(SMOTE) is specifically designed for learning from imbalanced datasets, generating synthetic minority class examples by interpolating between minority class examples nearby. However, the SMOTE encounters the overgeneralization problem. The densitybased spatial clustering of applications with noise(DBSCAN) is not rigorous when dealing with the samples near the borderline.We optimize the DBSCAN algorithm for this problem to make clustering more reasonable. This paper integrates the optimized DBSCAN and SMOTE, and proposes a density-based synthetic minority over-sampling technique(DSMOTE). First, the optimized DBSCAN is used to divide the samples of the minority class into three groups, including core samples, borderline samples and noise samples, and then the noise samples of minority class is removed to synthesize more effective samples. In order to make full use of the information of core samples and borderline samples,different strategies are used to over-sample core samples and borderline samples. Experiments show that DSMOTE can achieve better results compared with SMOTE and Borderline-SMOTE in terms of precision, recall and F-value.
基金supported in part by the National Science Foundation of USA(CMMI-1162482)
文摘Imbalanced data is one type of datasets that are frequently found in real-world applications, e.g., fraud detection and cancer diagnosis. For this type of datasets, improving the accuracy to identify their minority class is a critically important issue.Feature selection is one method to address this issue. An effective feature selection method can choose a subset of features that favor in the accurate determination of the minority class. A decision tree is a classifier that can be built up by using different splitting criteria. Its advantage is the ease of detecting which feature is used as a splitting node. Thus, it is possible to use a decision tree splitting criterion as a feature selection method. In this paper, an embedded feature selection method using our proposed weighted Gini index(WGI) is proposed. Its comparison results with Chi2, F-statistic and Gini index feature selection methods show that F-statistic and Chi2 reach the best performance when only a few features are selected. As the number of selected features increases, our proposed method has the highest probability of achieving the best performance. The area under a receiver operating characteristic curve(ROC AUC) and F-measure are used as evaluation criteria. Experimental results with two datasets show that ROC AUC performance can be high, even if only a few features are selected and used, and only changes slightly as more and more features are selected. However, the performance of Fmeasure achieves excellent performance only if 20% or more of features are chosen. The results are helpful for practitioners to select a proper feature selection method when facing a practical problem.
基金grants to HAR and HP.HAR is supported by UNSW Scientia Program Fellowship and is a member of the UNSW Graduate School of Biomedical Engineering.
文摘These days,imbalanced datasets,denoted throughout the paper by ID,(a dataset that contains some(usually two)classes where one contains considerably smaller number of samples than the other(s))emerge in many real world problems(like health care systems or disease diagnosis systems,anomaly detection,fraud detection,stream based malware detection systems,and so on)and these datasets cause some problems(like under-training of minority class(es)and over-training of majority class(es),bias towards majority class(es),and so on)in classification process and application.Therefore,these datasets take the focus of many researchers in any science and there are several solutions for dealing with this problem.The main aim of this study for dealing with IDs is to resample the borderline samples discovered by Support Vector Data Description(SVDD).There are naturally two kinds of resampling:Under-sampling(U-S)and oversampling(O-S).The O-S may cause the occurrence of over-fitting(the occurrence of over-fitting is its main drawback).The U-S can cause the occurrence of significant information loss(the occurrence of significant information loss is its main drawback).In this study,to avoid the drawbacks of the sampling techniques,we focus on the samples that may be misclassified.The data points that can be misclassified are considered to be the borderline data points which are on border(s)between the majority class(es)and minority class(es).First by SVDD,we find the borderline examples;then,the data resampling is applied over them.At the next step,the base classifier is trained on the newly created dataset.Finally,we compare the result of our method in terms of Area Under Curve(AUC)and F-measure and G-mean with the other state-of-the-art methods.We show that our method has betterresults than the other state-of-the-art methods on our experimental study.
文摘As a promising edge learning framework in future 6G networks,federated learning(FL)faces a number of technical challenges due to the heterogeneous network environment and diversified user behaviors.Data imbalance is one of these challenges that can significantly degrade the learning efficiency.To deal with data imbalance issue,this work proposes a new learning framework,called clustered federated learning with weighted model aggregation(weighted CFL).Compared with traditional FL,our weighted CFL adaptively clusters the participating edge devices based on the cosine similarity of their local gradients at each training iteration,and then performs weighted per-cluster model aggregation.Therein,the similarity threshold for clustering is adaptive over iterations in response to the time-varying divergence of local gradients.Moreover,the weights for per-cluster model aggregation are adjusted according to the data balance feature so as to speed up the convergence rate.Experimental results show that the proposed weighted CFL achieves a faster model convergence rate and greater learning accuracy than benchmark methods under the imbalanced data scenario.
基金The authors of this paper acknowledge the support from the National Natural Science Foundation of China(No.51975191)the Funds for Science and Technology Creative Talents of Hubei,China(No.2023DJC048)This work was supported by the Xiangyang Hubei University of Technology Industrial Research Institute Funding Program(No.XYYJ2022B01).
文摘Accurate fault diagnosis of heating,ventilation,and air conditioning(HVAC)systems is of significant importance for maintaining normal operation,reducing energy consumption,and minimizing maintenance costs.However,in practical applications,it is challenging to obtain sufficient fault data for HVAC systems,leading to imbalanced data,where the number of fault samples is much smaller than that of normal samples.Moreover,most existing HVAC system fault diagnosis methods heavily rely on balanced training sets to achieve high fault diagnosis accuracy.Therefore,to address this issue,a composite neural network fault diagnosis model is proposed,which combines SMOTETomek,multi-scale one-dimensional convolutional neural networks(M1DCNN),and support vector machine(SVM).This method first utilizes SMOTETomek to augment the minority class samples in the imbalanced dataset,achieving a balanced number of faulty and normal data.Then,it employs the M1DCNN model to extract feature information from the augmented dataset.Finally,it replaces the original Softmax classifier with an SVM classifier for classification,thus enhancing the fault diagnosis accuracy.Using the SMOTETomek-M1DCNN-SVM method,we conducted fault diagnosis validation on both the ASHRAE RP-1043 dataset and experimental dataset with an imbalance ratio of 1:10.The results demonstrate the superiority of this approach,providing a novel and promising solution for intelligent building management,with accuracy and F1 scores of 98.45%and 100%for the RP-1043 dataset and experimental dataset,respectively.
基金supported in part by the Science and Technology Development Fund,Macao SAR,China(File no.SKL-IOTSC(UM)-2021-2023,File no.0003/2020/AKP,and File no.0011/2021/AGJ)。
文摘Transition towards carbon-neutral power systems has necessitated optimization of power dispatch in active distribution networks(ADNs)to facilitate integration of distributed renewable generation.Due to unavailability of network topology and line impedance in many distribution networks,physical model-based methods may not be applicable to their operations.To tackle this challenge,some studies have proposed constraint learning,which replicates physical models by training a neural network to evaluate feasibility of a decision(i.e.,whether a decision satisfies all critical constraints or not).To ensure accuracy of this trained neural network,training set should contain sufficient feasible and infeasible samples.However,since ADNs are mostly operated in a normal status,only very few historical samples are infeasible.Thus,the historical dataset is highly imbalanced,which poses a significant obstacle to neural network training.To address this issue,we propose an enhanced constraint learning method.First,it leverages constraint learning to train a neural network as surrogate of ADN's model.Then,it introduces Synthetic Minority Oversampling Technique to generate infeasible samples to mitigate imbalance of historical dataset.By incorporating historical and synthetic samples into the training set,we can significantly improve accuracy of neural network.Furthermore,we establish a trust region to constrain and thereafter enhance reliability of the solution.Simulations confirm the benefits of the proposed method in achieving desirable optimality and feasibility while maintaining low computational complexity.
基金partially supported by the Aeronautical Science Foundation of China(No.201920007001)National Natural Science Foundation of China(Nos.U20B2067,61790552 and 61790554)。
文摘Imbalanced data classification is an important research topic in real-world applications,like fault diagnosis in an aircraft manufacturing system.The over-sampling method is often used to solve this problem.It generates samples according to the distance between minority data.However,the traditional over-sampling method may change the original data distribution,which is harmful to the classification performance.In this paper,we propose a new method called Conditional SelfAttention Generative Adversarial Network with Differential Evolution(CSAGAN-DE)for imbalanced data classification.The new method aims at improving the classification performance of minority data by enhancing the quality of the generation of minority data.In CSAGAN-DE,the minority data are fed into the self-attention generative adversarial network to approximate the data distribution and create new data for the minority class.Then,the differential evolution algorithm is employed to automatically determine the number of generated minority data for achieving a satisfactory classification performance.Several experiments are conducted to evaluate the performance of the new CSAGAN-DE method.The results show that the new method can efficiently improve the classification performance compared with other related methods.
基金supported by NSF CNS-2019340,NSF ECCS-2140175,and NIST 60NANB22D144.
文摘In recent years,deep learning gained proliferating popularity in the cybersecurity application domain,since when being compared to traditional machine learning methods,it usually involves less human efforts,produces better results,and provides better generalizability.However,the imbalanced data issue is very common in cybersecurity,which can substantially deteriorate the performance of the deep learning models.This paper introduces a transfer learning based method to tackle the imbalanced data issue in cybersecurity using return-oriented programming payload detection as a case study.We achieved 0.0290 average false positive rate,0.9705 average F1 score and 0.9521 average detection rate on 3 different target domain programs using 2 different source domain programs,with 0 benign training data sample in the target domain.The performance improvement compared to the baseline is a trade-off between false positive rate and detection rate.Using our approach,the total number of false positives is reduced by 23.16%,and as a trade-off,the number of detected malicious samples decreases by 0.68%.
基金supported by the National Natural Science Foundation of China (Grants Nos.61931004,62072250)the Talent Launch Fund of Nanjing University of Information Science and Technology (2020r061).
文摘Encrypted traffic classification has become a hot issue in network security research.The class imbalance problem of traffic samples often causes the deterioration of Machine Learning based classifier performance.Although the Generative Adversarial Network(GAN)method can generate new samples by learning the feature distribution of the original samples,it is confronted with the problems of unstable training andmode collapse.To this end,a novel data augmenting approach called Graph CWGAN-GP is proposed in this paper.The traffic data is first converted into grayscale images as the input for the proposed model.Then,the minority class data is augmented with our proposed model,which is built by introducing conditional constraints and a new distance metric in typical GAN.Finally,the classical deep learning model is adopted as a classifier to classify datasets augmented by the Condition GAN(CGAN),Wasserstein GAN-Gradient Penalty(WGAN-GP)and Graph CWGAN-GP,respectively.Compared with the state-of-the-art GAN methods,the Graph CWGAN-GP cannot only control the modes of the data to be generated,but also overcome the problem of unstable training and generate more realistic and diverse samples.The experimental results show that the classification precision,recall and F1-Score of theminority class in the balanced dataset augmented in this paper have improved by more than 2.37%,3.39% and 4.57%,respectively.
基金sponsored by the National Natural Science Foundation of China under grant number No. 62172353, No. 62302114, No. U20B2046 and No. 62172115Innovation Fund Program of the Engineering Research Center for Integration and Application of Digital Learning Technology of Ministry of Education No.1331007 and No. 1311022+1 种基金Natural Science Foundation of the Jiangsu Higher Education Institutions Grant No. 17KJB520044Six Talent Peaks Project in Jiangsu Province No.XYDXX-108
文摘With the rapid development of information technology,IoT devices play a huge role in physiological health data detection.The exponential growth of medical data requires us to reasonably allocate storage space for cloud servers and edge nodes.The storage capacity of edge nodes close to users is limited.We should store hotspot data in edge nodes as much as possible,so as to ensure response timeliness and access hit rate;However,the current scheme cannot guarantee that every sub-message in a complete data stored by the edge node meets the requirements of hot data;How to complete the detection and deletion of redundant data in edge nodes under the premise of protecting user privacy and data dynamic integrity has become a challenging problem.Our paper proposes a redundant data detection method that meets the privacy protection requirements.By scanning the cipher text,it is determined whether each sub-message of the data in the edge node meets the requirements of the hot data.It has the same effect as zero-knowledge proof,and it will not reveal the privacy of users.In addition,for redundant sub-data that does not meet the requirements of hot data,our paper proposes a redundant data deletion scheme that meets the dynamic integrity of the data.We use Content Extraction Signature(CES)to generate the remaining hot data signature after the redundant data is deleted.The feasibility of the scheme is proved through safety analysis and efficiency analysis.
基金This research was funded by the National Nature Sciences Foundation of China(Grant No.42250410321).
文摘Missing value is one of the main factors that cause dirty data.Without high-quality data,there will be no reliable analysis results and precise decision-making.Therefore,the data warehouse needs to integrate high-quality data consistently.In the power system,the electricity consumption data of some large users cannot be normally collected resulting in missing data,which affects the calculation of power supply and eventually leads to a large error in the daily power line loss rate.For the problem of missing electricity consumption data,this study proposes a group method of data handling(GMDH)based data interpolation method in distribution power networks and applies it in the analysis of actually collected electricity data.First,the dependent and independent variables are defined from the original data,and the upper and lower limits of missing values are determined according to prior knowledge or existing data information.All missing data are randomly interpolated within the upper and lower limits.Then,the GMDH network is established to obtain the optimal complexity model,which is used to predict the missing data to replace the last imputed electricity consumption data.At last,this process is implemented iteratively until the missing values do not change.Under a relatively small noise level(α=0.25),the proposed approach achieves a maximum error of no more than 0.605%.Experimental findings demonstrate the efficacy and feasibility of the proposed approach,which realizes the transformation from incomplete data to complete data.Also,this proposed data interpolation approach provides a strong basis for the electricity theft diagnosis and metering fault analysis of electricity enterprises.
基金This research was financially supported by the Ministry of Trade,Industry,and Energy(MOTIE),Korea,under the“Project for Research and Development with Middle Markets Enterprises and DNA(Data,Network,AI)Universities”(AI-based Safety Assessment and Management System for Concrete Structures)(ReferenceNumber P0024559)supervised by theKorea Institute for Advancement of Technology(KIAT).
文摘Time-series data provide important information in many fields,and their processing and analysis have been the focus of much research.However,detecting anomalies is very difficult due to data imbalance,temporal dependence,and noise.Therefore,methodologies for data augmentation and conversion of time series data into images for analysis have been studied.This paper proposes a fault detection model that uses time series data augmentation and transformation to address the problems of data imbalance,temporal dependence,and robustness to noise.The method of data augmentation is set as the addition of noise.It involves adding Gaussian noise,with the noise level set to 0.002,to maximize the generalization performance of the model.In addition,we use the Markov Transition Field(MTF)method to effectively visualize the dynamic transitions of the data while converting the time series data into images.It enables the identification of patterns in time series data and assists in capturing the sequential dependencies of the data.For anomaly detection,the PatchCore model is applied to show excellent performance,and the detected anomaly areas are represented as heat maps.It allows for the detection of anomalies,and by applying an anomaly map to the original image,it is possible to capture the areas where anomalies occur.The performance evaluation shows that both F1-score and Accuracy are high when time series data is converted to images.Additionally,when processed as images rather than as time series data,there was a significant reduction in both the size of the data and the training time.The proposed method can provide an important springboard for research in the field of anomaly detection using time series data.Besides,it helps solve problems such as analyzing complex patterns in data lightweight.
文摘Large-scale wireless sensor networks(WSNs)play a critical role in monitoring dangerous scenarios and responding to medical emergencies.However,the inherent instability and error-prone nature of wireless links present significant challenges,necessitating efficient data collection and reliable transmission services.This paper addresses the limitations of existing data transmission and recovery protocols by proposing a systematic end-to-end design tailored for medical event-driven cluster-based large-scale WSNs.The primary goal is to enhance the reliability of data collection and transmission services,ensuring a comprehensive and practical approach.Our approach focuses on refining the hop-count-based routing scheme to achieve fairness in forwarding reliability.Additionally,it emphasizes reliable data collection within clusters and establishes robust data transmission over multiple hops.These systematic improvements are designed to optimize the overall performance of the WSN in real-world scenarios.Simulation results of the proposed protocol validate its exceptional performance compared to other prominent data transmission schemes.The evaluation spans varying sensor densities,wireless channel conditions,and packet transmission rates,showcasing the protocol’s superiority in ensuring reliable and efficient data transfer.Our systematic end-to-end design successfully addresses the challenges posed by the instability of wireless links in large-scaleWSNs.By prioritizing fairness,reliability,and efficiency,the proposed protocol demonstrates its efficacy in enhancing data collection and transmission services,thereby offering a valuable contribution to the field of medical event-drivenWSNs.
文摘Guest Editors Prof.Ling Tian Prof.Jian-Hua Tao University of Electronic Science and Technology of China Tsinghua University lingtian@uestc.edu.cn jhtao@tsinghua.edu.cn Dr.Bin Zhou National University of Defense Technology binzhou@nudt.edu.cn, Since the concept of “Big Data” was first introduced in Nature in 2008, it has been widely applied in fields, such as business, healthcare, national defense, education, transportation, and security. With the maturity of artificial intelligence technology, big data analysis techniques tailored to various fields have made significant progress, but still face many challenges in terms of data quality, algorithms, and computing power.
基金supported by NOAA JTTI award via Grant #NA21OAR4590165, NOAA GOESR Program funding via Grant #NA16OAR4320115provided by NOAA/Office of Oceanic and Atmospheric Research under NOAA-University of Oklahoma Cooperative Agreement #NA11OAR4320072, U.S. Department of Commercesupported by the National Oceanic and Atmospheric Administration (NOAA) of the U.S. Department of Commerce via Grant #NA18NWS4680063。
文摘Capabilities to assimilate Geostationary Operational Environmental Satellite “R-series ”(GOES-R) Geostationary Lightning Mapper(GLM) flash extent density(FED) data within the operational Gridpoint Statistical Interpolation ensemble Kalman filter(GSI-EnKF) framework were previously developed and tested with a mesoscale convective system(MCS) case. In this study, such capabilities are further developed to assimilate GOES GLM FED data within the GSI ensemble-variational(EnVar) hybrid data assimilation(DA) framework. The results of assimilating the GLM FED data using 3DVar, and pure En3DVar(PEn3DVar, using 100% ensemble covariance and no static covariance) are compared with those of EnKF/DfEnKF for a supercell storm case. The focus of this study is to validate the correctness and evaluate the performance of the new implementation rather than comparing the performance of FED DA among different DA schemes. Only the results of 3DVar and pEn3DVar are examined and compared with EnKF/DfEnKF. Assimilation of a single FED observation shows that the magnitude and horizontal extent of the analysis increments from PEn3DVar are generally larger than from EnKF, which is mainly caused by using different localization strategies in EnFK/DfEnKF and PEn3DVar as well as the integration limits of the graupel mass in the observation operator. Overall, the forecast performance of PEn3DVar is comparable to EnKF/DfEnKF, suggesting correct implementation.