The flue temperature is one of the important indicators to characterize the combustion state of an ethylene cracker furnace,the outliers of temperature data can lead to the false alarm.Conventional outlier detection a...The flue temperature is one of the important indicators to characterize the combustion state of an ethylene cracker furnace,the outliers of temperature data can lead to the false alarm.Conventional outlier detection algorithms such as the Isolation Forest algorithm and 3-sigma principle cannot detect the outliers accurately.In order to improve the detection accuracy and reduce the computational complexity,an outlier detection algorithm for flue temperature data based on the CLOF(Clipping Local Outlier Factor,CLOF)algorithm is proposed.The algorithm preprocesses the normalized data using the cluster pruning algorithm,and realizes the high accuracy and high efficiency outlier detection in the outliers candidate set.Using the flue temperature data of an ethylene cracking furnace in a petrochemical plant,the main parameters of the CLOF algorithm are selected according to the experimental results,and the outlier detection effect of the Isolation Forest algorithm,the 3-sigma principle,the conventional LOF algorithm and the CLOF algorithm are compared and analyzed.The results show that the appropriate clipping coefficient in the CLOF algorithm can significantly improve the detection efficiency and detection accuracy.Compared with the outlier detection results of the Isolation Forest algorithm and 3-sigma principle,the accuracy of the CLOF detection results is increased,and the amount of data calculation is significantly reduced.展开更多
Due to the advancements in information technologies,massive quantity of data is being produced by social media,smartphones,and sensor devices.The investigation of data stream by the use of machine learning(ML)approach...Due to the advancements in information technologies,massive quantity of data is being produced by social media,smartphones,and sensor devices.The investigation of data stream by the use of machine learning(ML)approaches to address regression,prediction,and classification problems have received consid-erable interest.At the same time,the detection of anomalies or outliers and feature selection(FS)processes becomes important.This study develops an outlier detec-tion with feature selection technique for streaming data classification,named ODFST-SDC technique.Initially,streaming data is pre-processed in two ways namely categorical encoding and null value removal.In addition,Local Correla-tion Integral(LOCI)is used which is significant in the detection and removal of outliers.Besides,red deer algorithm(RDA)based FS approach is employed to derive an optimal subset of features.Finally,kernel extreme learning machine(KELM)classifier is used for streaming data classification.The design of LOCI based outlier detection and RDA based FS shows the novelty of the work.In order to assess the classification outcomes of the ODFST-SDC technique,a series of simulations were performed using three benchmark datasets.The experimental results reported the promising outcomes of the ODFST-SDC technique over the recent approaches.展开更多
Background Image matching is crucial in numerous computer vision tasks such as 3D reconstruction and simultaneous visual localization and mapping.The accuracy of the matching significantly impacted subsequent studies....Background Image matching is crucial in numerous computer vision tasks such as 3D reconstruction and simultaneous visual localization and mapping.The accuracy of the matching significantly impacted subsequent studies.Because of their local similarity,when image pairs contain comparable patterns but feature pairs are positioned differently,incorrect recognition can occur as global motion consistency is disregarded.Methods This study proposes an image-matching filtering algorithm based on global motion consistency.It can be used as a subsequent matching filter for the initial matching results generated by other matching algorithms based on the principle of motion smoothness.A particular matching algorithm can first be used to perform the initial matching;then,the rotation and movement information of the global feature vectors are combined to effectively identify outlier matches.The principle is that if the matching result is accurate,the feature vectors formed by any matched point should have similar rotation angles and moving distances.Thus,global motion direction and global motion distance consistencies were used to reject outliers caused by similar patterns in different locations.Results Four datasets were used to test the effectiveness of the proposed method.Three datasets with similar patterns in different locations were used to test the results for similar images that could easily be incorrectly matched by other algorithms,and one commonly used dataset was used to test the results for the general image-matching problem.The experimental results suggest that the proposed method is more accurate than other state-of-the-art algorithms in identifying mismatches in the initial matching set.Conclusions The proposed outlier rejection matching method can significantly improve the matching accuracy for similar images with locally similar feature pairs in different locations and can provide more accurate matching results for subsequent computer vision tasks.展开更多
A variety of factors affect air quality, making it a difficult issue. The level of clean air in a certain area is referred to as air quality. It is challenging for conventional approaches to correctly discover aberran...A variety of factors affect air quality, making it a difficult issue. The level of clean air in a certain area is referred to as air quality. It is challenging for conventional approaches to correctly discover aberrant values or outliers due to the significant fluctuation of this sort of data, which is influenced by Climate change and the environment. With accelerating industrial expansion and rising population density in Kolkata City, air pollution is continuously rising. This study involves two phases, in the first phase imputation of missing values and second detection of outliers using Statistical Process Control (SPC), and Functional Data Analysis (FDA), studies to achieve the efficacy of the outlier identification methodology proposed with working days and Nonworking days of the variables NO<sub>2</sub>, SO<sub>2</sub>, and O<sub>3</sub>, which were used for a year in a row in Kolkata, India. The results show how the functional data approach outshines traditional outlier detection methods. The outcomes show that functional data analysis vibrates more than the other two approaches after imputation, and the suggested outlier detector is absolutely appropriate for the precise detection of outliers in highly variable data.展开更多
Human living would be impossible without air quality. Consistent advancements in practically every aspect of contemporary human life have harmed air quality. Everyday industrial, transportation, and home activities tu...Human living would be impossible without air quality. Consistent advancements in practically every aspect of contemporary human life have harmed air quality. Everyday industrial, transportation, and home activities turn up dangerous contaminants in our surroundings. This study investigated two years’ worth of air quality and outlier detection data from two Indian cities. Studies on air pollution have used numerous types of methodologies, with various gases being seen as a vector whose components include gas concentration values for each observation per-formed. We use curves to represent the monthly average of daily gas emissions in our technique. The approach, which is based on functional depth, was used to find outliers in the city of Delhi and Kolkata’s gas emissions, and the outcomes were compared to those from the traditional method. In the evaluation and comparison of these models’ performances, the functional approach model studied well.展开更多
Purpose:The main aim of this study is to build a robust novel approach that is able to detect outliers in the datasets accurately.To serve this purpose,a novel approach is introduced to determine the likelihood of an ...Purpose:The main aim of this study is to build a robust novel approach that is able to detect outliers in the datasets accurately.To serve this purpose,a novel approach is introduced to determine the likelihood of an object to be extremely different from the general behavior of the entire dataset.Design/methodology/approach:This paper proposes a novel two-level approach based on the integration of bagging and voting techniques for anomaly detection problems.The proposed approach,named Bagged and Voted Local Outlier Detection(BV-LOF),benefits from the Local Outlier Factor(LOF)as the base algorithm and improves its detection rate by using ensemble methods.Findings:Several experiments have been performed on ten benchmark outlier detection datasets to demonstrate the effectiveness of the BV-LOF method.According to the results,the BV-LOF approach significantly outperformed LOF on 9 datasets of 10 ones on average.Research limitations:In the BV-LOF approach,the base algorithm is applied to each subset data multiple times with different neighborhood sizes(k)in each case and with different ensemble sizes(T).In our study,we have chosen k and T value ranges as[1-100];however,these ranges can be changed according to the dataset handled and to the problem addressed.Practical implications:The proposed method can be applied to the datasets from different domains(i.e.health,finance,manufacturing,etc.)without requiring any prior information.Since the BV-LOF method includes two-level ensemble operations,it may lead to more computational time than single-level ensemble methods;however,this drawback can be overcome by parallelization and by using a proper data structure such as R*-tree or KD-tree.Originality/value:The proposed approach(BV-LOF)investigates multiple neighborhood sizes(k),which provides findings of instances with different local densities,and in this way,it provides more likelihood of outlier detection that LOF may neglect.It also brings many benefits such as easy implementation,improved capability,higher applicability,and interpretability.展开更多
In its broadest sense, this paper reviews the general outlier problem, the means available for addressing the discordancy (or lack thereof) of an outlier (or outliers), and possible strategies for dealing with them. T...In its broadest sense, this paper reviews the general outlier problem, the means available for addressing the discordancy (or lack thereof) of an outlier (or outliers), and possible strategies for dealing with them. Two alternate approaches to the multiple outlier problem, consecutive and block testing, and their respective inherent weaknesses, masking and swamping, are discussed. In addition, the relative susceptibility of several tests for outliers in normal samples to the swamping phenomena is reported.展开更多
Although quality assurance and quality control procedures are routinely applied in most air quality networks, outliers can still occur due to instrument malfunctions, the influence of harsh environments and the limita...Although quality assurance and quality control procedures are routinely applied in most air quality networks, outliers can still occur due to instrument malfunctions, the influence of harsh environments and the limitation of measuring methods. Such outliers pose challenges for data-powered applications such as data assimilation, statistical analysis of pollution characteristics and ensemble forecasting. Here, a fully automatic outlier detection method was developed based on the probability of residuals, which are the discrepancies between the observed and the estimated concentration values. The estimation can be conducted using filtering—or regressions when appropriate—to discriminate four types of outliers characterized by temporal and spatial inconsistency, instrument-induced low variances, periodic calibration exceptions, and less PM_(10) than PM_(2.5) in concentration observations, respectively. This probabilistic method was applied to detect all four types of outliers in hourly surface measurements of six pollutants(PM_(2.5), PM_(10),SO_2,NO_2,CO and O_3) from 1436 stations of the China National Environmental Monitoring Network during 2014-16. Among the measurements, 0.65%-5.68% are marked as outliers. with PM_(10) and CO more prone to outliers. Our method successfully identifies a trend of decreasing outliers from 2014 to 2016,which corresponds to known improvements in the quality assurance and quality control procedures of the China National Environmental Monitoring Network. The outliers can have a significant impact on the annual mean concentrations of PM_(2.5),with differences exceeding 10 μg m^(-3) at 66 sites.展开更多
With the development of data age,data quality has become one of the problems that people pay much attention to.As a field of data mining,outlier detection is related to the quality of data.The isolated forest algorith...With the development of data age,data quality has become one of the problems that people pay much attention to.As a field of data mining,outlier detection is related to the quality of data.The isolated forest algorithm is one of the more prominent numerical data outlier detection algorithms in recent years.In the process of constructing the isolation tree by the isolated forest algorithm,as the isolation tree is continuously generated,the difference of isolation trees will gradually decrease or even no difference,which will result in the waste of memory and reduced efficiency of outlier detection.And in the constructed isolation trees,some isolation trees cannot detect outlier.In this paper,an improved iForest-based method GA-iForest is proposed.This method optimizes the isolated forest by selecting some better isolation trees according to the detection accuracy and the difference of isolation trees,thereby reducing some duplicate,similar and poor detection isolation trees and improving the accuracy and stability of outlier detection.In the experiment,Ubuntu system and Spark platform are used to build the experiment environment.The outlier datasets provided by ODDS are used as test.According to indicators such as the accuracy,recall rate,ROC curves,AUC and execution time,the performance of the proposed method is evaluated.Experimental results show that the proposed method can not only improve the accuracy and stability of outlier detection,but also reduce the number of isolation trees by 20%-40%compared with the original iForest method.展开更多
Blast furnace data processing is prone to problems such as outliers.To overcome these problems and identify an improved method for processing blast furnace data,we conducted an in-depth study of blast furnace data.Bas...Blast furnace data processing is prone to problems such as outliers.To overcome these problems and identify an improved method for processing blast furnace data,we conducted an in-depth study of blast furnace data.Based on data samples from selected iron and steel companies,data types were classified according to different characteristics;then,appropriate methods were selected to process them in order to solve the deficiencies and outliers of the original blast furnace data.Linear interpolation was used to fill in the divided continuation data,the Knearest neighbor(KNN)algorithm was used to fill in correlation data with the internal law,and periodic statistical data were filled by the average.The error rate in the filling was low,and the fitting degree was over 85%.For the screening of outliers,corresponding indicator parameters were added according to the continuity,relevance,and periodicity of different data.Also,a variety of algorithms were used for processing.Through the analysis of screening results,a large amount of efficient information in the data was retained,and ineffective outliers were eliminated.Standardized processing of blast furnace big data as the basis of applied research on blast furnace big data can serve as an important means to improve data quality and retain data value.展开更多
In this paper, we present a cluster-based algorithm for time series outlier mining.We use discrete Fourier transformation (DFT) to transform time series from time domain to frequency domain. Time series thus can be ma...In this paper, we present a cluster-based algorithm for time series outlier mining.We use discrete Fourier transformation (DFT) to transform time series from time domain to frequency domain. Time series thus can be mapped as the points in k -dimensional space.For these points, a cluster-based algorithm is developed to mine the outliers from these points.The algorithm first partitions the input points into disjoint clusters and then prunes the clusters,through judgment that can not contain outliers.Our algorithm has been run in the electrical load time series of one steel enterprise and proved to be effective.展开更多
This paper is concerned with the set-membership filtering problem for a class of linear time-varying systems with norm-bounded noises and impulsive measurement outliers.A new representation is proposed to model the me...This paper is concerned with the set-membership filtering problem for a class of linear time-varying systems with norm-bounded noises and impulsive measurement outliers.A new representation is proposed to model the measurement outlier by an impulsive signal whose minimum interval length(i.e.,the minimum duration between two adjacent impulsive signals)and minimum norm(i.e.,the minimum of the norms of all impulsive signals)are larger than certain thresholds that are adjustable according to engineering practice.In order to guarantee satisfactory filtering performance,a so-called parameter-dependent set-membership filter is put forward that is capable of generating a time-varying ellipsoidal region containing the true system state.First,a novel outlier detection strategy is developed,based on a dedicatedly constructed input-output model,to examine whether the received measurement is corrupted by an outlier.Then,through the outcome of the outlier detection,the gain matrix of the desired filter and the corresponding ellipsoidal region are calculated by solving two recursive difference equations.Furthermore,the ultimate boundedness issue on the time-varying ellipsoidal region is thoroughly investigated.Finally,a simulation example is provided to demonstrate the effectiveness of our proposed parameter-dependent set-membership filtering strategy.展开更多
Various uncertainties arising during acquisition process of geoscience data may result in anomalous data instances(i.e.,outliers)that do not conform with the expected pattern of regular data instances.With sparse mult...Various uncertainties arising during acquisition process of geoscience data may result in anomalous data instances(i.e.,outliers)that do not conform with the expected pattern of regular data instances.With sparse multivariate data obtained from geotechnical site investigation,it is impossible to identify outliers with certainty due to the distortion of statistics of geotechnical parameters caused by outliers and their associated statistical uncertainty resulted from data sparsity.This paper develops a probabilistic outlier detection method for sparse multivariate data obtained from geotechnical site investigation.The proposed approach quantifies the outlying probability of each data instance based on Mahalanobis distance and determines outliers as those data instances with outlying probabilities greater than 0.5.It tackles the distortion issue of statistics estimated from the dataset with outliers by a re-sampling technique and accounts,rationally,for the statistical uncertainty by Bayesian machine learning.Moreover,the proposed approach also suggests an exclusive method to determine outlying components of each outlier.The proposed approach is illustrated and verified using simulated and real-life dataset.It showed that the proposed approach properly identifies outliers among sparse multivariate data and their corresponding outlying components in a probabilistic manner.It can significantly reduce the masking effect(i.e.,missing some actual outliers due to the distortion of statistics by the outliers and statistical uncertainty).It also found that outliers among sparse multivariate data instances affect significantly the construction of multivariate distribution of geotechnical parameters for uncertainty quantification.This emphasizes the necessity of data cleaning process(e.g.,outlier detection)for uncertainty quantification based on geoscience data.展开更多
The distance-based outlier detection method detects the implied outliers by calculating the distance of the points in the dataset, but the computational complexity is particularly high when processing multidimensional...The distance-based outlier detection method detects the implied outliers by calculating the distance of the points in the dataset, but the computational complexity is particularly high when processing multidimensional datasets. In addition, the traditional outlier detection method does not consider the frequency of subsets occurrence, thus, the detected outliers do not fit the definition of outliers (i.e., rarely appearing). The pattern mining-based outlier detection approaches have solved this problem, but the importance of each pattern is not taken into account in outlier detection process, so the detected outliers cannot truly reflect some actual situation. Aimed at these problems, a two-phase minimal weighted rare pattern mining-based outlier detection approach, called MWRPM-Outlier, is proposed to effectively detect outliers on the weight data stream. In particular, a method called MWRPM is proposed in the pattern mining phase to fast mine the minimal weighted rare patterns, and then two deviation factors are defined in outlier detection phase to measure the abnormal degree of each transaction on the weight data stream. Experimental results show that the proposed MWRPM-Outlier approach has excellent performance in outlier detection and MWRPM approach outperforms in weighted rare pattern mining.展开更多
基金Sponsored by the National Natural Science Foundation of China(Grant No.61973094)the Maoming Natural Science Foundation(Grant No.2020S004)the Guangdong Basic and Applied Basic Research Fund Project(Grant No.2023A1515012341).
文摘The flue temperature is one of the important indicators to characterize the combustion state of an ethylene cracker furnace,the outliers of temperature data can lead to the false alarm.Conventional outlier detection algorithms such as the Isolation Forest algorithm and 3-sigma principle cannot detect the outliers accurately.In order to improve the detection accuracy and reduce the computational complexity,an outlier detection algorithm for flue temperature data based on the CLOF(Clipping Local Outlier Factor,CLOF)algorithm is proposed.The algorithm preprocesses the normalized data using the cluster pruning algorithm,and realizes the high accuracy and high efficiency outlier detection in the outliers candidate set.Using the flue temperature data of an ethylene cracking furnace in a petrochemical plant,the main parameters of the CLOF algorithm are selected according to the experimental results,and the outlier detection effect of the Isolation Forest algorithm,the 3-sigma principle,the conventional LOF algorithm and the CLOF algorithm are compared and analyzed.The results show that the appropriate clipping coefficient in the CLOF algorithm can significantly improve the detection efficiency and detection accuracy.Compared with the outlier detection results of the Isolation Forest algorithm and 3-sigma principle,the accuracy of the CLOF detection results is increased,and the amount of data calculation is significantly reduced.
文摘Due to the advancements in information technologies,massive quantity of data is being produced by social media,smartphones,and sensor devices.The investigation of data stream by the use of machine learning(ML)approaches to address regression,prediction,and classification problems have received consid-erable interest.At the same time,the detection of anomalies or outliers and feature selection(FS)processes becomes important.This study develops an outlier detec-tion with feature selection technique for streaming data classification,named ODFST-SDC technique.Initially,streaming data is pre-processed in two ways namely categorical encoding and null value removal.In addition,Local Correla-tion Integral(LOCI)is used which is significant in the detection and removal of outliers.Besides,red deer algorithm(RDA)based FS approach is employed to derive an optimal subset of features.Finally,kernel extreme learning machine(KELM)classifier is used for streaming data classification.The design of LOCI based outlier detection and RDA based FS shows the novelty of the work.In order to assess the classification outcomes of the ODFST-SDC technique,a series of simulations were performed using three benchmark datasets.The experimental results reported the promising outcomes of the ODFST-SDC technique over the recent approaches.
基金Supported by the Natural Science Foundation of China(62072388,62276146)the Industry Guidance Project Foundation of Science technology Bureau of Fujian province(2020H0047)+2 种基金the Natural Science Foundation of Science Technology Bureau of Fujian province(2019J01601)the Creation Fund project of Science Technology Bureau of Fujian province(JAT190596)Putian University Research Project(2022034)。
文摘Background Image matching is crucial in numerous computer vision tasks such as 3D reconstruction and simultaneous visual localization and mapping.The accuracy of the matching significantly impacted subsequent studies.Because of their local similarity,when image pairs contain comparable patterns but feature pairs are positioned differently,incorrect recognition can occur as global motion consistency is disregarded.Methods This study proposes an image-matching filtering algorithm based on global motion consistency.It can be used as a subsequent matching filter for the initial matching results generated by other matching algorithms based on the principle of motion smoothness.A particular matching algorithm can first be used to perform the initial matching;then,the rotation and movement information of the global feature vectors are combined to effectively identify outlier matches.The principle is that if the matching result is accurate,the feature vectors formed by any matched point should have similar rotation angles and moving distances.Thus,global motion direction and global motion distance consistencies were used to reject outliers caused by similar patterns in different locations.Results Four datasets were used to test the effectiveness of the proposed method.Three datasets with similar patterns in different locations were used to test the results for similar images that could easily be incorrectly matched by other algorithms,and one commonly used dataset was used to test the results for the general image-matching problem.The experimental results suggest that the proposed method is more accurate than other state-of-the-art algorithms in identifying mismatches in the initial matching set.Conclusions The proposed outlier rejection matching method can significantly improve the matching accuracy for similar images with locally similar feature pairs in different locations and can provide more accurate matching results for subsequent computer vision tasks.
文摘A variety of factors affect air quality, making it a difficult issue. The level of clean air in a certain area is referred to as air quality. It is challenging for conventional approaches to correctly discover aberrant values or outliers due to the significant fluctuation of this sort of data, which is influenced by Climate change and the environment. With accelerating industrial expansion and rising population density in Kolkata City, air pollution is continuously rising. This study involves two phases, in the first phase imputation of missing values and second detection of outliers using Statistical Process Control (SPC), and Functional Data Analysis (FDA), studies to achieve the efficacy of the outlier identification methodology proposed with working days and Nonworking days of the variables NO<sub>2</sub>, SO<sub>2</sub>, and O<sub>3</sub>, which were used for a year in a row in Kolkata, India. The results show how the functional data approach outshines traditional outlier detection methods. The outcomes show that functional data analysis vibrates more than the other two approaches after imputation, and the suggested outlier detector is absolutely appropriate for the precise detection of outliers in highly variable data.
文摘Human living would be impossible without air quality. Consistent advancements in practically every aspect of contemporary human life have harmed air quality. Everyday industrial, transportation, and home activities turn up dangerous contaminants in our surroundings. This study investigated two years’ worth of air quality and outlier detection data from two Indian cities. Studies on air pollution have used numerous types of methodologies, with various gases being seen as a vector whose components include gas concentration values for each observation per-formed. We use curves to represent the monthly average of daily gas emissions in our technique. The approach, which is based on functional depth, was used to find outliers in the city of Delhi and Kolkata’s gas emissions, and the outcomes were compared to those from the traditional method. In the evaluation and comparison of these models’ performances, the functional approach model studied well.
基金supported by the National Natural Science Foundation of China(Grant No.11201003)the Provincial Natural Science Research Project of Anhui Colleges(Grant No.KJ2016A263)+1 种基金the Natural Science Foundation of Anhui Province(Grant No.1408085MA07)the PhD Research Startup Foundation of Anhui Normal University(Grant No.2014bsqdjj34)
文摘Purpose:The main aim of this study is to build a robust novel approach that is able to detect outliers in the datasets accurately.To serve this purpose,a novel approach is introduced to determine the likelihood of an object to be extremely different from the general behavior of the entire dataset.Design/methodology/approach:This paper proposes a novel two-level approach based on the integration of bagging and voting techniques for anomaly detection problems.The proposed approach,named Bagged and Voted Local Outlier Detection(BV-LOF),benefits from the Local Outlier Factor(LOF)as the base algorithm and improves its detection rate by using ensemble methods.Findings:Several experiments have been performed on ten benchmark outlier detection datasets to demonstrate the effectiveness of the BV-LOF method.According to the results,the BV-LOF approach significantly outperformed LOF on 9 datasets of 10 ones on average.Research limitations:In the BV-LOF approach,the base algorithm is applied to each subset data multiple times with different neighborhood sizes(k)in each case and with different ensemble sizes(T).In our study,we have chosen k and T value ranges as[1-100];however,these ranges can be changed according to the dataset handled and to the problem addressed.Practical implications:The proposed method can be applied to the datasets from different domains(i.e.health,finance,manufacturing,etc.)without requiring any prior information.Since the BV-LOF method includes two-level ensemble operations,it may lead to more computational time than single-level ensemble methods;however,this drawback can be overcome by parallelization and by using a proper data structure such as R*-tree or KD-tree.Originality/value:The proposed approach(BV-LOF)investigates multiple neighborhood sizes(k),which provides findings of instances with different local densities,and in this way,it provides more likelihood of outlier detection that LOF may neglect.It also brings many benefits such as easy implementation,improved capability,higher applicability,and interpretability.
文摘In its broadest sense, this paper reviews the general outlier problem, the means available for addressing the discordancy (or lack thereof) of an outlier (or outliers), and possible strategies for dealing with them. Two alternate approaches to the multiple outlier problem, consecutive and block testing, and their respective inherent weaknesses, masking and swamping, are discussed. In addition, the relative susceptibility of several tests for outliers in normal samples to the swamping phenomena is reported.
基金supported by the National Natural Science Foundation (Grant Nos.91644216 and 41575128)the CAS Information Technology Program (Grant No.XXH13506-302)Guangdong Provincial Science and Technology Development Special Fund (No.2017B020216007)
文摘Although quality assurance and quality control procedures are routinely applied in most air quality networks, outliers can still occur due to instrument malfunctions, the influence of harsh environments and the limitation of measuring methods. Such outliers pose challenges for data-powered applications such as data assimilation, statistical analysis of pollution characteristics and ensemble forecasting. Here, a fully automatic outlier detection method was developed based on the probability of residuals, which are the discrepancies between the observed and the estimated concentration values. The estimation can be conducted using filtering—or regressions when appropriate—to discriminate four types of outliers characterized by temporal and spatial inconsistency, instrument-induced low variances, periodic calibration exceptions, and less PM_(10) than PM_(2.5) in concentration observations, respectively. This probabilistic method was applied to detect all four types of outliers in hourly surface measurements of six pollutants(PM_(2.5), PM_(10),SO_2,NO_2,CO and O_3) from 1436 stations of the China National Environmental Monitoring Network during 2014-16. Among the measurements, 0.65%-5.68% are marked as outliers. with PM_(10) and CO more prone to outliers. Our method successfully identifies a trend of decreasing outliers from 2014 to 2016,which corresponds to known improvements in the quality assurance and quality control procedures of the China National Environmental Monitoring Network. The outliers can have a significant impact on the annual mean concentrations of PM_(2.5),with differences exceeding 10 μg m^(-3) at 66 sites.
基金supported by the State Grid Liaoning Electric Power Supply CO, LTDthe financial support for the “Key Technology and Application Research of the Self-Service Grid Big Data Governance (No.SGLNXT00YJJS1800110)”
文摘With the development of data age,data quality has become one of the problems that people pay much attention to.As a field of data mining,outlier detection is related to the quality of data.The isolated forest algorithm is one of the more prominent numerical data outlier detection algorithms in recent years.In the process of constructing the isolation tree by the isolated forest algorithm,as the isolation tree is continuously generated,the difference of isolation trees will gradually decrease or even no difference,which will result in the waste of memory and reduced efficiency of outlier detection.And in the constructed isolation trees,some isolation trees cannot detect outlier.In this paper,an improved iForest-based method GA-iForest is proposed.This method optimizes the isolated forest by selecting some better isolation trees according to the detection accuracy and the difference of isolation trees,thereby reducing some duplicate,similar and poor detection isolation trees and improving the accuracy and stability of outlier detection.In the experiment,Ubuntu system and Spark platform are used to build the experiment environment.The outlier datasets provided by ODDS are used as test.According to indicators such as the accuracy,recall rate,ROC curves,AUC and execution time,the performance of the proposed method is evaluated.Experimental results show that the proposed method can not only improve the accuracy and stability of outlier detection,but also reduce the number of isolation trees by 20%-40%compared with the original iForest method.
基金This work is financially supported by the National Nature Science Foundation of China(No.52004096)the Hebei Province High-End Iron and Steel Metallurgical Joint Research Fund Project,China(No.E2019209314)+1 种基金the Scientific Research Program Project of Hebei Education Department,China(No.QN2019200)the Tangshan Science and Technology Planning Project,China(No.19150241E).
文摘Blast furnace data processing is prone to problems such as outliers.To overcome these problems and identify an improved method for processing blast furnace data,we conducted an in-depth study of blast furnace data.Based on data samples from selected iron and steel companies,data types were classified according to different characteristics;then,appropriate methods were selected to process them in order to solve the deficiencies and outliers of the original blast furnace data.Linear interpolation was used to fill in the divided continuation data,the Knearest neighbor(KNN)algorithm was used to fill in correlation data with the internal law,and periodic statistical data were filled by the average.The error rate in the filling was low,and the fitting degree was over 85%.For the screening of outliers,corresponding indicator parameters were added according to the continuity,relevance,and periodicity of different data.Also,a variety of algorithms were used for processing.Through the analysis of screening results,a large amount of efficient information in the data was retained,and ineffective outliers were eliminated.Standardized processing of blast furnace big data as the basis of applied research on blast furnace big data can serve as an important means to improve data quality and retain data value.
文摘In this paper, we present a cluster-based algorithm for time series outlier mining.We use discrete Fourier transformation (DFT) to transform time series from time domain to frequency domain. Time series thus can be mapped as the points in k -dimensional space.For these points, a cluster-based algorithm is developed to mine the outliers from these points.The algorithm first partitions the input points into disjoint clusters and then prunes the clusters,through judgment that can not contain outliers.Our algorithm has been run in the electrical load time series of one steel enterprise and proved to be effective.
基金supported in part by the National Natural Science Foundation of China(61703245,61873148,61933007)the China Postdoctoral Science Foundation(2018T110702)+3 种基金the Postdoctoral Special Innovation Foundation of of Shandong Province of China(201701015)the European Union’s Horizon 2020 Research and Innovation Programme(820776(INTEGRADDE))the Royal Society of the UKthe Alexander von Humboldt Foundation of Germany。
文摘This paper is concerned with the set-membership filtering problem for a class of linear time-varying systems with norm-bounded noises and impulsive measurement outliers.A new representation is proposed to model the measurement outlier by an impulsive signal whose minimum interval length(i.e.,the minimum duration between two adjacent impulsive signals)and minimum norm(i.e.,the minimum of the norms of all impulsive signals)are larger than certain thresholds that are adjustable according to engineering practice.In order to guarantee satisfactory filtering performance,a so-called parameter-dependent set-membership filter is put forward that is capable of generating a time-varying ellipsoidal region containing the true system state.First,a novel outlier detection strategy is developed,based on a dedicatedly constructed input-output model,to examine whether the received measurement is corrupted by an outlier.Then,through the outcome of the outlier detection,the gain matrix of the desired filter and the corresponding ellipsoidal region are calculated by solving two recursive difference equations.Furthermore,the ultimate boundedness issue on the time-varying ellipsoidal region is thoroughly investigated.Finally,a simulation example is provided to demonstrate the effectiveness of our proposed parameter-dependent set-membership filtering strategy.
基金supported by the National Key R&D Program of China(Project No.2016YFC0800200)the NRF-NSFC 3rd Joint Research Grant(Earth Science)(Project No.41861144022)+2 种基金the National Natural Science Foundation of China(Project Nos.51679174,and 51779189)the Shenzhen Key Technology R&D Program(Project No.20170324)The financial support is grateful acknowledged。
文摘Various uncertainties arising during acquisition process of geoscience data may result in anomalous data instances(i.e.,outliers)that do not conform with the expected pattern of regular data instances.With sparse multivariate data obtained from geotechnical site investigation,it is impossible to identify outliers with certainty due to the distortion of statistics of geotechnical parameters caused by outliers and their associated statistical uncertainty resulted from data sparsity.This paper develops a probabilistic outlier detection method for sparse multivariate data obtained from geotechnical site investigation.The proposed approach quantifies the outlying probability of each data instance based on Mahalanobis distance and determines outliers as those data instances with outlying probabilities greater than 0.5.It tackles the distortion issue of statistics estimated from the dataset with outliers by a re-sampling technique and accounts,rationally,for the statistical uncertainty by Bayesian machine learning.Moreover,the proposed approach also suggests an exclusive method to determine outlying components of each outlier.The proposed approach is illustrated and verified using simulated and real-life dataset.It showed that the proposed approach properly identifies outliers among sparse multivariate data and their corresponding outlying components in a probabilistic manner.It can significantly reduce the masking effect(i.e.,missing some actual outliers due to the distortion of statistics by the outliers and statistical uncertainty).It also found that outliers among sparse multivariate data instances affect significantly the construction of multivariate distribution of geotechnical parameters for uncertainty quantification.This emphasizes the necessity of data cleaning process(e.g.,outlier detection)for uncertainty quantification based on geoscience data.
基金supported by Fundamental Research Funds for the Central Universities (No. 2018XD004)
文摘The distance-based outlier detection method detects the implied outliers by calculating the distance of the points in the dataset, but the computational complexity is particularly high when processing multidimensional datasets. In addition, the traditional outlier detection method does not consider the frequency of subsets occurrence, thus, the detected outliers do not fit the definition of outliers (i.e., rarely appearing). The pattern mining-based outlier detection approaches have solved this problem, but the importance of each pattern is not taken into account in outlier detection process, so the detected outliers cannot truly reflect some actual situation. Aimed at these problems, a two-phase minimal weighted rare pattern mining-based outlier detection approach, called MWRPM-Outlier, is proposed to effectively detect outliers on the weight data stream. In particular, a method called MWRPM is proposed in the pattern mining phase to fast mine the minimal weighted rare patterns, and then two deviation factors are defined in outlier detection phase to measure the abnormal degree of each transaction on the weight data stream. Experimental results show that the proposed MWRPM-Outlier approach has excellent performance in outlier detection and MWRPM approach outperforms in weighted rare pattern mining.