期刊文献+
共找到11篇文章
< 1 >
每页显示 20 50 100
Data cleaning method for the process of acid production with flue gas based on improved random forest 被引量:1
1
作者 Xiaoli Li Minghua Liu +2 位作者 Kang Wang Zhiqiang Liu Guihai Li 《Chinese Journal of Chemical Engineering》 SCIE EI CAS CSCD 2023年第7期72-84,共13页
Acid production with flue gas is a complex nonlinear process with multiple variables and strong coupling.The operation data is an important basis for state monitoring,optimal control,and fault diagnosis.However,the op... Acid production with flue gas is a complex nonlinear process with multiple variables and strong coupling.The operation data is an important basis for state monitoring,optimal control,and fault diagnosis.However,the operating environment of acid production with flue gas is complex and there is much equipment.The data obtained by the detection equipment is seriously polluted and prone to abnormal phenomena such as data loss and outliers.Therefore,to solve the problem of abnormal data in the process of acid production with flue gas,a data cleaning method based on improved random forest is proposed.Firstly,an outlier data recognition model based on isolation forest is designed to identify and eliminate the outliers in the dataset.Secondly,an improved random forest regression model is established.Genetic algorithm is used to optimize the hyperparameters of the random forest regression model.Then the optimal parameter combination is found in the search space and the trend of data is predicted.Finally,the improved random forest data cleaning method is used to compensate for the missing data after eliminating abnormal data and the data cleaning is realized.Results show that the proposed method can accurately eliminate and compensate for the abnormal data in the process of acid production with flue gas.The method improves the accuracy of compensation for missing data.With the data after cleaning,a more accurate model can be established,which is significant to the subsequent temperature control.The conversion rate of SO_(2) can be further improved,thereby improving the yield of sulfuric acid and economic benefits. 展开更多
关键词 Acid production data cleaning Isolation forest Random forest data compensation
下载PDF
A Review of Data Cleaning Methods for Web Information System
2
作者 Jinlin Wang Xing Wang +2 位作者 Yuchen Yang Hongli Zhang Binxing Fang 《Computers, Materials & Continua》 SCIE EI 2020年第3期1053-1075,共23页
Web information system(WIS)is frequently-used and indispensable in daily social life.WIS provides information services in many scenarios,such as electronic commerce,communities,and edutainment.Data cleaning plays an e... Web information system(WIS)is frequently-used and indispensable in daily social life.WIS provides information services in many scenarios,such as electronic commerce,communities,and edutainment.Data cleaning plays an essential role in various WIS scenarios to improve the quality of data service.In this paper,we present a review of the state-of-the-art methods for data cleaning in WIS.According to the characteristics of data cleaning,we extract the critical elements of WIS,such as interactive objects,application scenarios,and core technology,to classify the existing works.Then,after elaborating and analyzing each category,we summarize the descriptions and challenges of data cleaning methods with sub-elements such as data&user interaction,data quality rule,model,crowdsourcing,and privacy preservation.Finally,we analyze various types of problems and provide suggestions for future research on data cleaning in WIS from the technology and interactive perspective. 展开更多
关键词 data cleaning web information system data quality rule crowdsourcing privacy preservation
下载PDF
Data Cleaning Based on Stacked Denoising Autoencoders and Multi-Sensor Collaborations
3
作者 Xiangmao Chang Yuan Qiu +1 位作者 Shangting Su Deliang Yang 《Computers, Materials & Continua》 SCIE EI 2020年第5期691-703,共13页
Wireless sensor networks are increasingly used in sensitive event monitoring.However,various abnormal data generated by sensors greatly decrease the accuracy of the event detection.Although many methods have been prop... Wireless sensor networks are increasingly used in sensitive event monitoring.However,various abnormal data generated by sensors greatly decrease the accuracy of the event detection.Although many methods have been proposed to deal with the abnormal data,they generally detect and/or repair all abnormal data without further differentiate.Actually,besides the abnormal data caused by events,it is well known that sensor nodes prone to generate abnormal data due to factors such as sensor hardware drawbacks and random effects of external sources.Dealing with all abnormal data without differentiate will result in false detection or missed detection of the events.In this paper,we propose a data cleaning approach based on Stacked Denoising Autoencoders(SDAE)and multi-sensor collaborations.We detect all abnormal data by SDAE,then differentiate the abnormal data by multi-sensor collaborations.The abnormal data caused by events are unchanged,while the abnormal data caused by other factors are repaired.Real data based simulations show the efficiency of the proposed approach. 展开更多
关键词 data cleaning wireless sensor networks stacked denoising autoencoders multi-sensor collaborations
下载PDF
Cleaning of Multi-Source Uncertain Time Series Data Based on PageRank
4
作者 高嘉伟 孙纪舟 《Journal of Donghua University(English Edition)》 CAS 2023年第6期695-700,共6页
There are errors in multi-source uncertain time series data.Truth discovery methods for time series data are effective in finding more accurate values,but some have limitations in their usability.To tackle this challe... There are errors in multi-source uncertain time series data.Truth discovery methods for time series data are effective in finding more accurate values,but some have limitations in their usability.To tackle this challenge,we propose a new and convenient truth discovery method to handle time series data.A more accurate sample is closer to the truth and,consequently,to other accurate samples.Because the mutual-confirm relationship between sensors is very similar to the mutual-quote relationship between web pages,we evaluate sensor reliability based on PageRank and then estimate the truth by sensor reliability.Therefore,this method does not rely on smoothness assumptions or prior knowledge of the data.Finally,we validate the effectiveness and efficiency of the proposed method on real-world and synthetic data sets,respectively. 展开更多
关键词 big data data cleaning time series truth discovery PAGERANK
下载PDF
IoT data cleaning techniques: A survey
5
作者 Xiaoou Ding Hongzhi Wang +3 位作者 Genglong Li Haoxuan Li Yingze Li Yida Liu 《Intelligent and Converged Networks》 EI 2022年第4期325-339,共15页
Data cleaning is considered as an effective approach of improving data quality in order to help practitioners and researchers be devoted to downstream analysis and decision-making without worrying about data trustwort... Data cleaning is considered as an effective approach of improving data quality in order to help practitioners and researchers be devoted to downstream analysis and decision-making without worrying about data trustworthiness.This paper provides a systematic summary of the two main stages of data cleaning for Internet of Things(IoT)data with time series characteristics,including error data detection and data repairing.In respect to error data detection techniques,it categorizes an overview of quantitative data error detection methods for detecting single-point errors,continuous errors,and multidimensional time series data errors and qualitative data error detection methods for detecting rule-violating errors.Besides,it provides a detailed description of error data repairing techniques,involving statistics-based repairing,rule-based repairing,and human-involved repairing.We review the strengths and the limitations of the current data cleaning techniques under IoT data applications and conclude with an outlook on the future of IoT data cleaning. 展开更多
关键词 Internet of Things(IoT) data quality data cleaning error detection data repairing
原文传递
Data Cleaning About Student Information Based on Massive Open Online Course System
6
作者 Shengjun Yin Yaling Yi Hongzhi Wang 《国际计算机前沿大会会议论文集》 2020年第1期33-43,共11页
Recently,Massive Open Online Courses(MOOCs)is a major way of online learning for millions of people around the world,which generates a large amount of data in the meantime.However,due to errors produced from collectin... Recently,Massive Open Online Courses(MOOCs)is a major way of online learning for millions of people around the world,which generates a large amount of data in the meantime.However,due to errors produced from collecting,system,and so on,these data have various inconsistencies and missing values.In order to support accurate analysis,this paper studies the data cleaning technology for online open curriculum system,including missing value-time filling for time series,and rulebased input error correction.The data cleaning algorithm designed in this paper is divided into six parts:pre-processing,missing data processing,format and content error processing,logical error processing,irrelevant data processing and correlation analysis.This paper designs and implements missing-value-filling algorithm based on time series in the missing data processing part.According to the large number of descriptive variables existing in the format and content error processing module,it proposed one-based and separability-based criteria Hot+J3+PCA.The online course data cleaning algorithm was analyzed in detail on algorithm design,implementation and testing.After a lot of rigorous testing,the function of each module performs normally,and the cleaning performance of the algorithm is of expectation. 展开更多
关键词 MOOC data cleaning Time series Intermittent missing Dimension reduction
原文传递
Improve Data Quality by Processing Null Values and Semantic Dependencies
7
作者 Houda Zaidi Faouzi Boufarès Yann Pollet 《Journal of Computer and Communications》 2016年第5期78-85,共8页
Today, the quantity of data continues to increase, furthermore, the data are heterogeneous, from multiple sources (structured, semi-structured and unstructured) and with different levels of quality. Therefore, it is v... Today, the quantity of data continues to increase, furthermore, the data are heterogeneous, from multiple sources (structured, semi-structured and unstructured) and with different levels of quality. Therefore, it is very likely to manipulate data without knowledge about their structures and their semantics. In fact, the meta-data may be insufficient or totally absent. Data Anomalies may be due to the poverty of their semantic descriptions, or even the absence of their description. In this paper, we propose an approach to better understand the semantics and the structure of the data. Our approach helps to correct automatically the intra-column anomalies and the inter-col- umns ones. We aim to improve the quality of data by processing the null values and the semantic dependencies between columns. 展开更多
关键词 data Quality Big data Contextual Semantics Semantic Dependencies Functional Dependencies Null Values data cleaning
下载PDF
Corpus Augmentation for Improving Neural Machine Translation 被引量:1
8
作者 Zijian Li Chengying Chi Yunyun Zhan 《Computers, Materials & Continua》 SCIE EI 2020年第7期637-650,共14页
The translation quality of neural machine translation(NMT)systems depends largely on the quality of large-scale bilingual parallel corpora available.Research shows that under the condition of limited resources,the per... The translation quality of neural machine translation(NMT)systems depends largely on the quality of large-scale bilingual parallel corpora available.Research shows that under the condition of limited resources,the performance of NMT is greatly reduced,and a large amount of high-quality bilingual parallel data is needed to train a competitive translation model.However,not all languages have large-scale and high-quality bilingual corpus resources available.In these cases,improving the quality of the corpora has become the main focus to increase the accuracy of the NMT results.This paper proposes a new method to improve the quality of data by using data cleaning,data expansion,and other measures to expand the data at the word and sentence-level,thus improving the richness of the bilingual data.The long short-term memory(LSTM)language model is also used to ensure the smoothness of sentence construction in the process of sentence construction.At the same time,it uses a variety of processing methods to improve the quality of the bilingual data.Experiments using three standard test sets are conducted to validate the proposed method;the most advanced fairseq-transformer NMT system is used in the training.The results show that the proposed method has worked well on improving the translation results.Compared with the state-of-the-art methods,the BLEU value of our method is increased by 2.34 compared with that of the baseline. 展开更多
关键词 Neural machine translation corpus argumentation model improvement deep learning data cleaning
下载PDF
A Missing Power Data Filling Method Based on Improved Random Forest Algorithm 被引量:3
9
作者 Wei Deng Yixiu Guo +3 位作者 Jie Liu Yong Li Dingguo Liu Liang Zhu 《Chinese Journal of Electrical Engineering》 CSCD 2019年第4期33-39,共7页
Missing data filling is a key step in power big data preprocessing,which helps to improve the quality and the utilization of electric power data.Due to the limitations of the traditional methods of filling missing dat... Missing data filling is a key step in power big data preprocessing,which helps to improve the quality and the utilization of electric power data.Due to the limitations of the traditional methods of filling missing data,an improved random forest filling algorithm is proposed.As a result of the horizontal and vertical directions of the electric power data are based on the characteristics of time series.Therefore,the method of improved random forest filling missing data combines the methods of linear interpolation,matrix combination and matrix transposition to solve the problem of filling large amount of electric power missing data.The filling results show that the improved random forest filling algorithm is applicable to filling electric power data in various missing forms.What’s more,the accuracy of the filling results is high and the stability of the model is strong,which is beneficial in improving the quality of electric power data. 展开更多
关键词 Big data cleaning missing data filling data preprocessing random forest data quality
原文传递
Impacts of Dirty Data on Classification and Clustering Models:An Experimental Evaluation
10
作者 Zhi-Xin Qi Hong-Zhi Wang An-Jie Wang 《Journal of Computer Science & Technology》 SCIE EI CSCD 2021年第4期806-821,共16页
Data quality issues have attracted widespread attentions due to the negative impacts of dirty data on data mining and machine learning results.The relationship between data quality and the accuracy of results could be... Data quality issues have attracted widespread attentions due to the negative impacts of dirty data on data mining and machine learning results.The relationship between data quality and the accuracy of results could be applied on the selection of the appropriate model with the consideration of data quality and the determination of the data share to clean.However,rare research has focused on exploring such relationship.Motivated by this,this paper conducts an experimental comparison for the effects of missing,inconsistent,and conflicting data on classification and clustering models.FYom the experimental results,we observe that dirty-data impacts are related to the error type,the error rate,and the data size.Based on the findings,we suggest users leverage our proposed metrics,sensibility and data quality inflection point,for model selection and data cleaning. 展开更多
关键词 data quality CLASSIFICATION CLUSTERING model selection data cleaning
原文传递
Developing an Integrated IoT Cloud Based Predictive Conservation Model for Asset Management in Industry 4.0
11
作者 Karnam Shanmugam Kachhti Satyam Thimma Reddy Sreenivasula Reddy 《Journal of Social Computing》 EI 2023年第2期139-149,共11页
With the advent of Industry 4.0(I4.0),predictive maintenance(PdM)methods have been widely adopted by businesses to deal with the condition of their machinery.With the help of I4.0,digital transformation,information te... With the advent of Industry 4.0(I4.0),predictive maintenance(PdM)methods have been widely adopted by businesses to deal with the condition of their machinery.With the help of I4.0,digital transformation,information techniques,computerised control,and communication networks,large amounts of data on operational and process conditions can be collected from multiple pieces of equipment and used to make an automated fault detection and diagnosis,all with the goal of reducing unscheduled maintenance,improving component utilisation,and lengthening the lifespan of the equipment.In this paper,we use smart approaches to create a PdM planning model.The five key steps of the created approach are as follows:(1)cleaning the data,(2)normalising the data,(3)selecting the best features,(4)making a decision about the prediction network,and(5)producing a prediction.At the outset,PdM-related data undergo data cleaning and normalisation to get everything in order and within some kind of bounds.The next step is to execute optimal feature selection in order to eliminate unnecessary data.This research presents the golden search optimization(GSO)algorithm,a powerful population-based optimization technique for efficient feature selection.The first phase of GSO is to produce a set of possible solutions or objects at random.These objects will then interact with one another using a straightforward mathematical model to find the best feasible answer.Due to the wide range over which the prediction values fall,machine learning and deep learning confront challenges in providing reliable predictions.This is why we recommend a multilayer hybrid convolution neural network(MLH-CNN).While conceptually similar to VGGNet,this approach uses fewer parameters while maintaining or improving classification correctness by adjusting the amount of network modules and channels.The projected perfect is evaluated on two datasets to show that it can accurately predict the future state of components for upkeep preparation. 展开更多
关键词 Industry 4.0 predictive maintenance golden search optimization multilayer hybrid convolution neural network data cleaning
原文传递
上一页 1 下一页 到第
使用帮助 返回顶部