期刊文献+
共找到20篇文章
< 1 >
每页显示 20 50 100
Data cleaning method for the process of acid production with flue gas based on improved random forest 被引量:3
1
作者 Xiaoli Li Minghua Liu +2 位作者 Kang Wang Zhiqiang Liu Guihai Li 《Chinese Journal of Chemical Engineering》 SCIE EI CAS CSCD 2023年第7期72-84,共13页
Acid production with flue gas is a complex nonlinear process with multiple variables and strong coupling.The operation data is an important basis for state monitoring,optimal control,and fault diagnosis.However,the op... Acid production with flue gas is a complex nonlinear process with multiple variables and strong coupling.The operation data is an important basis for state monitoring,optimal control,and fault diagnosis.However,the operating environment of acid production with flue gas is complex and there is much equipment.The data obtained by the detection equipment is seriously polluted and prone to abnormal phenomena such as data loss and outliers.Therefore,to solve the problem of abnormal data in the process of acid production with flue gas,a data cleaning method based on improved random forest is proposed.Firstly,an outlier data recognition model based on isolation forest is designed to identify and eliminate the outliers in the dataset.Secondly,an improved random forest regression model is established.Genetic algorithm is used to optimize the hyperparameters of the random forest regression model.Then the optimal parameter combination is found in the search space and the trend of data is predicted.Finally,the improved random forest data cleaning method is used to compensate for the missing data after eliminating abnormal data and the data cleaning is realized.Results show that the proposed method can accurately eliminate and compensate for the abnormal data in the process of acid production with flue gas.The method improves the accuracy of compensation for missing data.With the data after cleaning,a more accurate model can be established,which is significant to the subsequent temperature control.The conversion rate of SO_(2) can be further improved,thereby improving the yield of sulfuric acid and economic benefits. 展开更多
关键词 Acid production data cleaning Isolation forest Random forest data compensation
下载PDF
Data Cleaning Based on Stacked Denoising Autoencoders and Multi-Sensor Collaborations 被引量:1
2
作者 Xiangmao Chang Yuan Qiu +1 位作者 Shangting Su Deliang Yang 《Computers, Materials & Continua》 SCIE EI 2020年第5期691-703,共13页
Wireless sensor networks are increasingly used in sensitive event monitoring.However,various abnormal data generated by sensors greatly decrease the accuracy of the event detection.Although many methods have been prop... Wireless sensor networks are increasingly used in sensitive event monitoring.However,various abnormal data generated by sensors greatly decrease the accuracy of the event detection.Although many methods have been proposed to deal with the abnormal data,they generally detect and/or repair all abnormal data without further differentiate.Actually,besides the abnormal data caused by events,it is well known that sensor nodes prone to generate abnormal data due to factors such as sensor hardware drawbacks and random effects of external sources.Dealing with all abnormal data without differentiate will result in false detection or missed detection of the events.In this paper,we propose a data cleaning approach based on Stacked Denoising Autoencoders(SDAE)and multi-sensor collaborations.We detect all abnormal data by SDAE,then differentiate the abnormal data by multi-sensor collaborations.The abnormal data caused by events are unchanged,while the abnormal data caused by other factors are repaired.Real data based simulations show the efficiency of the proposed approach. 展开更多
关键词 data cleaning wireless sensor networks stacked denoising autoencoders multi-sensor collaborations
下载PDF
A Review of Data Cleaning Methods for Web Information System
3
作者 Jinlin Wang Xing Wang +2 位作者 Yuchen Yang Hongli Zhang Binxing Fang 《Computers, Materials & Continua》 SCIE EI 2020年第3期1053-1075,共23页
Web information system(WIS)is frequently-used and indispensable in daily social life.WIS provides information services in many scenarios,such as electronic commerce,communities,and edutainment.Data cleaning plays an e... Web information system(WIS)is frequently-used and indispensable in daily social life.WIS provides information services in many scenarios,such as electronic commerce,communities,and edutainment.Data cleaning plays an essential role in various WIS scenarios to improve the quality of data service.In this paper,we present a review of the state-of-the-art methods for data cleaning in WIS.According to the characteristics of data cleaning,we extract the critical elements of WIS,such as interactive objects,application scenarios,and core technology,to classify the existing works.Then,after elaborating and analyzing each category,we summarize the descriptions and challenges of data cleaning methods with sub-elements such as data&user interaction,data quality rule,model,crowdsourcing,and privacy preservation.Finally,we analyze various types of problems and provide suggestions for future research on data cleaning in WIS from the technology and interactive perspective. 展开更多
关键词 data cleaning web information system data quality rule crowdsourcing privacy preservation
下载PDF
A method for cleaning wind power anomaly data by combining image processing with community detection algorithms
4
作者 Qiaoling Yang Kai Chen +2 位作者 Jianzhang Man Jiaheng Duan Zuoqi Jin 《Global Energy Interconnection》 EI CSCD 2024年第3期293-312,共20页
Current methodologies for cleaning wind power anomaly data exhibit limited capabilities in identifying abnormal data within extensive datasets and struggle to accommodate the considerable variability and intricacy of ... Current methodologies for cleaning wind power anomaly data exhibit limited capabilities in identifying abnormal data within extensive datasets and struggle to accommodate the considerable variability and intricacy of wind farm data.Consequently,a method for cleaning wind power anomaly data by combining image processing with community detection algorithms(CWPAD-IPCDA)is proposed.To precisely identify and initially clean anomalous data,wind power curve(WPC)images are converted into graph structures,which employ the Louvain community recognition algorithm and graph-theoretic methods for community detection and segmentation.Furthermore,the mathematical morphology operation(MMO)determines the main part of the initially cleaned wind power curve images and maps them back to the normal wind power points to complete the final cleaning.The CWPAD-IPCDA method was applied to clean datasets from 25 wind turbines(WTs)in two wind farms in northwest China to validate its feasibility.A comparison was conducted using density-based spatial clustering of applications with noise(DBSCAN)algorithm,an improved isolation forest algorithm,and an image-based(IB)algorithm.The experimental results demonstrate that the CWPAD-IPCDA method surpasses the other three algorithms,achieving an approximately 7.23%higher average data cleaning rate.The mean value of the sum of the squared errors(SSE)of the dataset after cleaning is approximately 6.887 lower than that of the other algorithms.Moreover,the mean of overall accuracy,as measured by the F1-score,exceeds that of the other methods by approximately 10.49%;this indicates that the CWPAD-IPCDA method is more conducive to improving the accuracy and reliability of wind power curve modeling and wind farm power forecasting. 展开更多
关键词 Wind turbine power curve Abnormal data cleaning Community detection Louvain algorithm Mathematical morphology operation
下载PDF
Cleaning of Multi-Source Uncertain Time Series Data Based on PageRank
5
作者 高嘉伟 孙纪舟 《Journal of Donghua University(English Edition)》 CAS 2023年第6期695-700,共6页
There are errors in multi-source uncertain time series data.Truth discovery methods for time series data are effective in finding more accurate values,but some have limitations in their usability.To tackle this challe... There are errors in multi-source uncertain time series data.Truth discovery methods for time series data are effective in finding more accurate values,but some have limitations in their usability.To tackle this challenge,we propose a new and convenient truth discovery method to handle time series data.A more accurate sample is closer to the truth and,consequently,to other accurate samples.Because the mutual-confirm relationship between sensors is very similar to the mutual-quote relationship between web pages,we evaluate sensor reliability based on PageRank and then estimate the truth by sensor reliability.Therefore,this method does not rely on smoothness assumptions or prior knowledge of the data.Finally,we validate the effectiveness and efficiency of the proposed method on real-world and synthetic data sets,respectively. 展开更多
关键词 big data data cleaning time series truth discovery PAGERANK
下载PDF
IoT data cleaning techniques: A survey
6
作者 Xiaoou Ding Hongzhi Wang +3 位作者 Genglong Li Haoxuan Li Yingze Li Yida Liu 《Intelligent and Converged Networks》 EI 2022年第4期325-339,共15页
Data cleaning is considered as an effective approach of improving data quality in order to help practitioners and researchers be devoted to downstream analysis and decision-making without worrying about data trustwort... Data cleaning is considered as an effective approach of improving data quality in order to help practitioners and researchers be devoted to downstream analysis and decision-making without worrying about data trustworthiness.This paper provides a systematic summary of the two main stages of data cleaning for Internet of Things(IoT)data with time series characteristics,including error data detection and data repairing.In respect to error data detection techniques,it categorizes an overview of quantitative data error detection methods for detecting single-point errors,continuous errors,and multidimensional time series data errors and qualitative data error detection methods for detecting rule-violating errors.Besides,it provides a detailed description of error data repairing techniques,involving statistics-based repairing,rule-based repairing,and human-involved repairing.We review the strengths and the limitations of the current data cleaning techniques under IoT data applications and conclude with an outlook on the future of IoT data cleaning. 展开更多
关键词 Internet of Things(IoT) data quality data cleaning error detection data repairing
原文传递
Data Cleaning About Student Information Based on Massive Open Online Course System
7
作者 Shengjun Yin Yaling Yi Hongzhi Wang 《国际计算机前沿大会会议论文集》 2020年第1期33-43,共11页
Recently,Massive Open Online Courses(MOOCs)is a major way of online learning for millions of people around the world,which generates a large amount of data in the meantime.However,due to errors produced from collectin... Recently,Massive Open Online Courses(MOOCs)is a major way of online learning for millions of people around the world,which generates a large amount of data in the meantime.However,due to errors produced from collecting,system,and so on,these data have various inconsistencies and missing values.In order to support accurate analysis,this paper studies the data cleaning technology for online open curriculum system,including missing value-time filling for time series,and rulebased input error correction.The data cleaning algorithm designed in this paper is divided into six parts:pre-processing,missing data processing,format and content error processing,logical error processing,irrelevant data processing and correlation analysis.This paper designs and implements missing-value-filling algorithm based on time series in the missing data processing part.According to the large number of descriptive variables existing in the format and content error processing module,it proposed one-based and separability-based criteria Hot+J3+PCA.The online course data cleaning algorithm was analyzed in detail on algorithm design,implementation and testing.After a lot of rigorous testing,the function of each module performs normally,and the cleaning performance of the algorithm is of expectation. 展开更多
关键词 MOOC data cleaning Time series Intermittent missing Dimension reduction
原文传递
Can Automatic Classification Help to Increase Accuracy in Data Collection?
8
作者 Frederique Lang Diego Chavarro Yuxian Liu 《Journal of Data and Information Science》 2016年第3期42-58,共17页
Purpose: The authors aim at testing the performance of a set of machine learning algorithms that could improve the process of data cleaning when building datasets. Design/methodology/approach: The paper is centered ... Purpose: The authors aim at testing the performance of a set of machine learning algorithms that could improve the process of data cleaning when building datasets. Design/methodology/approach: The paper is centered on cleaning datasets gathered from publishers and online resources by the use of specific keywords. In this case, we analyzed data from the Web of Science. The accuracy of various forms of automatic classification was tested here in comparison with manual coding in order to determine their usefulness for data collection and cleaning. We assessed the performance of seven supervised classification algorithms (Support Vector Machine (SVM), Scaled Linear Discriminant Analysis, Lasso and elastic-net regularized generalized linear models, Maximum Entropy, Regression Tree, Boosting, and Random Forest) and analyzed two properties: accuracy and recall. We assessed not only each algorithm individually, but also their combinations through a voting scheme. We also tested the performance of these algorithms with different sizes of training data. When assessing the performance of different combinations, we used an indicator of coverage to account for the agreement and disagreement on classification between algorithms. Findings: We found that the performance of the algorithms used vary with the size of the sample for training. However, for the classification exercise in this paper the best performing algorithms were SVM and Boosting. The combination of these two algorithms achieved a high agreement on coverage and was highly accurate. This combination performs well with a small training dataset (10%), which may reduce the manual work needed for classification tasks. Research limitations: The dataset gathered has significantly more records related to the topic of interest compared to unrelated topics. This may affect the performance of some algorithms, especially in their identification of unrelated papers. Practical implications: Although the classification achieved by this means is not completely accurate, the amount of manual coding needed can be greatly reduced by using classification algorithms. This can be of great help when the dataset is big. With the help of accuracy, recall,and coverage measures, it is possible to have an estimation of the error involved in this classification, which could open the possibility of incorporating the use of these algorithms in software specifically designed for data cleaning and classification. 展开更多
关键词 DISAMBIGUATION Machine leaming data cleaning Classification ACCURACY RECALL COVERAGE
下载PDF
Intelligent Data Pre-processing Model in Integrated Ocean Observing Network System
9
作者 韩华 丁永生 刘凤鸣 《Journal of Donghua University(English Edition)》 EI CAS 2009年第5期499-502,共4页
There are a number of dirty data in observation data set derived from integrated ocean observing network system. Thus, the data must be carefully and reasonably processed before they are used for forecasting or analys... There are a number of dirty data in observation data set derived from integrated ocean observing network system. Thus, the data must be carefully and reasonably processed before they are used for forecasting or analysis. This paper proposes a data pre-processing model based on intelligent algorithms. Firstly, we introduce the integrated network platform of ocean observation. Next, the preprocessing model of data is presemed, and an imelligent cleaning model of data is proposed. Based on fuzzy clustering, the Kohonen clustering network is improved to fulfill the parallel calculation of fuzzy c-means clustering. The proposed dynamic algorithm can automatically f'md the new clustering center with the updated sample data. The rapid and dynamic performance of the model makes it suitable for real time calculation, and the efficiency and accuracy of the model is proved by test results through observation data analysis. 展开更多
关键词 integrated ocean observing network intelligentdata pre-processing data cleaning fuzzy soft clustering
下载PDF
Improve Data Quality by Processing Null Values and Semantic Dependencies
10
作者 Houda Zaidi Faouzi Boufarès Yann Pollet 《Journal of Computer and Communications》 2016年第5期78-85,共8页
Today, the quantity of data continues to increase, furthermore, the data are heterogeneous, from multiple sources (structured, semi-structured and unstructured) and with different levels of quality. Therefore, it is v... Today, the quantity of data continues to increase, furthermore, the data are heterogeneous, from multiple sources (structured, semi-structured and unstructured) and with different levels of quality. Therefore, it is very likely to manipulate data without knowledge about their structures and their semantics. In fact, the meta-data may be insufficient or totally absent. Data Anomalies may be due to the poverty of their semantic descriptions, or even the absence of their description. In this paper, we propose an approach to better understand the semantics and the structure of the data. Our approach helps to correct automatically the intra-column anomalies and the inter-col- umns ones. We aim to improve the quality of data by processing the null values and the semantic dependencies between columns. 展开更多
关键词 data Quality Big data Contextual Semantics Semantic Dependencies Functional Dependencies Null Values data cleaning
下载PDF
Duplicate identification model for deep web 被引量:4
11
作者 刘丽楠 寇月 +2 位作者 孙高尚 申德荣 于戈 《Journal of Southeast University(English Edition)》 EI CAS 2008年第3期315-317,共3页
A duplicate identification model is presented to deal with semi-structured or unstructured data extracted from multiple data sources in the deep web.First,the extracted data is generated to the entity records in the d... A duplicate identification model is presented to deal with semi-structured or unstructured data extracted from multiple data sources in the deep web.First,the extracted data is generated to the entity records in the data preprocessing module,and then,in the heterogeneous records processing module it calculates the similarity degree of the entity records to obtain the duplicate records based on the weights calculated in the homogeneous records processing module.Unlike traditional methods,the proposed approach is implemented without schema matching in advance.And multiple estimators with selective algorithms are adopted to reach a better matching efficiency.The experimental results show that the duplicate identification model is feasible and efficient. 展开更多
关键词 duplicate records deep web data cleaning semi-structured data
下载PDF
Corpus Augmentation for Improving Neural Machine Translation 被引量:2
12
作者 Zijian Li Chengying Chi Yunyun Zhan 《Computers, Materials & Continua》 SCIE EI 2020年第7期637-650,共14页
The translation quality of neural machine translation(NMT)systems depends largely on the quality of large-scale bilingual parallel corpora available.Research shows that under the condition of limited resources,the per... The translation quality of neural machine translation(NMT)systems depends largely on the quality of large-scale bilingual parallel corpora available.Research shows that under the condition of limited resources,the performance of NMT is greatly reduced,and a large amount of high-quality bilingual parallel data is needed to train a competitive translation model.However,not all languages have large-scale and high-quality bilingual corpus resources available.In these cases,improving the quality of the corpora has become the main focus to increase the accuracy of the NMT results.This paper proposes a new method to improve the quality of data by using data cleaning,data expansion,and other measures to expand the data at the word and sentence-level,thus improving the richness of the bilingual data.The long short-term memory(LSTM)language model is also used to ensure the smoothness of sentence construction in the process of sentence construction.At the same time,it uses a variety of processing methods to improve the quality of the bilingual data.Experiments using three standard test sets are conducted to validate the proposed method;the most advanced fairseq-transformer NMT system is used in the training.The results show that the proposed method has worked well on improving the translation results.Compared with the state-of-the-art methods,the BLEU value of our method is increased by 2.34 compared with that of the baseline. 展开更多
关键词 Neural machine translation corpus argumentation model improvement deep learning data cleaning
下载PDF
A Missing Power Data Filling Method Based on Improved Random Forest Algorithm 被引量:9
13
作者 Wei Deng Yixiu Guo +3 位作者 Jie Liu Yong Li Dingguo Liu Liang Zhu 《Chinese Journal of Electrical Engineering》 CSCD 2019年第4期33-39,共7页
Missing data filling is a key step in power big data preprocessing,which helps to improve the quality and the utilization of electric power data.Due to the limitations of the traditional methods of filling missing dat... Missing data filling is a key step in power big data preprocessing,which helps to improve the quality and the utilization of electric power data.Due to the limitations of the traditional methods of filling missing data,an improved random forest filling algorithm is proposed.As a result of the horizontal and vertical directions of the electric power data are based on the characteristics of time series.Therefore,the method of improved random forest filling missing data combines the methods of linear interpolation,matrix combination and matrix transposition to solve the problem of filling large amount of electric power missing data.The filling results show that the improved random forest filling algorithm is applicable to filling electric power data in various missing forms.What’s more,the accuracy of the filling results is high and the stability of the model is strong,which is beneficial in improving the quality of electric power data. 展开更多
关键词 Big data cleaning missing data filling data preprocessing random forest data quality
原文传递
Annotation Based Query Answer over Inconsistent Database
14
作者 吴爱华 谈子敬 汪卫 《Journal of Computer Science & Technology》 SCIE EI CSCD 2010年第3期469-481,共13页
In this paper, we introduce a concept of Annotation Based Query Answer, and a method for its computation, which can answer queries on relational databases that may violate a set of functional dependencies. In this app... In this paper, we introduce a concept of Annotation Based Query Answer, and a method for its computation, which can answer queries on relational databases that may violate a set of functional dependencies. In this approach, inconsistency is viewed as a property of data and described with annotations. To be more precise, every piece of data in a relation can have zero or more annotations with it and annotations are propagated along with queries from the source to the output. With annotations, inconsistent data in both input tables and query answers can be marked out but preserved, instead of being filtered in most previous work. Thus this approach can avoid information loss, a vital and common deficiency of most previous work in this area. To calculate query answers on an annotated database, we propose an algorithm to annotate the input tables, and redefine the five basic relational algebra operations (selection, projection, join, union and difference) so that annotations can be correctly propagated as the valid set of functional dependency changes during query processing. We also prove the soundness and completeness of the whole annotation computing system. Finally, we implement a prototype of our system, and give some performance experiments, which demonstrate that our approach is reasonable in running time, and excellent in information preserving. 展开更多
关键词 uncertain data data quality consistent query answer integrity constraints data cleaning
原文传递
Impacts of Dirty Data on Classification and Clustering Models:An Experimental Evaluation
15
作者 Zhi-Xin Qi Hong-Zhi Wang An-Jie Wang 《Journal of Computer Science & Technology》 SCIE EI CSCD 2021年第4期806-821,共16页
Data quality issues have attracted widespread attentions due to the negative impacts of dirty data on data mining and machine learning results.The relationship between data quality and the accuracy of results could be... Data quality issues have attracted widespread attentions due to the negative impacts of dirty data on data mining and machine learning results.The relationship between data quality and the accuracy of results could be applied on the selection of the appropriate model with the consideration of data quality and the determination of the data share to clean.However,rare research has focused on exploring such relationship.Motivated by this,this paper conducts an experimental comparison for the effects of missing,inconsistent,and conflicting data on classification and clustering models.FYom the experimental results,we observe that dirty-data impacts are related to the error type,the error rate,and the data size.Based on the findings,we suggest users leverage our proposed metrics,sensibility and data quality inflection point,for model selection and data cleaning. 展开更多
关键词 data quality CLASSIFICATION CLUSTERING model selection data cleaning
原文传递
Truth Discovery on Inconsistent Relational Data
16
作者 Jizhou Sun Jianzhong Li +1 位作者 Hong Gao Hongzhi Wang 《Tsinghua Science and Technology》 SCIE EI CAS CSCD 2018年第3期288-302,共15页
In this era of big data, data are often collected from multiple sources that have different reliabilities, and there is inevitable conflict with respect to the various information obtained when it relates to the the s... In this era of big data, data are often collected from multiple sources that have different reliabilities, and there is inevitable conflict with respect to the various information obtained when it relates to the the same object.One important task is to identify the most trustworthy value out of all the conflicting claims, and this is known as truth discovery. Existing truth discovery methods simultaneously identify the most trustworthy information and source reliability degrees and are based on the idea that more reliable sources often provide more trustworthy information,and vice versa. However, there are often semantic constrains defined upon relational database, which can be violated by a single data source. To remove violations, an important task is to repair data to satisfy the constrains,and this is known as data cleaning. The two problems above may coexist, but considering them together can provide some benefits, and to the authors knowledge, this has not yet been the focus of any research. In this paper, therefore, a schema-decomposing based method is proposed to simultaneously discover the truth and to clean the data, with the aim of improving accuracy. Experimental results using real world data sets of notebooks and mobile phones, as well as simulated data sets, demonstrate the effectiveness and efficiency of our proposed method. 展开更多
关键词 inconsistent data truth discovery data cleaning
原文传递
Developing an Integrated IoT Cloud Based Predictive Conservation Model for Asset Management in Industry 4.0
17
作者 Karnam Shanmugam Kachhti Satyam Thimma Reddy Sreenivasula Reddy 《Journal of Social Computing》 EI 2023年第2期139-149,共11页
With the advent of Industry 4.0(I4.0),predictive maintenance(PdM)methods have been widely adopted by businesses to deal with the condition of their machinery.With the help of I4.0,digital transformation,information te... With the advent of Industry 4.0(I4.0),predictive maintenance(PdM)methods have been widely adopted by businesses to deal with the condition of their machinery.With the help of I4.0,digital transformation,information techniques,computerised control,and communication networks,large amounts of data on operational and process conditions can be collected from multiple pieces of equipment and used to make an automated fault detection and diagnosis,all with the goal of reducing unscheduled maintenance,improving component utilisation,and lengthening the lifespan of the equipment.In this paper,we use smart approaches to create a PdM planning model.The five key steps of the created approach are as follows:(1)cleaning the data,(2)normalising the data,(3)selecting the best features,(4)making a decision about the prediction network,and(5)producing a prediction.At the outset,PdM-related data undergo data cleaning and normalisation to get everything in order and within some kind of bounds.The next step is to execute optimal feature selection in order to eliminate unnecessary data.This research presents the golden search optimization(GSO)algorithm,a powerful population-based optimization technique for efficient feature selection.The first phase of GSO is to produce a set of possible solutions or objects at random.These objects will then interact with one another using a straightforward mathematical model to find the best feasible answer.Due to the wide range over which the prediction values fall,machine learning and deep learning confront challenges in providing reliable predictions.This is why we recommend a multilayer hybrid convolution neural network(MLH-CNN).While conceptually similar to VGGNet,this approach uses fewer parameters while maintaining or improving classification correctness by adjusting the amount of network modules and channels.The projected perfect is evaluated on two datasets to show that it can accurately predict the future state of components for upkeep preparation. 展开更多
关键词 Industry 4.0 predictive maintenance golden search optimization multilayer hybrid convolution neural network data cleaning
原文传递
A Hybrid Instance Selection Using Nearest-Neighbor for Cross-Project Defect Prediction 被引量:10
18
作者 Duksan Ryu Jong-In Jang Jongmoon Baik 《Journal of Computer Science & Technology》 SCIE EI CSCD 2015年第5期969-980,共12页
Software defect prediction (SDP) is an active research field in software engineering to identify defect-prone modules. Thanks to SDP, limited testing resources can be effectively allocated to defect-prone modules. A... Software defect prediction (SDP) is an active research field in software engineering to identify defect-prone modules. Thanks to SDP, limited testing resources can be effectively allocated to defect-prone modules. Although SDP requires sufficient local data within a company, there are cases where local data are not available, e.g., pilot projects. Companies without local data can employ cross-project defect prediction (CPDP) using external data to build classifiers. The major challenge of CPDP is different distributions between training and test data. To tackle this, instances of source data similar to target data are selected to build classifiers. Software datasets have a class imbalance problem meaning the ratio of defective class to clean class is far low. It usually lowers the performance of classifiers. We propose a Hybrid Instance Selection Using Nearest-Neighbor (HISNN) method that performs a hybrid classification selectively learning local knowledge (via k-nearest neighbor) and global knowledge (via na/ve Bayes). Instances having strong local knowledge are identified via nearest-neighbors with the same class label. Previous studies showed low PD (probability of detection) or high PF (probability of false alarm) which is impractical to overall performance as well as high PD and low PF. use. The experimental results show that HISNN produces high overall performance as well as high PD and low PF. 展开更多
关键词 software defect analysis instance-based learning nearest-neighbor algorithm data cleaning
原文传递
A Novel Approach to Clustering Merchandise Records 被引量:3
19
作者 程涛远 王珊 《Journal of Computer Science & Technology》 SCIE EI CSCD 2007年第2期228-231,共4页
Object identification is one of the major challenges in integrating data from multiple information sources. Since being short of global identifiers, it is hard to find all records referring to the same object in an in... Object identification is one of the major challenges in integrating data from multiple information sources. Since being short of global identifiers, it is hard to find all records referring to the same object in an integrated database. Traditional object identification techniques tend to use character-based or vector space model-based similarity computing in judging, but they cannot work well in merchandise databases. This paper brings forward a new approach to object identification. First, we use merchandise images to judge whether two records belong to the same object; then, we use Naive Bayesian Model to judge whether two merchandise names have similar meaning. We do experiments on data downloaded from shopping websites, and the results show good performance. 展开更多
关键词 object identification data cleaning data integration
原文传递
Matching dependencies: semantics and query answering 被引量:2
20
作者 Jaffer GARDEZI Leopoldo BERTOSSI Iluju KIRINGA 《Frontiers of Computer Science》 SCIE EI CSCD 2012年第3期278-292,共15页
Matching dependencies (MDs) are used to declaratively specify the identification (or matching) of cer- tain attribute values in pairs of database tuples when some similarity conditions on other values are satisfie... Matching dependencies (MDs) are used to declaratively specify the identification (or matching) of cer- tain attribute values in pairs of database tuples when some similarity conditions on other values are satisfied. Their en- forcement can be seen as a natural generalization of entity resolution. In what we call the pure case of MD enforce- ment, an arbitrary value from the underlying data domain can be used for the value in common that is used for a match- ing. However, the overall number of changes of attribute val- ues is expected to be kept to a minimum. We investigate this case in terms of semantics and the properties of data clean- ing through the enforcement of MDs. We characterize the in- tended clean instances, and also the clean answers to queries, as those that are invariant under the cleaning process. The complexity of computing clean instances and clean query an- swering is investigated. Tractable and intractable cases de- pending on the MDs are identified and characterized. 展开更多
关键词 dataBASES data cleaning duplicate and entityresolution integrity constraints matching dependencies
原文传递
上一页 1 下一页 到第
使用帮助 返回顶部