With the increasing variety of application software of meteorological satellite ground system, how to provide reasonable hardware resources and improve the efficiency of software is paid more and more attention. In th...With the increasing variety of application software of meteorological satellite ground system, how to provide reasonable hardware resources and improve the efficiency of software is paid more and more attention. In this paper, a set of software classification method based on software operating characteristics is proposed. The method uses software run-time resource consumption to describe the software running characteristics. Firstly, principal component analysis (PCA) is used to reduce the dimension of software running feature data and to interpret software characteristic information. Then the modified K-means algorithm was used to classify the meteorological data processing software. Finally, it combined with the results of principal component analysis to explain the significance of various types of integrated software operating characteristics. And it is used as the basis for optimizing the allocation of software hardware resources and improving the efficiency of software operation.展开更多
Cluster analysis is one of the major data analysis methods widely used for many practical applications in emerging areas of data mining. A good clustering method will produce high quality clusters with high intra-clus...Cluster analysis is one of the major data analysis methods widely used for many practical applications in emerging areas of data mining. A good clustering method will produce high quality clusters with high intra-cluster similarity and low inter-cluster similarity. Clustering techniques are applied in different domains to predict future trends of available data and its uses for the real world. This research work is carried out to find the performance of two of the most delegated, partition based clustering algorithms namely k-Means and k-Medoids. A state of art analysis of these two algorithms is implemented and performance is analyzed based on their clustering result quality by means of its execution time and other components. Telecommunication data is the source data for this analysis. The connection oriented broadband data is given as input to find the clustering quality of the algorithms. Distance between the server locations and their connection is considered for clustering. Execution time for each algorithm is analyzed and the results are compared with one another. Results found in comparison study are satisfactory for the chosen application.展开更多
In k-means clustering, we are given a set of n data points in d-dimensional space R^d and an integer k and the problem is to determine a set of k points in R^d, called centers, so as to minimize the mean squared dista...In k-means clustering, we are given a set of n data points in d-dimensional space R^d and an integer k and the problem is to determine a set of k points in R^d, called centers, so as to minimize the mean squared distance from each data point to its nearest center. In this paper, we present a simple and efficient clustering algorithm based on the k-means algorithm, which we call enhanced k-means algorithm. This algorithm is easy to implement, requiring a simple data structure to keep some information in each iteration to be used in the next iteration. Our experimental results demonstrated that our scheme can improve the computational speed of the k-means algorithm by the magnitude in the total number of distance calculations and the overall time of computation.展开更多
As the cash register system gradually prevailed in shopping malls, detecting the abnormal status of the cash register system has gradually become a hotspot issue. This paper analyzes the transaction data of a shopping...As the cash register system gradually prevailed in shopping malls, detecting the abnormal status of the cash register system has gradually become a hotspot issue. This paper analyzes the transaction data of a shopping mall. When calculating the degree of data difference, the coefficient of variation is used as the attribute weight;the weighted Euclidean distance is used to calculate the degree of difference;and k-means clustering is used to classify different time periods. It applies the LOF algorithm to detect the outlier degree of transaction data at each time period, sets the initial threshold to detect outliers, deletes the outliers, and then performs SAX detection on the data set. If it does not pass the test, then it will gradually expand the outlying domain and repeat the above process to optimize the outlier threshold to improve the sensitivity of detection algorithm and reduce false positives.展开更多
Most clustering algorithms need to describe the similarity of objects by a predefined distance function. Three distance functions which are widely used in two traditional clustering algorithms k-means and hierarchical...Most clustering algorithms need to describe the similarity of objects by a predefined distance function. Three distance functions which are widely used in two traditional clustering algorithms k-means and hierarchical clustering were investigated. Both theoretical analysis and detailed experimental results were given. It is shown that a distance function greatly affects clustering results and can be used to detect the outlier of a cluster by the comparison of such different results and give the shape information of clusters. In practice situation, it is suggested to use different distance function separately, compare the clustering results and pick out the 搒wing points? And such points may leak out more information for data analysts.展开更多
The k-means clustering algorithm is one of the most commonly used algorithms for clustering analysis. The traditional k-means algorithm is, however, inefficient while working on large numbers of data sets and improvin...The k-means clustering algorithm is one of the most commonly used algorithms for clustering analysis. The traditional k-means algorithm is, however, inefficient while working on large numbers of data sets and improving the algorithm efficiency remains a problem. This paper focuses on the efficiency issues of cluster algorithms. A refined initial cluster centers method is designed to reduce the number of iterative procedures in the algorithm. A parallel k-means algorithm is also studied for the problem of the operation limitation of a single processor machine when given huge data sets. The analytical results demonstrate that these improvements can greatly enhance the efficiency of the k-means algorithm, i.e., allow the grouping of a large number of data sets more accurately and more quickly. The analysis has theoretical and practical importance for work on the improvement and parallelism of cluster algorithms.展开更多
With a sharp increase in the information volume,analyzing and retrieving this vast data volume is much more essential than ever.One of the main techniques that would be beneficial in this regard is called the Clusteri...With a sharp increase in the information volume,analyzing and retrieving this vast data volume is much more essential than ever.One of the main techniques that would be beneficial in this regard is called the Clustering method.Clustering aims to classify objects so that all objects within a cluster have similar features while other objects in different clusters are as distinct as possible.One of the most widely used clustering algorithms with the well and approved performance in different applications is the k-means algorithm.The main problem of the k-means algorithm is its performance which can be directly affected by the selection in the primary clusters.Lack of attention to this crucial issue has consequences such as creating empty clusters and decreasing the convergence time.Besides,the selection of appropriate initial seeds can reduce the cluster’s inconsistency.In this paper,we present a new method to determine the initial seeds of the k-mean algorithm to improve the accuracy and decrease the number of iterations of the algorithm.For this purpose,a new method is proposed considering the average distance between objects to determine the initial seeds.Our method attempts to provide a proper tradeoff between the accuracy and speed of the clustering algorithm.The experimental results showed that our proposed approach outperforms the Chithra with 1.7%and 2.1%in terms of clustering accuracy for Wine and Abalone detection data,respectively.Furthermore,achieved results indicate that comparing with the Reverse Nearest Neighbor(RNN)search approach,the proposed method has a higher convergence speed.展开更多
Data analysis and automatic processing is often interpreted as knowledge acquisition. In many cases it is necessary to somehow classify data or find regularities in them. Results obtained in the search of regularities...Data analysis and automatic processing is often interpreted as knowledge acquisition. In many cases it is necessary to somehow classify data or find regularities in them. Results obtained in the search of regularities in intelligent data analyzing applications are mostly represented with the help of IF-THEN rules. With the help of these rules the following tasks are solved: prediction, classification, pattern recognition and others. Using different approaches---clustering algorithms, neural network methods, fuzzy rule processing methods--we can extract rules that in an understandable language characterize the data. This allows interpreting the data, finding relationships in the data and extracting new rules that characterize them. Knowledge acquisition in this paper is defined as the process of extracting knowledge from numerical data in the form of rules. Extraction of rules in this context is based on clustering methods K-means and fuzzy C-means. With the assistance of K-means, clustering algorithm rules are derived from trained neural networks. Fuzzy C-means is used in fuzzy rule based design method. Rule extraction methodology is demonstrated in the Fisher's Iris flower data set samples. The effectiveness of the extracted rules is evaluated. Clustering and rule extraction methodology can be widely used in evaluating and analyzing various economic and financial processes.展开更多
Low Energy Adaptive Clustering Hierarchy(LEACH)is a routing algorithm in agricultural wireless multimedia sensor networks(WMSNs)that includes two kinds of improved protocol,LEACH_D and LEACH_E.In this study,obstacles ...Low Energy Adaptive Clustering Hierarchy(LEACH)is a routing algorithm in agricultural wireless multimedia sensor networks(WMSNs)that includes two kinds of improved protocol,LEACH_D and LEACH_E.In this study,obstacles were overcome in widely used protocols.An improved algorithm was proposed to solve existing problems,such as energy source restriction,communication distance,and energy of the nodes.The optimal number of clusters was calculated by the first-order radio model of the improved algorithm to determine the percentage of the cluster heads in the network.High energy and the near sink nodes were chosen as cluster heads based on the residual energy of the nodes and the distance between the nodes to the sink node.At the same time,the K-means clustering analysis method was used for equally assigning the nodes to several clusters in the network.Both simulation and the verification results showed that the survival number of the proposed algorithm LEACH-ED increased by 66%.Moreover,the network load was high and network lifetime was longer.The mathematical model between the average voltage of nodes(y)and the running time(x)was concluded in the equation y=−0.0643x+4.3694,and the correlation coefficient was R2=0.9977.The research results can provide a foundation and method for the design and simulation of the routing algorithm in agricultural WMSNs.展开更多
Most of the earlier work on clustering mainly focused on numeric data whoseinherent geometric properties can be exploited to naturally define distance functions between datapoints. However, data mining applications fr...Most of the earlier work on clustering mainly focused on numeric data whoseinherent geometric properties can be exploited to naturally define distance functions between datapoints. However, data mining applications frequently involve many datasets that also consists ofmixed numeric and categorical attributes. In this paper we present a clustering algorithm which isbased on the k-means algorithm. The algorithm clusters objects with numeric and categoricalattributes in a way similar to k-means. The object similarity measure is derived from both numericand categorical attributes. When applied to numeric data, the algorithm is identical to the k-means.The main result of this paper is to provide a method to update the 'cluster centers' of clusteringobjects described by mixed numeric and categorical attributes in the clustering process to minimizethe clustering cost function. The clustering performance of the algorithm is demonstrated with thetwo well known data sets, namely credit approval and abalone databases.展开更多
Objective: According to RFM model theory of customer relationship management, data mining technology was used to group the chronic infectious disease patients to explore the effect of customer segmentation on the mana...Objective: According to RFM model theory of customer relationship management, data mining technology was used to group the chronic infectious disease patients to explore the effect of customer segmentation on the management of patients with different characteristics. Methods: 170,246 outpatient data was extracted from the hospital management information system (HIS) during January 2016 to July 2016, 43,448 data was formed after the data cleaning. K-Means clustering algorithm was used to classify patients with chronic infectious diseases, and then C5.0 decision tree algorithm was used to predict the situation of patients with chronic infectious diseases. Results: Male patients accounted for 58.7%, patients living in Shanghai accounted for 85.6%. The average age of patients is 45.88 years old, the high incidence age is 25 to 65 years old. Patients was gathered into three categories: 1) Clusters 1—Important patients (4786 people, 11.72%, R = 2.89, F = 11.72, M = 84,302.95);2) Clustering 2—Major patients (23,103, 53.2%, R = 5.22, F = 3.45, M = 9146.39);3) Cluster 3—Potential patients (15,559 people, 35.8%, R = 19.77, F = 1.55, M = 1739.09). C5.0 decision tree algorithm was used to predict the treatment situation of patients with chronic infectious diseases, the final treatment time (weeks) is an important predictor, the accuracy rate is 99.94% verified by the confusion model. Conclusion: Medical institutions should strengthen the adherence education for patients with chronic infectious diseases, establish the chronic infectious diseases and customer relationship management database, take the initiative to help them improve treatment adherence. Chinese governments at all levels should speed up the construction of hospital information, establish the chronic infectious disease database, strengthen the blocking of mother-to-child transmission, to effectively curb chronic infectious diseases, reduce disease burden and mortality.展开更多
文摘With the increasing variety of application software of meteorological satellite ground system, how to provide reasonable hardware resources and improve the efficiency of software is paid more and more attention. In this paper, a set of software classification method based on software operating characteristics is proposed. The method uses software run-time resource consumption to describe the software running characteristics. Firstly, principal component analysis (PCA) is used to reduce the dimension of software running feature data and to interpret software characteristic information. Then the modified K-means algorithm was used to classify the meteorological data processing software. Finally, it combined with the results of principal component analysis to explain the significance of various types of integrated software operating characteristics. And it is used as the basis for optimizing the allocation of software hardware resources and improving the efficiency of software operation.
文摘Cluster analysis is one of the major data analysis methods widely used for many practical applications in emerging areas of data mining. A good clustering method will produce high quality clusters with high intra-cluster similarity and low inter-cluster similarity. Clustering techniques are applied in different domains to predict future trends of available data and its uses for the real world. This research work is carried out to find the performance of two of the most delegated, partition based clustering algorithms namely k-Means and k-Medoids. A state of art analysis of these two algorithms is implemented and performance is analyzed based on their clustering result quality by means of its execution time and other components. Telecommunication data is the source data for this analysis. The connection oriented broadband data is given as input to find the clustering quality of the algorithms. Distance between the server locations and their connection is considered for clustering. Execution time for each algorithm is analyzed and the results are compared with one another. Results found in comparison study are satisfactory for the chosen application.
文摘In k-means clustering, we are given a set of n data points in d-dimensional space R^d and an integer k and the problem is to determine a set of k points in R^d, called centers, so as to minimize the mean squared distance from each data point to its nearest center. In this paper, we present a simple and efficient clustering algorithm based on the k-means algorithm, which we call enhanced k-means algorithm. This algorithm is easy to implement, requiring a simple data structure to keep some information in each iteration to be used in the next iteration. Our experimental results demonstrated that our scheme can improve the computational speed of the k-means algorithm by the magnitude in the total number of distance calculations and the overall time of computation.
文摘As the cash register system gradually prevailed in shopping malls, detecting the abnormal status of the cash register system has gradually become a hotspot issue. This paper analyzes the transaction data of a shopping mall. When calculating the degree of data difference, the coefficient of variation is used as the attribute weight;the weighted Euclidean distance is used to calculate the degree of difference;and k-means clustering is used to classify different time periods. It applies the LOF algorithm to detect the outlier degree of transaction data at each time period, sets the initial threshold to detect outliers, deletes the outliers, and then performs SAX detection on the data set. If it does not pass the test, then it will gradually expand the outlying domain and repeat the above process to optimize the outlier threshold to improve the sensitivity of detection algorithm and reduce false positives.
文摘Most clustering algorithms need to describe the similarity of objects by a predefined distance function. Three distance functions which are widely used in two traditional clustering algorithms k-means and hierarchical clustering were investigated. Both theoretical analysis and detailed experimental results were given. It is shown that a distance function greatly affects clustering results and can be used to detect the outlier of a cluster by the comparison of such different results and give the shape information of clusters. In practice situation, it is suggested to use different distance function separately, compare the clustering results and pick out the 搒wing points? And such points may leak out more information for data analysts.
基金Supported by the National Defence Science and Technology Research Foundation of China (No. 99J15.3.2.JW0116)
文摘The k-means clustering algorithm is one of the most commonly used algorithms for clustering analysis. The traditional k-means algorithm is, however, inefficient while working on large numbers of data sets and improving the algorithm efficiency remains a problem. This paper focuses on the efficiency issues of cluster algorithms. A refined initial cluster centers method is designed to reduce the number of iterative procedures in the algorithm. A parallel k-means algorithm is also studied for the problem of the operation limitation of a single processor machine when given huge data sets. The analytical results demonstrate that these improvements can greatly enhance the efficiency of the k-means algorithm, i.e., allow the grouping of a large number of data sets more accurately and more quickly. The analysis has theoretical and practical importance for work on the improvement and parallelism of cluster algorithms.
文摘With a sharp increase in the information volume,analyzing and retrieving this vast data volume is much more essential than ever.One of the main techniques that would be beneficial in this regard is called the Clustering method.Clustering aims to classify objects so that all objects within a cluster have similar features while other objects in different clusters are as distinct as possible.One of the most widely used clustering algorithms with the well and approved performance in different applications is the k-means algorithm.The main problem of the k-means algorithm is its performance which can be directly affected by the selection in the primary clusters.Lack of attention to this crucial issue has consequences such as creating empty clusters and decreasing the convergence time.Besides,the selection of appropriate initial seeds can reduce the cluster’s inconsistency.In this paper,we present a new method to determine the initial seeds of the k-mean algorithm to improve the accuracy and decrease the number of iterations of the algorithm.For this purpose,a new method is proposed considering the average distance between objects to determine the initial seeds.Our method attempts to provide a proper tradeoff between the accuracy and speed of the clustering algorithm.The experimental results showed that our proposed approach outperforms the Chithra with 1.7%and 2.1%in terms of clustering accuracy for Wine and Abalone detection data,respectively.Furthermore,achieved results indicate that comparing with the Reverse Nearest Neighbor(RNN)search approach,the proposed method has a higher convergence speed.
文摘Data analysis and automatic processing is often interpreted as knowledge acquisition. In many cases it is necessary to somehow classify data or find regularities in them. Results obtained in the search of regularities in intelligent data analyzing applications are mostly represented with the help of IF-THEN rules. With the help of these rules the following tasks are solved: prediction, classification, pattern recognition and others. Using different approaches---clustering algorithms, neural network methods, fuzzy rule processing methods--we can extract rules that in an understandable language characterize the data. This allows interpreting the data, finding relationships in the data and extracting new rules that characterize them. Knowledge acquisition in this paper is defined as the process of extracting knowledge from numerical data in the form of rules. Extraction of rules in this context is based on clustering methods K-means and fuzzy C-means. With the assistance of K-means, clustering algorithm rules are derived from trained neural networks. Fuzzy C-means is used in fuzzy rule based design method. Rule extraction methodology is demonstrated in the Fisher's Iris flower data set samples. The effectiveness of the extracted rules is evaluated. Clustering and rule extraction methodology can be widely used in evaluating and analyzing various economic and financial processes.
基金Project on the Integration of Industry,Education and Research of Henan Province(Grant No.142107000055,162107000026)Scientific and Technological Project of Henan Province(Grant No.152102210190,162102210202)+2 种基金Natural Science Foundation of Henan Educational Committee(Grant No.14B416004,14A416002 and 13A416264)Key Project of Henan Tobacco Company(HYKJ201316)Innovation Ability Foundation of Natural Science(Grant No.2013ZCX002)of Henan University of Science and Technology.
文摘Low Energy Adaptive Clustering Hierarchy(LEACH)is a routing algorithm in agricultural wireless multimedia sensor networks(WMSNs)that includes two kinds of improved protocol,LEACH_D and LEACH_E.In this study,obstacles were overcome in widely used protocols.An improved algorithm was proposed to solve existing problems,such as energy source restriction,communication distance,and energy of the nodes.The optimal number of clusters was calculated by the first-order radio model of the improved algorithm to determine the percentage of the cluster heads in the network.High energy and the near sink nodes were chosen as cluster heads based on the residual energy of the nodes and the distance between the nodes to the sink node.At the same time,the K-means clustering analysis method was used for equally assigning the nodes to several clusters in the network.Both simulation and the verification results showed that the survival number of the proposed algorithm LEACH-ED increased by 66%.Moreover,the network load was high and network lifetime was longer.The mathematical model between the average voltage of nodes(y)and the running time(x)was concluded in the equation y=−0.0643x+4.3694,and the correlation coefficient was R2=0.9977.The research results can provide a foundation and method for the design and simulation of the routing algorithm in agricultural WMSNs.
文摘Most of the earlier work on clustering mainly focused on numeric data whoseinherent geometric properties can be exploited to naturally define distance functions between datapoints. However, data mining applications frequently involve many datasets that also consists ofmixed numeric and categorical attributes. In this paper we present a clustering algorithm which isbased on the k-means algorithm. The algorithm clusters objects with numeric and categoricalattributes in a way similar to k-means. The object similarity measure is derived from both numericand categorical attributes. When applied to numeric data, the algorithm is identical to the k-means.The main result of this paper is to provide a method to update the 'cluster centers' of clusteringobjects described by mixed numeric and categorical attributes in the clustering process to minimizethe clustering cost function. The clustering performance of the algorithm is demonstrated with thetwo well known data sets, namely credit approval and abalone databases.
文摘Objective: According to RFM model theory of customer relationship management, data mining technology was used to group the chronic infectious disease patients to explore the effect of customer segmentation on the management of patients with different characteristics. Methods: 170,246 outpatient data was extracted from the hospital management information system (HIS) during January 2016 to July 2016, 43,448 data was formed after the data cleaning. K-Means clustering algorithm was used to classify patients with chronic infectious diseases, and then C5.0 decision tree algorithm was used to predict the situation of patients with chronic infectious diseases. Results: Male patients accounted for 58.7%, patients living in Shanghai accounted for 85.6%. The average age of patients is 45.88 years old, the high incidence age is 25 to 65 years old. Patients was gathered into three categories: 1) Clusters 1—Important patients (4786 people, 11.72%, R = 2.89, F = 11.72, M = 84,302.95);2) Clustering 2—Major patients (23,103, 53.2%, R = 5.22, F = 3.45, M = 9146.39);3) Cluster 3—Potential patients (15,559 people, 35.8%, R = 19.77, F = 1.55, M = 1739.09). C5.0 decision tree algorithm was used to predict the treatment situation of patients with chronic infectious diseases, the final treatment time (weeks) is an important predictor, the accuracy rate is 99.94% verified by the confusion model. Conclusion: Medical institutions should strengthen the adherence education for patients with chronic infectious diseases, establish the chronic infectious diseases and customer relationship management database, take the initiative to help them improve treatment adherence. Chinese governments at all levels should speed up the construction of hospital information, establish the chronic infectious disease database, strengthen the blocking of mother-to-child transmission, to effectively curb chronic infectious diseases, reduce disease burden and mortality.