Effective storage,processing and analyzing of power device condition monitoring data faces enormous challenges.A framework is proposed that can support both MapReduce and Graph for massive monitoring data analysis at ...Effective storage,processing and analyzing of power device condition monitoring data faces enormous challenges.A framework is proposed that can support both MapReduce and Graph for massive monitoring data analysis at the same time based on Aliyun DTplus platform.First,power device condition monitoring data storage based on MaxCompute table and parallel permutation entropy feature extraction based on MaxCompute MapReduce are designed and implemented on DTplus platform.Then,Graph based k-means algorithm is implemented and used for massive condition monitoring data clustering analysis.Finally,performance tests are performed to compare the execution time between serial program and parallel program.Performance is analyzed from CPU cores consumption,memory utilization and parallel granularity.Experimental results show that the designed framework and parallel algorithms can efficiently process massive power device condition monitoring data.展开更多
Affected by many involved factors, different dimensions, data with large difference, incomplete information and so on, the most optimal selection of regional outburst prevention measures for outburst mine has become a...Affected by many involved factors, different dimensions, data with large difference, incomplete information and so on, the most optimal selection of regional outburst prevention measures for outburst mine has become a complicated system project. The traditional way of outburst prevention measure selection belongs to qualitative method, which may cause high-cost of gas control, huge quantities of drilling work, long construction time and even secondary disaster. To solve the above-mentioned problems, in light of occurrence status of coal seam gas in No. 21 mining area of Jinzhushan Tuzhu Mine, through grey fixed weight clustering theory and a combination method of qualitative and quantitative analysis, the judging model with multi-objective classification for optimization of outburst prevention measures was established. The three weight coefficients of outburst prevention technology scheme are sorted, in order to determine the advantages and disadvantages of each outburst prevention technology scheme under the comprehensive evaluation of multi-target. Finally, the problem of quantitative selection for regional outburst prevention technology scheme is solved under the situation of multi-factor mode and incomplete information, which provides reasonable and effective technical measures for prevention of coal and gas outburst disaster.展开更多
A novel multivariate similarity clustering analysis (MSCA) approach was used to estimate a biogeographical division scheme for the global terrestrial fauna and was compared against other widely used clustering algorit...A novel multivariate similarity clustering analysis (MSCA) approach was used to estimate a biogeographical division scheme for the global terrestrial fauna and was compared against other widely used clustering algorithms. The faunal dataset included almost all terrestrial and freshwater fauna, a total of 4631 families, 141,814 genera, and 1,334,834 species. Our findings demonstrated that suitable results were only obtained with the MSCA method, which was associated with distinct hierarchies, reasonable structuring, and furthermore, conformed to biogeographical criteria. A total of seven kingdoms and 20 sub-kingdoms were identified. We discovered that the clustering results for the higher and lower animals did not differ significantly, leading us to consider that the analysis result is convincing as the first zoogeographical division scheme for global all terrestrial animals.展开更多
[Objectives] To use co-clustering analysis and visualization method to analyze the research on Siraitiae Fructus in recent ten years,to know the hot spots and trend of research. [Methods] Relevant research results abo...[Objectives] To use co-clustering analysis and visualization method to analyze the research on Siraitiae Fructus in recent ten years,to know the hot spots and trend of research. [Methods] Relevant research results about S. Fructus in CNKI from January of 2007 to December of 2016 were retrieved by computers,and the retrieval time was February 20,2017. BICOMB,Net Draw,g CLUTO and SPSS19. 0 software were used to conduct co-clustering analysis and visualization analysis for included articles. Keywords were analyzed,and social network graph,visualization matrix,peak image and multidimensional scaling analysis map were drawn. Correlation among high-frequency key words were analyzed. [Results] Totally 723 articles were included,among which 70 articles were issued during 2012-2016; 76 key words were obtained by key word co-occurrence network map,among which mogroside,MOG,extraction process,tissue culture,cultivation technology,varieties,growth and development were in the core position; visualization and the peak image showed that the topics in this research field could be divided into 6 categories; research hotspot dynamic evolution showed that S. Fructus flower,beverage,total flavonoids,gene expression,gene cloning,enzyme,apoptosis,and S. Fructus seed oil would be the hot spots of further study. [Conclusions]This study reveals that the research on S. Fructus in the recent ten years is becoming mature,and expanding to deep level. This study can be promoted to discipline development evaluation of TCM research field.展开更多
Neural stem cells,which are capable of multi-potential differentiation and self-renewal,have recently been shown to have clinical potential for repairing central nervous system tissue damage.However,the theme trends a...Neural stem cells,which are capable of multi-potential differentiation and self-renewal,have recently been shown to have clinical potential for repairing central nervous system tissue damage.However,the theme trends and knowledge structures for human neural stem cells have not yet been studied bibliometrically.In this study,we retrieved 2742 articles from the PubMed database from 2013 to 2018 using "Neural Stem Cells" as the retrieval word.Co-word analysis was conducted to statistically quantify the characteristics and popular themes of human neural stem cell-related studies.Bibliographic data matrices were generated with the Bibliographic Item Co-Occurrence Matrix Builder.We identified 78 high-frequency Medical Subject Heading(MeSH)terms.A visual matrix was built with the repeated bisection method in gCLUTO software.A social network analysis network was generated with Ucinet 6.0 software and GraphPad Prism 5 software.The analyses demonstrated that in the 6-year period,hot topics were clustered into five categories.As suggested by the constructed strategic diagram,studies related to cytology and physiology were well-developed,whereas those related to neural stem cell applications,tissue engineering,metabolism and cell signaling,and neural stem cell pathology and virology remained immature.Neural stem cell therapy for stroke and Parkinson’s disease,the genetics of microRNAs and brain neoplasms,as well as neuroprotective agents,Zika virus,Notch receptor,neural crest and embryonic stem cells were identified as emerging hot spots.These undeveloped themes and popular topics are potential points of focus for new studies on human neural stem cells.展开更多
Clustering is one of the recently challenging tasks since there is an ever.growing amount of data in scientific research and commercial applications. High quality and fast document clustering algorithms are in great d...Clustering is one of the recently challenging tasks since there is an ever.growing amount of data in scientific research and commercial applications. High quality and fast document clustering algorithms are in great demand to deal with large volume of data. The computational requirements for bringing such growing amount data to a central site for clustering are complex. The proposed algorithm uses optimal centroids for K.Means clustering based on Particle Swarm Optimization(PSO).PSO is used to take advantage of its global search ability to provide optimal centroids which aids in generating more compact clusters with improved accuracy. This proposed methodology utilizes Hadoop and Map Reduce framework which provides distributed storage and analysis to support data intensive distributed applications. Experiments were performed on Reuter's and RCV1 document dataset which shows an improvement in accuracy with reduced execution time.展开更多
The application of data mining in astronomical surveys,such as the Large Sky Area MultiObject Fiber Spectroscopic Telescope(LAMOST)survey,provides an effective approach to automatically analyze a large amount of compl...The application of data mining in astronomical surveys,such as the Large Sky Area MultiObject Fiber Spectroscopic Telescope(LAMOST)survey,provides an effective approach to automatically analyze a large amount of complex survey data.Unsupervised clustering could help astronomers find the associations and outliers in a big data set.In this paper,we employ the k-means method to perform clustering for the line index of LAMOST spectra with the powerful software Astro Stat.Implementing the line index approach for analyzing astronomical spectra is an effective way to extract spectral features for low resolution spectra,which can represent the main spectral characteristics of stars.A total of 144 340 line indices for A type stars is analyzed through calculating their intra and inter distances between pairs of stars.For intra distance,we use the definition of Mahalanobis distance to explore the degree of clustering for each class,while for outlier detection,we define a local outlier factor for each spectrum.Astro Stat furnishes a set of visualization tools for illustrating the analysis results.Checking the spectra detected as outliers,we find that most of them are problematic data and only a few correspond to rare astronomical objects.We show two examples of these outliers,a spectrum with abnormal continuum and a spectrum with emission lines.Our work demonstrates that line index clustering is a good method for examining data quality and identifying rare objects.展开更多
The goal of this study was to optimize the constitutive parameters of foundation soils using a k-means algorithm with clustering analysis. A database was collected from unconfined compression tests, Proctor tests and ...The goal of this study was to optimize the constitutive parameters of foundation soils using a k-means algorithm with clustering analysis. A database was collected from unconfined compression tests, Proctor tests and grain distribution tests of soils taken from three different types of foundation pits: raft foundations, partial raft foundations and strip foundations.k-means algorithm with clustering analysis was applied to determine the most appropriate foundation type given the unconfined compression strengths and other parameters of the different soils.展开更多
Hierarchical clustering analysis based on statistic s is one of the most important mining algorithms, but the traditionary hierarchica l clustering method is based on global comparing, which only takes in Q clusteri n...Hierarchical clustering analysis based on statistic s is one of the most important mining algorithms, but the traditionary hierarchica l clustering method is based on global comparing, which only takes in Q clusteri ng while ignoring R clustering in practice, so it has some limitation especially when the number of sample and index is very large. Furthermore, because of igno ring the association between the different indexes, the clustering result is not good & true. In this paper, we present the model and the algorithm of two-level hierarchi cal clustering which integrates Q clustering with R clustering. Moreover, becaus e two-level hierarchical clustering is based on the respective clustering resul t of each class, the classification of the indexes directly effects on the a ccuracy of the final clustering result, how to appropriately classify the inde xes is the chief and difficult problem we must handle in advance. Although some literatures also have referred to the issue of the classificati on of the indexes, but the articles classify the indexes only according to their superficial signification, which is unscientific. The reasons are as follow s: First, the superficial signification of some indexes usually takes on different meanings and it is easy to be misapprehended by different person. Furthermore, t his classification method seldom make use of history data, the classification re sult is not so objective. Second, for some indexes, its superficial signification didn’t show any mean ings, so simply from the superficial signification, we can’t classify them to c ertain classes. Third, this classification method need the users have higher level knowledge of this field, otherwise it is difficult for the users to understand the signifi cation of some indexes, which sometimes is not available. So in this paper, to this question, we first use R clustering method to cluste ring indexes, dividing p dimension indexes into q classes, then adopt two-level clustering method to get the final result. Obviously, the classification result is more objective and accurate. Moreover, after the first step, we can get the relation of the different indexes and their interaction. We can also know under a certain class indexes, which samples can be clustering to a class. (These semi finished results sometimes are very useful.) The experiments also indicates the effective and accurate of the algorithms. And, the result of R clustering ca n be easily used for the later practice.展开更多
The currently used methods for analyzing a number of focal mechanism solutions are often ineffective for large samples.With the aid of the basic concept of hierarchical clustering methods for pattern recognition and i...The currently used methods for analyzing a number of focal mechanism solutions are often ineffective for large samples.With the aid of the basic concept of hierarchical clustering methods for pattern recognition and in combination with the expression of focal mechanism solutions themselves,the sum of the angle between P-axes and the angle between T-axes of 2 solutions is defined as a distance,and a software for hierarchical clustering analysis by the shortest distance method and longest distance method is compiled.The number of types in the clustering results can be determined in accordance with different requirements.For focal mechanism solutions of the same type,the average position of each stress axis can be calculated by the method of vector composition and thereby the spatial orientation of the average focal mechanism solution can be determined.In order to test the feasibility and reliability of the software,hierarchical clustering analyses are made for the focal mechanism solutions of 24展开更多
Objective:The aim of this study is to discover research status and hotspots of economic evaluation(EE)in nursing area using co-word cluster analysis.Methods:Medical Subject Heading(MeSH)term“cost–benefit analysis”w...Objective:The aim of this study is to discover research status and hotspots of economic evaluation(EE)in nursing area using co-word cluster analysis.Methods:Medical Subject Heading(MeSH)term“cost–benefit analysis”was searched in PubMed and nursing journals were limited by the function of filter.The information of author,country,year,journal,and keywords of collected paper was extracted and exported to Bicomb 2.0 system,where high-frequency terms and other data could be further mined.SPSS 19.0 was used for cluster analysis to generate dendrogram.Results:In all,3,020 articles were found and 10,573 MeSH terms were detected;among them,1,909 were MeSH major topics and generated 42 high-frequency terms.The consequence of dendrogram showed seven clusters,representing seven research hotspots:skin administration,infection prevention,education program,nurse education and management,EE research,neoplasm patient,and extension of nurse function.Conclusions:Nursing EE research involved multiple aspects in nursing area,which is an important indicator for decision-making.Although the number of papers is increasing,the quality of study is not promising.Therefore,further study may be required to detect nurses’knowledge of economic analysis method and their attitude to apply it into nursing research.More nursing economics course could carry out in nursing school or hospitals.展开更多
A novel Support Vector Machine(SVM) ensemble approach using clustering analysis is proposed. Firstly,the positive and negative training examples are clustered through subtractive clus-tering algorithm respectively. Th...A novel Support Vector Machine(SVM) ensemble approach using clustering analysis is proposed. Firstly,the positive and negative training examples are clustered through subtractive clus-tering algorithm respectively. Then some representative examples are chosen from each of them to construct SVM components. At last,the outputs of the individual classifiers are fused through ma-jority voting method to obtain the final decision. Comparisons of performance between the proposed method and other popular ensemble approaches,such as Bagging,Adaboost and k.-fold cross valida-tion,are carried out on synthetic and UCI datasets. The experimental results show that our method has higher classification accuracy since the example distribution information is considered during en-semble through clustering analysis. It further indicates that our method needs a much smaller size of training subsets than Bagging and Adaboost to obtain satisfactory classification accuracy.展开更多
In this study,the world’s land(except Antarctica)is divided into 67 basic geographical units according to ecological types.Using our newly proposed MSCA(Multivariate Similarity Clustering Analysis)method,7,591 specie...In this study,the world’s land(except Antarctica)is divided into 67 basic geographical units according to ecological types.Using our newly proposed MSCA(Multivariate Similarity Clustering Analysis)method,7,591 species of modern terrestrial mammals belonging to 1,374 genera in 162 families and 2,378 species of mammals in the Wallace era before 1876 are quantitatively analyzed,and almost the same clustering results are obtained,with clear levels and reasonable clustering,which conform to the principles of geography,statistics,ecology and biology.It not only affirms and supports the reasonable kernel of Wallace’s scheme,but also puts forward suggestions that should be revised and improved.The large or small differences between the clustering results and the mammalian geographical zoning schemes of contemporary scholars are caused by different analysis methods,and they are highly consistent with the analysis results of chordates,angiosperms and insects in the world analyzed by the same method.Once again,it confirms the homogeneity of the global biological distribution pattern of major groups,and the possibility of building a unified biogeographic zoning system in the world.展开更多
On the process of power system black start after an accident, it can help to optimize the resources allocation and accelerate the recovery process that decomposing the power system into several independent partitions ...On the process of power system black start after an accident, it can help to optimize the resources allocation and accelerate the recovery process that decomposing the power system into several independent partitions for parallel recovery. On the basis of adequate consideration of fuzziness of black-start zone partitioning, a new algorithm based on fuzzy clustering analysis is presented. Characteristic indexes are extracted fully and accurately. The raw data matrix is made up of the electrical distance between every nodes and blackstart resources. Closure transfer method is utilized to get the dynamic clustering. The availability and feasibility of the proposed algorithm are verified on the New-England 39 bus system at last.展开更多
Clustering analysis identifying unknown heterogenous subgroups of a population(or a sample)has become increasingly popular along with the popularity of machine learning techniques.Although there are many software pack...Clustering analysis identifying unknown heterogenous subgroups of a population(or a sample)has become increasingly popular along with the popularity of machine learning techniques.Although there are many software packages running clustering analysis,there is a lack of packages conducting clustering analysis within a structural equation modeling framework.The package,gscaLCA which is implemented in the R statistical computing environment,was developed for conducting clustering analysis and has been extended to a latent variable modeling.More specifically,by applying both fuzzy clustering(FC)algorithm and generalized structured component analysis(GSCA),the package gscaLCA computes membership prevalence and item response probabilities as posterior probabilities,which is applicable in mixture modeling such as latent class analysis in statistics.As a hybrid model between data clustering in classifications and model-based mixture modeling approach,fuzzy clusterwise GSCA,denoted as gscaLCA,encompasses many advantages from both methods:(1)soft partitioning from FC and(2)efficiency in estimating model parameters with bootstrap method via resolution of global optimization problem from GSCA.The main function,gscaLCA,works for both binary and ordered categorical variables.In addition,gscaLCA can be used for latent class regression as well.Visualization of profiles of latent classes based on the posterior probabilities is also available in the package gscaLCA.This paper contributes to providing a methodological tool,gscaLCA that applied researchers such as social scientists and medical researchers can apply clustering analysis in their research.展开更多
According to tie records of seismic station networks of China’s continent and Korea Peninsula and the historical data,the complete seismicity pattern was obtained for the first time.The seismic zoning was conducted b...According to tie records of seismic station networks of China’s continent and Korea Peninsula and the historical data,the complete seismicity pattern was obtained for the first time.The seismic zoning was conducted by means of the cluster analysis method.The map’s spatial distribution of seismicity from 1960 to 1994 shows that there are three strong seismic zones:the first one strikes in the NE direction,from the Jiangsu plain in China to the central Korean Peninsula; the second strikes in the NW direction,from the Bohai Sea,China to the southern Korean Peninsula; the third strikes in the NW direction,from the western Liaoning Province to Pyongyang.Most of earthquakes are located along these three zones,the seismic intensity is lower than that in the mainland,and exhibited the feature of fractured crust of a marginal sea basin.展开更多
[Objective] The research aimed to study clustering of the six climatic factors in Yunnan tobacco planting zone. [Method] 6 meteorological elements in 89 tobacco-growing counties and 12 sub-prefectures were conducted c...[Objective] The research aimed to study clustering of the six climatic factors in Yunnan tobacco planting zone. [Method] 6 meteorological elements in 89 tobacco-growing counties and 12 sub-prefectures were conducted clustering analysis. According to indicator and climate characteristics of the each type, climate in tobacco planting area of Yunnan Province was divided. [Result] Climate in tobacco planting area of Yunnan Province could be divided into eight types: Jiangchuan (24 counties, belonged to northern and central subtropical climate belts), Songming (27 counties, belonged to northern subtropical and central, south, north temperate climate belts), Tengchong (3 counties, belonged to northern subtropical climate belt), Mile (12 counties, belonged to central and southern subtropical climate belts), Qiubei (11 counties, belonged to southern subtropical climate belt), Yanjin (4 counties, belonged to central subtropical humid climate belt), Yuanjiang (4 counties, belonged to southern subtropical and northern tropical climate belts), Zhenxiong (3 counties, belonged to warm temperate and northern subtropical climate belts) were eight representatives. Among 1-8 eco-zones, domestic and foreign cities where climate reached level-one similarity were respectively 3, 1, 1, 0, 1, 1, 0 and 1, up to level-two similarity, respectively 12, 15, 3, 13, 13, 1, 5 and 3. Among 8 major ecological zones, similar distance of the city reaching level-one similarity was in the range of 0.28 to 0.45, and similar degree was the highest. Variety introduction among these places would be successful. Similar distance of the city reaching level-two similarity was between 0.51 and 1.00, and similar degree was higher. Mutual variety introduction had high successful rate in these places. [Conclusion] The research provided theoretical basis for selecting new suitable tobacco variety and optimizing tobacco variety layout in different zones.展开更多
基金This work has been supported by.Central University Research Fund(No.2016MS116,No.2016MS117,No.2018MS074)the National Natural Science Foundation(51677072).
文摘Effective storage,processing and analyzing of power device condition monitoring data faces enormous challenges.A framework is proposed that can support both MapReduce and Graph for massive monitoring data analysis at the same time based on Aliyun DTplus platform.First,power device condition monitoring data storage based on MaxCompute table and parallel permutation entropy feature extraction based on MaxCompute MapReduce are designed and implemented on DTplus platform.Then,Graph based k-means algorithm is implemented and used for massive condition monitoring data clustering analysis.Finally,performance tests are performed to compare the execution time between serial program and parallel program.Performance is analyzed from CPU cores consumption,memory utilization and parallel granularity.Experimental results show that the designed framework and parallel algorithms can efficiently process massive power device condition monitoring data.
文摘Affected by many involved factors, different dimensions, data with large difference, incomplete information and so on, the most optimal selection of regional outburst prevention measures for outburst mine has become a complicated system project. The traditional way of outburst prevention measure selection belongs to qualitative method, which may cause high-cost of gas control, huge quantities of drilling work, long construction time and even secondary disaster. To solve the above-mentioned problems, in light of occurrence status of coal seam gas in No. 21 mining area of Jinzhushan Tuzhu Mine, through grey fixed weight clustering theory and a combination method of qualitative and quantitative analysis, the judging model with multi-objective classification for optimization of outburst prevention measures was established. The three weight coefficients of outburst prevention technology scheme are sorted, in order to determine the advantages and disadvantages of each outburst prevention technology scheme under the comprehensive evaluation of multi-target. Finally, the problem of quantitative selection for regional outburst prevention technology scheme is solved under the situation of multi-factor mode and incomplete information, which provides reasonable and effective technical measures for prevention of coal and gas outburst disaster.
文摘A novel multivariate similarity clustering analysis (MSCA) approach was used to estimate a biogeographical division scheme for the global terrestrial fauna and was compared against other widely used clustering algorithms. The faunal dataset included almost all terrestrial and freshwater fauna, a total of 4631 families, 141,814 genera, and 1,334,834 species. Our findings demonstrated that suitable results were only obtained with the MSCA method, which was associated with distinct hierarchies, reasonable structuring, and furthermore, conformed to biogeographical criteria. A total of seven kingdoms and 20 sub-kingdoms were identified. We discovered that the clustering results for the higher and lower animals did not differ significantly, leading us to consider that the analysis result is convincing as the first zoogeographical division scheme for global all terrestrial animals.
基金Supported by Guangxi Major Scientific Research and Technological Development Project(Gui Ke Zhong 1355001-4,14124002-11)
文摘[Objectives] To use co-clustering analysis and visualization method to analyze the research on Siraitiae Fructus in recent ten years,to know the hot spots and trend of research. [Methods] Relevant research results about S. Fructus in CNKI from January of 2007 to December of 2016 were retrieved by computers,and the retrieval time was February 20,2017. BICOMB,Net Draw,g CLUTO and SPSS19. 0 software were used to conduct co-clustering analysis and visualization analysis for included articles. Keywords were analyzed,and social network graph,visualization matrix,peak image and multidimensional scaling analysis map were drawn. Correlation among high-frequency key words were analyzed. [Results] Totally 723 articles were included,among which 70 articles were issued during 2012-2016; 76 key words were obtained by key word co-occurrence network map,among which mogroside,MOG,extraction process,tissue culture,cultivation technology,varieties,growth and development were in the core position; visualization and the peak image showed that the topics in this research field could be divided into 6 categories; research hotspot dynamic evolution showed that S. Fructus flower,beverage,total flavonoids,gene expression,gene cloning,enzyme,apoptosis,and S. Fructus seed oil would be the hot spots of further study. [Conclusions]This study reveals that the research on S. Fructus in the recent ten years is becoming mature,and expanding to deep level. This study can be promoted to discipline development evaluation of TCM research field.
基金supported by the National Natural Science Foundation of China,No.81471308(to JL)the Stem Cell Clinical Research Project in China,No.CMR-20161129-1003(to JL)the Innovation Technology Funding of Dalian in China,No.2018J11CY025(to JL)
文摘Neural stem cells,which are capable of multi-potential differentiation and self-renewal,have recently been shown to have clinical potential for repairing central nervous system tissue damage.However,the theme trends and knowledge structures for human neural stem cells have not yet been studied bibliometrically.In this study,we retrieved 2742 articles from the PubMed database from 2013 to 2018 using "Neural Stem Cells" as the retrieval word.Co-word analysis was conducted to statistically quantify the characteristics and popular themes of human neural stem cell-related studies.Bibliographic data matrices were generated with the Bibliographic Item Co-Occurrence Matrix Builder.We identified 78 high-frequency Medical Subject Heading(MeSH)terms.A visual matrix was built with the repeated bisection method in gCLUTO software.A social network analysis network was generated with Ucinet 6.0 software and GraphPad Prism 5 software.The analyses demonstrated that in the 6-year period,hot topics were clustered into five categories.As suggested by the constructed strategic diagram,studies related to cytology and physiology were well-developed,whereas those related to neural stem cell applications,tissue engineering,metabolism and cell signaling,and neural stem cell pathology and virology remained immature.Neural stem cell therapy for stroke and Parkinson’s disease,the genetics of microRNAs and brain neoplasms,as well as neuroprotective agents,Zika virus,Notch receptor,neural crest and embryonic stem cells were identified as emerging hot spots.These undeveloped themes and popular topics are potential points of focus for new studies on human neural stem cells.
文摘Clustering is one of the recently challenging tasks since there is an ever.growing amount of data in scientific research and commercial applications. High quality and fast document clustering algorithms are in great demand to deal with large volume of data. The computational requirements for bringing such growing amount data to a central site for clustering are complex. The proposed algorithm uses optimal centroids for K.Means clustering based on Particle Swarm Optimization(PSO).PSO is used to take advantage of its global search ability to provide optimal centroids which aids in generating more compact clusters with improved accuracy. This proposed methodology utilizes Hadoop and Map Reduce framework which provides distributed storage and analysis to support data intensive distributed applications. Experiments were performed on Reuter's and RCV1 document dataset which shows an improvement in accuracy with reduced execution time.
基金supported by the Joint Research Fund in Astronomy (U1631239) under cooperative agreement between the National Natural Science Foundation of China (NSFC) and Chinese Academy of Sciences (CAS)supported by the International Science and Technology Cooperation Program of China (2014DFE10030)the Basic Science and Engineering Special Project of Heilongjiang Province Education Department (135109219)
文摘The application of data mining in astronomical surveys,such as the Large Sky Area MultiObject Fiber Spectroscopic Telescope(LAMOST)survey,provides an effective approach to automatically analyze a large amount of complex survey data.Unsupervised clustering could help astronomers find the associations and outliers in a big data set.In this paper,we employ the k-means method to perform clustering for the line index of LAMOST spectra with the powerful software Astro Stat.Implementing the line index approach for analyzing astronomical spectra is an effective way to extract spectral features for low resolution spectra,which can represent the main spectral characteristics of stars.A total of 144 340 line indices for A type stars is analyzed through calculating their intra and inter distances between pairs of stars.For intra distance,we use the definition of Mahalanobis distance to explore the degree of clustering for each class,while for outlier detection,we define a local outlier factor for each spectrum.Astro Stat furnishes a set of visualization tools for illustrating the analysis results.Checking the spectra detected as outliers,we find that most of them are problematic data and only a few correspond to rare astronomical objects.We show two examples of these outliers,a spectrum with abnormal continuum and a spectrum with emission lines.Our work demonstrates that line index clustering is a good method for examining data quality and identifying rare objects.
文摘The goal of this study was to optimize the constitutive parameters of foundation soils using a k-means algorithm with clustering analysis. A database was collected from unconfined compression tests, Proctor tests and grain distribution tests of soils taken from three different types of foundation pits: raft foundations, partial raft foundations and strip foundations.k-means algorithm with clustering analysis was applied to determine the most appropriate foundation type given the unconfined compression strengths and other parameters of the different soils.
文摘Hierarchical clustering analysis based on statistic s is one of the most important mining algorithms, but the traditionary hierarchica l clustering method is based on global comparing, which only takes in Q clusteri ng while ignoring R clustering in practice, so it has some limitation especially when the number of sample and index is very large. Furthermore, because of igno ring the association between the different indexes, the clustering result is not good & true. In this paper, we present the model and the algorithm of two-level hierarchi cal clustering which integrates Q clustering with R clustering. Moreover, becaus e two-level hierarchical clustering is based on the respective clustering resul t of each class, the classification of the indexes directly effects on the a ccuracy of the final clustering result, how to appropriately classify the inde xes is the chief and difficult problem we must handle in advance. Although some literatures also have referred to the issue of the classificati on of the indexes, but the articles classify the indexes only according to their superficial signification, which is unscientific. The reasons are as follow s: First, the superficial signification of some indexes usually takes on different meanings and it is easy to be misapprehended by different person. Furthermore, t his classification method seldom make use of history data, the classification re sult is not so objective. Second, for some indexes, its superficial signification didn’t show any mean ings, so simply from the superficial signification, we can’t classify them to c ertain classes. Third, this classification method need the users have higher level knowledge of this field, otherwise it is difficult for the users to understand the signifi cation of some indexes, which sometimes is not available. So in this paper, to this question, we first use R clustering method to cluste ring indexes, dividing p dimension indexes into q classes, then adopt two-level clustering method to get the final result. Obviously, the classification result is more objective and accurate. Moreover, after the first step, we can get the relation of the different indexes and their interaction. We can also know under a certain class indexes, which samples can be clustering to a class. (These semi finished results sometimes are very useful.) The experiments also indicates the effective and accurate of the algorithms. And, the result of R clustering ca n be easily used for the later practice.
基金the Joint Earthquake Science Foundation under Contract No. 91060
文摘The currently used methods for analyzing a number of focal mechanism solutions are often ineffective for large samples.With the aid of the basic concept of hierarchical clustering methods for pattern recognition and in combination with the expression of focal mechanism solutions themselves,the sum of the angle between P-axes and the angle between T-axes of 2 solutions is defined as a distance,and a software for hierarchical clustering analysis by the shortest distance method and longest distance method is compiled.The number of types in the clustering results can be determined in accordance with different requirements.For focal mechanism solutions of the same type,the average position of each stress axis can be calculated by the method of vector composition and thereby the spatial orientation of the average focal mechanism solution can be determined.In order to test the feasibility and reliability of the software,hierarchical clustering analyses are made for the focal mechanism solutions of 24
文摘Objective:The aim of this study is to discover research status and hotspots of economic evaluation(EE)in nursing area using co-word cluster analysis.Methods:Medical Subject Heading(MeSH)term“cost–benefit analysis”was searched in PubMed and nursing journals were limited by the function of filter.The information of author,country,year,journal,and keywords of collected paper was extracted and exported to Bicomb 2.0 system,where high-frequency terms and other data could be further mined.SPSS 19.0 was used for cluster analysis to generate dendrogram.Results:In all,3,020 articles were found and 10,573 MeSH terms were detected;among them,1,909 were MeSH major topics and generated 42 high-frequency terms.The consequence of dendrogram showed seven clusters,representing seven research hotspots:skin administration,infection prevention,education program,nurse education and management,EE research,neoplasm patient,and extension of nurse function.Conclusions:Nursing EE research involved multiple aspects in nursing area,which is an important indicator for decision-making.Although the number of papers is increasing,the quality of study is not promising.Therefore,further study may be required to detect nurses’knowledge of economic analysis method and their attitude to apply it into nursing research.More nursing economics course could carry out in nursing school or hospitals.
基金the National Natural Science Foundation of China (No.60472072)the Specialized Research Foundation for the Doctoral Program of Higher Educa-tion of China (No.20040699034).
文摘A novel Support Vector Machine(SVM) ensemble approach using clustering analysis is proposed. Firstly,the positive and negative training examples are clustered through subtractive clus-tering algorithm respectively. Then some representative examples are chosen from each of them to construct SVM components. At last,the outputs of the individual classifiers are fused through ma-jority voting method to obtain the final decision. Comparisons of performance between the proposed method and other popular ensemble approaches,such as Bagging,Adaboost and k.-fold cross valida-tion,are carried out on synthetic and UCI datasets. The experimental results show that our method has higher classification accuracy since the example distribution information is considered during en-semble through clustering analysis. It further indicates that our method needs a much smaller size of training subsets than Bagging and Adaboost to obtain satisfactory classification accuracy.
基金supported by the key laboratory foundation of Henna(112300413221).
文摘In this study,the world’s land(except Antarctica)is divided into 67 basic geographical units according to ecological types.Using our newly proposed MSCA(Multivariate Similarity Clustering Analysis)method,7,591 species of modern terrestrial mammals belonging to 1,374 genera in 162 families and 2,378 species of mammals in the Wallace era before 1876 are quantitatively analyzed,and almost the same clustering results are obtained,with clear levels and reasonable clustering,which conform to the principles of geography,statistics,ecology and biology.It not only affirms and supports the reasonable kernel of Wallace’s scheme,but also puts forward suggestions that should be revised and improved.The large or small differences between the clustering results and the mammalian geographical zoning schemes of contemporary scholars are caused by different analysis methods,and they are highly consistent with the analysis results of chordates,angiosperms and insects in the world analyzed by the same method.Once again,it confirms the homogeneity of the global biological distribution pattern of major groups,and the possibility of building a unified biogeographic zoning system in the world.
文摘On the process of power system black start after an accident, it can help to optimize the resources allocation and accelerate the recovery process that decomposing the power system into several independent partitions for parallel recovery. On the basis of adequate consideration of fuzziness of black-start zone partitioning, a new algorithm based on fuzzy clustering analysis is presented. Characteristic indexes are extracted fully and accurately. The raw data matrix is made up of the electrical distance between every nodes and blackstart resources. Closure transfer method is utilized to get the dynamic clustering. The availability and feasibility of the proposed algorithm are verified on the New-England 39 bus system at last.
基金supported by the Yonsei University Research Fund of 2021(2021-22-0060).
文摘Clustering analysis identifying unknown heterogenous subgroups of a population(or a sample)has become increasingly popular along with the popularity of machine learning techniques.Although there are many software packages running clustering analysis,there is a lack of packages conducting clustering analysis within a structural equation modeling framework.The package,gscaLCA which is implemented in the R statistical computing environment,was developed for conducting clustering analysis and has been extended to a latent variable modeling.More specifically,by applying both fuzzy clustering(FC)algorithm and generalized structured component analysis(GSCA),the package gscaLCA computes membership prevalence and item response probabilities as posterior probabilities,which is applicable in mixture modeling such as latent class analysis in statistics.As a hybrid model between data clustering in classifications and model-based mixture modeling approach,fuzzy clusterwise GSCA,denoted as gscaLCA,encompasses many advantages from both methods:(1)soft partitioning from FC and(2)efficiency in estimating model parameters with bootstrap method via resolution of global optimization problem from GSCA.The main function,gscaLCA,works for both binary and ordered categorical variables.In addition,gscaLCA can be used for latent class regression as well.Visualization of profiles of latent classes based on the posterior probabilities is also available in the package gscaLCA.This paper contributes to providing a methodological tool,gscaLCA that applied researchers such as social scientists and medical researchers can apply clustering analysis in their research.
基金This project was sponsored by the Joint Earthquake Science Foundation of China (94139)
文摘According to tie records of seismic station networks of China’s continent and Korea Peninsula and the historical data,the complete seismicity pattern was obtained for the first time.The seismic zoning was conducted by means of the cluster analysis method.The map’s spatial distribution of seismicity from 1960 to 1994 shows that there are three strong seismic zones:the first one strikes in the NE direction,from the Jiangsu plain in China to the central Korean Peninsula; the second strikes in the NW direction,from the Bohai Sea,China to the southern Korean Peninsula; the third strikes in the NW direction,from the western Liaoning Province to Pyongyang.Most of earthquakes are located along these three zones,the seismic intensity is lower than that in the mainland,and exhibited the feature of fractured crust of a marginal sea basin.
基金Supported by Department of Science and Education,State Bureau,China(04A26)
文摘[Objective] The research aimed to study clustering of the six climatic factors in Yunnan tobacco planting zone. [Method] 6 meteorological elements in 89 tobacco-growing counties and 12 sub-prefectures were conducted clustering analysis. According to indicator and climate characteristics of the each type, climate in tobacco planting area of Yunnan Province was divided. [Result] Climate in tobacco planting area of Yunnan Province could be divided into eight types: Jiangchuan (24 counties, belonged to northern and central subtropical climate belts), Songming (27 counties, belonged to northern subtropical and central, south, north temperate climate belts), Tengchong (3 counties, belonged to northern subtropical climate belt), Mile (12 counties, belonged to central and southern subtropical climate belts), Qiubei (11 counties, belonged to southern subtropical climate belt), Yanjin (4 counties, belonged to central subtropical humid climate belt), Yuanjiang (4 counties, belonged to southern subtropical and northern tropical climate belts), Zhenxiong (3 counties, belonged to warm temperate and northern subtropical climate belts) were eight representatives. Among 1-8 eco-zones, domestic and foreign cities where climate reached level-one similarity were respectively 3, 1, 1, 0, 1, 1, 0 and 1, up to level-two similarity, respectively 12, 15, 3, 13, 13, 1, 5 and 3. Among 8 major ecological zones, similar distance of the city reaching level-one similarity was in the range of 0.28 to 0.45, and similar degree was the highest. Variety introduction among these places would be successful. Similar distance of the city reaching level-two similarity was between 0.51 and 1.00, and similar degree was higher. Mutual variety introduction had high successful rate in these places. [Conclusion] The research provided theoretical basis for selecting new suitable tobacco variety and optimizing tobacco variety layout in different zones.