A novel study using LCeMS(Liquid chromatography tandem mass spectrometry)coupled with multivariate data analysis and bioactivity evaluation was established for discrimination of aqueous extract and vinegar extract of...A novel study using LCeMS(Liquid chromatography tandem mass spectrometry)coupled with multivariate data analysis and bioactivity evaluation was established for discrimination of aqueous extract and vinegar extract of Shixiao San.Batches of these two kinds of samples were subjected to analysis,and the datasets of sample codes,tR-m/z pairs and ion intensities were processed with principal component analysis(PCA).The result of score plot showed a clear classification of the aqueous and vinegar groups.And the chemical markers having great contributions to the differentiation were screened out on the loading plot.The identities of the chemical markers were performed by comparing the mass fragments and retention times with those of reference compounds and/or the known compounds published in the literatures.Based on the proposed strategy,quercetin-3-Oneohesperidoside,isorhamnetin-3-O-neohespeeridoside,kaempferol-3-O-neohesperidoside,isorhamnetin-3-O-rutinoside and isorhamnetin-3-O-(2G-a-l-rhamnosyl)-rutinoside were explored as representative markers in distinguishing the vinegar extract from the aqueous extract.The anti-hyperlipidemic activities of two processed extracts of Shixiao San were examined on serum levels of lipids,lipoprotein and blood antioxidant enzymes in a rat hyperlipidemia model,and the vinegary extract,exerting strong lipid-lowering and antioxidative effects,was superior to the aqueous extract.Therefore,boiling with vinegary was predicted as the greatest processing procedure for anti-hyperlipidemic effect of Shixiao San.Furthermore,combining the changes in the metabolic profiling and bioactivity evaluation,the five representative markers may be related to the observed antihyperlipidemic effect.展开更多
Various uncertainties arising during acquisition process of geoscience data may result in anomalous data instances(i.e.,outliers)that do not conform with the expected pattern of regular data instances.With sparse mult...Various uncertainties arising during acquisition process of geoscience data may result in anomalous data instances(i.e.,outliers)that do not conform with the expected pattern of regular data instances.With sparse multivariate data obtained from geotechnical site investigation,it is impossible to identify outliers with certainty due to the distortion of statistics of geotechnical parameters caused by outliers and their associated statistical uncertainty resulted from data sparsity.This paper develops a probabilistic outlier detection method for sparse multivariate data obtained from geotechnical site investigation.The proposed approach quantifies the outlying probability of each data instance based on Mahalanobis distance and determines outliers as those data instances with outlying probabilities greater than 0.5.It tackles the distortion issue of statistics estimated from the dataset with outliers by a re-sampling technique and accounts,rationally,for the statistical uncertainty by Bayesian machine learning.Moreover,the proposed approach also suggests an exclusive method to determine outlying components of each outlier.The proposed approach is illustrated and verified using simulated and real-life dataset.It showed that the proposed approach properly identifies outliers among sparse multivariate data and their corresponding outlying components in a probabilistic manner.It can significantly reduce the masking effect(i.e.,missing some actual outliers due to the distortion of statistics by the outliers and statistical uncertainty).It also found that outliers among sparse multivariate data instances affect significantly the construction of multivariate distribution of geotechnical parameters for uncertainty quantification.This emphasizes the necessity of data cleaning process(e.g.,outlier detection)for uncertainty quantification based on geoscience data.展开更多
Supervised machine learning techniques require labelled multivariate training datasets.Many approaches address the issue of unlabelled datasets by tightly coupling machine learning algorithms with interactive visuali...Supervised machine learning techniques require labelled multivariate training datasets.Many approaches address the issue of unlabelled datasets by tightly coupling machine learning algorithms with interactive visualisations.Using appropriate techniques,analysts can play an active role in a highly interactive and iterative machine learning process to label the dataset and create meaningful partitions.While this principle has been implemented either for unsupervised,semi-supervised,or supervised machine learning tasks,the combination of all three methodologies remains challenging.In this paper,a visual analytics approach is presented,combining a variety of machine learning capabilities with four linked visualisation views,all integrated within the mVis(multivariate Visualiser)system.The available palette of techniques allows an analyst to perform exploratory data analysis on a multivariate dataset and divide it into meaningful labelled partitions,from which a classifier can be built.In the workflow,the analyst can label interesting patterns or outliers in a semi-supervised process supported by active learning.Once a dataset has been interactively labelled,the analyst can continue the workflow with supervised machine learning to assess to what degree the subsequent classifier has effectively learned the concepts expressed in the labelled training dataset.Using a novel technique called automatic dimension selection,interactions the analyst had with dimensions of the multivariate dataset are used to steer the machine learning algorithms.A real-world football dataset is used to show the utility of mVis for a series of analysis and labelling tasks,from initial labelling through iterations of data exploration,clustering,classification,and active learning to refine the named partitions,to finally producing a high-quality labelled training dataset suitable for training a classifier.The tool empowers the analyst with interactive visualisations including scatterplots,parallel coordinates,similarity maps for records,and a new similarity map for partitions.展开更多
In this paper, a new approach for visualizing multivariate categorical data is presented. The approach uses a graph to represent multivariate categorical data and draws the graph in such a way that we can identify pat...In this paper, a new approach for visualizing multivariate categorical data is presented. The approach uses a graph to represent multivariate categorical data and draws the graph in such a way that we can identify patterns, trends and relationship within the data. A mathematical model for the graph layout problem is deduced and a spectral graph drawing algorithm for visualizing multivariate categorical data is proposed. The experiments show that the drawings by the algorithm well capture the structures of multivariate categorical data and the computing speed is fast.展开更多
Biometric gait recognition is a lesser-known but emerging and effective biometric recognition method which enables subjects’walking patterns to be recognized.Existing research in this area has primarily focused on fe...Biometric gait recognition is a lesser-known but emerging and effective biometric recognition method which enables subjects’walking patterns to be recognized.Existing research in this area has primarily focused on feature analysis through the extraction of individual features,which captures most of the information but fails to capture subtle variations in gait dynamics.Therefore,a novel feature taxonomy and an approach for deriving a relationship between a function of one set of gait features with another set are introduced.The gait features extracted from body halves divided by anatomical planes on vertical,horizontal,and diagonal axes are grouped to form canonical gait covariates.Canonical Correlation Analysis is utilized to measure the strength of association between the canonical covariates of gait.Thus,gait assessment and identification are enhancedwhenmore semantic information is available through CCA-basedmulti-feature fusion.Hence,CarnegieMellon University’s 3D gait database,which contains 32 gait samples taken at different paces,is utilized in analyzing gait characteristics.The performance of Linear Discriminant Analysis,K-Nearest Neighbors,Naive Bayes,Artificial Neural Networks,and Support Vector Machines was improved by a 4%average when the CCA-utilized gait identification approachwas used.Asignificant maximumaccuracy rate of 97.8%was achieved throughCCA-based gait identification.Beyond that,the rate of false identifications and unrecognized gaits went down to half,demonstrating state-of-the-art for gait identification.展开更多
In several LUCC studies, statistical methods are being used to analyze land use data. A problem using conventional statistical methods in land use analysis is that these methods assume the data to be statistically ind...In several LUCC studies, statistical methods are being used to analyze land use data. A problem using conventional statistical methods in land use analysis is that these methods assume the data to be statistically independent. But in fact, they have the tendency to be dependent, a phenomenon known as multicollinearity, especially in the cases of few observations. In this paper, a Partial Least-Squares (PLS) regression approach is developed to study relationships between land use and its influencing factors through a case study of the Suzhou-Wuxi-Changzhou region in China. Multicollinearity exists in the dataset and the number of variables is high compared to the number of observations. Four PLS factors are selected through a preliminary analysis. The correlation analyses between land use and influencing factors demonstrate the land use character of rural industrialization and urbanization in the Suzhou-Wuxi-Changzhou region, meanwhile illustrate that the first PLS factor has enough ability to best describe land use patterns quantitatively, and most of the statistical relations derived from it accord with the fact. By the decreasing capacity of the PLS factors, the reliability of model outcome decreases correspondingly.展开更多
Numerical weather simulation data usually comprises various meteorological variables, such as precipitation, temperature and pressure. In practical applications, data generated with several different numerical simulat...Numerical weather simulation data usually comprises various meteorological variables, such as precipitation, temperature and pressure. In practical applications, data generated with several different numerical simulation models are usually used together by forecasters to generate the final forecast. However, it is difficult for forecasters to obtain a clear view of all the data due to its complexity. This has been a great limitation for domain experts to take advantage of all the data in their routine work. In order to help explore the multi-variate and multi-model data, we propose a stamp based exploration framework to assist domain experts in analyzing the data. The framework is used to assist domain experts in detecting the bias patterns between numerical simulation data and observation data. The exploration pipeline originates from a single meteorological variable and extends to multiple variables under the guidance of a designed stamp board. Regional data patterns can be detected by analyzing distinctive stamps on the board or generating extending stamps using the Boolean set operations. Experiment results show that some meteorological phenomena and regional data patterns can be easily detected through the exploration. These can help domain experts conduct the data analysis efficiently and further guide forecasters in producing reliable weather forecast.展开更多
In complex multivariate data sets,different features usually include diverse associations with different variables,and different variables are associated within different regions.Therefore,exploring the associations b...In complex multivariate data sets,different features usually include diverse associations with different variables,and different variables are associated within different regions.Therefore,exploring the associations between variables and voxels locally becomes necessary to better understand the underlying phenomena.In this paper,we propose a co-analysis framework based on biclusters,which are two subsets of variables and voxels with close scalar-value relationships,to guide the process of visually exploring multivariate data.We first automatically extract all meaningful biclusters,each of which only contains voxels with a similar scalar-value pattern over a subset of variables.These biclusters are organized according to their variable sets,and biclusters in each variable set are further grouped by a similarity metric to reduce redundancy and support diversity during visual exploration.Biclusters are visually represented in coordinated views to facilitate interactive exploration of multivariate data from the similarity between biclusters and the correlation of scalar values with different variables.Experiments on several representative multivariate scientific data sets demonstrate the effectiveness of our framework in exploring local relationships among variables,biclusters and scalar values in the data.展开更多
Multivariate failure time data are frequently encountered in biomedical research.In this article,we model marginal hazards with accelerated hazards model to analyze multivariate failure time data.Estimating equations ...Multivariate failure time data are frequently encountered in biomedical research.In this article,we model marginal hazards with accelerated hazards model to analyze multivariate failure time data.Estimating equations are derived analogous to generalized estimating equation method.Under certain regular conditions,the resultant estimators for the regression parameters are shown to be asymptotically normal.Furthermore,we also establish the weak convergence of estimators for the baseline cumulative hazard functions.展开更多
In the analysis of correlated data, it is ideal to capture the true dependence structure to increase effciency of the estimation. However, for multivariate survival data, this is extremely
Multivariate longitudinal data arise frequently in a variety of applications,where multiple outcomes are measured repeatedly from the same subject.In this paper,we first propose a two-stage weighted least square estim...Multivariate longitudinal data arise frequently in a variety of applications,where multiple outcomes are measured repeatedly from the same subject.In this paper,we first propose a two-stage weighted least square estimation procedure for the regression coefficients when the random error follows an irregular autoregressive(AR)process,and establish asymptotic normality properties for the resulting estimators.We then apply the smoothly clipped absolute deviation(SCAD)variable selection approach to determine the order of the AR error process.We further propose a test statistic to check whether multiple responses are correlated at the same observation time,and derive the asymptotic distribution of the proposed test statistic.Several simulated examples and real data analysis are presented to illustrate the finite-sample performance of the proposed method.展开更多
Internet of Things systems generate a large amount of sensor data that needs to be analyzed for extracting useful insights on the health status of the machine under consideration.Sensor data of all possible states of ...Internet of Things systems generate a large amount of sensor data that needs to be analyzed for extracting useful insights on the health status of the machine under consideration.Sensor data of all possible states of a system are used for building machine learning models.These models are further used to predict the possible downtime for proactive action on the system condition.Aircraft engine data from run to failure is used in the current study.The run to failure data includes states like new installation,stable operation,first reported issue,erroneous operation,and final failure.In the present work,the non-linear multivariate sensor data is used to understand the health status and anomalous behavior.The methodology is based on different sampling sizes to obtain optimum results with great accuracy.The time series of each sensor is converted to a 2D image with a specific time window.Converted Images would represent the health of a system in higher-dimensional space.The created images were fed to Convolutional Neural Network,which includes both time variation and space variation of each sensed parameter.Using these created images,a model for estimating the remaining life of the aircraft is developed.Further,the proposed net is also used for predicting the number of engines that would fail in the given time window.The current methodology is useful in avoiding the health index generation for predicting the remaining useful life of the industrial components.Better accuracy in the classification of components is achieved using the TimeImagenet-based approach.展开更多
The survival analysis literature has always lagged behind the categorical data literature in developing methods to analyze clustered or multivariate data. While estimators based on
The sustainable development supposes a development strategy that would ensure the interdependence and complementarily of objectives from the social, economic and environmental fields. The degree of priority establishe...The sustainable development supposes a development strategy that would ensure the interdependence and complementarily of objectives from the social, economic and environmental fields. The degree of priority established for the three dimensions of sustainable development differs from one country to another, a fact that confers a national and local meaning to this issue. For the Central and Eastern European countries, balanced economic development represents one of the fundamental objectives of the reforms started in 1990. Education represents a priority of any country's economic development and an extremely important element of economic growth. This paper presents the characteristics of the Romanian educational system while achieving a comparative analysis regarding different countries of the European Union, both from a quantitative viewpoint (using the main indicators in the education field) and a qualitative viewpoint (using student performances in international evaluations). In the end, we present some proposals for the improvement of the present state of the Romanian educational system.展开更多
We thank all the discussants for their interesting and stimulating contributions. They have touched various aspects that have not been considered by the original articles.
Scatterplots and scatterplot matrix methods have been popularly used for showing statistical graphics and for exposing patterns in multivariate data.A recent technique,called Linkable Scatterplots,provides an interest...Scatterplots and scatterplot matrix methods have been popularly used for showing statistical graphics and for exposing patterns in multivariate data.A recent technique,called Linkable Scatterplots,provides an interesting idea for interactive visual exploration which provides a set of necessary plot panels on demand together with interaction,linking and brushing.This article presents a controlled study with a mixed-model design to evaluate the effectiveness and user experience on the visual exploration when using a Sequential-Scatterplots who a single plot is shown at a time,Multiple-Scatterplots who number of plots can be specified and shown,and Simultaneous-Scatterplots who all plots are shown as a scatterplot matrix.Results from the study demonstrated higher accuracy using the Multiple-Scatterplots visualization,particularly in comparison with the Simultaneous-Scatterplots.While the time taken to complete tasks was longer in the Multiple-Scatterplots technique,compared with the simpler Sequential-Scatterplots,Multiple-Scatterplots is inherently more accurate.Moreover,the Multiple-Scatterplots technique is the most highly preferred and positively experienced technique in this study.Overall,results support the strength of Multiple-Scatterplots and highlight its potential as an effective data visualization technique for exploring multivariate data.展开更多
The multivariate extension of the Cox model proposed by Wei,Lin and Weissfeld in 1989 has been widely used for analyzing multivariate survival data.Under the model assumption,failure times from an individual are assum...The multivariate extension of the Cox model proposed by Wei,Lin and Weissfeld in 1989 has been widely used for analyzing multivariate survival data.Under the model assumption,failure times from an individual are assumed to marginally follow their respective proportional hazards regression relation,leaving the joint distribution completely unspecified.This paper presents a simple approach to efficiency improvement through segmentation of stochastic integrals in the marginal estimating equations and incorporation of the limiting covariance structure.It is shown that when partition of the time interval is done at a suitable rate,the resulting estimator is consistent and asymptotically normal.Through the reproducing kernel Hilbert space arising from the covariance function of the limiting Gaussian process,it is also shown that the proposed estimator is asymptotically optimal within a reasonable class of estimators under marginal specification.Simulations are conducted to assess the finite-sample performance of the proposed method.展开更多
Cancer is one of the most serious diseases that cause an enormous number of deaths all over the world.Tumor metabolism has great discrimination from that of normal tissues.Exploring the tumor metabolism may be one of ...Cancer is one of the most serious diseases that cause an enormous number of deaths all over the world.Tumor metabolism has great discrimination from that of normal tissues.Exploring the tumor metabolism may be one of the best ways to find biomarkers for cancer detection,diagnosis and to provide novel insights into internal physiological state where subtle changes may happen in metabolite concentrations.Nuclear Magnetic Resonance(NMR)technique nowadays is a popular tool to analyze cell extracts,tissues and biological fluids,etc,since it is a relatively fast and an accurate technique to supply abundant biochemical information at molecular levels for tumor research.In this review,approaches in tumor metabolism are discussed,including sample collection,data profiling and multivariate data analysis methods etc.Some typical applications of NMR are also summarized in tumor metabolism.展开更多
2D ^13C-^1H HSQC NMR spectroscopy of acetylated cell walls in solution gives a detailed fingerprint that can be used to assess the chemical composition of the complete wall without extensive degradation. We demonstrat...2D ^13C-^1H HSQC NMR spectroscopy of acetylated cell walls in solution gives a detailed fingerprint that can be used to assess the chemical composition of the complete wall without extensive degradation. We demonstrate how multivariate analysis of such spectra can be used to visualize cell wall changes between sample types as high-resolution 2D NMR loading spectra. Changes in composition and structure for both lignin and polysaccharides can subsequently be interpreted on a molecular level. The multivariate approach alleviates problems associated with peak picking of overlapping peaks, and it allows the deduction of the relative importance of each peak for sample discrimination. As a first proof of concept, we compare Populus tension wood to normal wood. All well established differences in cellulose, hemicellulose, and lignin compositions between these wood types were readily detected, confirming the reliability of the multivariate approach, In a second example, wood from transgenic Populus modified in their degree of pectin methylesterification was compared to that of wild-type trees. We show that differences in both lignin and polysaccharide composition that are difficult to detect with traditional spectral analysis and that could not be a priori predicted were revealed by the multivariate approach. 2D NMR of dissolved cell wall samples combined with multivariate analysis constitutes a novel approach in cell wall analysis and provides a new tool that will benefit cell wall research.展开更多
基金Natural Science Foundation of China(T11036061/T0108).
文摘A novel study using LCeMS(Liquid chromatography tandem mass spectrometry)coupled with multivariate data analysis and bioactivity evaluation was established for discrimination of aqueous extract and vinegar extract of Shixiao San.Batches of these two kinds of samples were subjected to analysis,and the datasets of sample codes,tR-m/z pairs and ion intensities were processed with principal component analysis(PCA).The result of score plot showed a clear classification of the aqueous and vinegar groups.And the chemical markers having great contributions to the differentiation were screened out on the loading plot.The identities of the chemical markers were performed by comparing the mass fragments and retention times with those of reference compounds and/or the known compounds published in the literatures.Based on the proposed strategy,quercetin-3-Oneohesperidoside,isorhamnetin-3-O-neohespeeridoside,kaempferol-3-O-neohesperidoside,isorhamnetin-3-O-rutinoside and isorhamnetin-3-O-(2G-a-l-rhamnosyl)-rutinoside were explored as representative markers in distinguishing the vinegar extract from the aqueous extract.The anti-hyperlipidemic activities of two processed extracts of Shixiao San were examined on serum levels of lipids,lipoprotein and blood antioxidant enzymes in a rat hyperlipidemia model,and the vinegary extract,exerting strong lipid-lowering and antioxidative effects,was superior to the aqueous extract.Therefore,boiling with vinegary was predicted as the greatest processing procedure for anti-hyperlipidemic effect of Shixiao San.Furthermore,combining the changes in the metabolic profiling and bioactivity evaluation,the five representative markers may be related to the observed antihyperlipidemic effect.
基金supported by the National Key R&D Program of China(Project No.2016YFC0800200)the NRF-NSFC 3rd Joint Research Grant(Earth Science)(Project No.41861144022)+2 种基金the National Natural Science Foundation of China(Project Nos.51679174,and 51779189)the Shenzhen Key Technology R&D Program(Project No.20170324)The financial support is grateful acknowledged。
文摘Various uncertainties arising during acquisition process of geoscience data may result in anomalous data instances(i.e.,outliers)that do not conform with the expected pattern of regular data instances.With sparse multivariate data obtained from geotechnical site investigation,it is impossible to identify outliers with certainty due to the distortion of statistics of geotechnical parameters caused by outliers and their associated statistical uncertainty resulted from data sparsity.This paper develops a probabilistic outlier detection method for sparse multivariate data obtained from geotechnical site investigation.The proposed approach quantifies the outlying probability of each data instance based on Mahalanobis distance and determines outliers as those data instances with outlying probabilities greater than 0.5.It tackles the distortion issue of statistics estimated from the dataset with outliers by a re-sampling technique and accounts,rationally,for the statistical uncertainty by Bayesian machine learning.Moreover,the proposed approach also suggests an exclusive method to determine outlying components of each outlier.The proposed approach is illustrated and verified using simulated and real-life dataset.It showed that the proposed approach properly identifies outliers among sparse multivariate data and their corresponding outlying components in a probabilistic manner.It can significantly reduce the masking effect(i.e.,missing some actual outliers due to the distortion of statistics by the outliers and statistical uncertainty).It also found that outliers among sparse multivariate data instances affect significantly the construction of multivariate distribution of geotechnical parameters for uncertainty quantification.This emphasizes the necessity of data cleaning process(e.g.,outlier detection)for uncertainty quantification based on geoscience data.
文摘Supervised machine learning techniques require labelled multivariate training datasets.Many approaches address the issue of unlabelled datasets by tightly coupling machine learning algorithms with interactive visualisations.Using appropriate techniques,analysts can play an active role in a highly interactive and iterative machine learning process to label the dataset and create meaningful partitions.While this principle has been implemented either for unsupervised,semi-supervised,or supervised machine learning tasks,the combination of all three methodologies remains challenging.In this paper,a visual analytics approach is presented,combining a variety of machine learning capabilities with four linked visualisation views,all integrated within the mVis(multivariate Visualiser)system.The available palette of techniques allows an analyst to perform exploratory data analysis on a multivariate dataset and divide it into meaningful labelled partitions,from which a classifier can be built.In the workflow,the analyst can label interesting patterns or outliers in a semi-supervised process supported by active learning.Once a dataset has been interactively labelled,the analyst can continue the workflow with supervised machine learning to assess to what degree the subsequent classifier has effectively learned the concepts expressed in the labelled training dataset.Using a novel technique called automatic dimension selection,interactions the analyst had with dimensions of the multivariate dataset are used to steer the machine learning algorithms.A real-world football dataset is used to show the utility of mVis for a series of analysis and labelling tasks,from initial labelling through iterations of data exploration,clustering,classification,and active learning to refine the named partitions,to finally producing a high-quality labelled training dataset suitable for training a classifier.The tool empowers the analyst with interactive visualisations including scatterplots,parallel coordinates,similarity maps for records,and a new similarity map for partitions.
基金Supported by the National Natural Science Foundation of China (601133010)
文摘In this paper, a new approach for visualizing multivariate categorical data is presented. The approach uses a graph to represent multivariate categorical data and draws the graph in such a way that we can identify patterns, trends and relationship within the data. A mathematical model for the graph layout problem is deduced and a spectral graph drawing algorithm for visualizing multivariate categorical data is proposed. The experiments show that the drawings by the algorithm well capture the structures of multivariate categorical data and the computing speed is fast.
基金supported by Istanbul University Scientific Research Project Department with IRP-51706 Project Number.
文摘Biometric gait recognition is a lesser-known but emerging and effective biometric recognition method which enables subjects’walking patterns to be recognized.Existing research in this area has primarily focused on feature analysis through the extraction of individual features,which captures most of the information but fails to capture subtle variations in gait dynamics.Therefore,a novel feature taxonomy and an approach for deriving a relationship between a function of one set of gait features with another set are introduced.The gait features extracted from body halves divided by anatomical planes on vertical,horizontal,and diagonal axes are grouped to form canonical gait covariates.Canonical Correlation Analysis is utilized to measure the strength of association between the canonical covariates of gait.Thus,gait assessment and identification are enhancedwhenmore semantic information is available through CCA-basedmulti-feature fusion.Hence,CarnegieMellon University’s 3D gait database,which contains 32 gait samples taken at different paces,is utilized in analyzing gait characteristics.The performance of Linear Discriminant Analysis,K-Nearest Neighbors,Naive Bayes,Artificial Neural Networks,and Support Vector Machines was improved by a 4%average when the CCA-utilized gait identification approachwas used.Asignificant maximumaccuracy rate of 97.8%was achieved throughCCA-based gait identification.Beyond that,the rate of false identifications and unrecognized gaits went down to half,demonstrating state-of-the-art for gait identification.
基金National Natural Science Foundation of China No.40301038
文摘In several LUCC studies, statistical methods are being used to analyze land use data. A problem using conventional statistical methods in land use analysis is that these methods assume the data to be statistically independent. But in fact, they have the tendency to be dependent, a phenomenon known as multicollinearity, especially in the cases of few observations. In this paper, a Partial Least-Squares (PLS) regression approach is developed to study relationships between land use and its influencing factors through a case study of the Suzhou-Wuxi-Changzhou region in China. Multicollinearity exists in the dataset and the number of variables is high compared to the number of observations. Four PLS factors are selected through a preliminary analysis. The correlation analyses between land use and influencing factors demonstrate the land use character of rural industrialization and urbanization in the Suzhou-Wuxi-Changzhou region, meanwhile illustrate that the first PLS factor has enough ability to best describe land use patterns quantitatively, and most of the statistical relations derived from it accord with the fact. By the decreasing capacity of the PLS factors, the reliability of model outcome decreases correspondingly.
基金Supported by National Natural Science Foundation of China(61572274,61672307,61272225,51261120376)the National Key Technologies R&D Program of China(2015BAF23B03)
文摘Numerical weather simulation data usually comprises various meteorological variables, such as precipitation, temperature and pressure. In practical applications, data generated with several different numerical simulation models are usually used together by forecasters to generate the final forecast. However, it is difficult for forecasters to obtain a clear view of all the data due to its complexity. This has been a great limitation for domain experts to take advantage of all the data in their routine work. In order to help explore the multi-variate and multi-model data, we propose a stamp based exploration framework to assist domain experts in analyzing the data. The framework is used to assist domain experts in detecting the bias patterns between numerical simulation data and observation data. The exploration pipeline originates from a single meteorological variable and extends to multiple variables under the guidance of a designed stamp board. Regional data patterns can be detected by analyzing distinctive stamps on the board or generating extending stamps using the Boolean set operations. Experiment results show that some meteorological phenomena and regional data patterns can be easily detected through the exploration. These can help domain experts conduct the data analysis efficiently and further guide forecasters in producing reliable weather forecast.
基金This work was supported by the National Key Research&Development Program of China(2017YFB0202203)National Natural Science Foundation of China(61472354 and 61672452)NSFC-Guangdong Joint Fund(U1611263).
文摘In complex multivariate data sets,different features usually include diverse associations with different variables,and different variables are associated within different regions.Therefore,exploring the associations between variables and voxels locally becomes necessary to better understand the underlying phenomena.In this paper,we propose a co-analysis framework based on biclusters,which are two subsets of variables and voxels with close scalar-value relationships,to guide the process of visually exploring multivariate data.We first automatically extract all meaningful biclusters,each of which only contains voxels with a similar scalar-value pattern over a subset of variables.These biclusters are organized according to their variable sets,and biclusters in each variable set are further grouped by a similarity metric to reduce redundancy and support diversity during visual exploration.Biclusters are visually represented in coordinated views to facilitate interactive exploration of multivariate data from the similarity between biclusters and the correlation of scalar values with different variables.Experiments on several representative multivariate scientific data sets demonstrate the effectiveness of our framework in exploring local relationships among variables,biclusters and scalar values in the data.
基金Supported by the National Natural Science Foundation of China (11171263)
文摘Multivariate failure time data are frequently encountered in biomedical research.In this article,we model marginal hazards with accelerated hazards model to analyze multivariate failure time data.Estimating equations are derived analogous to generalized estimating equation method.Under certain regular conditions,the resultant estimators for the regression parameters are shown to be asymptotically normal.Furthermore,we also establish the weak convergence of estimators for the baseline cumulative hazard functions.
文摘In the analysis of correlated data, it is ideal to capture the true dependence structure to increase effciency of the estimation. However, for multivariate survival data, this is extremely
基金supported by the Fundamental Research Funds of Shandong University(Grant No.2018GN050)the Academic Prosperity Program provided by School of Economics,Shandong University and the Taishan Scholar Program of Shandong Province+2 种基金supported by National Natural Science Foundation of China(Grant No.11871323)the State Key Program in the Major Research Plan of National Natural Science Foundation of China(Grant No.91546202)Program for Innovative Research Team of Shanghai University of Finance and Economics。
文摘Multivariate longitudinal data arise frequently in a variety of applications,where multiple outcomes are measured repeatedly from the same subject.In this paper,we first propose a two-stage weighted least square estimation procedure for the regression coefficients when the random error follows an irregular autoregressive(AR)process,and establish asymptotic normality properties for the resulting estimators.We then apply the smoothly clipped absolute deviation(SCAD)variable selection approach to determine the order of the AR error process.We further propose a test statistic to check whether multiple responses are correlated at the same observation time,and derive the asymptotic distribution of the proposed test statistic.Several simulated examples and real data analysis are presented to illustrate the finite-sample performance of the proposed method.
文摘Internet of Things systems generate a large amount of sensor data that needs to be analyzed for extracting useful insights on the health status of the machine under consideration.Sensor data of all possible states of a system are used for building machine learning models.These models are further used to predict the possible downtime for proactive action on the system condition.Aircraft engine data from run to failure is used in the current study.The run to failure data includes states like new installation,stable operation,first reported issue,erroneous operation,and final failure.In the present work,the non-linear multivariate sensor data is used to understand the health status and anomalous behavior.The methodology is based on different sampling sizes to obtain optimum results with great accuracy.The time series of each sensor is converted to a 2D image with a specific time window.Converted Images would represent the health of a system in higher-dimensional space.The created images were fed to Convolutional Neural Network,which includes both time variation and space variation of each sensed parameter.Using these created images,a model for estimating the remaining life of the aircraft is developed.Further,the proposed net is also used for predicting the number of engines that would fail in the given time window.The current methodology is useful in avoiding the health index generation for predicting the remaining useful life of the industrial components.Better accuracy in the classification of components is achieved using the TimeImagenet-based approach.
文摘The survival analysis literature has always lagged behind the categorical data literature in developing methods to analyze clustered or multivariate data. While estimators based on
文摘The sustainable development supposes a development strategy that would ensure the interdependence and complementarily of objectives from the social, economic and environmental fields. The degree of priority established for the three dimensions of sustainable development differs from one country to another, a fact that confers a national and local meaning to this issue. For the Central and Eastern European countries, balanced economic development represents one of the fundamental objectives of the reforms started in 1990. Education represents a priority of any country's economic development and an extremely important element of economic growth. This paper presents the characteristics of the Romanian educational system while achieving a comparative analysis regarding different countries of the European Union, both from a quantitative viewpoint (using the main indicators in the education field) and a qualitative viewpoint (using student performances in international evaluations). In the end, we present some proposals for the improvement of the present state of the Romanian educational system.
文摘We thank all the discussants for their interesting and stimulating contributions. They have touched various aspects that have not been considered by the original articles.
文摘Scatterplots and scatterplot matrix methods have been popularly used for showing statistical graphics and for exposing patterns in multivariate data.A recent technique,called Linkable Scatterplots,provides an interesting idea for interactive visual exploration which provides a set of necessary plot panels on demand together with interaction,linking and brushing.This article presents a controlled study with a mixed-model design to evaluate the effectiveness and user experience on the visual exploration when using a Sequential-Scatterplots who a single plot is shown at a time,Multiple-Scatterplots who number of plots can be specified and shown,and Simultaneous-Scatterplots who all plots are shown as a scatterplot matrix.Results from the study demonstrated higher accuracy using the Multiple-Scatterplots visualization,particularly in comparison with the Simultaneous-Scatterplots.While the time taken to complete tasks was longer in the Multiple-Scatterplots technique,compared with the simpler Sequential-Scatterplots,Multiple-Scatterplots is inherently more accurate.Moreover,the Multiple-Scatterplots technique is the most highly preferred and positively experienced technique in this study.Overall,results support the strength of Multiple-Scatterplots and highlight its potential as an effective data visualization technique for exploring multivariate data.
基金supported by National Natural Science Foundation of China (Grant Nos.10471136 and 10971210)the Knowledge Innovation Program of Chinese Academy of Sciences (Grant No.KJCX3-SYW-S02)
文摘The multivariate extension of the Cox model proposed by Wei,Lin and Weissfeld in 1989 has been widely used for analyzing multivariate survival data.Under the model assumption,failure times from an individual are assumed to marginally follow their respective proportional hazards regression relation,leaving the joint distribution completely unspecified.This paper presents a simple approach to efficiency improvement through segmentation of stochastic integrals in the marginal estimating equations and incorporation of the limiting covariance structure.It is shown that when partition of the time interval is done at a suitable rate,the resulting estimator is consistent and asymptotically normal.Through the reproducing kernel Hilbert space arising from the covariance function of the limiting Gaussian process,it is also shown that the proposed estimator is asymptotically optimal within a reasonable class of estimators under marginal specification.Simulations are conducted to assess the finite-sample performance of the proposed method.
文摘Cancer is one of the most serious diseases that cause an enormous number of deaths all over the world.Tumor metabolism has great discrimination from that of normal tissues.Exploring the tumor metabolism may be one of the best ways to find biomarkers for cancer detection,diagnosis and to provide novel insights into internal physiological state where subtle changes may happen in metabolite concentrations.Nuclear Magnetic Resonance(NMR)technique nowadays is a popular tool to analyze cell extracts,tissues and biological fluids,etc,since it is a relatively fast and an accurate technique to supply abundant biochemical information at molecular levels for tumor research.In this review,approaches in tumor metabolism are discussed,including sample collection,data profiling and multivariate data analysis methods etc.Some typical applications of NMR are also summarized in tumor metabolism.
文摘2D ^13C-^1H HSQC NMR spectroscopy of acetylated cell walls in solution gives a detailed fingerprint that can be used to assess the chemical composition of the complete wall without extensive degradation. We demonstrate how multivariate analysis of such spectra can be used to visualize cell wall changes between sample types as high-resolution 2D NMR loading spectra. Changes in composition and structure for both lignin and polysaccharides can subsequently be interpreted on a molecular level. The multivariate approach alleviates problems associated with peak picking of overlapping peaks, and it allows the deduction of the relative importance of each peak for sample discrimination. As a first proof of concept, we compare Populus tension wood to normal wood. All well established differences in cellulose, hemicellulose, and lignin compositions between these wood types were readily detected, confirming the reliability of the multivariate approach, In a second example, wood from transgenic Populus modified in their degree of pectin methylesterification was compared to that of wild-type trees. We show that differences in both lignin and polysaccharide composition that are difficult to detect with traditional spectral analysis and that could not be a priori predicted were revealed by the multivariate approach. 2D NMR of dissolved cell wall samples combined with multivariate analysis constitutes a novel approach in cell wall analysis and provides a new tool that will benefit cell wall research.