Aviation accidents are currently one of the leading causes of significant injuries and deaths worldwide. This entices researchers to investigate aircraft safety using data analysis approaches based on an advanced mach...Aviation accidents are currently one of the leading causes of significant injuries and deaths worldwide. This entices researchers to investigate aircraft safety using data analysis approaches based on an advanced machine learning algorithm.To assess aviation safety and identify the causes of incidents, a classification model with light gradient boosting machine (LGBM)based on the aviation safety reporting system (ASRS) has been developed. It is improved by k-fold cross-validation with hybrid sampling model (HSCV), which may boost classification performance and maintain data balance. The results show that employing the LGBM-HSCV model can significantly improve accuracy while alleviating data imbalance. Vertical comparison with other cross-validation (CV) methods and lateral comparison with different fold times comprise the comparative approach. Aside from the comparison, two further CV approaches based on the improved method in this study are discussed:one with a different sampling and folding order, and the other with more CV. According to the assessment indices with different methods, the LGBMHSCV model proposed here is effective at detecting incident causes. The improved model for imbalanced data categorization proposed may serve as a point of reference for similar data processing, and the model’s accurate identification of civil aviation incident causes can assist to improve civil aviation safety.展开更多
In deriving a regression model analysts often have to use variable selection, despite of problems introduced by data- dependent model building. Resampling approaches are proposed to handle some of the critical issues....In deriving a regression model analysts often have to use variable selection, despite of problems introduced by data- dependent model building. Resampling approaches are proposed to handle some of the critical issues. In order to assess and compare several strategies, we will conduct a simulation study with 15 predictors and a complex correlation structure in the linear regression model. Using sample sizes of 100 and 400 and estimates of the residual variance corresponding to R2 of 0.50 and 0.71, we consider 4 scenarios with varying amount of information. We also consider two examples with 24 and 13 predictors, respectively. We will discuss the value of cross-validation, shrinkage and backward elimination (BE) with varying significance level. We will assess whether 2-step approaches using global or parameterwise shrinkage (PWSF) can improve selected models and will compare results to models derived with the LASSO procedure. Beside of MSE we will use model sparsity and further criteria for model assessment. The amount of information in the data has an influence on the selected models and the comparison of the procedures. None of the approaches was best in all scenarios. The performance of backward elimination with a suitably chosen significance level was not worse compared to the LASSO and BE models selected were much sparser, an important advantage for interpretation and transportability. Compared to global shrinkage, PWSF had better performance. Provided that the amount of information is not too small, we conclude that BE followed by PWSF is a suitable approach when variable selection is a key part of data analysis.展开更多
For the nonparametric regression model Yni =g(Xni) +εnii = 1, … n. with regulary spaced nonrandom design, the authors study the behavior of the nonlinear wavelet estimator of g(x). When the threshold and truncation ...For the nonparametric regression model Yni =g(Xni) +εnii = 1, … n. with regulary spaced nonrandom design, the authors study the behavior of the nonlinear wavelet estimator of g(x). When the threshold and truncation parameters are chosen by crossvalidation on the everage squared error, strong consistency for the case of dyadic sample size and moment consistency for arbitrary sample size are established under some regular conditions.展开更多
Background Cardiovascular diseases are closely linked to atherosclerotic plaque development and rupture.Plaque progression prediction is of fundamental significance to cardiovascular research and disease diagnosis,pre...Background Cardiovascular diseases are closely linked to atherosclerotic plaque development and rupture.Plaque progression prediction is of fundamental significance to cardiovascular research and disease diagnosis,prevention,and treatment.Generalized linear mixed models(GLMM)is an extension of linear model for categorical responses while considering the correlation among observations.Methods Magnetic resonance image(MRI)data of carotid atheroscleroticplaques were acquired from 20 patients with consent obtained and 3D thin-layer models were constructed to calculate plaque stress and strain for plaque progression prediction.Data for ten morphological and biomechanical risk factors included wall thickness(WT),lipid percent(LP),minimum cap thickness(MinCT),plaque area(PA),plaque burden(PB),lumen area(LA),maximum plaque wall stress(MPWS),maximum plaque wall strain(MPWSn),average plaque wall stress(APWS),and average plaque wall strain(APWSn)were extracted from all slices for analysis.Wall thickness increase(WTI),plaque burden increase(PBI)and plaque area increase(PAI) were chosen as three measures for plaque progression.Generalized linear mixed models(GLMM)with 5-fold cross-validation strategy were used to calculate prediction accuracy for each predictor and identify optimal predictor with the highest prediction accuracy defined as sum of sensitivity and specificity.All 201 MRI slices were randomly divided into 4 training subgroups and 1 verification subgroup.The training subgroups were used for model fitting,and the verification subgroup was used to estimate the model.All combinations(total1023)of 10 risk factors were feed to GLMM and the prediction accuracy of each predictor were selected from the point on the ROC(receiver operating characteristic)curve with the highest sum of specificity and sensitivity.Results LA was the best single predictor for PBI with the highest prediction accuracy(1.360 1),and the area under of the ROC curve(AUC)is0.654 0,followed by APWSn(1.336 3)with AUC=0.6342.The optimal predictor among all possible combinations for PBI was the combination of LA,PA,LP,WT,MPWS and MPWSn with prediction accuracy=1.414 6(AUC=0.715 8).LA was once again the best single predictor for PAI with the highest prediction accuracy(1.184 6)with AUC=0.606 4,followed by MPWSn(1. 183 2)with AUC=0.6084.The combination of PA,PB,WT,MPWS,MPWSn and APWSn gave the best prediction accuracy(1.302 5)for PAI,and the AUC value is 0.6657.PA was the best single predictor for WTI with highest prediction accuracy(1.288 7)with AUC=0.641 5,followed by WT(1.254 0),with AUC=0.6097.The combination of PA,PB,WT,LP,MinCT,MPWS and MPWS was the best predictor for WTI with prediction accuracy as 1.314 0,with AUC=0.6552.This indicated that PBI was a more predictable measure than WTI and PAI. The combinational predictors improved prediction accuracy by 9.95%,4.01%and 1.96%over the best single predictors for PAI,PBI and WTI(AUC values improved by9.78%,9.45%,and 2.14%),respectively.Conclusions The use of GLMM with 5-fold cross-validation strategy combining both morphological and biomechanical risk factors could potentially improve the accuracy of carotid plaque progression prediction.This study suggests that a linear combination of multiple predictors can provide potential improvement to existing plaque assessment schemes.展开更多
Model validation is the most important part of building a supervised model.For building a model with good generalization performance one must have a sensible data splitting strategy,and this is crucial for model valid...Model validation is the most important part of building a supervised model.For building a model with good generalization performance one must have a sensible data splitting strategy,and this is crucial for model validation.In this study,we con-ducted a comparative study on various reported data splitting methods.The MixSim model was employed to generate nine simulated datasets with different probabilities of mis-classification and variable sample sizes.Then partial least squares for discriminant analysis and support vector machines for classification were applied to these datasets.Data splitting methods tested included variants of cross-validation,bootstrapping,bootstrapped Latin partition,Kennard-Stone algorithm(K-S)and sample set partitioning based on joint X-Y distances algorithm(SPXY).These methods were employed to split the data into training and validation sets.The estimated generalization performances from the validation sets were then compared with the ones obtained from the blind test sets which were generated from the same distribution but were unseen by the train-ing/validation procedure used in model construction.The results showed that the size of the data is the deciding factor for the qualities of the generalization performance estimated from the validation set.We found that there was a significant gap between the performance estimated from the validation set and the one from the test set for the all the data splitting methods employed on small datasets.Such disparity decreased when more samples were available for training/validation,and this is because the models were then moving towards approximations of the central limit theory for the simulated datasets used.We also found that having too many or too few samples in the training set had a negative effect on the estimated model performance,suggesting that it is necessary to have a good balance between the sizes of training set and validation set to have a reliable estimation of model performance.We also found that systematic sampling method such as K-S and SPXY generally had very poor estimation of the model performance,most likely due to the fact that they are designed to take the most representative samples first and thus left a rather poorly representative sample set for model performance estimation.展开更多
Cross-validation method is used to choose the three smoothing parameters in nonlin ear wavelet regression estimators. The strong consistency and convergence rate of cross-vali dation nonlinear wavelet regression estim...Cross-validation method is used to choose the three smoothing parameters in nonlin ear wavelet regression estimators. The strong consistency and convergence rate of cross-vali dation nonlinear wavelet regression estimators are obtained.展开更多
Given their technical and economic advantages,the application of explosive substances to rock mass excavation is widely used.However,because of serious environmental restraints,there has been an increasing need to use...Given their technical and economic advantages,the application of explosive substances to rock mass excavation is widely used.However,because of serious environmental restraints,there has been an increasing need to use complex tools to control environmental effects due to blast-induced ground vibrations.In the present study,an artificial neural network(ANN)with k-fold cross-validation was applied to a dataset containing 1114 observations that was obtained from published results;furthermore,quantitative and qualitative parameters were considered for ground vibration amplitude prediction.The best ANN model obtained has a maximum coefficient of determination of 0.840 and a mean absolute error of 5.59 and it comprises 17 input parameters,12 neurons in a one-layer hidden layer,and a sigmoid transfer function.Compared with the traditional models,the model obtained using the proposed methodology demonstrated better generalization ability.Furthermore,the proposed methodology offers an ANN model with higher prediction ability.展开更多
With the high-precision products of satellite orbit and clock,uncalibrated phase delay,and the atmosphere delay corrections,Precise Point Positioning(PPP)based on a Real-Time Kinematic(RTK)network is possible to rapid...With the high-precision products of satellite orbit and clock,uncalibrated phase delay,and the atmosphere delay corrections,Precise Point Positioning(PPP)based on a Real-Time Kinematic(RTK)network is possible to rapidly achieve centimeter-level positioning accuracy.In the ionosphere-weighted PPP–RTK model,not only the a priori value of ionosphere but also its precision afect the convergence and accuracy of positioning.This study proposes a method to determine the precision of the interpolated slant ionospheric delay by cross-validation.The new method takes the high temporal and spatial variation into consideration.A distance-dependent function is built to represent the stochastic model of the slant ionospheric delay derived from each reference station,and an error model is built for each reference station on a fve-minute piecewise basis.The user can interpolate ionospheric delay correction and the corresponding precision with an error function related to the distance and time of each reference station.With the European Reference Frame(EUREF)Permanent GNSS(Global Navigation Satellite Systems)network(EPN),and SONEL(Système d’Observation du Niveau des Eaux Littorales)GNSS stations covering most of Europe,the efectiveness of our wide-area ionosphere constraint method for PPP-RTK is validated,compared with the method with a fxed ionosphere precision threshold.It is shown that although the Root Mean Square(RMS)of the interpolated ionosphere error is within 5 cm in most of the areas,it exceeds 10 cm for some areas with sparse reference stations during some periods of time.The convergence time of the 90th percentile is 4.0 and 20.5 min for horizontal and vertical directions using Global Positioning System(GPS)kinematic solution,respectively,with the proposed method.This convergence is faster than those with the fxed ionosphere precision values of 1,8,and 30 cm.The improvement with respect to the latter three solutions ranges from 10 to 60%.After integrating the Galileo navigation satellite system(Galileo),the convergence time of the 90th percentile for combined kinematic solutions is 2.0 and 9.0 min,with an improvement of 50.0%and 56.1%for horizontal and vertical directions,respectively,compared with the GPS-only solution.The average convergence time of GPS PPP-RTK for horizontal and vertical directions are 2.0 and 5.0 min,and those of GPS+Galileo PPP-RTK are 1.4 and 3.0 min,respectively.展开更多
Bulked-segregant analysis by deep sequencing(BSA-seq) is a widely used method for mapping QTL(quantitative trait loci) due to its simplicity, speed, cost-effectiveness, and efficiency. However, the ability of BSA-seq ...Bulked-segregant analysis by deep sequencing(BSA-seq) is a widely used method for mapping QTL(quantitative trait loci) due to its simplicity, speed, cost-effectiveness, and efficiency. However, the ability of BSA-seq to detect QTL is often limited by inappropriate experimental designs, as evidenced by numerous practical studies. Most BSA-seq studies have utilized small to medium-sized populations, with F2populations being the most common choice. Nevertheless, theoretical studies have shown that using a large population with an appropriate pool size can significantly enhance the power and resolution of QTL detection in BSA-seq, with F_(3)populations offering notable advantages over F2populations. To provide an experimental demonstration, we tested the power of BSA-seq to identify QTL controlling days from sowing to heading(DTH) in a 7200-plant rice F_(3)population in two environments, with a pool size of approximately 500. Each experiment identified 34 QTL, an order of magnitude greater than reported in most BSA-seq experiments, of which 23 were detected in both experiments, with 17 of these located near41 previously reported QTL and eight cloned genes known to control DTH in rice. These results indicate that QTL mapping by BSA-seq in large F_(3)populations and multi-environment experiments can achieve high power, resolution, and reliability.展开更多
Adaptive fractional polynomial modeling of general correlated outcomes is formulated to address nonlinearity in means, variances/dispersions, and correlations. Means and variances/dispersions are modeled using general...Adaptive fractional polynomial modeling of general correlated outcomes is formulated to address nonlinearity in means, variances/dispersions, and correlations. Means and variances/dispersions are modeled using generalized linear models in fixed effects/coefficients. Correlations are modeled using random effects/coefficients. Nonlinearity is addressed using power transforms of primary (untransformed) predictors. Parameter estimation is based on extended linear mixed modeling generalizing both generalized estimating equations and linear mixed modeling. Models are evaluated using likelihood cross-validation (LCV) scores and are generated adaptively using a heuristic search controlled by LCV scores. Cases covered include linear, Poisson, logistic, exponential, and discrete regression of correlated continuous, count/rate, dichotomous, positive continuous, and discrete numeric outcomes treated as normally, Poisson, Bernoulli, exponentially, and discrete numerically distributed, respectively. Example analyses are also generated for these five cases to compare adaptive random effects/coefficients modeling of correlated outcomes to previously developed adaptive modeling based on directly specified covariance structures. Adaptive random effects/coefficients modeling substantially outperforms direct covariance modeling in the linear, exponential, and discrete regression example analyses. It generates equivalent results in the logistic regression example analyses and it is substantially outperformed in the Poisson regression case. Random effects/coefficients modeling of correlated outcomes can provide substantial improvements in model selection compared to directly specified covariance modeling. However, directly specified covariance modeling can generate competitive or substantially better results in some cases while usually requiring less computation time.展开更多
The use of machine learning to predict student employability is important in order to analyse a student’s capability to get a job.Based on the results of this type of analysis,university managers can improve the empl...The use of machine learning to predict student employability is important in order to analyse a student’s capability to get a job.Based on the results of this type of analysis,university managers can improve the employability of their students,which can help in attracting students in the future.In addition,learners can focus on the essential skills identified through this analysis during their studies,to increase their employability.An effectivemethod calledOPT-BAG(OPTimisation of BAGging classifiers)was therefore developed to model the problem of predicting the employability of students.This model can help predict the employability of students based on their competencies and can reveal weaknesses that need to be improved.First,we analyse the relationships between several variables and the outcome variable using a correlation heatmap for a student employability dataset.Next,a standard scaler function is applied in the preprocessing module to normalise the variables in the student employability dataset.The training set is then input to our model to identify the optimal parameters for the bagging classifier using a grid search cross-validation technique.Finally,the OPT-BAG model,based on a bagging classifier with optimal parameters found in the previous step,is trained on the training dataset to predict student employability.The empirical outcomes in terms of accuracy,precision,recall,and F1 indicate that the OPT-BAG approach outperforms other cutting-edge machine learning models in terms of predicting student employability.In this study,we also analyse the factors affecting the recruitment process of employers,and find that general appearance,mental alertness,and communication skills are the most important.This indicates that educational institutions should focus on these factors during the learning process to improve student employability.展开更多
BACKGROUND Our study expand upon a large body of evidence in the field of neuropsychiatric imaging with cognitive,affective and behavioral tasks,adapted for the functional magnetic resonance imaging(MRI)(fMRI)experime...BACKGROUND Our study expand upon a large body of evidence in the field of neuropsychiatric imaging with cognitive,affective and behavioral tasks,adapted for the functional magnetic resonance imaging(MRI)(fMRI)experimental environment.There is sufficient evidence that common networks underpin activations in task-based fMRI across different mental disorders.AIM To investigate whether there exist specific neural circuits which underpin differ-ential item responses to depressive,paranoid and neutral items(DN)in patients respectively with schizophrenia(SCZ)and major depressive disorder(MDD).METHODS 60 patients were recruited with SCZ and MDD.All patients have been scanned on 3T magnetic resonance tomography platform with functional MRI paradigm,comprised of block design,including blocks with items from diagnostic paranoid(DP),depression specific(DS)and DN from general interest scale.We performed a two-sample t-test between the two groups-SCZ patients and depressive patients.Our purpose was to observe different brain networks which were activated during a specific condition of the task,respectively DS,DP,DN.RESULTS Several significant results are demonstrated in the comparison between SCZ and depressive groups while performing this task.We identified one component that is task-related and independent of condition(shared between all three conditions),composed by regions within the temporal(right superior and middle temporal gyri),frontal(left middle and inferior frontal gyri)and limbic/salience system(right anterior insula).Another com-ponent is related to both diagnostic specific conditions(DS and DP)e.g.It is shared between DEP and SCZ,and includes frontal motor/language and parietal areas.One specific component is modulated preferentially by to the DP condition,and is related mainly to prefrontal regions,whereas other two components are significantly modulated with the DS condition and include clusters within the default mode network such as posterior cingulate and precuneus,several occipital areas,including lingual and fusiform gyrus,as well as parahippocampal gyrus.Finally,component 12 appeared to be unique for the neutral condition.In addition,there have been determined circuits across components,which are either common,or distinct in the preferential processing of the sub-scales of the task.CONCLUSION This study has delivers further evidence in support of the model of trans-disciplinary cross-validation in psychiatry.展开更多
Maintenance operations have a critical influence on power gen-eration by wind turbines(WT).Advanced algorithms must analyze large volume of data from condition monitoring systems(CMS)to determine the actual working co...Maintenance operations have a critical influence on power gen-eration by wind turbines(WT).Advanced algorithms must analyze large volume of data from condition monitoring systems(CMS)to determine the actual working conditions and avoid false alarms.This paper proposes different support vector machine(SVM)algorithms for the prediction and detection of false alarms.K-Fold cross-validation(CV)is applied to evaluate the classification reliability of these algorithms.Supervisory Control and Data Acquisition(SCADA)data from an operating WT are applied to test the proposed approach.The results from the quadratic SVM showed an accuracy rate of 98.6%.Misclassifications from the confusion matrix,alarm log and maintenance records are analyzed to obtain quantitative information and determine if it is a false alarm.The classifier reduces the number of false alarms called misclassifications by 25%.These results demonstrate that the proposed approach presents high reliability and accuracy in false alarm identification.展开更多
Regression models for survival time data involve estimation of the hazard rate as a function of predictor variables and associated slope parameters. An adaptive approach is formulated for such hazard regression modeli...Regression models for survival time data involve estimation of the hazard rate as a function of predictor variables and associated slope parameters. An adaptive approach is formulated for such hazard regression modeling. The hazard rate is modeled using fractional polynomials, that is, linear combinations of products of power transforms of time together with other available predictors. These fractional polynomial models are restricted to generating positive-valued hazard rates and decreasing survival times. Exponentially distributed survival times are a special case. Parameters are estimated using maximum likelihood estimation allowing for right censored survival times. Models are evaluated and compared using likelihood cross-validation (LCV) scores. LCV scores and tolerance parameters are used to control an adaptive search through alternative fractional polynomial hazard rate models to identify effective models for the underlying survival time data. These methods are demonstrated using two different survival time data sets including survival times for lung cancer patients and for multiple myeloma patients. For the lung cancer data, the hazard rate depends distinctly on time. However, controlling for cell type provides a distinct improvement while the hazard rate depends only on cell type and no longer on time. Furthermore, Cox regression is unable to identify a cell type effect. For the multiple myeloma data, the hazard rate also depends distinctly on time. Moreover, consideration of hemoglobin at diagnosis provides a distinct improvement, the hazard rate still depends distinctly on time, and hemoglobin distinctly moderates the effect of time on the hazard rate. These results indicate that adaptive hazard rate modeling can provide unique insights into survival time data.展开更多
基金supported by the National Natural Science Foundation of China Civil Aviation Joint Fund (U1833110)Research on the Dual Prevention Mechanism and Intelligent Management Technology f or Civil Aviation Safety Risks (YK23-03-05)。
文摘Aviation accidents are currently one of the leading causes of significant injuries and deaths worldwide. This entices researchers to investigate aircraft safety using data analysis approaches based on an advanced machine learning algorithm.To assess aviation safety and identify the causes of incidents, a classification model with light gradient boosting machine (LGBM)based on the aviation safety reporting system (ASRS) has been developed. It is improved by k-fold cross-validation with hybrid sampling model (HSCV), which may boost classification performance and maintain data balance. The results show that employing the LGBM-HSCV model can significantly improve accuracy while alleviating data imbalance. Vertical comparison with other cross-validation (CV) methods and lateral comparison with different fold times comprise the comparative approach. Aside from the comparison, two further CV approaches based on the improved method in this study are discussed:one with a different sampling and folding order, and the other with more CV. According to the assessment indices with different methods, the LGBMHSCV model proposed here is effective at detecting incident causes. The improved model for imbalanced data categorization proposed may serve as a point of reference for similar data processing, and the model’s accurate identification of civil aviation incident causes can assist to improve civil aviation safety.
文摘In deriving a regression model analysts often have to use variable selection, despite of problems introduced by data- dependent model building. Resampling approaches are proposed to handle some of the critical issues. In order to assess and compare several strategies, we will conduct a simulation study with 15 predictors and a complex correlation structure in the linear regression model. Using sample sizes of 100 and 400 and estimates of the residual variance corresponding to R2 of 0.50 and 0.71, we consider 4 scenarios with varying amount of information. We also consider two examples with 24 and 13 predictors, respectively. We will discuss the value of cross-validation, shrinkage and backward elimination (BE) with varying significance level. We will assess whether 2-step approaches using global or parameterwise shrinkage (PWSF) can improve selected models and will compare results to models derived with the LASSO procedure. Beside of MSE we will use model sparsity and further criteria for model assessment. The amount of information in the data has an influence on the selected models and the comparison of the procedures. None of the approaches was best in all scenarios. The performance of backward elimination with a suitably chosen significance level was not worse compared to the LASSO and BE models selected were much sparser, an important advantage for interpretation and transportability. Compared to global shrinkage, PWSF had better performance. Provided that the amount of information is not too small, we conclude that BE followed by PWSF is a suitable approach when variable selection is a key part of data analysis.
文摘For the nonparametric regression model Yni =g(Xni) +εnii = 1, … n. with regulary spaced nonrandom design, the authors study the behavior of the nonlinear wavelet estimator of g(x). When the threshold and truncation parameters are chosen by crossvalidation on the everage squared error, strong consistency for the case of dyadic sample size and moment consistency for arbitrary sample size are established under some regular conditions.
基金supported in part by National Sciences Foundation of China grant ( 11672001)Jiangsu Province Science and Technology Agency grant ( BE2016785)supported in part by Postgraduate Research & Practice Innovation Program of Jiangsu Province grant ( KYCX18_0156)
文摘Background Cardiovascular diseases are closely linked to atherosclerotic plaque development and rupture.Plaque progression prediction is of fundamental significance to cardiovascular research and disease diagnosis,prevention,and treatment.Generalized linear mixed models(GLMM)is an extension of linear model for categorical responses while considering the correlation among observations.Methods Magnetic resonance image(MRI)data of carotid atheroscleroticplaques were acquired from 20 patients with consent obtained and 3D thin-layer models were constructed to calculate plaque stress and strain for plaque progression prediction.Data for ten morphological and biomechanical risk factors included wall thickness(WT),lipid percent(LP),minimum cap thickness(MinCT),plaque area(PA),plaque burden(PB),lumen area(LA),maximum plaque wall stress(MPWS),maximum plaque wall strain(MPWSn),average plaque wall stress(APWS),and average plaque wall strain(APWSn)were extracted from all slices for analysis.Wall thickness increase(WTI),plaque burden increase(PBI)and plaque area increase(PAI) were chosen as three measures for plaque progression.Generalized linear mixed models(GLMM)with 5-fold cross-validation strategy were used to calculate prediction accuracy for each predictor and identify optimal predictor with the highest prediction accuracy defined as sum of sensitivity and specificity.All 201 MRI slices were randomly divided into 4 training subgroups and 1 verification subgroup.The training subgroups were used for model fitting,and the verification subgroup was used to estimate the model.All combinations(total1023)of 10 risk factors were feed to GLMM and the prediction accuracy of each predictor were selected from the point on the ROC(receiver operating characteristic)curve with the highest sum of specificity and sensitivity.Results LA was the best single predictor for PBI with the highest prediction accuracy(1.360 1),and the area under of the ROC curve(AUC)is0.654 0,followed by APWSn(1.336 3)with AUC=0.6342.The optimal predictor among all possible combinations for PBI was the combination of LA,PA,LP,WT,MPWS and MPWSn with prediction accuracy=1.414 6(AUC=0.715 8).LA was once again the best single predictor for PAI with the highest prediction accuracy(1.184 6)with AUC=0.606 4,followed by MPWSn(1. 183 2)with AUC=0.6084.The combination of PA,PB,WT,MPWS,MPWSn and APWSn gave the best prediction accuracy(1.302 5)for PAI,and the AUC value is 0.6657.PA was the best single predictor for WTI with highest prediction accuracy(1.288 7)with AUC=0.641 5,followed by WT(1.254 0),with AUC=0.6097.The combination of PA,PB,WT,LP,MinCT,MPWS and MPWS was the best predictor for WTI with prediction accuracy as 1.314 0,with AUC=0.6552.This indicated that PBI was a more predictable measure than WTI and PAI. The combinational predictors improved prediction accuracy by 9.95%,4.01%and 1.96%over the best single predictors for PAI,PBI and WTI(AUC values improved by9.78%,9.45%,and 2.14%),respectively.Conclusions The use of GLMM with 5-fold cross-validation strategy combining both morphological and biomechanical risk factors could potentially improve the accuracy of carotid plaque progression prediction.This study suggests that a linear combination of multiple predictors can provide potential improvement to existing plaque assessment schemes.
基金YX and RG thank Wellcome Trust for funding MetaboFlow(Grant 202952/Z/16/Z).
文摘Model validation is the most important part of building a supervised model.For building a model with good generalization performance one must have a sensible data splitting strategy,and this is crucial for model validation.In this study,we con-ducted a comparative study on various reported data splitting methods.The MixSim model was employed to generate nine simulated datasets with different probabilities of mis-classification and variable sample sizes.Then partial least squares for discriminant analysis and support vector machines for classification were applied to these datasets.Data splitting methods tested included variants of cross-validation,bootstrapping,bootstrapped Latin partition,Kennard-Stone algorithm(K-S)and sample set partitioning based on joint X-Y distances algorithm(SPXY).These methods were employed to split the data into training and validation sets.The estimated generalization performances from the validation sets were then compared with the ones obtained from the blind test sets which were generated from the same distribution but were unseen by the train-ing/validation procedure used in model construction.The results showed that the size of the data is the deciding factor for the qualities of the generalization performance estimated from the validation set.We found that there was a significant gap between the performance estimated from the validation set and the one from the test set for the all the data splitting methods employed on small datasets.Such disparity decreased when more samples were available for training/validation,and this is because the models were then moving towards approximations of the central limit theory for the simulated datasets used.We also found that having too many or too few samples in the training set had a negative effect on the estimated model performance,suggesting that it is necessary to have a good balance between the sizes of training set and validation set to have a reliable estimation of model performance.We also found that systematic sampling method such as K-S and SPXY generally had very poor estimation of the model performance,most likely due to the fact that they are designed to take the most representative samples first and thus left a rather poorly representative sample set for model performance estimation.
文摘Cross-validation method is used to choose the three smoothing parameters in nonlin ear wavelet regression estimators. The strong consistency and convergence rate of cross-vali dation nonlinear wavelet regression estimators are obtained.
基金the support of CERENA–Center for Natural Resources and Environment(strategic project FCT-UID/ECI/04028/2019),Portugal.
文摘Given their technical and economic advantages,the application of explosive substances to rock mass excavation is widely used.However,because of serious environmental restraints,there has been an increasing need to use complex tools to control environmental effects due to blast-induced ground vibrations.In the present study,an artificial neural network(ANN)with k-fold cross-validation was applied to a dataset containing 1114 observations that was obtained from published results;furthermore,quantitative and qualitative parameters were considered for ground vibration amplitude prediction.The best ANN model obtained has a maximum coefficient of determination of 0.840 and a mean absolute error of 5.59 and it comprises 17 input parameters,12 neurons in a one-layer hidden layer,and a sigmoid transfer function.Compared with the traditional models,the model obtained using the proposed methodology demonstrated better generalization ability.Furthermore,the proposed methodology offers an ANN model with higher prediction ability.
基金The authors acknowledge grant supports from the National Science Fund for Distinguished Young Scholars(Grant No.41825009)the China Scholarship Council(CSC NO.201806560015 and 202006270072).
文摘With the high-precision products of satellite orbit and clock,uncalibrated phase delay,and the atmosphere delay corrections,Precise Point Positioning(PPP)based on a Real-Time Kinematic(RTK)network is possible to rapidly achieve centimeter-level positioning accuracy.In the ionosphere-weighted PPP–RTK model,not only the a priori value of ionosphere but also its precision afect the convergence and accuracy of positioning.This study proposes a method to determine the precision of the interpolated slant ionospheric delay by cross-validation.The new method takes the high temporal and spatial variation into consideration.A distance-dependent function is built to represent the stochastic model of the slant ionospheric delay derived from each reference station,and an error model is built for each reference station on a fve-minute piecewise basis.The user can interpolate ionospheric delay correction and the corresponding precision with an error function related to the distance and time of each reference station.With the European Reference Frame(EUREF)Permanent GNSS(Global Navigation Satellite Systems)network(EPN),and SONEL(Système d’Observation du Niveau des Eaux Littorales)GNSS stations covering most of Europe,the efectiveness of our wide-area ionosphere constraint method for PPP-RTK is validated,compared with the method with a fxed ionosphere precision threshold.It is shown that although the Root Mean Square(RMS)of the interpolated ionosphere error is within 5 cm in most of the areas,it exceeds 10 cm for some areas with sparse reference stations during some periods of time.The convergence time of the 90th percentile is 4.0 and 20.5 min for horizontal and vertical directions using Global Positioning System(GPS)kinematic solution,respectively,with the proposed method.This convergence is faster than those with the fxed ionosphere precision values of 1,8,and 30 cm.The improvement with respect to the latter three solutions ranges from 10 to 60%.After integrating the Galileo navigation satellite system(Galileo),the convergence time of the 90th percentile for combined kinematic solutions is 2.0 and 9.0 min,with an improvement of 50.0%and 56.1%for horizontal and vertical directions,respectively,compared with the GPS-only solution.The average convergence time of GPS PPP-RTK for horizontal and vertical directions are 2.0 and 5.0 min,and those of GPS+Galileo PPP-RTK are 1.4 and 3.0 min,respectively.
基金supported by Natural Science Foundation of Fujian Province (CN) (2020I0009, 2022J01596)Cooperation Project on University Industry-Education-Research of Fujian Provincial Science and Technology Plan (CN) (2022N5011)+1 种基金Lancang-Mekong Cooperation Special Fund (2017-2020)International Sci-Tech Cooperation and Communication Program of Fujian Agriculture and Forestry University (KXGH17014)。
文摘Bulked-segregant analysis by deep sequencing(BSA-seq) is a widely used method for mapping QTL(quantitative trait loci) due to its simplicity, speed, cost-effectiveness, and efficiency. However, the ability of BSA-seq to detect QTL is often limited by inappropriate experimental designs, as evidenced by numerous practical studies. Most BSA-seq studies have utilized small to medium-sized populations, with F2populations being the most common choice. Nevertheless, theoretical studies have shown that using a large population with an appropriate pool size can significantly enhance the power and resolution of QTL detection in BSA-seq, with F_(3)populations offering notable advantages over F2populations. To provide an experimental demonstration, we tested the power of BSA-seq to identify QTL controlling days from sowing to heading(DTH) in a 7200-plant rice F_(3)population in two environments, with a pool size of approximately 500. Each experiment identified 34 QTL, an order of magnitude greater than reported in most BSA-seq experiments, of which 23 were detected in both experiments, with 17 of these located near41 previously reported QTL and eight cloned genes known to control DTH in rice. These results indicate that QTL mapping by BSA-seq in large F_(3)populations and multi-environment experiments can achieve high power, resolution, and reliability.
文摘Adaptive fractional polynomial modeling of general correlated outcomes is formulated to address nonlinearity in means, variances/dispersions, and correlations. Means and variances/dispersions are modeled using generalized linear models in fixed effects/coefficients. Correlations are modeled using random effects/coefficients. Nonlinearity is addressed using power transforms of primary (untransformed) predictors. Parameter estimation is based on extended linear mixed modeling generalizing both generalized estimating equations and linear mixed modeling. Models are evaluated using likelihood cross-validation (LCV) scores and are generated adaptively using a heuristic search controlled by LCV scores. Cases covered include linear, Poisson, logistic, exponential, and discrete regression of correlated continuous, count/rate, dichotomous, positive continuous, and discrete numeric outcomes treated as normally, Poisson, Bernoulli, exponentially, and discrete numerically distributed, respectively. Example analyses are also generated for these five cases to compare adaptive random effects/coefficients modeling of correlated outcomes to previously developed adaptive modeling based on directly specified covariance structures. Adaptive random effects/coefficients modeling substantially outperforms direct covariance modeling in the linear, exponential, and discrete regression example analyses. It generates equivalent results in the logistic regression example analyses and it is substantially outperformed in the Poisson regression case. Random effects/coefficients modeling of correlated outcomes can provide substantial improvements in model selection compared to directly specified covariance modeling. However, directly specified covariance modeling can generate competitive or substantially better results in some cases while usually requiring less computation time.
文摘The use of machine learning to predict student employability is important in order to analyse a student’s capability to get a job.Based on the results of this type of analysis,university managers can improve the employability of their students,which can help in attracting students in the future.In addition,learners can focus on the essential skills identified through this analysis during their studies,to increase their employability.An effectivemethod calledOPT-BAG(OPTimisation of BAGging classifiers)was therefore developed to model the problem of predicting the employability of students.This model can help predict the employability of students based on their competencies and can reveal weaknesses that need to be improved.First,we analyse the relationships between several variables and the outcome variable using a correlation heatmap for a student employability dataset.Next,a standard scaler function is applied in the preprocessing module to normalise the variables in the student employability dataset.The training set is then input to our model to identify the optimal parameters for the bagging classifier using a grid search cross-validation technique.Finally,the OPT-BAG model,based on a bagging classifier with optimal parameters found in the previous step,is trained on the training dataset to predict student employability.The empirical outcomes in terms of accuracy,precision,recall,and F1 indicate that the OPT-BAG approach outperforms other cutting-edge machine learning models in terms of predicting student employability.In this study,we also analyse the factors affecting the recruitment process of employers,and find that general appearance,mental alertness,and communication skills are the most important.This indicates that educational institutions should focus on these factors during the learning process to improve student employability.
文摘BACKGROUND Our study expand upon a large body of evidence in the field of neuropsychiatric imaging with cognitive,affective and behavioral tasks,adapted for the functional magnetic resonance imaging(MRI)(fMRI)experimental environment.There is sufficient evidence that common networks underpin activations in task-based fMRI across different mental disorders.AIM To investigate whether there exist specific neural circuits which underpin differ-ential item responses to depressive,paranoid and neutral items(DN)in patients respectively with schizophrenia(SCZ)and major depressive disorder(MDD).METHODS 60 patients were recruited with SCZ and MDD.All patients have been scanned on 3T magnetic resonance tomography platform with functional MRI paradigm,comprised of block design,including blocks with items from diagnostic paranoid(DP),depression specific(DS)and DN from general interest scale.We performed a two-sample t-test between the two groups-SCZ patients and depressive patients.Our purpose was to observe different brain networks which were activated during a specific condition of the task,respectively DS,DP,DN.RESULTS Several significant results are demonstrated in the comparison between SCZ and depressive groups while performing this task.We identified one component that is task-related and independent of condition(shared between all three conditions),composed by regions within the temporal(right superior and middle temporal gyri),frontal(left middle and inferior frontal gyri)and limbic/salience system(right anterior insula).Another com-ponent is related to both diagnostic specific conditions(DS and DP)e.g.It is shared between DEP and SCZ,and includes frontal motor/language and parietal areas.One specific component is modulated preferentially by to the DP condition,and is related mainly to prefrontal regions,whereas other two components are significantly modulated with the DS condition and include clusters within the default mode network such as posterior cingulate and precuneus,several occipital areas,including lingual and fusiform gyrus,as well as parahippocampal gyrus.Finally,component 12 appeared to be unique for the neutral condition.In addition,there have been determined circuits across components,which are either common,or distinct in the preferential processing of the sub-scales of the task.CONCLUSION This study has delivers further evidence in support of the model of trans-disciplinary cross-validation in psychiatry.
基金supported financially by the Ministerio de Ciencia e Innovación(Spain)and the European Regional Development Fund under the Research Grant WindSound Project(Ref.:PID2021-125278OB-I00).
文摘Maintenance operations have a critical influence on power gen-eration by wind turbines(WT).Advanced algorithms must analyze large volume of data from condition monitoring systems(CMS)to determine the actual working conditions and avoid false alarms.This paper proposes different support vector machine(SVM)algorithms for the prediction and detection of false alarms.K-Fold cross-validation(CV)is applied to evaluate the classification reliability of these algorithms.Supervisory Control and Data Acquisition(SCADA)data from an operating WT are applied to test the proposed approach.The results from the quadratic SVM showed an accuracy rate of 98.6%.Misclassifications from the confusion matrix,alarm log and maintenance records are analyzed to obtain quantitative information and determine if it is a false alarm.The classifier reduces the number of false alarms called misclassifications by 25%.These results demonstrate that the proposed approach presents high reliability and accuracy in false alarm identification.
文摘Regression models for survival time data involve estimation of the hazard rate as a function of predictor variables and associated slope parameters. An adaptive approach is formulated for such hazard regression modeling. The hazard rate is modeled using fractional polynomials, that is, linear combinations of products of power transforms of time together with other available predictors. These fractional polynomial models are restricted to generating positive-valued hazard rates and decreasing survival times. Exponentially distributed survival times are a special case. Parameters are estimated using maximum likelihood estimation allowing for right censored survival times. Models are evaluated and compared using likelihood cross-validation (LCV) scores. LCV scores and tolerance parameters are used to control an adaptive search through alternative fractional polynomial hazard rate models to identify effective models for the underlying survival time data. These methods are demonstrated using two different survival time data sets including survival times for lung cancer patients and for multiple myeloma patients. For the lung cancer data, the hazard rate depends distinctly on time. However, controlling for cell type provides a distinct improvement while the hazard rate depends only on cell type and no longer on time. Furthermore, Cox regression is unable to identify a cell type effect. For the multiple myeloma data, the hazard rate also depends distinctly on time. Moreover, consideration of hemoglobin at diagnosis provides a distinct improvement, the hazard rate still depends distinctly on time, and hemoglobin distinctly moderates the effect of time on the hazard rate. These results indicate that adaptive hazard rate modeling can provide unique insights into survival time data.