With the increasing intelligence and integration,a great number of two-valued variables(generally stored in the form of 0 or 1)often exist in large-scale industrial processes.However,these variables cannot be effectiv...With the increasing intelligence and integration,a great number of two-valued variables(generally stored in the form of 0 or 1)often exist in large-scale industrial processes.However,these variables cannot be effectively handled by traditional monitoring methods such as linear discriminant analysis(LDA),principal component analysis(PCA)and partial least square(PLS)analysis.Recently,a mixed hidden naive Bayesian model(MHNBM)is developed for the first time to utilize both two-valued and continuous variables for abnormality monitoring.Although the MHNBM is effective,it still has some shortcomings that need to be improved.For the MHNBM,the variables with greater correlation to other variables have greater weights,which can not guarantee greater weights are assigned to the more discriminating variables.In addition,the conditional P(x j|x j′,y=k)probability must be computed based on historical data.When the training data is scarce,the conditional probability between continuous variables tends to be uniformly distributed,which affects the performance of MHNBM.Here a novel feature weighted mixed naive Bayes model(FWMNBM)is developed to overcome the above shortcomings.For the FWMNBM,the variables that are more correlated to the class have greater weights,which makes the more discriminating variables contribute more to the model.At the same time,FWMNBM does not have to calculate the conditional probability between variables,thus it is less restricted by the number of training data samples.Compared with the MHNBM,the FWMNBM has better performance,and its effectiveness is validated through numerical cases of a simulation example and a practical case of the Zhoushan thermal power plant(ZTPP),China.展开更多
Based on the lung adenocarcinoma(LUAD)gene expression data from the cancer genome atlas(TCGA)database,the Stromal score,Immune score and Estimate score in tumor microenvironment(TME)were computed by the Estimation of ...Based on the lung adenocarcinoma(LUAD)gene expression data from the cancer genome atlas(TCGA)database,the Stromal score,Immune score and Estimate score in tumor microenvironment(TME)were computed by the Estimation of Stromal and Immune cells in Malignant Tumor tissues using Expression data(ESTIMATE)algorithm.And gene modules significantly related to the three scores were identified by weighted gene coexpression network analysis(WGCNA).Based on the correlation coefficients and P values,899 key genes affecting tumor microenvironment were obtained by selecting the two most correlated modules.It was suggested through Gene Ontology(GO)and Kyoto Encyclopedia of Genes and Genomes(KEGG)enrichment analysis that these key genes were significantly involved in immune-related or cancer-related terms.Through univariate cox regression and elastic network analysis,genes associated with prognosis of the LUAD patients were screened out and their prognostic values were further verified by the survival analysis and the University of ALabama at Birmingham CANcer(UALCAN)database.The results indicated that eight genes were significantly related to the overall survival of LUAD.Among them,six genes were found differentially expressed between tumor and control samples.And immune infiltration analysis further verified that all the six genes were significantly related to tumor purity and immune cells.Therefore,these genes were used eventually for constructing a Naive Bayes projection model of LUAD.The model was verified by the receiver operating characteristic(ROC)curve where the area under curve(AUC)reached 92.03%,which suggested that the model could discriminate the tumor samples from the normal accurately.Our study provided an effective model for LUAD projection which improved the clinical diagnosis and cure of LUAD.The result also confirmed that the six genes in the model construction could be the potential prognostic biomarkers of LUAD.展开更多
The objective of prospectivity modeling is prediction of the conditional probability of the presence T = 1 or absence T = 0 of a target T given favorable or prohibitive predictors B, or construction of a two classes {...The objective of prospectivity modeling is prediction of the conditional probability of the presence T = 1 or absence T = 0 of a target T given favorable or prohibitive predictors B, or construction of a two classes {0,1} classification of T. A special case of logistic regression called weights-of-evidence (WofE) is geolo- gists' favorite method of prospectivity modeling due to its apparent simplicity. However, the numerical simplicity is deceiving as it is implied by the severe mathematical modeling assumption of joint conditional independence of all predictors given the target. General weights of evidence are explicitly introduced which are as simple to estimate as conventional weights, i.e., by counting, but do not require conditional independence. Complementary to the regres- sion view is the classification view on prospectivity modeling. Boosting is the construction of a strong classifier from a set of weak classifiers. From the regression point of view it is closely related to logistic regression. Boost weights-of-evidence (BoostWofE) was introduced into prospectivity modeling to counterbalance violations of the assumption of conditional independence even though relaxation of modeling assumptions with respect to weak classifiers was not the (initial) purpose of boosting. In the original publication of BoostWofE a fabricated dataset was used to "validate" this approach. Using the same fabricated dataset it is shown that BoostWofE cannot generally compensate lacking condi- tional independence whatever the consecutively proces- sing order of predictors. Thus the alleged features of BoostWofE are disproved by way of counterexamples, while theoretical findings are confirmed that logistic regression including interaction terms can exactly com- pensate violations of joint conditional independence if the predictors are indicators.展开更多
Machine learning methods are effective tools for improving short-term climate prediction.However,commonly used methods often carry out classification and regression prediction modeling separately and independently.Suc...Machine learning methods are effective tools for improving short-term climate prediction.However,commonly used methods often carry out classification and regression prediction modeling separately and independently.Such a single modeling approach may obtain inconsistent prediction results in classification and regression and thus may not meet the needs of practical applications well.To address this issue,this study proposes a selective Naive Bayes ensemble model(SENB-EM)by introducing causal effect and voting strategy on Naive Bayes.The new model can not only screen effective predictors but also perform classification and regression prediction simultaneously.After being applied to the area prediction of summer western North Pacific subtropical high(WNPSH)from 2008 to 2021,it is found that the accuracy classification score(a metric to assess the overall classification prediction accuracy)and the time correlation coefficient(TCC)of SENB-EM can reach 1.0 and 0.81,respectively.After integrating the results of different models[including multiple linear regression ensemble model(MLR-EM),SENB-EM,and Chinese Multimodel Ensemble Prediction System(CMME)used by National Climate Center(NCC)]for 2017-2021,the TCC of the ensemble results of SENB-EM and CMME can reach 0.92(the highest result among them).This indicates that the prediction results of the summer WNPSH area provided by SENB-EM have a high reference value for the real-time prediction.It is worth noting that,except for the numerical prediction results,the SENB-EM model can also give the range of numerical prediction intervals and predictions for anomalous degrees of the WNPSH area,thus providing more reference information for meteorological forecasters.Overall,as a new hybrid machine learning model,the SENB-EM has a good prediction ability;the approach of performing classification prediction and regression prediction simultaneously through integration is informative to short-term climate prediction.展开更多
基金supported by the National Natural Science Foundation of China(62033008,61873143)。
文摘With the increasing intelligence and integration,a great number of two-valued variables(generally stored in the form of 0 or 1)often exist in large-scale industrial processes.However,these variables cannot be effectively handled by traditional monitoring methods such as linear discriminant analysis(LDA),principal component analysis(PCA)and partial least square(PLS)analysis.Recently,a mixed hidden naive Bayesian model(MHNBM)is developed for the first time to utilize both two-valued and continuous variables for abnormality monitoring.Although the MHNBM is effective,it still has some shortcomings that need to be improved.For the MHNBM,the variables with greater correlation to other variables have greater weights,which can not guarantee greater weights are assigned to the more discriminating variables.In addition,the conditional P(x j|x j′,y=k)probability must be computed based on historical data.When the training data is scarce,the conditional probability between continuous variables tends to be uniformly distributed,which affects the performance of MHNBM.Here a novel feature weighted mixed naive Bayes model(FWMNBM)is developed to overcome the above shortcomings.For the FWMNBM,the variables that are more correlated to the class have greater weights,which makes the more discriminating variables contribute more to the model.At the same time,FWMNBM does not have to calculate the conditional probability between variables,thus it is less restricted by the number of training data samples.Compared with the MHNBM,the FWMNBM has better performance,and its effectiveness is validated through numerical cases of a simulation example and a practical case of the Zhoushan thermal power plant(ZTPP),China.
基金Our deepest gratitude goes to the editors and anonymous reviewers for their careful work and thoughtful suggestions that have helped to improve this paper substantially.The workwas supported by the National Natural Science Foundation of China(No.12071382)the Bowang scholar youth talent program(Zhiqiang Ye)of Chongqing Normal University,the Natural Science and Engineering Research Council of Canada,and the Canada Research Chair Program(JWu).
文摘Based on the lung adenocarcinoma(LUAD)gene expression data from the cancer genome atlas(TCGA)database,the Stromal score,Immune score and Estimate score in tumor microenvironment(TME)were computed by the Estimation of Stromal and Immune cells in Malignant Tumor tissues using Expression data(ESTIMATE)algorithm.And gene modules significantly related to the three scores were identified by weighted gene coexpression network analysis(WGCNA).Based on the correlation coefficients and P values,899 key genes affecting tumor microenvironment were obtained by selecting the two most correlated modules.It was suggested through Gene Ontology(GO)and Kyoto Encyclopedia of Genes and Genomes(KEGG)enrichment analysis that these key genes were significantly involved in immune-related or cancer-related terms.Through univariate cox regression and elastic network analysis,genes associated with prognosis of the LUAD patients were screened out and their prognostic values were further verified by the survival analysis and the University of ALabama at Birmingham CANcer(UALCAN)database.The results indicated that eight genes were significantly related to the overall survival of LUAD.Among them,six genes were found differentially expressed between tumor and control samples.And immune infiltration analysis further verified that all the six genes were significantly related to tumor purity and immune cells.Therefore,these genes were used eventually for constructing a Naive Bayes projection model of LUAD.The model was verified by the receiver operating characteristic(ROC)curve where the area under curve(AUC)reached 92.03%,which suggested that the model could discriminate the tumor samples from the normal accurately.Our study provided an effective model for LUAD projection which improved the clinical diagnosis and cure of LUAD.The result also confirmed that the six genes in the model construction could be the potential prognostic biomarkers of LUAD.
文摘The objective of prospectivity modeling is prediction of the conditional probability of the presence T = 1 or absence T = 0 of a target T given favorable or prohibitive predictors B, or construction of a two classes {0,1} classification of T. A special case of logistic regression called weights-of-evidence (WofE) is geolo- gists' favorite method of prospectivity modeling due to its apparent simplicity. However, the numerical simplicity is deceiving as it is implied by the severe mathematical modeling assumption of joint conditional independence of all predictors given the target. General weights of evidence are explicitly introduced which are as simple to estimate as conventional weights, i.e., by counting, but do not require conditional independence. Complementary to the regres- sion view is the classification view on prospectivity modeling. Boosting is the construction of a strong classifier from a set of weak classifiers. From the regression point of view it is closely related to logistic regression. Boost weights-of-evidence (BoostWofE) was introduced into prospectivity modeling to counterbalance violations of the assumption of conditional independence even though relaxation of modeling assumptions with respect to weak classifiers was not the (initial) purpose of boosting. In the original publication of BoostWofE a fabricated dataset was used to "validate" this approach. Using the same fabricated dataset it is shown that BoostWofE cannot generally compensate lacking condi- tional independence whatever the consecutively proces- sing order of predictors. Thus the alleged features of BoostWofE are disproved by way of counterexamples, while theoretical findings are confirmed that logistic regression including interaction terms can exactly com- pensate violations of joint conditional independence if the predictors are indicators.
基金Supported by the National Natural Science Foundation of China (42130610,41975076,and 42175067)National Key Research and Development Program of China (2019YFA0607104)。
文摘Machine learning methods are effective tools for improving short-term climate prediction.However,commonly used methods often carry out classification and regression prediction modeling separately and independently.Such a single modeling approach may obtain inconsistent prediction results in classification and regression and thus may not meet the needs of practical applications well.To address this issue,this study proposes a selective Naive Bayes ensemble model(SENB-EM)by introducing causal effect and voting strategy on Naive Bayes.The new model can not only screen effective predictors but also perform classification and regression prediction simultaneously.After being applied to the area prediction of summer western North Pacific subtropical high(WNPSH)from 2008 to 2021,it is found that the accuracy classification score(a metric to assess the overall classification prediction accuracy)and the time correlation coefficient(TCC)of SENB-EM can reach 1.0 and 0.81,respectively.After integrating the results of different models[including multiple linear regression ensemble model(MLR-EM),SENB-EM,and Chinese Multimodel Ensemble Prediction System(CMME)used by National Climate Center(NCC)]for 2017-2021,the TCC of the ensemble results of SENB-EM and CMME can reach 0.92(the highest result among them).This indicates that the prediction results of the summer WNPSH area provided by SENB-EM have a high reference value for the real-time prediction.It is worth noting that,except for the numerical prediction results,the SENB-EM model can also give the range of numerical prediction intervals and predictions for anomalous degrees of the WNPSH area,thus providing more reference information for meteorological forecasters.Overall,as a new hybrid machine learning model,the SENB-EM has a good prediction ability;the approach of performing classification prediction and regression prediction simultaneously through integration is informative to short-term climate prediction.