摘要
利用光谱技术实现农产品、食品品质无损检测的实质是建立样本光谱信息与样本品质参数之间的机器学习模型。为了获得具有良好泛化性能的机器学习模型,通常需要大量的标记样本,然而,获取样本的光谱信息相对容易,但标注样本品质参数的过程往往涉及到大量的时间和经济成本,并且具有破坏性。主动学习是一种减少训练集有标记样本数量的方法,通过选择最有价值的样本进行标记,而不是随机选择。因此,主动学习能够控制向训练集添加哪些样本,模型不再是被动地接受用于建模的样本。在分类任务中已经提出较多关于主动学习的算法,但回归任务中的研究却相对较少,且现有的用于回归任务的主动学习算法大多是有监督的,即需要少量有标记样本训练初始模型。本文提出了一种基于无监督主动学习方法的训练样本选择策略。该方法首先通过层次凝聚聚类对无标记(标准值)光谱数据集进行多样性划分,获得不同的聚类簇;然后通过局部线性重建算法在每个聚类簇中选择最具代表性的样本构成训练样本集,最后基于训练集构建模型。利用两个年份三个品种苹果的近红外光谱数据,构建了其可溶性固形物含量和硬度的偏最小二乘预测模型,用于验证所提出方法的有效性。实验结果表明:所提出的方法要优于已有的样本选择策略,可以有效地提高模型精度,减少在模型训练中的破坏性理化实验。同时,与随机采样(RS)、Kennard-Stone算法(KS)、光谱-理化值共生距离算法(SPXY)这三种光谱领域常用的样本选择算法相比,该研究所提出的方法表现出了最佳的性能,基于所提出的无监督主动学习算法选取200个样本作为训练集所建立的可溶性固形物含量预测模型的预测均方根误差相对于其他三种算法降低了2.0%~13.2%,硬度预测模型的预测均方根误差相对降低了1.2%~15.7%。
The essence of using near-infrared spectroscopy to realize non-destructive detection of agricultural products and food quality is to establish a machine learning model between sample spectral information and sample quality parameters.In order to obtain a machine learning model with good generalization performance,a large number of labeled samples are usually required.However,it is relatively easy to obtain spectral information of samples,but labeling samples quality parameters often involves a large amount of time and economic costs and is destructive.Active learning is a method to reduce the number of labeled samples in training set by selecting the most valuable samples for labeling instead of random selection.Therefore,active learning can control which samples are added to the training set,and the model no longer passively accepts samples for modeling.There have been many active learning algorithms in classification tasks.There are relatively few researches in regression tasks.Moreover,most of the existing active learning algorithms for regression tasks are supervised.That is,a small number of labeled samples are needed to train the initial model.In this paper,a training sample selection strategy based on unsupervised active learning is proposed.Firstly,the method divides the diversity of unlabeled(standard value)spectral datasets through hierarchical agglomerative clustering to obtain different clustering clusters.Then,the locally linear reconstruction method selects the most representative samples in each clustering cluster to form a training sample set and establish the partial least squares regression model based on the training set to predict the unlabeled samples.In this paper,partial least squares prediction models for soluble solids content and firmness prediction were constructed to evaluate the proposed method’s performance,using the near infrared spectrum data of three varieties of apples from two years.The experimental results show that the method proposed in this paper is superior to the existing sample selection strategy,which can effectively improve the model accuracy and reduce destructive physical and chemical experiments in model training.Meanwhile,compared with random sampling(RS),traditional Kennard-Stone(KS)and joint x-y distances(SPXY),the proposed method achieved the optimal performance.The root mean square error of the soluble solid content prediction models based on the unsupervised active learning algorithm proposed in this paper,which selects 200 samples as the training set,is reduced by 2.0%~13.2%compared with the other three algorithms,and the root means square error of the firmness prediction models is reduced by 1.2%~15.7%.
作者
赵小康
赵鑫
朱启兵
黄敏
ZHAO Xiao-kang;ZHAO Xin;ZHU Qi-bing;HUANG Min(Key Laboratory of Advanced Process Control for Light Industry(Ministry of Education),Jiangnan University,Wuxi 214122,China)
出处
《光谱学与光谱分析》
SCIE
EI
CAS
CSCD
北大核心
2022年第1期282-291,共10页
Spectroscopy and Spectral Analysis
基金
国家自然科学基金项目(61772240,61775086)资助。
关键词
光谱
品质检测
主动学习
训练样本选择
Spectroscopy
Quality detection
Active learning
Training sample selection