摘要
语音唤醒词检测(WWD)是语音交互中的关键技术,选择合适大小的检测窗对WWD性能的影响很大。提出一种新的多模型融合方法,通过融合小检测窗和大检测窗的检测结果来提高WWD性能。多模型融合方法包含两个分类模型,分别使用小检测窗和大检测窗,均基于轻量化的挤压与激励残差网络(SE-Res2Net)模块,即GhostSE-Res2Net,SE-Res2Net结构的多尺度机制可显著提升WWD的能力。在Ghost-SE-Res2Net中,首先使用Ghost卷积替换SE-Res2Net中的普通卷积以降低模型参数量,然后使用注意力池化层替换SE-Res2Net中的全局平均池化层进一步提升WWD能力。在实际检测时融合连续3个小检测窗模型的检测结果的最大值和1个大检测窗模型的检测结果,来判断唤醒词是否被触发。在训练时引入困难样本挖掘算法,选择性地学习较难检测的唤醒词信息以提高分类模型的检测性能。在包含2个唤醒词的Mobvoi数据集上评估系统性能,实验结果表明,在每小时0.5次错误唤醒的情况下,该系统在2个唤醒词上的错误拒绝率分别为0.46%和0.43%,实现了与先进基线相似的性能,并且系统参数量比基线少31%。
Speech Wake-up Word Detection(WWD)is a key technology in the field of voice interaction.Choosing an appropriate detection window size significantly affects the performance of WWD.This study proposes a novel multimodel fusion method.By fusing the detection results obtained with small and large detection windows,the WWD performance can be improved.The multi-model fusion method includes two classification models that use small and large detection windows,and both are based on a lightweight SE-Res2Net network,namely,Ghost-SE-Res2Net.The multi-scale mechanism of the Squeeze and Excitation Network(SE-Res2Net)structure significantly improves the WWD performance.In Ghost-SE-Res2Net,first the Ghost convolution is used to replace the ordinary convolution in SE-Res2Net to reduce the model parameter count.Subsequently,an attention pooling layer is used to replace the global average pooling layer to further improve the WWD performance.During detection,the maximum value of the detection results obtained from three consecutive small-detection window models is fused with the detection result obtained from one large-detection window model to determine whether the wake-up word is triggered.In this study,a hard sample mining algorithm is introduced during training to selectively learn difficult-to-detect wake-up word information and improve the classification model detection performance.Accordingly,the system performance is evaluated using the Mobvoi dataset containing two wake-up words.The experimental results show that at 0.5 false alarms per hour,the system achieved false rejection rates of 0.46%and 0.43%for the two wake-up words,respectively.This performance is on par with that of the state-of-the-art baseline,whereas the system's parameter count is 31%smaller than the baseline.
作者
虞秋辰
周若华
袁庆升
YU Qiuchen;ZHOU Ruohua;YUAN Qingsheng(School of Electrical and Information Engineering,Beijing University of Civil Engineering and Architecture,Beijing 102616,China;National Computer Network Emergency Response Technical Team and Coordination Center,Beijing 100029,China)
出处
《计算机工程》
CAS
CSCD
北大核心
2024年第3期52-59,共8页
Computer Engineering
基金
国家自然科学基金(11590774)。