多模态特征的越南语语音识别文本标点恢复

Text punctuation restoration for Vietnamese speech recognition with multimodal features

下载PDF

导出

摘要越南语语音识别系统输出的文本序列缺少标点符号,恢复识别文本标点有助于消除歧义,更易于阅读和理解。越南语语音识别文本中常出现破坏语义的错误音节,基于文本模态的标点恢复模型在识别带噪文本时存在标点预测不准确的问题。利用越南语语音中的语气停顿及声调变化指导模型对带噪文本作出正确的标点预测,提出多模态特征的越南语语音识别文本标点恢复方法,利用梅尔倒谱系数(MFCC)提取语音特征,利用预训练语言模型提取文本上下文特征,基于标签注意力机制实现语音与文本多模态特征融合,增强模型对越南语带噪文本上下文信息的学习能力。实验结果表明,相较于基于Transformer和BERT提取文本单一模态特征的标点恢复模型,所提方法在越南语数据集上精确率、召回率和F1值均至少提高10个百分点,验证了融合语音与文本特征对提升越南语语音识别带噪文本标点预测精确率的有效性。 The text sequence output by the Vietnamese speech recognition system lacks punctuation,and punctuating the recognized text can help eliminate ambiguity and make it easier to understand.However,the punctuation restoration model based on text modality faces the problem of inaccurate punctuation prediction when dealing with noisy text,as errors in phonemes often occur in Vietnamese speech recognition systems,which can destroy the semantics of the text.A Vietnamese speech recognition text punctuation restoration method that utilizes multi-modal features was proposed,guided by intonation pauses and tone changes in Vietnamese speech to correctly predict punctuation for noisy text.Specifically,Mel-Frequency Cepstral Coefficients(MFCC)were used to extract speech features,pre-trained language models were used to extract text context features,and speech and text features were fused with label attention mechanism to fuse multi-modal features,thereby enhancing the model’s ability to learn contextual information from noisy Vietnamese text.Experimental results show that compared to punctuation restoration models that extract only text features based on Transformer and BERT(Bidirectional Encoder Representations from Transformers),the proposed method improves the precision,recall,and F1 score on Vietnamese dataset by at least 10 percent points,demonstrating the effectiveness of fusing speech and text features in improving punctuation prediction accuracy for noisy Vietnamese speech recognition text.

作者赖华孙童王文君余正涛高盛祥董凌 LAI Hua;SUN Tong;WANG Wenjun;YU Zhengtao;GAO Shengxiang;DONG Ling(Faculty of Information Engineering and Automation,Kunming University of Science and Technology,Kunming Yunnan 650500,China;Yunnan Key Laboratory of Artificial Intelligence(Kunming University of Science and Technology),Kunming Yunnan 6505000,China)

机构地区昆明理工大学信息工程与自动化学院云南省人工智能重点实验室(昆明理工大学)

出处《计算机应用》 CSCD 北大核心 2024年第2期418-423,共6页 journal of Computer Applications

基金国家自然科学基金资助项目(61732005,U21B2027,61972186) 云南高新技术产业发展项目(201606) 云南省重大科技专项(202103AA080015,202002AD080001⁃5) 云南省基础研究计划项目(202001AS070014) 云南省学术和技术带头人后备人才(202105AC160018)。

关键词语音识别标点恢复越南语 BERT 多模态 speech recognition punctuation restoration Vietnamese Bidirectional Encoder Representations from Transformers(BERT) multimodal

分类号 TP183 [自动化与计算机技术—控制理论与控制工程]