摘要
对话生成是自然语言处理的重点研究方向,对抗生成网络GAN最近在对话生成领域得到了较好的应用。为了进一步改善对话生成的质量,并且解决GAN训练过程中判别模型返回奖励重复利用率低从而导致模型训练效率低的问题,提出一种基于近端策略优化PPO的对话生成算法PPO_GAN。该算法通过GAN模型生成对话,通过判别模型区分生成的对话与真实的对话。并采用近端策略优化的方法训练GAN,能处理GAN在对话生成时导致的反向传播不可微分的情况,在保证生成模型单调非减训练的同时,通过限制生成模型迭代的梯度使判别模型得到的奖励可以重复利用。实验结果表明,对比于极大似然估计与Adver-REGS等对话生成算法,PPO_GAN算法提高了对话训练的效率并且改善了对话生成的质量。
Dialogue generation is the key research direction of natural language processing.Generative adversarial nets(GAN)have recently been well applied in the field of dialog generation.In order to further improve the quality of dialogue generation and solve the low efficiency of model training caused by the discriminative model return reward low utilization rate in the GAN training process,this paper proposes a dialogue generation algorithm(PPO_GAN)based on proximal policy optimization(PPO).The algorithm,via GAN,generates a dialogue through the generation model,and distinguishes between generated dialogue and real dialogue through the discriminant model.The GAN is trained by proximal policy optimization method,which can deal with the situation that the back propagation of GAN cannot be differentiated when the dialogue is generated.While ensuring the monotonic non-reduction training of the generated model,the rewards obtained by the discriminant model can be reused by limiting the gra-dient of the generated model iteration.The experimental results show that,compared with dialog gene-ration algorithm such as the maxinum likelihood estimation and Adver-REGS,the PPO_GAN algorithm improves the efficiency of dialogue training and the quality of dialog generation.
作者
蔡钺
游进国
丁家满
CAI Yue;YOU Jin-guo;DING Jia-man(Faculty of Information Engineering and Automation,Kunming University of Science And Technology,Kunming 650500;Computer Technology Application Key Laboratory of Yunnan Province,Kunming 650500,China)
出处
《计算机工程与科学》
CSCD
北大核心
2020年第9期1680-1689,共10页
Computer Engineering & Science
基金
国家自然科学基金(61462050,61562054)
云南省自然科学基金(KKSY201603016)
关键词
对话生成
近端策略优化
强化学习
对抗生成网络
序列到序列模型
dialog generation
proximal policy optimization(PPO)
reinforcement learning
generative adversarial nets(GAN)
sequence-to-sequence model