摘要
强化学习已在各行业得到广泛应用,强化学习所需的大量探索在很多语境下均可能带来严重后果。为此,业界提出安全强化学习。从经济学中借鉴的效用函数是其中的一种常用技巧,但其在搜索算法中还未得到充分研究。当智能体在风险中立的语境下进行训练后,如何利用搜索算法将其迁移至风险敏感语境中仍是一个需要解决的问题。因此,提出向期望最大搜索算法中加入安全强化学习常用的效用函数技巧进行改进。以2048游戏为例,基于2022年最新的NTuple方法及在风险中立语境下所训练的权重,实验了不同的超参数。实验结果表明,将sigmoid函数作为效用函数应用于搜索算法中,可以显著提升算法性能。其中,在2层搜索中,改进后的算法相较于原始算法提升了14053.35分,在3层搜索中,提升了6522.9分。
Reinforcement learning has been widely used in various industries.The large amount of exploration required for reinforcement learning,however,can have serious consequences in many contexts,and the industry has proposed safe reinforcement learning to address this.The utility function,borrowed from economics,is one of the commonly used techniques,but its application in search algorithms has not been thoroughly investigated.How to use search algorithms to migrate an agent trained in a risk-neutral context to a risk-sensitive context has to be studied.To address this,it is proposed to improve the Expectimax search algorithm by integrating the utility function trick commonly used in safe reinforcement learning with the vanilla algorithm.Taking the 2048 game as an example,based on the weights trained in the risk-neutral context using the latest NTuple method in 2022,different hyperparameters were tested,and applying the sigmoid function as the utility function eventually led to higher mean scores:the improved algorithm achieved an increase of 14053.35 points for 2-level search and 6522.9 points for 3-level search compared to the original method,demonstrating the effectiveness of the proposed improvement.
作者
魏语轩
李昕闻
陈兴国
WEI Yuxuan;LI Xinwen;CHEN Xingguo(Bell Honors School,Nanjing University of Posts and Telecommunications;School of Computer Science,Nanjing University of Posts and Telecommunications;Jiangsu Key Laboratory of Big Data Security&Intelligent Processing,Nanjing University of Posts and Telecommunica-tions,Nanjing 210023,China;National Key Laboratory for Novel Software Technology,Nanjing University,Nanjing 210046,China)
出处
《软件导刊》
2023年第8期86-92,共7页
Software Guide
基金
国家自然科学基金项目(62276142,61806096,61872190)。