摘要
近年来,深度强化学习(Deep Reinforcement Learning,DRL)已经成为了人工智能领域中的研究热点.为了加速DRL训练,人们提出了分布式强化学习方法用于提升训练速度.目前分布式强化学习可以分为同策略方法、异策略方法以及最新的近同策略方法.近同策略方法改善了同策略方法和异策略方法的问题,但是由于其共享内存并行模型的限制,近同策略模型难以扩展到以网络互连的计算集群上,低可扩展性限制了近同策略方法能够利用的资源数量,增加了计算节点的负载,最终导致训练耗时增加.为了提升近同策略方法的可扩展性,提升收敛速度,本文提出了一种以消息传递为基础,使用Gossip算法与模型融合方法的并行执行者-学习者训练框架(Parallel Actor-Learner Architecture,PALA),这一方法通过增强训练的并行性和可扩展性来提升收敛速度.首先,该框架以Gossip算法作为通信基础,借助全局数据代理并使用消息传递模型创建了一套可扩展的多个并行单智能体训练方法.其次,为了保证探索-利用的同策略性,维持训练稳定,本文创建了一套可以用于多机之间进行隐式同步的进程锁.其次,本文面向含有CUDA张量的模型数据,提出了一种序列化方法,以保证模型数据能够通过节点间网络传递、聚合.最后,本文使用模型聚合方法对训练进行加速.基于上述优化和改进,PALA训练方法能够将负载均衡地映射到整个计算集群上,减少由于高负载而造成的长等待时间,提升收敛速度.实验表明,相较于之前使用共享内存模式的方法,PALA训练的智能体在达到相同水平时,训练时间缩减了20%以上,同时,PALA还有着较好的可扩展性,PALA可以扩展的硬件资源数量是原有方法的6倍以上.与其他方法相对比,PALA训练的智能体最终策略在几乎所有测试环境中达到了最优水平.
In recent years,Deep Reinforcement Learning has become a hot spot in the field of Artificial Intelligence,it has reached great success in many complex environments and even outperforms humans in several complex games.To accelerate DRL training,researchers proposed distributed reinforcement learning to improve training speed and scalability.At present,there are three types of distributed reinforcement learning:on-policy,off-policy,and near on-policy.Near on-policy solve the problem of on-policy and off-policy and improve training efficiency,but the near on-policy method is based on shared memory parallel model.Due to the limitation of the shared memory parallel model,near on-policy have trouble expanding to the cluster connected with the network.This problem made the near on-policy method hard to scale and limited the resource near on-policy can use,which ultimately lead to the increment of the workload in computation peer and the increment of the training time.In order to improve the scalability and speed up the convergence of near on-policy,we propose a parallel actor-learner architecture(Parallel Actor-Learner Architecture,PALA).The parallel model of this architecture is based on message passing parallel model,using the Gossip Algorithm and model average method.First,with the help of data proxy and message passing parallel model which performed in the Gossip algorithm way,we proposed a salable parallel training architecture of single agent training.This architecture could expand to multiple nodes connected with the network.Secondly,in order to stabilize training and keep the policy difference between learner and actor within a small bound,we proposed a process lock.This lock could be used in the cluster among multiple nodes for implicit synchronization.Without implicit synchronization,parallel training will not take much effect.Thirdly,this paper proposes a serialization method for model data containing CUDA tensors to ensure that model data can be transmitted and aggregated through the network between nodes.Finally,this paper uses model aggregation method to speed up training.Based on the above optimizations and improvements,the PALA training method can map the load to the entire computing cluster in a balanced manner,reduce the long waiting time caused by high load,and improve the convergence speed.By improving scalability,a large number of actors and learners are deployed by PALA which can fully explore the environment and jump out from local optimal.The experiment shows that compared with the former method which uses shared memory,PALA can accelerate training by more than 20%and it can also reach for more than 6 times computing resources.In some environments,PALA even outperforms the former method with up to 50%reward valve with the same training time.The final policies trained by PALA can outperform other methods in the vast majority of games.Experimental results show that PALA trained for 2.5 million steps already outperforms other methods after training for 4 million steps.When training for 2.5 million steps,PALA outperforms former methods by up to 90%.We also evaluated the reason why PALA reach higher performance compared with other method.
作者
孙正伦
乔鹏
窦勇
李青青
李荣春
SUN Zheng-Lun;QIAO Peng;DOU Yong;LI Qing-Qing;LI Rong-Chun(Department of Science and Technology on Parallel and Distributed Processing Laboratory,National University of Defense Technology,Changsha 410073)
出处
《计算机学报》
EI
CAS
CSCD
北大核心
2023年第2期229-243,共15页
Chinese Journal of Computers
基金
国家自然科学基金(61732018、61902415、61972409)
重点实验室开放基金(WDZC20205500104)资助。