面向执行-学习者的在线强化学习并行训练方法被引量：3

PALA:Parallel Actor-Learner Architecture for Distributed Deep Reinforcement Learning

下载PDF

导出

摘要近年来,深度强化学习(Deep Reinforcement Learning,DRL)已经成为了人工智能领域中的研究热点.为了加速DRL训练,人们提出了分布式强化学习方法用于提升训练速度.目前分布式强化学习可以分为同策略方法、异策略方法以及最新的近同策略方法.近同策略方法改善了同策略方法和异策略方法的问题,但是由于其共享内存并行模型的限制,近同策略模型难以扩展到以网络互连的计算集群上,低可扩展性限制了近同策略方法能够利用的资源数量,增加了计算节点的负载,最终导致训练耗时增加.为了提升近同策略方法的可扩展性,提升收敛速度,本文提出了一种以消息传递为基础,使用Gossip算法与模型融合方法的并行执行者-学习者训练框架(Parallel Actor-Learner Architecture,PALA),这一方法通过增强训练的并行性和可扩展性来提升收敛速度.首先,该框架以Gossip算法作为通信基础,借助全局数据代理并使用消息传递模型创建了一套可扩展的多个并行单智能体训练方法.其次,为了保证探索-利用的同策略性,维持训练稳定,本文创建了一套可以用于多机之间进行隐式同步的进程锁.其次,本文面向含有CUDA张量的模型数据,提出了一种序列化方法,以保证模型数据能够通过节点间网络传递、聚合.最后,本文使用模型聚合方法对训练进行加速.基于上述优化和改进,PALA训练方法能够将负载均衡地映射到整个计算集群上,减少由于高负载而造成的长等待时间,提升收敛速度.实验表明,相较于之前使用共享内存模式的方法,PALA训练的智能体在达到相同水平时,训练时间缩减了20%以上,同时,PALA还有着较好的可扩展性,PALA可以扩展的硬件资源数量是原有方法的6倍以上.与其他方法相对比,PALA训练的智能体最终策略在几乎所有测试环境中达到了最优水平. In recent years,Deep Reinforcement Learning has become a hot spot in the field of Artificial Intelligence,it has reached great success in many complex environments and even outperforms humans in several complex games.To accelerate DRL training,researchers proposed distributed reinforcement learning to improve training speed and scalability.At present,there are three types of distributed reinforcement learning:on-policy,off-policy,and near on-policy.Near on-policy solve the problem of on-policy and off-policy and improve training efficiency,but the near on-policy method is based on shared memory parallel model.Due to the limitation of the shared memory parallel model,near on-policy have trouble expanding to the cluster connected with the network.This problem made the near on-policy method hard to scale and limited the resource near on-policy can use,which ultimately lead to the increment of the workload in computation peer and the increment of the training time.In order to improve the scalability and speed up the convergence of near on-policy,we propose a parallel actor-learner architecture(Parallel Actor-Learner Architecture,PALA).The parallel model of this architecture is based on message passing parallel model,using the Gossip Algorithm and model average method.First,with the help of data proxy and message passing parallel model which performed in the Gossip algorithm way,we proposed a salable parallel training architecture of single agent training.This architecture could expand to multiple nodes connected with the network.Secondly,in order to stabilize training and keep the policy difference between learner and actor within a small bound,we proposed a process lock.This lock could be used in the cluster among multiple nodes for implicit synchronization.Without implicit synchronization,parallel training will not take much effect.Thirdly,this paper proposes a serialization method for model data containing CUDA tensors to ensure that model data can be transmitted and aggregated through the network between nodes.Finally,this paper uses model aggregation method to speed up training.Based on the above optimizations and improvements,the PALA training method can map the load to the entire computing cluster in a balanced manner,reduce the long waiting time caused by high load,and improve the convergence speed.By improving scalability,a large number of actors and learners are deployed by PALA which can fully explore the environment and jump out from local optimal.The experiment shows that compared with the former method which uses shared memory,PALA can accelerate training by more than 20%and it can also reach for more than 6 times computing resources.In some environments,PALA even outperforms the former method with up to 50%reward valve with the same training time.The final policies trained by PALA can outperform other methods in the vast majority of games.Experimental results show that PALA trained for 2.5 million steps already outperforms other methods after training for 4 million steps.When training for 2.5 million steps,PALA outperforms former methods by up to 90%.We also evaluated the reason why PALA reach higher performance compared with other method.

作者孙正伦乔鹏窦勇李青青李荣春 SUN Zheng-Lun;QIAO Peng;DOU Yong;LI Qing-Qing;LI Rong-Chun(Department of Science and Technology on Parallel and Distributed Processing Laboratory,National University of Defense Technology,Changsha 410073)

机构地区国防科技大学并行与分布处理国家重点实验室

出处《计算机学报》 EI CAS CSCD 北大核心 2023年第2期229-243,共15页 Chinese Journal of Computers

基金国家自然科学基金(61732018、61902415、61972409) 重点实验室开放基金(WDZC20205500104)资助。

关键词 Gossip算法强化学习同策略学习分布式强化学习并行训练方法 Gossip algorithm reinforcement learning on-policy distributed reinforcement learning parallel training

分类号 TP18 [自动化与计算机技术—控制理论与控制工程]

引文网络
相关文献

同被引文献35

1段海滨,张岱峰,范彦铭,邓亦敏.从狼群智能到无人机集群协同决策[J].中国科学：信息科学,2019,49(1):112-118. 被引量：52
2赵沁平.虚拟现实中的10个科学技术问题[J].中国科学：信息科学,2017,47(6):800-803. 被引量：22
3尹晋.电子对抗作战仿真实验设计[J].电子技术与软件工程,2017(22):76-77. 被引量：1
4孙少斌,韩志军.通用仿真实验总体架构研究[J].科学技术创新,2018(27):29-30. 被引量：1
5王宗杰,沈培志,罗木生.作战仿真实验设计软件体系框架研究[J].舰船电子工程,2018,38(12):18-21. 被引量：5
6孙佳琛,王金龙,陈瑾,丁国如.群体智能协同通信:愿景、模型和关键技术[J].中国科学：信息科学,2020,50(3):307-317. 被引量：24
7张婷婷,宋爱国,蓝羽石.集群无人系统自适应结构建模与预测[J].中国科学：信息科学,2020,50(3):347-362. 被引量：6
8汪亮,王文,王禹又,侯松林,乔裕哲,吴天珩,陶先平.强化学习方法在通信拒止战场仿真环境中多无人机目标搜寻问题上的适用性研究[J].中国科学：信息科学,2020,50(3):375-395. 被引量：10
9王伟嘉,郑雅婷,林国政,张亮,韩战钢.集群机器人研究综述[J].机器人,2020,42(2):232-256. 被引量：43
10孙长银,穆朝絮.多智能体深度强化学习的若干关键科学问题[J].自动化学报,2020,46(7):1301-1312. 被引量：85

引证文献3

1李璐璐,朱睿杰,隋璐瑶,李亚飞,徐明亮,樊会涛.智能集群系统的强化学习方法综述[J].计算机学报,2023,46(12):2573-2596. 被引量：2
2马春华.基于仿真实验的智能并行训练方法[J].指挥控制与仿真,2024,46(1):93-99.
3杨正权.基于强化学习的大规模无线传感器数据语义关联聚合方法[J].信息技术与信息化,2024(10):119-122.

二级引证文献2

1毛聪,曾利华.基于深度学习的AI视觉和机器人集群在钢铁行业的应用[J].中国机械,2024(9):80-83.
2高方坤,唐宏伟,邓嘉鑫,丁祥,罗佳强,王军权.多智能物资运送小车协同控制的任务分配[J].自动化应用,2024,65(17):232-237.

1王斌,李鹏,王建国,黄林冰,李正.5G+SD-WAN物联网技术实现与应用研究[J].数字通信世界,2022(12):27-31. 被引量：1
2山东:让先进典型引路,打造生态环保铁军[J].中国环境监察,2022(7):39-39.
3宋龙艳.全力推动北京早日实现碳达峰[J].投资北京,2022(12):30-32.
4霍乾伟,李天元,张闻,宋繁永,王加宁,陈贯虹.微生物修复石油污染土壤影响因素分析[J].现代化工,2022,42(S02):83-87. 被引量：2
5许银涛,王涛.基于安全强化学习的机械臂无碰路径规划[J].计算机科学与应用,2023,13(1):104-112.
6郑金宇,钟玮.地方政府债务对经济增长的非线性影响[J].统计与决策,2022(24):149-153. 被引量：8
7张抗私,史策,刘超.认知能力、社会网络与教育匹配[J].劳动经济评论,2022(2):215-231. 被引量：2
8文旭.教科书使用模型与分析框架的建立——基于活动理论的视角[J].现代中小学教育,2023,39(1):17-22.
9白东哲.新时期中国高尔夫职业教育的创新发展[J].体育视野,2022(19):140-142. 被引量：1
10Jie Wang,Jianhao Zhou,Wanzhong Zhao.Deep reinforcement learning based energy management strategy for fuel cell/battery/supercapacitor powered electric vehicle[J].Green Energy and Intelligent Transportation,2022,1(2):97-111. 被引量：3

计算机学报

2023年第2期

浏览历史

内容加载中请稍等...

面向执行-学习者的在线强化学习并行训练方法被引量：3

同被引文献35

引证文献3

二级引证文献2

相关作者

相关机构

相关主题

浏览历史

面向执行-学习者的在线强化学习并行训练方法 被引量：3

同被引文献35

引证文献3

二级引证文献2

相关作者

相关机构

相关主题

浏览历史

面向执行-学习者的在线强化学习并行训练方法被引量：3