连续空间的递归最小二乘行动者—评论家算法被引量：2

Recursive least-squares actor-critic algorithm in continuous space

下载PDF

导出

摘要传统的行动者—评论家(actor-critic,AC)算法用在连续空间时,数据利用率低、收敛慢,而现实世界中采样往往需要昂贵的代价,因此提出了一种新的连续空间递归最小二乘AC算法,能够充分利用数据,提高学习预测能力。该方法用高斯径向基函数对连续的状态空间进行编码,评论家部分改用带资格迹的递归最小二乘时间差分方法,而行动者部分用策略梯度方法,在连续动作空间中进行策略搜索。Mountain Car问题的仿真结果表明该算法具有较好的收敛结果。 The traditional actor-critic（AC） algorithms is applied in continuous space,which has low data utilization rate and slow convergence speed,but in the real world,sampling often requires expensive price. So this paper proposed a new recursive least squares AC algorithm of continuous space,which could make full use of the data and improve the learning and predictive abilities. The algorithm used Gaussian radial basis functions to encode the continuous state space. The critic applied to recursive least-squares temporal difference method,and the actor adopted policy gradient to search in the continuous action space. The simulation results of Mountain Car problem show that the proposed algorithm has good convergent results.

作者朱文文金玉净伏玉琛宋绪文

机构地区苏州大学计算机科学与技术学院

出处《计算机应用研究》 CSCD 北大核心 2014年第7期1994-1997,2000,共5页 Application Research of Computers

基金国家自然科学基金资助项目(61070122 61070223 61373094 60970015) 江苏省自然科学基金资助项目(BK2009116) 江苏省高校自然科学研究项目(09KJA520002) 吉林大学符号计算与知识工程教育部重点实验室资助项目(93K172012K04)

关键词强化学习行动者—评论家方法连续状态动作空间递归最小二乘策略梯度高斯径向基函数 reinforcement learning actor-critic method continuous state and action space recursive least-squares policy gradient Gaussian radial basis functions

分类号 TP181 [自动化与计算机技术—控制理论与控制工程] TP301.6 [自动化与计算机技术—计算机系统结构]

引文网络
相关文献

参考文献12

1BARTO A G, SUTTON R S, ANDERSON C W. Netrronlike adaptive elements that can solve difficult [earning control problems [J ]. IEEE Yrans on Systems, Man and Cybernetics, 1983,13 (5) : 834- 846.
2SUTTON R S, BARTO A G. Reinforcement learning:an introduction [ M ]. Cambridge : MIT Press, 1998.
3WILLIAMS R J. Simple statistical gradient-following algorithms for connectionist reinforcement learning[ J]. Machine Learning, 1992,8 (3) :229-256.
4SUTTON R S, MCALLESTER D, SINGH S, et al. Policy gradient methods for reinforcement learning with function approximation [ C ]// Advances in Neural Information Processing Systems. 1999: 1057- 1063.
5KONDA V R, TSITSIKLIS J N. Actor-critic algorithms [ C ]//Ad- vances in Neural Information Processing Systems. 1999:1008-1014.
6DEGRIS T, PILARSKI P M, SUTTON R S. Model-free reinforcement learning with continuous action in practice [ C]//Proc of American Control Conference. 2012:2177-2182.
7DEGRIS T, WHITE M, SUTTON R S. Off-policy actor-critic[ C]// Proc of the 29th International Conference on Machine Learning. 2012: 457-464.
8BRARTKE S J, BARTO A G. Linear least-squares algorithms for temporal difference learning [ J]. Machine Learning, 1996,22 ( 1-3) :33-57.
9XU Xin, HE Hang-en, HU De-wen. Efficient reinforcement learning using recursive least-squares methods[ J]. Journal of Artificial Intelli- gence Research ,2002,3 6:259- 292.
10王雪松,程玉虎,易建强.一种自适应模糊Actor-Critic学习[J].控制与决策,2006,21(9):1068-1072. 被引量：3

二级参考文献7

1秦斌,吴敏,王欣.模糊神经网络模型混沌混合优化学习算法及应用[J].控制与决策,2005,20(3):261-265. 被引量：5
2Creighton D C,Nahavandi S.Optimizing Discrete Event Simulation Models Using a Reinforcement Learning Agent[A].Proc of Winter Simulation Conf[C].San Diego,2002:1945-1950.
3Ster B.An Integrated Learning Approach to Environment Modeling in Mobile Robot Navigation[J].Neurocomputing,2004,57(1-4):215-238.
4Samejima K,Omori T.Adaptive Internal State Space Construction Method for Reinforcement Learning of a Real-world Agent[J].Neural Networks,1999,12(7):1143-1155.
5Meesad P,Yen G G.Accuracy,Comprehensibility and Completeness Evaluation of a Fuzzy Expert System[J].Int J of Uncertainty,Fuzziness and Knowledge-based Systems,2003,11(4):445-466.
6Lee Y A,Chung T C.A Function Approximation Method for Q-learning of Reinforcement Learning[J].J of KISS:Software and Applications,2004,31(11):1431-1438.
7李晓萌,杨煜普,许晓鸣.基于递阶强化学习的多智能体AGV调度系统[J].控制与决策,2002,17(3):292-296. 被引量：8

共引文献2

1王雪松,田西兰,程玉虎.基于支持向量机的连续状态空间Q学习[J].中国矿业大学学报,2008,37(1):93-98. 被引量：5
2程玉虎,刘博,王雪松.基于径向基统计网络的数据流分析模型[J].控制与决策,2010,25(6):879-883.

同被引文献18

1张应武,刘素君.基于研究性学习的本科计量经济学教学策略研究[J].佳木斯职业学院学报,2014,30(4):132-133. 被引量：2
2贾泽露.非GIS专业地理信息系统课程教学思考[J].测绘科学,2008,33(5):230-232. 被引量：37
3僧德文,王红霞.基于SuperMap的地理信息系统课程教学设计[J].浙江水利水电专科学校学报,2009,21(3):79-81. 被引量：1
4朱斐,刘全,傅启明,伏玉琛.一种用于连续动作空间的最小二乘行动者-评论家方法[J].计算机研究与发展,2014,51(3):548-558. 被引量：9
5钱敏.城市规划专业GIS课程教学改革探讨[J].科教文汇,2014(26):61-62. 被引量：2
6周鑫,刘全,傅启明,肖飞.一种批量最小二乘策略迭代方法[J].计算机科学,2014,41(9):232-238. 被引量：9
7程永生,朱江,林孝康.引入D2D通信的蜂窝网上行资源分配算法[J].电子与信息学报,2014,36(12):2822-2827. 被引量：11
8王涛,张化光.基于策略迭代的连续时间系统的随机线性二次最优控制[J].控制与决策,2015,30(9):1674-1678. 被引量：4
9李扬波,郭祖华,徐立新.蜂窝网络中基于干扰控制的D2D通信联合资源分配方案[J].火力与指挥控制,2016,41(1):169-173. 被引量：5
10钱程,钱丽萍,武航,陈庆章.混合D2D蜂窝网络的系统吞吐量优化[J].计算机科学,2016,43(1):145-148. 被引量：5

引证文献2

1曲明哲.基于替代迹的蜂窝网信道分配Actor-Critic算法[J].计算机应用研究,2018,35(4):1213-1216. 被引量：1
2余银峰,祝美玲,汪烈军.强化学习实验教学现状与探究——以新疆大学计算机科学与技术学院为例[J].教育进展,2024,14(1):603-608.

二级引证文献1

1康云鹏,付芳,张志才.车联网中基于SVC视频传输业务的资源分配研究[J].测试技术学报,2020,34(2):173-178. 被引量：1

1朱斐,刘全,傅启明,伏玉琛.一种用于连续动作空间的最小二乘行动者-评论家方法[J].计算机研究与发展,2014,51(3):548-558. 被引量：9
2王学宁,陈伟,张锰,徐昕,贺汉根.增强学习中的直接策略搜索方法综述[J].智能系统学报,2007,2(1):16-24. 被引量：8
3李荣艳,金鑫,王春辉,郑宁,别荣芳.一种新的中文文本分类算法[J].北京师范大学学报（自然科学版）,2006,42(5):501-505. 被引量：6
4梁玉成,贾小双.数据驱动下的自主行动者建模[J].贵州师范大学学报（社会科学版）,2016(6):31-34. 被引量：13
5陈仕超,凌兴宏,刘全,伏玉琛,陈桂兴.一种基于高斯过程的行动者评论家算法[J].计算机应用研究,2016,33(6):1670-1675. 被引量：1
6王海宾,王展青.基于RLS的特征提取算法研究[J].电子测试,2016,27(4X):36-37. 被引量：1
7王辉,于婧.几种经典的策略梯度算法性能对比[J].电脑知识与技术（过刊）,2014,20(10X):6937-6941. 被引量：1
8Mourad Elloumi,Samira Kamoun.Parametric Estimation of Interconnected Nonlinear Systems Described by Input-output Mathematical Models[J].International Journal of Automation and computing,2016,13(4):364-381. 被引量：1
9李倩.C2C电子商务网站诚信的建立[J].内江科技,2010,31(3):118-118. 被引量：1
10王国芳,方舟,李平.基于批量递归最小二乘的自然Actor-Critic算法[J].浙江大学学报（工学版）,2015,49(7):1335-1342. 被引量：3

计算机应用研究

2014年第7期

浏览历史

内容加载中请稍等...

连续空间的递归最小二乘行动者—评论家算法被引量：2

参考文献12

二级参考文献7

共引文献2

同被引文献18

引证文献2

二级引证文献1

相关作者

相关机构

相关主题

浏览历史

连续空间的递归最小二乘行动者—评论家算法 被引量：2

参考文献12

二级参考文献7

共引文献2

同被引文献18

引证文献2

二级引证文献1

相关作者

相关机构

相关主题

浏览历史

连续空间的递归最小二乘行动者—评论家算法被引量：2