求解部分可观测马氏决策过程的强化学习算法被引量：5

Reinforcement learning algorithm for partially observable Markov decision processes

下载PDF

导出

摘要针对部分可观测马氏决策过程(POMDP)中,由于感知混淆现象的存在,利用Sarsa等算法得到的无记忆策略可能发生振荡的现象,研究了一种基于记忆的强化学习算法——CPnSarsa(λ)学习算法来解决该问题.它通过重新定义状态,Agent结合观测历史来识别混淆状态.将CPnSarsa(λ)算法应用到一些典型的POMDP,最后得到的是最优或近似最优策略.与以往算法相比,该算法的收敛速度有了很大提高. In partially observable markov decision processes (POMDP), due to perceptual aliasing, the memoryless policies obtained by Sarsa-learning may oscillate. A memory-based new reinforcement learning algorithm-CpnSarsa (A) is studied to solve this problem. With new definitions of states, the agent combines current observation with preobservations to distinguish aliasing states. With application of the algorithm to some typical POMDP, the optimal or almost-optimal policies are obtained. Comparing with previous algorithms, this algorithm greatly improves the convergence rate.

作者王学宁贺汉根徐昕

机构地区国防科技大学自动化研究所

出处《控制与决策》 EI CSCD 北大核心 2004年第11期1263-1266,共4页 Control and Decision

基金国家自然科学基金重点项目(60234030) 青年科学基金资助项目(60303012).

关键词强化学习部分可观测Markov决策过程 Sarsa学习无记忆策略 Convergence of numerical methods Decision theory Markov processes Optimization State space methods

分类号 TP319 [自动化与计算机技术—计算机软件与理论]

引文网络
相关文献

参考文献11

1Tsitsiklis J N, Roy B V. An analysis of temporal difference learning with function approximation [J].IEEE Trans on Automatic Control, 1997,42 (5): 674-690.
2Chrisman L. Reinforcement learning with perceptual aliasing: The perceptual distinctions approach [A].Proc of the Tenth National Conf on Artificial Intelligence[C]. California, 1992. 183-188.
3Littman M. Memoryless policies: Theoretical limitations and practical results[A]. Proc of the Third Int Conf on Simulation of Adaptive Behavior [C ].Combridge, 1994. 238-245.
4Singh S, Jaakkola T, Jordan M. Learning without state-estimation in partially observable Markov decision processes [A]. Proc of the Eleventh Int Conf on Machine Learning[C]. New Brunswick, 1994. 284-292.
5Loch L, Singh S. Using eligibility traces to find the best memoryless policy in partially observable Markov decision processes[A]. Proc of the Fifteenth Int Conf on Machine Learning[C]. Madison, 1998. 323-331.
6Kaelhling L, Littman M, Cassandra A. Planning and acting in partially observable stochastic domains [J].Artificial Intelligence, 1998,101 (1): 99-134.
7Cassandra A. Exact and approximate algorithms for partially observable Markov decision processes [D].Brown University, 1998.
8Sutton R , Barto A. Reinforcement Learning: An Introduction[M]. MIT Press, 1998.
9Singh, Jaakkola, Littman, et al. Convergence results for single-step on-policy reinforcement learning algorithms[J]. Littman, Machine Learning, 2000, 38(3):287-308.
10Parr R, Russell S. Approximating optimal policies for partially observable stochastic domains [A]. Proc of the Int Joint Conf on Artificial Intelligence [C]. San Francisco, 1995. 1088-1094.

同被引文献35

1林龙年,Remus Osan,Shy Shoham,金文军,左文琪,钱卓,梅兵,陈桂芬.小鼠海马神经网络对情景体验进行实时编码的功能单元的发现与鉴别[J].华东师范大学学报（自然科学版）,2005(Z1):208-216. 被引量：2
2俞建成,张奇峰,吴利红,张艾群.水下滑翔机器人运动调节机构设计与运动性能分析[J].机器人,2005,27(5):390-395. 被引量：22
3Shani G, Brafman R I, Shimony S E. Forward search value iter- ation for POMDPs[C]//International Joint Conference on Arti- ficial Intelligence. USA: International Joint Conferences on Ar- tificial Intelligence, 2007: 261962624.
4Roy N, Gordon G, Thrun S. Finding approximate POMDP so- lutions through belief compression[J]. Journal of Artificial In- telligence Research, 2005, 23: 1-40.
5Kurniawati H, Hsu D, Lee W S. SARSOP: Efficient point-based POMDP planning by approximating optimally reachable be- lief spaces[C]//Robotics: Science and Systems. Zurich, Switzer- land, 2008.
6Wei J Q, Dolan J M, Snider J M, et al. A point-based MDP for robust single-lane autonomous driving behavior under un- certainties[C]//IEEE International Conference on Robotics andAutomation. Piscataway, USA: IEEE, 2011: 2586-2592.
7Theocharous G, Mahadevan S. Approximate planning with hi- erarchical partially observable Markov decision process mod- els for robot navigation[C]//IEEE International Conference on Robotics and Automation. Piscataway, USA: IEEE, 2002: 1347-1352.
8Ong S C W, Png S W, Hsu D, et al. Planning under uncertainty for robotic tasks with mixed observability[J]. International Jour- nal of Robotics Research, 2010, 29(5): 1053-1068.
9Jockel S, Westhoff D, Zhang J W. EPIROME - A novel frame- work to investigate high-level episodic robot memory[C]//IEEE International Conference on Robotics and Biomimetics. Piscat- away, USA: IEEE, 2007: 1075-1080.
10Endo Y. Anticipatory robot control for a partially observable environment using episodic memories[C]//IEEE International Conference on Robotics and Automation. Piscataway, USA: IEEE, 2008: 2852-2859.

引证文献5

1刘冬,丛明,高森,韩晓东,杜宇.融合神经元激励机制的机器人情景学习与行为控制[J].机器人,2014,36(5):576-583. 被引量：4
2张晓路,李斌,常健,唐敬阁.水下滑翔蛇形机器人滑翔控制的强化学习方法[J].机器人,2019,41(3):334-342. 被引量：8
3刘利军,王州,余臻.一种改进的深度确定性策略梯度网络交通信号控制系统[J].四川大学学报（自然科学版）,2021,58(4):87-93. 被引量：8
4刘剑锋,普杰信,孙力帆.融合对比预测编码的深度双Q网络[J].计算机工程与应用,2023,59(6):162-170. 被引量：1
5吕相霖,臧兆祥,李思博,王俊英.基于注意力的循环PPO算法及其应用[J].计算机技术与发展,2024,34(1):136-142.

二级引证文献21

1朱威,郭宪,方勇纯,张学有.可重构模块化蛇形机器人研制及多运动模态研究[J].信息与控制,2020,49(1):69-77. 被引量：12
2任仲友,王素玉,王家素,唐启雪,朱敏,江河.多块YBaCuO高温超导体在永磁轨道上的悬浮力[J].低温与超导,2000,28(2):17-21. 被引量：10
3邹强,丛明,刘冬,杜宇.仿鼠脑海马的机器人地图构建与路径规划方法[J].华中科技大学学报（自然科学版）,2018,46(12):83-88. 被引量：4
4郑月园,陈泽昊,梁家郅,翁倩,郭芳铭,杨刚,上官大堰,郭新勇.具有智能反馈的电子显微镜仿真教学系统[J].电子显微学报,2019,38(2):182-188. 被引量：4
5陈恩志,常健,李斌,张国伟,刘春.采用干扰观测器的水下滑翔蛇形机器人纵倾运动控制[J].西安交通大学学报,2020,54(1):184-192. 被引量：10
6丛明,邹强,刘冬,杜宇.定位细胞认知机理启发的机器人导航研究综述[J].机械工程学报,2019,55(23):1-12. 被引量：5
7柏宇鹏,崔金香,郑桂华,韦智然,赵爽.水下智能机器人的运动控制技术研究[J].南方农机,2020,51(11):31-31. 被引量：2
8闫安,陈章,董朝阳,何康辉.基于模糊强化学习的双轮机器人姿态平衡控制[J].系统工程与电子技术,2021,43(4):1036-1043. 被引量：9
9董伟.城市道路智慧交通信号灯系统的设计与应用[J].光源与照明,2021(10):22-24. 被引量：3
10王进,徐巍,丁一,孙开伟,王利蕾.基于图嵌入和区域注意力的多标签文本分类[J].江苏大学学报（自然科学版）,2022,43(3):310-318. 被引量：15

1薛丽华,殷苌茗,李立云,胡明辉.基于多智能体的融合Sarsa(λ)学习算法[J].计算机工程与应用,2008,44(4):182-183. 被引量：2
2陈夏冰,刘国栋.基于模糊神经网络Sarsa学习的多机器人任务分配[J].计算机应用与软件,2012,29(12):203-205. 被引量：3
3张国容,殷保群.一种基于HMM的P2P信任模型[J].电子技术（上海）,2013(8):48-51.
4马朋委,潘地林.基于启发函数改进的SARSA(λ)算法[J].计算机与数字工程,2016,44(5):825-828. 被引量：2
5李春贵,吴沧浦,刘永信.一种集成规划的SARSA(λ)强化学习算法[J].北京理工大学学报,2002,22(3):325-327. 被引量：2
6王洪峰,汪定伟,杨圣祥.动态环境中的进化算法[J].控制与决策,2007,22(2):127-131. 被引量：18
7张秋云,江虹.基于SARSA学习算法的USB块传输研究[J].中山大学学报（自然科学版）,2014,53(5):73-78.
8丁悦波,孙文胜.基于记忆算法的链式无线传感器网络研究[J].微型机与应用,2013,32(12):57-59.
9陈昊,黎明,陈曦.动态环境下基于混合记忆策略的遗传算法[J].应用科学学报,2010,28(5):540-545. 被引量：4
10李新磊.基于依赖型任务和Sarsa(λ)算法的云计算任务调度[J].计算机测量与控制,2015,23(8):2809-2812. 被引量：1

控制与决策

2004年第11期

浏览历史

内容加载中请稍等...

求解部分可观测马氏决策过程的强化学习算法被引量：5

参考文献11

同被引文献35

引证文献5

二级引证文献21

相关作者

相关机构

相关主题

浏览历史

求解部分可观测马氏决策过程的强化学习算法 被引量：5

参考文献11

同被引文献35

引证文献5

二级引证文献21

相关作者

相关机构

相关主题

浏览历史

求解部分可观测马氏决策过程的强化学习算法被引量：5