期刊文献+

求解部分可观测马氏决策过程的强化学习算法 被引量:5

Reinforcement learning algorithm for partially observable Markov decision processes
下载PDF
导出
摘要 针对部分可观测马氏决策过程(POMDP)中,由于感知混淆现象的存在,利用Sarsa等算法得到的无记忆策略可能发生振荡的现象,研究了一种基于记忆的强化学习算法——CPnSarsa(λ)学习算法来解决该问题.它通过重新定义状态,Agent结合观测历史来识别混淆状态.将CPnSarsa(λ)算法应用到一些典型的POMDP,最后得到的是最优或近似最优策略.与以往算法相比,该算法的收敛速度有了很大提高. In partially observable markov decision processes (POMDP), due to perceptual aliasing, the memoryless policies obtained by Sarsa-learning may oscillate. A memory-based new reinforcement learning algorithm-CpnSarsa (A) is studied to solve this problem. With new definitions of states, the agent combines current observation with preobservations to distinguish aliasing states. With application of the algorithm to some typical POMDP, the optimal or almost-optimal policies are obtained. Comparing with previous algorithms, this algorithm greatly improves the convergence rate.
出处 《控制与决策》 EI CSCD 北大核心 2004年第11期1263-1266,共4页 Control and Decision
基金 国家自然科学基金重点项目(60234030) 青年科学基金资助项目(60303012).
关键词 强化学习 部分可观测Markov决策过程 Sarsa学习 无记忆策略 Convergence of numerical methods Decision theory Markov processes Optimization State space methods
  • 相关文献

参考文献11

  • 1Tsitsiklis J N, Roy B V. An analysis of temporal difference learning with function approximation [J].IEEE Trans on Automatic Control, 1997,42 (5): 674-690.
  • 2Chrisman L. Reinforcement learning with perceptual aliasing: The perceptual distinctions approach [A].Proc of the Tenth National Conf on Artificial Intelligence[C]. California, 1992. 183-188.
  • 3Littman M. Memoryless policies: Theoretical limitations and practical results[A]. Proc of the Third Int Conf on Simulation of Adaptive Behavior [C ].Combridge, 1994. 238-245.
  • 4Singh S, Jaakkola T, Jordan M. Learning without state-estimation in partially observable Markov decision processes [A]. Proc of the Eleventh Int Conf on Machine Learning[C]. New Brunswick, 1994. 284-292.
  • 5Loch L, Singh S. Using eligibility traces to find the best memoryless policy in partially observable Markov decision processes[A]. Proc of the Fifteenth Int Conf on Machine Learning[C]. Madison, 1998. 323-331.
  • 6Kaelhling L, Littman M, Cassandra A. Planning and acting in partially observable stochastic domains [J].Artificial Intelligence, 1998,101 (1): 99-134.
  • 7Cassandra A. Exact and approximate algorithms for partially observable Markov decision processes [D].Brown University, 1998.
  • 8Sutton R , Barto A. Reinforcement Learning: An Introduction[M]. MIT Press, 1998.
  • 9Singh, Jaakkola, Littman, et al. Convergence results for single-step on-policy reinforcement learning algorithms[J]. Littman, Machine Learning, 2000, 38(3):287-308.
  • 10Parr R, Russell S. Approximating optimal policies for partially observable stochastic domains [A]. Proc of the Int Joint Conf on Artificial Intelligence [C]. San Francisco, 1995. 1088-1094.

同被引文献35

  • 1林龙年,Remus Osan,Shy Shoham,金文军,左文琪,钱卓,梅兵,陈桂芬.小鼠海马神经网络对情景体验进行实时编码的功能单元的发现与鉴别[J].华东师范大学学报(自然科学版),2005(Z1):208-216. 被引量:2
  • 2俞建成,张奇峰,吴利红,张艾群.水下滑翔机器人运动调节机构设计与运动性能分析[J].机器人,2005,27(5):390-395. 被引量:22
  • 3Shani G, Brafman R I, Shimony S E. Forward search value iter- ation for POMDPs[C]//International Joint Conference on Arti- ficial Intelligence. USA: International Joint Conferences on Ar- tificial Intelligence, 2007: 261962624.
  • 4Roy N, Gordon G, Thrun S. Finding approximate POMDP so- lutions through belief compression[J]. Journal of Artificial In- telligence Research, 2005, 23: 1-40.
  • 5Kurniawati H, Hsu D, Lee W S. SARSOP: Efficient point-based POMDP planning by approximating optimally reachable be- lief spaces[C]//Robotics: Science and Systems. Zurich, Switzer- land, 2008.
  • 6Wei J Q, Dolan J M, Snider J M, et al. A point-based MDP for robust single-lane autonomous driving behavior under un- certainties[C]//IEEE International Conference on Robotics andAutomation. Piscataway, USA: IEEE, 2011: 2586-2592.
  • 7Theocharous G, Mahadevan S. Approximate planning with hi- erarchical partially observable Markov decision process mod- els for robot navigation[C]//IEEE International Conference on Robotics and Automation. Piscataway, USA: IEEE, 2002: 1347-1352.
  • 8Ong S C W, Png S W, Hsu D, et al. Planning under uncertainty for robotic tasks with mixed observability[J]. International Jour- nal of Robotics Research, 2010, 29(5): 1053-1068.
  • 9Jockel S, Westhoff D, Zhang J W. EPIROME - A novel frame- work to investigate high-level episodic robot memory[C]//IEEE International Conference on Robotics and Biomimetics. Piscat- away, USA: IEEE, 2007: 1075-1080.
  • 10Endo Y. Anticipatory robot control for a partially observable environment using episodic memories[C]//IEEE International Conference on Robotics and Automation. Piscataway, USA: IEEE, 2008: 2852-2859.

引证文献5

二级引证文献21

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部