针对无人艇深度学习算法易发散,以及无人艇强化学习算法难以适应连续动作与空间的问题,提出改进的深度确定策略梯度(DDPG)算法,即APF-DDPG。该算法利用人工势场等改进复合奖惩函数,优化经验池方案和随机采样策略,添加高斯噪声提高探索性,以及修改神经网络模型等;并搭建仿真环境,评价无人艇的探索策略。结果验证了APF-DDPG算法的可行性及其性能,证明了改进的APF-DDPG算法,在复杂环境下路径规划的有效性与优越性。
To address the issues of deep learning algorithms for unmanned surface vehicles (USVs) being prone to divergence and the challenges of reinforcement learning algorithms adapting to continuous actions and spatial constraints, an improved Deep Deterministic Policy Gradient (DDPG) algorithm, termed APF-DDPG, is proposed. This algorithm enhances the composite reward function by incorporating artificial potential fields, optimizes the experience replay scheme and random sampling strategy, introduces Gaussian noise to improve exploration, and modifies the neural network model. Additionally, a simulation environment is established to evaluate the exploration strategies of USVs. The results validate the feasibility and performance of the APF-DDPG algorithm, demonstrating the effectiveness and superiority of the improved APF-DDPG algorithm in path planning under complex environmental conditions.
2026,48(8): 128-134 收稿日期:2025-8-4
DOI:10.3404/j.issn.1672-7649.2026.08.020
分类号:U674.91
基金项目:国家自然科学基金资助项目(61403250);上海市科委自然科学基金面上资助项目(21ZR1426600)
作者简介:黄志坚(1979-),男,博士,副教授,研究方向为智能控制与计算
参考文献:
[1] PANOV A I, YAKOVLEV K S, SUVOROV R. Grid path planning with deep reinforcement learning: Preliminary results[J]. Procedia Computer Science, 2018, 123(2018): 347-353
[2] SHANI G, HECKERMAN D, BRAFMAN R I, et al. An MDP-based recommender system[J]. Journal of Machine Learning Research, 2005, 6(5): 1265-1295
[3] ZHANG L J, CHEN Y. Finite-time adaptive dynamic programming for affine-form nonlinear systems[J]. IEEE Transactions on Neural Networks and Learning Systems, 2025, 36(2): 3573-3586
[4] LEE J Y, KIM Y. Hamilton-Jacobi based policy-iteration via deep operator learning[J]. Neurocomputing, 2025, 646(2025): 130-515
[5] LECUN Y, BENGIO Y, HINTON G. Deep learning[J]. Nature, 2015, 521(7553): 436-444
[6] POLVARA R, PATACCHIOLA M, SHARMA S, et al. Toward end-to-end control for UAV autonomous landing via deep reinforcement learning[C]//2018 International Conference on Unmanned Aircraft Systems (ICUAS), 2018.
[7] VAN HASSELT H, GUEZ A, SILVER D. Deep reinforcement learning with double Q-learning[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2016.
[8] HORGAN D, QUAN J, BUDDEN D, et al. Distributed prioritized experience replay[J]. arXiv preprint arXiv: 1803.00933, 2018: 1-19.
[9] FORTUNATO M, AZAR M G, PIOT B, et al. Noisy networks for exploration[J]. arXiv preprint arXiv: 1706.10295, 2017: 1-21.
[10] BHATNAGAR S, SUTTON R S, GHAVAMZADEH M, et al. Natural actor-critic algorithms[J]. Automatica, 2009, 45(11): 2471-2482
[11] MNIH V, BADIA A P, MIRZA M, et al. Asynchronous methods for deep reinforcement learning[C]//International Conference on Machine Learning, 2016.
[12] BARTH-MARON G, HOFFMAN M W, BUDDEN D, et al. Distributed distributional deterministic policy gradients[J]. arXiv preprint arXiv: 1804.08617, 2018: 1–16.
[13] SILVER D, LEVER G, HEESS N, et al. Deterministic policy gradient algorithms[C]//International Conference on Machine Learning, 2014.
[14] LI Y, WU D, WANG H, et al. Dynamic collision avoidance for maritime autonomous surface ships based on deep Q-network with velocity obstacle method[J]. Ocean Engineering, 2025, 320: 120335
[15] 任建 周卫祥. 基于改进人工势场法的无人艇路径规划[J]. 舰船科学技术, 2025, 47(13): 52-57 REN J, ZHOU W X. Path planning of unmanned surface vehicle based on improved artificial potential field method[J]. Ship Science and Technology, 2025, 47(13): 52-57
[16] NG A Y, HARADA D, RUSSELL S. Policy invariance under reward transformations: Theory and application to reward shaping[C]//Machine Learning: Sixteenth International Conference on Machine Learning(ICML'99). Bled, Slovenia. Morgan Kaufmann Publishers Inc, 1999.