基于LSTM–PPO算法的多機空戰智能決策及目標分配

丁云龍; 匡敏馳; 朱紀洪; 祝靖宇; 喬直

doi:10.13374/j.issn2095-9389.2023.10.13.003

基于LSTM–PPO算法的多機空戰智能決策及目標分配

Intelligent decision making and target assignment of multi-aircraft air combat based on the LSTM–PPO algorithm

摘要

摘要: 針對傳統多機空戰中智能決效率低、難以滿足復雜空戰環境的需求以及目標分配不合理等問題. 本文提出一種基于強化學習的多機空戰的智能決策及目標分配方法. 使用長短期記憶網絡（Long short-term memory，LSTM）對狀態進行特征提取和態勢感知，將歸一化和特征融合后的狀態信息訓練殘差網絡和價值網絡，智能體通過近端優化策略（Proximal policy optimization，PPO）針對當前態勢選擇最優動作. 以威脅評估指標作為分配依據，計算綜合威脅度，優先將威脅值最大的戰機作為攻擊目標. 為了驗證算法的有效性，在課題組搭建的數字孿生仿真環境中進行4v4多機空戰實驗. 并在相同的實驗環境下與其他強化學習主流算法進行比較. 實驗結果表明，使用LSTM–PPO算法在多機空戰中的勝率明顯優于其他主流強化學習算法，驗證了算法的有效性.

Abstract: With the rapid development of intelligent and informationized air battlefields, intelligent air combat has increasingly become key to affecting the outcome of a battlefield. In conventional multi-aircraft air combat, there are issues of low efficiency in intelligent decision-making, difficulty in meeting the needs of complex air combat environments, and unreasonable target allocation. In response to the problems in conventional multi-aircraft air combat, we introduce a long short-term memory–proximal policy optimization algorithm (LSTM–PPO). Using the long short-term memory network to extract features and perceive the situation of the state, an intelligent agent trains the normalized and feature-fused state information residual network and value network, chooses the optimal action through the proximal policy optimization strategy based on the current situation, and embeds a reward function containing expert knowledge during the training process to solve the problem of sparse rewards. Meanwhile, a target allocation algorithm based on threat value calculation is presented. Using angle, speed, and height threat values as the basis for target allocation, the ID of the target aircraft with the highest threat value on the battlefield is calculated in real-time. When the strategy network outputs an action of attack, it conducts target allocation. To confirm the effectiveness of the algorithm, we carried out 4v4 multi-aircraft air combat experiments in a digital twin simulation environment built by our research group. The red team consists of reinforcement learning agents based on LSTM–PPO algorithm, whereas the blue team comprises a finite state machine composed of expert knowledge bases. After more than 1200 rounds of aerial confrontation, the algorithm has been converged, and the win rate of the red team has reached 82%. Furthermore, we assessed the performance of four other mainstream reinforcement learning algorithms in 4v4 air combat experiments under the same experimental conditions. It is shown that the deep Q-network (DQN) and soft actor-critic (SAC) algorithms have difficulties in dealing with high-dimensional continuous action spaces and multiagent collaboration. The multi-agent deep deterministic policy gradient algorithm (MADDPG) employs a multi-agent strategy and cooperative training, so it exhibits a significantly higher win rate than the DQN and SAC algorithms. The multi-agent proximal policy optimization (MAPPO) algorithm has a relatively high failure rate and is not stable enough to deal with enemy aircraft’s strategies in some cases. The LSTM–PPO algorithm shows a significantly higher win rate than other mainstream reinforcement learning algorithms in multi-aircraft collaborative air combat, which confirms the effectiveness of the LSTM–PPO algorithm in dealing with high-dimensional continuous action spaces and multi-aircraft collaborative operations.

HTML全文

參考文獻(27)

施引文獻

資源附件(0)