• <tr id="yyy80"></tr>
  • <sup id="yyy80"></sup>
  • <tfoot id="yyy80"><noscript id="yyy80"></noscript></tfoot>
  • 99热精品在线国产_美女午夜性视频免费_国产精品国产高清国产av_av欧美777_自拍偷自拍亚洲精品老妇_亚洲熟女精品中文字幕_www日本黄色视频网_国产精品野战在线观看 ?

    Task assignment in ground-to-air confrontation based on multiagent deep reinforcement learning

    2023-01-18 13:37:24JiayiLiuGangWangQiangFuShaohuaYueSiyuanWang
    Defence Technology 2023年1期

    Jia-yi Liu,Gang Wang,Qiang Fu,Shao-hua Yue,Si-yuan Wang

    Air and Missile Defense College,Air Force Engineering University,Xi'an,710051,China

    Keywords:Ground-to-air confrontation Task assignment General and narrow agents Deep reinforcement learning Proximal policy optimization (PPO)

    ABSTRACT The scale of ground-to-air confrontation task assignments is large and needs to deal with many concurrent task assignments and random events.Aiming at the problems where existing task assignment methods are applied to ground-to-air confrontation,there is low efficiency in dealing with complex tasks,and there are interactive conflicts in multiagent systems.This study proposes a multiagent architecture based on a one-general agent with multiple narrow agents(OGMN)to reduce task assignment conflicts.Considering the slow speed of traditional dynamic task assignment algorithms,this paper proposes the proximal policy optimization for task assignment of general and narrow agents (PPOTAGNA) algorithm.The algorithm based on the idea of the optimal assignment strategy algorithm and combined with the training framework of deep reinforcement learning(DRL)adds a multihead attention mechanism and a stage reward mechanism to the bilateral band clipping PPO algorithm to solve the problem of low training efficiency.Finally,simulation experiments are carried out in the digital battlefield.The multiagent architecture based on OGMN combined with the PPO-TAGNA algorithm can obtain higher rewards faster and has a higher win ratio.By analyzing agent behavior,the efficiency,superiority and rationality of resource utilization of this method are verified.

    1.Introduction

    The core feature of the modern battlefield is a strong confrontation game.Relying on human judgment and decision-making cannot meet the requirements of fast-paced and high-intensity confrontations,and relying on a traditional analytical model cannot meet the requirements of complex and changeable scenes.In the face of future operations,intelligence is considered to be the key to solving the problem of a strong confrontation game and realizing the transformation of the information advantage into a decision advantage [1].Task assignment is a key problem in modern air defense operations.The main purpose is to allocate each meta-task to execute the appropriate elements to achieve target interception [2] and provide full play to the maximum resource efficiency ratio.The key to solving this problem is establishing a task assignment model and using an assignment algorithm.Weapon target assignment (WTA) is an important problem to be solved in air defense interception task assignment[3].The optimal assignment strategy algorithm is one of the methods for solving the WTA problem.It uses the Markov decision process (MDP) as the main idea.It considers that new targets appear randomly with a certain probability in the assignment process.Hang et al.[4]believe that dynamic weapon target assignment (DWTA) can be divided into two stages:strategy optimization and matching optimization.The Markov dynamics can be used to solve DWTA.On this basis,Chen et al.[5] improved the hybrid iterative method based on the policy iterative method and the value iterative method in Markov process strategy optimization to solve large-scale WTA problems.He et al.[6]transformed the task assignment problem into a phased decision-making process through MDP.This method has good results in small-scale optimization problems.

    Although the optimal assignment strategy algorithm continuously improves,the speed of solving large-scale WTA problems is still slightly insufficient.In addition,the WTA method based on MDP cannot immediately deal with emergencies because it needs to wait for processing in the previous stage to end.This paper combines the optimal assignment strategy algorithm for WTA with the DRL method to apply a neural network to the WTA method based on MDP to solve the problems of slow solution speed and not being able to deal with emergencies.

    This research aims to apply a multiagent system and DRL to the task assignment problem of ground-to-air confrontation.Because a multiagent system has superior autonomy and unpredictability,it has the advantages of strong solution ability,fast convergence speed and strong robustness in dealing with complex problems[7].QMIX algorithm can train decentralised policies in a centralized end-to-end fashion,and has a good performance in StarCraft II micromanagement tasks [8].Cao J et al.[9] proposed the LINDA framework to achieve better collaboration among agents.Zhang Z proposed a PGP algorithm using policy gradient potential as the information source for guiding strategy update[10]and a learning automata-based algorithm known as LA-OCA[11].Both algorithms have good performance in common cooperative tasks,and can significantly improve the learning speed of agents.However,due to the self-interest of each agent in the multiagent system and the complexity of the capacity scheduling problem of the multiagent system[12],it is prone to interactive conflict when dealing with the task assignment of large-scale ground-to-air confrontation.The multiagent architecture based on OGMN is proposed in this paper,which can reduce the system complexity and eliminate the shortcoming that multiagent systems are prone to interactive conflict when dealing with complex problems.Task assignment is a typical sequential decision-making process for incomplete information games.DRL(DRL)provides a new and efficient method to solve the problem of incomplete information games[13],which relies less on external guidance information.There is no need to establish an accurate mathematical model of the environment and tasks.Additionally,it has the characteristics of fast reactivity and high adaptability [14].However,in the ground-to-air confrontation scenario,there is a large quantity of data,and it is difficult to directly convert the combat task target into a reasonable reward function,so decision-making faces the problems of sparse feedback,delay and inaccuracy,resulting in low training efficiency.According to the characteristics of large-scale ground-to-air confrontation,this paper proposes the PPO-TAGNA algorithm,which effectively improves the training efficiency and stability through a multihead attention mechanism and phased reward mechanism.Finally,experiments in the digital battlefield verify the feasibility and superiority of the multiagent architecture based on the OGMN and PPO-TAGNA algorithms to solve the task assignment problem of ground-to-air confrontation.

    2.Background

    2.1.Multiobjective optimization method

    Since there may be conflicts or constraints between various objectives and there is no unique solution to the multiobjective optimization problem but there is an optimal solution set,the main solution methods of the multiobjective problem in the multiagent system are as follows [15]:

    (1) Linear weighted sum method

    The difficulty of solving this problem lies in how to allocate the weight,as follows:

    where fi(x) is one of the multiobjective functions,ωiis the weight andThe advantage of this algorithm is that it is easy to calculate,but the disadvantage is that the weight parameters are uncertain,and whether the weight setting is appropriate directly leads to the effect of the optimal solution.

    (2) Reward function method

    The reward function is used as the solution method of the optimization problem.Its design idea comes from the single-agent system and the rod balance system.The design method of the rod balance system reward function is that after the agent transfers the state,the reward value of failure is - 1,and the reward and punishment value of success is 0.The system has several obvious defects.① During the task executions,the agent cannot define whether it is the state transition that contributes to the final benefits and cannot determine the specific contribution.②The reward function design gives the task a goal but only provides the reward value in the last step,resulting in too sparse of a reward value.The probability of the agent exploring the winning state and learning strategy is very low,which is not conducive to the realization of the task goal.

    2.2.Reinforcement learning

    The idea of reinforcement learning (RL) is to use trial and error methods and rewards to train agent learning behavior [16].The basic reinforcement learning environment is an MDP.An MDP contains five quantities,namely,(S,A,R,P,γ),where S is a finite set of states,A is a finite set of actions,R is the reward function,P is the state transition probability,and γ is the discount factor.The agent perceives the current state from the environment,then makes the corresponding action and obtains the corresponding reward.

    2.3.Proximal policy optimization

    Deep learning (DL) uses a deep neural network as a function fitter and combines it with RL to produce DRL,which effectively solves the problem of dimension disaster in large-scale and complex problems in traditional RL methods.Proximal policy optimization(PPO)belongs to a class of DRL optimization algorithms[17].It is different from value-based methods,such as Q-learning.It directly calculates the policy gradient of cumulative expected return by optimizing the policy function to solve the policy parameters that maximize the overall return,PPO algorithm description is shown in Table 1.

    The objective function defining the cumulative expected return of the PPO is as follows:

    where in the equation as follows:

    Atis the dominance estimation function.This can be seen in the equation as follows:

    Table 1 PPO algorithm.

    2.4.Attention mechanism

    The attention mechanism is a mechanism that enables agents to focus on some information at a certain point in time and ignore other information.It can enable agents to make better decisions faster and more accurately in local areas[18].

    When the neural network is faced with a large amount of input situation information,only some key information is selected for processing through the attention mechanism.In the model,the maximum convergence and gating mechanism can be used to approximate the simulation,which can be regarded as a bottom-up saliency-based attention mechanism.In addition,top-down convergent attention is also an effective information selection method.For example,when inputting a large text,an article is given,then the content of the article is extracted and a certain number of questions is assumed.The questions raised are only related to part of the article content and have nothing to do with the rest [19].To reduce the solution pressure,only the relevant content needs to be selected and processed by a neural network.

    aican be defined as the degree to which the ith information is concerned when the task-related input information X and query vector q are given,and the input information is summarized as follows:

    Fig.1 is an example of an attention mechanism.In the red-blue confrontation scene,the computational action can adopt the attention mechanism.For example,the input data x are all currently selectable enemy targets,while q is the query vector output from the front part of the network.

    3.Problem modeling

    3.1.Problem formulation

    Task assignment is the core link in command and control.It means that on the basis of situational awareness and according to certain criteria and constraints,we can make efficient use of our own multiple types of interception resources,reasonably allocate and intercept multiple targets,and avoid missing key targets and repeated shooting,so as to achieve the best effect.The task assignment problem has the principles of the maximum number of intercepted targets,the minimum damage of protected objects and the minimum use of resources.This paper studies the task assignment of ground-to-air confrontation based on the principle of using the least resources and minimum damage to the protected object.

    Fig.1.Attention mechanism.

    Ground-to-air confrontation system is a loosely coupled system with a wide deployment range.It has a large scale and needs to deal with many concurrent task assignments and random events.The main tasks include target identification,determining the firing sequence,determining the interception strategy,missile guidance,killing targets,etc.The purpose of task assignment in this paper is to use the least resources while protecting the object with the least damage.Therefore,the task assignment of ground-to-air confrontation is a many to many optimization problem.It needs to have fast search ability and strong dynamic performance to deal with regional large-scale air attack,and can dynamically change the task assignment results according to the situation.DRL relies less on external guidance information and does not need to establish an accurate mathematical model of environment and task.At the same time,it has the characteristics of fast reactivity and high adaptability,which makes it suitable for ground-to-air confrontation task assignment.

    3.2.MDP modeling

    In order to combine the optimal assignment strategy of task assignment with DRL,we establish an MDP model of ground-to-air confrontation,which is divided into the states,the actions and the reward function.

    States:(1)The status of the red base,including base information and the status of the base when it is being attacked.(2)The status of the red interceptor,including the current configuration of the interceptor,the working status of the radar,the attack status of the radar,and the information of the enemy units that the interceptor can attack.(3) The status of the enemy units,including the basic information of the enemy units and the status when being attacked by red missiles.(4)The status of enemy units that can be attacked,including the status that can be attacked by red interceptors.

    Actions:(1)When is the radar on.(2)Which interception unit to choose.(3) Which launch vehicle to choose to intercept at what time (4) which enemy target to choose.

    Reward function:the reward function based on the principle of using the least resources and minimum damage to the protected object in this paper is as follows:

    The reward value is 50 points for winning,5 points for intercepting manned targets,such as the blue fighter,1 point for intercepting a UAV,and 0.05 points are deducted for each missile launched.In Eq.(10),m is the number of manned units intercepting the blue side,n is the number of UAVs intercepting the blue side,and i is the number of missiles launched.Since each of the above reward stages is the task goal that the red side must achieve if it wants to win,it can guide the agent to learn step-by-step and stageby-stage.

    4.Research on a multiagent system model based on OGMN

    4.1.The general and narrow agents architecture design

    The task assignment of large-scale ground-to-air confrontation needs to deal with many concurrent task assignments and random events.The whole battlefield situation is full of complexity and uncertainty.The fully distributed multiagent architecture has poor global coordination for random events,which makes it difficult to meet the needs of ground-to-air confrontation task assignments.The current centralized assignment architecture can achieve global optimal results[20],but for large-scale complex problems,it is not practical because of the high solving time cost.Aiming at the command-and-control problem of ground-to-air confrontation distributed cooperative operations,combined with the DRL development architecture and based on the idea of double driving data rules,a general and narrow agents command-and-control system is designed and developed.The global situation in a certain period of time is taken as the general agent input with strong computing power to obtain combat tasks and achieve tactical objectives.The narrow agent based on tactical rules decomposes the tasks of general agents according to its situation and outputs specific actions.The main idea of OGMN architecture proposed in this paper is to retain the global coordination ability of centralized method and combine the efficiency advantage of multiagent.The general agent assigns tasks to the narrow agents,and the narrow agents divide their tasks into sub-tasks,and selects the appropriate time to execute according to their remaining resources,the current number of channels and other information.This method can largely retain the overall coordination ability of general agent,and avoid missing key targets,repeated shooting,wasting of resources and so on.And it can reduce the computing pressure of general agent and improve the assignment efficiency.The general agent assigns tasks to the narrow agent according to the global situation.The narrow agent decomposes the tasks into instructions (such as intercepting a target at a certain time) and assigns them to an interceptor according to its situation.The structural design of general and narrow multiagent systems is shown in Fig.2.

    Fig.2.The framework of the task assignment decision model of general and narrow agents.

    Based on the existing theoretical basis,this paper vertically combines rules and data to drive general agents and rules to drive narrow agents.The purpose is to improve the speed of a multiagent system to solve complex tasks,reduce system complexity,and eliminate the shortcomings of a multiagent system in dealing with complex problems.It is expected that in a short time,a general agent with strong computing power is used to obtain situation information and quickly allocate tasks,and then multiple narrow agents select appropriate time and interceptors to intercept enemy targets according to specific tasks,as well as their state to save as many resources as possible on the premise of achieving tactical objectives.

    4.2.Markov’s decision process of cooperative behavior

    Traditional multiagent collaborative decision-making research mainly focuses on model-based research [21],namely,rational agent research.Traditional task assignment research has the disadvantages of relying too much on the accuracy of the model behind it.It only focuses on the design from the model to the actuator,but it does not focus on the model’s generation process.In the intelligent countermeasure environment,there are many kinds of agents.For multiple agents,it is difficult to obtain accurate decision-making models,models,complex task environments and situation disturbances.Additionally,environmental models present certain randomness and time variability [22-24].All these factors need to be studied to control the method of agent models under the lack of information.

    The essence of this framework shown in Fig.3 is to solve the large-scale task assignment problem based on the idea of an optimal assignment strategy algorithm and a DRL method.

    MDP four elements are set:(S,A,r,p):state(S),action(A),reward(r),transitionprobability(p);Markovproperty:p(st+1|s0,a0,…,st,at)=p(st+1|st,at) strategy function π:S→A orπ:S× A→[0,1];

    Optimization objective: The objective is to solve the optimal strategy function π*,maximizing the expected cumulative reward value as follows:

    Fig.3.The research framework of the cooperative behavior decision model of general and narrow agents.

    The method uses a reinforcement learning algorithm to solve the MDP when p(st+1|st,at) is unknown.The core idea is to use temporal-difference learning (TD) to estimate the action-value function as follows:

    Compared with Alpha C2[25],the model framework optimizes the agent state,which better meets the conditions of rationality and integrity.Rationality requires that states with similar physical meaning also have small numerical differences.For example,for the launch angle θ of the interceptor,since θ is a periodic variable,it is unreasonable to directly take θ as a part of the state.Thus,the launch angle θ should be changed to[cos θ,sin θ].

    Integrity requires that the state contains all the information required by the agent's decision-making.For example,in the agent's trajectory tracking problem,the trend information of the target trajectory needs to be added.However,if this information cannot be observed,the state needs to be expanded to include historical observations,such as the observation wake of the blue drone.

    5.PPO for task assignment of general and narrow agents(PPO-TAGNA)

    To solve the task assignment problem in large-scale scenarios,this paper combines the optimal assignment strategy algorithm with the DRL training framework,designs the stage reward mechanism and multihead attention mechanism for the traditional PPO algorithm,and proposes a PPO algorithm for the task assignment of general and narrow agents.

    5.1.Stage reward mechanism

    The reward function design is the key to the DRL application in ground-to-air confrontation task assignment.The DRL reward function design must be analyzed in detail.To solve the ground-toair confrontation task assignment problem,the reward value design idea in Alpha C2 [25] is to set the corresponding reward value for each unit type.If there is unit loss,the reward value of the corresponding unit is given.At the end of each deduction round,the reward value of each step is added as the final reward value.However,in practice,the reward value lost by each unit offsets each other at each step,resulting in a small reward value and low learning efficiency.Nevertheless,if only the reward value of victory or failure is given in the last step of each game,the reward value of the other steps is 0,which is equivalent to no artificial prior knowledge being added,which can give the neural network the maximum learning space.However,it leads to too sparse of a reward value,and the probability of neural networks exploring the winning state and learning strategies is very low [26].Therefore,the ideal reward value should be neither too sparse nor too dense and can clearly guide the agent to learn in the winning direction.

    The stage reward mechanism adopts the method of dismantling task objectives and giving reward values periodically to guide the neural network to find strategies to win.For example,phased rewards can be given at one time after successfully resisting the first attack.After the loss of the blue side high-value unit,the corresponding reward value is given.After the red side wins,it is given the winning reward value.On this basis,the reward function is optimized according to different objectives in the actual task,such as maximum accuracy,minimum damage,minimum response time,interception and condition constraints,to increase the effect of maximizing global revenue on the revenue of the agent and reduce the self-interest of the agent as much as possible.

    5.2.The multihead attention mechanism

    5.2.1.Neural network structure

    The neural network structure of the multiagent command-andcontrol model is shown in Fig.4.The situation input data are divided into four categories.The first category is the status of the red base,including base information and the status of the base when it is being attacked.The second category is the status of the red interceptor,including the current configuration of the interceptor,the working status of the radar,the attack status of the radar,and the information of the enemy units that the interceptor can attack.The third category is the status of the enemy units,including the basic information of the enemy units and the status when being attacked by red missiles.The fourth category is the status of enemy units that can be attacked,including the status that can be attacked by red interceptors.The number of units of each type of data is not fixed and changes with the battlefield situation.

    Each type of situation data carries out feature extraction through two layers of fully connected rectified linear units (FCReLU) and then connects all feature vectors before generating global features through one layer of FC-ReLU and gated recurrent units(GRUs)[27].When making decisions,neural networks should consider not only the current situation but also historical information.Networks need to continuously interact with the global situation through GRU and choose to retain or forget the information.The global feature and the selectable blue unit feature vector are calculated through the attention mechanism to select the interception unit.Each intercepting unit then selects the interception time and the enemy units through an attention operation according to its state and combined with the rule base.

    5.2.2.Standardization and filtering of stat data

    State data standardization is a necessary step before entering the network.The original status data include various data,such as radar vehicle position,aircraft speed,aircraft bomb load,and threat degree of enemy units.The unit and magnitude of this kind of data are different,and thus,it must be normalized before being input into the neural network.In the battle process,some combat units later join the battle situation,some units are destroyed,and thus,the data are lost.The neural network needs to be compatible with these situations.

    Fig.4.Neural network structure.

    Different units have different states at various time points.Therefore,when deciding to select some units to perform a task,it is necessary to eliminate those participating units that cannot perform the task at this time point.For example,there must be a certain time interval between two missile launches by the interceptor,and the interceptor must be connected to the radar vehicle to launch the missile.

    5.2.3.The attention mechanism and target selection

    In this paper,the decision-making action is processed by multiple heads as the network output,meaning that the action is divided into action subject (which interception unit to choose),action predicate (which launch vehicle to choose to intercept at what time),and action object (which enemy target to choose).

    When selecting interception targets,the network needs to focus on some important targets in local areas.In this paper,the state of each fire unit and the eigenvector of the incoming target are used to realize the operation of the attention mechanism by using an additive model.

    X=[x1, …, xN] is defined as N input information,and the probability aiis first calculated by selecting the ith input information under the given q and X.Then,aiis defined as follows:

    where aiis the distribution of attention,s(xi,q) is the attention scoring function,and the calculation model is the additive model as follows:

    where the query vector q is the feature vector of each fire unit,xiis the ith attack target that is currently selectable,W and u are the trainable neural network parameters,and v is the global situation feature vector,namely,the conditional attention mechanism;thus,the global situation information can participate in the calculation.The attention score of each fire unit about each target is obtained,each bit of the score vector is sigmoid sampled,and finally,the overall decision is generated.

    5.3.The ablation experiment

    To study the impact of the two mechanisms on the algorithm performance,this paper designs an ablation experiment.By adding or subtracting two mechanisms to the basic PPO algorithm,four different algorithms are set up to compare the differences in the effects.The experimental setup is shown in Table 2:

    Based on the general and narrow agents’framework in Section 3 of this paper,all algorithms are iteratively trained 100,000 times under the same scenario.The experimental results are shown in Fig.5.The performance of the basic PPO algorithm can be improved by adding the stage reward mechanism and the multihead attention mechanism alone.The multihead attention mechanism can increase the average reward from 10 to approximately 38.The stage reward mechanism has a slightly larger and more stable effect,which can increase the average reward from 10 to approximately 42.When the two mechanisms are added simultaneously,the algorithm performance can be considerably improved,and the average reward value can be increased to approximately 65,which shows that the PPO-TAGNA algorithm proposed in this paper is effectively applicable to the task assignment problem under the framework of general and narrow agents.

    6.Experiments and results

    6.1.Experimental environment setting

    The neural network training environment in this paper is carried out in the virtual digital battlefield.In the hypothetical combat area,for a certain number of blue offensive forces,when the red side has important places to protect and the forces are limited,the red side agent needs to make real-time decisions according to the battlefield situation and allocate tasks according to the threat degree of the enemy and other factors while trying to preserve their strength and protect important places from destruction.In this paper,the task assignment strategy of the red side is trained by a DRL method.Considering physical constraints,such as earth curvature and ground object shielding,the key elements of the digital battlefield are close to those of the real battlefield.The red-blue confrontation scenario is shown in Fig.6.

    Table 2 Ablation experimental design.

    6.1.1.The red army force setting

    Fig.5.Performance comparison of ablation experimental algorithms.

    There are 2 important places,the headquarters and the airport.

    One early warning aircraft has a detection range of 400 km.

    The long-range interception unit consists of 1 long-range radar and 8 long-range interceptors (each interceptor carries 3 longrange missiles and 4 short-range missiles).

    The short-range interception unit consists of 1 short-range radar and 3 short-range interceptors (each interceptor is loaded with 4 short-range missiles).

    Three long-range interception units and three short-range interception units are deployed to defend the red headquarters in a sector,while four long-range interception units and two shortrange interception units are deployed to defend the red airport in a sector,for a total of 12 interception units.

    6.1.2.Blue army force setting

    There are 18 cruise missiles.

    There are 20 UAVs,each carrying 2 antiradiation missiles and 1 air-to-ground missile.

    There are 12 fighter planes,each carrying 6 antiradiation missiles and 2 air-to-ground missiles.

    There are 2 jammers for long-distance support jamming outside the defense area.

    6.1.3.Confrontation criterion

    If the radar is destroyed,the unit loses combat capability.The radar needs to be started up in the whole guidance process.When the machine is turned on,it radiates electromagnetic waves,which are captured by the opponent and expose its position.The radar is subject to physical limitations,such as earth curvature and ground object shielding,and the missile flight trajectory is the best energy trajectory.The interception distances are 160 km (long range)and 40 km (short range).For UAVs,fighters,bombers,antiradiation missiles and air-to-ground missiles,the high-kill probability in the kill zone is 75%,the low-kill probability is 55%,and for cruise missiles,the high-kill probability in the kill zone is 45%,and the low-kill probability is 35%.The antiradiation missile has a range of 110 km and a hit rate of 80%.The air-to-ground missile has a range of 60 km and a hit rate of 80%.The jamming sector of the blue jammer is 15°,and after the red radar is interfered with,the killing probability is reduced according to the jamming level.

    6.2.Experimental hardware configuration

    The CPU running the simulation environment is an Intel Xeon E5-2678v3,88 core,256 G memory;the GPU * 2 runs neural network training.The model is an NVIDIA GeForce 2080ti,72 cores and 11 G video memory.In PPO,the superparameter is ε=0.2,the learning rate is 10-4,the batch size is 5,120,and the number of hidden layer units in the neural network is 128 and 256.

    6.3.Analysis of the experimental results

    6.3.1.Agent architecture comparison

    The agent architecture based on OGMN proposed in this paper and the framework of Alpha C2 [25] are iterated 100,000 times in the digital battlefield with the PPO,A3C and DDPG algorithms,and the reward value and win ratio are compared.The comparison results are shown in Fig.7.

    Fig.6.Schematic diagram of an experimental scenario.

    It can be seen that the agent architecture based on OGMN proposed in this paper can obtain a higher reward value faster in the training process,and the effect of PPO is the best,which can substantially improve the average reward value and stability.In terms of the win ratio,the agent architecture in this paper also has the same performance,which can achieve a higher win ratio and be more stable.Experiments show that the OGMN architecture can not only largely retain the overall coordination ability of general agent to ensure the stability of training,but also has the efficient characteristics of multiagent,which can improve the training efficiency.

    6.3.2.Algorithm performance comparison

    Under the same scenario,the PPO-TAGNA algorithm proposed in this paper is compared with the Alpha C2 algorithm [25] and the rule base based on expert decision criteria [5,28-30].The three algorithms are iterated 100,000 times in the digital battlefield,and the comparison results are shown in Fig.8:

    6.4.Agent behavior analysis

    In the deduction of the digital battlefield,some strategies and tactics can emerge in this study.Fig.9 shows the performance of the red agents before training.At this time,only the unit closest to the target is allowed to defend without the awareness of sharing the defense pressure,and the value of the target is not distinguished.Finally,when the high-value target attacks,the resources of the unit that can intercept are exhausted and fail.

    Fig.10 shows the performance of the red agents after training.At this time,the agent can distinguish the high threat units of the blue side,share the defense pressure,make more rational use of resources,defend the key areas more efficiently,and finally take the initiative to attack the high-value targets of the blue side to win.

    7.Conclusions

    Aiming at the low processing efficiency problem and the slow solution speed of large-scale task assignment issue,based on the idea of an optimal algorithm of assignment strategy and combined with the DRL training framework,this paper proposes the agent architecture based on OGMN and PPO-TAGNA algorithm for the ground-to-air confrontation task assignment.By comprehensively analyzing the large-scale task assignment requirements,a reasonable state space,action space and reward function are designed.Using the real-time confrontation of the digital battlefield,experiments,such as algorithm ablation,agent framework comparisons and algorithm performance comparisons,are carried out.The experimental results show that the OGMN task assignment method based on DRL has a higher win ratio than the traditional method.The use of resources is more reasonable and can achieve better results under limited training times.The multiagent architecture based on the OGMN and PPO-TAGNA algorithms proposed in this paper has applicability and superiority in ground-to-air confrontation task assignment and has important application value in the intelligent aided decision-making field.

    Fig.7.Comparison of agent architecture training effect.

    Fig.8.Algorithm performance comparison.

    Fig.9.Performance of agents before training.

    Fig.10.Performance of agents after training.

    Declaration of competing interest

    The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

    Acknowledgements

    The authors would like to acknowledge the Project of National Natural Science Foundation of China (Grant No.62106283),the Project of Natural Science Foundation of Shaanxi Province (Grant No.2020JQ-484) and the Project of National Natural Science Foundation of China (Grant No.72001214) to provide fund for conducting experiments.

    国产精品日韩av在线免费观看| 夜夜爽夜夜爽视频| 夫妻性生交免费视频一级片| 欧美bdsm另类| 国产av国产精品国产| 久久久久精品性色| 亚洲av中文字字幕乱码综合| 国产免费一级a男人的天堂| 国产精品久久久久久精品电影| 我要看日韩黄色一级片| 在线 av 中文字幕| 一级爰片在线观看| 国内精品美女久久久久久| 亚洲乱码一区二区免费版| 国国产精品蜜臀av免费| 久久久久免费精品人妻一区二区| 精品国内亚洲2022精品成人| 亚洲人成网站在线观看播放| 内射极品少妇av片p| 免费大片黄手机在线观看| 国产熟女欧美一区二区| 久久久亚洲精品成人影院| 久久精品国产自在天天线| 久久久精品欧美日韩精品| 亚洲国产精品成人久久小说| 亚洲av中文字字幕乱码综合| 国产精品国产三级专区第一集| 国产黄a三级三级三级人| 国产亚洲av嫩草精品影院| av线在线观看网站| www.av在线官网国产| 少妇熟女aⅴ在线视频| 色网站视频免费| av在线蜜桃| 国产不卡一卡二| 欧美一区二区亚洲| 久久精品久久久久久久性| 看黄色毛片网站| 纵有疾风起免费观看全集完整版 | 日本黄色片子视频| 欧美日本视频| 中文欧美无线码| 综合色av麻豆| 成人二区视频| 久久久久久久久久久免费av| 高清日韩中文字幕在线| 欧美xxxx黑人xx丫x性爽| 99热这里只有精品一区| 看十八女毛片水多多多| 99久久中文字幕三级久久日本| 午夜老司机福利剧场| 色尼玛亚洲综合影院| 国产精品.久久久| 黄片wwwwww| 视频中文字幕在线观看| 国产精品蜜桃在线观看| 国产男女超爽视频在线观看| 亚洲av免费在线观看| 最近手机中文字幕大全| 欧美潮喷喷水| 日产精品乱码卡一卡2卡三| 激情五月婷婷亚洲| 亚洲精品456在线播放app| 国产伦在线观看视频一区| 免费看a级黄色片| 最近视频中文字幕2019在线8| 国产黄色视频一区二区在线观看| 国产又色又爽无遮挡免| 亚洲熟女精品中文字幕| 久久午夜福利片| av线在线观看网站| 亚洲av日韩在线播放| 五月玫瑰六月丁香| 最近的中文字幕免费完整| 80岁老熟妇乱子伦牲交| av在线观看视频网站免费| 中国国产av一级| 少妇熟女欧美另类| 伦理电影大哥的女人| 久久99热这里只有精品18| 好男人视频免费观看在线| 色尼玛亚洲综合影院| 国产熟女欧美一区二区| 精品一区在线观看国产| 国产精品伦人一区二区| 国产高清不卡午夜福利| 一本—道久久a久久精品蜜桃钙片 精品乱码久久久久久99久播 | 亚洲国产精品成人久久小说| 欧美变态另类bdsm刘玥| 国产一区二区三区av在线| 日本爱情动作片www.在线观看| 啦啦啦啦在线视频资源| 伊人久久精品亚洲午夜| 国产精品av视频在线免费观看| 亚洲婷婷狠狠爱综合网| 免费少妇av软件| 国内少妇人妻偷人精品xxx网站| 亚洲欧洲日产国产| 91精品国产九色| 免费无遮挡裸体视频| 亚洲怡红院男人天堂| 成人无遮挡网站| 老女人水多毛片| 国产精品一区二区性色av| 国产精品日韩av在线免费观看| 国产精品国产三级国产专区5o| 熟妇人妻久久中文字幕3abv| 男女下面进入的视频免费午夜| 久久韩国三级中文字幕| 亚洲在线观看片| 亚洲精品aⅴ在线观看| 男女边吃奶边做爰视频| 超碰97精品在线观看| 日韩国内少妇激情av| eeuss影院久久| 大片免费播放器 马上看| 真实男女啪啪啪动态图| 婷婷色麻豆天堂久久| 免费大片18禁| 欧美高清性xxxxhd video| 老司机影院毛片| 国产男人的电影天堂91| 久久99热6这里只有精品| 欧美激情久久久久久爽电影| 亚洲av免费在线观看| 乱系列少妇在线播放| 18+在线观看网站| av国产免费在线观看| 欧美成人a在线观看| 又黄又爽又刺激的免费视频.| 身体一侧抽搐| 观看美女的网站| 欧美极品一区二区三区四区| 亚洲国产高清在线一区二区三| 日本猛色少妇xxxxx猛交久久| 天堂俺去俺来也www色官网 | 日韩av在线大香蕉| 日韩视频在线欧美| 久久精品国产自在天天线| 亚洲综合精品二区| 少妇的逼水好多| 一区二区三区免费毛片| 久久久久久久亚洲中文字幕| 日本色播在线视频| 男人爽女人下面视频在线观看| 午夜免费观看性视频| 国产av在哪里看| 免费av毛片视频| 日韩欧美精品免费久久| 成人一区二区视频在线观看| 亚洲精品日韩av片在线观看| 一区二区三区乱码不卡18| 欧美97在线视频| freevideosex欧美| 午夜福利视频1000在线观看| 亚洲无线观看免费| 免费av毛片视频| 久久久色成人| 久久久久久久亚洲中文字幕| 亚洲精品一二三| 亚洲精品中文字幕在线视频 | 欧美xxxx性猛交bbbb| 极品少妇高潮喷水抽搐| 久久久久国产网址| 久久99蜜桃精品久久| 久久精品综合一区二区三区| 男人爽女人下面视频在线观看| av线在线观看网站| 欧美潮喷喷水| videos熟女内射| 亚洲欧美一区二区三区黑人 | 欧美精品一区二区大全| 婷婷色综合大香蕉| 视频中文字幕在线观看| 国产人妻一区二区三区在| 国产一区有黄有色的免费视频 | 三级国产精品欧美在线观看| 亚洲欧美日韩无卡精品| 久久人人爽人人爽人人片va| 亚洲aⅴ乱码一区二区在线播放| 少妇人妻精品综合一区二区| 尤物成人国产欧美一区二区三区| 成人综合一区亚洲| 精品国产一区二区三区久久久樱花 | 亚洲精华国产精华液的使用体验| 国产 亚洲一区二区三区 | 午夜爱爱视频在线播放| 网址你懂的国产日韩在线| 日韩av免费高清视频| 色视频www国产| 大香蕉久久网| 少妇的逼好多水| 欧美丝袜亚洲另类| 三级经典国产精品| 舔av片在线| 久久精品国产亚洲av天美| 日韩人妻高清精品专区| 男女视频在线观看网站免费| 亚洲人成网站在线观看播放| 久久精品国产自在天天线| 国产老妇伦熟女老妇高清| 日韩国内少妇激情av| 黑人高潮一二区| 亚洲一区高清亚洲精品| 国产免费福利视频在线观看| av卡一久久| 美女被艹到高潮喷水动态| 亚洲成人中文字幕在线播放| 99热全是精品| 三级男女做爰猛烈吃奶摸视频| 国产精品三级大全| 国产色爽女视频免费观看| 蜜臀久久99精品久久宅男| 日韩欧美精品免费久久| 免费大片黄手机在线观看| 少妇的逼好多水| 内地一区二区视频在线| 国产精品无大码| 美女cb高潮喷水在线观看| 91精品一卡2卡3卡4卡| 久久久久久伊人网av| 青春草国产在线视频| 人体艺术视频欧美日本| 一二三四中文在线观看免费高清| kizo精华| 午夜激情久久久久久久| 老师上课跳d突然被开到最大视频| 自拍偷自拍亚洲精品老妇| 午夜老司机福利剧场| 中国美白少妇内射xxxbb| 久久久久久久久久黄片| 成年av动漫网址| 国产黄色免费在线视频| 亚洲av免费高清在线观看| 黄色配什么色好看| 久久久精品94久久精品| 大又大粗又爽又黄少妇毛片口| 少妇丰满av| 在线免费十八禁| 国产伦在线观看视频一区| 日韩伦理黄色片| 乱人视频在线观看| 热99在线观看视频| 在线免费观看的www视频| 国产精品女同一区二区软件| 日韩中字成人| 在线观看美女被高潮喷水网站| 一区二区三区乱码不卡18| 中文在线观看免费www的网站| 在线a可以看的网站| 国产成人免费观看mmmm| 91精品伊人久久大香线蕉| 欧美 日韩 精品 国产| 伦理电影大哥的女人| 精品久久久久久久久亚洲| 成人午夜高清在线视频| 国产精品蜜桃在线观看| 国产毛片a区久久久久| 国产三级在线视频| 精品一区在线观看国产| 一本一本综合久久| 丝袜喷水一区| 老司机影院毛片| 自拍偷自拍亚洲精品老妇| 亚洲欧美精品自产自拍| 亚洲美女视频黄频| 最后的刺客免费高清国语| 男人舔奶头视频| 91aial.com中文字幕在线观看| 我要看日韩黄色一级片| 韩国av在线不卡| 人妻一区二区av| 久久久久久九九精品二区国产| 91久久精品国产一区二区三区| 51国产日韩欧美| 国产欧美另类精品又又久久亚洲欧美| 午夜免费激情av| 国产伦精品一区二区三区四那| 国产av在哪里看| 日日干狠狠操夜夜爽| 99久久人妻综合| 在线a可以看的网站| 亚洲18禁久久av| 亚洲欧美中文字幕日韩二区| 亚洲av中文字字幕乱码综合| 久久99精品国语久久久| 偷拍熟女少妇极品色| 直男gayav资源| 精品人妻视频免费看| 三级毛片av免费| 日本黄大片高清| 国产精品久久视频播放| 亚洲色图av天堂| 综合色丁香网| 亚洲在久久综合| 噜噜噜噜噜久久久久久91| 大又大粗又爽又黄少妇毛片口| 日日撸夜夜添| 欧美成人精品欧美一级黄| 91午夜精品亚洲一区二区三区| 男人和女人高潮做爰伦理| 看十八女毛片水多多多| 色播亚洲综合网| 国产探花极品一区二区| 色综合站精品国产| 午夜精品一区二区三区免费看| 草草在线视频免费看| 欧美3d第一页| 男女啪啪激烈高潮av片| 午夜爱爱视频在线播放| 天天躁夜夜躁狠狠久久av| 亚洲精品国产av成人精品| 亚洲av二区三区四区| 日本爱情动作片www.在线观看| 91精品一卡2卡3卡4卡| 97在线视频观看| 亚洲在线观看片| 最近最新中文字幕免费大全7| 中文字幕免费在线视频6| 国产女主播在线喷水免费视频网站 | 特大巨黑吊av在线直播| 国产 一区精品| 少妇人妻精品综合一区二区| 亚洲精品自拍成人| 午夜爱爱视频在线播放| 国内精品美女久久久久久| 亚洲欧美清纯卡通| 在线天堂最新版资源| 好男人视频免费观看在线| 国产毛片a区久久久久| 午夜亚洲福利在线播放| 久久鲁丝午夜福利片| 最近视频中文字幕2019在线8| 午夜视频国产福利| 国产视频首页在线观看| 成人一区二区视频在线观看| 亚洲欧洲日产国产| 国产成人午夜福利电影在线观看| 男女国产视频网站| 看黄色毛片网站| 国国产精品蜜臀av免费| 免费看不卡的av| a级毛片免费高清观看在线播放| 可以在线观看毛片的网站| 黑人高潮一二区| 国产亚洲av片在线观看秒播厂 | 99热这里只有是精品在线观看| 亚洲精品乱码久久久久久按摩| 精品一区二区三区视频在线| 国产乱人视频| 国产精品美女特级片免费视频播放器| 亚洲精品亚洲一区二区| 99热这里只有是精品在线观看| 汤姆久久久久久久影院中文字幕 | av.在线天堂| 久久久久九九精品影院| 夜夜看夜夜爽夜夜摸| 亚洲精品国产av成人精品| 国产av码专区亚洲av| 国语对白做爰xxxⅹ性视频网站| 国产黄片美女视频| 一区二区三区乱码不卡18| 精品国产露脸久久av麻豆 | 久久精品国产亚洲av涩爱| 日韩强制内射视频| 国产亚洲精品av在线| 欧美性猛交╳xxx乱大交人| 亚洲欧美成人精品一区二区| 欧美性猛交╳xxx乱大交人| 中文资源天堂在线| 少妇裸体淫交视频免费看高清| 欧美+日韩+精品| 视频中文字幕在线观看| 色综合色国产| 亚洲婷婷狠狠爱综合网| 亚洲电影在线观看av| 亚洲精品国产av成人精品| av.在线天堂| 国产精品熟女久久久久浪| 亚洲电影在线观看av| 大香蕉97超碰在线| 在线观看av片永久免费下载| 99热6这里只有精品| 91午夜精品亚洲一区二区三区| 亚洲精品第二区| 亚洲激情五月婷婷啪啪| 干丝袜人妻中文字幕| 亚洲国产精品sss在线观看| 如何舔出高潮| 色综合亚洲欧美另类图片| 日本三级黄在线观看| 亚洲国产精品专区欧美| 欧美一级a爱片免费观看看| 午夜激情久久久久久久| 国产成人午夜福利电影在线观看| 听说在线观看完整版免费高清| 91精品一卡2卡3卡4卡| 偷拍熟女少妇极品色| 十八禁网站网址无遮挡 | 亚洲欧美精品自产自拍| 亚洲欧美中文字幕日韩二区| 国产av在哪里看| 免费看av在线观看网站| 日韩伦理黄色片| 亚洲精品一区蜜桃| 亚洲av成人av| 小蜜桃在线观看免费完整版高清| 在线观看美女被高潮喷水网站| 亚洲最大成人中文| 一级毛片我不卡| 日本黄大片高清| 国产黄色视频一区二区在线观看| 国产精品无大码| 精品99又大又爽又粗少妇毛片| 日本wwww免费看| 国产色爽女视频免费观看| 国产av不卡久久| 国产伦精品一区二区三区四那| 噜噜噜噜噜久久久久久91| 欧美+日韩+精品| 一二三四中文在线观看免费高清| 全区人妻精品视频| 午夜老司机福利剧场| 久久久久久久久久久免费av| 亚洲国产欧美在线一区| 丰满乱子伦码专区| 蜜桃久久精品国产亚洲av| 国产精品无大码| 麻豆乱淫一区二区| av网站免费在线观看视频 | 成年人午夜在线观看视频 | 久久国产乱子免费精品| 午夜爱爱视频在线播放| 亚洲av二区三区四区| 在线观看美女被高潮喷水网站| 欧美97在线视频| 久久久成人免费电影| 婷婷色av中文字幕| 中文天堂在线官网| 97热精品久久久久久| 看十八女毛片水多多多| 天美传媒精品一区二区| kizo精华| 男女国产视频网站| 99热这里只有是精品50| 街头女战士在线观看网站| 五月天丁香电影| 亚洲色图av天堂| 嫩草影院新地址| 少妇的逼水好多| 男女边吃奶边做爰视频| videos熟女内射| 国产三级在线视频| 久久久久久伊人网av| 三级毛片av免费| 国产精品.久久久| 综合色av麻豆| 成人综合一区亚洲| 亚洲欧洲日产国产| 精品久久久久久久人妻蜜臀av| 2018国产大陆天天弄谢| 极品教师在线视频| 久久热精品热| 少妇猛男粗大的猛烈进出视频 | 91精品一卡2卡3卡4卡| www.色视频.com| 成年免费大片在线观看| 色视频www国产| 日产精品乱码卡一卡2卡三| 晚上一个人看的免费电影| 日本wwww免费看| 亚洲在线观看片| 久久久精品欧美日韩精品| 精品酒店卫生间| 99久国产av精品国产电影| 午夜日本视频在线| 欧美成人午夜免费资源| 日韩av免费高清视频| 亚洲,欧美,日韩| 91精品伊人久久大香线蕉| 国产成人精品久久久久久| 欧美xxxx性猛交bbbb| 美女主播在线视频| 国产免费又黄又爽又色| av在线天堂中文字幕| 人人妻人人澡人人爽人人夜夜 | 一个人看的www免费观看视频| 一本一本综合久久| 日韩欧美精品v在线| 亚洲av在线观看美女高潮| 精品久久久久久久久av| 免费大片黄手机在线观看| 国产亚洲精品久久久com| 日产精品乱码卡一卡2卡三| 亚洲精品乱码久久久v下载方式| 亚洲精品色激情综合| 久久亚洲国产成人精品v| 91久久精品国产一区二区成人| 亚洲电影在线观看av| 久久精品国产亚洲网站| 亚洲欧美成人综合另类久久久| 一级爰片在线观看| 少妇裸体淫交视频免费看高清| 欧美区成人在线视频| 非洲黑人性xxxx精品又粗又长| 日韩精品有码人妻一区| av专区在线播放| 亚洲欧美日韩卡通动漫| 亚洲欧美精品自产自拍| 亚洲成人一二三区av| 日韩欧美三级三区| 亚洲,欧美,日韩| 亚洲欧美清纯卡通| 街头女战士在线观看网站| 亚洲天堂国产精品一区在线| 激情 狠狠 欧美| 国产男女超爽视频在线观看| 少妇熟女欧美另类| 久久久久久伊人网av| 免费看不卡的av| 好男人视频免费观看在线| 男女边摸边吃奶| 亚洲不卡免费看| 久久精品综合一区二区三区| 黄色日韩在线| 欧美xxxx性猛交bbbb| 亚洲欧洲国产日韩| 国产午夜精品一二区理论片| 好男人视频免费观看在线| 亚洲精品亚洲一区二区| 亚洲国产精品国产精品| 激情 狠狠 欧美| 人体艺术视频欧美日本| 久久精品久久久久久久性| 丰满人妻一区二区三区视频av| kizo精华| 亚洲欧美清纯卡通| 深夜a级毛片| 午夜精品一区二区三区免费看| 一个人免费在线观看电影| 高清午夜精品一区二区三区| 麻豆精品久久久久久蜜桃| 久久99热这里只有精品18| 国产成人午夜福利电影在线观看| 欧美日韩国产mv在线观看视频 | 亚洲av日韩在线播放| 蜜桃亚洲精品一区二区三区| av又黄又爽大尺度在线免费看| 国产午夜精品论理片| 大香蕉97超碰在线| 五月伊人婷婷丁香| 久久这里有精品视频免费| av卡一久久| 日日啪夜夜撸| av在线老鸭窝| 两个人视频免费观看高清| 91久久精品电影网| 色尼玛亚洲综合影院| 久久久精品欧美日韩精品| 夜夜看夜夜爽夜夜摸| 免费看日本二区| 2018国产大陆天天弄谢| 一个人看的www免费观看视频| 免费黄色在线免费观看| 99久久中文字幕三级久久日本| 久久久久久久久久成人| 男人爽女人下面视频在线观看| 最新中文字幕久久久久| a级一级毛片免费在线观看| 午夜福利在线观看吧| 国产黄频视频在线观看| 欧美一级a爱片免费观看看| 女人十人毛片免费观看3o分钟| 国产高潮美女av| 亚洲图色成人| 美女xxoo啪啪120秒动态图| 日本免费在线观看一区| 亚洲成人一二三区av| 少妇的逼好多水| 免费av不卡在线播放| 日韩一区二区三区影片| 国产成年人精品一区二区| 九草在线视频观看| 一级黄片播放器| 国产精品一及| 亚洲最大成人中文| 久久精品夜色国产| 日韩欧美三级三区| 岛国毛片在线播放| 日韩一区二区三区影片| xxx大片免费视频| h日本视频在线播放| 久久99热6这里只有精品| 欧美精品国产亚洲| 伦理电影大哥的女人| 人妻系列 视频| 亚洲av成人av| av天堂中文字幕网| 日本黄色片子视频| 亚洲内射少妇av| 美女脱内裤让男人舔精品视频| 伦理电影大哥的女人| 日本欧美国产在线视频| 午夜福利在线观看免费完整高清在| www.色视频.com| 欧美极品一区二区三区四区| 97人妻精品一区二区三区麻豆| 乱码一卡2卡4卡精品| 亚洲国产欧美人成| 精品少妇黑人巨大在线播放| 一本久久精品| 亚洲精品乱码久久久久久按摩| 久久草成人影院| 国产 亚洲一区二区三区 | 高清毛片免费看|