Yongqun ZHANG, Zhenyun SHI, Honging JI, Zhenzhen SU
a School of Electronic Engineering, Xidian University, Xi’an 710071, China
b School of Computer Science and Technology, Xidian University, Xi’an 710071, China
KEYWORDS
Abstract Multi-target tracking is facing the difficulties of modeling uncertain motion and observation noise.Traditional tracking algorithms are limited by specific models and priors that may mismatch a real-world scenario.In this paper, considering the model-free purpose, we present an online Multi-Target Intelligent Tracking (MTIT) algorithm based on a Deep Long-Short Term Memory (DLSTM) network for complex tracking requirements, named the MTIT-DLSTM algorithm.Firstly, to distinguish trajectories and concatenate the tracking task in a time sequence,we define a target tuple set that is the labeled Random Finite Set (RFS).Then, prediction and update blocks based on the DLSTM network are constructed to predict and estimate the state of targets,respectively.Further,the prediction block can learn the movement trend from the historical state sequence,while the update block can capture the noise characteristic from the historical measurement sequence.Finally, a data association scheme based on Hungarian algorithm and the heuristic track management strategy are employed to assign measurements to targets and adapt births and deaths.Experimental results manifest that, compared with the existing tracking algorithms, our proposed MTIT-DLSTM algorithm can improve effectively the accuracy and robustness in estimating the state of targets appearing at random positions, and be applied to linear and nonlinear multi-target tracking scenarios.
Multi-target tracking technology is an important branch of information fusion, and its main purpose is to continuously estimate the number of targets and their states from a sequence of observations in a scenario.1It has a wide application prospect not only in military fields2such as aerial reconnaissance,missile defense,and battlefield surveillance,but also in civilian fields such as robot vision,3traffic control, and autonomous vehicles.4Generally speaking, multi-target tracking mainly faces the following challenges: the number of targets is timevarying due to births and deaths; the accuracy of estimated states is usually degraded by measurement errors and uncertainty of target motions.In addition, data association, detection uncertainty, and false alarms can also bring obstacles.5At present, tracking multiple targets in an unconstrained scenario is still a very difficult task.
A classic multi-target tracking framework is based on Bayesian theory, which has been widely used and achieved good tracking accuracy.Typical algorithms include Multiple Hypothesis Tracking (MHT),6Joint Probabilistic Data Association (JPDA),7Random Finite Set (RFS) filtering,8etc.At present, the most concerned multi-target tracking algorithm is RFS filtering, which recursively propagates the posterior density of target states forward.To address the heavy computational load of RFS filtering,the Probability Hypothesis Density (PHD),9Cardinalized PHD (CPHD),10Multi-target Multi-Bernoulli (MeMBer),8and Cardinality-Balanced MeMBer(CBMeMBer)11filters have been successively developed as approximations.Recently, a Poisson Multi-Bernoulli Mixture(PMBM) filter12is proposed to handle multi-target tracking with higher tracking accuracy.However,the PMBM filter suffers from the heavy computational burden due to the global association history hypotheses.In addition,to further improve the tracking performance of RFS filtering algorithms, several innovative algorithms13–16have been proposed in our early works.Strictly speaking, these tracking algorithms are not multi-target trackers because they cannot distinguish each target.To this end, Generalized Labeled Multi-Bernoulli(GLMB)17,18and LMB19filters are proposed, which formally estimate target tracks by applying the labeled RFS formulation.All of these algorithms require a given system model during initialization, including a target state transition model and a sensor observation model.The former is used to predict the target state at the next time step,while the latter is used to map the predicted state to the measurement space for correction.If the system model does not match the unknown environment,the tracker will fail to track targets.
Recently, deep learning based on neural networks due to powerful nonlinear mapping, learning, and computing power has been growing rapidly and showing great advantages in many fields, such as image recognition,20speech processing,21video processing,22etc.However,there has been relatively less work applying deep learning to multi-target tracking.Milan et al.firstly proposed an end-to-end multi-target tracking algorithm based on deep learning.23Subsequently, Emambakhsh et al.used an Long-Short Term Memory (LSTM) network to predict the state for each target.24To deal with the combination explosion of multi-target data association, Liu et al.employed an LSTM network to learn the association weights between targets and measurements, avoiding traversing all possible association events.25Additionally, Gao et al.presented LSTM-based deep recurrent neural networks to track a single target with linear and nonlinear motions.26,27At present, multi-target tracking algorithms based on deep learning are limited to radar and pedestrian targets in ideal scenarios,and they have not performed well in more complex scenarios,such as target maneuvering, poor cooperation, etc.Consequently, figuring out how to combine the advantages of the deep learning theory and a neural network structure to reliably track multiple targets in complicated scenarios is an urgent issue to be solved.
Inspired by the potential of recurrent neural networks, we propose a Multi-Target Intelligent Tracking(MTIT)algorithm based on a Deep LSTM (DLSTM) network with a recurrent structure to online track each labeled target, i.e., a MTITDLSTM algorithm.It is here emphasized that the proposed data-driven MTIT-DLSTM algorithm does not need to model motions and rely on priors,which is more suitable for scenario diversity.Thus, the proposed algorithm is capable of dynamically tracking multiple targets in complex scenarios involving clutter, birth, and death of targets, as well as linear and nonlinear motions.The main contributions of our work are summarized as follows:
(1) We design a prediction block based on a DLSTM network to predict the state of each target, since the estimation of the target state for continuous time steps can be regarded as a sequential learning problem.The prediction block is required to capture the motion trend from historical states and transfer target states to the next time step.Therefore,the MTIT-DLSTM algorithm can adaptively track multiple targets moving in different ways.
(2)We design an update block based on DLSTM to update the state of each target.The main purpose of the update network is to learn the noise characteristics from historical measurements and update states.
(3) We build a training set corresponding to the network structure in terms of different kinds of target motions to ensure that the network can deal with tracking of multiple targets in complex scenarios.Adam optimizer is then adopted to train prediction and update networks,while a cosine annealing algorithm is used to dynamically change the learning rate to ensure convergence of the model.
(4) To handle the time-varying number of targets, we use heuristic track management to adapt births and deaths.Meanwhile, to find the corresponding measurements for targets,data association based on Hungarian algorithm is used to seek the association map between measurements and targets.
The paper is organized as follows.First of all,the necessary background knowledge about target tracking and DLSTM networks are reviewed in Section 2.Then,Section 3 establishes the overall algorithm framework and introduces the involved prediction and update networks, data association, track management, update survivals, and assign births in detail, respectively.Subsequently, how to train networks is explained in Section 4.Finally, Section 5 is dedicated to experimental results,and Section 6 concludes the paper,giving future working directions.
A brief review of target tracking and DLSTM is given in Sections 2.1 and 2.2,respectively, which is the theoretical basis of the proposed MTIT-DLSTM algorithm.
Target tracking is the ultimate goal of radar back-end data processing, composed of point aggregation, track initiation,track association, track fusion, track extinction, and so on.Traditional target trackers based on Bayesian filtering are to recursively compute the reliability of various values, i.e., the posterior probability of target states given measurements.28In the state space and measurement space,the motion and observation equations are, respectively, modeled as
where xkand zkdenote the target state and measurement vectors at time k, respectively; wk-1and vkrepresent the process noise and observation noise, respectively, which are independent of each other.The motion model, defined by Eq.(1),describes how a target moves, while the observation model,given by Eq.(2), describes the relationship between the target state and measurement.
Bayesian filtering consists of two procedures: prediction to compute the initial density using a motion model; update to compute the posterior density using a measurement model,i.e.,
where fk|k-1(xk|xk-1)is the transition probability density determined by the motion equation,while gk(zk|xk)is the likelihood function determined by the observation equation.The detailed process of Bayesian filtering is shown in Fig.1.
It is worth noting that all information about the state at time k is encapsulated in the posterior density determining the final estimation accuracy.Prediction and update are carried out recursively to achieve a continuous estimation of the target state.In general, to alleviate the computational burden of the above equations,Kalman filter29and the particle filter30are adopted to provide two main implementation approaches.The former performs under linear and Gaussian assumptions for the state and measurement models,while the latter approximates the arbitrary distributions using sequential importance sampling.
Under assumptions of the linear system and Gaussian noise, the state transition and measurement equations can be described as
where Fk-1is the state transition matrix, and Hkis the transformation matrix from the state space to the observation space; wk-1and vkare zero mean Gaussian noise with covariances Qk-1and Rk, respectively.
Considering a two-dimensional tracking scenario,the Constant Velocity(CV)and Constant Turn(CT)motions are modeled as
where Tsis the sampling interval,and Ω is the turn rate.Thus,if a target moves at CV,its state vector is xk=[x,vx,y,vy]T;if a target moves at CT, its state vector is xk=[x,vx,y,vy,Ω]T.
Remark 1.In order to make the proposed algorithm get rid of the limitations of the model, we utilize neural networks of deep learning to predict and update via learning motion trends and noise characteristics from the historical positions of a target,which correspond to estimating the conditional probabilities p(xk|x1:k-1) and p(xk|z1:k), respectively.Therefore, to adapt various motion modes,the state vector is defined as xk?[x,y]Tin this work.
2.2.1.Recurrent neural network
Fig.1 Bayesian filtering process.
As previously analyzed, the target tracking problem is essentially to estimate target states from the measurement sequence.Thus, target tracking can be realized by a Recurrent Neural Network (RNN) that can map sequential measurements to true states.A typical neural network consists of one input layer, one or several hidden layers, and one output layer.31An RNN is a kind of particular neural network with a recurrent structure, through which historical information can be stored and exploited.In the RNN structure, the hidden layer neurons at different times are connected via the hidden state h carrying the historical information,32as shown in Fig.2.For a sequence-to-sequence task, the RNN can produce a specific output sequence, given an input sequence with a temporal structure33.
The trained RNN is capable of estimating the output sequence (y1,y2,???,yK′) given an input sequence (ε1,ε2,???,εK),on condition of maximizing the conditional density p(y1,y2,???,yK′|ε1,ε2,???,εK), where K′may differ from K.This conditional density is represented as a product of the probability densities via the chain rule,34i.e.,
where v is the fixed joint representation of the input sequence and last hidden state of the RNN.Each p(yk|v,y1,y2,???,yk-1)represents a recurrent module defined by a deterministic function33.
Remark 2.The above analysis provides theoretical support for estimating optimal states of targets by means of an RNN.In this work, we will mainly estimate the states that maximize the conditional densities p(xk|x1:k-1) and p(xk|z1:k) mentioned in Section 2.1, which can be implemented by a chain of recurrent modules.Obviously,the output of the last module is just what we need.
2.2.2.DLSTM network
A vanishing gradient phenomenon of an RNN will occur when training data with long temporal dependencies.35To avoid this phenomenon, researchers have designed some modified loop units to improve the traditional RNN, such as LSTM combined with the ‘‘gate” mechanism.36A typical LSTM unit structure is shown in Fig.3.
The core of the LSTM unit consists of two information transmission lines (the cell state ckand the hidden state hk)and three ‘‘gates”, which are usually composed of a sigmoid neural network layer and a pointwise multiplication operation.The cell state ckand the hidden state hkupdated by
where ⊙represents the pointwise multiplication operation;fkis the forgetting decision vector, multiplied by ck-1to determine which information from the previous time should be forgotten;ikis the input decision vector, multiplied by c~k to determine which information at the current time should be added to the cell state ck; okis the output decision vector, multiplied by tanh(ck) to determine which information in ckshould be added to the hidden state,and okare,respectively,calculated as
Fig.2 Recurrent neural network.
Fig.3 Structure of an LSTM network.
where W and b are,respectively,the weight matrix and the bias vector, obtained through network training.Note that these parameters must be optimized to minimize the predictive error of the network.σ(?) and tanh(?) represent the sigmoid activation function and the hyperbolic tangent activation function,respectively, written as
A deep model can be more efficient at representing some functions than a shallow model.37The RNN introduced earlier has only one hidden layer, while stacking more than two hidden layers will get a deep RNN.Similarly,the LSTM network can also be extended in time and space dimensions to obtain a Deep LSTM(DLSTM)network,as shown in Fig.4,which can learn more hidden information to increase the expressive power.
Remark 3.In this work, in order to enable the network to map sequential measurements to true states,an RNN-based structure is considered, as analyzed by Eq.(9).Specifically, the DLSTM network will be applied due to its tremendous superior in extracting features from the input sequence.
In this section, we present multi-target intelligent tracking based on the DLSTM network, called the MTIT-DLSTM algorithm.Firstly, in Section 3.1, we define a problem formulation and a target tuple set in Sections 3.1.1 and 3.1.2 as the theoretical basis of the proposed MTIT-DLSTM algorithm,respectively.Then, an online intelligent tracking framework of the MTIT-DLSTM algorithm is constructed in Section 3.2.Subsequently,Section 3.3 elaborates the network construction process of the most important prediction and update blocks.Finally,data association,track management,update survivals,and assign births, involved in the MTIT-DLSTM algorithm framework, are illustrated in Sections 3.4-3.6, respectively.
3.1.1.Problem formulation
Fig.4 Structure of a DLSTM network.
Different from the framework of recursive Bayesian filtering,we here use two DLSTM networks to implement prediction and update separately.Besides, data association based on Hungarian algorithm is used to assign measurements for targets.The proposed framework has two main advantages: (A)prediction and update are two components that can be trained individually; (B) the framework becomes modular, making it easy to replace another module.The overall framework of the proposed MTIT-DLSTM algorithm is shown in Fig.5.It can be seen that multi-target intelligent tracking based on DLSTM is online, connected with the target tuple set Dk,meaning that the output of the kth iteration will be the partial input of the (k+1)th iteration.The overall process will be introduced in detail as follows.
Firstly, the cumulative historical state setis given into the predictor, which loads the trained DLSTM model for each target and predicts the target state set Xk|1:k-1in a d-dimension space.Since the model is universal to all targets, multiple targets treated as a batch can be fed together into the predictor, which greatly speeds up the learning efficiency and can meet real-time requirements.
Secondly, to enable the targets to match the corresponding measurements, we compute the pairwise-distance matrix Ck?RMk-1×Nkfor data association, where each element is the statistical distance between the predicted state of track ?iand measurement zk,j.The output of the data association block based on Hungarian algorithm is an assignment matrix Akwith the same size as that of Ck.In other words, we obtain an association map θk(?) equivalent to Ak.This is followed by track management to remove deaths, obtain survivals,and assign births with the assistance of the weight set, defined by Eq.(21).Then, put the labels of survivals and births into sets Ls,kand Lb,k, respectively.
Thirdly,the update block aims to estimate the states of targets in Ls,k,while the assignment birth block is to assign target tuples for targets in Lb,k.There are two ways to update,depending on whether the measurement is assigned to the target., produced through appending zk,θk(?i)to, is fed into the update network to estimate the target state when θk(?i)>0,while the update is performed without any measurement but with the target’s predicated state when θk(?i)=0.Consequently, the union Dkof the birth and survival target tuple sets are regarded as the output for the kth iteration.
Finally,the target state setis extracted from Dkaccording to the survival weight wmin.
It is noteworthy that the crux of the proposed MTITDLSTM algorithm is the network construction of prediction and update blocks,which directly affects the multi-target intelligent tracking performance.Subsequently, we will give their specific construction processes in the following subsections.
Fig.5 Overall framework of proposed MTIT-DLSTM algorithm.
3.3.1.Prediction block
In Bayesian filtering, the task of prediction is to transfer the target state from time k-1 to time k by means of a motion model underlying Markov hypothesis.29If the motion model does not match the target motion, the prediction will be paralyzed,resulting in failure to track the target.DLSTM networks provide a data-driven solution to this challenge by learning motion trends intelligently from previous states.
In this work, a DLSTM network with 10 time steps and 2 hidden layers is used to predict, which is illustrated in Fig.6.It can be seen that the prediction network retains the highorder Markov transition state.To predict the state of a mature target, the input and output of the prediction block are represented as a sequence and a vector,respectively,indicating that this block is a many-to-one structure.As a matter of fact, the DLSTM network is a many-to-many structure, as shown in Fig.4, but the prediction block, designed by us, takes only the output of the last time step as the output here.
In general, the length of the input sequence is variable.Obviously,the motion trend is reflected in at least two continuous time steps.In the first two moments, we choose the state of the first moment as the prediction of the second moment.Besides, we hold the view that the historical information from the past 10 time steps is sufficient for learning motion trends.Therefore, the minimum and maximum lengths of the input sequence in the prediction block are set to 2 and 10, respectively.During the first few time steps of tracking,the predicted states will be slightly inaccurate due to the lack of the historical information available.With the accumulation of more and more historical information,the predicted states become closer and closer to the true states.
As shown in Fig.6, the states of the input sequence are firstly normalized, which makes the model converge faster.Supposing that the motion scenario is a larger rectangle in the two-dimensional space, the state can be normalized by
where rmaxand rminrepresent the boundary extremum points of the scenario, i.e., (x1,y1) and (x2,y2), as shown in Fig.7.Then, the normalized sequence is fed into the DLSTM network to extract motion characteristics.Finally, its output is processed to obtain the predicted states via a single-layer linear network.Subsequently, these states will be used to match correct measurements for targets and participate in state estimating.
It is worth noting that the prediction network can directly handle the input from multiple targets because the input of DLSTM is provided as a tensor consisting of ‘Batch’, ‘Seq’,and ‘Feature’.For the above three parameters, ‘Batch’ and‘Seq’are variable,and‘Feature’is fixed,where‘batch’is equal to the number of targets to be tracked, ‘Seq’ is equal to the length of the historical sequence, and ‘Feature’ is the dimension of the target state.Taking two targets as example,Fig.8 shows the process of assembling the input of the prediction network.In Fig.8, the maximum value of ‘Seq’ is set to 10.When preparing the input of the prediction network, the historical information should be truncated for targets’ surviving length being greater than 10 time steps,while the historical information should be filled with initial states for targets’ surviving length being less than 10 time steps.
3.3.2.Update block
The noise,attached to measurements from a trajectory,is generally supposed to follow a Gaussian distribution, whose covariance is influenced by the sensor and environment.Traditional tracking filters rely on priors about noise to estimate the target state.However, in complex tracking environments, a large number of hypotheses and priors cannot be maintained.In this work, in order to solve this problem, we design an update network based on DLSTM to recover states from polluted measurements by capturing noise characteristics from historical measurements, as shown in Fig.9.
Fig.6 Prediction network.
Fig.7 Boundary of the two-dimensional scenario.
The update network,depicted by Fig.9,has a similar structure to that of the prediction network.The only difference is that the number of time steps is more than that of the prediction network.Since the statistical properties of noise are hidden in enough samples, the DLSTM network needs to learn noise information from a longer historical measurement sequence.The length of the input sequence is variable,ranging from 10 to 20.In this work, in order to ensure tracking accuracy,Kalman filter is used to estimate the state when the length of the cumulative measurement sequence is less than 10.Its detailed process is summarized as follows:
where Pk|k-1and Pkare, respectively, the covariances of the predicted and estimated states at time k;the innovation vector ekrepresents the difference between the measurement zkand the prediction obtained by mapping the predicted state xk|k-1to the observation space; accordingly, the matrix Skis the covariance of the innovation vector ek,and Kkis Kalman gain.Substituting Eqs.(24)-(27) into Eq.(28), the equation of updating states can be obtained by
In this work, we define x?[x,y]T, as mentioned in Remark 1,and z=[x,y]T,so Hkis the identity matrix I.Thus,Eq.(30)is transformed to
It can be seen that when estimating the state,a larger Pk|k-1means a higher proportion of measurement, while a larger Rkmeans a higher proportion of prediction.Assuming that Pk|k-1=diag(a,a) and Rk=diag(b,b), Eq.(31) can be approximated as
Remark 4.As mentioned earlier, the prediction block based on the DLSTM network can only learn the dynamics characteristics from the historical states, but cannot obtain the prediction covariance.Through repeatedly adjusting the coefficients and experimental verification, fx=fz=0?5 will acquire satisfactory results.Note that using the update network to estimate the target state requires historical information of at least 10 moments.In the first 10 moments,our main purpose is to ensure a continuous trajectory.During this period, the target state cannot be accurately estimated in general.Due to the lack of the priori of motion and measurement noise, classic estimating methods,such as Kalman filtering, cannot be directly used to accurately estimate the target state.In this work, in order to overcome the above shortcomings, we use fxand fzto represent the reliability of prediction and measurement, respectively.Since we cannot estimate the covariance, we have no alternative but to believe that prediction and measurement are equally reliable.
Fig.8 Process of assembling the input of the prediction network.
Fig.9 Update network.
Data association that follows the prediction block aims to assign the corresponding measurements to targets.In cluttered environment, there are ambiguities in the association due to measurements arising from true targets or false alarms.Assigning wrong measurements to targets often results in lost tracks.Moreover, clutter can produce false tracks that may rob measurements of true targets.Normally, data association schemes are used to deal with the above problems,which have received much attention in multi-target tracking research.38The purpose of data association is to compute the assignment matrix A by minimizing the global association cost, given the cost matrix C ?RM×N, which is described as25
Subject to (s.t.)
where i=1,2,???,M,j=1,2,???,N.The first constraint, given by Eq.(35), implies that each measurement can be assigned to at most one target, while the second one, introduced by Eq.(36), implies that a track can generate at most one measurement at any time step.Therefore, data association is a complex combinatorial optimization problem that can be handled by Hungarian algorithm.
Usually,the target number M will be less than the measurement number N,so Hungarian algorithm will return M association maps, which means that there are all-zero columns and no all-zero rows in the above mentioned assignment matrix A.Note that an all-zero column means that the measurement does not match any existing track.In order to model missed detection,we decide to add an additional column to the assignment mapping matrix, represented as A ?RM×(N+1).Accordingly, the cost matrix C also needs to add an additional column, filled with the association threshold cthr.Hence, the optimization objective and constraint are rewritten as
where i=1,2,???,M,j=1,2,???,N+1.
At time k, the statistical distance is established with the weighted norm of the innovation vector39as follows:
where ckfollows a central chi-squared distribution with n degrees of freedom,and n is the dimension of the measurement vector.Thus,the statistical distance between each survival target and each measurement is calculated to form a cost matrix.
Using the cost matrix Ckat time k as the input,Hungarian algorithm can output the assignment matrix Ak.Then,remove the last column of Akto get the final assignment matrix.An all-zero row i means that track ?iis possibly undetected or dead, while an all-zero column j means that the jthmeasurement may be a false alarm or arise from a birth target.
To adapt the time-varying number of targets, it is indispensable to provide a mechanism to allow birth targets to enter the scenario and remove existing ones that disappear indefinitely.Each unassigned measurement can potentially be either the start of a new trajectory or a false alarm.Additionally,the case of a certain target with missing measurement could mean that the target has disappeared,or that the detector has failed.To address this challenge,a heuristic method,considering consecutive measurements,is proposed to adapt to the appearance and extinction of targets without priors about appearing positions.
In general, a track goes through several stages from its appearance to extinction: track initiation, track maintenance,and track disappearance.In multi-target tracking, a point image is formed at every radar scan.If measurement points,originated from one potential birth target, appear continuously no less than 3 scanning cycles,the track formed by these points is considered as an initial track.Continuous arrival measurements maintain a mature track.When there is no new correlation measurement to maintain one track, it will enter the stage of track disappearance.Further, when one track disappears for two scan cycles, it is considered to have died.
Track management is mainly carried out by changing the weight of each track.In this work, targets with a survival weight wk,?i= 1 are regarded as mature targets, while those at the stages of track initiation and track disappearance(0 In Fig.10, the circle represents the survival weight that is initialized as 0.4.Note that wk,?i≥0?8 means that a new track has successfully started, marked with orange circles.Meanwhile, the letter ‘y’ refers to the situation that targets have assigned measurements, and accordingly, the survival weights will increase.The letter ‘n’ refers to the situation that targets have no assigned measurements, and accordingly, the survival weights will decrease.As we can see, it is strict for raw and dying targets to reduce their weights to 0, which helps to prevent false tracks.However, it is tolerant for mature targets to reduce their weights to 0.8 instead of 0, which aims at dealing with missed detection.Besides, track management needs to append the labels of targets with a survival weight greater than 0 (wk,?i>0) to the set Ls,k.Meanwhile, unassigned measurements, possibly originated from birth targets or false alarms,are temporarily treated as raw targets with an initial survival weight winit=0?4.At last, assign the labels for raw targets and append them to the set Lb,k.For completeness,we summarize the key steps of track management in Algorithm 1. Algorithm 1.Pseudocode for track management. Input: Ak,winit,{wk-1,?i}Mk-1 i=1 Output: Ls,k, Lb,k 1.Initialize: Ls,k =Lb,k ={}; n=0; convert Ak to association map θk(?);2.for i=1,2,???,Mk-1 do 3.if wk-1,?i =1 then /*mature targets */4.append ?i to Ls,k;5.if θk(?i)>0 then 6.wk,?i ←1 7.else 8.decrease survival weight;9.end 10.end 11.if 0 Fig.10 Dynamic change of survival weight in track management. Algorithm 2.Pseudocode for update survivals. Input: Ls,k,{S(?i)z,k-1}Mk-1 i=1 Output: Ds,k 1.for ? in Ls,k do 2.if θk(?)>0 then 3.append zk,θk(?) to S(?)z,k-1, resulting in S(?)z,k;4.if len(S(?)z,k)>10 then 5.^xk,? ←update net(S(?)z,k) /*use update network to update */6.else 7.^xk,? ←fx×xk|1:k-1,?+fz×zk,θk(?)8.end 9.end 10.if θk(?)=0 then /* regard prediction as optimal estimation */11.^xk,? ←xk|1:k-1,?12.end 13.append ^xk,? to S(?)x,k-1, resulting in S(?)x,k;14.ds,k ←(S(?)x,k,S(?)z,k,wk,?)?Ds,k 15.end 16.return In parallel with the above procedure, for each target from Lb,k, a target tuple is assigned as db,k=(zk,θk(?),zk,θk(?),winit)?Db,k.Therefore, the target tuple set at the kthtime step is the union of births and survivals,i.e., Dk=Ds,k∪Db,k.Finally, the target state set X^kis extracted from Dk.Set the minimum weight wmin=0?8 and then select those targets whose weights are greater than wminas estimated targets. The main innovation of our work is to use two DLSTM networks to predict and update independently.The former learns the movement trend from the historical state sequence to estimate the target state at the next moment, which is related to the accuracy of data association.The latter learns the noise characteristics from the historical measurement sequence to estimate the real state of a target, which determines the final tracking accuracy.In order to make the network have the desired function, offline pre-training is an important part.Finding optimal parameters for deep architectures still remains a harsh challenge.In this section,we will point out some of the most important parameters and training details. Deep learning methods, owning appropriate network structures predesigned, need to extract characteristic information from massive training data.Different kinds of target motions of interest need to be sufficiently included to ensure that the network can deal with tracking of targets with different motions.In this work, we mainly consider CV and CT targets over the surveillance area [-2000,2000] m×[-2000,2000] m.The generation rules of states and measurements of targets are shown in Table 1. Following the above rules and dynamic, we generate 256 tracks and add Gaussian noise with different covariances to each track to generate measurements.Then, the training data is encapsulated as the input–output pairs with a time order,i.e., where {Ip,k,Op,k} and {Iu,k,Ou,k} are the pair-wise sequence data used for training the prediction and update networks,respectively.Note thatoriginated from the ithtrack refers to the measurements during the previous 20 moments up to time k.It can be seen that the structure of the training data is consistent with the network structure shown in Figs.6 and 9.Further,Eqs.(40)and(41)can be understood as dividing a long track into several short tracks.Hence,the prediction and update networks are trained by 23040 (90×256) and 20736 (81×256) tracks, respectively, which is enough for the tracking requirement of this work. Table 1 Training data generation rules. The ultimate goal of training the neural network is to optimize its parameters to minimize the network’s loss function.We expect the predicted and updated states to be as close to the true states as possible.Therefore, the Mean-Squared Error(MSE)is here chosen as the loss function to measure the difference between the truth and output sequences.A lower MSE implies a better prediction.Additionally, to speed up training,the training data set is divided into several batches randomly,and the parameters are updated based on the partial evaluation of the loss function in each batch. The trained DLSTM network can be modelled by a deterministic function FΘ(?), where Θ={θ1,θ2,???,θM}, θiis the trainable parameter shared by all LSTM units, and M is the total number of applied LSTM units.The prediction process of networks can be described as ^O=FΘ(I).Let Oiand ^Oibe the truth and estimated state sets of the ithtrack in a batch,respectively.The training objective function is defined as where N is the number of targets contained in a batch.Oiand Iiare known and offered by training data, given by Eqs.(40) and(41).Therefore,the loss is a function of the parameter set Θ.The purpose of the training network is to adjust parameters for many times to minimize the loss function.As mentioned in Section 3.3.1, the DLSTM network has a many-to-many structure,while the prediction block has a many-to-one structure.The loss function, as shown in Eq.(42), contains the MSE between the output of all time steps and the ground truth.We intend to make the output of each time step as close to the true value as possible by minimizing the overall loss.This can be understood as training multiple many-to-one networks at the same time. To obtain the neural network’s optimal parameter set Θ,Adam optimizer is adopted in this work, which is a stochastic gradient-based optimization algorithm.It differs from the traditional Stochastic Gradient Descent (SGD) algorithm in that it assigns adaptive individual learning rates to each neural network parameter separately and updates them based on the estimates of the first and second moments of the gradients.31In training, a learning rate is the step of adjusting parameters,which determines how quickly the parameters change.Note that the disadvantages of using a fixed learning rate are that a too small one will lead to slow convergence and local optimum, while a too large one will lead to failure to converge.However, when getting closer to the global minimum of the loss function,we expect that the learning rate becomes smaller to make a model parameter as close to the optimum as possible.Therefore, to train the network more effectively, we specify the initial learning rate lr=0?0005 for Adam optimizer and use the cosine annealing algorithm to dynamically change it. In a training epoch, all pairs of training sequences are utilized to train the networks.In order to optimize the model parameters, we intend to train the prediction and update networks for 200 epochs, which is enough to reach convergence.The changing curve of the learning rate is illustrated in Fig.11 using the cosine annealing algorithm.It can be seen that in the later stage of training, the learning rate changes slowly and gradually approaches 0, which can help to reach the minimum loss location. In the network construction of the proposed MTIT-DLSTM algorithm,the DLSTM network for state prediction is trained with two layers and 64 hidden units.Compared to the prediction,the state estimation is a more formidable work,requiring the update network to have more representation power.To that end, the DLSTM network employed to update consists of two layers and 128 hidden units.In this work, all experiments are implemented in Python3.8 with the help of PyTorch,40using an NVIDIA GeForce RTX 3090 GPU to accelerate.The losses of the two networks during training are shown in Figs.12(a) and (b), respectively.It can be seen that the losses of the prediction and update networks tend to be stable with slight fluctuation in the later epochs, meaning that the networks have converged. Note that the prediction and update functions of the networks are derived from the training data.As long as the motions of a target are included in the training set,the prediction network can maintain the movement trend.Besides, as long as the measurement noise obeys the Gaussian distribution with its covariance in a consistent range with the training set,the update network can denoise from measurements. In this section, in order to verify the tracking performance of the proposed MTIT-DLSTM algorithm, two complex multitarget tracking scenarios, i.e., multiple CV and CT target tracking, are considered, which involve clutter, birth, and death of targets as well as linear and non-linear motions.In addition, we compare our algorithm with popular RFSbased filters, such as the PHD and LMB filters, whose MATLAB code is available from https://ba-tuong.vo-au.com/codes.html.In the following Sections 5.2 and 5.3, some experimental results are illustrated and analyzed. Fig.11 Learning rate versus epoch. Fig.12 Losses of the two networks in training. In all simulations,the survival probability pSand the detection probability pDare set to 0.99 and 0.98, respectively.A sensor returns position measurements, and the standard deviation of the measurement noise is set to σx=σy=10 m.The tracking period and sampling interval are set to T=100 s and Ts=1 s,respectively.The clutter measurements are uniformly distributed over the observation space and Poisson RFS with the intensity denoted by where u(?) denotes a uniform density over the observation region, V is calculated by V=∫u(z)dz, and λcis the mean value of Poisson distribution and set to λc=10 per scan in simulation. In this work, the OSPA distance is used as the metric to evaluate the performance in multi-target tracking.Considering two arbitrary sets X={x1,x2,???,xm} and Y={y1,y2,???,yn},where m,n ?N, the OSPA distance with order 1 ≤p<∞and cut-off c>0 between them is defined by41. where Πnis the set of permutations on {1,2,???,n}, d(c)(x,y)represents the distance between x and y,cut off c,which is calculated as d(c)(x,y)=min(c,d(x,y)).In the simulation, set p=1 and c=100.Further,the OSPA distance can be divided into a location error and a cardinality error, i.e., Considering a two-dimensional surveillance region[-1000,1000] m×[-1000,1000] m, all targets with random birth and death times travel in straight lines and with different but constant velocities,as shown in Table 2.Fig.13 shows the true trajectories represented by black lines and estimated trajectories represented by other colorful lines for a single run..Different trajectories are distinguished by different colors.From Fig.13, we observe that the proposed MTIT-DLSTM algorithm can successfully track all of the targets with no track label switching, leading to an excellent single run result.However,track label switching occurs occasionally due to occlusion or false alarms, resulting in a trajectory with several colors.Note that the proposed algorithm can re-initialize a lost track at any position to ensure the continuity of tracking. Results are shown in Fig.14, where Figs.14(a)–(d) exhibit the cardinality estimation, OSPA distance, OSPA cardinality error, and OSPA location error of all algorithms for multiple CV target tracking, averaged over 100 Monte Carlo (MC)runs,respectively.According to Figs.14(a)and(c),in the part of estimating the number of targets, it is clear that the performance of our MTIT-DLSTM algorithm is close to that of the LMB filter and obviously better than that of the PHD filter except for the moments of birth and death.Additionally, the MTIT-DLSTM algorithm shows a larger cardinality error near targets appearing and dying, since adaptive birth and death require several consecutive moments to judge.However,the PHD and LMB filters already use the first measurement to update the static birth distributions relying on the priori.Note that the birth positions of targets in the simulation are fixed and known for the compared algorithms, while our algorithm can track targets appearing at arbitrary positions without any priori about targets and environment. From Fig.14(b), it can be seen that our algorithm outperforms the other algorithms in terms of the overall OSPA distance.This is due to a significant drop in the OSPA location error, as shown in Fig.14(d), attributed to the use of the prediction and update networks.The prediction network replaces the motion model to transfer the target state,while the updatenetwork replaces Kalman filter to estimate the true state,which determines the final OSPA location error.In algorithms(e.g., PHD and LMB) based on Kalman filter, the covariance reflects the reliability of the estimated state, and its convergence speed is affected by the model and noise.Different from those algorithms, the update network focuses on data and intends to remove noise from measurements to restore the real state, which clearly brings better tracking results.During the early times (0 Table 2 Notations of important variables. Fig.13 Ground truth and estimated trajectories of CV targets. From Figs.14(c)and(d), we observe that the OSPA cardinality error increases and the corresponding location error decreases near times when a target is born(k=50 s)and a target has died(k=90 s).This phenomenon can be explained by Eqs.(45)and(46),and Eq.(45)can be further transformed towhere cost is the minimum Euclidean distance of X and Y.When targets appear, the OSPA location error drop is caused by an increase of max(m,n);when targets die,the OSPA location error drop is due to a decrease of cost. Table 3 Runtime comparison between all algorithms. Finally, the average single runtimes of all algorithms are shown in Table 3.The runtime of our MTIT-DLSTM algorithm is equivalent to that of the PHD filter, and faster than that of the LMB filter.The result implies that the proposed MTIT-DLSTM algorithm can satisfy the requirement for real-time processing in multi-target tracking. In this section, to verify that the proposed MTIT-DLSTM algorithm is able to handle motion uncertainty with the same trained networks, tracking multiple targets moving in a Constant Turn (CT) rate is simulated.In this scenario, there are 5 targets appearing in a two-dimensional surveillance region[-2000,2000] m×[0,2000] m with various birth and death times,as shown in Table 4.It’s worth noting that the proposed algorithm is free of model and data-driven, so there is no difference between linear and nonlinear scenarios.As mentioned in Section 2.1,both state and measurement vectors are defined as[x,y]T,which means that the proposed algorithm focuses on location information.Hence, no additional processing is required for the nonlinear scenario in this section. Fig.14 Simulation results for multiple CV target tracking. Table 4 Notations of important variables. Fig.15 Ground truth and estimated trajectories of CT targets. Results are shown in Figs.15-16,where Fig.15 exhibits the true trajectories represented by black lines and estimated trajectories represented by other colorful lines for a single run;Figs.16(a)-(d) shows the cardinality estimation, OSPA distance, OSPA cardinality error, and OSPA location error of all algorithms for multiple CT target tracking, averaged over 100 MC runs,respectively.The results are broadly in line with those of multiple CT target tracking in Section 5.2.The fluctuation of the OSPA distance results from the cardinality error.The peaks in Figs.16(b) and (c) represent the emergence or extinction of targets.As long as the number of targets remains constant, the proposed MTIT-DLSTM algorithm performs perfectly, referring to that the OSPA cardinality error is close to 0 and the OSPA location error is lower.It implies that our MTIT-DLSTM algorithm has better application prospects. Fig.17 Ground truth and estimated trajectory of a maneuvering target. In this section, to verify that the proposed MTIT-DLSTM algorithm is able to track a maneuvering target, we design a simulation scenario including a target that moves straight at first and then turns at k=40 s. Fig.16 Simulation results for multiple CT target tracking. Fig.18 Simulation results for single maneuvering target tracking. Results are shown in Figs.17-18,where Fig.17 exhibits the true trajectory represented by the black line and estimated trajectory represented by blue lines for a single run, and Figs.18(a)-(d) shows the cardinality estimation, OSPA distance, OSPA cardinality error, and OSPA location error for single maneuvering target tracking, averaged over 100 MC runs,respectively.The dotted line in Fig.17 indicates the process of track initiation.From Figs.18(a)-(d), we can conclude that the proposed algorithm can accurately estimate the number of targets due to track management strategy and target states.Note that Section 3.3 mentioned that the input sequence length of the prediction network ranges from 2 to 10.Therefore, the prediction network uses historical information from the past 10 time steps at most to predict.In this section, when the target undergoes the maneuver, the prediction network also only adopts historical information from the past 10 time steps instead of especially using long history information in prediction due to the target maneuver. In the above three simulation experiments, codes using the MTIT-DLSTM algorithm to track are the same except for measurement data, which illustrates that the prediction network can flexibly adapt to various target motions.There are two advantages of using the prediction network for state transition:one is to replace the motion model and learn the motion trend from historical states; the other is that the prediction state of the next time is not uniquely determined by the previous time,but jointly determined by historical states,which can successfully deal with missed detection and wrong association at a certain time. In our work, an intelligent tracking algorithm based on two DLSTM networks, called the MTIT-DLSTM algorithm, is proposed to address fixed motion and measurement models’problem in multi-target tracking,.We adopt LSTM-based prediction and update networks with deep structures to estimate target states from noisy measurements in a complex environment.Experimental results prove that our algorithm is able to estimate the time-varying number of targets and their tracks from measurement sets in the presences of data association uncertainty, detection uncertainty, noise, and false alarms.Moreover, in terms of robustness and accuracy, our proposed algorithm outperforms the popular PHD and LMB filters on behalf of classical algorithms.The following conclusions can be extracted from this work: (1) Deep learning provides a novel solution to multi-target tracking and gets rid of the dependence on the model and priors.Combined with the advantages of the DLSTM network,we predict the state at the next time based on the historical state sequence and measurement sequence, which can be tailored to scenario complexity. (2) Applying Adam optimization and a cosine annealing algorithm to efficiently minimize the loss makes the parameters of the model optimal. (3) Hungarian algorithm and heuristic track management are used to assign measurements to targets and adapt births and deaths. (4) The limitation of this work is that the update network only works in a rectangular coordinate system.This is due to the fact that the noise converted from a polar coordinate to a rectangular coordinate will no longer follow the Gaussian distribution.This is worthy of future research. Declaration of Competing Interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. Acknowledgements This work was supported by the National Natural Science Foundation of China (No.62276204), Open Foundation of Science and Technology on Electronic Information Control Laboratory, Natural Science Basic Research Program of Shanxi, China (Nos.2022JM-340 and 2023-JC-QN-0710),and China Postdoctoral Science Foundation (Nos.2020T130494 and 2018M633470).3.6.Update survivals and assign births
4.Training of prediction and update networks
4.1.Generation of training data
4.2.Adam optimizer for training
4.3.Training of networks
5.Simulation results
5.1.Parameter setting and evaluation metric
5.2.Multiple CV target tracking
5.3.Multiple CT target tracking
5.4.Single maneuvering target tracking
6.Conclusions
CHINESE JOURNAL OF AERONAUTICS2023年9期