Template-guided frequency attention and adaptive cross-entropy loss for UAV visual tracking

2023-10-25 12:12:48YuanliangXUEGuodongJINTaoSHENLiningTANLianfengWANG

CHINESE JOURNAL OF AERONAUTICS 2023年9期

Yuanliang XUE, Guodong JIN, Tao SHEN, Lining TAN, Lianfeng WANG

School of Nuclear Engineering, PLA Rocket Force University of Engineering, Xi’an 710025, China

KEYWORDS

Abstract This paper addresses the problem of visual object tracking for Unmanned Aerial Vehicles (UAVs).Most Siamese trackers are used to regard object tracking as classification and regression problems.However, it is difficult for these trackers to accurately classify in the face of similar objects,background clutters and other common challenges in UAV scenes.So,a reliable classifier is the key to improving UAV tracking performance.In this paper, a simple yet efficient tracker following the basic architecture of the Siamese neural network is proposed, which improves the classification ability from three stages.First, the frequency channel attention module is introduced to enhance the target features via frequency domain learning.Second, a template-guided attention module is designed to promote information exchange between the template branch and the search branch, which can get reliable classification response maps.Third, adaptive cross-entropy loss is proposed to make the tracker focus on hard samples that contribute more to the training process,solving the data imbalance between positive and negative samples.To evaluate the performance of the proposed tracker,comprehensive experiments are conducted on two challenging aerial datasets,including UAV123 and UAVDT.Experimental results demonstrate that the proposed tracker achieves favorable tracking performances in aerial benchmarks beyond 41 frames/s.We conducted experiments in real UAV scenes to further verify the efficiency of our tracker in the real world.

1.Introduction

With the rapid development of Unmanned Aerial Vehicles(UAVs) and their aerial photography equipment, UAVs with visual tracking capability are widely used in traffic patrolling,wildlife protection, disaster response, and military reconnaissance due to their advantages of flexible motion, low cost,and high safety.1Because of its important role in video intelligence processing,object tracking technology receives extensive attention.2Despite the demonstrated success,it remains a challenge to design a tracker that is robust to various UAV scenes such as small objects, similar distractors, background clutters,scale variation, and frequent occlusions.Therefore, an accurate and robust tracker is of great value for the wide application of UAVs.

Correlation Filter (CF) trackers realize acceptable performances with a high speed due to the hand-crafted features and fast Fourier transform.MOSSE3(Minimum Output Sum of Squared Error) is first proposed to calculate the similarity in the frequency domain, which incredibly improves the tracking speed.Henriques et al.4design a Kernelized Correlation Filter(KCF),incorporating multiple feature channels and adopting a circular shift sampling to enhance the ability of target representation.Based on the Refs.3-4, a series of extensions such as works5–8show their state-of-the-art performance.

Recently, deep features provided by Convolutional Neural Networks(CNNs)have demonstrated powerful object characterization capabilities and gradually replaced traditional handcrafted features in the computer vision field, such as object detection and tracking.HCF9(Hierarchical Convolutional Feature tacker), C-COT10(Continuous Convolution Operators Tracker) and DeepSRDCF11(Spatially Regularized Discriminative Correlation Filter with Deep features) all make preliminary explorations on the combination of CF trackers and existing CNNs, complementarily improving the performances.Tao et al.12propose a novel Siamese tracker called SINT, the first tracker that applies Siamese architectures.SiamFC13(Fully-Convolutional Siamese networks) and SiamRPN14(Siamese Region Proposal Network) improve the tracking speed and accuracy by introducing a new similarity calculation and the Region Proposal Network (RPN),15respectively.However, the feature extraction capability of AlexNet16in the Refs.13–14is not powerful enough for complex challenges.SiamRPN++17achieves leading performance on several benchmarks by employing ResNet50.18Since the SiamRPN++ showed a high tracking performance with a simple structure, plenty of Siamese trackers, including works,19–22based on SiamRPN++ have shown outstanding performance.

Numerous trackers have been proposed for UAV tracking in recent years.Li et al.23develop a new CF-based tracker for UAVs, called AutoTrack (Tracking with Automatic regularization), which can dynamically and automatically adjust the hyperparameters of spatio-temporal regularization terms.Yang et al.2train the pruned classifier via least squares transformation in spatial and Fourier domains, achieving real-time tracking but suboptimal performance.SiamAPN24(Siamese Anchor Proposal Network) adopts the no-prior structure of the Anchor Proposal Network (APN) to endow the anchors with adaptivity, enhancing the object perception ability.By virtue of the attention module, SiamAPN++25further raises the expression ability of features.CIFT26(Contextual Information Fusion Tracker) respectively inserts different attentions to the template branch and search branch, improving the tracker’s classification ability.Coincidentally, Cui et al.27respectively introduce spatial attention and channel attention into two branches to enhance target features and suppress distractor information.The above literature mainly focuses on promoting classification ability by designing constraint terms or adding attention modules.However, these trackers ignore the importance of the template in finding the target and omit the imbalanced training samples.

This paper proposes a template-guided frequency attention tracker (referred to as TGFAT), and introduces the adaptive cross-entropy loss for training a high performance classifier.

The main contributions of this work can be summarized as follows:

(1) We insert a Frequency Channel Attention (FCA) module into the backbone to filter the features, which can suppress distractor background information via frequency domain learning.

(2) To enhance the tracker classification capability, we design the Template-Guided Attention (TGA) module between the template branch and search branch, thus utilizing the given template features to guide the generation of search features.

(3) We find that positive samples are far less than negative samples in UAV scenes, leading to inefficient training of the Siamese tracker.To eliminate the class imbalance of the positive and negative examples, we design the Adaptive Cross-Entropy (ACE) loss by introducing a hyperparameter.

(4) We present a real-time tracker that achieves an outstanding tracking performance on several challenging aerial benchmark datasets.In addition, real-world tests strongly demonstrate impressive practicability and performance with a real-time speed of ～41.3 FPS(Frames Per Second).

2.Related work

Visual tracking is a popular research technology and plays an eminently important role in many fields.An exhaustive survey about object tracking is introduced in the survey28,and the following is a compendious review of the most representative Siamese trackers, as well as related issues on attention modules and loss functions.

2.1.Trackers with Siamese architecture

The Siamese architecture-based trackers formulate the problem of visual object tracking as the similarity embedding between a target template and a search region.SINT12(Siamese Instance Search Tracker) creatively applies Siamese architectures to find the target from the search region based on the highest similarity response.Though SINT performs well, its speed is only 4 FPS.Therefore, SiamFC13calculates the similarity between the whole candidate image and the template at one time by cross-correlation (Xcorr), improving the speed to 86 FPS.

The emergence of SiamFC has promoted the spread of object tracking from correlation filters to Siamese architectures.CFNet29(Correlation Filter Network) integrates the correlation filter and SiamFC, giving CF a stronger representation ability.SA-Siam30(Semantic and Appearance twofold branch Siamese network)adds a semantic branch into the original appearance branch, complementarily improving the robustness of SiamFC.SiamRPN14employs the RPN module to treat object tracking as local object detection and to realize classification and regression,outperforming most trackers and running at 160 FPS.Based on SiamRPN, numerous works have been done for further improvement.SiamMask21(Siamese Mask tracker) finds some consistency between object tracking and semi-supervised object segmentation.Therefore,SiamMask augments the anti-interference ability of SiamFC with binary segmentation masks.SiamRPN++17is the first tracker to take advantage of deeper features and successfully solve the negative impact of training ResNet-50.Besides,SiamDW31(Deeper and wider Siamese networks) stacks the cropping-inside residual modules to deeper the network and meanwhile,it also prevents learning positional bias.Moreover,anchor-free trackers19,22,32–33show outstanding advantages in enhancing scale adaptation and generalisability due to prediction pixel by pixel.

2.2.Attention modules in Siamese trackers

The attention modules,such as SE34(Squeeze-and-Excitation),CBAM35(Convolutional Block Attention Module), and ECA36(Efficient Channel Attention),have been demonstrated to offer great potential in improving the performance of deep CNNs.Therefore,plenty of trackers explore the application of attention modules in object tracking, achieving high performance.

SA-Siam30inserts the channel attention in the semantic branch to make channels where the target is located activated highly, suppressing distractor semantic information.SiamBM37(Siamese Better Match) finds that when the aspect ratio of the target is large, significant background clutters are easy to occur, and the application of spatial mask on the feature map has stronger background suppression ability and stability than channel attention.RASNet38(Residual Attentional Siamese Network)superimposes the residual attention module and the general attention module on the template branch to learn the common characteristics and differences of the targets and integrates the channel attention module to adapt to the appearance changes.SiamAPN++25aggregates and models the self-semantic interdependencies and the crossinterdependencies with two attentional aggregation networks.CIFT26establishes the long-range relationship using an attention information fusion module and learns the appearance features of the detection frame with a multi-spectral attention module.DAFSiamRPN39(Distance attention fusion SiamRPN) employs a convolutional attention module and a frequency channel attention module, respectively, to optimize the semantic information of cyberspace and channel feature information.

There is much work done to try attention modules on Siamese trackers,40–43realizing satisfactory performance.However, the above literature all adopt attention modules that are self-contained, meaning that attention weight is learned only from their own features.Our template-guided attention module learns attention from the template features and uses this learned attention weight to enhance target features in the search features, promoting the cross-branch information exchanges.

2.3.Loss functions for Siamese trackers

Following up on SiamRPN, Siamese trackers regard object tracking as the classification and regression problems.Thus,the classification loss function and the regression loss function are employed in the training of these Siamese trackers.The classification loss function, such as cross-entropy loss, is used to help trackers learn to distinguish the target and background, and the regression loss, including IoU (Intersection over Union) loss44, is adopted to train the tracker’s ability to accurately locate the target based on the optimal classification results.

There is some work focusing on improving the above loss functions to find the suitable loss for Siamese trackers.Focal loss45is adopted in SiamFC++32to alleviate the overfitting problem of cross-entropy loss.However, focal loss reduces the loss of easy samples, but the loss of hard samples is also punished.Thus adaptive focal loss40for the Siamese tracker is proposed to adaptively punish easy samples without reducing the loss of hard samples.Considering IoU loss which only works well when the bounding boxes have overlap and cannot provide any moving gradient for non-overlapping cases,LWTransTracker46(Layer-Wise Transformer Tracker)replaces IoU loss with C-IoU (Complete-IoU) loss for faster convergence and better regression accuracy.Based on considering the distance between the target and the bounding box,the Refs.26,39further consider the overlap rate and the scale variation as factors by employing D-IoU (Distance-IoU) loss in regression.In this paper, we find that data imbalance easily occurs in UAV object tracking,and cross-entropy loss or focal loss cannot solve this problem effectively.Therefore, we redesign the cross-entropy loss only with a hyperparameter, and accomplish better results.

3.Proposed method

In this section, we describe the proposed TGFAT framework in detail.Following the basic architecture of the Siamese neural network, TGFAT maintains the two-branch design strategy.The frequency channel attention module in the backbone network adaptively suppresses distractor information and enhances the target features via Two-Dimensional Discrete Cosine Transform(2D DCT).Template-guided attention module collects template feature information to efficiently guide the search feature,strengthening the classification ability with ignorable computational effort.Then adaptive crossentropy loss is introduced to adaptively punish easy samples and reduce the meaningless training, solving the data imbalance in the training process.The overall pipeline of our proposed tracker is depicted in Fig.1.We choose ResNet50 as backbone in which Frequency Channel Attention(FCA)module is inserted.Discrete Cosine Transform(DCT)is performed in the FCA module.Then, template-guided attention (TGA)module is integrated between template branch and search branch and ACE loss is classification loss.Finally, similarity response maps are calculated in RPN modules.The target position is located according to the maximum classification response, and then bounding box regression is performed at the predicted position.

3.1.Architecture of Siamese trackers

As shown in Eq.(1),tracking is modeled as a similarity learning problem by introducing the two same CNNs with parameters shared.Hence, Siamese trackers contain two branches and RPN modules.The two branches,consisting of a template branch and a search branch,are utilized to collect the template feature and the search feature.One branch is used to process the template of the tracking target, which is initialized by the manually labeled target area in the first frame of the video sequence, generally represented by z.The other branch is employed to process the search area of the video sequence.The selection of search areas for each frame is based on the location of the tracking result of the previous frame, that is,taking the coordinates of this location as the center, a fixedsize area is cropped out as the search area, which is generally represented by x.After being processed by backbone networks,z and x get the feature maps,which are respectively represented as φ(z )and φ(x ),where φ(z )size is smaller than φ(x ),so φ (z ) is adopted as a sliding window to slide on φ(x ).The whole process is similar to the sliding process of convolution kernel in CNN on the image, starting from the top-left and ending at the bottom-right.After the sliding window operation is completed, the final response map, represented by f(z,x ),can be generated to indicate the similarity between z and x.

Fig.1 Pipeline of TGFAT.

where φ(?)is the backbone network;*is the cross-correlation;b ?R is the bias of the convolution layer, and I is the identity matrix, so bI denotes a signal which takes value b in every location.

3.2.Frequency channel attention module

Compared with the ground scenes,the works23–25indicate that the UAV scenes contain more abundant and complex background information, easily misleading trackers into tracking the wrong target.Hence, we verify the difference between the UAV scenes and the ground scenes from the perspective of frequency domain processing.As shown in Fig.2,three groups of representative tracking scenes,including two types of common targets: vehicles and people, are selected for DCT (Discrete Cosine Transform) processing.In DCT, ｜F(ω )｜is the amplitude and ω is the frequency.The results in Fig.2 show that the videos captured from UAVs have richer high frequency components than those captured from the ground.Fourier transform analysis theory assumes that the noise data are mostly high-frequency components, which indeed proves that the UAV scenes contain more noise information.Therefore,it is the key to filtering the feature information extracted from the backbone.Some trackers40–41try to insert the traditional spatial and channel attention module into the backbone,lacking effective processing of high frequency components.So this paper introduces the FCA (Frequency Channel Attention)module proposed in FcaNet,47fully considering the suppression of noise information.

FcaNet47points out that using Global Average Pooling(GAP) in channel attention means only the lowest frequency information is preserved.And all components from other frequencies are discarded.FcaNet further proves that GAP is a special case of 2D DCT, so the FCA module is proposed to generalize GAP to more frequency components of 2D DCT.

Second, the cosine basis function of 7 × 7 2D DCT is defined as

where [ui,vi] are the frequency component 2D indices corresponding to Xi.

Fig.4 shows the whole cosine basis functions.Note that the 2D indices of top-left block is ui=0,vi=0, and other blocks can also be obtained in this way.To enhance the learning of frequency components,each group is assigned a corresponding 2D DCT frequency component.Then the 2D DCT is performed on each group to obtain the compression results Freqi,which can be viewed as the representation vector of each group.The compression process can be written as

As shown in Eq.(4), the whole representation vector Freq of the input is obtained by concatenation.and Freq is the multi-spectral vector that contains the importance of different frequency components.

Next, FCA weight is collected through Eq.(5), which assigns different weights for different frequency components according to their importance.

Fig.2 Spectrum analysis of ground and UAV images.

Fig.3 Illustration of FCA module.

Fig.4 Visualization of 7 × 7 DCT basis functions.

where δ is the sigmoid activation function, and fc is the fully connected layer.

To ensure that each position on the input feature map X has a corresponding weight, the attention weightsare expanded to the same size as X.Finally, the attention feature map X′is generated by

which suppresses the unimportant frequency components.The FCA module decomposes the input features into combinations of different frequency components by the split and adjusts the proportion of each channel by DCT.FCA can obtain better frequency-domain energy concentration and condense the relatively important information in the image,paying more attention to the tracking target.

3.3.Template-guided attention module

Since the template used in Siamese trackers is usually fixed and not updated,it is the key to effectively leveraging the template features.Most Siamese trackers26,38–42choose to independently enhance the representation ability of feature information in the search branch and the template branch, ignoring the great potential of template features in guiding the generation of search features.UAV has a wide field of view, so it is more prone to similar object distractors, background clutters,and fast motion in the tracking process.Therefore, the tracking object feature is not obvious in search features, and there are also some distractor features around the tracking object,hindering the tracker from distinguishing the tracking object.

Following the Ref.48, and aiming to explore the great potential of template features, we propose the Template-Guided Attention (TGA) module.As shown in Fig.5, the overall process is that template attention weight is collected based on the template feature, and then the target feature is strengthened on the search image under the guidance of template attention weight.Given the template feature φ(z ), we employ the Efficient Channel Attention (ECA)36to collect the template feature information.Firstly, Eq.(7) uses GAP to compress the template feature information on each channel and generate the global spatial representations whose size is C×1×1.Each of them squeezes the dimension of spatial information from H×W to 1×1.The process is formulated as

where φ(z )is the input template feature maps with C channels;uc(i,j )represents the single channel feature map,c ?C;and i,j is the position coordinate.

where σ is the sigmoid activation function; C1Dkis the 1DConv with k channels.One thing to note is that our parameter k is not need to vary with similar channels,which differs from the ECA module.

Secondly, as shown in Eq.(8), a One-Dimensional sparse Convolution (1D-Conv) is introduced to learn the relationships between the current channel and adjacent channels instead of 2 FC (Fully Connected) layers as is done in SE.34Then the attention weights of each channel are predicted according to the classes of objects.Considering the global spatial representations represent each channel independently, the template attention weights are further obtained by the 1DConv.1D-Conv involving few parameters is a simple yet efficient way to learn a local cross-channel information interaction whose coverage is equal to the kernel size.

The feature map is nearly orthogonal, different channels represent different object classes,17so it is not necessary to treat all channels equally.The purpose of single object tracking is to track the specific target, while the others are background and distractors.Channel attention can help trackers concentrate on ‘what’ is useful for tracking an object.49By reducing the attention weight on unimportant channels, the ECA module enhances the learning ability of the model on template features.

Thirdly, to ensure that each position in the input feature map φ(x ) has a corresponding weight, the attention weights ω are expanded to the same dimension as φ(x ).Then the search attention features φ′(x ) are obtained by

which embedding template feature relationships into search features and selectively suppressing the channels where the distractor targets are located.

Finally, aiming at adapting to the large scale variations in UAV scenes, the bounding box regression and classification are completed in Anchor-Free RPN (AF-RPN) modules as done in SiamBAN19(Siamese box adaptive network).As shown in Eq.(10), the similarity between the search attention features and the template features is calculated in AF-RPNs,and the similarity response map B,S represent the matching degree.

3.4.Adaptive cross-entropy loss

Fig.5 TGA module.

Training a good classifier always needs a sufficient number of high quality samples including positive samples and negative samples.However,object tracking in UAV scene has an imbalanced distribution of training samples50–51: (A) positive samples are far less than negative samples, leading to inefficient training of the Siamese trackers; and (B) most negative samples are easy negatives(non-similar non-semantic background)that contribute little useful information in learning a discriminative classifier45.As a consequence,the classifier is dominated by easily classified background samples and degrades when encountering difficult similar semantic distractors.

Leng et al.52prove that the cross-entropy loss easily overfits the easy samples, which causes the model to ignore the hard but important samples.Focal loss tries to avoid the overfitting problem by reducing the loss of easy samples, but the loss of hard samples is also punished, which cannot effectively deal with data imbalance.So the cross-entropy loss and the focal loss are not optimized classification loss for object tracking in UAV scenes.Referring to the Ref.52, this paper further mines the great potential of the cross-entropy loss function in object tracking and proposes an Adaptive Cross-Entropy(ACE) loss.

First, the cross-entropy loss is defined by:

where ptis the model’s prediction probability of the target ground-truth class, and p is the probability that the object is predicted to be the positive sample.

Then,the Taylor expansion of the cross-entropy loss in the bases of (1 -pt)j,j ?N*can be written as:

Using the gradient descent algorithm to optimize the crossentropy loss in the training process needs to take the gradient with respect to pt.Thus, the gradient of the cross-entropy loss is:

The overfitting problem can be seen from Eq.(13):the leading gradient term is 1, which provides a constant gradient regardless of the value of pt.On the contrary, the jth gradient term is strongly suppressed when j ?1 and ptgets closer to 1.Note that when ptgets closer to 1,the model is more confident in predicting the target classes, which usually represent the easy samples.Therefore,the focal loss adds the coefficient into cross-entropy loss to reduce the loss of easy samples,as shown in

The focal loss is proven effective because the hard examples contribute more to the model training than before.But note that the coefficient also punishes valuable hard samples(1-Pt>0?5) and hinders their training to some extent, lacking adaptability.40Feng et al.53find that tuning the first polynomial term can improve model robustness and performance.Therefore, different from focal loss, only the first polynomial coefficient in cross-entropy loss is modified in the ACE loss.

4.Experiment

4.1.Implementation details

Our experiments are implemented under Pytorch 0.4.1 framework on an Intel Core i7-9700 CPU (3.6 GHz) along with two NVIDIA Geforce RTX 2080Ti GPUs.The GOT-10k,54YouTube-BoundingBoxes,55ImageNet VID,56DET,56and COCO57datasets are used to train TGFAT,consisting of 760 thousand video sequences, and exceed more than ten million annotated pictures.Each group of data fed to the model comes from two different frames containing the same target in the annotated video, which simulates the movement process of the target and helps the model capture the robust features.The template image has a size of 127 × 127 and slides on the 255 × 255 search images with a sliding stride of 8.

Parameter settings: The number of adjacent channels k in the TGA module is 3, and the hyperparameter ε in the ACE loss is 2.The Stochastic Gradient Descent (SGD) is utilized to train the model with a total of 20 epochs.A total of 1×106pairs of training samples are sampled in each epoch and the batch size is set to 22.We use a warm-up learning rate of 1×10-3to 5×10-3for the first 5 epochs which decays exponentially from 5×10-3to 5×10-5with a momentum of 0.9 for the last 15 epochs.In the first 10 epochs, we only train the heads with the pre-trained ResNet-50 parameters frozen.Then the conv3 and conv4 layers of the backbone are finetuned in the last 10 epochs.The loss is the sum of the ACE loss in the classification and the IoU loss in the regression.

Tracking performance mainly has two evaluation criteria:success rate and precision rate.The precision rate is based on the Center Location Error (CLE), that is, calculate the Euclidean distance between the tracking result of each frame(given by the tracker) and the corresponding ground-truth.The precision rate represents the ratio that the CLE is less than 20 pixels.The success rate indicates the ratio of the overlap rate (represented by IoU) between the tracking target and the ground-truth greater than the specified threshold(usually 0.5).

4.2.Visualization of classification

Fig.6 shows the visualization of different trackers’ classification performances, comparing two leading trackers(SiamRPN++ and SiamBAN) with our tracker, TGFAT.

These images are also called heatmaps, where the temperature represents the prediction confidence of the object class.In other words,the higher the temperature,the stronger the classification ability.It is clear to see that SiamRPN++ and SiamBAN are challenged by similar objects and fast motion.The tracking object feature becomes more obvious in the TGFAT similarity response map,which is conducive to classifying the object and distractors.Therefore, TGFAT achieves better performance with the help of the FCA module and TGA module,filtering backbone features and guiding the precise position of tracking targets,respectively.Besides,the ACE loss adaptively enhances the training of hard samples, which solves the data imbalance in the UAV scene without additional computing overhead.

4.3.Evaluation on UAV datasets

To verify the effectiveness of the TGFAT proposed in this paper, experiments and comparisons are conducted using test sequences from two prestigious UAV benchmarks:UAV12358and UAVDT59.During the tracking process of UAVs, small object, similar objects, background clutters,and scale variations often occur, making tracking challenging.Please note that the trackers’ results used in this paper are taken from the dataset officially provided, from the authors,and test results in the survey.1

Fig.6 Classification response maps.

4.3.1.Results on UAV123

The UAV12358dataset is the video obtained from the shooting angle of the UAVs.Specifically, it contains 123 sequences and more than 110 thousand frames in total with 12 diverse kinds of attributes.Besides, the video sequences captured by UAVs have the characteristics of complex backgrounds and large scale variation,putting forward higher requirements for object tracking algorithms.In addition, the video length of UAV123 is longer than other datasets, which verifies the long-term tracking ability.To fully demonstrate the overall performance of our tracker,TGFAT is compared with the results of 17 current leading trackers, including CIFT,26SiamTPN60(Siamese Transformer Pyramid Networks), LDSTRT61(Learning Dynamic Spatial-Temporal Regularization Tracker), HiFT62(Hierarchical Feature Transformer), DAFSiamRPN,39BASCF63(Background cues and Aberrances response Suppression mechanism Correlation Filters), AutoTrack,23GlobalTrack64(Global Tracker), SiamRPN++,17LST2(Least Squares Transformation), ARCF65(Aberrance Repressed Correlation Filters), DaSiamRPN66(Distractoraware SiamRPN), SiamRPN,14ECO-gpu67(Efficient Convolution Operators), BACF68(Background Aware Correlation Filters), C-COT,10and SiamFC13, where the UAV trackers are CIFT, SiamTPN, LDSTRT, HiFT, DAFSiamRPN,BASCF, AutoTrack, LST, and ARCF.

As shown in Table 1,TGFAT shows superior performance compared with other leading trackers.TGFAT accomplishes the best success rate (61.7%), the highest precision rate(82.7%), and a speed of 41.3 FPS.These can be attributed to two major reasons.First, the TGA and FCA modules can enhance the ability of the tracker to extract target feature information and suppress irrelevant information.Second, the distraction caused by easy samples is alleviated by introducing the ACE loss,which helps the tracker adaptively pay attention to the most discriminative ones.Compared with other UAV trackers such as CIFT and DAFSiamRPN, TGFAT achieves favorable results in terms of success rate, precision rate, and real-time speed (≥30 FPS), which means the robustness and speed of TGFAT are competent for UAV object tracking tasks.

As illustrated in Table 2, TGFAT is compared with other trackers in a variety of attributes, where background clutters,similar objects,scale variations,fast motion,and full occlusion are the common attributes in UAV scenes.The results in Table 2 show that benefitting from the TGA module, FCA module, and the ACE loss, TGFAT achieves the best performance in similar objects,scale variations,fast motion,and full occlusion.Moreover, TGFAT is comparable with DaSiamRPN in background clutters,which shows the reliable discrimination and anti-interference ability of TGFAT.

As displayed in Fig.7, we select five real-time scenes in the UAV123 dataset to demonstrate the effectiveness of the proposed tracker, including bike2, car6_2, car18, group1_2, and wakeboard6.In the real-time scene of bike2 and group1_2,drifts happen to trackers such as HiFT, AutoTrack, and SiamRPN++due to the target’s small size and similar object interferences, TGFAT can still track robustly and accurately.In car6_2, the tracking target partly disappears in the field of view, which badly affects the tracking performance of many trackers such as SiamFC and HiFT, but TGFAT can commendably adapt to the disappearances and scale variations.The car18 scene is mainly about the tracking target’s fast motion, and the results show that our tracker is also effective in this situation with the accurate calculation of the bounding box.The results in wakeboard6 exhibit that TGFAT can robustly recognize and accurately locate the target in the face of the fast motion and background clutters.

Table 1 Experimental results on UAV123.The best two performances are displayed in bold and underline,respectively.Suc.and Pre.mean success rate and precision rate, respectively.

Table 2 Comparisons of algorithms for the attributes common in UAV scene.The best two performances are displayed in bold and underline, respectively.Suc.and Pre.mean success rate and precision rate, respectively.

Fig.7 Qualitative evaluation on UAV123.The sequences from left to right and top to bottom are bike2, car6_2, car18, group1_2 and wakeboard6.

In addition, Fig.8 presents three evaluations in the typical UAV scenes, and the IoU between the ground-truth and the prediction bounding box reflects the performance of trackers.The main challenges in these sequences are partial occlusion,small object (the first row), low resolution, scale variation(the second row), camera motion, and background clutter(the third row).At frame 175 of sequence bike 3, when the tracking target reappears after occlusion, HiFT’s discriminative ability is disturbed, and begins tracking the wrong target.After frame 283,HiFT loses the target.However,our TGFAT is not affected by partial occlusion and small object and keeps accurate tracking of the target.When occlusions occur in sequence truck 2,HiFT is also disturbed by background noise and starts to lose the target, leading to subsequent tracking failure.By contrast, TGFAT can still track the target stably in the face of occlusion and low resolution, which exhibits its reliable classification ability.Background noises and camera motion in sequence wakeboard6 are unique and common challenges in UAV scenes.From frame 484 to frame 807, the performance of HiFT in handling camera motion is unsatisfactory.Moreover, when HiFT finds the target again,the tracking bounding box contains too much background noise due to the insufficient ability to distinguish between the target and the background.In contrast, TGFAT achieves trust-worthy performance in camera motion and background noises.Therefore,we can conclude that TGFAT is more competent for aerial object tracking tasks.

4.3.2.Results on UAVDT

The UAVDT29dataset is a large-scale and challenging UAV detection and tracking benchmark, and mainly focuses on vehicles.It consists of 100 video sequences and 8×104frames,selected from more than 10 h of videos shot by UAV platforms in multiple locations in urban areas, which represent various common scenes including squares, main roads, highways,and intersections.Manually annotate frames using bounding boxes and 14 challenging attributes, such as weather changes,flight altitude, vehicle category, and occlusion.Videos are recorded at 30 FPS with image resolution of 1080×540 pixels and an average of 10.5 targets per frame.The targets photographed by UAVDT show the characteristics of small size,dense distribution, and fast motion, which enormously tests the overall performance of these trackers.

As displayed in Fig.9, TGFAT is compared with other 15 advanced trackers, including SiamCAR22(Siamese fully convolutional Classification and Regression), SiamRPN++,17MobileTrack,69GFSDCF70(Group Feature Selection and Discriminative Correlation Filter), MDNet71(Multi-Domain convolutional neural Networks), ARCF,65ARCF-H,65AutoTrack,23DSiam72(Dynamic Siamese network), ECOgpu67BACF,68SiamFC,13SRDCF,7CSR-DCF73(Discriminative Correlation Filter with Channel and Spatial Reliability),and C-COT,10where AutoTrack, ARCF, ARCF-H, and MobileTrack are trackers for UAV.TGFAT has competitive advantages attributed to the TGA module, FCA module,and ACE loss.TGFAT has achieved the success rate of 60.6% and the precision rate of 84.4%.The precision rate of TGFAT ranks first, respectively surpassing the second-best SiamCAR (82.3%) and the third-best SiamRPN++(82.0%) by 2.6% and 2.9%.Compared with other trackers,TGFAT achieves better performance in success rate,precision rate,and speed,showing its stable overall ability in the face of complex challenges.

Fig.9 Results on UAVDT.

To intuitively show the performance of our tracker,we utilize our TGFAT to make a more qualitative comparison with some trackers mentioned above on UAVDT.From Fig.10,we can see that TGFAT performs better in face of the challenging scenes including small object, occlusion, similar objects, fast motion, and background clutters.For example, there are serious background clutters and blurring target information in the S0103 and S1310 sequences, TGFAT can successfully distinguish between background and target, accomplishing better performance than HiFT and SiamRPN++.After the target in the S0601 sequence is fully occluded, some trackers lose the tracking target.

However, TGFAT can still accurately track the target,which shows its reliable re-tracking ability.In the S0301 and S1310 sequences, the background presents the characteristics of low illumination, low resolution, and many similar targets.Most of the trackers are affected by the blurring target features, but TGFAT still accurately locates the target and adapts the scale variations.Besides, TGFAT can adaptively adjust the aspect ratio of the target with the rotation of the camera in the S1702 sequence.It is obvious that TGFAT is better equipped for aerial object tracking tasks compared with other trackers.

4.4.Ablation experiments

In this section, we implement our ablation experiments on UAV123 and UAVDT datasets to evaluate the different components’ contribution to our TGFAT.Table 3 demonstrates the Precision rate(Pre.)and the Success rate(Suc.)of different variations.Please note that the baseline represents the Siam-BAN training with the same datasets as TGFAT.

From the comparison of the first two rows, the FCA module has the potential to enhance backbone robustness in UAV platforms, respectively promoting the success and precision of about 61.2% and 80.5% on UAV123, 60.5 % and 81.6% on UAVDT.It shows that TGFAT can benefit from the enhanced features provided by the FCA module.Besides,the TGA module is introduced to guide the generation of the search feature,improving success by 61.6% on UAV123, 60.6% on UAVDT and promoting precision by 81.1% on UAV123, 83.0% on UAVDT.Despite improvements, it is still difficult to solve the data imbalance in the training process.When we employ the ACE loss to train the TGFAT, the ACE loss can bring the precision rates increase (+2.0% and + 1.7% on UAV123 and UAVDT, respectively), and the success rates are promoted to 61.7%and 60.6%on UAV123 and UAVDT,respectively.That means the ACE loss can adaptively enhance the training of the hard samples,solving the data imbalance in the UAV scene.In summary, each component of TGFAT helps boost performance without introducing a notable computation burden.

4.5.Real-world tests

To verify the tracking performance of TGFAT in the real UAV scene, TGFAT is applied to the UAV aerial video for evaluation, where the resolution is 3840 × 2160 pixels, the frame rate of 30 FPS,and the shooting height of 120 m.Some tracking results are shown in Fig.11.Tracking targets are people and vehicles on the road.The small size, fast motion,little feature information, and background clutters of the target failed to affect the tracking performance of TGFAT.It can be seen that TGFAT has excellent small target recognition and anti-interference ability in practical applications.

Fig.10 Qualitative evaluation on UAVDT.The sequences from left to right and top to bottom are S0103, S0601, S0301, S1310 and S1702.

Table 3 Ablation experiments of TGFAT.Suc.and Pre.mean success rate and precision rate, respectively.

Fig.11 Tracking results of TGFAT in real UAV videos.

5.Conclusions

In this work, a novel Siamese tracker, referred as TGFAT, is proposed for accurate and efficient aerial tracking.First, we introduce the FCA module in the backbone network to make full use of frequency feature information and channel correlation, adaptively suppressing distractor information and enhancing the target features.Next, the TGA module is applied to guide the search feature taking advantage of the template feature, and is completed in a cross-branch manner,strengthening the tracker classification capability.Finally, to avoid the tracker misled by easy samples, we employ the ACE loss to emphasize the learning of hard samples, adaptively solving data imbalance common in UAV scenes.During the experiments stage, we evaluate TGFAT on two benchmarks, UAV123 and UAVDT, with several state-of-the-art trackers.Besides, the practical application ability of TGFAT is also verified in real-world scenes.Abundant experiments show that TGFAT performs favorably against these advanced works while maintaining a real-time speed (～41.3 FPS).

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

This study was co-supported by the National Natural Science Foundation of China (Nos.61673017 and 61403398).

CHINESE JOURNAL OF AERONAUTICS2023年9期

CHINESE JOURNAL OF AERONAUTICS的其它文章: Auxiliary generative mutual adversarial networks for class-imbalanced fault diagnosis under small samples; Electronic characteristic, tensile cracking behavior and potential energy surface of TiC(111)/Ti(0001)interface: A first principles study; Performance optimization for high speed axial piston pump considering cylinder block tilt; Ablation behaviour of C/C-HfC-SiC composites prepared by joint route of precursor infiltration and pyrolysis and gaseous silicon infiltration; Quasi-synchronous control of uncertain multiple electrohydraulic systems with prescribed performance constraint and input saturation; Observer-based motion axis control for hydraulic actuation systems