Hu Zhentao(胡振濤), Mao Yihao, Fu Chunling, Liu Xianxing
(*College of Computer and Information Engineering, Henan University, Kaifeng 475004, P.R.China)(**School of Physics and Electronics, Henan University, Kaifeng 475004, P.R.China)
Abstract
Key words: pedestrian tracking, correlation filter, Kalman filter, deep feature
Pedestrian tracking is an important research area of computer vision and pattern recognition. It has been applied in many fields such as video monitoring, automatic driving, and unmanned aerial vehicle. Especially in the field of security, pedestrian tracking is considered as the most fundamental technique to complete trajectory analysis, traffic monitoring, gait recognition, and so on[1,2]. It is well known that pedestrian is difficult to be detected accurately when they are occluded, even caused the phenomenon of losing track. In terms of the duration of occlusion, pedestrian occlusion is mainly divided into the short-term occlusion and the long-term occlusion. Besides, in terms of the area of occlusion, it can be divided into the partial occlusion area and the complete occlusion[3].
In general, the tracking processes of pedestrians that are completely occluded can be mainly divided into 4 stages: (1) Target tracking before occlusion; (2) Occlusion judgment of pedestrian; (3) Prediction of pedestrian position under complete occlusion; (4) Rematch of pedestrian under reappearance or loss. Bolme et al.[4]used the correlation filtering to solve the pedestrian tracking problem. The filter is designed according to the minimum output sum of squared error (MOSSE), which can localize the pedestrian according to the criterion of maximum of the response. But given the fact MOSSE only utilizes the gray feature, it is easy to cause tracking drift problem in some complex situations. Henriques et al.[5]proposed a novel kernelized correlation filter (KCF), the ridge regression of linear space is mapped to the high-dimensional space by kernel function. The real-time and accuracy of pedestrian tracking are effectively improved. And on that basis, Li and Zhu[6]designed a new scale adaptive kernel correlation filter tracker with feature integration (SAKCF), which furtherly improves the accuracy of target tracking. Aiming to the problem of partial occlusion, Huang et al.[7]proposed an anti-occlusion and scale adaptive kernel correlation filter (ASAKCF), the occlusion judgment mechanism of ASAKCF can effectively deal with partial and short-term occlusion problems. In Ref.[8], the Kalman filter[9]and the camshift strategy[10]were combined. Kalman filter and camshift strategy are separately utilized to the position prediction and identification of occluded target. Aiming to the problem of complete occlusion, Ma and Wang[11,12]introduced the online target detection mechanism after the tracking failure. Among them, Ref.[11] used support vector machine (SVM) online detection method to deal with occlusion or loss of tracking rematch problem. In Ref.[12], the single shot multi box detector (SSD)[13]was applied into the correlation filter to identify and locate the target, its advantage is the ability to long-term track the target. Although the above algorithms can solve partial and complete occlusion problems to some extent, they still have some defects such as weak detection ability and poor matching effect.
A novel anti-occlusion pedestrian tracking algorithm based on location prediction and deep feature rematch (ALPDFE) is proposed in this paper. Its goal is to solve the tracking problem of pedestrians under frequent or long-term complete occlusion. The algorithm uses the deep feature of pedestrian’s appearance to judge the occlusion. When the pedestrian is not occluded, SAKCF is used to estimate pedestrian location and scale. When the pedestrian is occluded, the location is predicted by Kalman filter. When pedestrian reappears, the deep feature and YOLOv3 method[14]are used to realize the judgment and rematch of pedestrian tracking. The main contributions of this paper are as follows: Firstly, we design a new occlusion judgment method which uses the deep learning strategy to extract pedestrian features. Secondly, in order to accurately estimate the pedestrian location in the occlusion or non-occluded conditions, we propose a location prediction structure by combining correlation filter with Kalman filter. Thirdly, aiming to the pedestrian reappear process, the deep features and target detection method are introduced to realize the pedestrian rematch process.
(1) Filter train
Giving the training sample oftth frame (ut,yt), the goal is to train the filterhtwhich minimizes the squared error between sampleutand its regression targetyt.utcan be obtained by the circulant matrix based on the pedestrian’s appearance feature,ytis considered as the filter response, which will take the maximum response value in the pedestrian location. The mathematical expression for the above model is as follows.
i=1,2,…,n(1)
whereidenotes the index of training samples,λdenotes the regularization parameter used to prevent overfitting. According to the basic knowledge of kernel function,htis represented by the feature mapping functionφ.
(2)
whereαtdenotes the filter parameter after mapping. According to the properties of circulant matrix and Fourier transform, the solution of filterαtis quickly calculated in the frequency domain.
(3)
The expression ofKcorresponds to the kernel function. For Gaussian kernel function, the kernel matrix[5]Kis express as
K(ut,ut)=
(4)
whereF-1denotes the Fourier inversion operation, * denotes the complex conjugate operation,σdenotes the parameter of Gaussian kernel, ⊙ denotes the dot-product operation.
(2) Pedestrian location
(5)
(3) Filter update
(6)
whereηdenotes the learning rate.
KCF is fixed size on the sample during the tracking process, so it is unable to deal with the scale variations of target. Based on the scale pyramids strategy, SAKCF[6]can realize the adaptive regulation of pedestrian scale by sampling different-sized candidate regions. Define thetth frame pedestrian size and the scale pool asW×Hands.
s={sj}j=1,2,…,k
(7)
(8)
(9)
Therefore, according to thetframe pedestrian scale ands′, thet+1 frame pedestrian scale is considered ass′W×s′H.
Although SAKCF has a better tracking performance, it cannot handle the tracking problem of complete occlusion effectively. According to 4 stages characteristics of complete occlusion tracking process, a novel anti-occlusion tracking algorithm is proposed. Specifically, it is divided into the following 3 steps. In the 1st step, a new occlusion judgment approach is designed. In the 2nd step, the pedestrian position is predicted during tracking failure and occlusion by Kalman filter. In the 3rd step, a new rematch strategy is presented for pedestrian reappearance.
Appearance feature will be changed when pedestrians are occluded. Therefore, we calculate the tracking quality according to appearance features to determine whether there is occlusion. In recent years, the deep learning techniques have emerged as effective methods for the representation of appearance feature, which can learn features automatically from data. Wojke and Bewley[15]designed the light weight convolutional neural network (LWCNN) that its architecture is shown in Fig.1, and used the deep cosine metric learning method to encode similarity directly into the training objective. The algorithm can obtain the better results for pedestrian re-identification. Using the extracting idea of pedestrian appearance features in Ref.[15], the tracking quality is calculated as follows.
Fig.1 The network architecture of LWCNN
Let a 128×64 RGB color image patch enter into LWCNN, and the feature size is mapped as 16×8 RGB color image through a series of convolutional layers. A global feature vectormof length 128 is extracted by fully-connected layer.
m=[m1,m2, …,m128]
(10)
The appearance featuresm1andm2of the above 2 image patches are extracted separately. Then the similarity between the feature vectors can be calculated as
(11)
where ‖·‖2denotes the 2-norm of vector.
As shown in Fig.1, the same pedestrian has a higher degree of similarity in different frames when there is no occlusion as Fig.2(a). The similarity is low when different person or existing occlusions as Fig.2(b) and Fig.2(c). Thus, the method used to calculate the tracking quality is effective.
Fig.2 Cosine distance of deep features for different patch
Constructing the feature template setM. The maximum cosine distance by calculating betweenmandMis used to represent the tracking quality.
q=Ψ(m,M)=max{Φ(mr,m)|mr∈M}
r=1,2,…,N(12)
where,Ndenotes the size ofM.
It is worth noting that suppose only pedestrian deep feature form the previous frame is used as template, because the transformation of pedestrian appearance is slow in the adjacent frame, even if there is occlusion, they will have a high similarity. The template is updated at a fixed interval in Ref.[16], which can not only decrease calculation amount, but also avoid the problem of template drift. Thus, the deep featuresmof pedestrian everyΥapart frame can be extracted and then the tracking qualityqis calculated. Ifqis more than the thresholdqt, it can be considered that the tracking result is normal andmcan be used to update the feature template.
M=(M⊕m)[N]
(13)
here, ⊕ denotes add themtemplate to the feature template setM. [N] is for selecting the latestNfeatures, the update process is shown as Fig.3.
According to the size of pedestrian and background complexity,qtis taken usually between 0.77 and 0.83. Combined with the video frame rate and the target moving speed,Υis taken usually between 6 and 12.
Fig.3 The update process of feature template
Whenqis less thanqt, the pedestrian is occluded, the appearance information of image is not available. The pedestrian location can be determined by the dynamic model of pedestrian motion. Assuming that the pedestrian motion model is known in adjacent frames, and Kalman filter is used to predict the pedestrian location. The model of state transition and observation is
xt+1=Axt+ω
(14)
zt=Hxt+v
(15)
where,xtandztdenote the state vector and observation vector of pedestrian at thettime, respectively.AandHdenote state transfer matrix and observation matrix, respectively. Process noiseωand observation noisevmeet Gaussian noise with covarianceQandR, respectively. The state of pedestrian can be defined as
x=(cx,vx,cy,vy)T
(16)
where,cxandcydenote the coordinate of the location center point of pedestrian motion,vxandvycorrespond to the horizontal speed and vertical speed respectively. Defining the observation vector asz=[cx,cy]T, and the concrete realization of Kalman filter is described as
(17)
where,xt+1|tandPt+1|tare the prediction value of pedestrian state and pedestrian state error covariance at thettime.xt|tandPt|tare the estimation value of pedestrian state and pedestrian state error covariance at thet+1 time, respectively,Kt+1is the filter gain at thet+1 time.
In order to rematch pedestrian, it is necessary to consider pedestrian detection method. Due to the slow operation speed, low precision, and poor anti-interference ability, the traditional detection method is extremely limited in practical applications. YOLOv3 is considered as a general object detection algorithm based on deep learning[14], which can determine the spatial location and scale of persons based on the given image. In addition, because of the special algorithm structure of YOLOv3, it has better real-time characteristics.
YOLOv3 is introduced to implement pedestrian detection when the number of consecutive occlusion framesθnis more than the thresholdθt. Defining the outputEof detection as
E={eζ}ζ=1,2,…,l
(18)
where,ldenotes the number of persons in the current frame image,eζdenotes the position and scale information of theζth person.
(19)
Suppose the maximum similarityq′ corresponds to theζth person. Whenq′ is greater thanqt, the match is considered successful. Otherwise, the match fails.
The flowchart of ALPDFE is shown in Fig.4, and the implementation is summarized in Algorithm 1.
Fig.4 The flowchart of ALPDFE
Algorithm 1 The implementation of ALPDFE
Two scenarios are selected to verify the feasibility and validity of the proposed algorithm based on different occlusion scenes. Video 1 is the Human 3 of visual tracker benchmark[17]. Video 2 is a real time tracking of pedestrian in the campus. Pedestrians are occluded frequently or completely in the video. Using one pass evaluation (OPE) mode evaluates performance from both qualitative and quantitative aspects. Qualitative analysis is the following 2 aspects: frequent occlusion and long-term occlusion. And Quantitative analysis is from the following 2 aspects: distance accuracy and overlap accuracy. The ALPDFE is compared with SiamFC, KCF, ASAKCF, ALP and SAKCF. And in order to compare the effectiveness of the location prediction step and rematch step, the anti-occlusion pedestrian tracking algorithm based on location prediction (ALP) is introduced into ablation experiment. Unlike ALPDFE, ALP includes only location prediction part by Kalman filter. The simulation parameters are set as follows: the occlusion thresholdθt=4, the tracking quality thresholdqt=0.8, the model updating intervalΥ=10, the size of feature templateN= 4.
(1) Frequent occlusion
Testing video uses a typical Human 3 video segment, pedestrians are frequently occluded by obstacles and reappearance. The tracking effect is shown in Fig.5. At the 10th frame, the pedestrian is not occluded, and all algorithms can track accurately. At the 35th frame, the pedestrian is occluded by other fast moving pedestrians. As SiamFC lacks the occlusion processing mechanism, first tracking drift phenomenon occurs. However, because the occlusion is small and it has similar appearance color to the pedestrian, the SAKCF and KCF still can continue to track. The pedestrian is occluded frequently from the 60th frame to the 150th frame. It can be seen that both SAKCF and KCF fail to track. At this point, ASAKCF, ALP and ALPDFE still can track continuously. At the 750th frame, ALP and ALPDFE show better tracking performance after a fast focal length change. As the tracking time increases, due to error accumulation and complex background, ASAKCF fails to track. Although ALP does not include the rematch part, it can effectively predict target location by Kalman filter when the frequent occlusion occurs, so it almost does not lose target in the whole tracking process. Because ALPDFE adds the rematch strategy for the phenomenon of tracking failure, pedestrian location is revised constantly. Therefore, ALPDFE has better tracking performance than the other 4 algorithms.
Fig.5 The tracking result of video 1
(2) Long-term occlusion
Video 2 is a long-term occlusion process for pedestrians. In this video, pedestrians pass through the parking lot and have 100 frames of continuous occlusion. There is interference from vehicles and other moving pedestrians, which increases the difficulty of pedestrian tracking.
The same initial position for the above 6 algorithms in the first frame is set. Their tracking performances are shown in Fig.6. At the 15th frame, all algorithms are tracking normally. At the 160th frame, pedestrian is occluded, and pedestrian is almost occluded completely at the 174th frame. After that, SiamFC, KCF, ASAKCF and SAKCF cannot continue to track pedestrian. ALP and ALPDFE can predict the pedestrian location by Kalman filter when pedestrian is occluded completely. It can be seen that only ALP and ALPDFE does not lose pedestrian track at the 195th frame. Due to the increase of occlusion time, the phenomenon of tracking drift appears for ALP. The rematch strategy of ALPDFE is executed when a continuous loss occurs, and the pedestrian can be tracked at the 271th frame, the other 5 algorithms fail to track moving pedestrian.
Fig.6 The tracking result of video 2
(1) Center location error
Fig.7 shows the center location error curve of the above 6 comparison algorithms. In Fig.7(a), the errors of SiamFC, KCF, ASAKCF, and SAKCF are gradually increased at the 35th, 50th, 50th and 1600th frame, respectively. The center location error of all algorithms is small, and pedestrians can be tracked normally in the first 50 frames. At the 50th frame, pedestrian is occluded, ASAKCF, ALP and ALPDFE can continue to track pedestrian, and other 3 algorithms lead to large location errors. As tracking time goes on, the center location error of ASAKCF also increases obviously at the 1 600th frame, because there is similar interference in the background which indicates the failure of pedestrian tracking. And when there is pedestrian occlusion or pedestrian tracking drift, ALP can use the prediction mechanism of Kalman filter to predict pedestrian position, which maintains the continuity of tracking to some extent. Compared with ALP, ALPDFE can use not only the prediction mechanism but also the rematch strategy to reposition the pedestrian, so it obtains a low location error. In Fig.7(b), SiamFC, KCF, ASAKCF and SAKCF have similar location error curves. The above 4 algorithms can track normally when pedestrian is not occluded before the 175th frames. After the 175th frame, SiamFC, KCF, ASA-KCF and SAKCF stay in the initial occlusion location when the pedestrian is completely occluded. As the movement of pedestrian is linear approximately, the error curve of center location increases linearly. After the reappearance of pedestrian, pedestrian cannot be tracked again because the pedestrian location is greatly deviated before and after occlusion. ALP and ALPDFE utilize the location prediction by Kalman filter during occlusion, which can reduce the central location error of the occlusion process in the 175th-270th frames. However, ALP can not rematch pedestrian after a long period of occlusion, Kalman prediction is prone to produce large tracking error. In addition, due to adding the rematch strategy in the ALPDFE, its tracking precision is effectively improved when pedestrian reappears by reducing the central location error after the 270th frame.
Fig.7 Center location error
(2) Precision and success rate of tracking
Tracking precision is calculated according to the ratio which the number of center location error in the error threshold is relative to the total number of frames. Success rate is the ratio which the number of overlap within the threshold to total number of frames. Fig.8 shows the normalized tracking precision curve and the success rate curve of test video for 6 algorithms. Fig.8(a) shows the precision curve, it can be seen that the tracking precise of ALPDFE is better than that of ALP. The result also shows that the results in Fig.8(a) illustrate the effectiveness of rematch strategy in ALPDFE, because the only difference between ALPDFE and ALP is that ALP lacks rematch steps. As the error threshold increases, the precision curve of SiamFC, KCF, SAKCF, ASAKCF, ALP, and ALPDFE gradually increases, then they flattens out at the location thresholds of 10, 5, 5, 20, 10 and 10, respectively. The SiamFC, KCF, and SAKCF locks the judgment and process for occlusion, which cause the failure of pedestrian tracking, or the tracking precision is low. ASAKCF can only deal with partial occlusion problems, and its tracking precision is lower than ALPDFE with rematch strategy. Fig.8(b) shows the success rate curve of the above 6 algorithms. As the overlap threshold is larger, the requirements for successful tracking are more demanding. Therefore, the success rate curves of SiamFC, KCF, SAKCF, ASAKCF, ALP and ALPDFE show firstly flat and then downward tendency at the overlap thresholds of 0.7, 0.7, 0.7, 0.3 ,0.4 and 0.4, respectively. In Fig.8(b), it can be seen that the tracking success rate of the ALPDFE is better significantly than other 5 algorithms.
Fig.8 Precision and success rate of tracking
Aiming to the problem of pedestrian tracking under occlusion, state prediction and rematch strategy is synthetically considered into the design of proposed algorithm. A novel anti-occlusion pedestrian tracking algorithm based on location prediction and deep feature rematch is proposed. In the realization of ALPDFE, the pedestrian appearance features are used to determine whether the occlusion phenomenon of pedestrian exists. When the occlusion phenomenon occurs, Kalman filter is used to predict the pedestrian position. Besides, when pedestrian reappears, pedestrians can be matched and repositioned by using YOLOv3 method. The simulation experiments and theoretical analysis show that ALPDFE can improve the anti-occlusion problem of pedestrian tracking. In addition, the SAKCF can be replaced by other existing trackers in the framework of ALPDFE, and ALPDFE has strong extensibility. In the real scene of pedestrian tracking, we need consider more complex environmental characteristics such as the pedestrian density, the rapid change of pedestrian scale and the lighting change of surrounding environment. Therefore, designing the accurate and rapid scale estimation method and enhancing the anti-interference capability of pedestrian tracking model will be the focus of the follow-up study.
High Technology Letters2020年4期