Gensheng Hu, Min Li, Dong Liang,, Mingzhu Wan and Wenxia Bao
(1.Key Laboratory of Intelligent Computing and Signal Processing, Ministry of Education, Anhui University, Hefei 230039, China; 2.School of Electronics and Information Engineering, Anhui University, Hefei 230601, China; 3.School of Information Science and Technology, Fudan University, Shanghai 200433, China; 4.Anhui Key Laboratory of Polarization Imaging Detection Technology, Hefei 230031, China)
Abstract: A group activity recognition algorithm is proposed to improve the recognition accuracy in video surveillance by using complex wavelet domain based Cayley-Klein metric learning. Non-sampled dual-tree complex wavelet packet transform (NS-DTCWPT) is used to decompose the human images in videos into multi-scale and multi-resolution. An improved local binary pattern (ILBP) and an inner-distance shape context (IDSC) combined with bag-of-words model is adopted to extract the decomposed high and low frequency coefficient features. The extracted coefficient features of the training samples are used to optimize Cayley-Klein metric matrix by solving a nonlinear optimization problem. The group activities in videos are recognized by using the method of feature extraction and Cayley-Klein metric learning. Experimental results on behave video set, group activity video set, and self-built video set show that the proposed algorithm has higher recognition accuracy than the existing algorithms.
Key words: video surveillance; group activity recognition; non-sampled dual-tree complex wavelet packet transform (NS-DTCWPT); Cayley-Klein metric learning
Group activity refers to the relative motion of two or more individuals who interact with and depend on each other. The technology of group activity recognition has been widely applied in public security, video surveillance and other fields. Since it needs to consider individual information, group information and scene understanding simultaneously[1-2], group activity recognition has become a difficult topic.
The results of individual activity recognition have been used in group activity recognition. Ref.[3] proposed a two-stage learning architecture to deal with group activity recognition. The first stage uses LSTMs model to analyze the activities of each individual. The second stage aggregates the analysis results into the same scene and achieves group activity recognition. Ref.[4] used a multi-target tracking algorithm to track each individual in the whole image, after which the output positions of the tracker were clustered into groups and the structure feature set was used to represent the groups’ activities. The above group activity recognition methods are based on individual activity recognition, which can make full use of the existing individual activity recognition results. How to detect and recognize individual activities is still a challenging problem.
Some researchers proposed group activity recognition algorithms by combining the features of individual and group activities. Ref.[5] used group activity description vector (GADV) to analyze and identify group activities. GADV is a trajectory description of the group and individuals in the group. By analyzing the trajectories of different individuals, Ref.[6] proposed two representative features of group interaction, the energy feature and the attraction and exclusion feature, to describe the group activities in a group interaction area and to solve the complexity and ambiguity problems caused by multiple human targets. Ref.[7] recognized the activities of different groups by extracting the corresponding individual activity features and learning the potential components between different individuals automatically. Refs.[8-9] presented deep neural-network-based hierarchical graphical models for individual and group activity recognition. As tracking and describing individual trajectories are often affected by occlusion, these algorithms are mainly applied to the scenes where the population intensity is not high, and the occlusion is not serious.
By extracting and learning the image features of group activities, group activity recognition can achieve good results. Ref.[10] learned different group activities through Random Forest, but Random Forest is not robust to noises. Ref.[11] proposed a semantic-based space-time descriptor, which uses the common features of visual words in videos to discover the semantic structure of group motion. Ref.[12] proposed an automatic group activity identification method using kernel density estimation (KDE) modeling and machine learning classification. Ref.[13] considered the interaction between groups and presented an algorithm to recognize human group activities with multi-group causalities. Owing to the scene diversity and the mutual occlusion between people, it is difficult to extract and describe the features of group activity. The above algorithms have problems of high computational complexity and low recognition accuracy.
Most of the above algorithms extract features in the time domain or spatial domain and then use clustering or classification algorithms for group activity recognition. This paper presents a group activity recognition algorithm for videos by combining non-sampled dual-tree complex wavelet packet transform (NS-DTCWPT) and Cayley-Klein metric learning. Dual tree complex wavelet transform (DTCWT) has advantages of approximate translation invariance, good directional selectivity, perfect reconstruction and finite redundancy. Besides the advantages of DTCWT, NS-DTCWPT is completely translation invariant because it is non-sampled after filtering. It has better detail preserving characteristics because wavelet packet transform is used in two trees. Cayley-Klein metric has similar explicit expressions with Mahalanobis metric. Cayley-Klein metric learning has a wider range of applications and better recognition performance than Euclidean metric and Mahalanobis metric learning. After human images in videos are decomposed by using non-sampled dual tree complex wavelet packet transform, the high and low frequency subband coefficients can be obtained. By extracting the features of high and low frequency subband coefficients and using Cayley-Klein metric learning to classify the features, the accuracy of group activity recognition can be improved.
Discrete wavelet transform has shifting variance; that is, a small input signal translation causes a drastic change of transform coefficient. Two dimensional discrete wavelet transform can only capture the signal details in horizontal, vertical and diagonal directions[14]. DTCWT consists of two parallel wavelet trees, and each tree adopts discrete wavelet transform for the signal. After transformation, real and imaginary parts of the signal are obtained. Two wavelet trees at each level provide a necessary signal delay for the multi-resolution analysis and double sampling interval. Therefore, the signal aliasing is eliminated, and the translation invariance is realized[15-16]. Similar to discrete wavelet transform, DTCWT cannot make accurate analysis for the high frequency parts of the signal. Dual tree complex wavelet packet transform (DTCWPT) subdivides high and low frequency subbands and obtains a more accurate frequency separation in the whole frequency band[17]. Each tree of the NS-DTCWPT does not subsample the signal to achieve complete translation invariance after the signal is transformed by discrete wavelet packet transform. The decomposition and reconstruction process of the NS-DTCWPT are shown in Fig.1, wherex(t) is input signal,(t) is output signal,f1andf2represent high and low pass decomposition filters of real tree respectively,h1andh0represent high and low pass decomposition filters of the real tree at level two,f3andf4represent the high and low pass decomposition filters of imaginary tree respectively,g1andg0represent high and low pass decomposition filters of the imaginary tree at level two, and their corresponding reconstruction filters areandrespectively.
Fig.1 Schematic diagram of two-level NS-DTCWPT decomposition and reconstruction
Distance metric plays a very important role in many applications of image comprehension and pattern recognition. The most widely used distance metric, Euclidean metric, treats the input sample space as isotropic one and thus does not fairly reflect the potential relationship of the data samples. Instead, Mahalanobis metric takes the correlation between the dimensional components of data samples into account. It treats the various dimensional components of the data samples non-equally and has a better performance than Euclidean metric in practical applications. However, the limitation of linear transformation makes it impossible to describe the potential non-linear relationship of training samples and limits its application in practice. Cayley-Klein metric has similar explicit expressions with the Mahalanobis metric[18]. By training of the samples, Cayley-Klein metric learning can yield a fractional linear transformation to reflect the spatial structure information or semantic information of the samples. This gives Cayley-Klein metric better discrimination and models the potential relationships of training samples better. Fractional linear transformation can be regarded as a special nonlinear transformation. Therefore, Cayley-Klein metric learning has a wider range of applications and better recognition performance than Mahalanobis metric learning.
Given an invertible symmetric matrixΨ∈R(n+1)×(n+1), its bilinear representation can be denoted as
(1)
IfΨis positive definite, the metric on the En:En={x∈Rn:ψxx>0} can be defined as
(2)
IfΨis indefinite, the metric on the Bn:Bn={x∈Rn|ψxx<0} can be defined as
(3)
Here (Rn,ρE) and (Bn,ρH) are called elliptical geometric space and hyperbolic geometric space respectively,kis a constant associated withΨ,1/kis the curvature of the elliptical geometric space, and -1/kis the curvature of the hyperbolic geometric space. Eqs.(2)(3) can be rewritten as
(4)
Eq.(4) is defined as Cayley-Klein metric and only relies on the symmetric matrixΨ. That is to say, given a symmetric matrix, there is a specific Cayley-Klein metric.
Cayley-Klein metric learning is to find an optimal Cayley-Klein metric matrix from training samples under some kind of learning criterion. Therefore, the process of Cayley-Klein metric learning includes establishing the Cayley-Klein metric learning criterion according to specific tasks and obtaining the optimal Cayley-Klein metric matrix by solving a nonlinear optimization problem. Denote
(5)
where G is a symmetric positive definite matrix. The elliptical Cayley-Klein metric is
(6)
Similar toν-support vector machine model, this paper gives the following optimization model of Cayley-Klein metric learning by minimizing distances between similar pairs and simultaneously maximizing distances between dissimilar pairs.
(7)
where the notationj→irepresents that xjand xibelong to the similar pairs, andμis a balance constant. The first term in the objective function penalizes larger distance between similar pairs, the second term penalizes smaller distance between dissimilar pairs, and the third term penalizes classification errors. The proportion of the misclassification of the similar pairs is controlled by the constantνin the second item[19].
Suppose that Cholesky decomposition of G is G=LTL with L∈R(n+1)×(n+1). Then L, instead of G, is optimized to ensure the symmetry of G. Denoteζijl(L,β)=[β+dCK(xi,xj)-dCK(xi,xl)]+, where [z]+=zifz≥0, and [z]+=0 otherwise. By substituting the constraint in Eq.(7) into the objective function, it becomes
(8)
σ(xi,xj)=tr(CijG)=tr(Cij(LTL))
(9)
The gradient of Eq.(8) at thet-th iteration is
(10)
where
(11)
(12)
(13)
In order to improve iterative efficiency, this paper uses mini-batch stochastic gradient descent (SGD) method to solve the above optimization problem. At each iteration, onlybsamples are selected to update gradient values, wherebis much smaller than the total number of samples. After convergence, Cayley-Klein metric matrix can be obtained from G=LTL.
The process of solving the model Eq.(7) to obtain Cayley-Klein metric matrix G by mini-batch SGD method is given in algorithm 1.
Algorithm1: Process of obtaining Cayley-Klein metric matrix G
Input: data of training samples, step sizeη
Output: Cayley-Klein metric matrix G
②Cholesky decompose G:G=LTL.
③Randomly selectbsamples and substitute it into Eq.(10) to obtain the gradient value of the samples.
⑤Repeat step ③ and step ④ until it converges, or reaches the stop criterion.
⑥Return G=LTL.
Because of time frequency local characteristics and details maintenance characteristic of NS-DTCWPT, group activities are recognized in complex frequency domain in this paper. Human images in video frames are decomposed by using NS-DTCWPT into multi-scale and multi-resolution. The more the number of decomposition levels, the stronger the directional information of the subband images, but the greater the rounding error. Three levels of decomposition are used in this paper. On each level, every human image is decomposed into two low-frequency subband images and multiple high-frequency directional subband images. NS-DTCWPT does not subsample the subband images after the images are transformed. All the high and low frequency subband images are the same size as the original image. Different features are extracted from high and low frequency subband images, respectively, in this paper. The high-frequency directional subband images are the details of the original images at different directions. An improved local binary pattern (ILBP), with characteristics of rotation invariance and directional selectivity, is used to extract high frequency coefficient features. The low-frequency subband images are the low frequency approximation of the original images. Inner-distance shape context (IDSC) combined with bag of words (BOW) is adopted to extract low frequency coefficient features. High and low frequency coefficient features are cascaded to form group activity features of video frame images.
Local binary pattern (LBP) can be used to describe local texture features of images[20], whereas traditional LBP is subject to the influence of the relative position of center pixel and neighbor pixels. On the basis of traditional LBP descriptor, an ILBP descriptor is constructed.
Each neighbor pixel of centre pixel in the high frequency subband image blockG(x,y) is regarded as current pixel in turn. Denoting the current pixel value asfi, the mean and standard deviation of the neighbor pixels asfμandfσ. Iffμ-fi≥0.5fσ, the current pixel value is set to 0, otherwise, it is set to 1. In this way, a set of binary sequence of the center pixel is obtained. Then a series of initial LBP value is obtained by rotating the neighbor pixels, whose minimum value is taken as the LBP value of the center pixel eventually[21]:
VLr=min{fR(VL,i)|i=0,1,…,7}
(14)
(15)
According to the imaging principles of cameras, the distribution of pixels in images is non-uniform; thus the centroidCand the geometric centerOofG(x,y) are not in the same position and a vector D fromOtoCcan be constructed. The angleθbetween D andXis defined as the main direction ofG(x,y). The calculation formula of angleθis
(16)
ILBP feature (VLr,θ) is constituted by combining rotation invariant LBP feature with main direction.
Ref.[23] proposed a method of describing shape contour, namely shape context (SC). SC descriptor has properties of translation, scaling and rotation invariance, and robustness to small nonlinear deformation and local noises. However, the histogram extracted by SC descriptor changes as the position of structure parts changes. Therefore, it reduces the ability to describe an object with a joint structure.
To overcome the drawbacks of SC descriptor, Ref.[24] used inner-distance shape context (IDSC) to measure the distance between two points on the contour. IDSC not only expresses global information, it also describes local information. It is robust to handle the targets with part deformation and non-rigid deformation. Although IDSC improves sensitivity to joints, it is dependent on the number of sampling points on the contours. A large number of sampling points will inevitably increase the complexity of the algorithm. In addition, IDSC features of each video frame are the combination of IDSC features of all sampling points, which can not be used directly for Cayley-Klein metric to realize the classification of group activities. In this paper, IDSC features extracted from training samples are clustered by using K-means algorithm to create a visual dictionary; hence BOW vector is formed and quantized. The obtained visual dictionary is representative, and it effectively reduces the influence of the number and position of contour sampling points on the recognition accuracy of group activities.
Steps for low frequency coefficient feature extraction are as follows.
Step1Using IDSC, extract the low frequency coefficient features of human body images in training video frames.
Step2Cluster the extracted IDSC features by using K-means algorithm. Each cluster center is regarded as a visual word. All the visual words are combined to obtain a visual dictionary whose capacity isk.
Step3Extract IDSC feature from low frequency coefficients of human body images in new video frames. Each IDSC feature is compared with visual words in the visual dictionary and classified to the most similar type of words.
Step4Calculate the frequency of each visual word and construct the histogram features of BOW description of each frame image.
Step5Normalize the BOW of each frame to complete the quantization of BOW.
Fig.2 shows the segmented images of four different activities and Fig.3 shows the contours and marker points of the segmented images. Figs.4-7 show the IDSC features of different marker points of four different activities respectively. These figures indicate that IDSC features given in this paper are well distinguishable.
Fig.2 Segmented images of different activities
Fig.3 Contour and marker points of different activity images
Fig.4 IDSC features of different marker points of talking activity image
Fig.5 IDSC features of different marker points of dancing activity image
Fig.6 IDSC features of different marker points of jogging activity image
Fig.7 IDSC features of different marker points of fighting activity image
Firstly, multi-scale and multi-direction decomposition of human body images in video frames is performed by using NS-DTCWPT to obtain high and low frequency subband coefficients. Secondly, features of high and low frequency coefficients are extracted by using ILBP and IDSC combined with BOW. Cascade of high and low frequency features yields group activity feature vector, which constitutes training and testing sample sets. Thirdly, Cayley-Klein metric matrix is obtained by solving the optimization model of Cayley-Klein metric learning. Cayley-Klein metric values of testing samples and training samples are calculated. The similarity between testing sample sequence and training sample sequence is evaluated by comparing the minimum value of Cayley-Klein metric. Finally, the feature vector of human images in new video frames is obtained and used to calculate the Cayley-Klein metric value with that of testing samples; thus group activities in video frames are recognized. Specific steps are as follows.
Step1Using NS-DTCWPT, decompose human images in the training and testing video frames to obtain high and low frequency subband coefficients.
Step2Extract high and low frequency coefficient features by using ILBP and IDSC combined with BOW. Cascade of high and low frequency coefficient features yields group activity feature vector, which constitutes training and testing sample sets.
Step3Solve the constraint optimization problem of (7) by using training samples to obtain the Cayley-Klein metric matrix Gj,j=1,2,…,N,whereNis the number of activity categories.
Step6Using
(17)
The video sets used in this paper include behave video set, group activity video set and a self-built video set. Two types of group activities of walk together and fighting are considered in behave video set, where each type of group activity is done by 5 people, and three videos containing a total of 1 080 frame images are selected. For every selected video, the first 2/3 video frames constitute training video and the latter 1/3 video frames constitute test video. The group activities in each frame are manually calibrated as reference label. Five types of group activities of crossing, waiting, talking, dancing and jogging are considered in group activity video set, where each type of group activity is done by 4-5 people and eight videos are selected as training and testing videos. The same is done for behave video set; for every selected video, the first 2/3 video frames constitute training video and the latter 1/3 video frames constitute test video. The self-built video set consists of two scenes of fighting video and a total of 8 video clips. Each video clip contains 90 frame images of group activities done by 2-5 people. Two videos of the first scene and the second scene are selected as training and testing videos respectively. In the process of extracting IDSC features, the parameters are set as follows: number of contour sampling pointsN=150, number of concentric circlesM=5, angle numberL=12, and clustering center numberk=30.
The recognition accuracy of different methods on behave and group activity video set is shown in Tab.1 and Tab.2 respectively. As seen from Tab.1, the recognition accuracy of the proposed algorithm for walk together activity is higher than that of Ref.[13] and slightly lower than that of Ref.[4]. The recognition accuracy of the proposed algorithm for fighting activity is higher than that of Ref.[4] and Ref.[13]. In Tab.2, the recognition accuracy of the proposed algorithm for crossing and jogging activities is similar to that of Ref.[10] ,but higher than that of Ref.[7] and Ref.[11]. The recognition accuracy of the proposed algorithm for waiting, talking and dancing activities is higher than that of other methods. Tab.3 shows the general recognition accuracy of the deep learning method and the proposed method on group activity video set. Compared with the above literatures, the proposed algorithm has higher recognition accuracy.
Tab.1 Recognition accuracy of different methods
Tab.2 Recognition accuracy of different methods
Tab.3 General recognition accuracy of the deep
In order to verify the validity of the application of NS-DTCWPT and Cayley-Klein metric learning to group activity recognition, the proposed algorithm is compared with the algorithm of DTCWT combined with Mahalanobis metric. Recognition results on different video sets are shown in Fig.8-Fig.13, where the color boxes of red, blue, rose red, green, yellow, cyan and white represent the activities of fighting, walk together, dancing, talking, jogging, crossing and waiting respectively. The last images in Figs.8-13 respectively show the recognition results when Gauss white noises are added to behave video set, group activity video set and self-built video set. Because NS-DTCWPT is more effective than DTCWT for representing geometrical features such as edges and textures, and Cayley-Klein metric learning is more suitable than Mahalanobis for modeling and measuring data with complex geometric structure, the proposed algorithm has better recognition results than the algorithm of DTCWT combined with Mahalanobis metric, as seen from Figs.8-13. Recognition results shown on the last noise images in Figs.8-13 demonstrate the advantage of the complete translation invariance of NS-DTCWPT. Figs.14-15 give quantitative evaluation results of different algorithms by using confusion matrix. Non-diagonal row element of the confusion matrix represents the probability that certain type of group activity is recognized as other type of group activity, and the diagonal element represents the probability of correct recognition. Average recognition accuracy by using different feature extraction methods is shown in Tab.4. From Figs.14-15 and Tab.4, it is verified that the proposed algorithm has better recognition accuracy.
Fig.8 Recognition results of the algorithm of DTCWT combined with Mahalanobis metric on behave video set
Fig.9 Recognition results of the proposed algorithm on behave video set
Fig.10 Recognition results of the algorithm of DTCWT combined with Mahalanobis metric on group activity video set
Fig.11 Recognition results of the proposed algorithm on group activity video set
Fig.12 Recognition results of the algorithm of DTCWT combined with Mahalanobis metric on self-built video set
Fig.13 Recognition results of the proposed algorithm on self-built video set
C—crossing; W—waiting; T—talking; D—dancing; J—joggingFig.14 Confusion matrix of different algorithms on group activity video set
Fig.15 Confusion matrix of different algorithms on behave video set
Data setILBP+IDSCLBP+SCBehave video set92.35%89.85%Group activity video set94.58%91.22%Self-built video set82.425%80.275%
The algorithm of group activity recognition needs consideration of individual information, group information and scene understanding simultaneously. The existing group activity recognition algorithms have low recognition accuracy. This paper gives a group activity recognition algorithm of NS-DTCWPT combined with Cayley-Klein metric learning. NS-DTCWPT has the characteristics of translation invariance and detail preserving. It is effective to represent geometrical features such as edges and textures. Cayley-Klein metric learning can yield a fractional linear transformation to reflect the spatial structure information or semantic information of the samples, which models the data with complex geometric structure and gives Cayley-Klein metric good discrimination. High and low frequency coefficients are obtained by decomposing human images in videos using NS-DTCWPT. By extracting the features of high and low frequency coefficients and using Cayley-Klein metric learning, group activities are recognized eventually. Experimental results show that the proposed algorithm can effectively improve the accuracy of group activity recognition.
Journal of Beijing Institute of Technology2018年4期