Gaihua Wang, Meng Lü, Tao Li Guoliang Yuan and Wenzhou Liu
(1.Hubei Collaborative Innovation Centre for High-Efficiency Utilization of Solar Energy, Hubei University of Technology, Wuhan 430068, China; 2.School of Electrical and Electronic Engineering, Hubei University of Technology, Wuhan 430068, China)
Abstract: A novel convolutional neural network based on spatial pyramid for image classification is proposed. The network exploits image features with spatial pyramid representation. First, it extracts global features from an original image, and then different layers of grids are utilized to extract feature maps from different convolutional layers. Inspired by the spatial pyramid, the new network contains two parts, one of which is just like a standard convolutional neural network, composing of alternating convolutions and subsampling layers. But those convolution layers would be averagely pooled by the grid way to obtain feature maps, and then concatenated into a feature vector individually. Finally, those vectors are sequentially concatenated into a total feature vector as the last feature to the fully connection layer. This generated feature vector derives benefits from the classic and previous convolution layer, while the size of the grid adjusting the weight of the feature maps improves the recognition efficiency of the network. Experimental results demonstrate that this model improves the accuracy and applicability compared with the traditional model.
Key words: convolutional neural network; multiscale feature extraction; image classification
Image classification is one of the most important and widely applied research directions in the field of computer vision and artificial intelligence, such as target recognition[1], object detection[2], geographic image analysis[3]and scene recognition[4]. Its research goal is to divide images into predefined categories according to image attributes. Regular and classical algorithms such as K-means clustering algorithm[5], local binary pattern(LBP)[6],histogram of oriented gradient(HOG)[7], principal component analysis(PCA)[8], scale-invariant feature transform(SIFT) have generated good results in image classification. However, those are manual data processing, which is defective in mass data processing. Recently, the pattern of convolutional neural network(CNN) becomes popular, which has a good classification effect through weight learning to obtain features automatically[9-11]. Firstly, as a bionic vision system, inspired by cat’s vision system, it covers the whole field of view by tiling the local receptive field. Secondly, every convolution layer being obtained directly from the images shares the convolution kernel, and pooling layers decrease the size of image.
Fig.1 LeNet-5
LeNet-5 introduced by Lecun et al.[12]is the first and most famous classical neural network model, which takes alternative convolutional and pooling layers. The AlexNet model is proposed by Krizhevsky et al.[13]in 2012, which has 5 convolutional layers, about 650 000 neurons and 60 million trained parameters, far more than the LeNet-5 model in the network scale. Furthermore, AlexNet chooses the large image classification database ImageNet[14]as the training set and uses dropout[15]to decrease overfitting. Based on AlexNet, Siomonyan[16]proposed the VGG network which is aimed at the depth of the CNN. VGG is composed of a 3×3 convolutional kernel. It proves that the increase of CNN depth could improve the accuracy of the image classification. However, there is a limit in increasing depth, which can result in network degeneration. Therefore, the best depth of VGG is of 16-19 layers. Considering the problem of network degeneration, He et al.[17]analyzed that if every added layer is trained well, the loss couldn’t increase under the condition of deeper network layer. The problem indicates that not every layer in the deep network is well trained. He et al. put forward a structure of ResNet. It maps feature maps from low level to high level network directly by a short connection. Although the size of convolutional kernel used by ResNet is the same as VGG, it can be built into a 152-layer network after solving the problem of network degeneration. Compared with VGG, ResNet has a fewer training loss and a higher test accuracy. Szegedy et al.[18]pay more attention to reduce network complexity through improving network structure. They propose a basic model in CNN, which is called Inception. The training numbers of GoogleNet[19-20]structured with Inception, is one-twelfth of AlexNet, but the accuracy of image classification in the database of ImageNet is higher than AlexNet by around 10%. Springenberg et al.[19]questioned the down-sampling layer in CNN, and they design a full CNN. In the current research circumstance, more advancements are conducted in two major fields: the depth of CNN and optimizing structure. The appearance of spatial pyramid pooling[21]shows a very good result. Its core is to extract features after different pooling sizes on the featured image, then aggregate to a feature vector. Inspired by this, the way of getting features from different convolutional layers also has some success[22-24].
In the paper, we propose a novel CNN based on spatial pyramid for image classification. It combines spatial pyramid with the classical CNN. The main structure of our model is just like others CNN. But convolution layers would be averagely pooled by the grid way to obtain feature maps, and then concatenated into a feature vector individually. Finally, those vectors are sequentially connected into a total feature vector as the last feature, then to the full connection layer and soft max classifier. In addition, the algorithm takes the change of total feature vectors and convolution layers on back-adjusting into consideration, so gradients of convolution layers treated by gridding pooling would be adjusted by two directions. Experimental results show that the proposed method is robust and can get an optimal result.
Convolution neural network has been one of the best ways on image process. It is composed of four parts: input layer, feature extraction, full connection and classifier. Fig.1 is the classical network architecture.
(1)
Fig. 2 Architecture of model
(2)
hi=ap{x1}
(3)
Here,apis the average-pooling function, which downsamples feature maps by taking the average values on down sampling sub-regions.
The remaining level of the feature maps are extracted recursively, by performing convolution and average-pooling on the feature maps from the preceding level
(4)
In this paper, the feature maps are obtained from every convolutional layer through average pooling with grid, and then they are aggregated into a total feature vector which sends to the full connection layers and softmax classifier. Fig.2 describes the overall framework of our approach.
The model achieves the image classification by using spatial pyramid to perform the feature extraction. There are three convolutional layers and two pooling layers in the first part of model. We extract feature through different pooling sizes from the three convolutional layers, and then combine them into one vector, and set a full connection layer of size 80 neurons and a sofmax classifier to complete the network.
The feature extraction of our method is based on spatial pyramid. It obtains images at different pixel scales by using a Gaussian function to smooth images. Every pixel scale is divided into refined grids. Then it will get features from each grid. The features are combined into a big feature vector. Spatial pyramid obtains the spatial information of the image through a statistical image feature point distribution in different resolutions. The Gaussian functions used to obtain different scales are shown as follows
(5)
where (i,j) are the coordinates of image pixel point,δis the scale coordinates, which determine the smoothness of the image. We extract features from different convolution layers by different scale of pooling with grid 4×4,2×2,1×1.
First, feature mapsx1,x2,x3are obtained from the traditional convolution neural network. Then, the first convolution layer is divided into 4×4 grids, and one feature is obtained from each grid by average pooling. Finally the first convolution layer becomes the size of 4×4 feature mapp1. The pooling scale and stride are changed by the size of input image. Now, as shown in Fig.3, we can get three mapsp1,p2,
pl=ap{hl}
(6)
Fig.3 Last feature extraction
In our method, weight and bias are initialized by
(7)
whereklis the layerlconvolution kernel, we can controlwijis between -1 to 1. Then, every layer output value is calculated by the propagation formula. And, the input image could be processed by our model.
The back propagation starts from the last layer, the last layer deviationδLis calculated by image labely(i)and the output value
(8)
(9)
(10)
There are two gradient directions for the feature mapx1,x2. One of them comes from the last feature vector and the other is the followed layer. Our method adds them together to adjust weight and bias
(11)
(12)
In this section, two widely used methods are evaluated as the comparison with our model. The first one is the classical LeNet-5 network which is great successful in the area of MNIST database, the other one is CNN whose feature extraction part have two convolution layers and two pooling layers. The performances of these methods will be analyzed by three different public databases.
The MNIST handwritten digital data consists of 28×28 pixel gray images, and each contains a digit 0-9(10 classes). There are 60 000 training images and 10 000 test images in total. Without extra pre-processing, the image pixels are only divided by 255 so that they are in the range [0 1]. Tab.1 shows CNN has the result of 98.15%, and LeNet-5 has the result of 99%. However, our method achieves the result of 99.08%. The learning rate of all methods are set 1. We beat others methods in our experiment. All the methods we test are original networks. We don’t use some effective optimization way, such as RELU,dropout. So, the method will get a higher accuracy in the future.
The CIFAR-10 database is composed of 10 classes of natural images split into 50 000 train images and 10 000 test images. Each image is a RGB image of 32×32 pixel. For the database, we make them in the range [0 1] and then make it gray. In Tab. 2, CNN has the result of 52.06% when the learning rate is set 0.1. And LeNet-5 gets the result of 10% and can’t recognize the database. Compared to CNN, LeNet-5 has one more pooling layer, and the last average pooling layer maybe miss some features. Our method achieves the result of 64.26% when the learning rate is set 0.5. It is the best one among all methods. Although the three methods results are not well, the accuracy of our method still exceed the other two a lot.
Tab.1 Test set accuracy rate for MNIST of
Tab.2 Test set accuracy rate for CIFAR-10
The vehicle database is composed of 64×64 pixel RGB images which are split into 13 491 train images and 1 349 test image. And each contains truck, car, bus and van (4 classes). For the database, we make them in the range [0 1], and then make it gray. In Tab.3, CNN gets the result of 52.34% when the learning rate is 0.1. And LeNet-5 gets the result of 25%. LeNet-5 can’t recognize the database. The results wouldn’t change when the learning rate is changed. It shows that LeNet-5’s applicability is narrow. Our method achieves the result of 79.26% when learning rate is 1. It is the best accuracy among the three methods.
Tab.3 Test set accuracy rate for the vehicle
From Fig.4, classical LeNet-5 network only has a good result on the MINIST database, and performs poorly on the others. Obviously it has a narrow scope of applications. CNN also has a good result on the MNIST database, and behaves better on the other databases, but the accuracy is only about 50%. According to the test, our method performs the best on all the databases. Its accuracy achieves 99.08% on the MINIST database, therefore our model could be used to the handwriting field definitely. Moreover, the accuracy gets result of 64.26% on CIFAR-10. It’s not a satisfactory result, it still shows that our method has a certain recognition rate on the database. On the vehicle database, our method almost reaches 80% on the accuracy, which is already an efficient result.
Fig.4 Comparisons of LeNet-5, CNN and our method’s accuracies for MNIST, CIFAR-10 and the vehicle databases
By comparison, our method could extract features from three different database effectively. There are two reasons for this phenomenon. First, the convolution layer which we add makes features extraction better. Second, the way we extract feature maps from different level convolution layers by pooling grids play the important role. The final features obtained by this extraction method include information of different depths. So our method achieves the best result on this database test.
In this work, a novel CNN based on spatial pyramid is proposed for image classification. Spatial pyramid and spatial pyramid pooling are introduced to understand our method better. The model is totally new, all the parameters have not been trained. It extracts features from every convolution layer, and prevents the miss of important features during the convolution extraction. In the meantime, algorithm shows the robustness. On the other hand, the feature vector by gridding could adjust the weight of each convolution layer’s features. And it ensure that feature vectors are fixed when input images are of different sizes. Finally, in the adjustment process, it takes the gradient effect of two different directions into consideration, which could adjust the network more accurately to get an optimal result. The experiment shows that our method works well, it could improve the network accuracy and make the network work in more databases.
Here are several research directions for further improvements on our mentioned network. In this paper we apply the three-convolution-layers network as the test, get a good result. In the future we plan to change the pooling means and activation function, and apply our method with other neural networks, such as AlexNet, VGG and etc., thereby to find better depth of convolution layers and grid size.
Journal of Beijing Institute of Technology2018年4期