SHOJANAZERI Hamid,TENG Shyh Wei,ARYAL Sunil, ZHANG Dengsheng,LU Guojun,LIU Ying
(1. School of Engineering,Information Technology and Physical Sciences,Federation University,VIC 3824,Australia; 2. School of Info Technology,Deakin University,VIC 3125,Australia; 3. Center for Image and Information Processing,Xi’an University of Posts and Telecommunications,Xi’an 710121,China)
Abstract: Similarity measure is an important component and research topic in image classification and retrieval. Given a type of image features,a good similarity measure should be able to retrieve similar images from the database while discard irrelevant images from the retrieval. Similarity measures in literature are typically distance based which measure the spatial distance between two feature vectors in high dimensional feature space. However,this type of similarity measures does not have any perceptual meaning and ignores the neighborhood influence in the similarity decision making process. In this paper,we propose a novel perceptual similarity measure which can measure both the distance and perceptual similarity of two image features in feature space. Our results show the proposed similarity measure has a significant improvement over the traditional distance based similarity measure most commonly used in literature.
Keywords: image retrieval;image classification;similarity measure;data dependent similarity measure
Similarity measurement (SM) is an essential component of many pattern recognition and computer vision applications,including image classification,registration and retrieval[1-4]. The performance of these applications often relies on the effectiveness of a good SM. Many SMs have been proposed in literature,these SMs are typically distance-based similarity measures (DSM) which compute the geometric proximity between two feature vectors in feature space,includingMiknowski,cosine,histogramintersection,quadratic,Mahalanobis[5],covariance[6],scenegraphsimilarity[7],co-occurrencepatterns[8]anddictionarysimilarity[9].Given a type of image features,a good similarity measure should be able to retrieve similar images from the database while discard irrelevant images from the retrieval. However,due to imperfect features,DSMs often return a high number of irrelevant images in top K retrieved images while failing to retrieve sufficient relevant images from the database. In other words,DSMs are usually not robust.
A DSM basically measures a pairwise distance between two instances of feature vectors in feature space without considering neighbouring data. It is equivalent to compare two images or objects in fine detail by matching them feature by feature without referencing to other neighbouring images. This kind of image/object matching is counter intuitive to what human beings do when comparing images/objects. Human beings tend to interpret the similarity between two objects by also referencing to neighbouring objects. The phenomenon of pattern recognition using reference has been recently captured by Aryal et al[10-11]. They found that perceptual similarity between two instances is affected by the distribution of neighboring data. Specifically,two instances in a relatively dense neighborhood are perceived less similar than they are in a relatively sparser neighborhood.
There are many such examples in the physical world. For example,a red pear would be perceived more similar to a red apple than green pears in a basket of green pears. This is because red objects in the basket are rare. Another example is that when we see a few different trees standing on a land of grass like a golf course,we tend to view them broadly as trees like in a low resolution image. However,if we are surrounded by trees as in a park,we tend to distinguish them as different trees like in a high resolution image.
This kind of perceptual blurring and discerning is also demonstrated by a well known phenomenon of ethnic mixing up in a population. For example,to local Africans,white English men in Africa all look alike because white men in Africa are rare.
Neighbourhood reference is a powerful feature for pattern recognition used by human beings,this kind of perceptual feature or mechanism,however,has been ignored by conventional DSMs so far.
Based on this idea,Aryal et al have proposed a data dependent similarity measure,called mass-based similarity measure ormpin contrast with the traditionallpdistance. The idea is to use the data mass of the neighborhood surrounding the two instances under consideration to replace the difference term in the Miknowski-form distance. Data mass of a neighborhood is essentially a coarse measure which is more robust than DSMs. However,by purely basing on neighborhood data mass and ignoring the spatial distance between the two instances under consideration,it affects the accuracy ofmp.
Therefore,in this paper,we propose a novel perceptual SM (PSM) which incorporates both geometric distance and region density between two data instances. The new hybrid SM overcomes the limitations of bothlpandmp. Furthermore,because the hybrid SM characterizes both coarse and fine measures,it is more robust than traditional SMs. We tested the proposed PSM on three standard benchmark databases,and compared its retrieval result with bothlpandmp. Our results show that PSM consistently outperforms bothlpandmp.
The rest of the paper is organized as follows. In Section 2,we briefly describe related work in literature. Section 3 presents our proposed PSM. Section 4 discusses experimental setup and results. The paper is concluded in Section 5.
In this section,we will discuss related work in literature and their limitations. The two types of work related to our research areMiknowski-formdistance and mass-based similarity.
Given two feature vectors ind,x= (x1,x2,…,xd) andy= (y1,y2,…,yd), wherexirepresent thei-th value of feature ofx. the Miknowski-form distance is given by
(1)
(2)
(3)
l1andl2are among the best and most widely used similarity measures in literature[5,12]. However,due to image features usually have very high dimensions and features are imperfect,this kind of detailed feature by feature matching can result in undesirable outcomes in many situations. For example,two completely different images can have the same feature vector as shown in Figure 1,and two similar images can also have very different feature vectors as shown in Figure 2. In both cases,thelpbased SMs would give a totally incorrect matching result.
Figure 1. The two different images have the same histogram
Figure 2. Two similar images have very different histograms
Another issue associated withlpbased SMs is that often the result oflpsum is determined by one of a few dominant dimensions with large values. This may result in a situation that dimensions with smaller values do not contribute proportionally in the calculation of similarity between two instances.
These issues demonstratelpbased SMs are not robust. The limitations can be overcome by incorporating neighboring data in the decision making process. Therefore,our aim in this research is to explore ways of improvinglpor making use oflpto improve other promising SMs.
Gowda and Krishna[13]used mutual neighbourhood for image clustering. They first computed 5 nearest neighbours for each data instance using a conventional distance measure such aslp. Then,a mutual neighbourhood value (MNV) is computed for each pair of data pointsxiandxj. Specifically,ifxiis them-th nearest neighbour ofxjandxjis then-th nearest neighbour ofxi,MNV(xi,xj) =m+n. During the clustering,data points with both the minimum MNV and minimum distance are merged into one cluster.
Jarvis and Patrick[14]used shared nearest neighbours (SSN) to cluster image data. To estimate the SSN between two data instances,theNnearest neighbours of each data point is first calculated. The similarity between the two data instances are then determined based on the number of SSN between them. The greater number of SSN the higher confidence in their similarity.
Although the above two methods both took neighbourhood into consideration,only a small number of neighbours are used and they are still geometric similarity measures because the nearest neighbours are all determined based on geometric distance.
A more powerful use of neighboring data for similarity determination is proposed by Aryal et al[10-11]who have proposed a mass-based similarity measure ormp. The idea is to use neighborhood data to make a similarity decision collectively instead of making a similarity decision just based on two instances alone or a few nearest neighbors. Specifically,they propose to use the neighborhood data mass at each subspace ofdto replace the difference at each dimension in the Miknowski-form distance.
The idea ofmpis based on a distance-density model described by Krumhausl[15]and a psychological discovery that two instances in a sparse region are perceptually more similar than they are in a dense region.
Specifically,the mass based SMmpis defined as[16]
(4)
While the geometric distance between the two pairs of data points are the same,them1distance between the two pairs are 0.40 for (x,y) and 0.55 for (y,z) respectively,which meansyis perceptually more similar toxthan toz. This is reasonable,because in an actual image analysis and classification situation,xandyare likely grouped into the same cluster whileyandzare likely grouped into two different clusters. It can be seen that the mass based distance is useful in image retrieval.
Figure 3. mp dimension calculation between two data points
Based on the above analysis,it is understood thatlpis essentially a geometric similarity measure between two instances. Because it is computed using feature by feature matching and is based on the two instances alone,it is not robust. It can result in completely incorrect matching in cases shown in Figures 1 and 2.mpis essentially a mass based similarity measure between two instances,because it is computed using collective info from neighborhood data mass. Therefore,mpcan be inaccurate in situations when the features of the two instances are close but with many neighbors.
To overcome the limitations of both thelpandmp,we propose to incorporate bothmpandlpin a new SM. The idea is to usempto assistlpto make better decisions in situations like those shown in Figure 1 and 2. Specifically,we propose two variants of perceptual SM called PSM1pand PSM2pwhich are described in the following.
The first variant of the new similarity measure we are proposing is called Perceptual Similarity Measure 1 (PSM1p). It uses thelpas a weight for mass-based similarity when it may fail to retrieve accurate results. Generally,when we calculate the similarity between two instances usingmp,one of the following four cases may occur.
Case 1:mpis small (data instances are in a sparse region) andlpis also small.
Case 2:mpis small (data instances in a sparse region),butlpis large.
Case 3:mpis large (data instances in a dense region) andlpis also large.
Case 4:mpis large (data instances in a dense region),butlpis small.
In Cases 1 and 3,mpandlpare not in conflict. However,in Cases 2 and 4,their measurements are opposite of each other,these cases can be due to missing data or extremely skewed data distributions,the use ofmpalone in these cases may not be effective. For example,in cases 2,mphas small similarity between two instances whilelpfinds them highly dissimilar. Thelpmeasurement of similarity in this case may not align well with how humans perceived similarity. Bothmpandlpmeasure the similarity from a different perspective that partially complies with human perception. Therefore,they should not be in extreme conflict,rather they should complement each other. As the definition of the four abovementioned cases are based on sparse/dense region and small/large distance,we need to define them. A threshold is defined in each dimension to identify sparse/dense region and small/large distance. The threshold is the mid-point of minimum and maximum values of data mass and distance values between a query and all other data instances in the dataset.
To address the discussed limitations of Cases 2 and 4,we will use the distance between two instances to weight the data mass in dimensions that have the situations described in these cases. The weightedmpis defined as:
(5)
Where in Case 2 when the data mass is low but thelpdistance is large,the weightWiis set as the distance between two instances:Wi=abs(xi-yi).In this case,a higher weightWiis assigned to the data mass to moderate the low similarity resulted from the low data mass.
Therefore,PSM1pis defined as conventionalmpin Cases 1 and 3,and weightedmpin Cases 2 and 4 as follows:
(6)
In PSM1p,mpis the basis of similarity calculations andlphas been used as the weight in Cases 2 and 4. In similarity calculation,Cases 2 and 4 may happen in some dimensions of feature vectors,by using the weightedmp,PSM1peffectively improves the similarity measurement in these dimensions. PSM1pis essentially a hybrid SM,it also overcomes the sensitivity issue oflpwhile preserves its accuracy.
Because image features usually have very high dimensions (e.g.,512 and 1024 dimensions are common),data distribution in high dimensional space can be very skewed. Therefore,the data mass at each subspace can vary dramatically. Figure 4 shows an example of data mass histograms computed from a neighborhood of two instances in a 90 dimensional space.
Figure 4. Data masses between two feature vectors.
To prevent PSM from being disproportionately influenced by a few dominant dimensional features,we apply a feature transform onmpbefore computing PSM. The modified PSM is given as
(7)
whereT() is a transform function,T() is typically a non-linear function such assquaredrootorlogarithmicfunction.
Logarithmic transform is an established method to deal with highly skewed data distributions[17-19]. The Log transform changes a highly skewed data to a distribution closer to normal and draws out the small numbers. As we mentioned we aim to use data mass as weight for geometric distance and using Log transform balance the contribution of very high and/or very low data mass in some dimensions in the overall similarity. The Log2transformed data mass histogram from Figure 4 is shown in Figure 5.
Figure 5. The Log transformed data masses
In this section,we design two experiments to test the performance of the two PSMs. In the first experiment,we comparelpandmpin order to understand the strength and weakness of the two SMs. In the second experiment,we compare the two proposed PSMs against bothlpandmpin order to find a SM with the best performance.
To evaluate the similarity between two images,PSM1pworks as follows. Having dataset and query images represented by their feature vectors,first in each dimension of the feature space,PSM1pcalculates the distance and data mass between query and all the images in the dataset in all dimensions. Threshold is then defined which is the mid-point between the minimum and maximum of distance/data mass between query and all dataset images. Then,PSM1pchecks the distance and data mass against the threshold. If both distance and data mass are below or above the threshold,PSM1pwill use conventionalmpas the similarity measure. Otherwise,PSM1pwill use weightedmpas defined in equation (6) to calculate the similarity between the two images. Finally,it aggregates the similarities from all dimensions of feature space to calculate the overall similarity.
PSM2pconsiders the effect of region density on the perceived similarity by weighting the transformed density usinglpdistance in all the dimensions. Unlike PSM1p,PSM2pdoes not define any threshold.
The first benchmark dataset used in this work is a shopping database from eBay[20],which has a ground truth based on the color. The key information for the database (DB) is as follows.
· The DB has 11 classes.
· Each class is characterized by one color.
· The DB has four categories of objects.
· Each object has 12 images.
· There are 48 images in each class.
· Total number of images in the DB is 528.
The ground truth (class label) is the primary colour in images and objects of the target colour are segmented in dataset images. Target images containing objects with the same color as that of the query image are considered as relevant images. Figure 6 shows four sample of images from the eBay dataset along with their mask images (right hand side) that segment the primary colour objects in the original images.
Figure 6. Sample images from the eBay dataset and their segmented objects on the right hand side.
The second benchmark database is a texture database used in[21]. It has 1000 images categorised in 25 classes and each class has 40 images. Each class represents a different texture,such as wood,wallpaper,water and brick.
Figure 7 shows a sample of this dataset from the brick class.
Figure7.Sampleimagesinbrickclassofthetexturedataset
The third benchmark database is Corel dataset used in [22]. This dataset is a collection of 1,000 images categorised into 10 classes. The images are a mixture of objects and natural scenes. Some example of classes in this dataset are beach,mountain,flower and bus.
Figure 8 shows a sample of images in the beach class.
Figure8.SampleimagesinbeachclassoftheCoreldataset
Three types of features are used in our experiments,they are color histogram,LBP and SIFT features.
Colorfeatures.For the eBay dataset,because the object colors are homogenous and are available after the segmentation,color features are used to measure image similarity. Specifically,a HSV color histogram is used as an image representation. First,image colors are converted from RGB space to HSV space. Next,each of the three components H,S,and V are quantized into 30 bins respectively. Finally,each image is represented as a 30 + 30 + 30 = 90 dimensional histogram.
Texturefeatures.For the texture dataset,LBP features are used due to its simplicity and capability of capturing local structure. The LBP features are extracted from an image using the following procedure.
· Divide the image into 3 x 3 blocks.
·For each pixel in the block,compare the pixel to the center pixel.
·Neighbours in a block with value greater than the center pixel,will be assigned as 1 and 0 otherwise. As a result,each block is described by 8 binary digits.
·Compute the histogram over each block counting the frequency of occurring binary values.
·Concatenate the histogram of each block to obtain the image level features.
We have extracted the LBP features from Texture dataset that resulted in feature vectors of 256 dimensions.
SIFTfeatures.For the Corel dataset,SIFT features are used due to the non-homogenous nature of the images. To obtain the image level features using dense SIFT,we need to go through the encoding process which is equivalent to bag of features (BOW) or feature quantization. The BOW procedure is described as follows:
· Extracting SIFT features from images.
· Quantising features toNvisual words or clusters using thek-means clustering,andN=100.
· Building the BOW histogram by finding the frequency of occurrence of each visual word in the image.
The precision and recall pair is used to evaluate the performance of image retrieval in our work. In information retrieval task where an instance can be relevant or non-relevant,precision shows the fraction of retrieved instances that are relevant while recalls is the fraction of relevant instances that are retrieved[23]. The further PR curve is away from the origin,the better is the retrieval performance of the method that curve is representing.
(8)
(9)
whereris the number of relevant retrieved images,n1is the number of retrieved images,andn2is the number of relevant images in DB.
In this section,we test the retrieval performance of bothlpandmpon the three datasets described above. At this moment,pis set as 2. For each dataset,every imageIis used as a query,the precision and recall curve is calculated for each imageI. The final retrieval result of the dataset is the average of the PR curves of all the query images in the dataset.
The results are shown in Figure 9-11. It is found thatl2has an advantage overm2in the eBay dataset,this is because the colors are homogenous,the data are typically well clustered in feature space. In this situation,them2SM does not have advantage overl2SM.
In the texture dataset,however,due to the difference of local structures,the data distribution in feature space has large variance,therefore,them2SM shows an advantage overl2SM.
In the Corel dataset,due to the complexity and wide variety of image patterns,neitherl2norm2perform well. In this situation,a segmentation is necessary to divide images into regions so that they can be retrieved more effectively.
Figure 9. Image retrieval results from the eBay dataset
Figure 10. Image retrieval results from the texture dataset
Figure 11. Image retrieval results from the Corel dataset
From the above test,it can be concluded that data distribution has an effect on the perceived similarity as shown inmp. However,thelpdistance between two instances should not be ignored,as it intuitively corresponds to the defined similarity in the three-dimensional world,especially when the magnitude of vectors in feature space matters.mpcalculates the similarity between two instances solely based on data distribution in the region covering the two instances. Consider the following examples. We have two pairs of data instances:the first pair of data are perceptually similar but they are located in a dense region while the second pair are perceptually less similar but are located in a sparse region.mpwill determine the second pair to be more similar than the first pair,contrary to the perceptual similarity. This example shows thatmpalone is not suitable to measure perceptual similarity between images. An effective similarity measure should be accurately mimic the human's judgment of similarity.mpconsiders the lower data mass between two instances as more similar and vice versa.lpmeasures the difference between magnitudes of two features using their geometric position in the feature space. Bothmpandlphave their own strength. Because both of them measure the similarity from different perspective that partially complies with human judgment of similarity,neither of them should be ignored.
Results from the above section show that neithermpnorlpcan adequately represent the similarity between two images,however,both have their advantages. It is intriguing to develop a similarity measure which exploits the advantages of both thempandlp. In this section,we compare the performance of the two proposed similarity measures PSM1 and PSM2 with bothl2andm2. The datasets and features are the same as those used in Section 4. The retrieval results from the three datasets are shown in the following subsections.
4.5.1 Retrieval results from the eBay dataset
Figure 12 shows the retrieval results of eBay dataset. It shows that overall PSM2 performs the best on this dataset.l2performs the second best,followed by PSM1 andm2. As visual examples,Figures 13 and 14 show the top 10 retrievals of two query images from eBay dataset using the four SMs. Non-relevant retrieved images are marked as NR.
It can be observed thatl2,m2and PSM1 all retrieve images with different color from the query image. Whilel2and PSM1 are not able to handle outlier colors,m2,on the other hand,causes confusion by using only neighboring data mass while ignoring the distance between the query and the target images. In both scenarios,PSM2 gives better results by overcoming the limitations of all the other three SMs.
Figure 12. Image retrieval results from eBay dataset using l2,m2,PSM1 and PSM2
Figure 13. Top 10 retrievals of Query 1 from the eBay dataset
Figure 14. Top 10 retrievals of Query 2 from the eBay dataset
4.5.2 Retrieval results of the texture dataset
Figure 15 shows the overall retrieval results of the texture dataset. It also shows that PDM2 has the best performance and PDM1 performs the second best,followed bym2andl2. As visual examples,Figures 16 and 17 show the top 10 retrievals for two query images from the texture dataset using the four similarity measures.
Figure 15. Image retrieval results from the texture dataset using l2,m2,PSM1 and PSM2
Figure 16. Top 10 retrievals of Query 1 from the texture dataset
Figure 17. Top 10 retrievals of Query 2 from the texture dataset
4.5.3 Retrieval results of the Corel dataset
In a similar trend for Corel dataset,Figure 18 shows that PDM2 has the best performance along with PDM1,followed byl2andm2. As visual examples,Figures 19 and 20 show the top 10 retrievals for two query images from Corel dataset using above mentioned similarity measures.
Based on the three retrieval results of Figure 12-20,it can be observed that the PSM2 measure outperforms the other 3 SMs on both the eBay and texture datasets. Both PSM1 and PSM2 fair comparably in the Corel dataset. Overall,the results show PSM2 is a more desirable SM than PSM1 while both PSM1 and PSM2 outperformsl2andm2. It demonstrates that the proposed PSM2 is a promising similarity measure for image retrieval and classification.
Figure 18. Image retrieval results of Corel dataset using l2,m2,PSM1 and PSM2
Figure 19. Top 10 retrievals of Query 1 from the Corel dataset
Figure 20. Top 10 retrievals of Query 2 from the Corel dataset
Similarity measure is a crucial part of any image retrieval and classification system. Most of the SMs are distance based and few of them take into account of human perception on image similarity. This research shows that both types of methods have advantages in certain situations,however,none of them adequately represents image similarity alone. In this paper,we have proposed a novel perceptual similarity measure PSM2 by combining the strength of two different types of similarity measureslpandmp. This new PSM overcomes the limitations of both the distance based similarity measure and the mass based similarity measure. Basically,the new similarity measure combines both coarse and fine matching between two image features,which is consistent with human intuition. We have evaluated the proposed PSM against bothlpandmpusing three standard databases and a standard performance measure. Our experimental results show that the proposed PSM outperforms both the Euclidean distancel2and mass based similarity measurem2significantly,and is promising for image classification and retrieval applications. In future,we intend to apply the proposed PSM on segmented and large image database such as ImageNet datasets. We also intend to use machine learning tools to further improve the performance. Acknowledgement:This research was partially supported by Australian Research Council Discovery Projects scheme:DP130100024.