Ruikang Xing and Chenghai Li
Abstract: In clustering analysis,the key to deciding clustering quality is to determine the optimal number of clusters.At present,most clustering algorithms need to give the number of clusters in advance for clustering analysis of the samples.How to gain the correct optimal number of clusters has been an important topic of clustering validation study.By studying and analyzing the FCM algorithm in this study,an accurate and efficient algorithm used to confirm the optimal number of clusters is proposed for the defects of traditional FCM algorithm.For time and clustering accuracy problems of FCM algorithm and relevant algorithms automatically determining the optimal number of clusters,kernel function,AP algorithm and new evaluation indexes were applied to improve the confirmation of complexity and search the scope of traditional fuzzy C-means algorithm,and evaluation of clustering results.Besides,three groups of contrast experiments were designed with different datasets for verification.The results showed that the improved algorithm improves time efficiency and accuracy to certain degree.
Keywords: Fuzzy C-means clustering,affinity propagation (AP) clustering,evaluation index,kernel function.
As the important technology in data mining field,clustering analysis is widely applied in statistics,decision support,machine learning,pattern recognition,picture processing,spatial database technology and e-commerce,etc.It is a very efficient data analysis method.Classical clustering algorithms mainly include partition-based clustering,hierarchical clustering algorithm,density-based method,grid-based method,model-based method and analysis method based on isolated point,etc.The quality of clustering algorithms greatly influences the final results of clustering process.
Clustering process is an effective grouping of physical or abstract set of objects.The group generated in clustering results is called cluster.Cluster is the set of objects with certain same features in the database.The concrete manifestations include the following:any objects in the cluster have high similarity,while the objects which do not belong to the same cluster have relatively large dissimilarity.The value of similarity and dissimilarity can be calculated according to various attribute values of description objects.Usually,the distance between any objects is the measurement method which is mostly applied.As an important method which is widely applied in data analysis,clustering is used to classify the samples as per the specific standards,with the purpose of maximizing intracategory similarity and minimizing intercategory similarity.In clustering analysis,the key to deciding clustering quality is to determine the optimal number of clusters.At present,most clustering algorithms need to give the number of clusters in advance for clustering analysis of the samples.How to gain the correct optimal number of clusters has been an important topic of clustering validation study.The existing method to determine the optimal number of clusters is mainly the fuzzy C-means (FCM) algorithm.
2.1.1 Fuzzy C-means algorithm analysis
Clustering analysis aims to classify objects according to their different features,degree of intimacy and similarity.The boundary of relations among things is usually not clear (i.e.fuzzy relation),so the application of fuzzy method for clustering analysis becomes inevitable.Fuzzy clustering analysis has been successfully applied in large-scale data analysis,data mining,picture analysis,pattern recognition,information fusion and so on.And,various fuzzy clustering algorithms appear.Among the numerous fuzzy clustering algorithms,fuzzy C-means (FCM) clustering algorithm [Bezdek (1981)] is most widely and successfully applied.It is a clustering analysis method based on objective function.Membership degree of each object to be classified for all centers of clustering can be gained through optimizing objective function so as to decide the category of classification objects and reach the purpose of automatic classification [Chen,Li and Wang (2006)].
FCM clustering algorithm:the set of objects to be classified is set as:
wherein,each object hasmcharacteristic indexes,and is set as:
Now,the object setAis classified intoccategories (2≤c≤n).The matrix which consists of vectors ofccenters of clustering is set as:
To gain an optimal fuzzy classification,a best fuzzy classification can be chosen from the fuzzy classification space as per clustering norms.To calculate the appropriate fuzzy classification matrixUand center of clusteringV,the objective function:is made to reach the minimum.Wherein,certain value can be taken forq(generallyq=2).represents the distance between objectAjand the vector ofithcategory of centers of clustering.
Usually,iterative operation is used to figure out the approximate solution of objective function given in Formula (5).The detailed steps are as follows:
Step1:Choose the number of categoriesc,2≤c≤n;take a primary fuzzy classification matrixU(0)for gradual iteration,l= 0 ,1,2,???
Step2:ForU(l),calculate the center of clustering:
in which:
Step3:Amend fuzzy classification matrixU(l);when ?i,A≠V
ji
If ?k,Aj=Vk
The fuzzy classification matrixand the center of clusteringgained from the above algorithm are locally optimal solutions relative to the number of categoriesc,initial fuzzy classification matrix
Noise in dataset has a great influence on the whole clustering classification process.At present,many algorithms fail to process noise,thus leading to the influence on the dataset classification.Or,noise processing results are unsatisfactory.Noise data processing is too complex or there is no substantive influence of noise reduction.All these lead to some defects of FCM clustering algorithm in practical applications.
2.1.2 Objective function optimization based on kernel function
Kernel function is introduced to enhance optimizing ability of FCM clustering algorithm.It issupposed that the centerof clusteringin high-dimensional space can find primary imagein the primary space.Then,the objective function changes to
According to Mercer kernel definition,
Meanwhile,Gaussian radial basis function (K(x,x) = 1,?x∈X) is used as the kernel function for simplification.Then,the objective function of improved fuzzy C clustering algorithm can change to
Lagrange multiplication approach is used to the center of clustering and iterative formula of membership matrix:
The process of improving fuzzy C clustering algorithm is as follows:
Step1:Initialize.Give weighted indexmand the number of clustering categoriesc(2≤c≤n);set the parameter values of the chosen kernel function;set threshold value of iteration stopε;initialize membership matrixU(0),iteration counterb=0.
Step2:Work outK(xi,vk)
Step3:Update membership matrixU(b).
Step4:If(||.||<εis an appropriate norm),stop updating membership matrixU,otherwise,makeU=U+1and turn to Step2.
2.1.3 Improved FCM algorithm and analysis
The division method of standard FCM clustering algorithm is based on the following criterion:The sumEof distance between each data object p and corresponding cluster center is minimum.The computational formula ofEis
wherein,oiis the center of clusterCi;d(.)is distance function;Eis the minimum of sum of distance between all data objects p and corresponding cluster centers.Whenk=1,time complexity of the algorithm isO(n2).After the kernel function is added,when theithinitial center pointi∈ [1,k],the computational formula of time complexitytof the algorithm ist= (n+ 1)(i-1).
Time complexityTiof theithinitial center point is
Time complexityO(T)of the algorithm is
Thus,O(T) =O(n2).In conclusion,after the kernel function is added,the algorithm complexity of FCM clustering algorithm reduces.
Since clustering results of FCM clustering algorithm depend on the selection of initial center of clustering,different initial center of clustering will generate different clustering results.Thus,clustering results are unstable.How to determine the optimal clustering according to FCM algorithm is important.
Usually,the basic thought of determining the optimal number of clusters is as follows:for the specific dataset,conform the search scope of number of clusters and operate clustering algorithm to gain the clustering results of different number of clusters;choose appropriate validity indexes to evaluate clustering results,and confirm the optimal number of clusters according to the evaluation result.Thus,the core of confirming the optimal number of clusters is to confirm reasonable search scope of number of clusters and evaluation indexes of clustering effectiveness.To confirm the search scope of number of clusters [kmin,kmax],kminandkmaxshould be confirmed.kmin=1refers to even distribution of samples,without obvious characteristic difference.The minimum number of clusters in clustering algorithms is usually 2,i.e.kmin=2.There still no explicit theoretical direction about how to confirmkmax.The empirical rule that most scholars use is:kmax≤which is described in the Literature[Yu and Cheng (2002)].The conclusion is based on the precondition of uncertainty functionf(x) =x-1.But the precondition is not the sufficient condition proved by Literature [Yang,Li,Hu et al.(2006)].The conclusion is deduced based on the precondition that the sample space has fractal geometrical characteristics,and the conclusion have no universality.Besides,sample size and practical category number of all datasets in Literature (Frey and Dueck 2008) also have no such property.Sample size and practical category number of some datasets in Literature [Brusco (2008)] also have no such property.To sum up,kmax≤is only an empirical rule,and does not own universality.In this study,AP algorithm proposed by Frey et al.is applied to confirmkmax.The algorithm is fast and effective.It has been well applied in multiple fields.
2.2.1 AP clustering algorithm
AP clustering algorithm [Kapp (2007);Xiao and Yu (2008)] is a kind of clustering algorithm based on affinity information propagation.Its purpose is to find out the set of optimal category representatives so that the sum of similarities of all samples to the nearest category is largest.AP algorithm deems all N samples in the dataset as the candidate category representatives and establishes attraction degree information with other samples for each sample.In other words,similarity between any two samplesxiandx(when Euclidean distance is applied for measurement,s(i,k) = - | |x-v||2) iskikstored inN×Nsimilarity matrix.AP algorithm appliess(i,k)to express the appropriateness of samplexkas samplexi.It is initially supposed that the possibility of all samples chosen as category representative is same,that is,alls(k,k)are set with the same valuep.To pick out the appropriate category representative,it is necessary to gather relevant evidence from samples continuously.Thus,AP algorithm introduces two important information quantities parameters:reliabilityrand availabilitya.These two parameters represent different competition purposer(i,k)points toxkfromxi.It represents the evidence fromxk,and expresses the appropriateness degree ofxkused as the category representative ofxi,anda(i,k)points toxifromxk.It represents the evidence accumulated byxi,and is used to express the appropriateness ofxichoosingxkas the category representative.For any samplexi,the sum of reliability of all samplesr(i,k)and availabilitya(i,k)is calculated.The samplexkinvolving the largest sum is category representative.The iteration process of AP algorithm is the alteration and update process of two information quantities.
To prevent oscillation in the iteration process,AP algorithm introduces the factorλto prevent oscillation,and the value ofλis between 0 and 1.The update result ofr(i,k)anda(i,k)is gained through the weighing of current iteration value and the last iteration result.
2.2.2 AP feasibility analysis
AP algorithm does not give the number of clusters.When the algorithm ends,the number of clusters is determined automatically.For the clustering structure of intra-category compactness and inter-category alienation,AP algorithm can get the accurate clustering result.But for the close clustering structure,the algorithm tends to generate much local clustering.Thus,the number of clusters is generally large and the accurate clustering results cannot be given [Wang,Li,Zhang et al.(2007)].Because of its fast speed and effectiveness,AP algorithm rather than C-means clustering algorithm is used to complete initial category number screening of dataset.Since the category number searched by AP algorithm is greater than,the maximumkmaxof category number is reduced from n (sample size) to the number of clusterskAPgenerated by AP algorithm.Compared with,the scheme involves clustering structure distribution of samples,which is scientific.The experiment also successfully verifies the feasibility of its scheme.
At present,many validity evaluation indexes have been proposed to analyze clustering results for FCM algorithm,such as partition coefficient VPC [Bezdek (1974)],partition entropy VPE [Bezdek (1974)],VOS proposed by Kim et al.[Kim,Lee,Lee et al.(2004)]VXB index proposed by Xie et al.[Xie and Beni (1991)],VFS index proposed by Fukuyama et al.[Fukuyama and Sugeno (1989)],VK index pro-posed by Kwon [Kwon(1998)],VCWB index proposed by Rezaee et al.[Rezaee,Lelieveldt and Reiber (1998)],VB index proposed by Boudraa [Boudraa (1999)],VSV index proposed by Kim et al.[Kim,Park and Park (2001)],Wint (Weighted inter-intra) [Boudraa (1999)] and Silhouette[Silhouette (2004)] index.However,due to the defects of these indexes,it is hard for them to judge the clustering results.The clustering validity test effect is not ideal enough.Thus,geometric structure of datasets and clusters with different sizes are fully considered in this study.The specific value of intra-category compactness and inter-category separation degree is combined with clustering membership degree to define a new clustering validity index.Besides,the information of dataset and its geometric structure are fully considered.So,the optimal partition and the optimal number of clusters of fuzzy partition can be accurately confirmed by FCM algorithm.On this basis,a method to confirm the optimal number of clusters of samples is proposed to evaluate the clustering results of AP algorithm and determine the optimal number of clusters of samples.
2.3.1 Definition of compactness Index
Compactness index is used to measure intra-category compactness,and can be expressed with intra-category weighted squared error sum as follows:
Ascincreases,and weightωireduce.inhibits the reduction of measured value through weighing each category.
When the noise point is separately regarded as one category,.At this moment,the weight of such category will be larger than other categories.To make compactness index more robust,is added for adjustment.
2.3.2 Definition of separation index
Separation index is a method to measure separation degree of two fuzzy sets.Dispersion between two categories is defined as follows:the sample belongs to the minimum of membership degree of these two categories.Separation measurement uses the largest difference of all paired fuzzy clusters.So,similarity between two fuzzy setsFiandFjis defined as:
Separation measurement of given fuzzy partition is
Then,the boundary of separation index is 0 ≤S(c,U) ≤ 1;whenFi=Fj,S(c,U) = 0.
2.3.3 New evaluation index
Since compactness and separation have different scalar quantity,normalization result can be expressed as:
These two formulas are effectively combined to get the new evaluation indexes of clustering effect:
In the new validity indexes,compactness indexV(c,U)reflects intra-category total variation,and it is used to express the concentration degree of intra-category samples.
When its value is smaller,compactness of category is better.Separation indexS(c,U)reflects intra-category total variation,and it is used to express the distance among fuzzy clusters.When separation is larger,the partition result is better.V(c,U)andS(c,U)are combined to reflect the partition features of dataset.When the indexes are smaller,intra-category is more compact,intra-category is more separate and the clustering result is better.
To test validity and operation efficiency of the proposed algorithm,three groups of experiments were applied to carry out simulation test of artificial dataset and true dataset,and the algorithms were compared.
Table1:Datasets and standard number of clusters
There are three artificial datasets:Dataset1,Dataset2 and Dataset3.Dataset1 is composed of two two-dimensional Gaussian distribution data with the centers of (0,0) and (20,20)respectively.Each category has 400 samples.Dataset2 is composed of four twodimensional Gaussian distribution data with the centers of (0,0),(5,7),(12,17) and (19,24) respectively.Each category has 400 samples.Dataset3 is the sample dataset generated artificially at random.The number of samples is 150.The true number of clusters is 13.The true dataset is composed of UCI true datasets including Iris and Wine datasets.The standard number clusters,data and sources of artificial datasets and UCI true datasets are shown in Tab.1.
Experiment 1:Validity experiment of kernel-based improved FCM algorithm
IRIS dataset and Wine dataset were chosen as the test samples.FMC clustering algorithm and improved FMC clustering algorithm were simulated.The change trend of objective function with iteration times is shown in the Fig.1.
Figure1:Convergence comparison chart of FCM algorithm based on improved kernel function
In the Fig.1,Line 1 represents the first clustering of FCM algorithm;Line 2 represents the second clustering of FCM algorithm;Line 3 represents the first clustering of improved FCM algorithm;Line 4 represents the second clustering of improved FCM algorithm.
According to the figure,the iteration times of improved FCM algorithm is obviously lower than that of traditional FCM algorithm in the process where the target value tends to coincide.Thus,the validity of the algorithm is proved.
Experiment 2:Validity experiment of AP algorithm determining upper limit of search.Artificial Dataset3 was chosen asexample.Simulation experiment was carried out forrespectively.
To eliminate the influence of this algorithm,PC was used as the evaluation index for contrast experiment.The optimal number of clusters is shown in the following figure.
Figure2:Comparison chart of optimal number of clusters
It can be easily seen that,since the search scope confirmed byis smaller than the practical optimal number of clusters,the accurate optimal number of clusters cannot be gained.The accurate optimal number of clusters is obtained through AP algorithm.
Experiment 3:Comparison experiment of several validity indexes Artificial Dataset1 and Dataset2 as well as UCI datasets Iris and Wine were selected to verify several representative evaluation indexes via comparisons.The results are shown in Tab.2.Judging from the above results,the evaluation indexes proposed in this study can be all converged and get the accurate number of clusters in 4 groups data.They perform better than other indexes.The theoretical research and experimental results indicate that,compared with other indexes and methods,the indexes in this study have better performance and stability.
Table2:Comparison of various indicators
The above three groups contrast experiments verify the timeliness of improved algorithm,validity of search scope and the accuracy of evaluation indexes respectively.The result shows that the proposed fuzzy clustering algorithm automatically determining the number of clusters is reliable.
Based on the analysis of FCM algorithm,an accurate and efficient algorithm used to confirm the optimal number of clusters is proposed in this study to solve the defects of traditional FCM algorithm.The algorithm is improved in the aspects of reducing algorithm complexity,confirming search scope and constructing clustering validity index.In addition,multiple groups of contrast experiments verify the improvement of algorithm with higher efficiency and accuracy.
Despite some problems exiting in the algorithm,the future researches will be completed to improve the time efficiency,which is caused by mutual application of various algorithms in the process of automatically determining the number of clusters.
Acknowledgement:This research was financially supported by Natural Science Foundation of China (Grant No.61703426) and Postdoctoral Science Foundation of China (Grant No.2016M602996).
Computers Materials&Continua2019年8期