基于邊信任度的混合參數(shù)自適應重疊社區(qū)發(fā)現(xiàn)算法

2019-04-13 03:32:26顧春妹趙建軍洪文興徐文靜

天津大學學報(自然科學與工程技術(shù)版) 2019年6期

汪?清，顧春妹，趙建軍，崔?鑫，洪文興，徐文靜

汪?清1，顧春妹1，趙建軍1，崔?鑫1，洪文興2，徐文靜2

(1. 天津大學自動化與信息工程學院，天津 300072；2. 廈門大學航空航天學院，廈門 361005)

網(wǎng)絡中的社區(qū)結(jié)構(gòu)有助于簡化網(wǎng)絡拓撲結(jié)構(gòu)分析，揭示系統(tǒng)內(nèi)部的規(guī)律，能夠為信息推薦和信息傳播控制提供有力的支撐．網(wǎng)絡重疊社區(qū)結(jié)構(gòu)與真實生活更加接近，但其分析較非重疊社區(qū)結(jié)構(gòu)更加困難．因此，針對重疊社區(qū)發(fā)現(xiàn)問題，在對網(wǎng)絡的邊進行峰值聚類的基礎上提出了一種基于邊信任度的混合參數(shù)的自適應重疊社區(qū)發(fā)現(xiàn)算法．定義了網(wǎng)絡邊的鄰居邊集合及與其鄰居邊之間的信任度函數(shù)，通過信息傳遞獲取邊的總信息量，并且基于此引入混合參數(shù)的概念．基于k-means算法使用混合參數(shù)對網(wǎng)絡中的邊進行聚類，即將網(wǎng)絡中的邊劃分為核心邊集與非核心邊集，每個核心邊作為一個聚類中心．根據(jù)非核心邊到核心邊的距離將所有非核心邊劃分至距離其最近的聚類中心所在社區(qū)．再根據(jù)網(wǎng)絡中邊與節(jié)點的關系實現(xiàn)重疊節(jié)點發(fā)現(xiàn)，最終實現(xiàn)重疊社區(qū)的發(fā)現(xiàn)．該算法的優(yōu)點是每條邊通過獨立地完成信息擴散找到社區(qū)的結(jié)構(gòu)，相比于傳統(tǒng)的峰值聚類算法，不需要人為設置相關參數(shù)，實現(xiàn)重疊社區(qū)的自適應發(fā)現(xiàn)．為驗證算法的可行性，對算法復雜度進行了分析，并且使用兩種社區(qū)劃分評價指標——標準化互信息和模塊度，分別在人工數(shù)據(jù)集及6種真實數(shù)據(jù)集上進行實驗，通過與其他算法進行對比分析，實驗結(jié)果表明該算法更具可行性和有效性．

峰值聚類；邊信任度；混合參數(shù)；重疊社區(qū)發(fā)現(xiàn)；自適應算法

網(wǎng)絡與我們的日常生活息息相關，如交通系統(tǒng)、通信系統(tǒng)和社交網(wǎng)絡等．眾所周知，社區(qū)在揭示網(wǎng)絡隱藏結(jié)構(gòu)方面發(fā)揮著重要作用，并且現(xiàn)實問題中節(jié)點可能屬于多個社區(qū)，這些節(jié)點稱為重疊節(jié)點．如果社區(qū)包含重疊節(jié)點，則稱該社區(qū)為重疊社區(qū)．重疊社區(qū)在現(xiàn)實世界中很常見，如一個人可能對幾個主題感興趣并加入多個社交圈，即重疊的社交網(wǎng)絡．網(wǎng)絡的重疊社區(qū)結(jié)構(gòu)很好地反映了真實的網(wǎng)絡結(jié)構(gòu)．研究重疊社區(qū)結(jié)構(gòu)具有重要的現(xiàn)實意義，不僅有助于揭示復雜系統(tǒng)的內(nèi)部規(guī)則，理解拓撲結(jié)構(gòu)，而且能簡化網(wǎng)絡分析．但是，由于網(wǎng)絡中的節(jié)點具有多重身份，使得獲得準確的社區(qū)結(jié)構(gòu)的難度加大．因此，重疊社區(qū)發(fā)現(xiàn)已成為該領域中廣泛研究的問題，其主要目的是識別由內(nèi)部密集連接的節(jié)點和外部稀疏連接的節(jié)點組成的社區(qū)．

Chen等[1]于2016 年基于密度峰值聚類和節(jié)點局部信息提出一種線性復雜度的社區(qū)發(fā)現(xiàn)算法. Huang等[2]提出了一種基于密度峰值法的重疊檢測算法，并獲得了良好的性能．Zhou等[3]利用蟻群算法和標簽傳播檢測重疊社區(qū)．該算法首先初始化節(jié)點標簽和螞蟻的位置，再按照預先設置的概率進行隨機游走并更新節(jié)點的標簽．當滿足條件后，對網(wǎng)絡節(jié)點所含的標簽序列進行處理并為節(jié)點分配標簽，從而完成網(wǎng)絡的重疊社區(qū)發(fā)現(xiàn)．上述社區(qū)發(fā)現(xiàn)算法本質(zhì)上是對網(wǎng)絡中的節(jié)點進行分類，但是僅僅對節(jié)點集進行劃分并不能獲得網(wǎng)絡的重疊社區(qū)．因此，相關學者將網(wǎng)絡中的邊當作數(shù)據(jù)對象進行聚類研究，即將邊分配到不同的社區(qū)完成社區(qū)發(fā)現(xiàn)．Ahn等[4]提出的算法計算了每對邊緣之間的相似度，然后使用具有相似性的層次聚類來確定邊緣屬性由于是層次聚類，因此算法可得到不同分辨率下的社區(qū)結(jié)構(gòu)．Shi等[5]基于遺傳算法對網(wǎng)絡中的邊進行聚類，針對網(wǎng)絡邊定義網(wǎng)絡社區(qū)結(jié)構(gòu)的基因表達以及交叉和變異算子，從而得到網(wǎng)絡的重疊社區(qū)．Zhang等[6]在標簽傳播算法(label propagation algorithm，LPA)的基礎上引入邊緣聚類系數(shù)用于更新標簽，而不是隨機地進行鄰居節(jié)點標簽的更新，有效地抑制了標簽的隨機傳播，但該算法僅能用于非重疊社區(qū)結(jié)構(gòu)及小規(guī)模重疊社區(qū)的網(wǎng)絡，并且不能達到自適應．

本文從邊的角度出發(fā)，利用網(wǎng)絡比邊與點之間的關系，提出了一種基于邊信任度的混合參數(shù)(mixing parameter with trust degree of edge clustering，MPTD-EC)自適應重疊社區(qū)發(fā)現(xiàn)算法．該算法自動選擇核心邊，引入邊之間信息傳遞，利用邊的總信息量代替峰值聚類中密度，不需要人為地設置截斷距離．

本算法不需要人為設定參數(shù)，實現(xiàn)了自適應．為了評估提出的方法，將其應用于合成和真實網(wǎng)絡．實驗結(jié)果證明了該算法的有效性．

1?基于邊信任度的混合參數(shù)自適應重疊社區(qū)發(fā)現(xiàn)算法

首先，定義鄰居邊之間的信任度函數(shù)，在網(wǎng)絡中進行邊信息傳遞獲得邊信息矩陣，在此基礎上計算出邊距離矩陣．然后進行核心邊的選取，根據(jù)邊距離矩陣將非核心邊進行分配，獲得邊社區(qū)，再將其轉(zhuǎn)換為網(wǎng)絡重疊社區(qū)．

1.1?信息擴散

利用信息擴散，可以獲得每條邊的信息量，利用邊的信息量去標識邊在網(wǎng)絡中的重要程度．

(1)

?(2)

?(3)

(4)

?(5)

步驟2遍歷網(wǎng)絡中的邊，每條邊依次作為信息源，將其原始信息1用廣度優(yōu)先算法擴散到網(wǎng)絡所有邊，且當以該邊為信息源時，并不考慮其他邊所含信息量的大?。?/p>

在本算法中，用邊的總信息替換峰值簇中每個節(jié)點的密度，以避免選擇截止距離．

1.2?核心邊獲取

本文利用改進的k-means算法進行核心邊選?。ㄟ^獲取核心邊可以獲取整個網(wǎng)絡社區(qū)的核心集及網(wǎng)絡所包含的重疊社區(qū)的數(shù)目.

k-means算法利用質(zhì)心(不同聚類的中心)來表示不同類別，具體步驟如下．

步驟1隨機選取個初始聚類中心．

步驟2計算出剩余邊到聚類中心的距離，將每個數(shù)據(jù)點分配到距離其最近的中心所在類別．

步驟3重新計算個聚類的中心．

步驟4重復步驟2、步驟3，直到聚類中心不再改變．

步驟3根據(jù)如下公式重新計算2個聚類的中心.

通過上述方法即可獲得核心邊集．

1.3?社區(qū)劃分

1)信息擴散

步驟2據(jù)信息矩陣得歸一化距離矩陣．

2)核心邊的劃分

步驟4重復步驟2、步驟3至聚類中心不再改變.

3)非核心邊的劃分

2?算法的時間復雜度分析

3?評價指標

為了衡量社區(qū)發(fā)現(xiàn)算法所獲社區(qū)結(jié)構(gòu)的優(yōu)劣，提出了模塊度[8]、標準化互信息(NMI)等[9]評價指標.研究發(fā)現(xiàn)模塊度存在分辨率的限制，即網(wǎng)絡中較小規(guī)模社區(qū)結(jié)構(gòu)的存在會造成模塊度值很大，但劃分的社區(qū)結(jié)構(gòu)并非網(wǎng)絡的最佳劃分．相比模塊度，NMI更能評價社區(qū)劃分的性能．同時因為人工產(chǎn)生的數(shù)據(jù)集網(wǎng)絡本身即重疊社區(qū)，使用NMI進行評價．但對于真實數(shù)據(jù)集，人為將其劃分為非重疊社區(qū)，因此進行重疊社區(qū)劃分時，使用NMI并不能衡量算法的性能，因此在真實數(shù)據(jù)集中，使用改進的模塊度進行算法性能的評價．

3.1?標準化互信息

本文通過標準化互信息(NMI)表征本算法在人工網(wǎng)絡數(shù)據(jù)集的性能．NMI通過信息熵來衡量社區(qū)發(fā)現(xiàn)算法所劃分的社區(qū)結(jié)構(gòu)和網(wǎng)絡已知的社區(qū)結(jié)構(gòu)的差異．其計算公式為

?(6)

3.2?基于模塊度的改進評價指標

?(7)

?(8)

?(9)

?(10)

?(11)

4?仿真實驗

為了驗證本算法的可行性與有效性，分別在合成網(wǎng)絡和真實網(wǎng)絡上進行了實驗，同時將本算法與一些已提出的算法(COPRA[11]和CMP[12])進行了對比?分析．

4.1?人工網(wǎng)絡數(shù)據(jù)集

本文使用LFR模型[13]生成網(wǎng)絡，在此基礎上進行了分析．由于該模型可通過調(diào)整參數(shù)控制整個網(wǎng)絡和社區(qū)屬性，如大小、節(jié)點度分布、重疊節(jié)點數(shù)等，合成的網(wǎng)絡結(jié)構(gòu)更逼近于真實的社交網(wǎng)絡拓撲．

表1?網(wǎng)絡參數(shù)

Tab.1?Parameters of network

4.2?真實網(wǎng)絡數(shù)據(jù)集

為了進一步測試所提算法，同時在幾個經(jīng)典的經(jīng)常被用作測試網(wǎng)絡的真實社交網(wǎng)絡(空手道數(shù)據(jù)集[14]、Dolphins數(shù)據(jù)集[15]、Football數(shù)據(jù)集[16]、Polbooks數(shù)據(jù)集[17]、Email數(shù)據(jù)集[18]和PGP數(shù)據(jù)集[19])上進行實驗．這些社交網(wǎng)絡的社區(qū)結(jié)構(gòu)如表2所示．其中，Email數(shù)據(jù)集和PGP數(shù)據(jù)集社區(qū)數(shù)均為0.

表2?6種真實網(wǎng)絡的參數(shù)

Tab.2?Parameter of six real networks

4.3?實驗結(jié)果及分析

4.3.1?人工網(wǎng)絡數(shù)據(jù)集算法性能

在本節(jié)中，分別在LFR1、LFR2和LFR3上進行了本算法的性能測試．同時，為了進行比較，在相同的網(wǎng)絡上進行COPRA和CMP的測試．實驗結(jié)果如圖1所示．

圖1?MPTD-EC算法性能

4.3.2?真實網(wǎng)絡數(shù)據(jù)集算法性能

為了評價本算法性能，實驗采用基于模塊度的改進評價指標，分別將5種算法應用在6種真實數(shù)據(jù)集上，數(shù)據(jù)集相關參數(shù)參見表3，實驗結(jié)果獲得6種網(wǎng)絡決策圖(根據(jù)邊的總信息量及距離繪制)，如圖2所示．

圖2?MPTD-EC算法在6種真實數(shù)據(jù)集上獲得的決策圖

COPRA是隨機算法，圖3中決策圖紅點個數(shù)表示核心邊的個數(shù)，即表征了網(wǎng)絡中社區(qū)的個數(shù)．表3中相應的實驗結(jié)果是50個獨立結(jié)果的平均值．表3表明本算法在這些社交網(wǎng)絡中具有最佳效果，說明本算法的有效性．

Tab.3?Values of obtained by five algorithms

5?結(jié)?語

本文在密度峰值聚類的基礎上，提出了一種基于邊信任度的混合參數(shù)自適應重疊社區(qū)發(fā)現(xiàn)算法．在本算法中，峰值簇中的密度和距離由網(wǎng)絡邊與邊之間的信息傳遞決定，避免截斷距離選取，并且引入混合參數(shù)，使用k-means算法進行核心邊的選取，從而確定網(wǎng)絡社區(qū)個數(shù)，并且通過邊到點的轉(zhuǎn)換完成重疊節(jié)點和重疊社區(qū)的發(fā)現(xiàn)，不需要人為設定相關參數(shù)，使算法達到自適應．綜合實驗結(jié)果，可知本算法在人工數(shù)據(jù)集上較已提出的基于邊聚類的社區(qū)發(fā)現(xiàn)算法更適用于重疊社區(qū)數(shù)較多的社區(qū)發(fā)現(xiàn)問題，并且在真實數(shù)據(jù)集上性能也有較大提升，驗證了該算法的可行性和有效性．

［1］ Chen Y，Zhao P，Li P，et al. Finding communities by their centers[J]. Scientific Reports，2016，6：24017-1-8.

［2］ Huang L，Wang G，Wang Y，et al. A link density cluster ing algorithm based on automatically selecting density peaks for overlapping community detection[J]. International Journal of Modern Physics B，2016，30(24)：165-167.

［3］ Zhou X，Liu Y，Zhang J，et al. An ant colony based algorithm for overlapping community detection in complex networks[J]. Physica A Statistical Mechanics & Its Applications，2015，427：289-301.

［4］ Ahn Y Y，Bagrow J P，Lehmann S. Link communities reveal multiscale complexity in networks[J]. Nature，2010，466(7307)：761-764.

［5］ Shi C，Cai Y，F(xiàn)u D，et al. A link clustering based over lapping community detection algorithm[J]. Data & Knowledge Engineering，2013，87(9)：394-404.

［6］ Zhang X K，Tian X，Li Y N，et al. Label propagation algorithm based on edge clustering coefficient for community detection in complex networks[J]. International Journal of Modern Physics B，2014，28(30)：1450216-1-15.

［7］ Hu Y，Li M，Zhang P，et al. Community detection by signaling on complex networks[J]. Physical Review E Statistical Nonlinear & Soft Matter Physics，2008，78(2)：016115.

［8］ Clauset A，Newman M E，Moore C. Finding community structure in very large networks[J]. Physical Review E Statistical Nonlinear & Soft Matter Physics，2004，70：066111-1-6.

［9］ Danon L，Díazguilera A，Duch J，et al. Comparing community structure identification[J]. Journal of Statistical Mechanics Theory & Experiment，2005，2005(9)：09008-1-10.

［10］ Nicosia V，Mangioni G，Carchiolo V，et al. Extending the definition of modularity to directed graphs with overlapping communities[J]. Journal of Statistical Mechanics：Theory & Experiment，2009(3)：3166-3168.

［11］ Gregory S. Finding overlapping communities in networks by label propagation[J]. New Journal of Physics，2010，12(10)：2011-2024.

［12］ Palla G，Derényi I，F(xiàn)arkas I，et al. Uncovering the overlapping community structure of complex networks in nature and society[J]. Nature，2005，435(7043)：814-818.

［13］ Lancichinetti A，F(xiàn)ortunato S. Benchmarks for testing com munity detection algorithms on directed and weighted graphs with overlapping communities[J]. Physical Review E Statistical Nonlinear & Soft Matter Physics，2009，80：016118-1-8.

［14］ Zachary W W. An information flow model for conflict and fission in small groups[J]. Journal of Anthropological Research，1977，33(4)：452-473.

［15］ Lusseau D. The emergent properties of a dolphin social network[J]. Proceedings Biological Sciences，2003，270(Suppl 2)：186-188.

［16］ Girvan M，Newman M E J. Community structure in social and biological networks[J]. Proceedings of the National Academy of Sciences，2002，99(12)：7821-7826.

［17］ Newman M E J，Girvan M. Finding and evaluating community structure in networks[J]. Physical Review E Statistical Nonlinear & Soft Matter Physics，2004，69(2)：026113-1-15.

［18］ Guimera R，Danon L，Diaz-Guilera A，et al. Self-similar community structure in a network of human interactions[J]. Physical Review E，2003，68(6)：065103.

［19］ Bogu?á M，Pastor-Satorras R，Díaz-Guilera A，et al. Models of social networks based on social distance attacment[J]. Physical Review E，2004，70(5)：056122-1-8.

Adaptive Overlapping Community Detection Algorithm Based on Mixing Parameter with the Trust Degree of Edge

Wang Qing1，Gu Chunmei1，Zhao Jianjun1，Cui Xin1，Hong Wenxing2，Xu Wenjing2

(1. School of Electrical and Information Engineering，Tianjin University，Tianjin 300072，China； 2. School of Aerospace，Xiamen University，Xiamen 361005，China)

The community structure in a network simplifies the analysis of the network topology，reveals the internal rules of the system，and provides strong support for information recommendation and information dissemination control. The overlapping community structure of the network is closer to real-life scenario，but its analysis is more difficult than the non-overlapping community. Therefore，to solve the overlapping community detection，based on the peak clustering，an adaptive overlapping community detection algorithm based on the mixing parameter with the trust degree of edge is proposed. In this study，the neighbor edge set of the network and the trust function between the edge and its neighbors are defined，and the total information of the edge is obtained through information transfer. Based on this concept，the concept of mixing parameters is introduced. Then，based on the k-means algorithm，clustering is performed using the mixed parameter，i.e.，the edges in the network are divided into a core edge set and a non-core edge set，and each core edge acts as a clustering center. According to the distance from the non-core edge to the core edge，the non-core edges are divided into the community of the nearest cluster center. According to the relation between edges and nodes in the network，overlapping node discovery is achieved. Ultimately the overlapping communities are detected. The advantage of this algorithm is that each edge finds the structure of the community by independently completing information transfer. Moreover，compared to the traditional peak clustering algorithm，the proposed algorithm does not need to set parameters；therefore，adaptive detection of overlapping communities is achieved. To verify the feasibility of our algorithm，the complexity of the algorithm is analyzed. The two evaluation indices of the community detection，normalized mutual information and modularity，are used to experiment on the artificial dataset and the six real datasets respectively. In comparison to other algorithms，the experimental results show that the proposed algorithm is more feasible and effective.

peak clustering；trust degree of edge；mixing parameter；overlapping community detection；adaptive algorithm

10.11784/tdxbz201809051

TN915.01

0493-2137(2019)06-0618-07

2018-09-17；

2018-10-15.

汪?清（1982—），博士，副教授，wangq@tju.edu.cn．

洪文興，hwx@xmu.edu.cn

國家自然科學基金資助項目(61871282)；福建省科技計劃資助項目(2018H0035)；廈門市科技計劃資助項目(3502Z20183011).

the National Natural Science Foundation of China(No.61871282)，the Science and Technology Program of Fujian，China (No.2018H0035)，the Science and Technology Program of Xiamen，China(No.3502Z20183011).

(責任編輯：王曉燕)