李 穎
(廣東科學(xué)技術(shù)職業(yè)學(xué)院 計(jì)算機(jī)工程技術(shù)學(xué)院, 廣東 珠海 519090)
?
基于云計(jì)算的非連續(xù)層次數(shù)據(jù)挖掘方法
李 穎
(廣東科學(xué)技術(shù)職業(yè)學(xué)院 計(jì)算機(jī)工程技術(shù)學(xué)院, 廣東 珠海 519090)
大型云存儲(chǔ)數(shù)據(jù)庫(kù)中分布海量的非連續(xù)層次數(shù)據(jù),該類數(shù)據(jù)具有較強(qiáng)的自耦合非線性特征,采用傳統(tǒng)方法進(jìn)行數(shù)據(jù)挖掘時(shí),存在挖掘難度大的問題.為此,提出一種基于云計(jì)算的非連續(xù)層次數(shù)據(jù)挖掘算法.進(jìn)行數(shù)據(jù)挖掘模型的總體分析,對(duì)非連續(xù)層次數(shù)據(jù)進(jìn)行語義指向性特征提取和量化編碼,在量化編碼的基礎(chǔ)上,采用模糊C均值聚類算法,完成語義本體特征指向性波束聚類,實(shí)現(xiàn)數(shù)據(jù)挖掘算法改進(jìn).實(shí)驗(yàn)結(jié)果表明,非連續(xù)層次數(shù)據(jù)挖掘改進(jìn)算法,精度較高,性能較好,抗干擾能力較強(qiáng),性能指標(biāo)優(yōu)于傳統(tǒng)方法.
云計(jì)算;語義;數(shù)據(jù)挖掘;數(shù)據(jù)聚類;信息檢索
隨著網(wǎng)絡(luò)信息和大數(shù)據(jù)處理技術(shù)的快速發(fā)展,大量數(shù)據(jù)通過云存儲(chǔ)模型分布在網(wǎng)絡(luò)空間中,構(gòu)成網(wǎng)絡(luò)Deep Web數(shù)據(jù)庫(kù),在大數(shù)據(jù)信息處理技術(shù)高度發(fā)達(dá)的今天,采用云計(jì)算方法進(jìn)行數(shù)據(jù)傳輸和調(diào)度,能有效提高Deep Web數(shù)據(jù)庫(kù)的訪問能力和信息檢索能力.在大型云存儲(chǔ)數(shù)據(jù)庫(kù)中,分布著海量非連續(xù)層次數(shù)據(jù),具有較強(qiáng)的自耦合性非線性特征,在其他外界環(huán)境干擾下,挖掘難度較大.為提高對(duì)網(wǎng)絡(luò)數(shù)據(jù)庫(kù)的語義檢索和信息分析能力,需要進(jìn)行基于云計(jì)算的非連續(xù)層次數(shù)據(jù)挖掘方法研究,實(shí)現(xiàn)云計(jì)算環(huán)境下的數(shù)據(jù)挖掘云平臺(tái)構(gòu)建[1-3].
近年來,已有不少學(xué)者開展了對(duì)云存儲(chǔ)數(shù)據(jù)庫(kù)中非連續(xù)層次數(shù)據(jù)挖掘算法的研究,典型的算法包括基于演化博弈的云存儲(chǔ)數(shù)據(jù)庫(kù)非連續(xù)層次數(shù)據(jù)挖掘算法、基于統(tǒng)計(jì)信號(hào)分析的云存儲(chǔ)數(shù)據(jù)庫(kù)非連續(xù)層次數(shù)據(jù)挖掘算法、基于語義特征提取的數(shù)據(jù)挖掘算法和基于自適應(yīng)波束形成的數(shù)據(jù)挖掘算法等[4-8].根據(jù)上述算法原理,相關(guān)學(xué)者進(jìn)行數(shù)據(jù)挖掘算法的研究與改進(jìn),其中,文獻(xiàn)[9]提出一種基于關(guān)聯(lián)維特征提取的云計(jì)算非連續(xù)層次數(shù)據(jù)挖掘算法,通過相空間重構(gòu)得到云存儲(chǔ)數(shù)據(jù)庫(kù)的高維運(yùn)動(dòng)空間軌跡,以此為基礎(chǔ)進(jìn)行關(guān)聯(lián)維特征提取建模,實(shí)現(xiàn)對(duì)非連續(xù)層次數(shù)據(jù)的云挖掘,該方法具有較高的挖掘精度,但該算法需要進(jìn)行高維相空間分解,計(jì)算開銷較大,特征提取的準(zhǔn)確性受限;文獻(xiàn)[10]提出一種基于文本檢測(cè)法的數(shù)據(jù)挖掘算法,在云計(jì)算環(huán)境下,采用人工標(biāo)注法以及文本檢測(cè)法,進(jìn)行非連續(xù)層次數(shù)據(jù)的非線性特征編碼,在此基礎(chǔ)上進(jìn)行數(shù)據(jù)準(zhǔn)確訪問和信息索引,提高了數(shù)據(jù)挖掘和數(shù)據(jù)庫(kù)優(yōu)化訪問的性能,但是該數(shù)據(jù)挖掘算法受到的干擾較大,在低信噪比環(huán)境下數(shù)據(jù)挖掘的精度不高,性能不好[11-14].
圖 1 云計(jì)算非連續(xù)層次數(shù)據(jù)交互中心 數(shù)據(jù)傳輸通道示意圖Fig.1 Schematic diagram of cloud computing non continuous layer data exchange center data transmission channel
因此,提出一種基于語義本體特征指向性波束聚類的非連續(xù)層次數(shù)據(jù)挖掘算法,進(jìn)行數(shù)據(jù)挖掘模型的總體分析及非連續(xù)層次數(shù)據(jù)結(jié)構(gòu)分析,對(duì)非連續(xù)層次數(shù)據(jù)進(jìn)行語義指向性特征提取和量化編碼,在量化編碼的基礎(chǔ)上,采用模糊C均值聚類算法實(shí)現(xiàn)語義本體特征指向性波束聚類,實(shí)現(xiàn)數(shù)據(jù)挖掘算法改進(jìn),最后通過仿真實(shí)驗(yàn)進(jìn)行性能測(cè)試.
為了實(shí)現(xiàn)基于云計(jì)算的非連續(xù)層次數(shù)據(jù)挖掘,首先進(jìn)行數(shù)據(jù)挖掘模型的總體設(shè)計(jì),在大型云存儲(chǔ)數(shù)據(jù)庫(kù)中,非連續(xù)層次數(shù)據(jù)交互中心能控制操作系統(tǒng)的選擇,本文構(gòu)建的基于云計(jì)算的非連續(xù)層次數(shù)據(jù)挖掘模型,采用私有云平臺(tái)下的Optorsim結(jié)構(gòu),需要把大型云存儲(chǔ)數(shù)據(jù)庫(kù)中的非連續(xù)層次數(shù)據(jù)分成3×3拓?fù)浣Y(jié)構(gòu),設(shè)置4個(gè)負(fù)載區(qū)域?qū)哟蔚妮斎胼敵鐾ǖ?,大型云存?chǔ)數(shù)據(jù)庫(kù)的云計(jì)算非連續(xù)層次數(shù)據(jù)交互中心結(jié)構(gòu)的數(shù)據(jù)傳輸通道模型描述如圖1所示.
圖1中,p1,p2和p3分別代表云計(jì)算非連續(xù)層次數(shù)據(jù)交互中心的數(shù)據(jù)幀傳輸節(jié)點(diǎn),以近鄰點(diǎn)為中心,構(gòu)建非連續(xù)層次數(shù)據(jù)的向量量化特征編碼模型[15-20],把p2當(dāng)作OptorSim結(jié)構(gòu)的數(shù)據(jù)聚類中心;初始化水平集函數(shù)φ,得到大型云存儲(chǔ)數(shù)據(jù)庫(kù)單個(gè)檢索節(jié)點(diǎn)的適應(yīng)度函數(shù).把4個(gè)數(shù)據(jù)交互通道的數(shù)據(jù)進(jìn)行特征聚類和提取,提取非連續(xù)層次數(shù)據(jù)的多徑梯度圖,得到數(shù)據(jù)的語義本體模型輸入信道模型分別為x1,x2,x3和x4,表示為
(1)
其中:m為信息子空間中的云計(jì)算關(guān)聯(lián)屬性,基于云計(jì)算的非連續(xù)層次數(shù)據(jù)挖掘模型總體設(shè)計(jì)構(gòu)架如圖2所示.
2.1 問題的提出及量化編碼
圖 2 基于云計(jì)算的非連續(xù)層次數(shù)據(jù)挖掘模型實(shí)現(xiàn)總體構(gòu)架Fig.2 Realization of the overall framework of the non continuous data mining model based on cloud computing
在上述進(jìn)行云計(jì)算的非連續(xù)層次數(shù)據(jù)挖掘方法總體構(gòu)架基礎(chǔ)上,進(jìn)行大型云存儲(chǔ)數(shù)據(jù)庫(kù)中分布的海量非連續(xù)層次數(shù)據(jù)挖掘模型的改進(jìn)設(shè)計(jì),由于非連續(xù)層次數(shù)據(jù)具有較強(qiáng)的自耦合性非線性特征,在受到大干擾下,挖掘難度較大.本文在云計(jì)算環(huán)境下,提出一種基于語義本體特征指向性波束聚類的非連續(xù)層次數(shù)據(jù)挖掘算法.對(duì)非連續(xù)層次數(shù)據(jù)進(jìn)行語義指向性特征提取和量化編碼,對(duì)語義本體模型窗口中的非連續(xù)層次數(shù)據(jù)梯度最大值進(jìn)行自適應(yīng)加權(quán),得到輸出的非連續(xù)層次數(shù)據(jù)關(guān)聯(lián)指向性加權(quán)向量為
(2)
采用一個(gè)1×N時(shí)間窗口進(jìn)行特征壓縮,確定非連續(xù)層次數(shù)據(jù)挖掘時(shí)間窗口大小N,把時(shí)間窗口劃分成許多小的時(shí)間間隔,進(jìn)行向量量化編碼,假設(shè)檢測(cè)函數(shù)x(t),量化編碼的連續(xù)滑動(dòng)窗口距離表示為
(3)
其中:ωj為非連續(xù)層次數(shù)據(jù)的最大梯度差加權(quán)系數(shù),表示為
(4)
通過對(duì)非連續(xù)層次數(shù)據(jù)中的有用文本進(jìn)行語義指向性特征提取,假設(shè)數(shù)據(jù)是分段平穩(wěn)的,在分段平穩(wěn)的線性區(qū)域,各自判決非連續(xù)層次數(shù)據(jù)窄時(shí)域窗TLX、TLY,得到非連續(xù)層次數(shù)據(jù)的文本特征提取判決式為
(5)
設(shè)非連續(xù)層次數(shù)據(jù)的能量密度譜為m;在最小窗口距離下,得到向量量化編碼的時(shí)鐘采樣Nj*,其中向量量化編碼的矢量空間軌跡函數(shù)為
(6)
把語義本體特征指向性波束提取區(qū)域分割為3×3拓?fù)浣Y(jié)構(gòu),選擇特定的窗函數(shù),得到輸出的向量量化編碼對(duì)象集合Fm(x,y)為
(7)
(8)
2.2 數(shù)據(jù)挖掘算法實(shí)現(xiàn)
(9)
在t-ω平面上,利用輸出的語義指向性特征進(jìn)行數(shù)據(jù)聚類中心搜索,采用模糊C均值算法,將有限的多分量非連續(xù)層次數(shù)據(jù)X分為c類,其中,1 (10) (11) (12) 結(jié)合上述LGB編碼結(jié)果,對(duì)核函數(shù)進(jìn)行改造,調(diào)整加權(quán)得到Nj*和幾何鄰域NEj*(t),得到數(shù)據(jù)挖掘的聚類中心為 (13) (14) 此時(shí),求得數(shù)據(jù)挖掘的目標(biāo)函數(shù)的極值為 (15) (16) 其中:m為自適應(yīng)基函數(shù);(dik)2為樣本xk與文本像素樣本Vi的測(cè)度距離,通過上述語義指向性特征提取結(jié)果,得到數(shù)據(jù)聚類挖掘的特征空間歐式距離為 (17) 且滿足: (18) 通過上述處理,當(dāng)xk和Vi構(gòu)成一個(gè)復(fù)共軛時(shí),可以保留原始數(shù)據(jù)的語義本體特征,實(shí)現(xiàn)語義本體特征指向性波束聚類,提高數(shù)據(jù)挖掘的精度. 為了驗(yàn)證本文設(shè)計(jì)的云計(jì)算環(huán)境下,基于云計(jì)算的非連續(xù)層次數(shù)據(jù)挖掘算法的數(shù)據(jù)挖掘性能,進(jìn)行仿真實(shí)驗(yàn).實(shí)驗(yàn)的硬件環(huán)境為:處理器Intel(R)Core(TM)2 Duo CPU 2.94GHz,內(nèi)存:8.00GB.采用Matlab仿真軟件,進(jìn)行非連續(xù)層次數(shù)據(jù)挖掘算法的編程設(shè)計(jì),非連續(xù)層次數(shù)據(jù)的測(cè)試數(shù)據(jù)來自大型云存儲(chǔ)數(shù)據(jù)庫(kù)Deep Web 200G,非連續(xù)層次數(shù)據(jù)采樣樣本的個(gè)數(shù)為1 024,采樣的周期為T=0.04s,非連續(xù)層次數(shù)據(jù)挖掘過程中,受到碼間串?dāng)_干擾的強(qiáng)度為SNR=0~24 dB,數(shù)據(jù)的標(biāo)量時(shí)間序列基頻為100Hz,包含3個(gè)頻率分量的非線性數(shù)據(jù)特征分量,根據(jù)上述仿真環(huán)境和參數(shù)設(shè)定,進(jìn)行數(shù)據(jù)挖掘算法仿真分析,首先進(jìn)行原始數(shù)據(jù)的信息流模型構(gòu)建和特征提取建模,得到原始數(shù)據(jù)信息流時(shí)域波形如圖3所示. 圖 3 原始數(shù)據(jù)信息流時(shí)域波形 圖4 非連續(xù)層次數(shù)據(jù)的語義指向性特征提取Fig.3 Time domain waveform of raw data information flow Fig.4 Semantic directional feature extraction of non continuous hierarchical data 圖 5 數(shù)據(jù)挖掘性能對(duì)比Fig.5 Data mining performance comparison 由圖3可知,原始數(shù)據(jù)信息流分布在云存儲(chǔ)數(shù)據(jù)庫(kù)中,受到較強(qiáng)的自耦合性非線性特征干擾,導(dǎo)致數(shù)據(jù)挖掘的精度低、性能差.因此,需進(jìn)行數(shù)據(jù)挖掘模型改進(jìn),采用本文方法進(jìn)行語義指向性特征提取,得到結(jié)果如圖4所示.由圖4可知,采用本文方法進(jìn)行非連續(xù)層次數(shù)據(jù)的語義指向性特征提取,可以實(shí)現(xiàn)數(shù)據(jù)的語義本體特征指向性波束聚類,數(shù)據(jù)聚類挖掘的收斂性較好,提高了數(shù)據(jù)挖掘性能,為了定量分析挖掘性能,采用本文算法和傳統(tǒng)方法,以數(shù)據(jù)挖掘的準(zhǔn)確度作為測(cè)試指標(biāo),采用10 000次蒙特卡洛實(shí)驗(yàn),得到數(shù)據(jù)挖掘的輸出均方根誤差RMSE,對(duì)比結(jié)果如圖5所示. 由圖5可知,采用本文算法進(jìn)行基于云計(jì)算的非連續(xù)層次數(shù)據(jù)挖掘,數(shù)據(jù)挖掘輸出的均方根誤差較低,說明數(shù)據(jù)挖掘精度高于傳統(tǒng)方法,抗干擾性能強(qiáng). 針對(duì)傳統(tǒng)的數(shù)據(jù)挖掘方法存在挖掘精度低、誤差大的問題.提出基于云計(jì)算的非連續(xù)層次數(shù)據(jù)挖掘算法.進(jìn)行數(shù)據(jù)挖掘模型的總體分析,對(duì)非連續(xù)層次數(shù)據(jù)進(jìn)行語義指向性特征提取和量化編碼,在量化編碼的基礎(chǔ)上,采用模糊C均值聚類算法,完成語義本體特征指向性波束聚類,實(shí)現(xiàn)數(shù)據(jù)挖掘算法改進(jìn).實(shí)驗(yàn)結(jié)果表明,采用本文算法進(jìn)行數(shù)據(jù)挖掘的精度較高、性能較好,語義本體特征指向性波束聚類效果較好,抗干擾能力較強(qiáng). [1] 周鐳,單鋒,劉鵬,等. 基于供應(yīng)鏈的企業(yè)信息化評(píng)價(jià)模型的建立[J]. 西安工程大學(xué)學(xué)報(bào),2015,29(6):772-779. ZHOULei,SHANFeng,LIUPeng,etal.Theestablishmentoftheenterpriseinformatizationevaluationmodelbasedonsupplychain[J].JournalofXi’anPolytechnicUniversity,2015,29(6):772-779. [2] 劉經(jīng)南,方媛,郭遲,等. 位置大數(shù)據(jù)的分析處理研究進(jìn)展[J]. 武漢大學(xué)學(xué)報(bào)(信息科學(xué)版),2014,39(4):379-385. LIUJingnan,F(xiàn)ANGYuan,GUOChi,etal.Advancesinbigdataanalysisandprocessinglocation[J].GeomaticsandInformationScienceofWuhanUniversity,2014,39(4):379-385. [3] 李鵬,劉思峰. 基于灰色關(guān)聯(lián)分析和D-S證據(jù)理論的區(qū)間直覺模糊決策方法[J]. 自動(dòng)化學(xué)報(bào),2011,37(8):993-999. LIPeng,LIUSifeng.Interval-valuedintuitionistfuzzynumbersdecision-makingmethodbasedongreyincidenceanalysisandD-Stheoryofevidence[J].ActaAutomaticaSinica,2011,37(8):993-999. [4]ELDEMERDASHYA,DOBREOA,LIAOBJ.BlindidentificationofSMandalamoutiSTBC-OFDMsignals[J].IEEETransactionsonWirelessCommunications,2015,14(2):972-982. [5]XUY,TONGS,LIY.Prescribedperformancefuzzyadaptivefault-tolerantcontrolofnon-linearsystemswithactuatorfaults[J].IETControlTheoryandApplications,2014, 8(6):420-431. [6]HUANGX,WANGZ,LIY,etal.Designoffuzzystatefeedbackcontrollerforrobuststabilizationofuncertainfractional-orderchaoticsystems[J].JournaloftheFranklinInstitute,2015,351(12):5480-5493. [7] 陸興華,陳平華. 基于定量遞歸聯(lián)合熵特征重構(gòu)的緩沖區(qū)流量預(yù)測(cè)算法[J]. 計(jì)算機(jī)科學(xué),2015,42(4):68-71. LUXinghua,CHENPinghua.Trafficpredictionalgorithminbufferbasedonrecurrencequantificationunionentropyfeaturereconstruction[J].ComputerScience,2015,42(4):68-71. [8] 譚君,賈松敏,李秀智,等. 改進(jìn)的CLG變分光流場(chǎng)估計(jì)方法[J]. 電子設(shè)計(jì)工程,2016(1):5-8. TANJun,JIASongmin,LIXiuzhi,etal.ImprovedmethodforvariationalopticalflowfieldestimationbasedonCLG[J].ElectronicDesighEngineering,2016(1):5-8. [9]CHOIJ,YUK,KIMY.Anewadaptivecomponent-substitution-basedsatelliteimagefusionbyusingpartialreplacement[J].IEEETransactionsonGeoscienceandRemoteSensing,2011,49(1):295-309. [10]MEZOUARMC,KPALMAK,TALEBN,etal.Apan-sharpeningbasedonthenon-subsampledcontourlettransform:Applicationtoworldview-2imagery[J].IEEEJournalofSelectedTopicsinAppliedEarthObservationsandRemoteSensing,2014,7(5):1806-1815. [11]GLENTISGO,JAKOBSSONA,ANGELOPOULOSK.Block-recursiveIAA-basedspectralestimateswithmissingsamplesusingdatainterpolation[C]//InternationalConferenceonAcoustics,SpeechandSignalProcessing(ICASSP),F(xiàn)lorence,2014:350-354. [12]SUNWeize,SOHC,CHENYuan,etal.Approximatesubspace-basediterativeadaptiveapproachforfasttwo-dimensionalspectralestimation[J].IEEETransactionsonSignalProcessing,2014,62(12):3220-3231. [13] 陳丹,柯熙政,張璐. 湍流信道下激光器互調(diào)失真特性[J]. 光子學(xué)報(bào),2016,45(2):93-97. CHENDan,KEXizheng,ZHANGLu.Laserintermodulationdistortionandcharacteristicundertheturbulencechannel[J].ActaPhotonicaSinica,2016,45(2):93-97. [14] 許寧,肖新耀,尤紅建,等.HCT變換與聯(lián)合稀疏模型相結(jié)合的遙感影像融合[J]. 測(cè)繪學(xué)報(bào),2016,45(4):434-441. XUNing,XIAOXinyao,YOUHongjian,etal.ApansharpeningmethodbasedonHCTandjointsparsemodel[J].ActaGeodaeticaetCartographicaSinica,2016,45(4):434-441. [15] 崔永君,張永花. 基于特征尺度均衡的Linux系統(tǒng)雙閾值任務(wù)調(diào)度算法[J]. 計(jì)算機(jī)科學(xué),2015,42(6):181-184. CUIYongjun,ZHANGYonghua.Linuxsystemdualthresholdschedulingalgorithmbasedoncharacteristicscaleequilibrium[J].ComputerScience,2015,42(6):181-184. [16] 劉俊,劉瑜,何友,等. 雜波環(huán)境下基于全鄰模糊聚類的聯(lián)合概率數(shù)據(jù)互聯(lián)算法[J]. 電子與信息學(xué)報(bào),2016,38(6):1438-1445. LIUJun,LIUYu,HEYou,etal.Jointprobabilisticdataassociationalgorithmbasedonall-neighborfuzzyclusteringinclutter[J].JournalofElectronicsandInformationTechnology,2016,38(6):1438-1445. [17]BAESH,YOONKJ.Robustonlinemultiobjecttrackingwithdataassociationandtrackmanagement[J].IEEETransactionsonImageProcessing,2014,23(7):2820-2833. [18]JIANGX,HARISHANK,THAMARASAR,etal.Integratedtrackinitializationandmaintenanceinheavyclutterusingprobabilisticdataassociation[J].SignalProcessing,2014(94):241-250. [19]LIL,XIEW.Intuitionisticfuzzyjointprobabilisticdataassociationfilteranditsapplicationtomultitargettracking[J].SignalProcessing,2014,(96):433-444. [20]ZHONGF,LIH,ZHONGS,etal.AnSOCestimationapproachbasedonadaptiveslidingmodeobserverandfractionalorderequivalentcircuitmodelforlithium-ionbatteries[J].CommunicationsinNonlinearScienceandNumericalSimulation, 2015,24(1):127-144 編輯、校對(duì):趙 放 The method of non continuous data mining based on cloud computing LI Ying (School of Computer Engineering Technical, Guangdong Institute of Science and Technology, Zhuhai 519090, Guangdong,China) A large database of cloud storage has massive discontinuous level data, and the data has stronger coupling nonlinear characteristics. When using traditional method for data mining, mining difficult problems exist . Discontinuous hierarchical data mining algorithm based on cloud computing is put forward. Carrying on the overall analysis of the data mining model, semantic directivity characteristics of discontinuous level data are extracted and quantization coding is conducted. on the basis of quantitative coding, fuzzy C-means clustering algorithm is adopted,to complete semantic ontology directional beam cluster, improving the data mining algorithm. The experiment results showed that the improved algorithm has high precision,good performance and strong anti-jamming capability,and its performance is superior to that of traditional methods. cloud computing; semantic; data mining; data clustering; information retrieval 1674-649X(2016)04-0498-06 10.13338/j.issn.1674-649x.2016.04.016 2015-12-13 廣東省高職教育教學(xué)管理委員會(huì)教改課題(JGW2013026) 李穎(1977—),女,廣東省韶關(guān)市人,廣東科學(xué)技術(shù)職業(yè)學(xué)院講師,研究方向?yàn)樘摂M化與云計(jì)算.E-mail:wing_lee@126.com 李穎.基于云計(jì)算的非連續(xù)層次數(shù)據(jù)挖掘方法[J].西安工程大學(xué)學(xué)報(bào),2016,30(4):498-503. LI Ying.The method of non continuous data mining based on cloud computing[J].Journal of Xi′an Polytechnic University,2016,30(4):498-503. TP 391 A3 仿真實(shí)驗(yàn)與性能測(cè)試
4 結(jié)束語