魏盛娜 盛超
摘要:現(xiàn)如今許多不法分子利用釣魚網(wǎng)站盜取用戶的個人信息,竊取用戶的財產(chǎn),對用戶造成巨大損失。因此該文通過使用決策樹學習算法,提取其中的關鍵詞,分析并建立釣魚網(wǎng)站特征模型,對未知網(wǎng)站進行判別。CART是一種決策樹算法,但CART決策樹的多數(shù)表決法會屏蔽小類數(shù)據(jù)類型的影響,因此該文根據(jù)這點對CART決策樹進行改進,引入代價函數(shù),不斷地利用迭代和最小均方誤差調(diào)整特征的權(quán)重增加懲罰。實驗結(jié)果表明,改進后的決策樹在對未知網(wǎng)站進行分析,成功地降低了負樣本的錯誤率,提升了識別率。
關鍵詞:決策樹;URL識別;最小均方誤差;代價函數(shù)
中圖分類號:TP391 文獻標識碼:A 文章編號:1009-3044(2017)33-0079-02
Abstract: Now many criminals use phishing sites to steal the user's personal information, steal the user's property, causing huge losses to the user. Therefore, this paper uses the decision tree learning algorithm to extract the keywords, analyze and establish the phishing website feature model, and judge the unknown website. CART is a decision tree algorithm, but the majority voting method of CART decision tree will shield the influence of small class data type. Therefore, this paper improves the CART decision tree according to this point, introduces the cost function, and makes use of iteration and minimum mean square error Adjust the weight of the feature to increase the penalty. The experimental results show that the improved decision tree has successfully reduced the error rate of negative samples and improved the recognition rate in the analysis of unknown websites.
Key words: decision tree; URL identification; least-mean-square; cost function
1 背景
釣魚網(wǎng)站通常是指偽裝成合法網(wǎng)站,竊取用戶提交的賬號、密碼等私密信息的網(wǎng)站。目前已出現(xiàn)10余種反釣魚工具,本文選用決策樹方法對釣魚URL特征進行識別,國內(nèi)外學者也提出了很多決策樹的相關改進算法:
ID3算法是1986年由Quinlan提出的,是基于信息增益的選擇[1] 。J.Ma[2]等人分析可疑URL 的詞匯和主機屬性采用詞袋模型表示特征, 獲得了成千上萬的特征,運用特征匹配加上ID3算法檢測釣魚網(wǎng)站。但ID3算法也存在缺陷,因為包含較多屬性值的特征所含的信息增益一般會越高,所以ID3優(yōu)先會選擇有較多屬性值的特征,從而構(gòu)建的決策樹往往不是最優(yōu)的,只可以用于處理離散數(shù)據(jù),不能用于處理連續(xù)數(shù)據(jù)。
C4.5算法是Quinlan本人對ID3算法的改進[3],引入了信息增益比(GainRatio)作為選擇的準則。來自John Hopkins大學的Sujata與Google的研究員用URL特征做釣魚模式識別進行了嘗試[4],運用改進后的c4.5算法,取得了很好的成果。但在決策樹生成過程中,頻繁的對訓練的數(shù)據(jù)集排序和掃描,增加了算法的時間復雜度。
2 CART決策樹
CART(Classification And Regression Tree)算法由L.Breiman,J.Friedman,R.Olshen和C.Stone于1984年提出[5],即分類回歸算法,簡稱CART算法,分類問題中含有K個類別,樣本點屬于第k類的概率為pk對于給定的樣本集D
[Gini(D)=k=1γk′≠kpkpk′=1-k=1γp2k] (1)
CART決策樹具體算法為:在所有可能的特征A以及所有可能的切分點a中,選擇基尼指數(shù)最小的特征及對應的切分點作為最優(yōu)特征與最優(yōu)切分點,依照最優(yōu)切分點和最優(yōu)特征點,從現(xiàn)結(jié)點生成兩個子節(jié)點,將訓練數(shù)據(jù)集特征分配到兩個子結(jié)點;算法終止條件為結(jié)點樣本個數(shù)小于給定閾值,或者樣本集基尼指數(shù)小于閾值,亦或沒有更多特征。
3 實驗方法
1) 算法改進
Cart決策樹作為分類與回歸樹,應用作為釣魚網(wǎng)站的識別,輸出非數(shù)值標簽。然而,在實際應用中,將一個釣魚網(wǎng)站誤報為正規(guī)網(wǎng)站的危害遠遠大于將正常網(wǎng)站檢測為釣魚網(wǎng)站[6]。應此,我們引入一個代價函數(shù),犧牲正樣本的極少識別率,用于降低負樣本的錯誤率。
基于Cart決策樹的基本方法,對樣本進行隨機分類,并對樣本進行基本學習,計算當前漏報率(將釣魚網(wǎng)站誤判為正常網(wǎng)站的比例)和誤報率(將正常網(wǎng)頁誤判的釣魚網(wǎng)站的比例)的比例設為誤差輸出值d(n),進行歸一化,訓練開始漏報率和誤報率沒有權(quán)值調(diào)整,初始為1:1的比例,定義估計誤差endprint
[e(n)=d(n)-d(n)] (2)
e(n)是期望相應,[d(n)]為對d(n)的估計,因為理想的預計誤差為0,所以可看作接近0的極小值,這時引入誤差函數(shù)J(w)作為代價函數(shù)。
[J(w)=E|e(n)2|=Ee(n)e*(n)] (3)
根據(jù)維納-霍夫方程[7]和正交原理[8]:
[Rw0=p] (4)
總結(jié)上述步驟,改進的CART算法步驟如下:
1) 初始化,使得n=0,權(quán)向量[w(0)=0],估計誤差[e(0)=d(0)-d(0)=d(0)],輸入向量[u(0)=u(0)0...0T]。
2) 對數(shù)據(jù)集D抽取固定樣本數(shù),進行CART決策樹生成學習器1,2,...,n,記錄每個學習器的誤差輸出值d(n)和特征權(quán)值。
3) 對于n=0,1,...更新權(quán)向量[w(n+1)=w(n)+μu(n)e*(n)],更新期望信號的估計[d(n+1)=wH(n+1)u(n+1)],設定步長μ為0.02,求得估計誤差[e(n+1)=d(n+1)-d(n+1)]。
4) 重復步驟2,3。
4 數(shù)據(jù)分析
1) 數(shù)據(jù)收集
本實驗采用的數(shù)據(jù)來自UCI Machine Learning Repository的Pishing Website數(shù)據(jù)子集[9],數(shù)據(jù)來源于Google引擎記錄,PishingTank記錄。其中正常網(wǎng)頁1491個,釣魚網(wǎng)站1054個。訓練集和預測集按照1:2的比例進行分配。主要白名單來源于Aleax中抽取的網(wǎng)頁數(shù)據(jù)。黑名單由于變動具有時效性,所以實時跟蹤主要黑名單來自2016年6月到2017年6月Phishing tank網(wǎng)站提供的釣魚網(wǎng)頁URL名單庫。
5 實驗評估及其結(jié)果分析
根據(jù)各個分類器對釣魚網(wǎng)站檢測的準確率(precision)和召回率(recall)來評估其預測結(jié)果的好壞[10]。TP為被分類器正確預測為釣魚網(wǎng)站的個數(shù);TN為被分類器正確預測為正常網(wǎng)站的個數(shù);FP為:被分類器錯誤預測為釣魚網(wǎng)站的個數(shù);FN為被分類器錯誤預測為正常網(wǎng)站的個數(shù);
分類精度:(正確分類所占總數(shù)的比例)
[P=TP+TNP+N] (5)
誤報率:(將正常網(wǎng)頁誤判的釣魚網(wǎng)站的比例)
[FPR=TPTP+FP] (6)
漏報率:(將釣魚網(wǎng)站誤判為正常網(wǎng)站的比例)
[FNR=FNFN+TN] (7)
6 試驗結(jié)果分析
根據(jù)實驗結(jié)果可以看出,決策樹在算法模型上顯著性的對未知的URL起到了預測的作用,具有較好的分類效果,達到了一定的分類精度。在使用基于最小均方誤差的代價函數(shù)后,不僅成功地將漏報率降低,同時迭代擬合了集成學習的思想,提升了算法的識別率,根據(jù)圖2所示,實驗在400次左右誤差開始收斂。實驗的不足在于,根據(jù)表1可以看出,改進后的決策樹雖然降低了漏報率,但是是在犧牲了誤報率的性能,在迭代次數(shù)增大至200次左右后誤報率開始回升,同時增加迭代次數(shù)提升了算法復雜度,需在今后進行進一步改進。
參考文獻:
[1] Sujata Garera, Niels Provos, Monica Chew, et al. A framework for detection and measurement of phishing attacks[J]. ProceedingWORM07 Proceedings of the 2017 ACM workshop on Recurring malcode,2017.
[2] Li L,Helenius M.Usability evaluation of anti-pishing toolbars[J]. Comput,Virol., 2014,3(2):163-184.
[3] Zhang Y,Hong J,Cranor L. CANTINA: A content-based approach to detecting phishing websites[C]. In Proc. 16th Int.Con. World Wide Web Banff, Canada,2016.
[4] Anthony Y Fu, Liu Wenyin, Deng Xiaotie. Detecting Phishing Web Pages with Visual Similarity Assessment Based on Earth Mover's Distance(EMD)IN[J]. IEEE Transactions on Dependable and Secure Computing. 2016, 3(4):301-311.
[5] Zhang, Hai-jun, Liu Gang, Chow Tommy WS.Textual and visual content-based anti-phishing: a bayesian approach[J]. IEEE transac-tions on neural networks /a publicat-ion of the IEEE Neural Networks Council, 2011, 22 (10):1532-1546.
(下轉(zhuǎn)第84頁)
(上接第80頁)
[6] Artem Vorobiev and Jun Han. Security Attack Ontology for Web Services[J]. 2015 IEEE,2015.
[7] Reyes Rios-Cabrera, Tinne Tuytelaars, Luc Van Gool. Efficient multi-camera vehicle detection, tracking, and identification in a tunnel surveillan-ce application[J]. Computer Vision and Image Understanding, 2012, 116(6):742-753.endprint
[8] Abdelhamid N, Ayesh A, Thabtah F. Phishing detection based associative classification data mining[J]. Expert Systems With Applications, 2014, 41(13):24-28.
[9] Lin CF, Lin S F. Efficient face detection method with eye region judgment[J]. EURASIP Journal on Image and Video Processing, 2013, 2013(1):1-14.
[10] Reyes Rios-Cabrera, Tinne Tuytelaars, Luc Van Gool. Efficient multi-camera vehicle detection, tracking, and identification in a tunnel surveillan-ce application[J]. Computer Vision and Image Understanding, 2012, 116(6):742-753.
[11] Barraclough P A, Hossain M A, Tahir M A. Int-elligent phishing detection and protec-tion scheme for online transactions[J]. Expert Systems With Applications, 2013, 40 (11):4697-4706.
[12] Yuan-Hsin Tung, Chen-Chiu Lin, Hwai-Ling Shan. Test as a Service: A framework for Web security TaaS service in cloud environment[J]. IEEE, 2014.
[13] Zhang Hai-jun, Liu Gang, Chow Tommy W S.Textual and visual content-based anti-phishing: a bayesian approach[J]. IEEE transac-tions on neural networks /a publicat-ion of the IEEE Neural Networks Council, 2011, 22(10):1532-1546.
[14] UCI 數(shù)據(jù)集[EB/OL].http://archive.ics.uci.edu/ml/datasets/Phishing + Websites.
[15] Lim J S, Kim WH. Detecting and tracking of multiple pedestrians using motion, color information and the AdaBoost algorithm[J]. Multimedia Tools and Applications, 2013, 65(1):161-179.
[16] Himabindu Lakkaraju. A Machine Learning Framework to Identify Student at Risk of Adverse Academic Outcomes[Z]. SIGKDD2016.
[17] WenluZhang, Rongjian Li, Tao Zeng, QianSun, Deep Model Based Transfer and Multi-Task Learning for Biological Image Analysis[Z]. SIGKDD2016.endprint