孫玉娣
基于電信大數(shù)據(jù)的5G網(wǎng)絡(luò)海量用戶復(fù)訪行為預(yù)測模型
孫玉娣
(江蘇經(jīng)貿(mào)職業(yè)技術(shù)學(xué)院數(shù)字商務(wù)學(xué)院,江蘇 南京 211168)
5G網(wǎng)絡(luò)中的用戶會產(chǎn)生大量的訪問數(shù)據(jù),導(dǎo)致用戶復(fù)訪行為難以精準(zhǔn)預(yù)測,因此提出基于電信大數(shù)據(jù)的5G網(wǎng)絡(luò)海量用戶復(fù)訪行為預(yù)測模型。從電信大數(shù)據(jù)中提取用戶上網(wǎng)歷史行為特征數(shù)據(jù),構(gòu)建數(shù)據(jù)集。引入多階加權(quán)馬爾可夫鏈模型,通過計(jì)算各階自相關(guān)系數(shù),得到模型權(quán)重值,計(jì)算模型的統(tǒng)計(jì)量。經(jīng)過分析后得到各階步長的馬爾可夫氏鏈一步轉(zhuǎn)移概率矩陣,從而實(shí)現(xiàn)對5G網(wǎng)絡(luò)海量用戶復(fù)訪行為的精準(zhǔn)預(yù)測。實(shí)驗(yàn)結(jié)果表明,該模型擁有最低的均值誤差和標(biāo)準(zhǔn)差,以及最高的精度、查全率、查準(zhǔn)率、1指標(biāo),可證明該方法在預(yù)測用戶復(fù)訪行為方面有著非常明顯的優(yōu)勢。
電信大數(shù)據(jù);用戶復(fù)訪行為預(yù)測;多階加權(quán)馬爾可夫鏈模型;一步轉(zhuǎn)移概率矩陣;自相關(guān)系數(shù)
隨著5G電信網(wǎng)絡(luò)迅速發(fā)展,人們可以通過各種各樣的網(wǎng)站瀏覽新聞、下載數(shù)據(jù)以及購買商品,在方便生活的同時(shí)豐富了知識儲備。這一系列操作必然會產(chǎn)生海量的網(wǎng)絡(luò)數(shù)據(jù),利用相關(guān)算法從這些數(shù)據(jù)中挖掘出有用的信息,并對用戶未來可能訪問的網(wǎng)站和購買的商品進(jìn)行預(yù)測,已經(jīng)成為一項(xiàng)十分熱門的研究內(nèi)容。針對可能復(fù)訪或者復(fù)購的用戶,根據(jù)其先前訪問的歷史和偏好進(jìn)行針對性的推薦,可以在一定程度上提高用戶的購買欲望。用戶的瀏覽、操作、訪問等歷史行為數(shù)據(jù)都以日志文件的形式存儲在數(shù)據(jù)庫中,如何利用這些行為數(shù)據(jù)分析用戶是否會復(fù)訪,對于網(wǎng)絡(luò)平臺的可持續(xù)發(fā)展具有十分重要的意義。
文獻(xiàn)[1]將深度神經(jīng)網(wǎng)絡(luò)算法與不用正則化方法聯(lián)合起來,通過建立不同的分組,根據(jù)一定的數(shù)據(jù)特征對數(shù)據(jù)集進(jìn)行復(fù)訪行為的預(yù)測;文獻(xiàn)[2]在用戶行為序列的基礎(chǔ)上實(shí)現(xiàn)用戶點(diǎn)擊預(yù)測。按照交互時(shí)間對用戶歷史行為進(jìn)行排序,得到用戶歷史行為序列;將詞嵌入模型引入深度因子分解機(jī)(deep factorization machine,DeepFM)模型,對用戶歷史行為序列進(jìn)行自適應(yīng)學(xué)習(xí),得到用戶的興趣列表,捕捉用戶的興趣變化,從而實(shí)現(xiàn)預(yù)測。
上述兩種方法已無法適應(yīng)當(dāng)前的5G大數(shù)據(jù)網(wǎng)絡(luò)環(huán)境,因此,本文提出了一種基于電信大數(shù)據(jù)的5G網(wǎng)絡(luò)海量用戶復(fù)訪行為預(yù)測模型。首先,從服務(wù)器節(jié)點(diǎn)中提取用戶的瀏覽數(shù)據(jù)、行為數(shù)據(jù)、操作數(shù)據(jù)以及屬性數(shù)據(jù)等各類信息構(gòu)建5G電信網(wǎng)絡(luò)數(shù)據(jù)集;然后,構(gòu)建多階加權(quán)馬爾可夫鏈模型,并對模型的轉(zhuǎn)移矩陣和初始概率向量進(jìn)行計(jì)算;最后,根據(jù)各階步長的自相關(guān)系數(shù)計(jì)算權(quán)重值,分析權(quán)重值后得到各階步長的馬爾可夫鏈一步轉(zhuǎn)移概率矩陣,實(shí)現(xiàn)對5G網(wǎng)絡(luò)用戶復(fù)訪行為的精準(zhǔn)預(yù)測。在實(shí)驗(yàn)中,將本文模型與其他方法進(jìn)行預(yù)測性能對比,結(jié)果表明本文模型在多個方面均展現(xiàn)出了明顯優(yōu)勢,預(yù)測均值誤差、標(biāo)準(zhǔn)差始終低于其他兩種方法,而預(yù)測精度則大大高于其他兩種方法。
表1 5G電信網(wǎng)絡(luò)采集數(shù)據(jù)解析
在進(jìn)行用戶復(fù)訪行為預(yù)測之前,需要建立5G電信網(wǎng)絡(luò)數(shù)據(jù)集[3],為了確保用戶行為數(shù)據(jù)的精準(zhǔn)性和實(shí)時(shí)性,在5G電信網(wǎng)絡(luò)中選取若干個服務(wù)器節(jié)點(diǎn),將采集裝置部署在這些節(jié)點(diǎn)上進(jìn)行數(shù)據(jù)采集。采集內(nèi)容包含用戶瀏覽數(shù)據(jù)、用戶屬性數(shù)據(jù)、用戶訪問行為數(shù)據(jù)[4]、用戶訪問深度數(shù)據(jù)等幾大類數(shù)據(jù),5G電信網(wǎng)絡(luò)采集數(shù)據(jù)解析見表1。
5G電信網(wǎng)絡(luò)數(shù)據(jù)的采集頻率[5]設(shè)定為0.2次/s,根據(jù)采集信息種類的不同,將數(shù)據(jù)分別存儲在30個數(shù)據(jù)庫中,其中包含280多個字段以及若干個擴(kuò)展字段。本文采集的數(shù)據(jù)來自真實(shí)網(wǎng)站的公開數(shù)據(jù)庫,數(shù)據(jù)表示用戶訪問一次頁面的所有瀏覽、操作行為,可以真實(shí)、有效地反映用戶的行為特點(diǎn)。
圖1 5G電信網(wǎng)絡(luò)數(shù)據(jù)集構(gòu)建過程
1.2.1 多階加權(quán)馬爾可夫鏈模型
由于電信大數(shù)據(jù)具有用戶數(shù)量大、用戶產(chǎn)生的數(shù)據(jù)量大、用戶數(shù)據(jù)多樣等諸多特點(diǎn),在對其進(jìn)行分析處理時(shí)常常出現(xiàn)效率低、難度大等問題。為此,引入馬爾可夫鏈模型[7-9],對5G電信網(wǎng)絡(luò)用戶進(jìn)行復(fù)訪行為預(yù)測。
馬爾可夫鏈模型針對用戶的上網(wǎng)行為做出了以下假設(shè):用戶上網(wǎng)瀏覽的過程是一個隨機(jī)過程,即齊次的離散馬爾可夫鏈,因此可以將用戶上網(wǎng)行為構(gòu)成的特征集合看作離散隨機(jī)變量[10]的值域,也就是說,用戶上網(wǎng)過程構(gòu)成了的取值序列,且序列具有馬爾可夫性。
綜上所述,只要已知馬爾可夫鏈模型的初始概率向量,就可以實(shí)現(xiàn)對任何時(shí)間下用戶的復(fù)訪概率以及復(fù)訪網(wǎng)絡(luò)區(qū)間的預(yù)測。
1.2.2 用戶復(fù)訪行為預(yù)測
表2 不同模型階數(shù)下的和
(2)根據(jù)表2計(jì)算統(tǒng)計(jì)量:
為了驗(yàn)證本文模型在實(shí)際應(yīng)用中是否同樣合理有效,進(jìn)行對比實(shí)驗(yàn)測試。實(shí)驗(yàn)所用數(shù)據(jù)從某大型網(wǎng)絡(luò)的公開數(shù)據(jù)庫中提取得到,為了更好地進(jìn)行實(shí)驗(yàn),預(yù)先對采集到的數(shù)據(jù)進(jìn)行清洗處理,剔除掉缺失率較大的缺失值,并利用scikit-learn接口中的分類模型對數(shù)據(jù)集進(jìn)行訓(xùn)練。
首先,將本文模型與文獻(xiàn)[1]和文獻(xiàn)[2]提出的模型進(jìn)行對比。分別應(yīng)用3種模型對同一時(shí)間段內(nèi)的用戶上網(wǎng)行為進(jìn)行分析,并給出最終的復(fù)訪行為預(yù)測結(jié)果。3種模型的用戶復(fù)訪行為預(yù)測均值誤差和標(biāo)準(zhǔn)差分別如圖2、圖3所示。
圖2 3種模型的用戶復(fù)訪行為預(yù)測均值誤差
通過觀察圖2和圖3可以很清楚地看出,隨著數(shù)據(jù)量的不斷增加,本文模型的用戶復(fù)訪行為預(yù)測均值誤差和標(biāo)準(zhǔn)差最小,文獻(xiàn)[2]模型的均值誤差較文獻(xiàn)[2]模型低一些,而文獻(xiàn)[1]模型的標(biāo)準(zhǔn)差較文獻(xiàn)[2]模型低一些。
圖3 3種模型的用戶復(fù)訪行為預(yù)測標(biāo)準(zhǔn)差
接下來通過查全率、查準(zhǔn)率、1指標(biāo)、精度ACC以及受試者操作特征(receiver operator characteristic,ROC)曲線下面積(area under the curve,AUC)5個指標(biāo),進(jìn)一步驗(yàn)證3種模型的用戶復(fù)訪行為預(yù)測性能。用戶復(fù)訪行為預(yù)測從本質(zhì)上來說是一個二分類問題,可以根據(jù)數(shù)據(jù)樣本的真實(shí)類別和算法預(yù)測的類別將預(yù)測結(jié)果分為真陽性(true positive,TP)、假陽性(false positive,F(xiàn)P)、真陰性(true negative,TN)、假陰性(false negative,F(xiàn)N)4種。TP、FP、TN、FN之和等于數(shù)據(jù)樣本總數(shù)。當(dāng)算法預(yù)測結(jié)果為TP+FP、TP+FN時(shí),表示正類;當(dāng)結(jié)果為FN+TN、FP+TN時(shí),表示負(fù)類。
ACC是一個性能度量指標(biāo),正確數(shù)據(jù)樣本數(shù)量與數(shù)據(jù)樣本總數(shù)的比值就是ACC。
對訓(xùn)練集進(jìn)行預(yù)測,會得到一個預(yù)測概率,將預(yù)測概率與概率閾值進(jìn)行對比,當(dāng)預(yù)測概率大于閾值概率時(shí),數(shù)據(jù)樣本為正類,反之則被認(rèn)定為負(fù)類。將訓(xùn)練集按照預(yù)測概率進(jìn)行排序,從而得到算法的最終預(yù)測性能。為了更加公平、準(zhǔn)確地對比3種模型的預(yù)測性能,引入10倍交叉驗(yàn)證法統(tǒng)計(jì)最終的實(shí)驗(yàn)結(jié)果,3種模型的用戶復(fù)訪行為預(yù)測結(jié)果見表3。
表3 3種模型的用戶復(fù)訪行為預(yù)測結(jié)果
通過觀察表3可以看出,3種模型中,本文模型的預(yù)測結(jié)果始終都是最優(yōu)的,由此可以說明本文模型在預(yù)測5G網(wǎng)絡(luò)用戶復(fù)訪行為時(shí)的精準(zhǔn)度最高。這是由于本文模型利用多階加權(quán)馬爾可夫鏈模型對電信大數(shù)據(jù)進(jìn)行分階分析和處理,通過計(jì)算各階步長的一步轉(zhuǎn)移概率矩陣,得到用戶上網(wǎng)歷史行為特征數(shù)據(jù),隨著對特征數(shù)據(jù)分析的不斷深入,可得到用戶復(fù)訪行為預(yù)測結(jié)果。
在5G電信網(wǎng)絡(luò)環(huán)境下,本文利用多階加權(quán)馬爾可夫鏈模型,從大數(shù)據(jù)中提取用戶上網(wǎng)歷史行為特征數(shù)據(jù),通過對這些數(shù)據(jù)進(jìn)行分析來確定用戶的瀏覽習(xí)慣和偏好,從而精準(zhǔn)且高效地預(yù)測。將本文模型與其他模型進(jìn)行對比實(shí)驗(yàn),實(shí)驗(yàn)結(jié)果表明,本文模型有著最優(yōu)秀的預(yù)測性能,可實(shí)現(xiàn)對用戶復(fù)訪行為的精準(zhǔn)預(yù)測。
[1] 盧宇紅, 宋佳麗, 王萌, 等. 基于深度神經(jīng)網(wǎng)絡(luò)融合稀疏分組lasso的預(yù)測模型研究[J]. 中國衛(wèi)生統(tǒng)計(jì), 2021, 38(6): 821-827.
LU Y H, SONG J L, WANG M, et al. The study on the prediction model based on deep neural network together with sparse group lasso[J]. Chinese Journal of Health Statistics, 2021, 38(6): 821-827.
[2] 顧亦然, 王雨, 楊海根. 基于用戶行為序列的短視頻用戶多行為點(diǎn)擊預(yù)測模型[J]. 電子與信息學(xué)報(bào), 2023: 10.11999/JEIT211458.
GU Y R, WANG Y, YANG H G. Multi-action click prediction model for short video users based on user’s behavior sequence[J]. Journal of Electronics & Information Technology, 2023: 10.11999/JEIT211458.
[3] CAO W C, WANG K, GAN H C, et al. User online purchase behavior prediction based on fusion model of CatBoost and Logit[J]. Journal of Physics: Conference Series, 2021, 2003(1): 012011.
[4] LI H R, LIN F Q, LU X, et al. Systematic analysis of fine-grained mobility prediction with on-device contextual data[J]. IEEE Transactions on Mobile Computing, 2022, 21(3): 1096-1109.
[5] QIAO S B, PANG S C, WANG M, et al. Online video popularity regression prediction model with multichannel dynamic scheduling based on user behavior[J]. Chinese Journal of Electronics, 2021, 30(5): 876-884.
[6] NIU B, SUI L, TANG J R, et al. Prediction of microblog users’ forwarding behavior based on interactive and active information[C]//Proceedings of the 2020 International Conference on Aviation Safety and Information Technology. New York: ACM Press, 2020: 554-559.
[7] XIAO Y P, LI J H, ZHU Y F, et al. User behavior prediction of social hotspots based on multimessage interaction and neural network[J]. IEEE Transactions on Computational Social Systems, 2020, 7(2): 536-545.
[8] HU G Y, ZHOU Z J, HU C H, et al. Hidden behavior prediction of complex system based on time-delay belief rule base forecasting model[J]. Knowledge-Based Systems, 2020, 203: 106147.
[9] SUDAN B, CANSIZ S, OGRETICI E, et al. Prediction of success and complex event processing in E-learning[C]//Proceedings of 2020 International Conference on Electrical, Communication, and Computer Engineering (ICECCE). Piscataway: IEEE Press, 2020: 1-6.
[10] SOLTANI N Y. Online learning of sparse Gaussian conditional random fields with application to prediction of energy consumers behavior[C]//Proceedings of 2021 IEEE Statistical Signal Processing Workshop (SSP). Piscataway: IEEE Press, 2021: 486-490.
[11] SUN L T, GAO S W, WANG L. An automatic test sequence generation method based on Markov chain model[C]//Proceedings of 2021 World Conference on Computing and Communication Technologies (WCCCT). Piscataway: IEEE Press, 2021: 91-96.
[12] DENNIS L A, FU Y, SLAVKOVIK M. Markov chain model representation of information diffusion in social networks[J]. Journal of Logic and Computation, 2022, 32(6): 1195-1211.
[13] PENG L, WEN L, QIANG L, et al. Research on complexity model of important product traceability efficiency based on Markov chain[J]. Procedia Computer Science, 2020, 166: 456-462.
[14] HAN C, CHEN J, TAN M K, et al. A tensor-based Markov chain model for heterogeneous information network collective classification[J]. IEEE Transactions on Knowledge and Data Engineering, 2022, 34(9): 4063-4076.
[15] CRUZ I R, LINDSTR?M J, TROFFAES M C M, et al. Iterative importance sampling with Markov chain Monte Carlo sampling in robust Bayesian analysis[J]. Computational Statistics & Data Analysis, 2022, 176: 107558.
[16] ALAMOUDI A, LIU M L, PAYANI A, et al. Predicting mobile users traffic and access-time behavior using recurrent neural networks[C]//Proceedings of 2021 IEEE Wireless Communications and Networking Conference (WCNC). Piscataway: IEEE Press, 2021: 1-6.
[17] LIU K, TATINATI S, KHONG A W H. A weighted feature extraction technique based on temporal accumulation of learner behavior features for early prediction of dropouts[C]//Proceedings of 2020 IEEE International Conference on Teaching, Assessment, and Learning for Engineering (TALE). Piscataway: IEEE Press, 2021: 295-302.
[18] SETIA S, JYOTI V, DUHAN N. HPM: a hybrid model for user’s behavior prediction based on N-gram parsing and access logs[J]. Scientific Programming, 2020: 1-18.
[19] CHEN L Y, WANG L H, ZHOU Y X. Research on data mining combination model analysis and performance prediction based on students’ behavior characteristics[J]. Mathematical Problems in Engineering, 2022: 1-10.
[20] RASOULI A, ROHANI M, LUO J. Bifold and semantic reasoning for pedestrian behavior prediction[C]//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision (ICCV). Piscataway: IEEE Press, 2022: 15580-15590.
[21] ZHOU H, YU K M, CHEN Y C, et al. A hybrid feature selection method RFSTL for manufacturing quality prediction based on a high dimensional imbalanced dataset[J]. IEEE Access, 2021, 9: 29719-29735.
[22] JIANG L, LIU H, JIANG H, et al. Heuristic and neural network based prediction of project-specific API member access[J]. IEEE Transactions on Software Engineering, 2022, 48(4): 1249-1267.
A prediction model of massive 5G network users’ revisit behavior based on telecom big data
SUN Yudi
School of Digital Commerce, Jiangsu Vocational Institute of Commerce, Nanjing 211168, China
Users in 5G networks will generate a large amount of access data, which makes it difficult to accurately predict users’ revisit behavior. Therefore, a prediction model of massive 5G network users’ revisit behavior based on telecom big data was proposed. The user’s historical online behavior characteristic data was extracted from the telecom big data to build a data set. Multi order weighted Markov chain model was introduced. The model weight value was obtained by calculating the autocorrelation coefficient of each order, and the statistics of the model were calculated. After analysis, the one-step transition probability matrix of Markov chain with each step size was obtained, so as to accurately predict the revisit behavior of massive users in 5G network. The experimental results show that the proposed model has the lowest mean error and standard deviation, as well as the highest accuracy, recall, precision and1 indicators, which can prove that the proposed method has a very obvious advantage in predicting users’ revisit behavior.
telecom big data, prediction of users’ revisit behavior, multi order weighted Markov chain model, one step transition probability matrix, autocorrelation coefficient
TP357
A
10.11959/j.issn.1000–0801.2023026
孫玉娣(1981– ),女,江蘇經(jīng)貿(mào)職業(yè)技術(shù)學(xué)院數(shù)字商務(wù)學(xué)院副教授,主要研究方向?yàn)楸倔w、知識工程。
2022–12–28;
2023–02–07
2021年江蘇高?!扒嗨{(lán)工程”優(yōu)秀教學(xué)團(tuán)隊(duì)項(xiàng)目;江蘇經(jīng)貿(mào)職業(yè)技術(shù)學(xué)院“領(lǐng)軍人才”資助項(xiàng)目
“Qing Lan Project” in Jiangsu Universities in 2021, “Leading Talents” Program of Jiangsu Vocational Institute of Commerce