DPIF: A Framework for Distinguishing Unintentional Quality Problems From Potential Shilling Attacks

2019-04-29 03:21:46MohanLiYanbinSunShenSuZhihongTianYuhangWangandXianzhiWang

Computers Materials&Continua 2019年4期

Mohan Li, Yanbin Sun, , Shen Su, Zhihong Tian, Yuhang Wang, and Xianzhi Wang

Abstract:Maliciously manufactured user profiles are often generated in batch for shilling attacks.These profiles may bring in a lot of quality problems but not worthy to be repaired.Since repairing data always be expensive,we need to scrutinize the data and pick out the data that really deserves to be repaired.In this paper,we focus on how to distinguish the unintentional data quality problems from the batch generated fake users for shilling attacks.A two-steps framework named DPIF is proposed for the distinguishment.Based on the framework,the metrics of homology and suspicious degree are proposed.The homology can be used to represent both the similarities of text and the data quality problems contained by different profiles.The suspicious degree can be used to identify potential attacks.The experiments on real-life data verified that the proposed framework and the corresponding metrics are effective.

Keywords:Data quality,shilling attacks,functional dependency,suspicious degree,homology.

1 Introduction

Low quality data can severely impact on the usability of data,and may lead to huge losses[Eckerson(2002)].A lot of recent work has been proposed to evaluate the data quality and a lot of data cleaning rules are proposed[Fan and Geerts(2012);Fan,Geerts,Jia et al.(2008);Cao,Fan and Yu(2013);Fan,Geerts and Wijsen(2012);Chu,Ilyas and Papotti(2013a);Interlandi and Tang(2015);Fan,Li,Ma et al.(2012);Li,Li,Cheng et al.(2018);Li and Li(2016)].However,most of these rules are adept at detecting errors,but not good at fixing errors.For example,functional dependenciesfd1:city,street→zipand fd2:IsMarried,Gender→Relationshipcan be used to clean the data shown by Fig.1.A functional dependencyφis violated if two tuplestiandtjare same in the attributes in the left-hand side,but different in the attribute in the right-hand side.By using the FDs,we will be of the opinion that there may exist errors in the colored grids since these grids violate the dependencies.However,it is difficult to know how to further locate and fix the errors.For instance,t2andt6violatefd2,but modify each of the attributesIsMarried,GenderandRelationshipcan makefd2not violated.Repair the low quality data is much more complex than locate the data quality problems.Since unreasonable repair is not advisable,we may pay a lot of effort or even need human power,such as domain experts or crowdsourcing to find the proper repairs[Garcia-Ulloa,Xiong and Sunderam(2017);Zheng,Li and Cheng(2016);Zhang,Chen,Tong et al.(2015);Chu,Morcos,Ilyas et al.(2015)].

Figure 1:The dataset containing user profiles

Repairing low quality data is very expensive,thus we need to scrutinize the data and pick out the data that really deserves to be repaired.One type of data that is not worth repairing arethemaliciouslymanufacturedfakeuserprofilesthatreadytobeusedforshillingattacks.The following example illustrates the situation.

Example 1.1.Consider the data shown by Fig.1.By usingfd1andfd2,we will find that(1)t2andt6violatesfd2,(2)t3,t4andt5violate both of the two dependencies.However,if we carefully checkt3,t4andt5,we will find that the three tuples are very similar in the text.The similarity exists even in the attributes with different values and the violations of dependencies.It is normal to have the correct values similar,but it is suspicious if the wrongvalueissimilar.Apossiblecaseisthatsomeoneistryingtoautomaticallygenerating some fake user profiles for shilling attack.

Shilling attacks[Gunes,Kaleli,Bilge et al.(2014)]aim to change the predictions of a recommender system by creating a set of fake users and the corresponding ratings.In modern recommender systems,a user is always asked to enter some personal information when creating a new account.Since the value fields may be specified,a labor-saving way is to use similar settings on these information when creating fake users.Moreover,fake users are often created in batches and the creators may not be familiar with the semantic of different attributes.Thus these fake profiles tend to make similar mistakes.In contrast,unintentional data quality problems are often sporadic and do not get strong group similarities.

Based on the above ideas,this paper proposes a two-steps framework,named Dirty Profi le Identify Framework(DPIF for short),which aims to distinguish the unintentional data quality problems from potential shilling attacks.First a set of candidates which may be erroneous are detected by data quality rules,and then clustering is utilized to find potential shills.The contributions of this paper are as follows.

1.We provide a two-steps framework for distinguishing the unintentional data quality problems from potential shilling attacks.

2.We propose the metrics of homology and suspicious degree. The homology is defined to represent both the text similarity and the error similarities of different profiles. The suspicious degree is also defined to identify potential attacks.

3.The experiments on real-life data are carried out to verify the effectiveness of the proposed framework and metrics.

The rest of this paper is organized as follows.Section 2 introduces the related works.Section 3 provides an overview of the framework.Section 4 studies the process of finding candidate dirty tuples.Section 5 gives the methods for distinguishing quality problems and potential attacks.Section 6 shows the experimental study.Section 7 concludes the paper.

2 Related works

2.1 Data quality

There are currently a lot of works on data quality evaluation[Fan,Geerts,Jia et al.(2008);Fan,Geerts and Wijsen(2012);Chu,Ilyas and Papotti(2013a);Interlandi and Tang(2015);Fan and Geerts(2012);Ilyas,Chu et al.(2015);Chu,Ilyas and Papotti(2013b);Rammelaere,Geerts and Goethals(2017);Cao,Fan and Yu(2013);Kruse and Naumann(2018);Li,Li,Cheng et al.(2018);Li and Li(2016);Garcia-Ulloa,Xiong and Sunderam(2017);Li,Sun,Jiang et al.(2018)].A lot of data quality rules,such as functional dependency,conditional functional dependency,denial constraint and editing rule,are used to detect data quality problems in the given datasets.The rules can be roughly classified into two categories.

The first category includes functional dependency,conditional functional dependency,denial constraints and some other similar dependencies[Fan,Geerts,Jia et al.(2008);Cao,Fan and Yu(2013);Fan,Geerts and Wijsen(2012);Chu,Ilyas and Papotti(2013a)].These dependencies are good at finding quality problems but not good at fixing quality problems.The reason is that these rules usually indicate which combinations of values in the data are not correct,but do not indicate how to pinpoint and fix the errors.

The second category of dependencies tries to add information related to data repair in the rules[Interlandi and Tang(2015);Fan,Li,Ma et al.(2012)].These rules can effectively fix errors in the data,but they are usually defined on small subsets of the data,which makes them to be limited in the application scenario and expensive to obtain.

Some crowdsourced methods are also proposed to overcome the difficulties in data repair[Chu,Morcos,Ilyas et al.(2015);Zhang,Chen,Tong et al.(2015);Garcia-Ulloa,Xiong andSunderam(2017)].Theserepairmethodsprovideaccuraterepairresults,butmanpower repairs are often expensive and inefficient,so we must check before the repair operation to ensure that the data is worth repairing.

2.2 Shilling attacks

With the widespread use of big data analytics,security issues derived from big data are also receiving attention[Liu,Peng and Wang(2018);Wang,Liu,Qiu et al.(2018);Wang,Tian,Zhang et al.(2018);Tian,Cui,An et al.(2018);Chen,Tian,Cui et al.(2018);Zhang,Wang,Wang et al.(2018);Tian,Su,Shi et al.(2018);Sun,Li,Su et al.(2018)].Shilling attack is a kind of attack on big data which negatively impacts the accuracy of a recommender system.The attackers try to inject dirty profiles into the data set,and injected dirty profiles may severely reduce the data quality.Existing techniques,including supervised and unsupervised methods,are mainly focus on how to identify the user’s abnormal rating behavior[Gunes,Kaleli,Bilge et al.(2014)].The current research focus on shilling attacks is mainly on how to manipulate the rating to influence the accuracy of the recommendation system.Thetypes of theattacks includerandomattack,averageattack,bandwagonattack[Mobasher,Burke,Bhaumik et al.(2007)],and average-noise injecting attack,average-target shift attack[Williams,Mobasher,Burke et al.(2006)],etc.

The methods of detecting shilling attack can be roughly classified into supervised and unsupervised methods.Supervised methods[Wu,Wu,Cao et al.(2012);Yang,Xu,Cai et al.(2016);Zhang and Zhou(2014)]require a well-labeled training data set.These data are often hard to get in real-life applications,because even if for humans,it is difficult to judge whether a score record is a malicious attack.Furthermore,supervised methods also fall short in the scenario that the type of attack is not known in advance.Unsupervised methods[Zhang,Zhang,Zhang et al.(2018);Bryan,O’Mahony and Cunningham(2008);Bhaumik,Mobasher and Burke(2011);Yang,Cai and Guan(2016)]can overcome some of the shortcomings of supervised methods.These methods are mainly based on clustering user rating behaviors to discover shilling attacks.

The existing works basically pay attention to observing the abnormality of the user’s rating behavior,but ignores the abnormality of the user’s personal information(so-called user profile).Dependencies between different attributes of a user profile are often readily available,and there are many existing works that can detect anomalous values on dependent attributes.In order to inject fake ratings,attackers often need to create a lot of fake user profile to register accounts.When creating these accounts,an attacker may ignore attribute dependencies and cause many different types of data quality problems in user profiles.Observing the homology of these data quality issues can effectively help us detect abnormal accounts and discover potential shilling attacks.

3 Overview of DPIF

Figure 2:A overview of DPIF

LetRbe a user profile table defined on a set of attributesattr(R).Each tuplet∈Rcorresponds to the information of a user.t[A]is the values ofton attributesA?attr(R).For example, for the tuple t3in Fig. 1, t3[city, street, zip] is (BeijingRoad, Guangzhou,510070). A data quality rule set Φ is defined on attr(R), and can be used to detect the candidate dirty user profiles. The main idea two-steps workflow of DPIF is as follows, and the main com-ponents are shown by Fig. 2.

1.Dirty profile detector uses Φ to detect possible errors in R. The possible erroneous data are collected and organized by the provenance, i.e., by the tuples they come from(named candidate dirty user profiles) and the rules they violate.

2.Candidate dirty user profiles are given to the shills identifier. Shills identifier clusters the dirty user profiles based on provenance, and calculates the suspicious degree of each user profile. Based on the suspicious degree, the dirty user profiles are distinguished, as either potential shilling attack or unintentional quality problems.

For example,if we use DPIF to process the table shown by Fig.1,the candidate dirty user profiles are{t2,t3,t4,t5,t6}since the colored grids are detected as the violation offd1andfd2.Then,shills identifier may identify that{t3,t4,t5}arepotential attacks since they are similar in both text and errors.{t2}and{t6}are identified as unintentional quality problems because they have no suspicious analogues.Technical details of dirty profile detector and shills identifier are discussed in Section 4 and Section 5.

4 Detecting candidate dirty profiles

For the ease of discussion,we choose function dependencies(FD)as data quality rules,but it is easy to verify that other types of data quality rulesdiscussed in Section 2 can also apply to the dirty profile detector.

A FD is a constraint between two sets of attributes in a relation.Given two attribute setsX andYin relationattr(R),a FDφ:X→Ymeans that eachXvalue inRis associated with precisely oneYvalue inR.Xis called the left-hand side ofX→Y,denoted bylhs(φ),andYis called the right-hand side ofφ,denoted byrhs(φ).For example,fd1:city,street→zipmeans thatcityandstreetcan uniquely determine the value ofzip.Therefore,if two recordstiandtjare the same incityandstreet,but different in zip,we can detect an error inti[city,street,zip]andtj[city,street,zip].We choose FD to be the quality rule,because FDs are defined on the entire dataset rather than the subset of data.Thus high recall can be obtained with a small number of rules.

Only a pair of tuples can violate a FD,thus the process of this step is as follows.For each FD φ ∈ Φ,scanRto find how many pairs of tuples violateφ.Concretely,?ti,tj∈ R,ifbutwe say thatti,tjviolate φ.Then,andare marked as potential dirty area,andti,tjare marked as candidate dirty profiles.

The time complexity isO(|Φ||R|2).Some other existing techniques can also be employed to speed up this step.Since the optimization of dirty data detection is beyond the scope of this paper,we choose the easiest implementation.

4.1 Organization of the intermediate results

To make the subsequent process effective,we need to carefully organize the intermediate results.Three kinds of lists need to be maintained,i.e.,candi(R),vio(t)andvio(ti,tj).

?candi(R)stores all the candidate dirty profiles.

? vio(t) stores all the FDs that are violate by t, i.e., for each ti, tjviolate φ, φ is in both vio(ti) and vio(tj).

?vio(ti, tj) stores all the FDs that are jointly violate by tiand tj, i.e. for each ti, tjviolate φ, φ is in vio(ti, tj).

Example 4.1.Let the dataset and FDs in Fig. 1 be R and Φ, respectively. Then, candi(R)={t2, t3, t4, t5, t6}, for t2, t3and t6, we have vio(t2) = {fd2}, vio(t3) = {fd1, fd2}, vio(t2,t3) = ?, vio(t2, t6) = {fd2}.

Please note thatvio(ti,tj)is not necessarily equal tovio(ti)∩vio(tj)sincetiandtjcan violatethesameFDjoinlywithothertuples.Actually,vio(ti,tj)? vio(ti)∩vio(tj).These lists will be given to shills identifier and make the identifying process more effectively.

5 Distinguishing unintentional quality problems and potential attacks

The intermediate results are given to the shills identifier,and the aim of the shills identifi er is to dividecandi(R)into two subsets,containing unintentional quality problems and potential attacks respectively.The main idea of the division is that it is suspicious if two profiles have similar errors.Therefore,this step can be further refining into two steps.

1.We propose a definition of the homology as the similarity of both the text and the errors contained by two different profiles, and cluster the candidate dirty profiles based on homology.

2.Profiles for attacks are usually generated in batches, thus larger clusters clusters are more likely to be potential attacks. Therefore, we can define the suspicious degree based on the size of the clusters.

The following subsections provide details of the two steps.

5.1 Clustering the candidate dirty profiles

First we provide the definition of homology to reflect both the text similarity and the similarities of the"errors"contained by different profiles.Homology is the similarity in trait from shared ancestry.High homology of two profiles means that they are very similar in text and the errors,thus they may be generated in batch and are considered as suspicious attacks.The definition is as follows.

Definition 5.1(profile homology).The homology oftiandtjis denoted byH(ti,tj),and can be computed by Formula(1).

whereαis the normalization parameter,textSim(ti,tj)is the text similarity oftiandtj,vioSim(ti,tj)is the similarity of the violations oftiandtj,that is,

Please note that the definition ofvioSim(ti,tj)is different from the Jaccard similarity of two sets,becausevio(ti,tj)is not necessarily equal tovio(ti)∩vio(tj).However,when the data has many quality problems,vio(ti,tj)might be much lower thanvio(ti)andvio(tj),which makesvioSimclose to 0.In this case,we can usevio(ti)∩vio(tj)instead of vio(ti,tj),to avoid excessive low homology.ThetextSim(ti,tj)can be calculated by any reasonable similarities.For example,normalized absolute error can be used to measure the numerical attributes,normalized edit distance or cosine distance can be used to measure the string attributes,etc.

A cluster is a set of profiles,thus the definition of the homology of clusters can be given based on the definition of profile homology.

Definition 5.2(cluster homology). The homology of a cluster c is the minimum homology of the profiles in c, that is,

The homology of two clustersciandcjis the minimum homology of the profiles inci∪cj,that is,

Based on the homology,a single-linkage agglomerative clustering can be used to cluster profiles that are similar in both error and text.Given a thresholdθof homology,the agglomerative clustering aims to generate a set of clustersC={c1,c2,...,cl}which satisfies three conditions,i.e.,(1)H(ci)≥θfor eachci∈C,(2)for each,and(3)H(ci,cj)＜θfor each.

The procedure of the agglomerative clustering is shown by Algorithm 5.1.The clustering process starts from the bottom layer and merges the two clusters with the highest homology value each time.In Algorithm 5.1,Line 1 to 3 initiate the clusters inC,each cluster contains exactly one element incandi(R).Please note that profiles not incandi(R)do not need to be clustered,since there are no errors in these profiles.In the loop from Line 4 to 9,tmpc1andtmpc2point to the two clusters with the highest homology.The union of the two clusterstmpc1∪tmpc2is added toC,whiletmpc1andtmpc2are removed fromC.

Algorithm 5.1 The single-linkage agglomerative clustering of candidate dirty profiles

Example 5.1.For the dataset shown by Fig.1,we define thetextSim(ti,tj)as the percents of same values intiandtj,α=0.5,andθ=0.8.The matrix in Fig.3 shows the homology of profiles.The grey grids shows the homology higher thanθ.If we use Algorithm 5.1 to cluster the profiles,we will getC={{t2},{t3,t4,t5},{t6}}.

5.2 Calculating the suspicious degree

As we discussed before,profiles for attacks are usually generated in batches,thus larger clusters or tighter clusters in clustering results are more likely to be potential attacks.Therefore,larger clusters are more likely to be suspicious.We provide a metric of suspicious degree based on the size of the cluster.

Definition 5.3(suspicious degree).The suspicious degree of a cluster is the ratio of its size to the size of the largest cluster,that is,

Figure 3:The homology matrix

The suspicious degree of a profiletis the suspicious degree of the cluster it located in,that is,susp(t)=susp(c(t)),wherec(t)is the cluster containingt.

Based on the results of Example 5.1https://archive.ics.uci.edu/ml/datasets,the suspicious degree of the clusters can be calculated,that is,susp({t3,t4,t5})=3/3=1,susp({t2})=susp({t6})=1/3≈0.33.

All clusters can be sorted according to the suspicious degree,so that we can check from top to bottom which profiles might be potential attacks.We can set a parameterk,the top kclusters with the highest suspicious degree are considered potential attacks,while the remaining clusters are considered unintentional quality problems.

6 Experimental results

We conduct the experiments on real-life data set.The codes are written in Python and run on a machine with i7 1.80 GHz Intel CPU and 8 GB of RAM.The dataset which was used in the experiments is a UCI dataset1https://archive.ics.uci.edu/ml/datasetsconsisting of restaurant data with consumer ratings.We use the DPIF proposed in this paper to distinguish the unintentional data quality problems from potential shilling attacks.We calculate the suspiciousness of each cluster and check if the attack occurs in clusters of top-ksuspicious degree.In this process,there are four parameters that are very important,namely(p1)the methods of generating dirty profi les,(p2)the size ofkwhen selecting top-kclusters,(p3)the threshold of the homologyθ,and(p4)the selection of the FD set.We tested the recall and precision for different cases of the four parameters.In all experiments,there are 10%unintentional quality problems in the data.The measure of effectiveness is recall and precision,which are defined as follows.

wheretp,fp,fnare the number of true positive attacks,false positive attacks and false negative attacks.

Figure 4:Recall and precision when varying k and dr

6.1 Varying dirty profiles and k

We put(p1)and(p2)together for the experiment,because the two parameters are related.There are numerous ways to generate dirty profiles.We choose to change some values in the normal profile to generate dirty profiles in batches.More concretely,we use one normal profile and random modify the profile in the range ofdr(i.e.,dris the range that the modification allowed to be appear)to get a set of fake profiles.The profiles in the same set tend to be more similar to each other,and the set is called a batch.The number of batches is denoted byb.Ideally,each cluster in the result of DPIF is corresponding to a batch.Thus,to find all the potential attacks,kshould be no less thanb.Therefore,in the experiment,we varyk/b(i.e.,k=b,k=2b,etc)anddr.The recall and precision are shown by Fig.4.

We find that increase ofdrhas a more obvious impact on and will also slightly affect precision.This is because that a larger value ofdrmeans higher randomness when generating fake profiles,thus the quality issues that appear in the profile are more different.In other words,vioSamwill be lower,which decreases homology.dralso impact precision.However,the normal profiles are not very similar to each other,thus they have low homology which makes them hard to be clustered together.Therefore,drhas less impactonprecision.The different value ofkalso impacts the recall and precision.Since we only check top-k clusters,the higherkis,the higher recall and precision we get.Whenk=banddr=50%,the recall is only 0.82,whenkis increased to3b,the recall is increased to 0.97.kdoes not affect precision much.This is because most normal profiles are in the small clusters,which are not likely to appear in top-ksuspicious clusters.

6.2 Varying the homology threshold θ

As is discussed in Section 5,the higher the value ofθ,the less likely the dirty profiles are clustered.Since we only check the top-kclusters,the largerθ,the lower the recall.The precision is less affected byθ.We test three different cases ofα,i.e.,α =1,α =0.5 andα=0.The value ofαis actually the weight oftextSimandvioSim.α=0means that we only considervioSim.α=0.5means that we only considertextSim.α=1 means that we only considertextSim.The experimental result is shown by Fig.5.In the experiments,dris fixed to be25%.

Figure 5:Recall and precision when varying θ

It can be observed from the result that whenα=1the recall and precision are obviously lower than the case thatα=0.5andα=0.This is because that in our scenario,even if the fake profiles are inserted in batch,these fake profiles are somewhat different from each other.Using high similarity threshold(greater than 0.9 in our experiments)is hard to distinguish normal quality problems from batch produced fake profiles.In experimental result shown by Fig.4,the results ofα=0are better thanα=0.5.However,it does not mean that lowerαis always better.For example,as we have discussed,the increase ofdr will lead to the decrease ofvioSam.In that case,higherαwill be better.

6.3 Varying the FD set

We test three FD sets,consisting of 10,20 and 30 different FDs,respectively.The results of the recall and precision are shown by Fig.6.It can be observed from the results that the number of FDs cannot decide the effectiveness.That is,more FDs does not mean higher recall and precision.For example,in our experiment the set consisting of 10 FDs achieved the highest recall and precision.The recall the set consists of 20 FDs is higher than the set consisting 30 FDs,but for precision,it is the opposite.

We further examined the performance of each FD throughout the process and found that in general the FDs with shorter left-hand side(i.e.,has more attributes in the left-hand side)tend to be more effective than the ones with a longer left-hand side.This is because the left-hand side is more easy to satisfy.

Figure 6:Recall and precision when varying the FD set

7 Conclusions

This paper proposes DPIF,which aims to distinguish the unintentional data quality problems from potential shilling attacks.The process can be divided into two steps.First a set of candidates which may be erroneous is detected by data quality rules,and then clustering is used to find potential shills.In future work,we will explore different data quality rules,and will study the methods of automatically learning the parameters.

Acknowledgement:The work is supported by the National Natural Science Foundation of China(Nos.61702220,61702223,61871140,61572153,61572492,U1636215)and the National Key Research and Development Plan(Grant Nos.2018YEB1004003,2018YFB0803504).

Computers Materials&Continua2019年4期

Computers Materials&Continua的其它文章: A Credit-Based Approach for Overcoming Free-Riding Behaviour in Peer-to-Peer Networks; Image Augmentation-Based Food Recognition with Convolutional Neural Networks; Artificial Neural Network Methods for the Solution of Second Order Boundary Value Problems; Color Image Steganalysis Based on Residuals of Channel Differences; Mechanical Response and Energy Dissipation Analysis of Heat-Treated Granite Under Repeated Impact Loading; Rayleigh-Type Wave in A Rotated Piezoelectric Crystal Imperfectly Bonded on a Dielectric Substrate