FengDai

Embed Size (px)

Citation preview

  • 8/6/2019 FengDai

    1/6

    LIP: A LIfetime and Popularity Based Ranking Approachto Filter out Fake Files in P2P File Sharing Systems

    Qinyuan Feng, Yafei Dai

    (Department of Computer Science and Technology, Peking University, Beijing, China){fqy, dyf }@net.pku.edu.cn

    Abstract

    P2P le sharing systems often use incentive policies toencourage sharing. With the decrease of free riders, theamount of cheating behaviors has increased. Some usersrename a common le with a popular name to attract thedownloads of other users in order to gain unfair advan-tages from incentive policies. We call the renamed le a fake le. While techniques have been proposed to com-bat fake les, an effective approach to lter out fake lesin existing systems is lacking, especially before a real lecomes out. In this paper, we design two detectors to iden-tify fake les by mining historical logs and nd that fake les are indeed pervasive. We introduce a metric called lifetime, which is a les average retention time in userscomputers, and show that it can be used to distinguish be-tween real and fake les. We then propose a lifetime and popularity based ranking approach to lter out fake les. Experiments are designed with the real and fake les col-lected by the two detectors, and the results show that thisapproach can reduce the download volume of fake lesboth before and after a real le comes out.

    1. Introduction

    P2P le sharing systems often use incentive policies toencourage sharing. Some incentive policies [1] are based

    on points where peers are rewarded points for uploadingand spend points for successful downloading, and use ser-vice differentiation to reward and punish users for their be-haviors. They give downloading preference to users withhigher scores and sometimes apply a bandwidth quota tothe downloads of users with lower scores.

    With these incentive policies, the amount of free ridersdecreases, however the amount of cheating behaviors hasincreased. Some users rename a common le with a pop-ular name to attract the downloads of other users in orderto gain unfair advantages from the incentive policies, and

    we call the renamed le a fake le. Supported by NSFC(60673183) and Doctoral Funding of

    MOE(20060001044)

    We design two detectors to identify fake les by mininghistorical logs and nd that fake les are indeed pervasive.To make things worse, some users even publish fake lesbefore a real le comes out.

    Two kinds of reputation mechanisms have been pro-posed to resolve this problem. One evaluates a users rep-utation from his behaviors [2], but a users reputation issometimes different from a les reputation; whereas theother evaluates a les reputation directly [3], it can reectthe les essential attribute.

    Users tend to retain a real le longer and delete a fakele more quickly. From this phenomenon, we introduce ametric called lifetime, which is a les average retentiontime in users computers, to analyze the les quality. Weshow that it can be used to distinguish between real andfake les. Based on it, we propose a lifetime and popular-ity based ranking approach to lter out fake les.

    Experiments are designed with the real and fake lescollected by the two detectors, and the results show thatthis approach can reduce the download volume of fakeles both before and after a real le comes out.

    The road map of this paper is as follows. In Section 2,we cover the related works. We design two detectors andanalyze the time characteristics of les in Section 3. Wethen propose a lifetime and popularity based ranking ap-proach in Section 4 and design an experiment to evaluatethis approach in Section 5. We draw a brief conclusionand identify some ways to expand in future research inSection 6.

    2. Related Works

    In pollution measurement aspect, J. Liang et al. [4]analyzed the severe pollution which is lunched by somecompanies to protect copyrights in P2P le sharing sys-tems; We analyze the severe pollution which is lunched bysome users to gain unfair advantages from incentive poli-cies in P2P le sharing systems, and our approach can alsobe used in their circumstances; U. Lee et al. [5] analyzedthe time interval between download and the quality check-ing; We nd the same situation in our work; D. Dumitriuet al. [6] concluded that a users behaviors such as will-ing to share and removing pollution quickly have a great

    1

  • 8/6/2019 FengDai

    2/6

    impact on the effect of le targeted attacks; Our work im-proves this results by quantitative analysis of users be-haviors with lifetime; N. Christin et al. [7] analyzed thedifferences between pollution and poisoning and their re-spective impact on content availability in P2P le sharingnetworks, they dene a deliberate injection of decoys toreduce the availability of targeted les as poisoning anda accidental injection of unusable les as pollution; Ourwork nds that a users deliberate behavior can also comefrom cheating incentive policies; J. Liang et al. [8] an-alyzed index poison, and our work is a complement of theirs.

    In pollution identication aspect, D. Dumitriu et al. [6]discussed random algorithm; J. Liang et al. [8] recom-mended distributed blacklist; S. Kamvar et al. [2] pro-posed EigenTrust algorithm; K. Walsh et al. [3] proposedan object reputation system; J. Liang et al. [4] proposedan approach depending on whether a le is decodable andthe warp of its duration to identify pollution automatically.But some of these approaches require users to participateactively; Some cant identify fake les before a real lecomes out; Some need to download all or part of a le;Some need a lot of system resource; Some can only iden-tify some types of les. Our approach calculates the reten-tion times of les and lters out fake les automatically,so it is good at bypassing the limitations above.

    3. Analysis of Fake Files

    3.1. Term Specication

    Title: the common description of a particular con-tent. Such as Movie: The Matrix and Music: Yes-terday Once More. We use T(Title) to express it.

    File: the sharing object in P2P le sharing system.We use F(File) to express it. A le is composed bytwo parts:

    Name: the name of a le and it should belongto a title, we use N(Name) to express it.

    Content hash: the hash value which is gener-ated by digesting the content of a le with ahash function, it is independent of the name.We use H(Hash) to express it.

    Fake le: a le whose name does not match its con-tent hash.

    File Owner: a user who has ever owned the le.

    Figure 1 gives an explanation of these specications. Ingure A, U1 owns real les F1 and F2, U2 owns real leF1, and N1 belongs to T1 whereas N2 belongs to T2; ingure B, U2 changes F1s name to N3 which belongs toT2 and creates fake le F3; in gure C, U3 downloads F2

    Figure 1: Specication sketch

    from U1 and U4 downloads F3 from U2. We can get thefollowing conclusions: T1 has real le F1 whose ownersare U1 and U2; T2 has real le F2 whose owners are U1and U3; T2 has fake le F3 whose owners are U2 and U4;Content hash H1 corresponds to two names which are N1and N3; Content hash H2 corresponds to one name whichis N2.

    3.2. Detection of Fake Files

    Our historical logs are collected from Maze [9] which isa large deployed P2P le sharing system with more than 2million registered users and more than 10,000 users onlineat any given time. A log server is used to record everydownloading action and each log contains uploading peer-id, downloading peer-id, global time, les content hash,and les name. The logs from Oct 11, 2005 to Aug 11,2006 are selected. Though our analysis is based on Maze,it is similar to many other P2P le sharing systems, so theapproach and conclusion are universal.

    We choose ve representative popular titles and use T1,T2, T3, T4, and T5 to express them.

    From users feedbacks in Maze forum, we know thereare some users who rename a common le with a popularname to attract the downloads of other users, so we get thehint to detect fake les from the change of names and therst publication time of a le.

    When a user creates a fake le by renaming a lesname, it causes the les content hash to correspond toat least two names. One belongs to the original title andthe other belongs to the popular title, and the former oneshould appear earlier than the latter one. Furthermore, allthe les of a title which are published before the titles realle comes out are sure to be fake les.

    So, two detectors are designed to detect fakes les of atitle.

    Detector 1: for each le of a title, the detector col-lects all the names that correspond to the les con-tent hash, if there exists a name which belongs to an-other title and appears before this les rst publica-tion time, the detector will identify this le as fake.

    Detector 2: for a title, the detector gains the real

    2

  • 8/6/2019 FengDai

    3/6

    Table 1: Identication results

    Real le Detector 1 Detector 2 Detector 1&2

    only only

    T1 827(51.5%) 53(3.3%) 432(26.9%) 294(18.3%)

    T2 409(83.3%) 12(2.4%) 21(4.3%) 49(10.0%)

    T3 2 91(91.5%) 8(2.5%) 2(0.6%) 17(5.3%)

    T4 411(68.3%) 42(7.0%) 49(8.1%) 100(16.6%)

    T5 151(74.4%) 52(25.6%) 0(0.0%) 0(0.0%)

    les rst publication time from Divx [10] and iden-ties the les published before this time as fake.

    When we are identifying a title, we rst collect all theles that belong to the title, then use the two detectors toidentify them and mark them as real le(not identied byany of the detectors), detector 1 only(only identied bydetector 1), detector 2 only(only identied by detector 2),detector 1&2(identied by both of the detectors). Table

    1 shows the results with the pecentage of les belongs toeach category. The ve titles all have their own character-istics: T1 is the most popular title and its fake les are themost pervasive; T2 has low-grade fake les; T3 has onlypublished for a short time and it has the lowest grade of fake les; T4 has middle-grade fake les; T5s rst publi-cation time is earlier than the logs collecting time and itrepresents the titles with incomplete logs.

    0 5 0 1 0 0 1 5 0 2 0 0

    0

    2 0 0

    4 0 0

    6 0 0

    8 0 0

    N

    u

    m

    b

    e

    r

    o

    f

    F

    i

    l

    e

    s

    T i m e ( d a y )

    R e a l f i l e

    F a k e f i l e

    Figure 2: The change of cumulative numbers with time

    Figure 2 shows the change of T1s real and fake lescumulative numbers with time. The x-axis is the time,the y-axis is the cumulative numbers of real and fake leswhich have ever appeared (The other 4 titles gures aresimilar to this one). It can be deduced from the gure thatfake les can appear both before and after a real le comesout. So the approach to lter out fake les should have aneffect in both conditions.

    From the analysis above, we know that the two simplebut efcient detectors can identify a lot of fake les, butthere are still some questions: Why is it better to serve afake le than the real one? Why not download a popu-lar le and serve the real one? Doesnt this give the samebenets as serving a fake le? Why cant users use a com-pletely new le as their fake les? The answer is same toall the questions: because it is the most simple way and it

    does work. What you should do is renaming a le, and es-pecially if you rename it as a popular movie which has notappeared, all the users who want to download this moviewill nd you. In fact, we have tried some other ways toproduce fake les, but this only makes things more com-plex and gains no more benits.

    Even so, we should recognize that the two detectorsare not feasible in real systems, we need prepare a lot toanalyze a title, this is the reason that we only choose vetitles. Even if they are deployed, the users can also cheatthe two detectors easily. But they collect the real and fakeles from logs for our analysis. With the analysis below,a lifetime and popularity based ranking approach will bedesigned to lter out fake les automatically. Experimentsare also designed with the the real and fake les collectedby the two detectors to evaluate the effect of the approach.

    3.3. Time Characteristic Analysis

    Users tend to retain a real le longer and delete a fakele more quickly. From this phenomenon, we introduce ametric called lifetime, which is a les average retentiontime in users computers, to analyze the les quality.

    Denition of Lifetime and PopularityIf le F has n owners, and each owners retention time

    of F is t i (1 < = i < = n), then dene LF =n

    i =1t i

    n asle Fs lifetime and n as le Fs popularity.

    In order to calculate a les lifetime, we check le andusers appearance time in the logs. For each le, its own-ers are gathered at rst. Then each owners rst and lastappearance times with this le are collected, the differencebetween the two times is treated as this owners retentiontime of this le. For example, if U1 downloads F1 at 10:30and uploads F1 at 14:00 and 16:30, we will calculate U1sretention time of F1 as (16.5 10.5) 60 60 = 21600s .Some owners will close their client softwares after down-loading a le or move the le out of the sharing directory,so the retention time calculated by this method will besmaller than the real retention time, but the warps are sim-ilar to all the les and we are mainly comparing lifetimesof real and fake les, it is therefore relatively reliable.

    0 1 0 0 2 0 0 3 0 0 4 0 0 5 0 0 6 0 0 7 0 0

    0

    2 0

    4 0

    6 0

    8 0

    1 0 0

    L

    i

    f

    e

    t

    i

    m

    e

    (

    1

    0

    ,

    0

    0

    0

    s

    )

    P o p u l a r i t y

    R e a l f i l e s

    F a k e f i l e s

    Figure 3: The change of lifetimes with popularity

    Differentiation and Astringency of Lifetime

    3

  • 8/6/2019 FengDai

    4/6

    Figure 3 shows ve real and ve fake les belongingto ve titles separately and the change of their lifetimeswhile their popularities increase. We can get the follow-ing conclusions. When the popularity is small, the life-time uctuates greatly. When the popularity reaches 100,the lifetime of a fake le converges below 100,000s, sowe can distinguish fake les from real les then. Becausethe lifetime of a le with small popularity does not con-verge, we only analyze the les whose popularities arelarger than 20 below.

    0 1 0 2 0 3 0 4 0 5 0

    0 . 0

    0 . 2

    0 . 4

    0 . 6

    0 . 8

    1 . 0

    P

    e

    r

    c

    e

    n

    t

    a

    g

    e

    o

    f

    F

    i

    l

    e

    s

    L i f e t i m e ( 1 0 , 0 0 0 s )

    R e a l F i l e s

    F a k e F i l e s

    Figure 4: CDF of real and fake les lifetimes

    Distribution of LifetimeFigure 4 shows T1s CDF of real and fake les life-

    times. The x-axis is the lifetime, the y-axis is the per-centage of real and fake les whose lifetimes are within x(The other 4 titles gures are similar to this one). We canconclude from the gure that real les lifetimes are holis-tically higher than fake les, which means compared to

    fake les, there are more real les with higher lifetimes.

    4. Lifetime and Popularity Based Ranking Ap-proach

    We will describe this approach with some implementa-tion details in DHT.

    0 1 0 0 2 0 0 3 0 0 4 0 0 5 0 0

    0

    1

    2

    3

    4

    3

    2

    F i l e

    L

    i

    f

    e

    t

    i

    m

    e

    (

    1

    0

    ,

    0

    0

    0

    s

    )

    P o p u l a r i t y

    1

    Figure 5: Classication of les

    Lifetime and Popularity Based Ranking ApproachFrom the analysis above, we know lifetime can be used

    to distinguish between real and fake les. But when thepopularity of a le has not reached a certain threshold, itslifetime does not converge, so we should combine popu-larity with lifetime. In gure 5, les are divided into four

    zones by their lifetimes and popularities. The les in zone1, whose popularities are large and lifetimes are high, aremore likely to be real les; the les in zone 4, whose pop-ularities are large but lifetimes are low, are more likely tobe fake les; the les in zone 2 and 3 should be judgedlater.

    So we design a lifetime and popularity based rankingapproach (LIP): it divides les into four zones with life-time and popularity thresholds, lters out the les in zone4 and recommends user to select the le with the high-est lifetime in zone 1, 2 and 3. The settings of lifetime andpopularity thresholds will be discussed in the next section.

    Information CollectionThis approach only needs the retention times of les. In

    DHT, the publication node periodically publishes its shar-ing information, such as a les name and content hash, toan index node. When the index node rst receives a nodessharing information, it does not only record the sharing in-formation, but also records the publication time. When theindex node receives the same sharing information from thesame node, it will update this users retention time for thisle. For example, U1 publishes the information of F1 toU2 at 13:00 and 15:00 separately. U2 will calculate U1sretention time of F1 as (15 13) 60 60 = 7200s.

    Information ProcessingWhen calculating a les lifetime, it just sums all the

    owners retention times up, and divides it by the les pop-ularity. In DHT, each index node records all the retentiontimes within its responsible area, so it can calculate thepopularity and lifetime by itself.

    Information FeedbackIt lters out fake les with lifetime and popularity

    thresholds, and sends les lifetimes with search resultsto users. In DHT, each index node can do this by itself.

    5. Experiment and Results

    In this section. We rst describe the experiment, andnext use training set to choose lifetime and popularitythresholds. We then evaluate the effect of LIP.

    In experiment, we simulate a sequence of downloadingactions of a title. For each downloading action, we rstnd out all the les which still have replicas and calcu-late their numbers of replicas, popularities and lifetimes.Then, we use an approach to choose a le from them andcalculate a retention time for this downloaded le.

    We can collect the following contents from the logs.

    File set for a title: all the les of a title compose thele set for this title. With the two detectors in Section2, we also label a le as fake or real.

    Retention time set for a le: all the owners reten-tion times of a le compose the retention time set forthis le.

    4

  • 8/6/2019 FengDai

    5/6

    Downloading action set for a title: all the down-loading actions of a title compose the downloadingaction set for this title.

    Figure 6: Log collection sketch

    Figure 6 shows an explanation of how to collect thesecontents from the logs of a title. The x-axis is the timeand each unit means one second. 1 Each rectangular inter-val is a users retention time. User, le, rst appearancetime, last appearance time and retention time are written

    in each interval. We can see from the gure that, F1 hastwo owners who are U1 and U5, their retention times of F1 are 12 and 7 respectively, and the two retention timescompose the retention time set for F1; F2 has four userswho are U2, U3, U4 and U6, their retention times of F2are 4, 8, 9, and 6 respectively, and the four retention timescompose the retention time set for F2; the six users down-loading actions are (U1, 1), (U2, 2), (U3, 4), (U4, 6), (U5,7), (U6, 9), they compose the downloading action set forthis title.

    We should also simulate the following contents.

    Files retention time: In simulation, when a userdownloads a le, we select a retention time from theretention time set for this le randomly as this usersretention time of this le.

    Five comparative approaches:

    1. Replica priority (RP): select the le with themaximum number of replicas.

    2. Lifetime priority (LP): select the le with thehighest lifetime.

    3. Select randomly (SR).

    4. Lifetime and popularity based ranking ap-proach (LIP): select the le with the highestlifetime after ltering.

    5. Actual state (AS): it is the actual state of thelogs.

    Figure 7 shows a concrete example of the simulationwith a title. It has simulated U1 to U5s downloadingactions at time 1, 2, 3, 4, 8 respectively and is going to

    1This is only a sketch, so the time begins from 1 and the range isvery small. In real situation, it is the global time of log server and therange is very large.

    Figure 7: Simulation sketch

    simulate U6s downloading action at time 10. It can beseen from the gure that F2s replica has deracinated attime 7 and only F1 and F3 still have replicas. Their num-bers of replicas are 2 and 1, their popularities are 2 and2, their lifetimes are ((10 1) + (10 8)) / 2 = 5 .5 and((10 3)+(9 4)) / 2 = 6 . If we use RP, we should chooseF1 and select a retention time from the retention time setfor F1 randomly as U6s retention time of F1; If we useLP, we should choose F3 and select a retention time fromthe retention time set for F3 randomly.

    5.1. Training Set

    LIP needs two thresholds: lifetime threshold and pop-ularity threshold. Different thresholds have different ef-fects. If popularity threshold is too high, a fake le canonly be identied after a lot of downloads. If popularitythreshold is too low, the lifetime of a le may have notconverged, so we may identify a real le as fake. If life-time threshold is too high, we may also identify a real leas fake. If lifetime threshold is too low, we may identify afake le as real.

    Our approach should reduce the download volume of fake les both before and after a real le comes out. Andwe can see from gure 2 that T1 has a lot of fake lesboth before and after a real le comes out, so we use T1 astraining set to choose lifetime and popularity thresholds.

    From gure 3, we know a fake les lifetime convergesbelow 100,000s when its popularity reaches 100, so wechoose popularity and lifetime thresholds near 100 and100,000 respectively. We simulate with different thresh-old settings. And from the perspectives of reducing thedownload volume of fake les and guaranteeing the down-load volume of real les, we choose popularity thresholdas 80 and lifetime threshold as 80,000 at last. Its inter-esting to see that 80,000 seconds is about one day, whichmeans most real les lifetimes are longer than one day,and most fake les are shorter than one day.

    5.2. Testing Set

    We use T2, T3, T4 and T5 as testing sets to evaluate theeffect of LIP.

    Figure 8 gives the comparison of the download volumeof fake les between AS and LIP. We can conclude that

    5

  • 8/6/2019 FengDai

    6/6

    T 2 T 3 T 4 T 5

    0

    2

    4

    6

    D

    o

    w

    n

    l

    o

    a

    d

    V

    o

    l

    u

    m

    e

    (

    1

    0

    0

    0

    )

    T i t l e

    A S

    L I P

    Figure 8: Comparison of the download volume of fake les

    LIP can signicantly reduce the download volume of fakeles most of the time.

    When we use other titles in our experiments, there arealso some instances that the download volume of real lesreduces or the download volume of fake les increases.There may be three factors that inuence the effect of LIP.

    1. The quality of fake les. If a fake les quality isbetter than the real le, which means the originalmovie is more attractive, users may choose to retainthe fake le longer than the real le.

    2. The percentage of low quality les in real les. Be-cause LIP may lter out some real les of low quali-ties.

    3. Real les rst publication period. Before a realles lifetime converges, LIP cant conrm whetherit is real, so it may lter out some doubtful real les.

    5.3. Comparison of Different Approaches

    We continue to use T1 to compare ve different ap-proaches and analyze their effect on the download volumeof fake les both before and after a real le comes out.From the downloading action set for T1, we know T1sreal le rst comes out at the 5837th downloading action,which means during the previous 5836 downloading ac-tions, there are only fake les.

    0 5 1 0 1 5 2 0 2 5

    0

    5

    1 0

    1 5

    2 0

    2 5

    D

    o

    w

    n

    l

    o

    a

    d

    V

    o

    l

    u

    m

    e

    (

    1

    0

    0

    0

    )

    N u m b e r o f D o w n l o a d i n g A c t i o n s ( 1 0 0 0 )

    R P

    L P

    L I P

    S R

    A S

    Figure 9: Comparison of ve different approaches

    Figure 9 shows the download volume of fake les withve different approaches. The x-axis is the number of

    downloading actions that have happened, and the y-axisis the download volume of fake les. It can be concludedfrom the gure that RP is the worst, so is SR, LP can sig-nicantly reduce the download volume of fake les after areal le comes out, but only LIP can reduce the downloadvolume of fake les both before and after a real le comesout.

    6. Conclusion and Future Work

    Some users create fake les to gain unfair advantagesfrom incentive policies and it can make fake les perva-sive. Two detectors are designed to collect data from logsfor our analysis and experiments. We show that lifetimecan be used to distinguish between real and fake les, andpropose a lifetime and popularity based ranking approachto rank the reputation of les and lter out fake les in P2Psharing systems. The experimental results show that thisapproach can reduce the download volume of fake lesboth before and after a real le comes out.

    There are many aspects to improve LIP, such as secu-rity considerations, the trust of feedback, how to chooselifetime and popularity thresholds.

    LIP is not only a binary value to identify real and fakeles, it can also reect a les quality. We will realize itand see how it works in real situation.

    References

    [1] Q. Lian, Y. Peng, M. Yang, Z. Zhang, Y. Dai, X. Li, Robust Incen-

    tives via Multilevel Tit for tat, In Proc. of IPTPS, Santa Barbara,CA, 2005.

    [2] S. Kamvar, M. Schlosser, and H. Garcia-Molina, The EigentrustAlgorithm for Reputation Management in P2P Networks, ACMWWW03, pages 640-651, 2003.

    [3] K. Walsh and E. G. Sirer, Experience with an Object ReputationSystem for Peer-to-Peer Filesharing, USENIX NSDI, 2006.

    [4] J. Liang, R. Kumar, Y. Xi, and K. Ross, Pollution in P2P lesharing systems, Proc. IEEE INFOCOM05, Miami, FL, 2005.

    [5] U. Lee, M. Choi, J. Cho, M. Y. Sanadidi, M. Gerla, UnderstandingPollution Dynamics in P2P File Sharing, In Proc. of IPTPS, SantaBarbara, CA, 2006.

    [6] D. Dumitriu, E. Knightly, A. Kuzmanovic, I. Stoica, and W.Zwaenepoel, Denial-of-service resilience in peer-to-peer le shar-ing systems, Proc. ACM SIGMETRICS05, Banff, AB, Canada,2005.

    [7] N. Christin, A. S. Weigend, and J. Chuang, Content Availability,Pollution and Poisoning in Peer-to-Peer File Sharing Networks,In Proc. of the Sixth ACM Conference on Electronic Commerce(EC05) Vancouver, BC, Canada, 2005.

    [8] J. Liang, N. Naoumov, and K.W. Ross, The Index Poison-ing Attack in P2P File-Sharing Systems, IEEE Infocom 2006,Barcelona, Spain, 2006.

    [9] M. Yang, B. Zhao, Y. Dai and Z. Zhang, Deployment of a largescale peer-to-peer social network, In Proc. of the 1st Workshop onReal, Large Distributed Systems, San Francisco, CA. USA, 2004.

    [10] Divx Database, http://divx.thu.cn, 2006.

    6