Graffiti Networks: A Subversive, Internet-Scale File Sharing Model

The proliferation of peer-to-peer (P2P) file sharing protocols is due to their efficient and scalable methods for data dissemination to numerous users. But many of these networks have no provisions to provide users with long term access to files afte…

Authors: Andrew Pavlo, Ning Shi

Graffiti Networks: A Subversive, Internet-Scale File Sharing Model
Graffiti Netw orks: A Sub ver siv e, Inter net-Scale File Sharing Model Andrew Pa vlo Br own University pavlo@cs.b rown.edu Ning Shi Br own University ning@cs.br own.edu Abstract The prolifera tion of peer-to-peer (P2 P) file sharing p roto- cols is due to their efficient and scalable methods for data dissemination to numero us users. But many o f the se n et- works have n o provision s to pr ovide user s with lon g term access to files after th e initial interest h as dimin ished, nor are th ey able to guaran tee protection for user s f rom ma li- cious clients that wish to imp licate them in incrimin ating activities. As su ch, users may turn to supplem entary mea- sures fo r storing an d tr ansferring da ta in P2P systems. W e present a n ew file sharing par adigm, called a Graffiti Net- work , which allows peers to h arness the po tentially un lim- ited storage o f the Internet as a third- party interm ediary . Our k ey contributions in t his paper are (1) an overvie w of a distributed sy stem based on this new th reat model and (2) a mea surement of its viability th rough a one-yea r dep loy- ment stud y u sing a p opular web-p ublishing platfo rm. The results of this experiment motiv ate a discussion abo ut th e challenges o f mitigating this type of file sharing in a ho s- tile network en viro nment and h ow web site o perators can protect their resources. 1 Intr oduction In just a few y ears since its inception, the BitT orrent p ro- tocol an d similar system s have become the predom inant P2P file sharin g model [11]. But the recent acti vities of those seeking to take down P2P infrastructures ha ve forced the file sharing commun ity to adapt to a hostile environ- ment [15]. Op erators of global BitT orrent trackers now take two no table measur es in order to inde mnify themselves from legal action : (1) the trackers are located in coun tries that are not party to inter national copyright treaties, and (2) access to trackers is contro lled b y p riv ate, in vite-on ly commun ities with strict membersh ip requir ements [9]. The former allo ws operators to ig nore le gal threats to shutdown their serv ices th at a law-abidin g ISP would no rmally have to comply with. But this approac h can be bo th p rohibitively expensiv e and difficult to arr ange. Additionally , limiting access to only privileged users o nly tempo rarily pr otects a site th at has been made private; it takes o nly a single sedi- tious user to undermin e the network and pr ovide damagin g evidence to the right parties. The afo remention ed measures may pr otect tracker oper- ators but they provide little pro tection to the average file sharing user . T his is becau se the fundame ntal p rinciple of the BitT orrent p rotocol is th at users download and up load data dir ectly with oth er u ntrusted users, rath er than down- load fro m a single, c entral sourc e [11]. Althoug h som e P2P clients emp loy commun ication encryp tion and protocol ob- fuscation enh ancements, such measu res do no t protect a user f rom malicious c lients that harvest file sharing activ- ity info rmation for f uture litigation. Furthermo re, it has been shown that while it may not be possible to easily view encryp ted packet contents, a th ird-par ty observer c an still deduce that file sharing is occurring by identifying network pairs based on a tracker’ s pub lic peer list [7, 15]. Another limitation of curre nt BitT orrent-like m odels is that the networks rely on altruistic users to k eep files a vail- able f or o thers. This is problem atic in an environment where users want to lim it th eir expo sure to any traffic log - ging clients, a nd thus it is in th eir interest to d isconnect immediately o nce they have the successfully downloaded their desire d files. Content in these networks is u nav ail- able once all of the peers that ha ve the complete fi le depart. Newly arriving clients may be able to download and share some fraction of t he data (if any is av ailable), but they must wait and hope th at a client returns to the network with the rest of file. Enhance ments to priv ate trackers, such as u p- load/download ratios, provide incentives fo r clients to con- tinue to seed files [9], but these econo mic models are dif- ficult to in itiate and do little to m aintain less po pular older files. In respo nse to the lack o f user an onymity and long-ter m data persistence in existing P2P systems, som e users may seek an altern ativ e. But because traditional d ata h osting solutions ar e not a viable option for sharing certain con- tent that may have legal consequen ces, these users mu st use mo re que stionable means fo r sharing data. Mo tiv ated by this, we developed the Graffiti Network distributed file sharing protoco l th at uses multiple third-party sto rage sites as a data replication an d tra nsfer mediu m b etween c lients. The Graffiti approa ch is to u se pu blically available web sites to store multiple co pies of shared con tent. W e use the term graffiti for our work since we ar e storing data in a way that non-ne twork par ticipants may regard as unsightly or unwanted vandalism. Our ap proach presents several n ew security challen ges over other existing P2 P systems w here clients transmit d ata directly with each o ther: (1) a n ewly 1 arriving peer can still do wnload files even if all other peers have lon g disconnected , (2) a p eer does not need to know about the existence of o ther peer s, and (3) a tr acker does not need multiple peers to enforce tit-for-tat policies [11]. The layou t of th is p aper is as follows. First, we provide an overview of the Graffiti Network fi le sharing model. W e then d iscuss o ur experimental prototy pe of the Graffiti N et- work m odel that is integrated with a BitT orrent system. The results from our on e-year stu dy on the efficacy of our pr oto- type in a r eal-world d eployment show that the use of public storage sites in a file sh aring system is possible. W e then conclud e with a discu ssion abo ut how both ad ministrators and software de velopers can guard against such a threat. 2 Related W ork W e motivate our work b y fir st d iscussing th e related back- groun d research and literature. 2.1 BitT orr ent The BitT orrent proto col define s th e operatio ns of a P2P net- work that facilitates the efficient shar ing of files in a dis- tributed mann er [11]. Our mo del in herits ma ny of th e fea - tures of BitT o rrent, but employs th ird-party storage sites as an interme diary for data transfer s, rath er than allowing clients to d irectly download files fr om each o ther . This in - direction makes it difficult to discover the identities of users that are participating in a Graffiti Netw ork. The overall efficiency and throug hput of BitT o rrent sys- tems has been shown to scale gracefully to accom modate many users arriving at the same time to download ne w and popular files [27]. But wh ile the model works well in th e short term, it d oes not ensure th e long term av ailability of esoteric content o r files that become less popular over time. This problem is espe cially prev alent for conten t th at is r e- leased in “episodes”: new con tent is shared p rofusely when it is released, but the n umber of peers decre ases as the file becomes older and newer episode s are released. In a fiv e month study of BitT orr ent n etwork activity , it was shown that the a verage time that a clien t stays in the network to continue sharing a file after it has received the entir e file set was only seven hou rs [19]. These re sults, howev er, are based on the sharing activity of copyr ight-free files, an d therefor e the clients do not have a vested inter est in dis- connectin g immed iately . In contr ast, a study [25] explic- itly focused on illegal file sharing activity showed that the departur e rate of p eers is much faster than p reviously as- sumed in [2 7]. The results in [16] sh ow th at the av erag e av ailability of a torrent is less than nine days and that most swarms comp letely die ou t in on ly 13 d ays. Th us, witho ut the inc entiv es for sharing found in pr i vate comm unities [ 9], most BitT or rent content be comes un av ailable after just a short amount of time. T o overcome the cap ricious nature of users, Graffiti Networks use storage sites tha t have the potential to always be av ailable, and thus th e shared files are still accessible after the initial inter est in the conten t has subsided. W ith enough replication, enforced by a strict asynchro nous tit-for-tat mod el, we believe that a Graffiti Network could provide clients with a ccess to files months or years after it was first introduced to the Internet. 2.2 Peer -to-P eer Storage Systems Much of the previous work on developing P2P storag e systems that provide block storage across multip le nod es is based o n distributed hash tables [1 2, 22, 29]. These approa ches have the same deficiencies as the BitT orrent model: peers download file block s d irectly fr om o ther peers, the reby lo sing anonymity , and the systems d o not provide mechan isms to provid e lon g term av ailability for less popular files af ter peers disco nnect from th e network. Other systems are focused on p roviding ano nymous an d secure P2 P data storage [32]. The POTSHARDS system provides secure long-ter m data storage when the con tent originato r no lo nger exists using secret splitting and da ta re- construction techniques to handle partial losses [30]; the ir approa ch assum es m ultiple, semi-reliable storag e backen ds that are willin g to host a client’ s data. Th e Freen et ano ny- mous storag e system u ses key-based ro uting to lo cate files stored on remote peers [10]. As d iscussed in [ 12], Free net’ s anonymity limits both its reliab ility and perfo rmance: files are not as sociated with any predictable server , and thus un- popular con tent may d isappear since no on e is r esponsible for maintainin g replicas. 2.3 Steganographic Storage Systems Although the Graffiti Netw ork model is n ot a pur e steganograph ic-based storage system, it do es share sim- ilar p roperties o f th is class o f systems [18, 17]. The Mnemosyn e stor age service a pplies the steganography technique s fro m a local storag e system [8] to a d istributed hash table [1 7]. The Ste gV a ult pr oposal uses secret sharing to build a secure P2 P stor age system on top of reliable mul- ticast [ 18]. One key benefit of these systems is th at user s have plau sible deniab ility of the existence of hidd en data because it is concealed inside covering data [6]. 2.4 Alternativ e Storage Sites Since the Graffiti Network mod el r elies on gaining access to and the circumvention of third-p arty storage sit es to host content, we consider the alternative appro ach o f using d ed- icated sto rage service s that are explicitly designe d for the storage and transfer of large files. The Amazon Simple Storage Service provides a well-defined API for writing a r- bitrary da ta files, but it currently charges fo r b oth the stor- age space an accoun t u ses as well as the n etwork b and- width used to tran sfer d ata [ 1]. The Gmail Filesystem en- ables Goo gle em ail a ccounts to be used as a n etwork stor- age medium, but adopting approach w ould require users to share ac count inform ation [20]. T he Usen et news service is another p otential sto rage system , but servers often impose a message retention time and many ISPs ha ve discontinu ed providing this service to customers for free. Free web-based file-hosting sites also do not provide the 2 Fileset Storage Sites Sto r a g e Si te : Wiki Site 1 G r a ffi ti T r a c k e r G r a ffi ti C l i e n t P ie c e In f o r m a tio n S to r a g e S ite L is t 4 H T M L R e s u lt En c r y p te d P a y lo a d 3 2 Sto r a g e Si te : Message Board Storage Site: Message Board Figure 1: For a giv en a fileset, the client communicates with the tracke r in the following manner: (1) the client sends the tracke r the list of pieces it already has; (2) the tracker responds a list of instructions on where the client should download a sub-piece and the location of where to upload a replica; (3) after do wnloading the new sub-piece, the client then navigates t he target storage site and uploads a ne w encrypted and encoded sub-piece payload; (4) t he storage site returns an HTML page and the client verifies that the upload was successful. This process repeats until the client has all the pieces of the fileset and has produced enough replicas for the tracker . robustness that we seek in o ur file shar ing mod el [4]. One limitation of these sites is that large files are broken into separate downloads and users must wait f or so me time pe- riod before they are allowed to retr iev e the ne xt piece. Fur- thermor e, the user must manually enter each segmen t U RL into their browser and repe atedly pa ss h uman-validation tests [24]. Th ese f ree hosting sites are a lso u nder scrutiny because many of their users po st illegal co ntent, an d thu s the site operator s streamline the removal proc ess fo r files and the d isclosure of offending users’ in formation f or copy- right holder s in order to quick ly diffuse any legal action that may disrupt the hosting site’ s re venue stream. Despite this, it is possible to includ e file-ho sting sites as just o ne of the many option s av ailable in a Graffiti Network d eployment (see Section 3.3). Lastly , an other prop osed so lution is to c reate a h ighly- volatile storage site by sending data packets to unsuspect- ing network entities to leverage network latency as a type of dur ability [26]. The id ea is to co ntinuou sly send data to targets that relay the same data back to th e source, th ere- fore two cop ies of the data are always theor etically av ail- able. This appr oach is no t practical for the Graffi ti Net- work model because it does not allo w the data to be shared amongst m ultiple peers. Furth ermore, it re quires that the original data so urce remain o nline in ord er to keep cycling the packets back out ov er the wire. 3 Graffiti Network Model W e n ow describe how a file-sharing system based on the Graffiti Network model w ould operate . W e discuss various measures and technique s that ensu re th e system is stable, usable, an d scalable. Such qu alities are necessary to fa- cilitate wide-spread ad option by file-sharin g particip ants, thereby making the threat a real possibility . T o d escribe the Graffiti mo del, we ad opt the termin ol- ogy of the BitT orrent proto col [ 11]. W e define a fileset as a set o f on e o r m ore files that peer s wish to sh are. Th e fileset’ s data is divided into multiple fixed-length pieces of n bytes (the last piece can contain less than n bytes) and are numb ered sequen tially . Each piece is divided f ur- ther into fixed-length sub-pieces . A Graffiti Network th at is d eployed to distribute these pieces is c omprised of three distinct com ponen ts: (1) a track er coor dinates the rep lica- tion an d sha ring p rocedu res of a fileset, (2) a client down- loads and replicates the fileset data man aged by the tracker , and (3) third-party s torag e sites stor e and provide access to fileset d ata for peers. Any client that wishe s to download and reco nstruct the original fileset is re quired by th e tracker to pr oduce multiple su b-piece replicas on as many storage sites as possible. A high-level overvie w of the Graffiti Network proto col is shown in Figur e 1. T o co nnect to the Gr affiti Network, the clien t first an noun ces itself to the tra cker and provid es it with a list o f all th e piece s that the p eer has alread y down- loaded. The tracker r esponds with a series o f sub-p iece r e- quest pairs fo r a new piece that the clien t is missing. Each request pair con sists o f (1 ) a download locatio n wh ere the peer can r etrieve a su b-piece and ( 2) instructions to p roduc e a n ew replica on a different storage site for th e data it just downloaded. Graffiti trackers follow a strict tit-fo r-tat pro- tocol: for each sub -piece that a pee r downloads, th at p eer is required to g enerate a r eplica for a previously downloaded sub-piece on a different storage site and send th e locatio n of this new replica back to the tracker befo re it can receive the next piece. 3.1 Central T racke r The tracker provides a directory service for peer s to retrieve a fileset. For each piece of data in a fileset, the tracker main- tains a table o f the sub-p iece replica location s on sites that were g enerated by clients. Each replica is ann otated with three p ieces of meta-data: (1) a un ique en cryption key fo r that replica, (2) a checksum for each sub-piece, and (3) the first and last byte sequences of the encrypted data b lock on the storage site. The tracker u ses a dif ferent encryption key per entry to ensur e that each r eplica is stor ed as a uniq ue character seq uence to prevent the use of too ls to discover other rep licas. The ch ecksum and seq uence markers also 3 allow peers to determin e w hether a replica has the pr oper byte seq uence and to locate d ata bo undarie s at the stora ge site location. For each connected peer , the tracker maintains an ac- tive piece set (APS) of download/u pload r eplica p airs th at are unfu lfilled requ ests f or a client. Each pair con sists of a sub- piece identifier th at the trac ker provided for clien t to download an d a storage site location where the tracker in- structed the client to make a new replica. On ce the client provides the tracker with inform ation ab out a new replica for a d ownload/upload pair , the entry is rem oved from that client’ s APS a nd the client is allowed to receive n ew infor- mation. The size o f the APS is de termined by the tracker’ s administrator and prev ents a client for downloading too many sub-p ieces without pro ducing any n ew replicas. As in the BitT orren t proto col, the Graffiti tracker stri ves for uni- form av ailability o f all data p ieces [11]. Since the tracker decrees what pieces the clien ts m ust replicate for each re- quest in th e APS, it can dec ide to rep licate th e “r arest” pieces first. Malicious clien ts in Graffiti Networks are qu ite differ- ent than maliciou s clients in BitT orrent networks [23]. A rogue Graffiti clien t may have o ther ulterior g oals: (1) to discover all of th e storag e site loc ations used b y a tracker in or der to contact site ad ministrators an d have th e replica data removed or ( 2) to falsely iden tify valid storage sites and replica location s as in valid in an attempt to disru pt op- erations. In the first case o f try ing to discover all o f a file- set’ s r eplicas, th e tracker can use thro ttling m easures to pre- vent a client fro m learn ing too much in a shor t amoun t of time. But for the latter problem, the trac ker should n ot ac- ti vely c heck wheth er a client actually u ploaded the data at the location it claim s it did, due to security and econ omic reasons. Instead it c an em ploy prox ies or othe r thir d-party entities to deter mine whether a client is behaving proper ly . For example, the tracker can retrieve a page through th e Coral Cache or T or services to dete rmine if the data was stored at the location claimed by a client [14, 13]. 3.2 Client A Graffiti client allows a user to autom atically download a fileset stored on o ne or m ore storage s ites. A user mu st first obtain a metad ata file for a sp ecific fileset uniqu ely identi- fied by an “info hash” in order to begin do wnloa ding [11]. After the client first announces itself to the tracker a t the address listed in the metadata file, th e tra cker places the peer in an “initialization” mode. Th is is always d one re- gardless of whether the client is connec ting for the first time or if it is return ing with som e pieces already downloaded . The tr acker sends every new client the sam e initial piec e set (IPS) that will use for th e first phase of downloading and replication. This initial set is the same for all clients ar- riving within a c ertain time perio d to prevent a clien t fr om initiating multiple new connectio ns without ever creatin g new replicas. The size of th e in itial set is the same size as the APS an d its infor mation is changed to a different ran - dom set of sub -pieces at regular intervals (e.g., hours or days, r ather than m inutes). T hus, it is po ssible fo r a rogue client to r etrieve a comp lete fileset withou t ever producing a new rep lica fo r the network, but it would take sev eral days or weeks to cycle through all of the tracker’ s IPS combina- tions if there were a significantly large numb er of p ieces. The client is required to also pr oduce two n ew replicas for each sub-piec e in the I PS, even if the client has already downloaded the p ieces previously . This p olicy is ak in to a new ten ant payin g “last month ’ s rent” b efore moving into an apartm ent: it ensures that client can not disconnect from the network without creating new re plicas for each piece that it downloads. Once the c lient successfully d ownloads and gener ates sufficient replicas f or its IPS, it lea ves th e initialization phase and is then allowed to rec eiv e arbitrary pieces. The protoco l works the same b efore: the tracker maintains an APS f or eac h client an d on ly g i ves new download loca tions once that pa rticular client h as pro duced a new replica on a storage site. 3.3 Storage Sites A potential Gr affiti stor age site is any acc essible network entity that allows for da ta to be stored and retrieved using a k nown n etwork protoco l. In practice, peers will likely use publically av ailable web sites that provide services that Graffiti clien ts repu rpose to store arbitrary block s of data. This app roach h as the distinctio n that all data movement appears as normal HTTP traffic, and thus is immune to cur- rent ISP throttling and tracking techniqu es [15 ]. The ideal sto rage site fo r a Graffiti Network is one that allows for anyone to po st data with out CAPTCHA p ro- tections [ 24] and is either unmo derated or has lon g ab an- doned by its owner . A popu lar a nd high -traffic wiki site, for examp le, would no t be a good storage site c andidate as it likely th at non-ma licious visitors would quickly n o- tice the ch anges made by Graffiti clients to store replica data. W ith the rise of m any ope n-source web- publishing platforms, there are many p otential targets that allow for anonymou s or semi-anonym ous data posting . Notable ex- amples includ e paste-bins, wiki sites, message boar ds, and blogs. An HTML-ba sed storage site also allo ws the data to be disseminated to peers th rough disparate c hannels once it is o nline, such as thro ugh Co ral Cache [14] o r T or [ 13]. Th e data embedd ed in the site’ s pages could also be picked up by sear ch engin e cach ing and a rchiving serv ices for longer- term storage. Other pote ntial stora ge sites includ e any pho to an d file hosting sites that allow fo r automa ted d ata up loading. In the case o f the former, th e data could also be hidd en in- side of imag e files using well-k nown tech niques [21, 28]. As th e In ternet ev olves, new targets will emerge that can be incor porated in to existing networks. The system could also allow clien ts to use storage sites that are p assword pr o- tected f or writing d ata, but wh ere an account is not requ ired to read back the data. This o bviates the need for a client to 4 send th e tracker acc ount info rmation, which could then be used im prope rly by othe r clients to tamper with or destroy the data. Using in voluntary web sites as storag e dumps seems counterin tuitiv e if the main go al of th e network is d ata persistence and av ailability , since replicas are prom ptly re- moved when site adm inistrators a nd mo derators discover them. The Graffiti model overcomes th is challen ge and takes advantage o f “free network storag e” thro ugh a mas- si ve replication and obfu scation pro cess. It is n ot trivial, howe ver , to au tomatically store arb itrary data on rand om web sites nor is it trivial to discover which sites are avail- able with the p roperties stated above. The prevalence of popular web publishing software means th at one only need s to target a small n umber of platfo rms in ord er to circum- vent a large portion of the I nternet. Furtherm ore, m any sites, such as wikis an d message boards, often display the network location of the u ser responsible fo r adding new content or m aking chan ges to their pag es, wh ich makes it difficult to deny r esponsibility for p articipating in illegal activities. W e argue th at by f racturing a fileset’ s rep licas across hund reds of storag e sites, it is difficult to be fully implicated when o nly a fr action of the e viden ce is a vail- able. A d istributed effort to probe web sites and u ncover open storage paths co uld allow peer s to dr aw on a nearly limitless pool of a vailable storage. 4 Experimental Deployment T o deter mine wheth er the Graffiti Network mod el is a vi- able and thu s is a po tential threat, we imp lemented a pr o- totype Graffiti tracker and client as an e xtensio n to the Bit- T orrent pro tocol. W e then stored a sample d ata set o n a large n umber of open sites a nd measured th e a vailability of our data for almost an entire year . W e built o ur system on top of the o pen-sou rce libtor- rent [ 5] BitT orrent libr ary in order to allow clien ts to partic- ipate in tor rent swarms concurr ently with Graf fiti Network activities. When e nough peer s a re available, the clien t op- erates strictly in Bi tT orren t mod e. But if the number of dis- tributed copies in the swarm dro ps below a threshold , the client begins to contact the tracker using the Graffi ti proto- col in con junction with its BitT orren t operatio ns. As n ew pieces are retr iev ed from storag e sites, they are pa ssed to libtorrent’ s storage manager for seeding to other peers. 4.1 Storage Site Discovery In o ur experimental prototy pe, we target the open source MediaW iki [3] p latform as the potential storage site for the network . Du e to the pop ularity of sites like Wikipedia that u se MediaW iki, we b eliev e that it is the m ost wid ely deployed w iki platfor m with a large number of less- experienced users that install the software without c hang- ing the pe rmissiv e default settings. Anoth er key character- istic is th at the MediaWiki p latform maintain s a comp lete revision log for each ar ticle, wh ich allows Graffiti peers to retrieve data e ven if the changes are r ev ersed or the content Sites Found Sites Used Anonymou s Edits 8,483 3,161 Registration Protected 5,983 2,347 Puzzle Protected 1,157 138 CAPTCHA Protected 1,586 - Not Publicly Modifiable 5,946 - T ota l: 23,15 6 5,646 T able 1: The categories of protection used by t he MediaW iki si tes discov ered during the collection process and the sites used in the experimen tal deplo yment. is altered. W e decide d to test o ur system on op en M ediaW iki sites that we do n ot h av e con trol over as th is allows us to b est measure whether ou r assumptions abou t how long the data will remain on the sites are correc t. W e developed a dis- tributed web crawler to discover Med iaW iki installation s throug h search eng ines using keywords that are un iquely indicative of a newly installed site. The crawler pu rposely ignored well-k nown MediaW iki sites (e.g. , tho se sites that are part of the Wikipedia F ound ation) and the commerc ial- ized version s of MediaW iki (e.g., W ikia). For each s ite that the crawler found , we probed it to determine what kind o f protection scheme it utilizes and the last time that it was updated (see T able 1). Of the 2 3,156 unique MediaWi ki installations tha t we found , 8 ,483 sites allowed fo r an ony- mous editing and 5,98 3 allowed users to register accou nts without CAPTCHA or ema il protection s in order to make edits [24]. The default MediaWi ki installation provides a p rimitive arithmetic “puzzle” protection counterm easure that we fou nd in use on 1,1 57 sites; this pu zzle is easily broken with just a few lines of code, and th us did n ot pre - vent our system fro m storing data on these sites. Lastly , in order to min imize the impa ct of our experimen ts, we only targeted those sites that h ad not b een up dated within the last three mo nths, thereb y re ducing our list to 5,646 sites; lowering the threshold t o tw o months w ould hav e yielded a total of 11,987 potential storage sites. The Graffiti client stores data on MediaW iki sites as base64-en coded, Blowfis h-e ncrypted blocks of text that are written in a new article titled with a random word from the dictionary . A more resilient a pproach would be to modify a popular page on a given site, and then immediately reverse the changes and mark the revision as vandalism. T his h as two significan t im plications compa red to writing data to a newly created article . F orem ost is th at removing this data completely fro m the p age’ s history requ ires administrato rs to delete the entire pag e fro m the datab ase and restore the latest revision b y h and, ther eby losing all the p revious legit- imate revisions. Secon d, su ch an attack is more lik ely to b e overlooked by a site’ s o perators since they ma y o nly care whether the change s we re reversed. W e dee med this tech- nique too malevolent for the p urpose of our expe riments, and thus chose to not implemen t it . T o retriev e a sub-piece stored on one of th ese storage sites, the client downloads the web page and extracts the 5 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0 50 100 150 200 250 300 percentage of missing replicas # of days since creation Site Not Found Replica Removed Figure 2: Percentage of total replicas remove d over time cate- gorized by the type of failure. 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0 50 100 150 200 250 300 percentage of missing replicas per category # of days since creation Puzzle Protected Replicas Registration Protected Replicas Anonymous Replicas Figure 3: The av ailability of replicas categorized by its corre- sponding storage site’ s protection schemes . text surro unded by the byte sequen ce markers pr ovided by the tracker . Th e client then reverses the base6 4 encoding , decryp ts the data, and verifies that it m atches the checksum provided by the tracker . 4.2 System Configuration For our experimental dep loyment, we used a Linux ISO split into 512 KB pieces an d 6 4KB sub-piec es as our sample data file th at the clients want to shar e. Even th ough we were able to store up to 51 2KB payloads on a single MediaW iki page, we choose to use a smaller sub-piece size. Again, an- other more maliciou s a pproach would be to store a payload with the size that ca n be up loaded and retr iev ed but causes either a b rowser or the server to choke if the operator tries to access the p age th rough the Media W iki adm inistrativ e interface. For example, we found that it was p ossible to store 5 12KB p ieces tha t would exhaust the default 20MB memory limit of PHP if som eone tried to rem ove the data. Thus, the on ly way to remove the co ntent is to execute the proper SQL com mands dir ectly in th e database, which is likely too diffi cult for most users. W e initiated file sharin g acti vity on April 1 0th, 2009 u s- ing a tra cker and five clients deployed in ou r dep artmen- tal lab. Each client connects to the tracker and pr oduces a full c opy of a sub-p iece on one of th e 5,60 0+ MediaW iki sites. W e assum e that all clients are tr uthful about wh ether a replica is av ailable and do not falsify r eplica URLs. W e instrumented th e tracker to target each storage site on ly once (althou gh v ariation s in sub-dom ains and URL rewrit- ing led to some sites being used more than once). Along with the data pay load, at the top of each wiki page we stored a small paragraph with an explanation of the seemin gly ran dom text. Th is descriptio n also in cluded a uniq ue tracking link back to ou r web page with f ur- ther informa tion about the pr oject. Tracking users’ click- throug hs fro m the se li nk s allows us to measure to som e ex- tent whether hu mans we re actu ally d iscovering our paylo ad pages before they were deleted. Once th e clients pushed out all of th e data to the sites, we 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0 50 100 150 200 250 300 percentage of missing replicas per domain type # of days since creation .com .org .edu other-intl other-us Figure 4: T he cumulativ e av ailability of replicas categorized by their domain type: .com (42.5%), .edu (3.2%), .org (24.1%), US- based other (14.0%), and Non-US-based other (16.1%). then used a separate tool to check daily whether the data we stored is still in place and has not been modified. W e check ev ery replica r egardless if it has n ot been available fo r some time to ensure that the errors are not transient. 4.3 Results W e now report on the a vailability of the 5,646 replicas that we stored in our experiments fro m Apr il 1 0th to Febr uary 28th, 2010. For e ach missing replica, we categorize the replica as (1) removed if the site is av ailable but th e or ig- inal pa ge is missing, (2) chan ged if bo th the site and the original page are a vailable, but the data does not match our stored chec ksum, or ( 3) not fo und if the site is n o longer av ailable (e.g., the do main name has expired or Med iaW iki was unin stalled). Our in vestigation f ound that the missing replicas w ere o nly either r emoved or not fo und; n o rep lica had its contents altered. On the last d ay of ou r data co llection, roughly 40 % of the replicas were still av ailable and ho sting th e o riginal data that the prototy pe clients uploaded . The g raph in Figure 2 shows a timeline o f the p ercentage of replicas that a re not av ailable o n each day that we checked. The first n otable 6 data point is that an initial 20 % of th e replicas were re- moved with in the same week that th ey were created. Th e rate in which sites are removed then tapers of f as time pro- gresses. W e attribute this dro p-off in activity to two possi- ble r easons. Foremost is that b y d efault any ch anges to a MediaW iki site will ap pear on the fir st p age of r evision logs for se ven d ays after the re vision i s created, and thus our ac- tions are more likely to be discovered soon after the data is posted. The second po ssible reason is b ecause a story about our pr oject appear ed on th e fro nt page o f a popular tech nol- ogy news website on th e third d ay of our experimen t [2]. W e believe that the “ notoriety” of the p roject during this pe- riod may have caused administrators to e xam ine their web- sites to see if they were targeted by our system. Once this initial atten tion dimin ished, the slo pes of the lines in Fig- ure 2 decrease an d it takes another 3 5 da ys bef ore ano ther 10% of the replica s are rem oved. After abou t 100 day s, the growth ra te of r eplicas being rem oved (i.e., the lower portion of th e cur ve in Figure 2) tap ers off an d the num- ber of sites tha t becom e unavailable begins to r ise. This is expected since m any of the sites were n ot actively used by their propr ietor , and thus are taken down ar bitrarily . The graph in Figu re 3 shows how the replicas were re- moved over time in relation to their storage site’ s pro tection scheme. The salient aspect of the result is that initially sites that emp loyed som e type of p rotection were faster to re- move replicas. This is expected , since m any o f the sites that employed som e protection were still be ing u sed by users despite ha ving not been u pdated recently , whereas many of the completely o pen sites still displayed the d efault Medi- aW iki homep age message and thus wer e never ev en used once they wer e installed . Such sites are likely lo ng fo rgot- ten by th eir owners who may n ev er discover the r eplicas once they p ass the default seven day revision log window . But af ter ap proxima tely 120 day s, th e perce ntage of miss- ing r eplicas stored on sites allowing for anonymou s edits surpasses sites using the basic registration protection. Lastly , the graph in Figure 4 charts the availability of replicas with respe ct to the d omain name of th e storage site. W e attribute gr eater dur ability of data sto red o n .e du and .org sites comp ared to other domain s; such o rganizations are likely to use open-source sof tware f or co llaboratio n and internal sites are often not behind corpo rate firew alls. 5 Discussion The results presented in the previous section clearly demonstra te the efficacy o f the Graffiti Network model as a mean s for facilitating lon ger-term file sharing. W e ther e- fore argue that the th reat of such a system do es indeed exist and sites need to take measures to prote ct themselves from being used in such a manner that we have d escribe. 5.1 Countermeasur es Much of the feed back that we received on the project was from admin istrators th at expressed their d esire to p rovide an open wiki site that allowed anonymo us contributions, despite the inevitable e xpo sure to vandalism and s pam. W e counter th at su ch sites that do no t want to requ ire user s to register a n account s hou ld still use CAPTCHA protection s, such as befo re a user is allowed to ed it a pa ge. In p ractice, we found that the reCAPTCHA [31] project is the most ef- fective protection as it does not r equire administrato rs to install special server-side grap hics libraries and strikes a proper ba lance b etween availability and com plexity . More complex CAP TCHA schemes would not d eter future Graf- fiti clients that are able to solve CAPTCHAs (either ma nu- ally or p rogram matically) and m ay only inhibit legitimate visually impaired users. If sites wish to still remain open , the CAPTCHA could be selectively enabled on ly when an unv erified user tries to post da ta larger than some low de- fault threshold o r create s too m any new pa ges in a short time span. W e also belie ve that other simple pr otection measures could be in cluded in p opular w eb ap plications to pr ev ent abando ned or forgotten sites from being used fo r unin- tended p urposes. For examp le, MediaWiki’ s default behav- ior could be to lock d own the editing features of a site after a certain num ber o f days if it was installed but then never actually u sed. This app roach is similar to the one used by some blo gging platf orms to disable co mments on older posts. Admin istrators cou ld easily re -enable th is fu nction- ality b y simp ly loggin g into th e site aga in. Anoth er tech- nique is to use a page cou nter that is in voked on the client- side (e.g., through Ja vaScript) and then co mpare the results with server -side logs to determine whether there are an un- usually large number of users acc essing pag es thr ough a non-b rowser client. W eb application fra mew ork s, such as Ruby on Rails and Djang o, cou ld also provid e similar fea- tures to protect custom-m ade sites. 5.2 V ariations & Adaptations Other than for P2P activities, the Gra ffiti mod el is also of potential u se fo r large-scale distributed systems used by criminal o rganizations, often referred to as botnets . The goal of mo st b otnet op erators is to gain access to a la rge supply of compu tational r esources for purpo ses o f network commun ication (e. g., sen ding em ails or DOS attacks). If these goals shift towards m ore d ata-centric ac ti vities, the n systems based on so me of the princ ipals of th e Graffiti Network m odel m ay becom e prevalent in order to store large amoun ts of data f or the b otnet. Alternatively , instead of stor ing r eplicated data, the comman deered storage sites could also be used as a contro l channel for other entities in the botnet. 6 Acknowledgmen ts The authors would like to than k Arvid Norberg at BitT or- rent, Inc. for his assistance with the libto rrent library [5]. 7 Conclusion W e have p resented an overvie w of Graffiti Networks, a new file sha ring model that allows peers to subversi vely 7 use third -party stor age sites as an intermed iary for tr ans- ferring files between users. Our c lient-tracker par adigm is similar to th e BitT orren t p rotoco l, but is designed to pro- vide long term file av ailability to users while pr eserving their ano nymity . W e d o not intend the Graffiti mod el to supplant BitT orrent ne tworks, as it will never achieve the same ma ximum network thro ughp ut no r will it ever be as efficient. W e believe, howe ver , that our appr oach can h av e a symbiotic relation ship with existing deploymen ts: peers would use a Gr affiti Network-like system to impr ove the long term av ailability of shared files, while leveraging the faster initial transfer rates of direct P2P communication f or data disseminatio n. W e have implemented a proto type and shown that data can be store d on p ublically accessible sites for extended perio ds of time, beyond what is often po ssi- ble in other existing peer-to-peer systems. After almo st an entire year, ro ughly 40% of the data that we stored on sites that are no t unde r our control was still available. Th ese re- sults ind icate that m alicious users may adopt the Graffiti Network mode l, and thu s site op erators sho uld take mea- sures to prevent their sites fro m being used in this manner . Refer ences [1] Amazon S3. http://aws.amazo n.com/s3/ . [2] Grad Student Project Uses Wi kis T o Stash Data, Miffs Admins. http://tech.sl ashdot.org/arti cle.pl?sid=09/0 4/13/0120226 . [3] MediaW iki. http:/ /www.mediawiki. org/ . [4] RapidShare.co m. ht tp://www.rapids hare.com/ . [5] Rasterbar Software. http://www.raste rbar.com/ . [6] Steghide . http://steghi de.sourceforge. net/ . [7] Meeting t he Cha llenge of T oday’ s Ev asi ve P2P Tra ffic. White Paper, Sandvine Inc., W aterloo, Canada, 2004. [8] A N D E R S O N , R . J . , N E E D H A M , R . M . , A N D S H A M I R , A . The Steg anographic File System. In Pr oc. of the Intl. W orkshop on In- formation Hiding (1998), pp. 73–82. [9] A N D R A D E , N . , M O W B R AY , M . , L I M A , A . , W AG N E R , G . , A N D R I P E A N U , M . Influences on Cooperation in BitT orrent Communi- ties. In Pr oc. of the W orkshop on E conomics of P2P Systems (2005), pp. 111–115. [10] C L A R K E , I . , S A N D B E R G , O . , W I L E Y , B . , A N D H O N G , T. W . Freenet : A Distrib uted Anonymous Information Storage and Re- trie v al System. In Intl. W orkshop on Designing Privacy Enhancing T echn ologies (2001), pp. 46–66. [11] C O H E N , B . Incent iv es Build Robustness in BitT orrent. In Proc. of the W orkshop on Economics of P2P Systems (2003). [12] D A B E K , F. , K A A S H O E K , M . F. , K A R G E R , D . , M O R R I S , R . , A N D S T O I C A , I . W ide-are a cooperati ve storage with CFS. In Proc . of the ACM Symposium on Op erating Systems Princi ples (2001), pp. 202– 215. [13] D I N G L E D I N E , R . , M AT H E W S O N , N . , A N D S Y V E R S O N , P. T or: the second-generat ion onion router . In SSYM’04: P r oceedin gs of the 13th co nfere nce on USENIX Security Symposium (B erkele y , CA, USA, 2004), USENIX Associati on, pp. 21–21. [14] F R E E D M A N , M . J . , F R E U D E N T H A L , E . , A N D M A Z I ` E R E S , D . De- mocratiz ing content publi cation with coral . In NSDI’04: Proc eed- ings of th e 1st con fere nce on Symposiu m on Network ed Systems De- sign and Impleme ntation (Berke ley , CA, USA, 2004), USENIX As- sociati on, pp. 18–18. [15] G A R E T T O , M . , F I G U E I R E D O , D . R . , G A E TA , R . , A N D S E R E N O , M . A Modeling Framew ork to Understand the Tussle between ISPs and P2P File-sha ring Users. P erformance Evaluation 64 , 9- 12 (2007), 819–837. [16] G U O , L . , C H E N , S . , X I AO , Z . , T A N , E . , D I N G , X . , A N D Z H A N G , X . Measurements, ana lysis, and modeling of bittorrent-l ike systems. In Pr oc. of the Internet Measur ement Confer ence (2005), pp. 4–18. [17] H A N D , S . , A N D R O S C O E , T. Mnemosyne: P2P Steganogra phic Storage. In Revised P apers fr om the Intl. W orkshop on P2P Systems (2002), pp. 130–140. [18] H O N G , G . C . StegV ault: Perva siv e Information Hiding in an Anonymous P2P En vironment. Master’ s thesis, National Uni versity of Singapore , 2003. [19] I Z A L , M . , K E L L E R , U . G . , B I E R S AC K , E . , F E L B E R , P . , H A M R A , A . , A N D E R I C E , G . L . Dissecting BitT orrent: Fi ve Months in a T orrent’ s Lifetime. In Pro c. of the P ass ive and Active Measur ement W orkshop (2004). [20] J O N E S , R . Gmail Filesyste m. http://richard .jones.name/ . [21] K A T Z E N B E I S S E R , S . , A N D P E T I T C O L A S , F. A . , Eds. Information Hiding T echniques for Ste ganogr aphy and Digital W atermarking . Artech House, Inc., 2000. [22] K U B I AT OW I C Z , J . , B I N D E L , D . , C H E N , Y . , C Z E RW I N S K I , S . , E A T O N , P . , G E E L S , D . , G U M M A D I , R . , R H E A , S . , W E AT H - E R S P O O N , H . , W E I M E R , W. , W E L L S , C . , A N D Z H AO , B . OceanSto re: An Architecture for Global-scale Persistent Storage. SIGPLAN Notic es 35 , 11 (2000), 190–201. [23] L O C H E R , T. , M O O R , P . , S C H M I D , S . , A N D W A T T E N H O F E R , R . Free Riding in BitT orrent is Cheap. In W orkshop on Hot T opics in Network ing (2006), pp. 85–90. [24] N AO R , M . V erifica tion of a Human in th e Loop or Identific ation via the T uring T est. 1996. [25] P O U W E L S E , J . A . , G A R B AC K I , P. , E P E M A , D . H . J . , A N D S I P S , H . J . The Bittorrent P2P Fil e-sharing System: Measurement s and Analysis. In Intl. W orkshop on P2P Systems (2005). [26] P U R C Z Y N S K I , W. , A N D Z A L E W S K I , M . Jug- gling w ith packets: Floati ng data s torage . http://lcamtuf .coredump.cx/ju ggling_with_pac kets.txt , 2003. [27] Q I U , D . , A N D S R I K A N T , R . Modeling and Performance Analysis of BitT orrent-lik e P2P Netw orks. SIGCOMM C.Commun. Rev . 34 , 4 (2004), 367–378. [28] R A M K U M A R , M . , A N D A K A N S U , A . N . Capacity estimate s for data hiding in compressed images. IEEE T ransact ions on Image Pr ocessing 10 , 8 (August 2001), 1252–1263. [29] R OW S T R O N , A . , A N D D RU S C H E L , P . Storage m anageme nt and cachi ng in P AST , a large-sca le, persistent P2P storage utility . SIGOPS Operati ng Systems Revie w 35 , 5 (2001), 188–201. [30] S T O R E R , M . W. , G R E E N A N , K . M . , M I L L E R , E . L . , A N D V O RU - G A N T I , K . PO TSHARDS: secure long-term storage without encryp- tion. In Proc . of the USENIX Annual T echnica l Confe rence (2007), pp. 143–156. [31] V O N A H N , L . , M AU R E R , B . , M C M I L L E N , C . , A B R A H A M , D . , A N D B L U M , M . reCAPTCHA: Human-Based Charact er Recog- nition via W eb Security Measures. Scienc e (August 2008), 1465– 1468. [32] W A L D M A N , M . , R U B I N , A . D . , A N D C R A N O R , L . F. P ublius: A Robust , T amper-e vident , Censorshi p-resistant W eb Publishing Sys- tem. In Proc . of the Confer ence on USENIX Security Symposium (2000), pp. 59–72. 8

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment