Reduce to the Max: A Simple Approach for Massive-Scale Privacy-Preserving Collaborative Network Measurements (Extended Version)

Reduce to the Max: A Simple Appr oach f or Massive-Scale Priv acy-Preserving Collaborative Netw ork Measurements ∗ (Extended V ersion) F abio Ricciato University of Salento, Italy FTW , Austria Mar tin Bur khar t ETH Zur ich, Switzerland ABSTRA CT Priv acy-preserving techniques for distributed computation ha ve been proposed recen tly as a promising framework in collaborative inter-domain netw ork monitoring. Sev eral dif- feren t approac hes exist to solve such class of problems, e.g., Homomorphic Encryption (HE) and Secure Multiparty Com- putation (SMC) based on Shamir’s Secret Sharing algorithm (SSS). Such tec hniques are complete from a computation- theoretic p erspective: giv en a set of priv ate inputs, it is possible to perform arbitrary computation tasks without re- v ealing any of the intermediate results. In fact, HE and SSS can op erate also on se cr et inputs and/or provide se cr et out- puts . How ev er, they are computationally exp ensiv e and do not scale well in the n um ber of pla yers and/or in the rate of computation tasks. In this pap er we advocate the use of “elemen tary” (as opp osite to “complete” ) Secure Multipart y Computation (E-SMC) pro cedur es for traﬃc monitoring. E- SMC supp orts only simple computations with private input and public output , i.e., it can not handle secret input nor secret (intermediate) output. Suc h a simpliﬁcation brings a dramatic reduction in complexit y and enables massiv e- scale implementa tion with acceptable delay and ov erhead. Not withstanding its simplicity , w e claim that an E-SMC sc heme is suﬃcient to p erform a great v ariet y of compu- tation tasks of practical relev ance to collab orativ e net work monitoring, including, e.g., anonymous publishing and set operations. This is achiev ed by combining a E-SMC scheme with data structures lik e Blo om Filters and bitmap s trings. 1. INTR ODUCTION Priv acy-preserving tec hniques for distributed computation ha ve been proposed recen tly as a promising to ol in collab- orativ e inter-domain net work monitoring — see, e.g., the motiv ating pap er by Roughan and Zhang [19]. In the refer- ence scenario, a set of ISPs are un willing to share local traﬃc data due to business sensitivity and/or concerns about their users’ priv acy . On the other hand, they hav e a collectiv e in terest to p erform some global computation on suc h data and share the ﬁnal result. F or example, they might w ant to aggregate lo cal traﬃc measuremen ts in order to recon- struct global statistics, and these migh t be further processed in order to unv eal global threats (e.g., b otnets) or discov er macroscopic anomalies. As p oin ted out already in [8], each ISP would b eneﬁ t from comparing its o wn local view (of traf- ∗ This is an extended v ersion of the paper presented at the Thir d International Workshop on T ra ﬃc Monitoring and Ana lysis (TMA’11), Vienna, 27 April 2011. ﬁc conditions) with the global view aggregated o v er all other ISPs, esp ecially in the o ccasion of anomalies and alarms, in order to hin t at whether the (unknown) root cause is lo- cal or global— a ma jor discriminator for deciding ab out the reaction. Also, ISPs migh t b e ready to share with other ISPs information ab out securit y incidents observ ed lo cally , pro vided that they can do so anonymously . Tw o possible approaches to solv e such class of problems are Homomorphic Encryption (HE) and Secure Multipart y Computation (S MC) based on Shamir’s Secret Sharing algo- rithm (SSS for short). Both these techniques are “complete” from a computation-theoretic p erspective 1 : giv en a set of private inputs , it is possible, in principle, to com pute an y ar- bitrary function, including structured algorithms in v olving conditional statements, without rev ealing any of the in ter- mediate results. In fact, a distinguishing feature of HE and SSS is that they can operate also on se cr et inputs and/or pro vide se cr et outputs (see the graphical representation in Fig. 1(a)). The notions of se cr et and private are distinct: private data is known in cleartext to at least one pla y er (and usually only to one), while se cr et data remains unknown by all play ers and can not b e reconstructed unless a minimum n umber of play ers agree to do so. On the other hand, suc h tec hniques are computationally exp ensiv e — esp ecially HE — and therefore do not scale w ell in the rate of computation tasks (queries) and/or in the num ber of play ers. In this pap er we advocate the use of “elemen tary” (as op- posite to “complete” ) SMC procedures for collaborative traf- ﬁc monitoring. Such techniques — hereafter referred to as E-SMC for short — hav e a fundamen tal limit: they supp ort only simple computations with private inputs and public out- put , i.e., they can not handle secret input nor secret (inter- mediate) output. W e show that such a simpliﬁcation allows for an enormous reduction in computational complexit y and o verhead, making suc h tec hniques amenable to massive-scale implemen tation. Not withstanding its simplicity , we claim that E-SMC is suﬃcient to p erform a broad v ariet y of tasks of practical imp ortance in the ﬁeld of collaborative traﬃc monitoring. In fact, queries can be c hained to build more structured computation tasks (ref. Fig. 1(b)) whenev er in- termediate results — whic h are necessarily public in E-SMC — are not regarded as sensitive. Moreo ver, w e show that an additiv e E-SMC scheme can b e combined with lo cal trans- formations on the priv ate data and/or with particular data structures (e.g., Bloom Filters, bitmap strings) in order to 1 A fully homomorphic, computationally complete HE sc heme has been introduced recently by Gen try [12]. The completeness of SSS is shown in [3]. extend the range of supp orted op erations. In this work we take a ﬁrst step to wards unfolding the potential of E-SMC for traﬃc monitoring. W e mak e three main con tributions. First, w e presen t a simple sc heme for E-SMC, called GCR, which is based on additiv e-only or m ultiplicative-onl y secret computation and extends an idea presen ted earlier in [2]. Second, we highlight some system- design asp ects of GCR that enable massive-scal e implemen- tation: in particular, we propose to split the computation in to oﬄine randomization and online aggregation phases. Third, we describe how G CR can supp ort a n um ber of op er- ations relev an t to collab orativ e traﬃc monitoring — like set operations, anon ymous publishing and anonymous schedul- ing — when combined with data structures like Blo om Fil- ters and bitmap strings. The aim of this rep ort is not to provide deﬁnitive results nor quantitativ e assessments, but rather to indicate a direc- tion of work to researc hers engaged in inter-domain traﬃc monitoring. W e claim that a broad v ariet y of tasks of practi- cal relev ance to this ﬁeld do not necessitate to resort to “com- plete” (and complex) priv acy-preserving sc hemes but can be satisfactorily attained by E-SMC. Thanks to their simplic- it y , collab orati ve systems based on E-SMC are amenable to massive-scale implementation, with very large num ber of pla yers and/or very high rate of queries. In turn, system scalabilit y pav es the wa y tow ards customer-driven collab o- rativ e monitoring, where participating play ers do not map to ISPs but rather to their customers — think, e.g., to mid-to- large companies with own IT security staﬀ. This is indeed a new av enu e of collab orativ e netw ork monitoring that might ha ve in E-SMC its enabling technology . 2. THE GCR METHOD W e consider the classical SMC scenario where a set of n pla yers collaborate to compute a function of some priv ate data — e.g., traﬃc statistics, net work logs, records of se- curit y incidents. As customary in SMC, we assume a semi- honest mo del (also kno wn as honest-but-curious ): all play ers coop erate honestly to compute the ﬁnal result, but a subset of them might collude to infer priv ate information of other pla yers. In other words, no malicious pla yer will attempt to in terrupt nor corrupt the computation process, e.g., b y pro viding incorrect input data. In this section we presen t a simple metho d to perform secure priv ate addition whic h extends an idea presen ted ear- lier by A tallah et al. in [2, § 4.1] based on additiv e secret sharing. W e refer to our metho d as “Globally-Constrained Randomization” , GCR for short. W e show that GCR, whic h is simple conceptually , lends itself very w ell to massiv e-scale implemen tation. W e prop ose also for the ﬁrst time a v aria- tion of the scheme to p erform secure multiplic ation . 2.1 Notation W e consider a set of n play ers { P i , i = 1 . . . n } with n ≥ 3 (normally n >> 1). The maximum num ber of colluding pla yers will b e denoted by l (collusion threshold) with l ≤ n − 2. Note that l is a design parameter that can be set independently from the system size n . F or each computation task (query) each play er P i in volv es tw o elements: • a i is the private input of P i to the summation. F or some queries, it is obtained by applying a lo cal trans- formation g () on some other inner priv ate data b i , i.e., a i = g ( b i ). • r i is the priv ate r andom element which P i has previ- ously generated cooperatively with other pla y ers in the w ay presented later. • v i def = a i + r i is the public input which P i ev entually announces to the other play ers. The collection of random elements across all play ers con- stitutes a Random Set (RS) and will be denoted b y r def = { r i , i = 1 . . . n } . The goal of the computation round is to obtain the public output result A def = f ( a 1 , a 2 . . . a n ) = f ( g ( b 1 ) , g ( b 2 ) , ..g ( b n )) without disclosing the v alues of the individual a i ’s. F or each computation, all input elemen ts ( a i , r i , v i ) and the output A must be in the same format. F or the additive sc heme they must be deﬁned ov er the same additive c ommutative gr oup (Ab elian group). W e will con- sider the following distinct cases: Real scalars: a i , r i and A are real num bers deﬁned in the in terv al R p def = [0 , p ]. F or the sak e of simplicity we will assume p integer, but not necessarily prime. The group op eration in this case is mo dulo- p addition. A generic random elemen t x is a random v alue extracted uniformly in [0 , p ], i.e., x ∼ U (0 , p ). The n ull elemen t is the zero v alue. In teger scalars: this is a sub-case of the previous one, where a i , r i and A are integers in Z p def = [0 , p ]. Unless diﬀer- en tly sp eciﬁed, p is not necessarily a prime num ber. In practice, it is conv enien t to c ho ose p = 2 q ( q in teger) so that mo dulo- p addition maps to wrap-around of a q -bit counter. Binary strings: a i , r i and A are binary strings of length k . The group op eration is therefore bitwise addition (X ORing). In this con text a generic random elem ent x is a random string, i.e., a collection of bits set randomly to 1 or 0 indep enden tly and with equal probabilities. The null element is a string with all ’0’s. Arra ys of coun ters: a i , r i and A are vectors of k elements, and each element is a q − bit coun ter. The group op- eration is therefore an array of k parallel mo dulo − p additions. In this context a generic random element x is a collection of k random v alues < x 1 , x 2 , ..x k > ex- tracted independently and uniformly in [0 , p − 1]. The n ull element is an array of zeros. The format of the input elements a i , r i , the exact v alues of the parameters (e.g., k , q ) and, if applicable, the c hoice of the transformation function g () dep end on the particular kind of op eration (query) as detailed in § 5. In the following w e will use the sym b ol ‘+’ to refer generically to the addi- tion b et w een tw o terms and ‘ P ’ for multiple terms, without specifying the group op eration. 2.2 Description The central asp ect of the GCR method is that RS is con- structed in a wa y that guarantees the zero-sum condition, i.e., the comp osition of random elemen ts across al l users secret intermediate results private inputs Secret Space Private Space Public Space public output (a) private inputs public output Secret Space Private Space Public Space public output (intermediate) (b) Figure 1: Graphical represen tation of a “complete” secure pro cedure with secret in termediate results (a) and a sequence of “elementary” secure op erations c hained by public in termediate results (b). sums up to the null element: n X i =1 r i = 0 . (1) Moreo ver, the generation of RS ensures that the individual r i ’s can not be inferred by other play ers — pro vided that the num ber of colluding play ers remains b elo w the colluding threshold l . Eac h pla yer P i then shares with other pla y ers (e.g., via a central collector) the sum of data plus random elemen ts, i.e., v i = a i + r i , whic h serves as the public input to the computation. When al l input elements v i are collected, the v alue of A is obtained by summing them all, formally: n X i =1 v i = n X i =1 ( a i + r i ) = n X i =1 a i + n X i =1 r i = A + 0 = A (2) Note that the v alue of A can be reconstructed only when the inputs from al l pla yers hav e been collected: it is suﬃcient that a single play er (among those that hav e contributed to generate the RS r ) fails to provide its input element to pre- v ent the computati on of A . This is the main disadv an tage of GCR compared to SSS, as discussed later in § 3.3. RS generation Hereafter w e describ e ho w each generic pla yer P i ( i = 1 . . . n ) constructs its random element r i in coop eration with other play ers, so as to collectively build the RS r . Note that the RS generation pro cedure can be run in p ar al lel by all pla yers and is completely asynchronou s. Eac h random elemen t is initially set to the n ull element, i.e., r i = 0. Eac h play er P i extracts l + 1 random v ariables x i,j ( j = 1 . . . l + 1) and computes their sum y i def = P j x i,j . It calculates the additive inv erse 2 y i of y i and adds it to its own random elemen t, i.e., r i ← r i + y i . At the same time, P i con tacts l + 1 randomly selected other play ers and sends one v ariable x i,j to each of them: each con tacted pla yer P j will then increment its random elemen t b y x i,j , i.e., r j ← r j + x i,j . This metho d is secure against collusion of up to l play ers. Notably , the v alue of l is a free parameter, independent from the system size n , which can b e tuned to 2 In modular arithmetic the additiv e inv erse y of y is the elemen t that satisﬁes y + y = 0. F or real num bers in [0 , p ], y = p − y + 1, while for binary strings y = y . trade-oﬀ communication ov erhead with robustness to collu- sion — b oth scale linearly in l . Computation phase . With GCR the computation is basically a summation ov er n public inputs, the v i ’s, and no particular constrain t applies to the aggregation metho d whic h can b e cen tralized or distributed. F or the sake of sim- plicit y , w e assume in the follo wing a fully centralized scheme, with a single master — not necessarily a pla yer — that is in c harge of launching the query , collecting the n public in- puts, computing the result and ﬁnally p ublishing it to all the pla yers. Another possible option is tree-based aggregation: pla yers are arranged into a tree, wh ere eac h no de collects the inputs from its children and sends the summation result to its parent no de, un til the ro ot computes and publishes the ﬁnal result. More sophisticated p eer-to-peer metho d can also b e adopted at the cost of some additional co ordination o verhead. The p oin t to b e tak en is that the GCR metho d is oblivious to the particular input aggregation scheme. 2.3 Extension to multiplication It is straigh tforward to a dapt the GCR scheme to support m ultiplication of positive integers. First, the input and out- put data a i , r i and A must be deﬁned o ver the multiplica tive group [1 , p ] with p a prime num ber: primality guaran tees that eac h element has a unique multiplic ative in v erse ele- men t (note the diﬀerence with additive GCR which does not require primalit y of p ). Second, all mo dulo- p additions are replaced by mo dulo- p multiplications. Third, the balancing constrain t eq. (1) is replaced by: n Y i =1 r i = 1 (3) In this w a y we obtain a m ultiplicativ e v arian t of the addi- tiv e sharing scheme, which to the best of our knowledge w as nev er considered in previous literature. It is important to remark that GCR can support either addition or m ultipli- cation, but it can not comp ose addition and mul tiplication operations without reconstructing and resharing v alues. In the secret ev aluation scheme, it is therefore not computa- tionally complete. Finally , note t hat m ultiplicativ e GCR can not take zero as priv ate input, as that wou ld automatically force to zero also the public output, i.e., a i = 0 ⇒ v i = a i · r i = 0 ∀ r i , there- fore leaking the priv ate v alue. In practice, b efore launching a secret multiplicat ion, one can easily chec k for the presence of zero inputs, e.g., with a preliminary round of Conditional Coun ting (see § 4.2). 2.4 Sensitivity of Output It is imp ortan t to note that SMC in general (not only E- SMC) only guaran tees that no information is leaked fr om the c omputation pr o c ess . That is, it solves the problem of how to compute a function f () on distributed data in a priv acy- preserving wa y . An orthogonal problem is to ﬁnd out what is safe to compute. Just learning the resulting v alue f () could allo w the inference of sensitiv e information. F or example, if the priv ate input bits must remain secret, computing the logical AND of all input bits is insecure in itself: if the ﬁ- nal result was 1, all input bits m ust b e 1 as well and are th us no longer secret. In SMC, it is the r esp onsibility of the input pr oviders to verify that le arning f () is ac c eptable , in the same wa y as they ha v e to verify this when using a trusted third part y . While with SMC, this analysis has to be p erformed for the ﬁnal result only , in E-SMC it has to be p erformed individually for each step computing public in termediate results. A recently suggested approac h to deal with this is dif- fer ential privacy [10, 15], whic h systematically randomizes answ ers to database queries to preven t inference of sensi- tiv e input data. If data records are indep enden t, it guaran- tees that it is statistically impossible to infer the presence or absence of single records in the database from answ ers to queries. Diﬀeren tial priv acy and SMC complemen t eac h other very w ell. Using diﬀeren tial priv acy , it is p ossible to specify a randomized output e f () that is safe for public re- lease. Using SMC, it is possible to actually compute e f () in a priv acy-preserving manner, without relying on a trusted third party . In tuitiv ely , the stronger f () aggregates input data, the less randomness needs to b e added. 3. SYSTEM-DESIGN CONSIDERA TIONS In this section w e consider a num ber of system-lev el as- pects. In particular, we prop ose to split the GCR op era- tion into an oﬄine generation of RS and online aggrega- tion phase, and show how joins and leav es of no des can b e handled eﬃciently . W e also compare the GCR sc heme to Shamir’s secret sharing scheme, which, among the existing alternativ es for p erforming SMC, allows the most eﬃcien t solutions. 3.1 Ofﬂine generation of Random Sets One k ey adv an tage of GCR is that the pro cess of gen- erating the RS is completely decoupl ed — and can be run independently — from the actual computation round. This has imp orta nt implications for the design of a massive-scale system, enabling eﬃcien t management of the communica- tion load and minimal response delay . W e devise a system where lists of RS are generated oﬄine and stored for later use. At an y time, eac h play er P i has a v ailable a collection of random elements r i [ u ], indexed in u , which can be readily used for future computation rounds. The comm unication protocol must ensure that the RS indexing is univocal and sync hronized across all pla yers. During the online computa- tion phase, the query command broadcasted by the cent ral master will indicate explicitly the RS index to be used for the pro duction of the public inputs v i ’s. P erforming RS generation oﬄine brings several adv an- tages. First, it minimizes the query response delay do wn to the same v alue of an equiv alen t cleartext summation. Second, it allo ws to reduce the impact of comm unication o verhead on to the netw ork load b y scheduling the RS gener- ation process in p eriods of low netw ork load (e.g., at night or week-end). Moreo ver, generation of multiple RS can be b atche d , meaning that in a single secure connection (t yp- ically SSL ov er TCP) t wo play ers can exchange multiple < v ariable,index > pairs { x i,j [ u ] , u } whic h collectively build a collection of RS { r [ u ] } . This greatly reduces the com- m unication o v erhead associated to connection establishmen t (handshaking, authentication, key exc hange, etc.). 3.2 Joining and lea ving In the GCR sc heme, the set of play ers participating in the computation round must match exactly the set of play ers that ha v e previously built the RS: the ﬁnal result will not be reconstructed if the tw o sets diﬀer b y even a single element. If RSs are generated oﬄine, the set of pla y ers might hav e c hanged during the in terv al betw een the generation of r [ u ] and its consumption in a query . It w ould be v ery impractical to t rash all pre-computed RSs up on every new pla y er joining or leaving the system — an even t not infrequen t for systems with many pla yers. F ortunately this is not necessary and eac h legacy RS can be incremen tally adjusted upon new join or leav e with only l + 1 op erations. When a new play er P i joins the system, it learns from other play ers the index range currently in use { u 1 . . . u 2 } (note this information is public) and computes a set of ran- dom v ariables x i,j [ u ] for j = 1 . . . l + 1 and u ∈ { u 1 . . . u 2 } . It then sets its local random elements as r i [ u ] = y i [ u ] (re- call that y i = P l +1 j =1 x i,j [ u ]). Then for eac h index v alue k it selects l + 1 other pla y ers to whic h it sends the individ- ual v ariables x i,j [ u ]. Similarly , when an existing pla y er P i w ants to leav e the system, it must ﬁrst “release” its random elemen ts r i [ u ]. The simplest w a y to accomplish that is to simply pass the v alue of r i [ u ] to another randomly selected pla yer P j and let the latter up date its lo cal random ele- men t as r j [ u ] ← r j [ u ] + r i [ u ]. Note that we are assuming a “coop erativ e leaving ” b eha vior: play ers release their unused random elements to the system before lea ving. How ever if a pla yer shuts down without releasing its random elements — e.g., due to failure, p o w er oﬀ or disconnection — all RSs in the entire system are in v alidated and b ecome useless. In large scale systems suc h ev en ts might no t b e infrequen t, and proper countermeasures must b e adopted to minimize their impact (e.g., no de redundancy). 3.3 GCR versus Shamir’ s Scheme W e now compare GCR to Shamir’s secret sharing scheme [20], denoted by SSS. E-SMC, along with all the use cases de- scribed in the follo wing sections can be implemen ted with either GCR or SSS. In GCR, reconstruction of public v alues is implicitly done after eac h pro cessing step, while in SSS reconstruction needs to b e scheduled explicitly if desired. In SSS, a secret v alue s is shared among a set of n play ers b y generating a random p olynomial f of degree t < n o ver a prime ﬁeld Z p , such that f (0) = s . Eac h pla yer i = 1 . . . n then receives an ev aluation point s i = f ( i ), called the share of play er i . The secret s can b e reconstructed from an y t + 1 shares using Lagrange interpolation but is completely unde- ﬁned for t or less shares. Because SSS is linear, addition of t wo shared secrets can b e computed b y having each play er locally add his shares of the t wo v alues. Multiplication of t wo shared secrets requires an extra round of communica- tion to guaran tee randomness and to correct the degree of the new p olynomial [11]. Thus, a distributed multiplication requires a sync hronization round with n 2 total messages. F or m ultiplications to work, the degree m ust be suc h that n ≥ 2 t + 1. There are tw o main adv antages of SSS ov er GCR . First, the basic operations for addition and multiplication accept public, priv ate, and also se cr et input data and output se cr et data. That is, ev en without reconstructing intermediate v al- ues, it is p ossible to arbitrarily compose secret op erations, corresponding to Fig. 1(a). The GCR sc heme allo ws com- position of addition and multiplication only if intermediate results are publicly reconstructed, because the sharing op er- ation to b e applied (additive or multiplicativ e) dep ends on the next op eration t yp e. The second adv an tage of SSS is that it realizes a ( t + 1)-out-of- n threshold sharing sc heme. That is, an y set of t + 1 play ers can reconstruct a secret, being robust against up to n − t − 1 “missing” pla yers. In GCR , a single non-resp onsiv e pla yer renders reconstruction of secret information imp ossible. While E-SMC can also be implemented with SSS, GCR is highly optimized for online processing of queries. SSS requires linear storage o verhead ( n shares to b e stored for eac h secret v alue), whereas GCR has constan t storage ov er- head (one random v alue per priv ate input). When pro cess- ing the query , GCR inv olv es zero communication ov erhead, since the pla yers just send their randomized v alues instead of the original v alue to the aggregation no de(s). In SSS, when n play ers wan t to sum up their v alues, each of them generates n shares ad-ho c and distributes them to the oth- ers. In principle, the pla y ers could pre-generate t random shares and distribute them in a pre-processing phase. In the online phase, they would calculate the remaining n − t shares using Lagrange interpolation, such that the interpo- lated polynomials represen t their actual secrets. How ev er, after distributing the last shares, each play er still needs to perform n − 1 additions lo cally and for ﬁnal reconstruction, send their shares of the sum to the aggregation node(s), whic h even tually in terp olates the ﬁnal p olynomial. It is not ob vious how to further split this process into a oﬄine pre- processing and an online phase similar to GCR, where a single message and addition op eration is enough. Another adv an tage of GCR is that the additive scheme is not restricted to prime ﬁelds. This allows to set the ﬁeld size to 2 32 or 2 64 and therefore use implicit 32 (64) bit register wrap-arounds of CPU op erations instead of performing an explicit modulo op eration 3 . F urthermore, the multiplicativ e GCR scheme does not need an additional synchronization round like SSS. In summary , pro vided that in termediate results are not sensitiv e, GCR allo ws for a muc h smaller storage and com- putation ov erhead during the online pro cessing phase. 4. B ASIC OPERA TIONS 3 In general, mod ( a, n ) = a − n ∗ f loor ( a/n ), which uses an additional division, multiplication, and subtraction op era- tion. Here we brieﬂy sk etch some basic op erations that can be mapped to a secu re addition with a public parameter and/or a public conditional statemen t. As suc h, they can be ac- complished directly b y GCR metho d or an y other scheme for secure addition. 4.1 Summation The summation of p ositiv e real scalars A = P i a i , with a i ∈ [0 , p ], is p erformed directly as explained ab o v e via modulo- p additions. The only signiﬁcan t constrain t is on the v alue of p which must be greater than the total sum, i.e., p > A . The metho d can be easily extended to han- dle negativ e elemen ts deﬁned in [ d 1 , d 2 ], with d 1 < 0 < d 2 , b y imposing a ﬁxed shift + | d 1 | to all inputs a i ’s and then subtracting n | d 1 | from the output. Note how ev er that sum- mation of negative n umbers is unusual in traﬃc monitoring. 4.2 Conditional Counting W e consider tw o v ersions of Conditiona l Coun ting (CC) queries: “pla yer counting” and “item counting” . In the ﬁrst v ersion, the goal is to count ho w man y play ers matc h a pub- lic condition C which is explicitly announced as a public query argumen t. Eac h pla yer P i sets a i to 0 or 1 dep ending on whether or not it matc hes the condition C . Therefore CC maps to a particular case of summation, where a i ∈ { 0 , 1 } and p ≥ n + 1. In the “item counting” versi on instead the goal is to count the total num b er of items (e.g., hosts or alarm records) matching the condition C , where m ultiple items might be observ ed by a single play er. Again, coun ting maps directly to summation of in tegers. CC qu eries can serve as a preli minary round to other more adv anced queries, e.g., to identify the presence of zero in- puts before multiplication (see § 2.3), or to disco v er the ex- act num b er of activ e pla yers b efore a round of Anonymous Sc heduling (see § 5.4). 4.3 Histograms and max/min discovery Eac h play er P i has a scalar priv ate v alue b i and the prob- lem is to derive a K -bins histogram of the distribution of the b i ’s. This can be easily ac hiev ed b y using CC queries, in- dexed in k , with condition C := Y k − 1 < b i ≤ Y k , wherein the threshold v alues { Y k , k = 1 . . . K } represent the bin b ound- aries. The num ber of CC queries is equal to the num ber of bins K . Ho wev er since bin boundaries are pre-determined, the queries can b e batched in a single round using an array of K counters. In a similar wa y it is possible to disco v er the maxim um v alue of the b i ’s. Again, one can resort to a sequence of CC queries where the threshold v alues Y k are adjusted dynami- cally based on the previous result following a binary search. If b i ’s are integer and upp er b ounded by p , the maximum is found in log 2 p rounds. Note how ever that the results of all in termediate queries are public, therefore this method dis- closes more info rmation about the b i ’s distribution than just the maximum. In a similar wa y it is p ossible to discov er the minim um. 5. AD V ANCED OPERA TIONS Here we show a few examples of more adv anced op era- tions which can be mapp ed to E-SMC queries in com bination with sp eciﬁc constraints on the input data elemen ts and/or a proper local transformation function g (). F or each of them w e illustrate a possible application for collaborative net w ork monitoring. This section is one of the main contributions of the paper: to the best of our kno wledge w e are the ﬁrst to “in terpret” the followin g operations as applications of SMC using the additive sharing scheme. 5.1 Multiplication Multiplication of p ositiv e integers can be accomplished directly by the m ultiplicativ e version of GCR presented in § 2.3. Alternatively , the m ultiplication of p ositiv e real num- bers B = Q i b i (for b i > 0) can b e mapp ed to a summation in the logarithmic domain. Each play er locally computes a i = log c b i and then the computation pro ceeds as a simple summation of real num bers, leading to A = P i a i . Finally , the result is computed as B = c A . Some numerical issues migh t arise when the pro duct inv olves a large num b er of non-unitary terms, due to the accumulation of rounding er- rors in the representation of the logarithmic v alues — these ho wev er are well studied problems. 5.2 Set Operations In this section, we ﬁrst describe how (probabilistic) set operations can b e implemen ted using blo om ﬁlters with any SMC sc heme that supports both, priv ate additions and mul- tiplications (e.g., SSS). W e th en outline what subpart of that functionalit y can easily b e implemented with GCR. Bloom ﬁlters (BF) are p o werful data structures for rep- resen ting sets [5]. A bloom ﬁlter for represen ting a set S = { x 1 , x 2 , . . . , x n } of n elemen ts is described by an array of m bits, initially all set to 0. The BF uses k independent hash functions h 1 , . . . , h k with range 1 , . . . , m . F or each el- emen t x ∈ S , the bits h i ( x ) are set to 1 for 1 ≤ i ≤ k . F or c hecking whether an element y is a member of S , we simply c heck whether all bits h i ( y ) are set to 1. As long as the BF is not saturated, i.e., m is c hosen suﬃciently large to represen t all elements, the total n umber of non-zero buck ets allo ws to accurately estimate | S | . Counting Blo om Filters (CBF) are a generalization of BFs, which use integer arrays instead of bit arrays. Th us, CBFs allow to represen t mul- tisets , in which each element can be represen ted more than once. Note, th at while a (C)BF allows to eﬃcien tly chec k for elemen t mem b ership, it can not b e used to en umerate the con tained element s, in general. Compared to state-of-the- art approaches for priv acy-preserving set operations, which use homomorphic encryption (e.g., [14]), this allows for very eﬃcien t and scalable solutions. Set Union. If each play er i has a local set S i , they can construct the union of their sets S = S 1 ∪ S 2 ∪ , . . . , ∪ S n b y p erforming p ri- v ate OR ( ∨ ) ov er their BF array s. If inputs are multisets, represen ted by CBFs, the aggregation op eration is addition instead of OR. Using CBFs, each play er can learn the num- ber of o ccurrences of sp eciﬁc elements across all pla yers or the num ber of other play ers that rep ort eac h element (b y using a BF as input). F rom the aggregate CBF, one could, for instance, compute the entrop y of the empirical element distribution. Set Intersection. In order to p erform set in tersection on BFs, the play ers simply use the AND ( ∧ ) op eration for aggregating their sets S = S 1 ∩ S 2 ∩ , . . . ∩ S n . Only buc k ets set to 1 in all the play ers’ BFs will ev aluate to 1 in the aggregate BF. In this sp eciﬁc scenario, it is also p ossible for eac h pla y er i to enumerate all elements in S simply b y iterating o ver all x ∈ S i and c hecking whether x ∈ S , since S ⊆ S i . Set Operations with GCR. GCR directly supp orts the addition op eration and there- fore set union on m ultisets. If the coun ts in each buck et are not sensitive, the union and intersection of sets can be computed from the public union of multisets — the in ter- section, for instance, is given by selecting all elements with coun t n . How ev er, priv ate union and in tersection directly on sets can not b e delivered b y GCR. In fact, union requires OR, i.e., a com bination of addition and m ultiplication 4 not supported b y GCR, while the problem with intersection is that multiplicativ e GCR do es not include 0 (see § 2.3). 5.3 Anonymous publishing The goal is to let one pla yer P 1 publish to all other pla yers a binary string w without revealing its iden tity . The string w can b e, for example, a malware payload that P 1 has dis- co vered with an IDS, or the description of an attac k which w as observ ed locally . Moreo v er, w could b e used as a public condition for a future Conditional Counting round ( § 4.2), e.g., to discov er how many other play ers hav e observed the same ev en t. There are sev eral reasons why the publisher w ants to remain anon ymous. First, kno wing that it was hit by the malw are migh t b e detrimental to its reputation among customers. Second, suc h information migh t beneﬁt other p oten tial attac kers. DC-nets [9] are a basic and unconditionally secure solution for anonymous publishing. In the following, we devise an alternativ e solution that do es not require pair-wise shared secrets, and deals with the problem of i) detecting collisions and ii) scheduling the publication process to a v oid collisions. Let k denote the length of string w , and denote by C ( w ) a Cyclic Redundancy Chec k (CR C) con trol ﬁeld of length c computed on w — the need for CRC is explained b e- lo w. It is straightforw ard to map an Anonymous Publishing round to a bit-wise summation on strings of length k + c . The publisher P 1 sets its data element to the concatena- tion of w and C ( w ), i.e., a 1 = < w , C ( w ) > , while all other pla yers set their data elements to null ( a j = 0 , j 6 = 1). Therefore the public result will return the string w in clear- text, i.e., A = a 1 = < w, C ( w ) > , but since the individ- ual data elemen ts remain unknown the identit y of the pub- lisher can not b e reconstructed. Such a simple approac h w orks only if exactly one pla yer attempts to publish in the computation round: if tw o (or more) play ers P 1 and P 2 at- tempt to publish diﬀerent strings, we ha ve a collision — i.e., the computed result will b e the combination A = < w 1 ⊕ w 2 , C ( w 1 ) ⊕ C ( w 2 ) > (’ ⊕ ’ for bit-wise summation) from whic h neither of the elements w 1 , w 2 can be deriv ed. Ho wev er the collision can b e easily revealed by CRC failure as in general C ( w 1 + w 2 ) 6 = C ( w 1 ) ⊕ C ( w 2 ). The “colli- sion recov ery” pro cedure can simply foresee the rep etition of new anon ymous publishing rounds asso ciated to a back- oﬀ sc heme to av oid that the same pla yers collide again in the next round — a mechanism conceptually equiv alent to Slotted-Aloha. 4 Note that with a, b being bits, a ∨ b = a + b − 2 ab and a ∧ b = ab . A simple “detection and reco v ery” approach is not eﬀectiv e when the instan taneous rate of publishing attempts is high — this is of particular concern in large-scale system with man y play ers ( n >> 1) and/or in presence of correlated at- tempts (e.g., a spreading malware pa yload caught simulta- neously by diﬀerent domains). In such cases it is preferable to adopt a “collision prev ention” metho d b y orderly schedul- ing the publishing rounds for diﬀerent pla y ers. Th is can be achiev ed b y a single round of anon ymous scheduling, as explained b elo w. 5.4 Anonymous scheduling The problem is deﬁned as follo ws. Out of the total n play- ers, a subset of m < n “activ e” play ers are ready to p erform a given action, e.g., anonymous publishing. The problem is then to schedu le the m active pla yers without knowing nor r eve aling their identities . This apparently diﬃcult task can be easily accomplished by bit-wise summation o ver strings of size k >> m . At the query round, the inactive play ers set their data elements to the null string, while each active pla yer P i extracts uniformly a random integer q i ∼ U (1 , k ) and then builds its data element a i with a single ’1’ at the q i -th position and all other bits set to ’0’. The bitmap length k must be set large enough to ensure that bit-collision prob- abilit y — i.e., t wo or more pla y ers independently picking the same random v alue q i — is kept acceptably low. Assuming that no bit-collision has o ccurred, the ﬁnal (pub- lic) result A is a bitmap with m ’1’s and k − m ’0’s. Upon learning A , each active play er P i c hecks whether the bit in the q i position is set to ’1’, and if so it counts the num- ber of ’1’s in the preceding p ositions, sa y µ i , from which he learns it has b een scheduled in the successiv e ( µ i + 1) − th query round. If otherwise the q i -th bit is ’0’, P i infers that a collision has o ccurred and wa its for the next sc heduling round. Note that in case of bit-collisions the round does not com- pletely fail: if collisions inv olv es only tw o (or any ev en num- ber of ) play ers, the colliding play ers will simply w ait for the next scheduling query . If three (or any o dd num ber of ) pla yers hav e collided on the same q − th bit, they w ould again collide in the q − th query round. How ev er this is not a se- rious problem as far as collisions in the query rounds can be detected and recov ered (e.g., by CRC failure in case of Anon ymous Publishing). The num ber of active play ers m is relev an t to the set- ting of the bitmap length k ( k >> m ). One conserv ative approac h is to simply assume the worst case m = n . Alter- nativ ely , a preliminary Conditional Counting query ( § 4.2) migh t b e launc hed to discov er the exact v alue of m . The latter approac h has also another adv antage: with k nowledge of m , the occurrence of bit-collisions can b e easily revealed b y comparing to the num ber of ’1’s in the ﬁnal result, i.e., | A | 1 . In fact, the diﬀerence m − | A | 1 equals to the num b er of colliding play ers. F or example, m − | A | 1 = 1 implies that only a tw o-play er collision has o ccurred, and the master can decide to v alidate the curren t sc heduling round — implic- itly deferring the t wo co lliding pla yers to a future s cheduling round — or to in v alidate it and immediately re-launc h a new sc heduling round. 6. RELA TED WORKS SMC is a cryptographic fra mework introduced b y Y ao [22] and later generalized by Goldreich et al. [13]. SMC tec h- niques ha ve b een widely used in the data mining communit y . F or a comprehensive surv ey , please refer to [1]. Roughan et al. [19] ﬁrst prop osed the use of SMC tec hniques for a n umber of applications relating to traﬃc measurements, in- cluding the estimation of global traﬃc volume and perfor- mance measurements [18]. In addition, the authors identi- ﬁed that SMC techniqu es can b e combined with commonly- used traﬃc analysis metho ds and tools, such as time-series algorithms [2] and sketc h data structures. Ho wev er, for man y years, SMC-based solutions hav e mainly been of theoretical interest due to impractical resource re- quiremen ts. Only recently , generic SMC frameworks op- timized for eﬃcien t processing of voluminous input data ha ve b een developed [4, 8]. T oday , it is p ossible to pro- cess hundreds of thousands of elements distributed across dozens of netw orks within few minu tes, for instance to gen- erate distributed top-k rep orts [6]. While these results are compelling, they stic k to the completely secret ev aluation sc heme. Our work aims at b oosting scalability even further b y relaxing the secrecy constraint for in termediate results. As suc h, our approach can be applied only in cases where the disclosure of in termediate results is not regarded as critical — a quite frequen t case in practical applications. Moreov er, w e aim at optimizing the sharing sc heme for fast computa- tion in the online phase. When it comes to analyzing traﬃc data across multiple net works, v arious anonymization tech niques hav e b een pro- posed for obscuring sensitive local information (e.g., [21]). Ho wev er, these metho ds are generally not lossless and in tro- duce a delicate priv acy-utilit y tradeoﬀ [17]. Moreov er, the capabilit y of anon ymization to protect priv acy has recently been called in question, b oth from a tec hnical [7] and a legal persp ectiv e [16]. 7. CONCLUSIONS The use of SMC tec hniques has recen tly b een proposed to ov ercome the inhibiting priv acy concerns asso ciated with in ter-domain sharing of net work traﬃc data. How ever, the cost at whic h the cryptographic priv acy guarantees of SMC are b ough t is tremendous. Although the design and im- plemen tation of basic SMC primitiv es hav e recen tly b een optimized, processing time for queries is still in the order of sev eral minutes and in volv es signiﬁcant communication o verhead. In this pap er, we further b oost the p erformance of priv acy- preserving netw ork monitoring by t wo means. Firstly , we iden tify that perfect secrecy of intermediate results is not required in man y cases. That is, we adv ocate the use of “elemen tary” (as opp osite to “complete” ) secure multipart y computation (E-SMC) pro cedures for traﬃc monitoring. E- SMC supp orts only simple computations with private in- put and public output , i.e., they can not handle secret input nor secret (intermediate) output. Secondly , we separate the computation into an oﬄine and an online phase. Our pro- posed scheme GCR is based on additive secret sharing and pre-generates random secret shares during the oﬄine phase with only constant storage ov erhead. In the online phase, GCR allo ws to process actual queries with zero communi- cation ov erhead. This enables adoption of SMC techniques on massiv e scales, b oth in terms of input data volume and n umber of participan ts. In the second part, we in troduce a n umber of high-lev el primitiv es supp orted by GCR that co ver a wide range of use cases in netw ork monitoring, in- cluding the priv ate generation of histogram s, set op erations, and anonymous publishing. In future work, we will ev aluate GCR on real netw ork se- tups and study h ybrid approac hes co mbining GCR with SSS to provide scalability and functional completeness. Acknowledgments This work was supp orted b y the DEMONS pro ject funded b y the EU 7th F ramew ork Programme [G.A. no. 257315] ( http://fp7- demons.eu ). 8. REFERENCES [1] C. C. Aggarw al and P . S. Y u. Privacy-Pr eserving Data Mining: Mo dels and A lgorithms . Springer Publishing Compan y , Incorp orated, 2008. [2] M. Atallah, M. Byko v a, J. Li, K. F rikken, and M. T opk ara. Priv ate collab orativ e forecasting and benchmarking. In Pr o c. ACM WPES’04 , Octob er 2004. [3] M. Ben-Or, S. Goldwasser, and A. Wigderson. Completeness theorems for non-cryptographic fault-toleran t distributed computation. In ACM symp osium on Theor y of c omputing (STOC) , 1988. [4] D. Bogdanov, S. Laur, and J. Willemson. Sharemind: A F ramew ork for F ast Priv acy-Preserving Computations. In Eur op e an Symp osium on R ese ar ch in Computer Se curity (ESORICS) , 2008. [5] A. Bro der and M. Mitzenmac her. Netw ork applications of blo om ﬁlters: A survey. Internet Mathematics , 1(4):485–509, 2004. [6] M. Burkhart and X. Dimitropoulos. F ast priv acy-preserving top-k queries using secret sharing. In 19th International Confer enc e on Computer Communic ations and Networks (ICCCN) , Zurich, Switzerland, Aug. 2010. [7] M. Burkhart, D. Schatzma nn, B. T rammell, E. Boschi, and B. Plattner. The role of netw ork trace anon ymization under attack. ACM SIGCOMM Computer Communic ation R eview , 40(1):5–11, Jan. 2010. [8] M. Burkhart, M. Strasser, D. Many , and X. Dimitrop oulos. SEPIA: Priv acy-Preserving Aggregation of Multi-Domain Netw ork Even ts and Statistics. In 19th USENIX Se curity Symp osium , W ashington, DC, USA, Aug. 2010. [9] D. Chaum. The dining cryptographers problem: Unconditional sender and recipient un traceability. Journal of Cryptolo gy , 1(1):65–75, 1988. [10] C. Dwork. Diﬀerential priv acy: A survey of results. In The ory and Applic ations of Mo dels of Computation (T AMC) , 2008. [11] R. Gennaro, M. Rabin, and T. Rabin. Simpliﬁed VSS and fast-track multipart y computations with applications to threshold cryptograph y. In 7th annual AC M symp osium on Principles of distribute d c omputing (PODC) , 1998. [12] C. Gentry . F ully homomorphic encryption using ideal lattices. In ACM symp osium on The ory of Computing . A CM, 2009. [13] O. Goldreich, S. Micali, and A. Wigderson. How to pla y any ment al game. In ACM symp osium on The ory of c omputing (STOC) , 1987. [14] L. Kissner and D. Song. Priv acy-Preserving Set Operations. In Pr o c. of CR YPTO’05 , 2005. [15] F. McSherry and R. Maha jan. Diﬀerentially-priv ate net work trace analysis. In ACM SIGCOMM , 2010. [16] P . Ohm. Broken promises of priv acy: Resp ondi ng to the surprising failure of anonymizat ion. 57 UCLA L aw R eview , 2010. Av ailable at http://ssrn.com/abstract=145 0006 . [17] R. Pang, M. Allman, V. Paxson, and J. Lee. The devil and pack et trace anonymization. Computer Communic ation R eview , 36(1):29–38, 2006. [18] M. Roughan and Y. Zhang. Priv acy-preserving performance measurements. In SIGCOMM workshop on Mining network data (MineNet) , 2006. [19] M. Roughan and Y. Zhang. Secure distributed data-mining and its application to large-scale netw ork measuremen ts. ACM Computer Communic ation R eview , 36(1), 2006. [20] A. Shamir. How to share a secret. Communic ations of the ACM , 22(11), 1979. [21] A. Slagell, K. Lakk ara ju, and K. Luo. Flaim: A m ulti-level anon ymization framework for computer and netw ork logs. In 20th USENIX L ar ge Instal lation System A dministr ation Confer enc e (LISA) , 2006. [22] A. Y ao. Proto cols for secure computations. In IEEE Symp osium on F oundations of Computer Scienc e , 1982.

Reduce to the Max: A Simple Approach for Massive-Scale Privacy-Preserving Collaborative Network Measurements (Extended Version)

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment