PSI ({Psi}): a Private data Sharing Interface

PSI ( Ψ ): a Priv ate data Sharing In terface ∗ (w orking p aper) Marco Gab oardi † James Honak er ‡ Gary King § Jac k Murtagh ¶ K obbi Nissim k Jonathan Ullman ∗∗ Salil V adhan †† with con tributions from Nabib Ahmed, Andreea An tuca, Brendan A v ent, Jordan A w an, Christian Baehr, Connor Bain, Victor Balcer, Thomas Bra wner, Jessica Bu, Mark Bun, Stephen Chong, F ann y Cho w, Katie Clayton, Holly Cunningham, Vito D’Orazio, Gian Pietro F arina, Anna Gavrilman, Benjamin Glass, Caper Go o den, P aul Handorﬀ, Raquel Hill, Alyssa Hu, Jason Huang, Justin Kaasho ek, Allyson Kaminsky , Chan Kang, Murat Kuntarcioglu, Vishesh Karw a, George Kellaris, Michael Lac kner, Jack Landry , Hyun W o o Lim, Gio v anni Mallo y , Mic hael Lopiccolo, Nathan Manohar, Ross Mawhorter, Dan Muise, Marcelo No v aes, Ana Luisa Oaxaca, Raman Prasad, Sofy a Raskho dniko v a, Grace Rehaut, Ry an Rogers, Or Sheﬀet, Adam D. Smith, Thomas Steink e, Kathryn T a ylor, Julia V asile, Clara W ang, Hao qing W ang, Remy W ang, Lancelot W athieu, David Xiao, An ton Xue, and Jo y Zheng August 7, 2018 Abstract W e pro vide an o verview of the design of PSI (“a Priv ate data Sharing Interface”), a system we are developing to enable researc hers in the so cial sciences and other ﬁelds to share and explore priv acy-sensitiv e datasets with the strong priv acy protections of diﬀeren tial priv acy . ∗ This w ork is part of the “Priv acy T o ols for Sharing Research Data” pro ject at Harv ard, supp orted by NSF grant CNS-1237235 and grants from the Sloan F oundation. This is a working pap er describing a vision for work that is still in progress, and is therefore authored by the leadership of the eﬀorts. F uture and accompan ying publications that emphasize sp eciﬁc technical contributions will be authored by team members resp onsible for those contributions. † Departmen t of Computer Science and Engineering, Universit y at Buﬀalo, SUNY. W ork done in part while at the Univ ersity of Dundee, UK and visiting the Center for Researc h on Computation & So ciety , John A. P aulson Sc ho ol of Engineering & Applied Sciences, Harv ard Univ ersity . gaboardi@buffalo.edu . ‡ james@hona.kr ; http://hona.kr § Alb ert J. W eatherhead I I I Universit y Professor, Harv ard Universit y , Institute for Quantitativ e So cial Science. king@harvard.edu ; http://GaryKing.org ¶ Cen ter for Research on Computation & So ciety , John A. Paulson School of Engineering & Applied Sciences, Harv ard Universit y . jmurtagh@g.harvard.edu . k Departmen t of Computer Science, Georgetown Universit y , and Center for Research on Computation & Society , John A. Paulson School of Engineering & Applied Sciences, Harv ard Universit y . kobbi.nissim@georgetown.edu . ∗∗ College of Computer and Information Sciences, Northeastern Universit y . W ork done in part while aﬃliated with the Center for Research on Computation & So ciety , John A. P aulson Sc ho ol of Engineering & Applied Sciences, Harv ard Universit y . jullman@ccs.neu.edu †† Cen ter for Research on Computation & So ciety , John A. Paulson School of Engineering & Applied Sciences, Harv ard Univ ersity . W ork done in part while visiting the Shing-T ung Y au Cen ter and the Departmen t of Ap- plied Mathematics at National Chiao-T ung Univ ersity in T aiwan. Also supp orted by a Simons In vestigator A ward. salil@seas.harvard.edu . 1 1 The Problem Researc hers in all empirical ﬁelds are increasingly exp ected to widely share the data b ehind their published researc h, to enable other researchers to verify , replicate, and extend their work. Indeed, data-sharing is now often mandated by funding agencies [ 61 , 57 , 58 ] and journals [ 51 , 31 , 76 ]. T o meet this need, a v ariet y of open data rep ositories ha ve b een dev elop ed to make data-sharing easier and more p ermanent. The index from the R e gistry of R ese ar ch Data R ep ositories surpassed 1500 diﬀeren t rep ositories in April 2016 [ 5 ]. The largest general-purp ose rep ositories include those that use the op en-source Dataverse plat- form [ 17 , 42 ] (including Harv ard’s Datav erse rep ository , which has, under some measures, the largest rep ository of so cial science datasets in the w orld), CERN’s Zeno do, and the commercial Figshare [ 70 ], and Dryad [ 78 ] rep ositories. Ho wev er, many of the datasets in the so cial and health sciences contain sensitive p ersonal in- formation ab out human sub jects, and it is increasingly recognized that traditional approac hes suc h as stripping “p ersonally iden tifying information” are ineﬀective at protecting priv acy , especially if done by a la y researc her with no exp ertise in deidentiﬁcation. This leads to tw o problems, one for priv acy and one for utilit y: 1. There are numerous data sets, suc h as surveys, that hav e b een “deidentiﬁed” via traditional means and increasingly are b eing dep osited in publicly accessible data rep ositories. As the literature has rep eatedly sho wn, it is likely that man y sub jects in these surveys can b e reiden tiﬁed b y attack ers with a mo derate amount of background information, and th us their priv acy may not b e suﬃcien tly w ell-protected. 2. There are numerous other data sets that are not a v ailable at all, or only with highly restrictive and time-consuming pro visions. Suc h pro visions can include a review by the original data dep ositor— who may no longer be accessible—and/or an Institutional Review Board (IRB), and a length y negotiation b et ween institutions on the terms of use. Th us, an imp ortan t problem is to dev elop and deploy metho ds that can b e used to oﬀer greater priv acy protections for datasets of the ﬁrst type, ideally at little or no cost in utility 1 , and enable the safe sharing of datasets of the second t yp e. Diﬀeren tial priv acy [ 25 ] oﬀers an attractive approach to addressing this problem. Indeed, it pro vides a formal mathematical framew ork for measuring and enforcing the priv acy guaran tees pro vided by statistical computations. The level of priv acy protection that diﬀerential priv acy can oﬀer is describ ed in terms of tw o priv acy loss parameters  and δ ; the smaller they are, the greater the level of priv acy . T o achiev e this greater level of priv acy protection, a diﬀerentially priv ate algorithm will generally inject a greater amoun t of random “noise” in to a statistical computation, thereby yielding less “accurate” results. Using diﬀeren tial priv acy enables us to provide wide access to statistical information ab out a dataset without worries of individual-level information b eing leak ed inadverten tly or due to an adv ersarial attack. There is now b oth a ric h theoretical literature on diﬀeren tial priv acy and n umerous eﬀorts to bring diﬀerential priv acy closer to practice, including large-scale deplo yments b y Go ogle [ 29 ], Apple [ 33 ], and the U.S. Census Bureau [ 46 ] (See Section 5 for more on previous w ork.) How ev er, none of the past work simultaneously meets all of our desiderata for suc h a system: 1 Ev en traditional de-identiﬁcation techniques hav e b een found to hav e a signiﬁcant negative impact on utility [ 19 ]. 1 • Accessibilit y b y non-exp erts: researchers in the so cial sciences should b e able to use the system to share and explore data with no inv olv ement from exp erts in data priv acy , computer science, or statistics. • Generality: the system should b e applicable and eﬀective on a wide v ariety of heterogeneous datasets, as opp osed to b eing tailored for a particular data source or domain. • W orkﬂow-compatibilit y: the system should ﬁt naturally in the workﬂo w of its users (e.g. re- searc hers in the social sciences), and be positioned to oﬀer clear b eneﬁts (e.g. more access to sensitiv e data or less risk of an embarra ssing priv acy violation) rather than being an imp ediment. 2 Our Con tribution: PSI In this pap er, w e provide an o verview of PSI (“a Priv ate data Sharing Interface”), a system w e ha ve dev elop ed to enable researchers in the so cial sciences and other ﬁelds to share and explore priv acy-sensitiv e datasets with the strong priv acy protections of diﬀeren tial priv acy . It is designed to achiev e all of the desiderata mentioned in Section 1 (Accessibilit y for Non-Exp erts, Generality , and W orkﬂow-compatibil ity). Unique features of PSI include: • None of its users, including the data dep ositors who hav e priv acy-sensitiv e data sets they wish to share and the data analysts who seek to analyze those datasets, are exp ected to hav e exp ertise in priv acy , computer science, or statistics. Nevertheless, PSI enables them to make informed decisions ab out the appropriate use of diﬀerential priv acy , the setting of priv acy loss parameters, the parti- tioning of a priv acy budget across diﬀerent statistics, and the interpretation of errors introduced for priv acy . • It is designed to b e integrated with existing and widely used data rep ository infrastructures, such as the Dataverse pro ject [ 17 , 42 ], as part of a broader collection of mec hanisms for the handling of priv acy-sensitive data, including an appro v al pro cess for accessing raw data (e.g. through IRB review), access control, and secure storage. Consequently , PSI can initially b e used to incr e ase the accessibilit y of priv acy-sensitiv e data, augmen ting rather than replacing curren t means for accessing suc h data, thereby low ering the adoption barrier for diﬀeren tial priv acy . • Its initial set of diﬀeren tially priv ate algorithms were c hosen to include statistics that hav e wide use in the social sciences, and are integrated with existing statistical soft w are designed for la y so cial science researchers, namely the Zelig [ 15 ] pack age in R and the T woRavens [ 37 ] graphical data exploration in terface. W e ha v e dev elop ed a prototype of the system. In tegration with Dataverse is ongoing and will b e liv e in the near future. The features of PSI describ ed in this paper are at diﬀering stages of completion. Some ha ve b een implemen ted and thoroughly ev aluated in b oth user tests and replication exp erimen ts (see Sections 14 and 15 , resp ectively), some hav e b een implemented but ha ve not undergone user testing either inten tionally to reduce participant fatigue or b ecause they w ere still in progress at the time of testing, and others are actively in developmen t. Throughout the text w e indicate which asp ects of the system are implemen ted and whic h are in developmen t, and the sections on user testing and replication exp erimen ts describ e whic h pieces of the to ol underwen t those ev aluations. 2 A preliminary prototype of PSI is av ailable at http://privacytools.seas.harvard.edu/psi . It do es not yet incorp orate all of the planned features describ ed in this pap er, as a num b er of them are still under developmen t. The purp ose of this paper is to describ e the design of PSI, and initiate a discussion ab out the choices made and p ossible alternativ es. 3 A Motiv ating Story Consider a so cial scien tist, Alice, who is studying the relationship of health status to political participation. Since the rules surrounding health insurance hav e b een a recen t fo cus of p olitical debate, the researc her is in terested to see if individuals with health problems hav e become more engaged in the p olitical pro cess. She searches the catalog of the tens of thousands of datasets arc hived in a Datav erse data rep ository and lo cates several that ma y contain information to test her hypothesis. Some are broad surveys of attitudes and b ehavior that con tain hundreds of questions across man y domains, 2 so lik ely contain only a couple of questions on voting turnout or a couple of questions on health status. Some are more fo cused studies that survey patien t p opulations, with ric her questions ab out their medical issues and questions to judge the impacts on their liv es and opinions, or longitudinal studies that revisit these participan ts rep eatedly o ver a long time scale. She migh t also lo cate detailed time diary studies, where resp ondents agree to provide extensive recordings of how they sp end their time eac h day , spaced with p erio dic surveys, that might even include biological surveys of cortisol and other hormone levels from saliv a. The broad surveys are generally a v ailable for public do wnload from the rep ository; how ev er, even here, geographic v ariables suc h the state of residence, are only a v ailable in a sp ecial v ersion that is closed to the public. 3 The fo cused studies ma y or may not b e publicly av ailable for download dep ending on the original data depositor’s wishes, agreemen ts with the dep ositor’s Institutional Review Board (IRB), and whether the data touches on vulnerable protected populations (suc h as children, felons, minorities, the disabled) or data for which there are federal regulations (suc h as health care and studen t records). The time diary studies, b ecause of their ric h description of individuals and p otential for reidentiﬁcation, will almost certainly b e stored but unav ailable to the researc her. In summary , as the datasets b ecome ric her, and more lik ely to directly test her h yp otheses, they are more lik ely to b e unav ailable for download. So our researcher can use the data easily av ailable and test the broadest implications of her theory with somewhat crude proxy v ariables for the items of in terest. Or she can apply to gain access to sensitiv e data that can allow her to directly test all of her h yp otheses in n uance. Eac h application dela ys the researc h pro ject, is costly in terms of researc her time and IRB resources, and commonly requires participation or appro v al b y the original data dep ositor and relev an t IRBs, all of which in turn ma y need to b e facilitated by commu nication through the rep ository staﬀ curators. Ev en if a researcher is committed to applying for access to priv ate data, the a v ailable descriptions of the data are often insuﬃcient to judge which of the p oten tial closed datasets con tain the b est or most relev an t data for the researc her’s purp oses. Man y of these IRB applications may turn out to b e costly lost eﬀorts once access is ﬁnally granted. F or 2 F or example the General So cial Surv ey , National Election Study , or Co operative Congressional Election Study . 3 F or example, to get the state of residence of resp ondents, the General So cial Survey requires a signed contract with the researcher, a thorough description of the researc h to be conducted on the data, a fee of $750, approv al from the researcher’s IRB, and construction of a data protection plan that generally requires non-net work ed computers set up in a room with secured limited access [ 8 ]. 3 PSI data r epositor y data exploration tools regulated access open access authorization process, e.g. IRB Public Private data analyst Figure 1: Ov erview of a data analysis repository using PSI. example, there may b e an insuﬃcien t num ber of data p oin ts of the t yp e required, or the v ariables are not measured in the manner expected, or the time p erio d is wrong. Imagine instead that the rep ository enabled Alice to run exploratory statistical pro cedures on the datasets in Datav erse, so as to learn which datasets would b e useful to her research, but only returning statistical answ ers that are diﬀerentially priv ate, so as to uphold the rep ository’s ethical and legal resp onsibilities to safeguard sensitiv e data. She could inv estigate all the sensitiv e datasets that are closed for download, and use the noisy statistical answ ers to learn whic h datasets hav e information useful for her research. Our PSI system is built to allo w such immediate exploratory access to these closed ﬁles and to reduce the wasted eﬀort of researchers, data dep ositors, rep ositories and IRBs coming from applications to access to datasets that even tually prov e to not b e useful to the applican t. The situation we hav e describ ed ab ov e is schematically represen ted in Figure 1 . When a data analyst has access to a data rep ository infrastructure like Datav erse, they currently hav e only tw o options: either using the publicly a v ailable data whic h are oﬀered with op en access, or going through an authorization process that may b e lengthy and costly . With PSI there will b e a v aluable alterna- tiv e: accessing the sensitive data for data exploration. T o enhance this opp ortunit y , PSI is designed to b e naturally integrated with b oth the data rep ository infrastructure and data explorations to ols. 4 Diﬀeren tial Priv acy Diﬀeren tial priv acy is a formal mathematical framework for measuring the priv acy guarantees pro- vided b y statistical computations. Consider an algorithm M that tak es a dataset x as input and p erforms a randomized computation to pro duce an output y . Informally , diﬀerential priv acy re- quires that if we change any one individual’s data in x , then the distribution of y do es not change m uch. In tuitively , this means that eac h individual’s data is hidden from an adversary that views the output y . T o make this in tuition precise, w e need to deﬁne what we mean by “one individual’s data,” and pro vide a measure of how m uch the distribution of y is allo wed to c hange. F or the former, a typical c hoice is to consider datasets x that consist of n records, where we think of each record as consisting of one individual’s data, and the sample size n is public (not sensitiv e information). W e call tw o 4 datasets x and x 0 neighb ors if they agree in all but one record (i.e. x 0 is obtained from x by changing one individual’s data). Then the formal deﬁnition of diﬀerential priv acy is as follows: Deﬁnition 4.1 (Diﬀerential Priv acy , [ 25 , 24 ]) F or p ar ameters  ≥ 0 and δ ∈ [0 , 1] , a r andom- ize d algorithm M is ( , δ ) -diﬀerentially priv ate if for every two neighb oring datasets x, x 0 and every set S of outputs, Pr[ M ( x ) ∈ S ] ≤ e  · Pr[ M ( x 0 ) ∈ S ] + δ, wher e the pr ob abilities ar e taken over the r andomization of the algorithm M . The lev el of priv acy protection is gov erned b y the tw o privacy loss p ar ameters  and δ ; the smaller they are, the closer the distributions of M ( x ) and M ( x 0 ) are, and hence the greater the lev el of priv acy . T ypically ,  is taken to b e a small constan t suc h as . 1 , whereas δ is tak en to b e v ery small, like 2 − 30 . The w a y that diﬀerentially priv ate algorithms for statistical analysis are often designed are b y carefully in tro ducing a small amount of random noise into non-priv ate algorithms for the same analyses. The more noise that is introduced, the greater the lev el of priv acy protection (i.e. a smaller  and/or δ ). How ev er, less noise produces a more accurate and useful analysis. Th us diﬀeren tially priv ate algorithms oﬀer a priv acy-utilit y tradeoﬀ. In order to limit the amoun t of noise required, most diﬀerentially priv ate algorithms for statistical analysis require the n umber of data p oints and the explicit range of the diﬀeren t data v ariables. A simple example is the mean of a numeric v ariable, the amoun t of noise needed to guaran tee diﬀeren tial priv acy for it is prop ortional to its range divided b y the num b er of data p oints. By now, there is a large literature giving diﬀerentially priv ate algorithms for a wide v ariet y of data analysis tasks. Often, these algorithms are accompanied b y a theoretical analysis showing that their p erformance con verges to that of the non-priv ate algorithm as the sample size n tends to inﬁnity . Ho wev er, such asymptotic p erformance guarantees do not necessarily translate to go o d p erformance at a sp eciﬁc ﬁnite sample size, and th us a great deal of work remains to b e done to engineer diﬀeren tially priv ate algorithms to b e useful in practice. In addition, one t ypically do es not wan t to run just one analysis on a dataset, but rather a large collection of analyses. F ortunately , diﬀerentially priv acy satisﬁes a v ariet y of c omp osition the or ems sho wing that the priv acy protection degrades gracefully when w e run man y diﬀerentially priv ate algorithms. F or example: Theorem 4.2 (Basic Comp osition [ 25 , 24 ]) L et M 1 , . . . , M k b e r andomize d algorithms wher e M i is (  i , δ i ) diﬀer ential ly private for i = 1 , . . . , k . Then the algorithm M ( x ) = ( M 1 ( x ) , . . . , M k ( x )) that runs e ach of the M i ’s using indep endent c oin tosses is ( P i  i , P i δ i ) diﬀer ential ly private. If we wan t to ac hieve a global, ov erall level of priv acy protection giv en by (  g , δ g ) , we can think of the pair as a “priv acy budget” to b e spent on diﬀerent analyses M i w e wan t to run. W e can sp end more of this budget on a sp eciﬁc analysis M i (i.e. tak e  i , δ i smaller), but this will consume more of our budget, leaving less for the other analysis if w e wan t to ensure that P i  i ≤  g and P i δ i ≤ δ g . There are b etter b ounds on the comp osition of diﬀerentially priv ate algorithms than the simple summing b ound given ab o ve [ 26 , 38 , 56 ], but they still hav e the same budget-like eﬀect—a larger (  i , δ i ) (i.e. higher accuracy , low er priv acy) for one computation requires reducing the  and δ v alues for other computations in order to main tain the same ov erall level of priv acy . 5 5 Previous w ork Most of the previous w ork to bring diﬀeren tial priv acy to practice can b e partitioned into the follo wing categories: • Pr o gr amming languages and systems: here the goal is to make it easier for users to write programs that are guaran teed to b e diﬀeren tially priv ate, either b y comp osition of diﬀeren tially priv ate build- ing blo cks [ 50 , 64 , 34 ], using general frameworks suc h as “partition-and-aggregate” or “subsample- and-aggregate” [ 60 ] to conv ert non-priv ate programs into diﬀeren tially priv ate ones [ 67 , 54 ], or b y formal veriﬁcation from scratch [ 10 ]. On one hand, these metho ds provide muc h more generality than we seek—our target users are not programmers, and it will already b e very useful to pro vide them with a small, ﬁxed collection of diﬀerentially priv ate versions of statistical computations that are common in the social sciences. On the other hand, most of these tools do not pro vide m uch guidance for a lay user in deciding how to partition a limited priv acy budget among man y statistics or analyses he or she may wan t to run, or ho w to in terpret the noisy results giv en b y a diﬀeren tially priv ate algorithm. In contrast to the other tools men tioned ab ov e, GUPT [ 54 ] do es enable a user to specify ﬁne- grained accuracy goals and automatically con verts these in to priv acy budget allo cations, in a similar spirit to our priv acy budgeting to ol (describ ed later). Ho wev er, GUPT is limited to diﬀerentially priv ate programs obtained via the subsample-and-aggregate framew ork, whereas our to ol has no suc h restriction, and can b e extended to include arbitrary diﬀerentially priv ate algorithms. Moreo ver, our to ol allo ws the priv acy budget allo cation to b e interactiv ely adjusted b y users, and supp orts optimal comp osition theorems for diﬀerential priv acy [ 56 ]. • Optimization for sp e ciﬁc data r ele ases: there hav e b een several successful applications of diﬀer- en tial priv acy to v ery sp eciﬁc and structured sources of data like commuter patterns [ 46 ], mobility data [ 52 ], client-side softw are data [ 30 ], and genome-wide asso ciation studies [ 16 ] Here diﬀerential priv acy exp erts carefully optimize the choice of diﬀerentially priv ate algorithms and the partitioning of the priv acy budget to maximize utility for the particular data source. In the con text of a broad data rep ository in the so cial or health sciences, the collection of data sources and the structure of the datasets is to o heterogenous to allo w for such optimization. And it is not scalable to hav e a diﬀeren tial priv acy exp ert man ually inv olved in each instance of data sharing. • Optimization and evaluation of sp e ciﬁc algorithms: there is a v ast literature on the design of diﬀeren tially priv ate algorithms for sp eciﬁc data analysis tasks, including substantial exp erimen tal w ork on comparing and optimizing such algorithms across a wide range of datasets. As an exam- ple, the recent work on DPBench [ 35 ] provides a thorough comparison of diﬀeren t algorithms and diﬀeren t wa ys of optimizing them. Suc h w ork is complemen tary to ours. Algorithms that p erform well in such ev aluation are natural candidates to add to our library of diﬀerentially priv ate routines, but such ev aluation do es not address how to budget the priv acy allo cated to this one algorithm against man y other analyses one might w ant to run on the same dataset or more generally how to enable la y users to mak e appropriate use of diﬀerential priv acy . Moreov er, our use case of a general-purpose so cial science data repository guides the c hoices of whic h algorithms to implement, the measures of accuracy , and the metho ds for ev aluation, as discussed in the later sections. 6 There are also a n umber of deploy ed systems that provide query access to sensitive data, using heuristic approaches to protect priv acy . These include systems for querying clinical health data [ 45 , 48 ], education data [ 4 ], genomic data [ 74 ], and Census data [ 1 ]. How ev er, the lack of rigorous priv acy guaran tees raises a genuine risk, as illustrated by attacks on the Israeli Census query system [ 80 ], on genomic data [ 36 , 28 ] and more generally on releases of aggregate statistics [ 20 , 27 ]. (Some of the aforementioned systems address this concern by limiting access to a more trusted set of users.) 6 Incen tiv es for use Diﬀeren tial priv acy has sometimes b een critiqued for its cost in utility (coming from the noise in tro duced in statistics), th us one might w onder what would motiv ate researc hers to use it in place of the curren t data-sharing ecosystem. W e see at least three diﬀerent scenarios in whic h diﬀerential priv acy can provide a clear beneﬁt o ver current approaches. • (“DP works great”) In some circumstances, the results of diﬀeren tially priv ate analyses are vir- tually indistinguishable from non-priv ate analyses. Curren tly , this tends to b e the case when the n umber n of samples is large, the data is low-dimensional, and the analyses to be p erformed are relativ ely simple and few in n umber. In suc h cases, the greater priv acy protections of diﬀeren tial priv acy come essen tially for free. As b oth theoretical and applied work on diﬀeren tial priv acy ad- v ances and data gets “bigger” ( n gets larger), we can expect an increasingly large set of data-sharing circumstances to fall in this scenario. • (“Access is wide”) When w e wish to mak e sensitiv e data av ailable to an extremely wide comm unity (for example, when allowing public access), w e should b e increasingly concerned ab out attacks from individuals with malicious inten t. Such adv ersaries can include ones who ha ve extensiv e kno wledge ab out a particular data sub ject that can b e exploited as bac kground information. Thus, the strong protections of diﬀerential priv acy , which remain meaningful regardless of an adversary’s background information, are attractive. • (“Data is currently unav ailable”) F or data that is currently unav ailable except p ossibly through restrictiv e and time-consuming pro visions, any useful statistical information that diﬀerenti al priv acy can oﬀer is a b eneﬁt to utilit y , even if it do es not fall in the “DP w orks great” category . In particular, DP can oﬀer the p ossibility of rough, exploratory analysis to determine whether a dataset is of suﬃcien t interest to go through the pro cess of applying for access to the raw data. The architecture of PSI is designed to supp ort all three of these scenarios. In the near term, w e exp ect the third scenario, namely enabling exploratory analysis of data that is curren tly una v ailable, to b e the one where PSI is most frequen tly used. In this scenario, PSI can pro vide a clear utilit y b eneﬁt, can b e applied with the mo dest sample sizes that are common in so cial science, and do es not require an extensive library of highly optimized and sophisticated diﬀeren tially priv ate algorithms. Ho wev er, PSI is extensible to incorp orate such a library in the future, and w e hop e that even tually it will b e used more often in the other tw o scenarios as well, providing high-utility and priv acy- protectiv e access to data that is curren tly shared in a less safe manner [ 18 ]. In the future, another p oten tial incen tive for the use of a diﬀerentially priv ate data analysis system like PSI is the automatic protection that diﬀeren tial priv acy provides against false discov ery , allo wing analysts to p erform adaptiv e data exploration (without “preregistration”) and still ha ve conﬁdence that the conclusions they draw are statistically v alid [ 23 , 11 ]. 7 W e note that sometimes researchers do not wish to share their data, and are only using priv acy as an excuse. A system like PSI can help eliminate the excuse. Still, other external incentiv es ma y b e needed (such as from the researc h comm unity , funding agencies, or journals) to encourage sharing of data. On exploratory analysis Since it is our initial goal for the use of PSI, we elab orate on what w e mean b y supporting “ex- ploratory data analysis.” This term generally refers to a wide-ranging set of techniques to empirically learn features of data by inspection, and familiarize oneself with the nature of the data, or discov er apparen t structure in the data [ 75 ]. It is insp ection and disco very not driven b y theory or mo del- ing. In our setting of a social science data rep ository , w e envision at least tw o uses for exploratory analysis. F or lay-users (e.g. members of the general public), exploratory analysis can b e a wa y to satisfy curiosity and discov er interest ing facts for situations where a statistically rigorous analysis ma y not b e necessary (e.g. for a high-sc ho ol pro ject). F or a so cial science researcher, the goal of exploratory analysis can b e to determine which of the many datasets in the repository are of most interest, so that the researc hers only inv est their time and eﬀort in applying for ra w access to those datasets. An y ﬁnal analyses they wish to p erform and publish could then be done on the ra w data, not through the diﬀerentially priv ate in terface. This more mo dest near-term goal for PSI comp ensates for the fact that w e cannot p erform the kinds of optimizations that might b e done if w e had a diﬀeren tial priv acy exp ert inv olv ed in each instance of data sharing. 7 A ctors and W orkﬂo w W e hav e three diﬀerent kinds of actors in PSI: data dep ositors, data curators, and data analysts. Eac h of them has a diﬀeren t role and diﬀerent requirements. W e represen t them, their interaction and the threat mo del w e consider, in terms of trust for diﬀeren t actors, in T able 1 . W e now detail the roles and the exp ected exp ertise for eac h of them. Data dep ositors. These are users that come to dep osit their priv acy-sensitiv e dataset in a data rep ository , and may wish to mak e diﬀerentially priv ate access to their dataset av ailable. By inter- acting with the system, the data dep ositor supplies basic information ab out the dataset (e.g. the t yp es and ranges of the v ariables), sets the ov erall priv acy loss parameters, selects an initial set of diﬀeren tially priv ate statistics to calculate and release, and determines ho w the remaining priv acy budget will b e partitioned among future data analysts. Data depositors are the ones with the initial ethical and/or legal resp onsibility for protecting the priv acy of their data sub jects, and they (or their institutions) ma y b e liable if they willfully violate their obligations. Th us, they can b e trusted to follow instructions (if not onerous or confusing) and answer questions truthfully to the b est of their knowledge. On the other hand, they cannot b e assumed to hav e exp ertise in diﬀerential priv acy , computer science, or statistics, so any questions that in volv e these areas are explained carefully in the system. Data curators. These are the data-rep ository managers that main tain the hardware and softw are on whic h PSI runs and the accompan ying data rep ository infrastructure (e.g. Dataverse ) and as- so ciated statistical to ols (e.g. Zelig and T woRavens ). They are trusted, and indeed ma y also hav e 8 A ctors Lev el of DP trust exp ertise data curators trusted mo dest data dep ositors trusted none data analysts semi-trusted none (restricted) data analysts un trusted none (general public) T able 1: A ctors and their lev el of trust and required exp ertise. legal obligations to ensure that the rep ository does not violate the priv acy protections it claims to oﬀer through to ols such as PSI. Data curators can b e assumed to hav e exp ertise in IT systems administration and data stew ardship [ 32 ] and archiving [ 6 ], and can b e trained to ha ve at least a mo dest bac kground in statistics and diﬀerential priv acy . But they are few in n umber, and cannot b e actively in volv ed in most instances of data sharing or data exploration. Thus PSI is designed to b e suﬃciently automated to enable data dep ositors and data analysts to safely use it on their own. Data curators w ould also b e resp onsible for deciding whether to accept new diﬀeren tially priv ate routines into the library used by PSI and correcting bugs or securit y ﬂa ws found in existing routines. These can b e diﬃcult tasks even for exp erts in diﬀerential priv acy . Thus, in a future version of the system, it would b e of interest to minimize the amoun t of trusted co de, and ha ve to ols to formally v erify the remaining comp onen ts (b oth original components and later contributions), along the lines of the programming languages to ols describ ed in Section 5 . Data analysts. These are users that come to access sensitive datasets in the repository , often with the goal of data exploration as discussed in Section 6 . They will ha ve access to all of the diﬀeren tially priv ate statistics selected b y the data dep ositor, as w ell as the ability to make their o wn diﬀerentially priv ate queries (sub ject to sta ying within the o verall priv acy budget, as discussed more b elo w). W e envision at least t wo tiers of trust for data analysts once the system has active users. PSI can mak e access a v ailable to a v ery wide communit y of analysts (e.g. the general public), in which case the analysts are considered completely untruste d . Alternativ ely (or additionally), we can restrict to a set of analysts that are iden tiﬁable (e.g. as registered users of the data rep ository), with some accoun tability (e.g. through their v eriﬁed aﬃliation with a home institution). Such analysts may b e considered semi-truste d , as we can assume that they will follo w basic terms of use to not abuse the system in certain w ays. Sp eciﬁcally , w e will assume that semi-trusted users will not collude to compromise priv acy , and will not create phon y accoun ts. (This will enable us to pro vide greater utilit y for such users, as discussed in Section 9 .) 9 8 P edagogical Materials In order to enable PSI to b e used by empirical researchers without exp ertise in priv acy , computer science, or statistics, we hav e prepared p edagogical materials explaining diﬀeren tial priv acy in an in tuitive but accurate manner, with a minimum of technical terminology and notation. These ma- terials are meant to b e suﬃcient for data dep ositors and data analysts to understand and make appropriate c hoices in using PSI, such as those describ ed in the forthcoming sections. Data dep os- itors require more bac kground material than data analysts, as the former are concerned with the priv acy protections aﬀorded to their data sub jects, whereas the latter only need to understand the impact of the system on their analyses (namely , that results will b e less accurate or statistically signiﬁcan t than would b e obtained on the raw data, and that there is a limited “budget” of queries that they can p erform). Relev an t extracts of the p edagogical materials are oﬀered in PSI at eac h decision p oint, and can also b e included when describing data-sharing plans to Institutional Review Boards (IRBs). In addition, members of our team hav e started to dev elop rigorous arguments sho wing that diﬀeren tial priv acy should b e deemed to satisfy certain legal obligations of priv acy protection, whic h can also b e used to reassure data dep ositors, data curators, and IRBs that diﬀeren tial priv acy is a suﬃcien tly strong form of protection. F or example, the com bined legal and tec hnical analysis in [ 59 ] provides an argumen t that, when applied to educational data, diﬀerentially priv ate computations satisfy the requiremen ts of the F amily Educational Righ ts and Priv acy Act of 1974 [ 2 ]. As discussed in Section 7 , we assume that data curators hav e exp ertise in IT systems adminis- tration and data stewardship, and at least a mo dest background in statistics and diﬀeren tial priv acy . Th us, they do not need an y sp ecialized p edagogical materials other than a thorough do cumentation of the system. 9 Priv acy Budget Managemen t One of the c hallenges in enabling non-experts to use diﬀerential priv acy is that it can b e diﬃcult to understand the implications of diﬀerent selections of the priv acy loss parameters (namely ε and δ ), b oth in terms of priv acy and utility , esp ecially when these need to b e distributed ov er many diﬀerent statistics to b e computed. T o address this issue, PSI is designed to exp ose these implications to the user, in easy-to-understand terms, and is accompanied by a v ariety of simple explanations of diﬀeren tial priv acy and its parameters that are sho wn to the user at relev an t times. W e hav e dev elop ed a priv acy budgeting to ol that guides users through judicious choices of global priv acy loss parameters, lets users select statistics to release, automatically distributes the priv acy budget across the chosen statistics, and exp oses the resulting priv acy-accuracy tradeoﬀs (see Figure 2 .) Global priv acy loss parameters: The data dep ositor, who carries the initial resp onsibilit y for protecting the priv acy of her data sub jects, is charged with setting the ov erall (“global”) priv acy loss parameters  g , δ g for her dataset (seen at the top right of Figure 2 ). T o enable this c hoice, w e pro vide in tuitive (but accurate!) explanations of the meaning of each of these priv acy loss parameters, and giv e recommended settings based on the level of sensitivity of a dataset (e.g. corresp onding to an institution’s established researc h data security levels, such as [ 3 ] or the similar categories in the DataT ags system that in tegrates with PSI [ 72 ]). δ g is easily explained as the probability of arbitrary leak age of information, lik e the probabilit y of an adversary breaking an encryption sc heme, and th us should b e set to b e extremely small, like 2 − 30 . F or the main priv acy loss parameter,  g , we explain 10 Figure 2: PSI budgeting in terface: The left p anel shows variables in the dataset; midd le p anel elicits ne c essary metadata fr om dep ositors; right p anel displays sele ctions with a priori err or esti- mates, tailor e d to e ach statistic. F or example, the err or on the r ac e histo gr am indic ates the r ele ase d c ount for e ach bin wil l diﬀer fr om the true c ount by at most 19.079 p e ople with pr ob ability .95. The upp er right c orner displays the privacy loss p ar ameters and the functional b o osts fr om se cr e cy of the sample. it with a table comparing an adv ersary’s p osterior b elief that a data sub ject has a sensitive trait to the p osterior b elief had the sub ject opted out of the study . PSI also conﬁrms with the data dep ositor that each individual sub ject’s data corresp onds to one ro w of the uploaded dataset (so that the p er-row protections of diﬀerential priv acy translate to p er-sub ject protections). Secrecy of the sample: The data dep ositor is ask ed whether the dataset is a random sample from a larger p opulation, and whether the c hoice of this sample has b een k ept conﬁdential. If so, a useful lemma in diﬀerential priv acy known as “secrecy of the sample” allo ws for an eﬀective sa vings in the priv acy loss parameters corresponding to the ratio of sizes b et ween the dataset and the larger p opulation. 4 This means that corresp ondingly greater utility can b e pro vided for the same lev el of priv acy protection. (T o account for the fact that, in practice, p opulation samples are t ypically not p erfectly random, the dep ositor is instructed to conserv ativ ely estimate the ov erall p opulation size.) Lemma 9.1 (Secrecy of the sample [ 40 , 71 ]) L et M b e an ( , δ ) -diﬀer ential ly private algorithm for datasets of size n . L et M 0 b e a r andomize d algorithm that takes as input a dataset D of size m ≥ n , and then runs M on a dataset D 0 obtaine d by sele cting a uniformly r andom subset of D ’s r e c or ds of size n . Then, M 0 is (( e  − 1) · ( n/m ) , δ · ( n/m )) -diﬀer ential ly private. In the application of this lemma in PSI, D 0 represen ts a dataset that is b eing dep osited in the 4 https://adamdsmith.wordpress.com/2009/09/02/sample- secrecy/ . 11 rep ository , D represents a larger p opulation from which D 0 w as (randomly) drawn, and M represen ts the diﬀerentially priv ate statistics computed b y PSI on D 0 . Note that in typical applications of diﬀeren tial priv acy ,  is a small constan t and therefore ( e  − 1) · n/m ≈  · n/m . In concrete applications, esp ecially in the so cial sciences, this lemma p ermits large savings in the priv acy budget. F or this reason, we integrate this prop erty in the budgeting in terface (See Figure 2 ). Budgeting among diﬀerent statistics: Once global priv acy loss parameters hav e b een deter- mined, users can select v ariables from their dataset (from the left-hand panel of the budgeting in terface, Figure 2 ) and c ho ose statistics to release ab out those v ariables from PSI’s library of dif- feren tially priv ate algorithms. A t this stage, there is still the challenge of how the global priv acy loss parameters should b e distributed among the diﬀeren t statistics to b e computed. That is, for each statistic to b e computed, w e need to select priv acy loss parameters (i.e. set  i and δ i for statistic i ) and then apply comp osition theorems to ensure that globally , we achiev e (  g , δ g ) diﬀeren tial priv acy . This lea ves the question of ho w a user should select individual priv acy loss parameters  i (and δ i ). The larger the v alue of  i is tak en, the more utility w e obtain from the i ’th statistic, but this lea ves less of the global priv acy budget remaining for the other statistics. Since some statistics a user is computing may b e more imp ortant than others, and diﬀeren t diﬀerentially priv ate algorithms ha ve diﬀerent priv acy-utilit y tradeoﬀs, the “best” use of the priv acy budget is likely to inv olv e a non-uniform distribution of the  i ’s. T o enable users to determine this partition without requiring that they b e priv acy experts, PSI automatically assigns initial priv acy loss parameters to eac h c hosen statistic. Similarly to GUPT [ 54 ], PSI then exp oses the priv acy-accuracy tradeoﬀs to the user (see the summary table in the right-hand panel of Figure 2 .) Rather than adjusting the individual priv acy loss parameters  i , the user can instead mo dify the “accuracy” that will b e obtained for diﬀerent selected statistics (presen ted as, for example, the size of 95% conﬁdence interv als; see further discussion in the next section). F or each diﬀeren tially priv ate algorithm in PSI, there are accompanying functions that translate b et ween the priv acy loss parameters and a measure of accuracy (also dep ending on other metadata, suc h as the range of v ariables inv olv ed and the dataset size n ). These functions are used by the priv acy budgeting tool to translate the accuracy bounds into individual priv acy loss parameters and ensure that the global priv acy loss parameters are not exceeded. Optimal comp osition: T o ensure that we get the most utilit y out of the global priv acy budget, w e use the Optimal Comp osition Theorem [ 56 ], which in fact was dev elop ed for the purp ose of our priv acy budget to ol. This characterizes the optimal v alue for the global priv acy budget  g (for eac h p ossible δ g ∈ [0 , 1) ) when comp osing k algorithms that are (  i , δ i ) -DP . Theorem 9.2 (Optimal Comp osition Theorem, [ 56 ]) L et M 1 , . . . , M k b e r andomize d algorithms wher e M i is (  i , δ i ) diﬀer ential ly private for i = 1 , . . . , k and let δ g ∈ [0 , 1) . Then the algorithm M ( x ) = ( M 1 ( x ) , . . . , M k ( x )) that runs e ach of the M i ’s using indep endent c oin tosses is (  g , δ g ) diﬀer ential ly private for the le ast value of  g satisfying the fol lowing ine quality: 1 Q k i =1 (1 + e  i ) · X S ⊆{ 1 ,...,k } max n e P i ∈ S  i − e  g · e P i 6∈ S  i , 0 o ≤ 1 − 1 − δ g Q k i =1 (1 − δ i ) 12 While the Basic Composition Theorem gives an upp er b ound on the degradation of priv acy under comp osition, the ab ov e theorem is optimal in the sense that for every set of priv acy loss parameters, (  i , δ i ) for i ∈ { 1 , . . . , k } and δ g there exists a set of algorithms M 1 , . . . , M k that are (  i , δ i ) diﬀeren tially priv ate, respectively , whose comp osition achiev es (  g , δ g ) diﬀeren tial priv acy exactly . F or even mo derate v alues of k , the optimal comp osition theorem can pro vide substantial sa vings in the priv acy budget o ver the other composition theorems in diﬀerential priv acy . In an eﬀort to maximize utilit y for users, the budgeting interface uses an implemen tation of Theorem 9.2 to app ortion a global epsilon v alue across sev eral statistics. Since the Optimal Comp osition Theorem is infeasible to compute in general, w e use an eﬃcien t approximation algorithm that still outp erforms the alternativ e comp osition theorems [ 56 ]. Budgeting among diﬀerent actors: Recall that the selection of diﬀerentially priv ate statistics to b e computed is done b oth b y the data dep ositor, who selects an initial set of statistics that will b e all who access the dataset, and b y individual data analysts, who may b e carrying out nov el explorations of their own conception. The priv acy budgeting to ol described ab o ve is designed to supp ort b oth t yp es of actors (with sligh tly diﬀeren t settings for eac h to reﬂect their diﬀeren t roles and level of trustw orthiness). The data dep ositor is tasked with deciding how m uch of the global priv acy budget  g to reserve for future data analysts. F or example, if the data dep ositor uses up  d units of priv acy for the statistics she c ho oses to release, then at least  a =  g −  d units of priv acy will b e left for the future analysts. (  a migh t actually b e larger, since comp osition theorems for diﬀeren tial priv acy can in some cases give b etter b ounds than simply summing the priv acy loss parameters.) In a future version of PSI diﬀeren t tiers of access will b e deﬁned for data analysts. In the case of semi-trusted data analysts (who w e assume will not collude, as discussed in Section 7 ), PSI will pro vide e ach analyst a p er-user priv acy budget of  a . In the case of completely un trusted analysts, w e will shar e  a among all future analysts. The incorp oration of these tiers of access into the system is under dev elopment. The latter mo del is more conserv ativ e with resp ect to priv acy protection, and th us ma y b e appropriate when analysts do not ha ve the suﬃcien t accountabilit y or the data is highly sensitive (e.g. with life-or-death or criminal implications). The do wnside of the more conserv ativ e mo del is that it is vulnerable to a denial-of-service attack, where the ﬁrst few data analysts, in tentionally or inadv ertently , deplete the en tire priv acy budget, leaving future analysts unable to make any queries. This can b e partly mitigated b y rate-limiting the use of the priv acy budget and b y sharing all statistics computed publicly . It is also p ossible to reserv e part of the priv acy budget for untrusted analysts and part for trusted analysts, with each part b eing treated as described ab o ve. Budgeting for Interactiv e and Adaptiv e Queries: An additional subtlet y in priv acy budget- ing comes from the fact that data analysts may choose their priv acy loss parameters (  i , δ i ) adap- tively , depending on the results of previous queries. In such a case, it is natural to try to use comp osition theorems as privacy ﬁlters [ 66 ] — for example, the k ’th query would b e allow ed only if its priv acy loss parameters (  k , δ k ) do not cause the inequalit y of the Optimal Comp osition Theorem (Thm. 9.2 ) to be violated. Unfortunately , as shown in [ 66 ], this strategy do es not in general yield (  g , δ g ) diﬀerential priv acy ov erall. How ev er, more restrictive b ounds, such as Basic Comp osition (Thm. 4.2 ), do yield v alid priv acy ﬁlters. Consequen tly , for in teractive queries for data analysts in 13 PSI, the Optimal Comp osition Theorem (and its approximations) is only used within (non-adaptiv e) b atches of queries; to comp ose across diﬀerent batches, w e use Basic Comp osition. 10 Diﬀeren tially Priv ate Algorithms Choice of Statistical Pro cedures: While PSI is designed to b e easily extensible so as to incor- p orate new algorithms from the rapidly expanding literature, the initial set of diﬀeren tially priv ate algorithms in PSI were c hosen to supp ort the most necessary statistics that are needed to provide immediate utilit y for so cial science researc h and data exploration. Sp eciﬁcally , we include: • Univ ariate descriptiv e statistics, such as means, quan tiles, histograms, and approximate cumu- lativ e distribution functions. F rom some of these, post-pro cessing can also pro vide additional de- scriptiv e statistics at no additional priv acy cost. • Basic statistical estimators, for inference ab out the p opulation from which a dataset w as sampled. W e ha ve selected some of the most widely used statistical inference pro cedures in social science, suc h as matc hing algorithms and diﬀerence-of-means tests for causal inference, and low-dimensional linear, logit, probit and p oisson regression. • Per-ro w transformations for creating new features (v ariables) out of combinations of already exist- ing ones. These allow the previously described pro cedures to b e leveraged to do more sophisticated computations on a broader range of questions 5 . W e ha ve chosen to initially implement diﬀerentially priv ate versions of statistical metho ds that are widely used in so cial science 6 and where the diﬀeren tially priv ate algorithms giv e go o d perfor- mance at sample sizes we found in so cial science research. F or our ev aluation, that w e will report in Section 15 , w e hav e examined 80 such datasets from published w orks that used metho ds a v ailable in our diﬀerentially priv ate library . These choices are also motiv ated in part b y the data exploration to ols that PSI will integrate with, and which w e exp ect our data analysts to use. In particular, the T woRavens graphical data exploration to ol ( http://2ra.vn ) pro vides descriptive statistics for each v ariable in a dataset, as w ell as graphical illustrations of its empirical distribution (e.g. a histogram or a probabilit y density function) [ 22 ]. PSI replaces these with the diﬀerentially priv ate descriptive statistics it computes. P er-row transformations allow for building more sophisticated analysis. Indeed, these transfor- mations and univ ariate means are suﬃcien t to express all the statistic al queries in the sense of [ 41 ]. In order to allow only transformations that are safe , PSI allows only transformations that are p er- ro w and that come from a restricted domain-sp eciﬁc language. This language also allo ws for either sp ecifying or automatically inferring ranges for transformed v ariables from those of the original 5 F or example, the (empirical) cov ariance b etw een tw o attributes can be estimated by estimating the mean of a new attribute that is the pro duct of the tw o original attributes (as well as the means of the original attributes), or the mean of a v ariable in a subp opulation can b e computed from the mean of the pro duct of that v ariable with a binary indicator for the subp opulation of in terest, and the mean of the indicator. 6 F or example Krueger and Lewis-Beck [ 44 ] in a survey of all 1796 quantitativ e articles published b et ween 1990 and 2005 in three leading political science journals ﬁnd that 30% use simple linear regression, and a further 25% use either Logit or Probit regression. Similar studies show regression is used in 18% of published articles in psyc hology , 25% in the education research [ 73 ] and ov er 40% of articles in the New England Journal of Medicine [ 68 ], while in Public Health, 20% use regression and 43% use Logit or Probit mo dels [ 39 ]. 14 v ariables, so that these can still be enforced for the transformed v ariables. A more comprehensive discussion of the v ariable transformations is in Section 13 . Measuring A ccuracy: The c hoice of accuracy measure, and how to represent it to users, is im- p ortan t b oth in the priv acy budgeting to ol as w ell as for data exploration by data analysts, who need to know ho w to in terpret the noisy statistics pro vided by diﬀerential priv acy . F or descriptive statistics, w e hav e determined that 95% conﬁdence interv als are the simplest and most in tuitive w ay to represent the noise in tro duced by diﬀerential priv acy . F or many of the basic diﬀerentially priv ate algorithms for descriptiv e statistics (suc h as the Laplace mechanism [ 25 ]), a theoretical worst-case analysis is also indicative of t ypical p erformance, so w e use this to calculate the a priori priv acy-accuracy translation needed in the priv acy budgeting to ol. F or statistical inference pro cedures, the accuracy (e.g. size of a conﬁdence interv al obtained) is necessarily data-dep enden t, even without priv acy . (F or example, using a t -test for mean estimation giv es a conﬁdence in terv al of size that dep ends on the empirical v ariance of the data.) When incorp orating such metho ds, PSI uses c onservative conﬁdence in terv als, meaning that it ensures that the diﬀeren tially priv ate conﬁdence interv al includes the true v alue with probabilit y at le ast .95. In tuitively , w e account for the noise introduced by diﬀerential priv acy b y making the conﬁdence in terv als larger — this ensures that analysts do not draw incorrect conclusions from the diﬀeren tially priv ate statistics (but more analyses may come out inconclusive, as we explain to users of the system). T o provide the a priori accuracy b ounds needed by the priv acy budgeting to ol, we in tend to use “rules of th umb” based on exp erimen tal ev aluation given n ,  , the n umber of v ariables, and other a v ailable metadata. 11 Soft w are Arc hitecture W e ha ve implemented a prototype of PSI that is ready for preliminary deploymen t in a data rep os- itory . As mentioned throughout the text, some features are still under developmen t. In this section w e will describ e the current implementation. Metadata: Archiv al data for a research study are commonly stored on rep ositories as an original data ﬁle, and a complementary meta-data ﬁle. The original data ﬁle con tains the raw numeric v alues of observ ations in the dataset. The meta-data ﬁle contains auxiliary information ab out the dataset that increases its ability to b e reused by researchers; this might include text descriptions of the v ariables, summary statistics, prov enance [ 14 ] and numerical ﬁngerprin ts for v alidation[ 6 ]. The largest rep ositories hav e shared standards for how this meta-data ﬁle should b e constructed [ 77 , 12 ], so that catalogs of data can b e built across rep ositories [ 49 , 65 ], and softw are utilities can b e reused and deplo yed across diﬀerent institutions [ 79 ]. Some of the information that gets recorded in the metadata we consider public, such as the names and text descriptions of the meanings of the v ariables and the sample size. Some of the metadata, suc h as v ariable-lev el summary statistics, con tains priv ate information, even if aggregated. Thus if the dataset contains priv ate information, w e consider its metadata to also be a priv ate ﬁle that could p oten tially leak information. It is complian t with the shared standards, ho wev er, for metadata to ha ve missing or empty ﬁelds, so we can construct a reduced version of the priv ate metadata, that only con tains public information. T o this we can add diﬀeren tially priv ate versions of certain summary 15 Data depositor Depositor interface T rusted System Priv acy Barrier Composition and Accuracy Application Diﬀerentially Priv ate Release Application Credentials Priv ate raw data Priv ate metadata Public DP metadata Personal DP metadata Semi-trusted analyst Untrusted analyst Interactiv e explorer interface Explorer interface Credentials Diﬀ.Priv. Exact V alues Untrusted T rusted Unsecured Secured Users Client Layer Application Lay er Storage Lay er Figure 3: A r chite ctur e diagr am. statistics, and still distribute the metadata ﬁle for public use, so long as the total priv acy loss after comp osition (see section 9 ) is b elow the appropriate global parameter. W e call this the public metadata . The b ottom of Figure 3 sho ws the priv ate raw data, its accompan ying priv ate metadata, and the public metadata, residing in a storage lay er in our system. Surrounding them are the application la yer to ols for diﬀerential priv acy , whic h run on a remote serv er. The diﬀeren tially priv ate algo- rithms, the accuracy estimates, and the budgeting co ordinated b y the comp osition theorem, each discussed previously , are all implemented in the R programming language, which is widely used in the statistics and quantitativ e so cial science communities [ 63 ]. W e describ e ho w they interlink b elo w, as we trace out user’s interaction with the system. W e exp ect to distribute all of these routines as an R pack age for easy reuse within the R environmen t (indep endently of Dataverse and T w oRavens ). In addition to this co de on the server, there are client la yer in terfaces (written as thin HTML Ja v ascript GUI’s) that allow diﬀeren t types of users to in teract with the system, but no direct access to the raw data. W e no w describ e our diﬀeren t key users (the same as in tro duced in section 7 ), and how their resp ective interfaces interact with the larger system, in turn. Dep ositor In teraction: At the time of budgeting the dep ositor interfac e or privacy budgeting interfac e , as for example in Figure 2 , allows the data dep ositor to construct a list of statistics they w ould like to release on the dataset. This in terface has no direct access to either the data or computations on the data; whenever the page requires a new computation, 7 it copies the conten ts of the current page to a remote application that uses diﬀerential priv acy comp osition theorems to re-partition the priv acy budget among the curren t set of statistics (b y scaling all of the  i ’s b y the largest multiplicativ e factor that stays within the global priv acy budget), and recalculates 7 As when the metadata for a statistic is completed, or a statistic is deleted, or when an accuracy v alue, or any global parameter is edited. 16 the corresp onding accuracies. This remote pro cess then recomputes and returns an up dated list of priv acy loss parameters and accuracy estimates asso ciated wi th eac h selected statistic. The fron tend interface then rewrites the summary table in the right-hand panel of Figure 2 with these newly provided v alues, and w aits for more actions from the user until another round of computation is required. The bac kend comp osition pro cess is memoryless, in the sense that no past v ersion of the page p ersists or is stored, but every request of the back end b egins an en tirely new set of budgeting and accuracy computations. F or this reason, the connection b etw een the fron tend and back end do es not ha ve to b e persistent. When the dep ositor has ﬁnalized the list of statistics she wishes to make av ailable, together with their priv acy loss parameters, a table containing the chosen statistics and their asso ciated metadata and priv acy loss parameters is then submitted to another separate remote release application that computes all the DP statistics requested. This release to ol chec ks the comp osition of the request with a trusted version of the composition application, which means that co de to this p oint do es not hav e to b e trusted, so long as the global  can b e v eriﬁed. This is detailed on Figure 3 as a split b o x, representing that there are tw o instances of the same co de, one listening and replying to clien t requests, which do es not ha ve to be trusted, and another cop y that has to b e trusted, but only in teracts with the bac kend, and has no web connection so is easier to protect. The release tool is the only pro cess that has access to the raw data whic h sits in secure storage in the Data verse rep ository . The application that calculates the DP releases do es not reply to the dep ositor in terface. The arc hitecture diagram in Figure 3 , sho ws the directions of comm unication b etw een ev ery piece of the system and one can trace out from this that any path from the raw data to an y data analyst (or ev en the data dep ositor), has to pass through the DP channel from this application to the release of a diﬀerentially priv ate v alue written to a metadata ﬁle. Analyst In teraction: The diﬀerentially priv ate statistics that are generated are released in a ﬁle of metadata asso ciated with the securely arc hived data. Ev erything in this metadata ﬁle can b e made a v ailable for public bro wsing. In Figure 3 , we sho w an untruste d public analyst who do es not need to prov e any creden tials, and is able to access the public metadata ﬁle with the diﬀerentially priv ate releases. The public analysts can use the public metadata ﬁle in whatev er manner they prefer. Ho wev er, since all this information is written in the rep ository metadata standards, a diﬃcult to read XML ﬁle, we pro vide an explor er interfac e that presen ts the information in a more easily in terpretable graphical form, using a mo diﬁed v ersion of the T w oRavens soft w are [ 37 ], describ ed in the next section. This is a statistical platform that allows users to explore data in rep ositories b y means of their metadata, so is a go o d match for this application where only the metadata is a v ailable to the user. Once PSI is integrated with a data rep ository , we will pro vide another tier of access to semi- truste d users. These are users for whic h the dep ositor has granted a user-sp eciﬁc priv acy budget  a from which they can generate additional diﬀerentially priv ate releases, beyond those included in the public release. W e exp ect these users will hav e some distinct universit y or researc h aﬃliation whic h can be veriﬁed by credential s and agree to terms of use. 8 Their explorer interface includes b oth the exploratory abilit y of the untrusted analyst interface, and the budgeting ability of the dep ositor in terface. Again, these users can construct a list of queries they w ould like to make, partitioning their p ersonal  a budgets among them, by assistance of the composition application. 8 F or example, Dataverse v eriﬁes mem b ers of certain universities by Shibb oleth [ 55 ] using the Security Assertion Markup Language proto col. 17 When they hav e a batch of statistics whose accuracies they ﬁnd useful, the statistics are sub- mitted to the diﬀerentially priv ate release function whic h c hecks that the composition of statistics meets the av ailable budget using a trusted copy of the comp osition application. This may o ccur in sev eral adaptiv e rounds, as they learn things ab out the data that inform their further exploration, un til their budget is exhausted. Each semi-trusted user will ha ve their own p ersonal metadata ﬁle. This will start out with the same information in the public metadata, but each release will add the additional diﬀerentially priv ate releases that ha ve b een paid from that user’s p ersonal  a budget. Only the semi-trusted user will hav e access to this metadata ﬁle, by means of their credentials, and sp eciﬁcally in the terms of use we are trusting they will accord to, they will ha ve agreed not to share these v alues in collusion with other users (as discussed in the trust model in Section 7 ). Securit y of the protot yp e: In developing PSI we concentrated on design c hoices that maximize its usefulness for its p otential user communit y . Nevertheless, we addressed also sev eral of the secu- rit y and side-channel vulnerabilities that hav e b een raised in the literature ab out implemen tations of diﬀeren tial priv acy [ 34 , 54 ]. Most of these concerns are mitigated by our design c hoices. F or example, PSI only allo ws its users to select built-in diﬀerentially priv ate data analyses and run v ariable transformations b efore them, rather than allowing arbitrary analyses that are then au- tomatically veriﬁed or con verted to satisfy DP . This restriction comes naturally with our goal of allo wing exploratory data analysis b y users without programming expertise. More details on the vulnerabilities and how PSI addresses them are Section 13 . 12 Exploration In terface Figure 4: Explor er gr aphic al user interfac e for insp e cting diﬀer ential ly private r ele ase d values, adapt- ing TwoR avens platform. As describ ed in the previous section, all released diﬀeren tially priv ate v alues are written to 18 metadata ﬁles, either public ﬁles or (in a future version of the system) ﬁles b elonging to only one user. These ﬁles can b e used by the permitted analyst in whatev er manner they prefer, but we pro vide in our system a user-friendly interface to read the information stored in the metadata. The T w oRavens platform for statistical inference ( http://2ra.vn ) is an in terface that allows users, at all levels of statistical expertise, to browse data on rep ositories, exp lore summary statistics and build statistical mo dels on those datasets by means of directed graphs [ 37 , 22 ]. The in terface is a bro wser-based, thin client, with the data remaining in an online rep ository , and the statistical mo deling o ccurring on a remote serv er. The data remains in the rep ository and never go es to the bro wser; rather the statistical exploration is ac hieved by remote statistical pro cessing and moving the correct metadata to the browser. This architecture works well with the PSI system since it relies solely on metadata, and we hav e b een adapting some of the graphs and summary tables av ailable to con vey to the user the additional uncertaint y inheren t in dealing with diﬀerentially priv ate releases from noisy mechanisms, for example, providing conﬁdence in terv als for diﬀerentially priv ate v alues, and histograms and density plots that represent the uncertain ty in the v alues due to noise. 13 Securit y The initial protot yp es of PSI do not address all of the security and side-channel issues that hav e b een raised in the literature ab out implementations of diﬀerential priv acy [ 34 , 54 ]. W e feel that a higher priority is ev aluating whether the design of PSI is useful for its p oten tial user communit y , and if the answer is p ositiv e, security issues can b e addressed in a future v ersion, b efore it is used to handle highly sensitive data. 13.1 Timing, state and priv acy budget attac ks Haeb erlen et al. [ 34 ] analyze the p ossible attac ks to a diﬀerential priv acy system working in a cen tralized scenario similar to the one we describ ed in Section 7 . In their scenario, data analysts are allo wed to submit arbitrary analyses to the diﬀerential priv acy system and the system is resp onsible for running these analyses if they pass some formal requirements guaran teeing diﬀerential priv acy and if there is still some budget left. Even if these formal requiremen ts guaran tee diﬀerential priv acy , this mo del is prone to three main kinds of side channel attacks: Timing attac ks The data analysis may leak information ab out an individual using a timing (or an y other cov ert) channel. State attac ks The data analysis may leak information ab out an individual through an observ able c hange in the application state, for instance b y using a global v ariable. Priv acy budget attacks The data analysis ma y leak information ab out an individual b y running a subanalysis that fails b ecause of lack of priv acy budget. Most of these attacks can b e implemented only if data analysts are allo wed to submit arbitrary data analyses. In PSI a data analyst can only select built-in diﬀerentially priv ate data analysis and run v ariable transformations b efore them. Using only built-in diﬀerentially priv ate data analysis prev ents these attac ks at data analysis time. F or instance, there is no risk of a priv acy budget attac k since queries cannot run subanalyses that can exhaust the priv acy budget. Nevertheless, data analysts can submit to PSI v ariable transformations that can create new features b y combining 19 existing ones as w e discussed in Section 10 , and these raise a greater risk of timing attac ks and state attac ks, which we discuss in the next section. A [0,2] B [0,2] … 1 .9 .8 … 2 1 .6 … 3 1.5 1 … 4 .6 .7 … 5 .5 2 … 6 .4 .4 …. … … … … A [0,2] B [0,2] C [0,4] 1 .9 .8 .72 2 1 .6 .6 3 1.5 1 1.5 4 .6 .7 .42 5 .5 2 1 6 1.4 1.8 2.52 … … … … differentially private mean noise~[0,4] C1=A1*B1 C2=A2*B2 C3=A3*B3 C4=A4*B4 C5=A5*B5 C6=A6*B6 safe transformation D D’ mean(A*B)? Figure 5: W orkﬂow schema for safe variable tr ansformations. 13.1.1 Safe v ariable transformations An imp ortan t prop erty of diﬀerential priv acy is closure under p ost-pro cessing. Lemma 13.1 (Post-processing, [ 25 ]) L et M b e a ( , δ ) -diﬀer ential ly private r andomize d algo- rithm fr om 9 X n to Y , and f b e an arbitr ary (p ossibly r andomize d) map fr om Y to Z . Then, the c omp osition of M and f , denote d f ◦ M is a ( , δ ) -diﬀer ential ly private algorithm fr om X n to Z . This says that the result of a diﬀerentially priv ate computation can b e safely given as input to an y other transformation and the diﬀerentially priv acy prop erty will be main tained. The situation is more inv olv ed when transformations o ccur b efore applying a diﬀerentially priv ate mechanism. Indeed, if we ﬁx a diﬀerentially priv ate algorithm M from X n to Y and we arbitrarily pre-pro cess its input dataset z ∈ Z n with an arbitrary map f from Z n to X n w e can break its priv acy guarantee. As a simple example, consider a diﬀerentially priv ate mechanism that approximately releases the fraction of the individuals with a particular feature B in a database with n records, and a map f that returns a database with n records with the feature B in the case John Do e is in the database, and that remo ves all the elements with feature B , otherwise. When n is suﬃciently large, a data analyst observing the result of M ◦ f can determine with high probabilit y whether John Do e is in the database or not. F ortunately , there is a class of imp ortan t transformations that preserv es diﬀeren tial priv acy: p er-ro w transformations. Lemma 13.2 (Per-ro w transformations) L et M ( x 1 , . . . , x n ) b e a ( , δ ) -diﬀer ential ly private r an- domize d algorithm 10 fr om X n to Y , and f b e a map fr om Z to X . Then, the c omp osition of f and M , denote d ( M ◦ f )( z 1 , . . . , z n ) = M ( f ( z 1 ) , . . . , f ( z n )) is a ( , δ ) -diﬀer ential ly private r andomize d algorithm fr om Z n to Y . This kind of transformation can be very useful in practice. F or instance, as w e men tioned b efore, the (empirical) cov ariance b et ween t wo attributes can b e estimated b y estimating the mean of a 9 W e use X n to describ e the possible set of databases with n records of t yp e X . 10 W e mak e here explicit the fact that M is a function of the records x 1 , . . . , x n of the input dataset. 20 new attribute that is the pro duct of the t wo original attributes (as well as the means of the original attributes), or the mean of a v ariable in a subp opulation can b e computed from the mean of the pro duct of that v ariable with a binary indicator for the subp opulation of in terest, and the mean of the indicator. Ho wev er, one must b e careful in using this lemma. Indeed, if the input data is Z n one must consider the p ossible change of attributes in X n when reasoning ab out the diﬀerentially priv ate algorithm M . Let see this with an example. Consider the case where w e w ant a diﬀeren tially priv ate estimate of the mean of a new attribute C that is the pro duct of the tw o original attributes A, B . A diﬀeren tially priv ate algorithm for computing the mean m ust choose noise that is prop ortional to the range of the attribute. Supp ose that w e know the range of A and B is [ a, b ] for a, b ≥ 0 . When w e c ho ose the noise for M we need to reason ab out the range of C which is not [ a, b ] but it is [ a 2 , b 2 ] instead. In order to allo w only transformations that are safe in the sense discussed ab o ve, PSI tak es an approac h similar to the one of AIRA V A T [ 67 ] and requires the data curators and the data analysts to provide the ranges of each v ariable b efore and after the transformations and enforces them at run time, i.e. the diﬀerentially priv ate algorithms, truncate v alues that are outside the sp eciﬁed range. This guarantees the correct use of the principle formalized in Lemma 13.2 and so priv acy is preserv ed. T o supp ort the design of transformations PSI uses a restricted domain-sp eciﬁc language and an automated program analysis trac king v ariable ranges. The w orkﬂow of v ariable transformations is describ ed in Figure 5 . Starting from the priv ate dataset D , a v ariable transformation generates a new priv ate dataset D 0 , con taining the same individuals as D but with p oten tially new v ariables, on whic h the diﬀerentially priv ate algorithm is run. In this example, the transformation creates a new attribute C as the pro duct of A and B . This is p erformed p er-row and the program analysis forw ards the information ab out the range from the inputs (in this example the range for b oth A and B is [0 , 2] ) to the newly generated v ariable (in this example the range for C is then [0 , 4] ). This range is pro vided to the user who can decide to k eep it or to use a diﬀeren t range. The diﬀerentially priv ate algorithm will then enforce this range and add noise prop ortional to it. The language for v ariable transformations allows only statistical op erations that combine, trans- form or separate v ariables in a v alue indep endent wa y . This preven ts high-level timing attacks — ones where the timing leak age is in tentional — ev en if it do esn’t prev ent ﬁne-grained timing analysis on n umerical computations, as w e will discuss b elo w. Moreov er, to protect against state attacks, the language for v ariable transformations only allows access to lo cally deﬁned v ariables. The program analysis is based on a ﬂow-sensitiv e t yp e system that is used to guarantee that information ab out the c hanges in the ranges of v ariables are propagated to the output. Summing up, our approach of separating v ariable transformations from the diﬀeren tially priv ate data analysis (whose co de is not accessible b y the data analyst) guaran tees protection against priv acy budget attac ks. The use of a domain speciﬁc language further protects against state attac ks and (high-lev el) timing attacks. Finally , the enforcemen t of the v ariable ranges at runtime preven ts the misuse of the v ariable transformations. T o help the user decide the range for eac h v ariable, the domain sp eciﬁc language uses a program analysis propagating range information from the input to the output. 21 13.1.2 Floating-p oin t rounding attac k Another attack is the ﬂoating-p oint rounding attac k identiﬁed b y Mirono v [ 53 ]. The idea of this attac k is to exploit the irregularities in ﬂoating-p oint implementations of some basic algorithms like the Laplace mechanism. When the output is n umeric, diﬀerential priv acy requires every output to b e feasible, i.e. being returned with some probability for ev ery input, and outputs to ha ve similar probabilities when the inputs diﬀer by an individual. Mironov show ed instead that naiv e implemen tations of diﬀerential priv acy lead to results that are concentrated on subsets of outputs. Ev en worse, it can b e the case that for neigh b oring databases some outputs may only b e p ossible under one of the tw o databases. This allo ws adversaries to distinguish the output distributions of the t wo database with certain ty and violate diﬀerential priv acy . The solution prop osed by Mirono v is to use the snapping mec hanism [ 53 , 21 ] which essentially tosses out the least signiﬁcant bits of the diﬀerentially priv ate ﬂoating-p oint outputs using a com bi- nation of clamping and rounding pro cedures. This mechanism is also eﬀectiv e when the mechanism is instantiated with imp erfect randomness. W e hav e implemented the snapping mechanism and incorp orated it into the library of diﬀeren tially priv ate algorithms that underlies the system. Ho w- ev er, we found that it has p o or utility when compared to the typical Laplace mec hanism so at this stage w e do not oﬀer it as a default through the interface. 13.1.3 Fine-grained side channels attacks Side channel attac ks are in general diﬃcult to preven t. W e discussed b efore how the use of built-in diﬀeren tially priv ate primitives and a domain sp eciﬁc language for v ariable transformations can help in mitigating timing channels. Nevertheless, the curren t implemen tation ma y still b e prone to ﬁne grained attacks like the one b y [ 7 ] exploiting time leak ages due to ﬂoating p oints computations. W e exp ect these kinds of attacks to b e further mitigated by the fact that PSI is only accessed remotely and so some of these ﬁne grained observ ations are absorb ed by delays in the comm unication. W e exp ect that b y using an execution environmen t where statistical op erations ha ve v alue- indep enden t cost, whic h can b e achiev ed by padding thanks to the restricted setting, by using some of the prop osed mitigations [ 7 ], and by having PSI only accessed remotely we can preven t further vulnerabilities. Nev ertheless, w e lea ve a complete ev aluation of these vulnerabilities to a future v ersion of our protot yp e. 14 Usabilit y testing W e conducted thorough usabilit y testing of PSI, seeing 28 participan ts in total. UI testing was brok en in to three phases: a pre-pilot phase ( n = 3 ), a pilot phase ( n = 5 ), and a full study ( n = 20 ). After eac h phase of testing, impro vemen ts were made to both the to ol and the study proto cols based on feedbac k from the participan ts. Because the study pro cedures diﬀered in the three phases and the participants in the full study tested a more up to date version of the tool than w as used in the pilot phases, w e will primarily fo cus on the full study in this section. User testing w as conducted on a pared do wn version of the system. W e tested the w orkﬂow of data dep ositors releasing univ ariate statistics through the priv acy budgeting interface. F rom a user p ersp ective, the acts of releasing multiv ariate statistics and using the in teractive query in terface inv olv e very similar procedures to the ones required for only releasing univ ariate statistics through the budgeting in terface. F or this reason, we fo cused on a core set of 22 tasks that would shed light on the usabilit y of the whole system while minimizing the fatigue of research participan ts. The to ol for v ariable transformations discussed in Section 13 w as not in tegrated into the system at the time of testing so has only undergone in ternal testing and will b e ev aluated in future rounds of user exp erimen ts as more features are added to the system. All participan ts w ere o v er 18 years old and had some exp erience with data analysis, either through courses, w ork, or researc h. The study was approv ed by Harv ard’s Institutional Review Board and all participants were comp ensated with a $20 gift card to Amazon for sessions lasting ab out one hour. In the full study , participan ts’ education levels v aried, with 10% having completed some college, 25% with a Bac helor’s degree, 50% with a Master’s degree, and 15% ha ving attained a PhD. 35% of participants rep orted b eing unfamiliar with diﬀerential priv acy , 50% said they were somewhat familiar, 10% describ ed themselv es as familiar, 5% were v ery familiar and nob o dy rep orted b eing an exp ert in diﬀerential priv acy . Study pro cedures The user tests were designed to simulate the exp erience of data dep ositors. First, participan ts were asked to read brief in tro ductory text in PSI broadly describing the purp ose of the to ol, the concept of diﬀeren tial priv acy and priv acy loss parameters, the need for metadata, and the idea b ehind secrecy of the sample. Next the participants w ere given a scenario designed to sim ulate the mindset of a data dep ositor. They were given a toy dataset in Excel containing the demographic information (age, sex, income, education lev el, race, and marital status) of 1000 p eople sampled randomly from a count y with p opulation 700,000. They were told that their goal was to adv ertise their dataset to other so cial scientists who w ere interested in the relationship betw een race and income for people of v arious ages. After reviewing the dataset, participants were asked to set priv acy loss parameters for the scenario and were giv en a choice whether or not to use the secrecy of the sample feature. After setting priv acy loss parameters, participants in the full study w ere led through a sequence of 11 tasks using the interface. The tasks all related to the scenario and toy dataset and required participan ts to eﬀectiv ely use eac h feature of the to ol, guiding them through the typical w orkﬂow of a data dep ositor. The tasks are listed b elow with eac h of the features b eing tested indicated in paren theses. 1. Y ou just en tered a tutorial mo de in the interface that will highligh t some k ey features of the tool. Go through the tutorial and, when prompted, select a mean of the Age v ariable as your ﬁrst statistic. (T utorial mode, selecting statistics, inputting metadata). 2. Y ou decide that the income and race v ariables are also imp ortant for future researchers, so y ou decide to release statistics for these. A dd a mean and a quantile for income, as w ell as a histogram for race. (Selecting statistics, inputting metadata). 3. Y ou no longer wish to include a quan tile for income. Delete this statistic. (Deleting statistics). 4. Y ou decide that y ou wan t to b e very conﬁdent in your error estimates. Use the to ol to set a 98 percent conﬁdence level. (Adjusting conﬁdence level). 5. Y ou are thinking ab out your dataset, and you realize that it contains some information that makes it more sen- sitiv e than y ou originally though t. Use the to ol to make the c hanges necessary to reﬂect this shift. (Adjusting priv acy loss parameters). 6. Y ou hav e just b een informed b y a colleague that your dataset was actually randomly sampled from a p opulation of size 1,200,000. Use the tool to make changes to reﬂect this. Do es this mak e y our statistics more or less accurate? (Secrecy of the sample). 7. Y ou decide that it would b e useful to allo w other researc hers who do not hav e access to your raw data to make some of their o wn selections for statistics to calculate from your dataset. Use the to ol to make changes to reﬂect this. (Reserving priv acy budget for data analysts). 23 8. How m uch error is there for the mean age statistic? What does this num b er mean? (Interpreting error estimates). 9. Make it so that the released mean age is oﬀ from its true mean by at most one year. Is this more or less accurate than what y ou had b efore? (Redistributing priv acy budget via error estimates). 10. Make it so that each count in the released race histogram is oﬀ from the true count b y at most 5 p eople without c hanging the error you just set for mean age. (Redistributing priv acy budget, hold feature). 11. Y ou are satisﬁed with your statistics and your error estimates. Finalize your selections. (Submitting statistics). In the tw o pilot phases, the tasks were muc h less sp eciﬁc and simply asked participants to use the to ol to release some useful statistics ab out their dataset. Pilot participan ts reported feeling o verwhelmed b y the op en-ended nature of this request so we modiﬁed the tasks in the full study to sp eciﬁcally target each feature of the to ol. Audio recording was used during tasks and participants w ere asked to sp eak their thoughts out loud as m uch as p ossible, which pro vided v aluable qualitative feedbac k. F or eac h participant, time sp ent on each task was recorded. Errors were also recorded and classiﬁed as either Critical Errors (CEs), in which the participan t made choices that led to the inabilit y to complete the task, and Non-critical Errors (NCEs), where participan ts corrected their own errors and were able to successfully complete the task despite early mistak es. Results Ov erall, participants performed w ell, committing relativ ely few critical errors during the tasks. The most common mistakes o ccurred when entering metadata, with ev ery participant rep orting some doubt ov er what v alues to en ter and 45% of participan ts en tering v alues at some p oin t in the session that w ould lead to p o or results. 30% of participants had some degree of trouble ﬁguring out how to redistribute their priv acy budget across their statistics (by raising or lo wering the corresp onding accuracy estimate). The basic functionalities of selecting and deleting statistics, mo difying global parameters, and submitting the statistics for diﬀerentially priv ate release came easily to most participan ts. Mistakes tended to recur across sub jects, clearly highlighting the more diﬃcult features of the to ol. A quan titative summary of the task results can b e found in T able 6 . Anecdotally , most participants said the documentation w as helpful but should b e simpler, shorter, and distributed as needed throughout the to ol. Unsolicited, four participants reported having fun using the to ol. There was no signiﬁcant relationship b et ween familiarit y with diﬀeren tial priv acy and the to- tal num b er of critical or non-critical errors participants made. Lik ewise, there w as no signiﬁcan t relationship b et ween education level and the total n umber of critical or non-critical errors made. PSI puts some priv acy-critical decisions into the hands of its users, making precise do cumen- tation, intuitiv e design, and thorough error-handling imp erativ e. There are three wa ys that data dep ositors could accidentally leak more information ab out their data than in tended: mistakenly setting ov erly large priv acy loss parameters, o verestimating the size of the p opulation using the secrecy of the sample feature, or entering v alues directly from the raw data as metadata in the to ol (note that none of these violations is p ossible for data analysts). Over all participan ts including the t wo pilot phases and all tasks, 0 p eople ov erestimated the p opulation size when using the secrecy of the sample feature eve n though the ma jorit y of participants elected to use it. One participan t in the pilot phase en tered empirical range v alues from the toy dataset in the metadata ﬁeld. In reaction to this, we added an additional w arning ab out the risks of data-dep endent decisions in the to ol which w as heeded by all 20 members of the full study . Only one participant of the 28 made an unsafe c hoice regarding the priv acy loss parameters b y accidentally switc hing  and δ , setting  = 10 − 6 24 T ask A verage time on task (secs) # CEs # NCEs 1 349.9 2 1 2 289.7 9 6 3 8.2 2 1 4 29.8 2 0 5 53.7 4 2 6 30.1 1 0 7 95.8 5 1 8 34.3 2 0 9 20.3 5 3 10 52.4 6 3 11 20.6 0 0 ● 20 40 60 80 SUS Scores Figure 6: P erformance on usabilit y tasks: Aver age time on task is r ep orte d only for p articip ants who suc c essful ly c omplete d the task. The # CEs and # NCEs c olumns list the numb er of p e ople who c ommitte d at le ast one critic al err or or non-critic al err or, r esp e ctively, during the task. On the right is a b ox plot of System Usability Sc ale sc or es for PSI. The black vertic al line is the me dian (73.6) and the r e d vertic al line is the me an (69.6). The cir cle r epr esents one outlier in the data. and δ = . 25 . Although a rare even t, w e ha ve taken this error seriously and ha ve implemen ted more rigorous c hecks on the priv acy loss parameters to preven t similar accidents in the future. All other participan ts follow ed the instructions in the do cumentation, setting  to a small constan t (  v alues c hosen by all other participants ranged from .05 to 1) and δ to a negligible n umber (range: 10 − 7 to 10 − 5 ). These results suggest that the system is designed in such a wa y that makes it diﬃcult to violate priv acy even for users who make many mistakes while using the to ol. A t the end of the usability test, participants w ere asked to rank how relev an t they think PSI is for p eople who collect h uman-sub jects data on a scale from 1 to 5, with 1 b eing the least relev an t and 5 b eing the most. The av erage relev ance rating given to the pro ject was 4.3. W e view this as an indication that p eople b oth recognize data priv acy as an imp ortant issue and b elieve that PSI w ould successfully address a need in this space. System Usabilit y Scale After the tasks, participan ts ﬁlled out the System Usability Scale (SUS) [ 13 ], a ten item questionnaire that is widely employ ed in usability studies to assess the quality of a user in terface. The SUS is easy to administer and though only ten questions, has b een shown to be a reliable and v alid measure of usabilit y [ 69 , 9 ]. The ten questions are on a ﬁv e p oint Likert scale and yield a total score b etw een 0 and 100, whic h should not b e interpreted as a p ercentile or letter grade. In the full study , the mean SUS score given to PSI was 69.6 with a median score of 73.6. A b o x and whisker plot of SUS scores can b e seen in Figure 6 . There was one strong outlier in the data, more than 2.5 standard deviations b elo w the mean. Remo ving this outlier giv es a mean SUS score of 72.5 and a median of 75. These scores are b etter than the system’s av erage score in the pilot study of 59.5, suggesting that changes made to the to ol betw een the pilot and the full study made 25 the system signiﬁcan tly more usable. There was no signiﬁcant relationship b etw een SUS scores and familiarit y with diﬀerential priv acy nor education level. T w o meta-analyses hav e b een conducted on a wide range of usabilit y studies and found av erage SUS scores across all systems to b e 69.7 [ 9 ] and 68.0 [ 69 ], resp ectively . There has also b een work on asso ciating lab els with SUS scores [ 9 ] including: W orst Imaginable, A wful, Poor, OK, Go o d, Excellen t, Best Imaginable. Of the participants who rated a system OK, the av erage SUS score w as 52.0 and of those who selected Go o d, the av erage SUS score w as 72.8. In a study on w ell known user in terfaces, Excel receiv ed an a verage SUS score of 56.5, p eople ga ve their GPS’s an av erage rating of 70.8, while the iPhone received a 78.5 a verage[ 43 ]. In light of these results and the complexity of PSI, w e ﬁnd our scores on the System Usability Scale encouraging. Incorp orating F eedbac k After eac h phase of testing, mo diﬁcations w ere made to the design of the to ol and the do cumen tation in accordance with user p erformance and feedbac k. A ma jor c hange inspired b y the usabilit y test w as the incorp oration of a tutorial mo de that is automatically triggered when users ﬁrst encoun ter the in terface. The tutorial orien ts users to the features of the to ol and guides them through selecting their ﬁrst statistic. There w ere also substan tial c hanges made to the do cumentation with a fo cus on simpler, more concise language and more intuitiv e and visual lo cations of help text throughout the in terface. Many more smaller changes were made to impro ve the usabilit y of the UI, including new buttons, bug ﬁxes, hiding adv anced features when they’re not needed, and a host of cosmetic adjustmen ts. As mentioned in the results section, a particular eﬀort was made after eac h testing phase to prev ent acciden tal priv acy leaks. The most signiﬁcan t of these w as a more rigorous automatic c hecking system during priv acy loss parameter selection. These chec ks w ork to prev ent accidental unsafe parameter settings and pro vide clear alert messages if imprudent choices are made. 15 Empirical Ev aluation In addition to user testing, w e hav e exp erimentally ev aluated all of the diﬀerentially priv ate algo- rithms implemen ted in PSI using a com bination of real and syn thetic data. W e ha ve performed t wo main kinds of exp erimental ev aluations: exp eriments aiming at conﬁrm- ing the feasibilit y of releasing sev eral statistics with a giv en budget, exp eriments replicating studies from the so cial science literature. Exp erimen ts on the com bined release of statistics The goal of this category of exp eriment w as to answer the question “can we release basic statistics for all the v ariables with a ﬁxed budget and with a go o d accuracy?” W e ha ve analyzed several datasets of diﬀerent size (with n as small as 10 3 and as big as 10 6 ) a v ailable in Datav erse. The ov erall goal was to release all the univ ariate statistics curren tly imple- men ted in PSI under diﬀerent v alues of the budget for  (in the range [0 . 01 , 1] ) with ﬁxed δ (set at 2 − 20 ) and v arying the secrecy of the sample assumption (with v alues 1% , 3% , 5% , 100% ). W e hav e considered diﬀeren t splits of the priv acy loss parameters among the diﬀerent statistics, and w e hav e exp erimen ted using the optimal comp osition theorem and the basic comp osition theorem. W e hav e also used diﬀerent accuracy measures to capture diﬀeren t c haracteristics of the diﬀeren t data: mean absolute error, mean relativ e error, mean squared error, ro ot of mean squared error, ` 1 , ` 2 and ` ∞ norm. 26 F rom this exp erience w e learned that in many situations we can provide diﬀerentially priv ate results for all the univ ariate statistics with a non-trivial accuracy . F or datasets with sample size 100,000 we were able to release sev eral univ ariate statistics (mean, histograms, and CDF) for all the v ariables ( ∼ 50 attributes), with mean absolute error ≤ 10% , with global  = 0 . 3 and global delta 2 − 20 , as shown in Figure 7 . As exp ected, these results hav e shown some v ariabilit y dep ending on the setup of the parameters, e.g. larger dataset sizes and larger v alues of epsilon giv e b etter accuracy , as well as on the error metric used to measure accuracy . Nevertheless, the exp eriments w e performed met some of the exp ectations set in Section 6 and conﬁrmed what the theoretical analysis tell us. Besides, this step help ed us optimizing the co de of the diﬀeren t statistics increasing the scalabilit y of the analysis, e.g. the release of sev eral univ ariate statistics (mean, histograms, and CDF) on datasets with milions of en tries and ∼ 50 v ariables takes less than 10sec. Figure 7: Normalize d me an absolute err or for Me ans, Histo gr ams, and CDFs for (5,10,20, and 50) variables fr om the PUMS dataset. Replication of so cial science studies Replicating the results of published works is a higher bar for PSI than the actual initial goal, whic h is to supp ort data exploration (for determining whether one should apply for access to raw data). Nevertheless, we created a corpus of 80 datasets from quan titative so cial science by ﬁnding datasets on rep ositories and reaching out to authors of studies. Our goal was to ﬁnd datasets that could b e publicly released, but whose topics and structure closely resem bled those that would ordinarily b e closed due to the inclusion of sensitive data. These give us a v ariet y of t yp es and sizes of datasets from whic h w e can b enchmark the p erformance of diﬀeren tially priv ate statistics, while releasing in comparison the true dataset v alues. F rom this corpus, w e also c hose t welv e studies whic h had published articles or rep orts using simple statistical metho ds that w e could emulate using our av ailable diﬀerentially priv ate statistics. These ranged in size from 926 observ ations in a surv ey of high sc ho ol biology teac hers and whether they teach ev olution in the classro om to 369,000 observ ations in a randomized ﬁeld experiment testing the ability to mobilize v oters by pride or shame by mailing them their previous turnout history . As an exemplar, we brieﬂy describ e one replication study . The P ew Research Center conducts p erio dic public opinion p olling of attitudes and so cial and demographic factors. They released a report describing trends o ver time in their data with regard to attitudes to ward same-sex marriage [ 47 , 62 ], ﬁrst of which was that “ Public supp ort for same-sex marriage has gr own r apid ly over the p ast de c ade ”. The rep ort lo oked at 34 separate nationally represen tative surveys conducted from 2004 to 2017 whic h each ask ed resp ondents whether they fa vor or opp ose allowing gays and lesbians to marry (or don’t kno w). W e replicate this study b y 27 Figure 8: Pew opinion data over time, and estimate d tr ends. using diﬀeren tial priv acy to release the mean fa vor and opp ose rates for eac h survey . The true and DP surv ey means are sho wn in Figure 8 in blue and red resp ectiv ely . W e used a conserv ativ e  = 0 . 01 for eac h survey mean; while the mean and median sample sizes across surveys are slightly o ver 2000 resp ondents, the range in sample size is relativ ely large, giving diﬀerent degrees of noise across released means. Although eac h of the released means are noisy , the exp ected error is zero, and the trend remains. As shown clearly in the ﬁgure, the trend from the b est ﬁt line across the DP releases is very close to the trend line from using the raw data, and an analyst exploring this data with PSI could hav e discov ered the same ﬁnding with a high level of priv acy protection. 28 References [1] American factﬁnder. h ttp://factﬁnder.census.gov/. [2] F amily Educational Rights and Priv acy A ct (FERP A), 20 U.S.C. § 1232g; 34 C.F.R. P art 99. [3] Harv ard universit y data classiﬁcation table. http://policy .security .harv ard.edu/. [4] Naep to ols on the w eb. T echnical Rep ort NCES 2011-460, U.S. Department of Education, 2011. h ttps://nces.ed.gov/nationsreportcard/ab out/naepto ols.aspx. [5] re3data.org reaches a milestone & b egins oﬀering badges, 2016. http://www.re3data.org/ 2016/04/re3data- org- reaches- a- milestone- begins- offering- badges / . [6] Micah Altman, Margaret O. Adams, Jonathan Crabtree, Darrell Donako wski, Marc Maynard, Am y Pienta, and Cop eland H. Y oung. Digital preserv ation through archiv al collab oration: The data preserv ation alliance for the so cial sciences. The Americ an Ar chivist , 72(1):170–184, 2009. [7] Marc Andrysco, Da vid Kohlbrenner, Keaton Mow ery , Ranjit Jhala, Sorin Lerner, and Ho v a v Shac ham. On subnormal ﬂoating p oin t and abnormal timing. In 2015 IEEE Symp osium on Se curity and Privacy, SP 2015, San Jose, CA, USA, May 17-21, 2015 , pages 623–639, 2015. [8] NOR C at the Universit y of Chicago. Obtaining gss sensitive data ﬁles. [9] Aaron Bangor, Philip K ortum, and James T Miller. An empirical ev aluation of the system usabilit y scale. International Journal of Human–Computer Inter action , 24(6):574–594, 2008. [10] Gilles Barthe, Boris Köpf, F ederico Olmedo, and Santiago Zanella Béguelin. Probabilistic rela- tional reasoning for diﬀerential priv acy . In John Field and Michael Hicks, editors, Pr o c e e dings of the 39th ACM SIGPLAN-SIGA CT Symp osium on Principles of Pr o gr amming L anguages, POPL 2012, Philadelphia, Pennsylvania, USA, January 22-28, 2012 , pages 97–110. A CM, 2012. [11] Raef Bassily , K obbi Nissim, Adam Smith, Thomas Steinke, Uri Stemmer, and Jonathan Ull- man. Algorithmic stabilit y for adaptiv e data analysis. In 48th Annual Symp osium on the The ory of Computing (STOC’16) , June 2016. T o app ear. Preliminary version av ailable at h [12] Gran t Blank and Karsten Boy e Rasm ussen. The data do cumen tation initiativ e the v alue and signiﬁcance of a worldwide standard. So cial Scienc e Computer R eview , 22(3):307–318, 2004. [13] John Bro oke. Sus-a quick and dirt y usability scale. Usability evaluation in industry , 189(194):4– 7, 1996. [14] James Cheney , Stephen Chong, Nate F oster, Margo Seltzer, and Stijn V ansummeren. Pro v e- nance: a future history . In Pr o c e e dings of the 24th ACM SIGPLAN c onfer enc e c omp anion on Obje ct oriente d pr o gr amming systems languages and applic ations , pages 957–964. ACM, 2009. [15] Christine Choirat, James Honak er, K osuke Imai, Gary King, and Olivia Lau. Zelig: Ev eryone’s statistical soft ware (version 5), 2015. 29 [16] A communit y assessment of priv acy preserving techniques for human genomes. Xiao qian jiang and y ongan zhao and xiaofeng wang and bradley malin and shuang wang and lucila ohno- mac hado and haixu tang. BMC Me dic al Informatics and De cision Making , 14(Suppl 1)(S1), 2014. [17] Mercè Crosas. The data verse net work: An open-source application for sharing, discov ering and preserving data. D-Lib Magazine , 17:1–2, 2011. doi:1045/january2011-crosas. [18] Mercè Crosas, Gary King, James Honaker, and Latan y a Sw eeney . Automating open science for big data. The ANNALS of the Americ an A c ademy of Politic al and So cial Scienc e , 659(1):260– 273, 2015. [19] Jon P . Daries, Justin Reich, Jim W aldo, Elise M. Y oung, Jonathan Whittinghill, Andrew Dean Ho, Daniel Thomas Seaton, and Isaac Chuang. Priv acy , anonymit y , and big data in the so cial sciences. Communic ations of the ACM , 57(9):56–63, September 2014. [20] Irit Dinur and Kobbi Nissim. Revealing information while preserving priv acy . In PODS , pages 202–210, 2003. [21] Y evgeniy Do dis, A driana López-Alt, Ilya Mironov, and Salil P . V adhan. Diﬀerential priv acy with imp erfect randomness. In A dvanc es in Cryptolo gy - CR YPTO 2012 - 32nd Annual Cryp- tolo gy Confer enc e, Santa Barb ar a, CA, USA, A ugust 19-23, 2012. Pr o c e e dings , pages 497–516, 2012. [22] Vito D’Orazio and James Honaker. A User Guide to TwoR avens: An overview of fe atur es and c ap abilities , 2016. [23] Cyn thia Dwork, Vitaly F eldman, Moritz Hardt, T oniann Pitassi, Omer Reingold, and Aaron Roth. The reusable holdout: Preserving v alidity in adaptive data analysis. Scienc e , 349(6248):636–638, 2015. [24] Cyn thia Dwork, Krishnaram Ken thapadi, F rank McSherry , Ilya Mirono v, and Moni Naor. Our data, ourselves: Priv acy via distributed noise generation. In EUROCR YPT , pages 486–503, 2006. [25] Cyn thia Dwork, F rank McSherry , K obbi Nissim, and A dam Smith. Calibrating noise to sensi- tivit y in priv ate data analysis. In 3 rd The ory of Crypt. Conf. , pages 265–284, 2006. [26] Cyn thia Dwork, Guy N. Rothblum, and Salil P . V adhan. Bo osting and diﬀerential priv acy . In F OCS , pages 51–60, 2010. [27] Cyn thia Dw ork, Adam D. Smith, Thomas Steink e, and Jonathan Ullman. Hiding in plain sight: A surv ey of attac ks on priv ate data. Man uscript, April 2016. [28] Y aniv Erlich and Arvind Naray anan. Routes for breac hing and protecting genetic priv acy . Natur e R eviews Genetics , 15(6):409–421, 2014. [29] Úlfar Erlingsson, V asyl Pihur, and Aleksandra Korolo v a. Rapp or: Randomized aggregatable priv acy-preserving ordinal resp onse. In Pr o c e e dings of the 2014 A CM SIGSAC c onfer enc e on c omputer and c ommunic ations se curity , pages 1054–1067. ACM, 2014. 30 [30] Úlfar Erlingsson, V asyl Pih ur, and Aleksandra K orolov a. RAPPOR: randomized aggregatable priv acy-preserving ordinal resp onse. In Gail-Jo on Ahn, Moti Y ung, and Ningh ui Li, editors, Pr o c e e dings of the 2014 ACM SIGSA C Confer enc e on Computer and Communic ations Se curity, Sc ottsdale, AZ, USA, Novemb er 3-7, 2014 , pages 1054–1067. ACM, 2014. [31] Sergiu Gherghina and Alexia Katsanidou. Data a v ailabilit y in p olitical science journals. Eur o- p e an Politic al Scienc e , 12(3):333–349, Sep 2013. [32] Alyssa Go o dman, Alb erto Pepe, Alexander W. Block er, Christine L. Borgman, Kyle Cranmer, Merce Crosas, Rosanne Di Stefano, Y olanda Gil, P aul Groth, Margaret Hedstrom, Da vid W. Hogg, Vinay Kash yap, Ashish Mahabal, Aneta Siemigino wsk a, and Aleksandra Slavk o vic. T en simple rules for the care and feeding of scientiﬁc data. PL oS Comput Biol , 10(4):1–5, 04 2014. [33] Andy Green b erg. AppleâĂŹs âĂ Ÿ diﬀerential priv acy âĂŹis ab out collecting your dataâĂ Ťbut not y our data. Wir e d, June , 2016. [34] Andreas Haeb erlen, Benjamin C. Pierce, and Arjun Naray an. Diﬀerential priv acy under ﬁre. In Pr o c e e dings of the 20th USENIX Se curity Symp osium , August 2011. [35] Mic hael Ha y , Ashwin Machana v a jjhala, Gerome Miklau, Y an Chen, and Dan Zhang. Principled ev aluation of diﬀerentially priv ate algorithms using dpb ench. In SIGMOD , 2016. [36] Nils Homer, Szab olcs Szelinger, Margot Redman, Da vid Duggan, W aibha v T embe, Jill Muehling, John V Pearson, Dietric h A Stephan, Stanley F Nelson, and David W Craig. Resolv- ing individuals con tributing trace amounts of dna to highly complex mixtures using high-density snp genot yping microarrays. PL oS genetics , 4(8):e1000167, 2008. [37] James Honak er and Vito D’Orazio. Statistical mo deling b y gesture: A graphical, bro wser- based statistical int erface for data rep ositories. In Extende d Pr o c e e dings of A CM Hyp ertext 2014 . A CM, 2014. [38] P eter Kairouz, Sewoong Oh, and Pramo d Visw anath. The comp osition theorem for diﬀerential priv acy . In Pr o c e e dings of the 32nd Inter national Confer enc e on Machine L e arning, ICML 2015, Lil le, F r anc e, 6-11 July 2015 , pages 1376–1385, 2015. [39] James C Karran, Erica EM Mo o die, and Michael P W allace. Statistical metho d use in public health researc h. Sc andinavian journal of public he alth , 43(7):776–782, 2015. [40] Shiv a Prasad Kasiviswanathan, Homin K. Lee, Kobbi Nissim, Sofya Raskho dniko v a, and Adam Smith. What can w e learn priv ately? SIAM J. Comput. , 40(3):793–826, 2011. [41] Mic hael J. Kearns. Eﬃcien t noise-toleran t learning from statistical queries. In STOC’93 ACM , pages 392–401, 1993. [42] Gary King. An in tro duction to the datav erse netw ork as an infrastructure for data sharing. So ciolo gic al Metho ds and R ese ar ch , 36:173–199, 2007. [43] Philip K ortum and Aaron Bangor. Usabilit y ratings for everyda y pro ducts measured with the system usabilit y scale. International Journal of Human-Computer Inter action , 29(2): 67–76, 2013. 31 [44] James S Krueger and Michael S Lewis-Beck. Is ols dead? The Politic al Metho dolo gist , 15(2):2–4, 2008. [45] H. J. Lo w e, T. A. F erris, P . M. Hernandez, and S. C. W eb er. STRIDE ?- an integrated standards-based translational research informatics platform. AMIA Annual Symp osium Pr o- c e e dings , pages 391–395, 2009. [46] Ash win Mac hanav a jjhala, Daniel Kifer, John M. Ab owd, Johannes Gehrk e, and Lars Vilh ub er. Priv acy: Theory meets practice on the map. In Pr o c e e dings of the 24th International Confer enc e on Data Engine ering, ICDE 2008, April 7-12, 2008, Cancún, Méxic o , pages 277–286, 2008. [47] Da vid Masci, Anna Brown, and Jo celyn Kiley . 5 facts ab out same-sex marriage. June 26. h ttp://www.p ewresearch.org/fact-tank/2017/06/26/same-sex-marriage/., 2017. [48] Andrew J. McMurry , Shawn N. Murph y , Douglas MacF adden, Griﬃn W eb er, William W. Simons, John Orechia, Jonathan Bick el, Nic h W attanasin, Clint Gilbert, Philip T revv ett, Su- sanne Ch urchill, and Isaac S. Kohane. SHRINE: enabling nationally scalable multi-site disease studies. PL oS ONE , 8(3), 2013. [49] Katherine McNeill. In terop erability b et ween institutional and data rep ositories: A pilot pro ject at mit. 2007. [50] F rank McSherry . Priv acy integrated queries: an extensible platform for priv acy-preserving data analysis. In Pr o c e e dings of the ACM SIGMOD International Confer enc e on Management of Data, SIGMOD 2009, Pr ovidenc e, Rho de Island, USA, June 29 - July 2, 2009 , pages 19–30, 2009. [51] R. Michael Alv arez and Jonathan N. Katz. Editors’ note. Politic al Analysis , 24(2):131, 2016. [52] Darakhshan J. Mir, Sibren Isaacman, Ramón Cáceres, Margaret Martonosi, and Reb ecca N. W righ t. DP-WHERE: diﬀeren tially priv ate mo deling of h uman mobilit y . In Xiaohua Hu, T sau Y oung Lin, Vijay Raghav an, Benjamin W. W ah, Ricardo A. Baeza-Y ates, Geoﬀrey F ox, Cyrus Shahabi, Matthew Smith, Qiang Y ang, Rayid Ghani, W ei F an, Ronny Lempel, and Ragh unath Nambiar, editors, Pr o c e e dings of the 2013 IEEE International Confer enc e on Big Data, 6-9 Octob er 2013, Santa Clar a, CA, USA , pages 580–588. IEEE, 2013. [53] Ily a Mironov. On signiﬁcance of the least signiﬁcan t bits for diﬀerential priv acy . In the ACM Confer enc e on Computer and Communic ations Se curity, CCS’12, R aleigh, NC, USA, Octob er 16-18, 2012 , pages 650–661, 2012. [54] Prashan th Mohan, Abhradeep Thakurta, Elaine Shi, Da wn Song, and David E. Culler. GUPT: priv acy preserving data analysis made easy . In K. Selçuk Candan, Yi Chen, Richard T. Sno d- grass, Luis Gra v ano, and Ariel F uxman, editors, Pr o c e e dings of the ACM SIGMOD Interna- tional Confer enc e on Management of Data, SIGMOD 2012, Sc ottsdale, AZ, USA, May 20-24, 2012 , pages 349–360. ACM, 2012. [55] RL Morgan, Scott Cantor, Stev en Carmo dy , W alter Hoehn, and Ken Klingenstein. F ederated securit y: The shibb oleth approach. Educ ause Quarterly , 27(4):12–17, 2004. 32 [56] Jac k Murtagh and Salil P . V adhan. The complexit y of computing the optimal comp osition of diﬀeren tial priv acy . In The ory of Crypto gr aphy - 13th International Confer enc e, TCC 2016-A, T el Aviv, Isr ael, January 10-13, 2016, Pr o c e e dings, Part I , pages 157–175, 2016. [57] National Institute of Health. NIH notice NOT-OD-03-032: Final NIH statement on sharing researc h data., 2003. [58] National Science F oundation. A ward & administration guide (AAG) chapter vi.d.4, 2014. [59] K obbi Nissim, Aaron Bembenek, Alexandra W o o d, Mark Bun, Marco Gab oardi, Urs Gasser, Da vid O’Brien, Thomas Steink e, and Salil V adhan. Bridging the gap b et ween computer science and legal approaches to priv acy . 2016. [60] K obbi Nissim, Sofya Raskho dniko v a, and Adam Smith. Smo oth sensitivit y and sampling in priv ate data analysis. In Pr o c. 39 th STOC’07 A CM , 2007. [61] Oﬃce of Managemen t and Budget. OMB memorandum M-13-13: Op en data policy – managing data as an asset., 2013. [62] P ew Researc h Cen ter. Changing Attitudes on Ga y Marriage. June 26. h ttp://www.p ewforum.org/fact-sheet/changing-attitudes-on-ga y-marriage/., 2017. [63] R Core T eam. R: A L anguage and Envir onment for Statistic al Computing . R F oundation for Statistical Computing, Vienna, Austria, 2016. [64] Jason Reed and Benjamin C. Pierce. Distance mak es the t yp es grow stronger: A calculus for diﬀeren tial priv acy . In ICFP , 2010. [65] Robin Rice. Disc-uk datashare pro ject: Final rep ort. 2009. [66] Ry an M. Rogers, Aaron Roth, Jonathan Ullman, and Salil P . V adhan. Priv acy o dometers and ﬁlters: P ay-as-y ou-go comp osition. CoRR , abs/1605.08294, 2016. [67] Indra jit Roy , Srinath T. V. Setty , Ann Kilzer, Vitaly Shmatiko v, and Emmett Witchel. Aira- v at: Securit y and priv acy for mapreduce. In Pr o c e e dings of the 7th USENIX Symp osium on Networke d Systems Design and Implementation, NSDI 2010, April 28-30, 2010, San Jose, CA, USA , pages 297–312. USENIX Asso ciation, 2010. [68] Y asunori Sato, Masahiko Gosho, Kengo Nagashima, Sho T ak ahashi, James H. W are, and Nan M. Laird. Statistical metho ds in the Journal – an up date. New England Journal of Me dicine , 376(11):1086–1087, 2017. PMID: 28296608. [69] Jeﬀ Sauro. A pr actic al guide to the system usability sc ale: Backgr ound, b enchmarks & b est pr actic es . Measuring Usability LLC, 2011. [70] Jatinder Singh. Figshare. Journal of Pharmac olo gy and Pharmac other ap eutics , 2(2):138, 2011. [71] A dam Smith. Diﬀeren tial priv acy and the secrecy of the sample. https://adamdsmith. wordpress.com/2009/09/02/sample- secrecy/ . [72] Latan ya Sweeney , Mercè Crosas, and Michael Bar-Sinai. Sharing sensitive data with conﬁdence: The datatags system. T e chnolo gy Scienc e , 2015. 33 [73] Susan T roncoso Skidmore and Bruce Thompson. Statistical tec hniques used in published arti- cles: A historical review of reviews. Educ ational and Psycholo gic al Me asur ement , 70(5):777–795, 2010. [74] Kim b erly A. T ryk a, Luning Hao, Anne Sturc ke, Y umi Jin, Zhen Y. W ang, Lora Ziy abari, Moira Lee, Natalia Popov a, Nataliya Sharop o v a, Masato Kimura, and Michael F eolo. Ncbi???s database of genot yp es and phenot yp es: dbgap. Nucleic A cids R ese ar ch , 42(D1):D975–D979, 2014. [75] John Wilder T ukey . Explor atory Data A nalysis . Addison-W esley , 1977. [76] Sv en Vlaeminc k and Lisa-Kristin Herrmann. Data p olicies and data archiv es: A new paradigm for academic publishing in economic sciences? In B. Schmidt and M. Dobrev a, editors, New A venues for Ele ctr onic Publishing in the A ge of Inﬁnite Col le ctions and Citizen Scienc e: Sc ale, Op enness and T rust , pages 145–155. [77] Stuart W eib el, John Kunze, Carl Lagoze, and Misha W olf. Dublin core metadata for resource disco very . T echnical rep ort, 1998. [78] Hollie C White, Sarah Carrier, Abb ey Thompson, Jane Green b erg, and Ry an Scherle. The dry ad data rep ository: A singap ore framework metadata architecture in a dspace en vironment. Universitätsverlag Göttingen , page 157, 2008. [79] Mark D Wilkinson, Michel Dumon tier, IJsbrand Jan Aalb ersb erg, Gabrielle Appleton, Myles Axton, Arie Baak, Niklas Blom b erg, Jan-Willem Boiten, Luiz Bonino da Silv a San tos, Philip E Bourne, et al. The fair guiding principles for scientiﬁc data management and stewardship. Scientiﬁc data , 3, 2016. [80] Amitai Ziv. Israel’s ’anon ymous’ statistics surv eys aren’t so anonymous. Haar etz , 7 January 2013. 34

PSI ({Psi}): a Private data Sharing Interface

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment