Routes for breaching and protecting genetic privacy

We are entering the era of ubiquitous genetic information for research, clinical care, and personal curiosity. Sharing these datasets is vital for rapid progress in understanding the genetic basis of human diseases. However, one growing concern is th…

Authors: Yaniv Erlich, Arvind Narayanan

Routes for breaching and protecting genetic privacy
Review Routes f or brea ch ing and pr otec tin g genet ic pr ivacy Yaniv Erlich 1 ,* and Ar v ind Naray anan 2 1 White head Insti tute for B iomedi cal Rese arch, Nine Cambrid ge Center, Ca mbrid ge, MA USA 0214 2 2 Depa rtment of Co mputer Sc ience, P rinceton Unive rsity, 35 Old en Street , Princ eton, NJ US A 0 8540 * Correspondence to Y.E ( yaniv@wi.mi t.ed u ) Abstract We are entering the era o f ubiquitous geneti c information for research, cl inical care, and p erso nal c urio sity . Shar ing the se dat aset s is v ita l for r apid pro gr ess in un derstand ing the genet ic basis of hu man diseas es. However, on e growing concern is the ability to prote ct the genetic priva cy of the data origin ators. Her e, we techn ical ly map thre ats to gen etic pr ivacy and discu ss pote nt ial mit iga tion strate gie s for priva cy - pr e servin g di sse minat ion of g enet ic dat a. About t he Authors Yaniv Erlich i s a Fellow a t the Whitehead In stitute for Biomedic al Resear ch. Erlich received his P h.D. from C old Spring Harbor Labor atory in 2010 an d B.Sc. fro m Tel - Aviv Uni versity i n 2006. Prio r to that , Erli ch w ork ed in co mput er se cur ity a nd was respon sib le for condu ctin g pe netr ation test s on financia l inst itut es an d co mmer cia l companies. Dr. Erl ich ’s research i nvolves developing new algorithms fo r computational hu man genetics. A rvi nd Naraya na n is an A ssist ant P rof essor in the Depart ment of Computer Science and the Cent er for Infor mation Tec hnology and Policy at P rinceton. He studies infor mat ion pr iva cy an d sec urity . H is re sear ch ha s sh own t hat dat a anon ym izatio n is bro ken in fundamental wa ys, for which h e jointly received the 2 008 P rivacy Enhan cing Tech nol ogie s A ward. His cu rren t re sea rch int ere sts inclu de bu il ding a platfor m for privacy - pres erving data sharing. Draft 1 S ummary • Broad data d issem ination is es sential fo r advancem ents in g enetics, b ut a lso brings to light con cerns regarding privacy. • Privacy breaching techniqu es work by cross - referencing two or more pieces of information to gain new, potentially undesirable knowledge on ind ividuals or their fa milie s. • Bro ad ly s pea kin g, t he mai n r out es t o breach privacy are id entity t racing , attribu te disclosure, and com pletion of sensi tive DNA inform ation. • I dentity tracin g ex ploit s qu asi - identif iers in the DN A data or m etadata to u ncov er the identity of an unknown genetic dataset. • Attrib ute d isclos ure te ch ni ques work on know n DNA datasets. They use the DNA information to link the ident ity of a person w ith a sensitiv e phenotype. • Completion technique s also work on known DNA data. They try to uncover sensitiv e genom ic areas that wer e m asked to pro tect th e par ti cip ant. • In the last few years , we ha ve witness ed a rap id growt h in the r ange of tech nique s and tools to conduct these privacy - breaching attacks. Currently, m ost of the techniqu es are bey ond the reach of the g eneral pu blic, but ca n be ex ecuted by traine d pe rsons with varyi ng degrees of effor t. • There is c onside rable deb ate reg arding r isk m anagement. O ne cam p suppor ts a pragm atic, ad - hoc approac h of priv acy by obscu rity and th e other s upports a system atic, mathem atical ly - backed approach o f pri vacy by design. • Privacy by desig n algorithm s include acc ess control, differen tial priv acy , and cryptog raphic techn iques. So fa r, data cu stodians o f g enetic da tabases m ainly adopted acces s control as a mitig ation strateg y. • New developm ents in cry ptographic techniques may usher in an add itional arsenal of securi ty by design tech niques. 2 INTRODU CTION We produce g enetic inf ormation for r esearch , clinical car e, and genea logy at exponential r ates. Seq uencing st udies with t housands of ind ividu als ha ve bec ome a reality 1,2 and new projec ts aim to sequen ce hund reds of tho usands to mil lion s of individu als 3 . Some gen etici sts envision whole gen ome seq uencing of every person as part of rout ine health car e 4,5 . Shar ing g enet ic f ind ings is v ital for ac celer ating th e pa ce of bio med ical d is cove ries and fully realizing the prom ise s of the genetic rev olution 6 . Recent studies suggest that r obust pred ictio ns of genetic p redispos ition s to compl ex trai ts from genetic data will requ ire the anal ysi s of m ill ions o f sa mpl es 7,8 . Clearly, collect ing cohorts at such scales ar e typical ly b eyond the reach of individ ual investi gators and cannot be achiev ed w ith out comb inin g diff ere nt s ource s. In add itio n, br oad dis semin atio n of gene tic dat a pr omot es s ere ndipit ou s disc ove ries t hro ugh secon dary analy si s, which is ne cessar y to maximi z e its utili ty for p at ients and the g eneral publ ic 9 . One of the key issue s of broad disse min ati on is an adequate balan ce of data priva cy 10 . P rospe ctive p art icipa nt s of scient if ic studie s have ra nked privacy of sens itive inf orma tion as one the ir top con cern s and a major dete rminan t if to participat e i n the stu dy 11 - 13 . Protec ting p ersonally i dentifiable infor mation is als o a demand of an arr ay of r egulatory statutes in United Stat es and the Eu ropean Union 14 . Data de - ident ifying, the remov ing of the pers on identifie r, has b een suggested as a pot ential path t o reconcile data sharin g and privacy demand s 15 . But is this technic all y feas ibl e for genetic data? This r eview char acterize s pri vacy breachi ng techniques of genetic info rmation and maps potent ial counter - measur es. We f irst categorize pri v acy - brea ching s tra tegie s, discu ss th eir und erlyin g techn ica l con cept s, an d e valua te t heir per for man ce an d limita tion s. Then, we present priv acy - preser ving technol ogies, gr oup them accord ing to their met hodolog ical approa ch es , an d discuss their rel evance to gen etic information. A s a general t heme, we focu s only on breaching techniqu es that inv olve data m inin g and fusin g dist inct reso urc es to g ain priva te inf ormat ion rel evan t to DNA data. Data cust odia ns should b e aware t hat se curity t hreats can be much broad er. They can includ e cracking weak dat abase passwor ds, cl assi cal co mpute r hacking tec hniques of t he server t hat holds the d ata, ste aling of stor age devi ces due to po or physic al securi ty, and inten tion al mis cond uct of data custo dian s 16 - 18 . W e do not include t hese threat s sin ce they are n ot unique to geneti c infor mation an d have been exte nsively st udied by the c omp uter se cur ity field 19 . In additi on, this rev iew does no t cov er t he pot ent ial i mpl icatio ns of los s of p rivacy , w hich h eav ily de pen d on cultural, legal , and socio - econom ical context and were cov ered in part b y the broa d privacy literature 20 , 21 . 3 PRIVACY BREACHING OF GENETIC DATA Gene tic pr ivacy brea chin g techniq ue s fall into t hree categ orie s: Id ent ity Tr acin g, Attr ibute Di sclo sure Atta cks via DNA (ADA D) , and Co mplet ion Tech niq ues ( Figur e 1 ). The s hare d con cept of t hese t echn iqu es is g ainin g a new piece of private – poten tially sen sitiv e – informa tion about the target or his f amily by exploi ting DNA data. The three categ ories are di stinct in the typ e of sensit ive informat ion that they reveal. Th e aim of identity tracing i s to link between an un known genome and the concealed i dentity of the data or iginator. In ADAD , the ad versary alr eady has access to the identif ied DNA sam ple of the target and t o a database t hat links D NA -derived data to sens itive at tr ibut es wit hout e xpl icit iden tifi ers, for ex a mple a public database of the genet ic st udy of drug abu se . The A DAD techniq ues matc h the DNA data and as sociate t he identity of the t arget with t he sen sitive attr ibute. In compl etio n te chn ique s, th e adv ersa ry a lso k now s t he ide ntity of a ge no mic d ata set bu t has acc ess o nly t o a sa nitiz ed ve rsion wit hou t sens itive l oc i. The aim here is to 4 expose the sen sitive loci t hat are not part of the ori ginal data. Table 1 summari zes all privacy b reaching tech niques that are p resent ed in this sect ion. Table 1 | Categor ization of techniques fo r breaching g enetic pri vacy Technique Maturation Level Technical complexity Auxiliary inform ation Identity Tr acing Surna me I nfere nce  ●●● Intermediate - Good DNA Phe noty ping  ●● Po or Demographic identifiers  ● Good Pedigree structure  ●● Po or Side channel leakage  ●●● Varies Attribute Disclosure Attacks via DNA N=1  ●●● Good Genotype frequen cies  ●●● Good Lin ka ge disequilib rium  ●●●● Intermediate Effect sizes  ●●● Good Trait inferen ce  ●● Good Gene expression data  ●●●● Po or Completio n Attac ks Imput atio n of a mas ked marke r  ●● Good Genealogical i mputation  ●●●● Po or Matura tio n le vel:  Wor king pr inc ipl es e stabl ishe d wi th sim ul ated d ata .  Small scale pro of of concept w ith re al data in a co ntrol le d e nviro nme nt ( ty picall y only one datase t).  Large s cale exper imen ts i n con trolled env iro nment s with r eal data (ty pical ly more than one datase t).  Brea ch of p ri vacy was rep ort ed i n a real scen ari o. Techn ica l com plexi ty: ● no kno wle dge in ge netic s or spec ial to ols is re quire d. ●● Requ ire gen eti c k nowled ge; comp utatio n can re ason abl y be done on a re gul ar co mpute r . Exi sting tools a re avai lable ●●● Req uire gen eti c know le dge , interme diate scal e pro cess ing of da ta and/o r mol ecula r techn iques. ●●●● Req uire gen eti c kn owled ge ; large s cale p roc essi ng of d ata is a prer equi sit e; may al so req uir e molecu lar techn iqu es. Aux iliary info rmatio n: this colum n ref er s to the l evel of exis tin g ref eren ce d atab ases for t he US pop ula tion in publ ic re s ources . Fo r identity tr acin g, it re fers to the avail ability of organiz ed l ists tha t lin k identi ties and extr act pie ces of inf orma tion. F or AD AD and co mpl etio n tec hnique s, i t ref ers to the ex istence of s upport ing refer ence d at aset s th at a re nec essa ry t o com plete t he a tt ack. Poor – support ing data is hig hl y fr agm ente d and not amena ble t o sea rch in g. Inter medi at e – sup por ting data is har mo nize d and se ar cha ble but re quir es s ome pre - proces sing. G reat – supporting data is ready to use u sing exis ting tools or mi nim al p re - proc essing. IDENT ITY TRAC IN G A TTACKS The goal of identity t racin g attacks is to uniquely identify the data originat or from the population de spite th e absence of expli cit ident ifiers such a s the name and exact address in the publi shed data set. The i dea is to accumul ate quasi - identifier s -- residual pieces of infor mation that are embed ded in the dataset -- a nd to grad ua lly narr ow dow n the p os sibl e indiv idu als t hat match the co mbin atio n o f the se quasi - 5 identifier s to t he point that the data originator i s t he onl y ma tch. The succe ss of t he attack depend s on the informat ion content that t he adversary can ob tain from these quasi - identifi ers relative t o size of the bas e populati on ( Box 1 ). IDENTI TY DIS CLOSU RE BY META - DATA 6 Genetic datas ets are typi cally pub li shed with add itiona l met adat a, su ch a s bas ic demographi c details , inclus ion/exclu sion criter ia, pedigre e stru cture, and health condit ion s that are critical to understand the study and for secondary an alysis. Unrestricted de mogr aph ic inf ormation conve ys substantial p ower for identity tracing. It has been e stimated that the combinat ion of date of birt h, sex, an d 5 digit zip co de uniq uely ide ntif ies mor e tha n 6 0% of US in divid uals 22 , 23 . In add ition, ther e are ex tensiv e pub lic re sour ce s with br oad population coverag e and sear ch inter face s that lin k de mo grap hic quasi - identif ier s an d ind ivid uals, in clud in g vot er regist rie s, p ubli c re cord sear ch en gine s su ch as Pe opl e - Finders.com, d ata broker s, and social med ia . In one o f the pione ering st udies of ide ntity tracing using metadata, Sweeny reported t he successf ul tracing of th e medical re cord of the G overnor of Massa chu sett s usin g de mogr aph ic id entif ier s 24 . At that time, the Massa chu sett s Group In suran ce C omm ission rele ased ho spit al d ischar ge informat ion wit h fiv e dig it zip codes, sex, and date of birth. By searching the voter reg istry, Sweeny was able to uni quely match the hos pital discharge of the G overnor. A more recent study reported the iden tificat ion of 30% of Per sonal Genome Pr oject (PGP ) particip ants by demographi c profiling that included zip cod e and ex act birthdate s that are found in PGP profil es 25 . Since the inception of the H IPAA Rule in 2003, demogr aphic iden tifier s are th e subje cts of tight regulati on in the US health care syste m 26 . The SAFE HARBOR provision require s that th e maxi mal resolut ion of any dat e field, including b irth an d hosp ital admiss ion s , will b e in y ear s . In addit ion, the maxima l resol ution of a geogra phi cal s ubdiv isio n is t he fir st th ree digit s of a z ip co de ( as lo ng a s th ere ar e more than 20,000 l iving i n the reg ions that c orresp ond to the t hree dig it zip codes). Statistical analy s e s of the cens us data have fo und that the Sa fe Harbor p rovision provides r eas onable immun ity a ga inst iden tity tra cing assu min g tha t th e adv er sary has ac cess o nly t o de mog raphi c id entif ier s . The comb ina tion of sex, age, et hnic group, and state is uniqu e to less than 0.25% of the p opulat ions a cross all states 27 . An empirical stu dy eval uated the re - identificat ion of 15,000 r ecords of Hispanic patients in the Ch icago area that included year of birth, 3 - digit zip c ode, and mar ital status (ma rried/unmarr ied) by comparison to vo ting registry data 28 . The authors reported t he correct identificat ion of 2 o ut of the 1 5,000 recor ds and e stimated that less of 0.2 2 % the p opulation is ex pose d wit h th is se t of qua si - ident ifiers . Th ese stud ies sh ow th at wit h a cces s to onl y HIP AA r edacte d de mog raph ic quasi - ident ifier s, ident ity tr acin g is e xtre mely hard . Pedigree stru ctures ar e another p iece of m etadata that are incl ude d in m any genetic stud ies. T hese str uctur es conta in r ich in for mati on , especia lly w hen l arg e kin ship s are available 29 . T he number of offspring, the ir birth order, and other familial even ts such as r ema rriag e, crea te uniq ue co mbin ati on s of quasi - identifier s t hat q uickl y narrow down th e search space. A sy stematic stu dy analyzed t he distrib ution of 2,500 two - gen eration fami ly pedigrees that were sampled fr om obituari es of a town of 60,000 indiv iduals 30 . The pedigre es were uns orted, mean ing that only the number of male and female indivi duals in ea ch generat ion w as availabl e. Despite this l im ite d information , about 30% of the ped igree stru ctures were uniq ue, dem onstrati ng the large informat ion conten t that can be obtaine d from su ch data. Anoth er feature o f pedig ree s for ident ity tra cing is the co mbin at ion of quas i - identif ier s acr os s recor ds . F or example, it i s quite r are that a s urname alone can identi ty an indivi dual. However, the surname co mbination of a couple prior to their m arriage is an 7 extremely stron g ident ifier. In ad dition, once a single indiv idual in a pedig ree is ident ifie d, it i s e asy to link the identities of the other relat ives and thei r gen etic datasets. The main lim itat ion of iden tity trac ing usin g ped igr ee st ruct ure s is t heir low searchabil ity. Perha ps one notable except ion is Israel , w here the en tire popul atio n reg istr y wa s le aked to the w eb in 20 06 and al low s the con str uction of mu lti - generation fa mily t rees of all Israe li cit izen s 31 . But in general d ue to th eir low searchability, th e value of family tree s for re - iden tifi catio n is mo stly limit ed to manual verif ication of the potential iden tity of the targ et rather than a s tarting point of t he proc ess. IDENTI TY TR ACIN G BY GENE ALOGI CAL TRI A NGUL ATION Genetic genea logy att ract s milli ons o f ind ivid ual s int ere sted in the ir anc est ry an d discov er ing dist ant relatives. To that end, t he commun ity h as dev elo ped impr essiv e online platforms to sear ch for genet ic matches and connect indiv iduals. These onl ine resources can be ex ploite d to triangulate t he identit y of an unknown genom e. One potential route of iden tity tracing is surn ame infere nce f rom Y - ch romos ome data 32 , 33 . In most soci eties, s urnames are pa ssed fr om fath er to son , creating a tran sien t corr elat ion w ith speci fic Y chr omo some HAPLOTYPES 34,35 . The ad versary can take advantag e of the Y ch romoso me - surname correl ation and comp are the Y haplotype of the unknown genom e to ha plotyp e record s in re creational genetic genealogy dat abases. A close ma tch wit h a relatively short tim e to the most co mm on recent ancestor (MRCA) would signa l that the u nknown genome likely has the sa me surna me a s the re cor d in the database . The p ower o f surn ame in feren ce st ems f ro m expl oiting inf orma tion fro m dist an t patrilineal r elatives of the unknow n geno me. The ass o ciation betw een sur names and Y - c hromoso me s usua lly spa n s d ozen s of generations, imply ing that ev ery record in a geneal ogical dat abase is capabl e of reve aling the surn ames of hundreds to thousands of ma les . A recent e mp irica l stud y estimated t hat 10 - 14 % of US Caucasian males fr om the m iddle and upper c lass es are subj ect to s urname inference based on scanning the two largest Y - chr omo some g eneal og ical w ebsit es with a bui lt - i n search engine 33 . An i nferred surname has tremen dous p ower for iden tity tra cing. Indi vi dua l su rnames are relative ly ra r e in the populat ion and in most c as es a single surname is shared by less than 40,000 US mal es 33 , whi ch is eq uiv alent to 12 b its o f inf orma tion . In ter ms of i dent ifi cati on, suc cessf ul sur na me re co very is ver y cl ose to dete rmin ing an indi v idua l’ s zip code. A nother feature o f surna me inferen ce is that surn ames are highly searchab le. From public re cord search engin e s to social ne tworks, nume rous online resour ces offer sur name query inter faces, simpli fying the advers ary ’s efforts to complete t he triangulat ion. Surname inferenc e ha s be en ut iliz ed t o bre ach ge netic priv acy i n the p ast 36 - 39 . Several s per m don or con ceived ind ivid uals and adop tee s su cces sfu lly u sed this technique on their ow n DNA t o reveal t he sur names of the ir ance stor s, wh ich eventually lead t o the exp osure of their bi ological fa milies. Thi s te chniq ue cou ld also be applied to whole genome sequen cing dataset s. A recent study report ed five s ucce ssful sur na me inferences fr om Illumina dat asets of thre e large fam ili es tha t 8 were part of the 1000 G enomes pr oject , wh ich eventu ally expo sed the ide ntity of close to fifty res earch part icipants 33 . The ma in l imita tion of surn ame in fer enc e is th at haplotype matc hing r elie s on compa rin g Y ch ro moso me Sh ort Tand em R ep eats ( Y - STR s ). Curre ntly, most sequencing studies d o not rout inely rep ort thes e marker s and the ad versa ry w ould have to proc ess large- scale raw sequen cing file s with a spe cializ ed t ool , wh ich is both ti me and re sour ce cons uming and requ ire s bio infor mat ics exp erien ce 40 . Anot her co mpl icatio n is f alse id ent ific ation of sur na mes an d infe ren ce of s urn ames with spelling var iants c ompared to th e original surname. E l imina tin g incorre ct surna me h its n eces sita tes a cce ss to a ddit ion al quasi - identif ier s su ch a s ped igree structure and typically requires a few h ours of m anual w ork. Finally, the performanc e of surname inference v aries betwee n different soc io - ethnic group s based on no n - pat ern ity r ate s, so ciolo gica l n orm s of surn ame inh eritan ce, and acces s of the group to r ecreation al genealogy . An open researc h que stio n is t he uti li ty of non Y chr omos ome ma rker s f or genealogi cal triangulat ion. Website s su ch as M ito sea rch.o r g and Ged Match. com ru n open searc hable databas es for match ing mit ocho ndr ia l and autos oma l geno type s, respe ctive ly . Our expectation is that mitoc hondr ia l data wil l not be ver y informati ve for tracing ident it ies . The resol uti on of mit och ondr ial s ear che s is m uch l ower d ue t o its smaller size and the absence of highly polymorp hic markers like Y- STRs , meaning that a large number of individuals w ould share the sa me mitochondria l haplotypes. In add ition , mo st hu man socie tie s do no t ex er cise maternally inherited id enti fiers, reduc ing the u tilit y o f su ch sea rch es. A utosomal sear ches on the other han d mi ght be quite power ful. Geneti c genealogy co mpanies ha ve started to market servi ces for dense geno me - wide arra ys tha t enable relative ly sufficien t accur acy to ide ntify distant relat ives on the order of 3 rd to 4 th cousins 41 . The se hit s would reduce t he search spa ce to no more than a few thousand indiv iduals 42 . The main challenge of this ap proa ch wou ld b e translat ing the gen ealogical mat ch to a list of po ten tial people. But with the grow ing interest in genealog y , this techn iq ue might be easi er in the future and should b e taken into con sideration . IDENTI TY TRAC ING BY PHENOTYP IC PRE DICT IO N Severa l reports on genetic privacy envisione d that phenotypic predictio ns from genetic data could s erve as qu asi - identif ier s f or id entit y tra cing 43 , 44 . Twin stu die s have est imate d h igh h erit ab ilitie s f or var iou s v isibl e tr aits such a s heig ht 45 and fa cia l morpholog y 46 . In addit ion, recen t studies showe d that age pr ed ict ion is possib le from DNA spe cimen s der ived from bloo d samp les 47 , 48 . But t he appl icab ility o f the se DNA -deri ved quas i - identifier s fo r ide ntity tra cing has yet to be de monstrated . The major li mitat ion of phen oty pic pr ed iction is th e fast decay of the id entif icat ion power wi t h small inference error s ( Box 1 ) . C urre nt genetic knowl edge explains only a small exten t o f t he ph en otyp ic var iab ility of most visib le t rait s, su ch a s he ight 49 , BMI 50 , and face morp hology 51 , signif ican tly l imit ing thei r utility for identification. For example, p er fect knowledg e about height at o ne - centimeter re solut ion conveys 5 bits o f infor mation . Howev er, with curr ent gen etic kn owledge that explai ns 10% of he ight v aria bil ity 49 , the adversary learn s only 0.1 5 bits of in format ion . Pre dictio ns of most of the other vi sibl e tra its are even worse , implyin g that the ir ut ility as quas i - 9 identifier s wo uld be quite low. The e xcept ion s in visibl e tr aits a re e ye c olor 52 and age pred iction 47 . Re cent stu die s show ed a pr ed ict ion a ccur acy o f 7 5% - 90% of th e phenotyp ic variabi lity of these traits . But eve n these su ccesses translate to approximately 3 - 4 bits of in format ion. Another c hallen ge for phenot ypic pre diction is the l ow search ab ility of most of t hese t ra its. There ar e no pop ulation - based registrie s of height , eye color, or f ace morpho logy and t he adversary would hav e to invest sub stan tia l eff ort s to comp ile s uch a reg ist ry . However, wit h the adv ent of new t ypes o f soc ial me dia, this b arr ier might b e le ss sig nif ican t in the fu tur e. IDENTI TY TRAC ING BY SIDE - CH ANNEL LE AKS Side channel attacks exploit quas i - identifier s that are unintenti onally enco ded in the database build ing blocks an d structure rat her than the actua l data that is meant to be public . A good exa mple for such leaks i s the ex posure of t he full n ames of PGP participant s from 23and Me filenames 25 . The PGP allowed part icipan ts to uploa d 23andMe geno typing files t o their public pro file webpages. The de fault convent ion of these 23 andMe files in cludes the first an d last n ame of the user. A s par t of the upload proc ess, the PGP websit e automati cally compre ssed th e file, name d it wit h the PGP identi fier of the user , and presen ted a link that sh owed the new f ile name that d oes n ot in clud e th e fir st and last n ame s. H owever, aft er downlo ading and deco mpre ssing the 23andMe f ile, the original f ilename app eared. S in ce mos t of t he users d id not cha nge the defa ult nami ng conventio n, it was po ss ib le to trace the identity of a larg e numbe r of PGP profil es. Based on this experien ce, the PGP now forces the part icipant to re name f il es be fore uploading and warns th em that t he file may contain h idden infor mation that c an expose th eir identit ies. Rich da ta files embed multiple layer s of hidden information that pr ovide ample opportun ities for l eakag e of quasi - ident ifier s. P hoto files typically embed Exchangeable I mage File Format (EXIF) fi elds that can include GPS data ab out the location of the p hoto or th e ser ial number of the c amera 53 . This in for matio n c ou ld convey po tential l eads even i f the p hoto it self does not discl os e any se nsit ive informat ion . Microsoft Office pro duct s ty pic ally em bed t he a utho r n ame and cont ain previous rev isions of t he document t hat show d eleted tex t 54 . In general, flat tex t files are the most i mmune format to th ese types of l eaks of un intentiona l content. The mechan ism to generat e database acce ssion number s can also leak pers ona l information. Id eally, t he se numbe rs sh ou ld b e completely r andom but ex perience has h ighl ight ed t hat somet ime s the se numb er s u nintentiona lly reveal residual informat ion due to non - random assignments . For exampl e, in several top medial data min in g cont ests, the acce ssio n n umb ers un intenti onall y re vealed the d isease status of the patient , which was the aim of the co ntest 55 . Another ex ample is the non - rando m as signm ent of So cial Secur ity Nu mber s (SSN ) in the US. P attern analysis of a large amount of p ublic d ata rev ealed te mporal an d spatial comm onalit ie s in the ass ign ment sy ste m that allowed predictions of the SSN from quasi - ident ifier s 56 . S ome sugges ted the assign ment of accessi o n numbers b y applyi ng CRYPTOGRAPHIC H ASHING to the participan t identifier s such a s name or social se curity numbe r 57 . However, this te chnique is extr emely vulne rable to DICTIONARY ATT ACKS due to the relatively l ow sear ch space o f the in put. In 10 general, it is advisable to add some sort of ra ndomization to pro cedure s th at generate acce ssion numb ers in o rder to prevent misu se. ATTRIB UTE DI SCL OSURE ATTACK S VIA DNA (ADA D) In ADAD, the adve rsary create s a statisti cal bri dge that uses DNA data to lin k sensitive attribute s wit h the identity of a person. The f irst piec e of infor mation is a DNA sa mple fr om an i den tifie d target. This c an be ach ieved by successful compl etio n of an ide nt ity tracin g attack, expl oiting id enti fie d DNA data in p ro jec ts such a s Open SNP , gaining internal acces s to restricted d atabases , or simply b y obtaini ng a DNA sample directly fro m the tar get . The second p iece of informat ion is DNA derived dat a that is asso ciated wit h sensitiv e information , such as dis ease, perso nality t rait s, or so cio - economic st atus, whi ch does not otherw ise c ontain explicit iden tifiers. The main differen ce between the A DAD att ack s is t he t ype o f DNA derived dat a that is a ssociated w ith the sensit ive attribute. ADAD : THE N =1 S CENAR IO The simplest scen ario of ADAD is when the sens itive attribute is associated with the genotype dat a of the indiv idual. The advers ary can si mply mat ch the genotype data that is as sociated w ith the id entity of the ind ividual an d the gen otype data that is associated with the attribute. Such an attack requi re s only a small number of autosomal SNPs. Emp irical data showed that a carefully chosen set of 45 SNPs is suffi cient to p rovid e m at ches wit h a TYP E I ERROR of 10 - 15 for most of the major populat ions acro ss the glob e 58 . Moreover, it is expected t hat ran dom su bsets of approximately 300 common SNP s would yi e ld s ufficient in form ation to un iquely match any pers on 59 . With the l ow nu mber of SNP s req uire d for mat ching , ind ivid ual level gen otypes - phenotyp e records in genome - wide a ssoc iati on stud ies (G WAS) are high ly vulnera ble to ADAD. In order t o address this is sue, several organi z atio ns, in clud ing the NIH , ad opted a two tier acce ss sy stem for GWAS datasets : a restricte d ac cess area that st ores indiv idual level g enotype s and phe notypes and a publ ic acce ss are a for high level data summary statistic s of allele frequenci es of all cases and controls 60 . The pr emis e of t his d isti nctio n wa s that summ ary sta tisti c s enable seconda ry data usage for meta - G WAS a na lysis w hil e it wa s tho ught that thi s type of data is i mmun e to ADAD. ADAD : THE SU MMA RY ST ATIS TIC SC EN ARIO A landmark work by Homer et al. r eported th e possibil ity of ADA D o n GWAS datasets that only cons ists of the al lele frequen cies of the study parti cipants 61 . Th e underlying c oncept of t heir appr oach is t hat, with the target gen otypes in the cas e group, the av erage allel e frequen cies w ill be posi tively bia sed toward s the target genotypes compared t o the est imated M AF from t he gen eral populat ion. Conv ersely, when t he target is not part of the study, the av erage allele freq uencies will be 11 negatively b iased compa red to the target genoty pe s . A good illu stration of this concept is cons idering an ex tremely rare variat ion in the subject’ s genome. Non - zero allele frequ ency of this varia tion i n a small - scale st udy incr ease s t he lik el ihood that the targ et wa s part of the study , wherea s zero a llele freq uency strongly r educes this likelihoo d. Hom er et al. showed that by integra ting the slight bias es in t he allele frequen cies ov er a large number of SNPs it is als o poss ible to conduct A DAD with the common var iations th at are analyzed in GWAS. Subsequent st udies exten ded the range of ex ploitat ions for summary statistics. One line o f st udie s imp ro ved the t est stat isti c in the orig in al Homer et al. wo rk and analyzed its mathe matical properties 62 - 64 . Under th e as sump tion of co mmo n SNP s in LINKAGE EQUILIBRIUM , t heir improv ed te st st atis tic is m athe mat ical ly guar anteed to yield maximal POWER fo r any SPECIFICITY level. Wang et al. went beyond allele frequ en cies an d de mon stra ted th at it is poss ible to ex p loit l ocal LD st ruc tures for ADAD 65 . Their test stati sti c score s the c o - app earanc e of two SNP alleles in the target genome with t he bias of LD structure in a G WAS study ver sus the g eneral populat ion. The p ower of thi s approa ch stems from scaven ging for the co - occurren ce of two mildl y uncom mon alleles in diffe rent hap lotype blocks that together create a rar e ev ent. They r eported a pow er of 80 % and sp ecificity of 95% for ADAD on a G WAS with 20 0 samples t hat exp loited the LD structure of 17 4 comm on S NPs in t he F GFR2 locu s. Wit h the same numb er o f SNP s, ADA D met hod s that use only alle le frequen cies yield an ex pected p ower of 24% f or the sam e specificity lev el under th e most opt imal scenar io. Im et al. devel oped a me thod to exploit the EFFECT SIZ ES of GWAS st udie s of q uan titat ive t rait s to de tect the presence of the targ et 66 . Different fro m ADAD with allele freq uencie s, the detection performanc e is better for parti cipants with ex treme p henotype s and worse f or participant s with average phenotyp es. A pow erful development o f this ap proach i s expl oiting GWA S stud ies t hat ut ilize t he sa me co hor t for mul tip le ph enot yp es. Th e adversary rep eats the iden tification proc ess of the targ et with the effect size s of each phenot ype and inte grates the m to boo st the ident ification perfor mance. Afte r determining the p resen ce of the target in a quantitative tra it study, the a dver sary c an further expl oit the G WAS data t o predict the phenotype s with hig h accuracy 67 . This meth od wo rks b y si mply corr elat ing the D NA of the t arg et wit h th e eff ect s ize s and takes a dvantag e of the spurio us assoc iations when regre ssing a large number on markers wit h a single phenoty pe. The theoreti cal performa nce of ADAD is a compl ex functi on of the siz e of the study and the general pop ulation 68 , 69 . On one hand, in any of the tec hniques above, studies with smaller nu mbers o f participant s generate more apparent b iases in t heir summ ary st atist ic s, whi ch incr ease s the p ower an d speci ficity of the AD AD disc rimina tion ( Figure 2A ). On the other h and, a target draw n randomly fr om the general populat ion has a low er a - prior i pr obab ility of having par tic ipated in a stud y with a smaller number o f participant s. This mean s that ADAD on smaller studie s needs t o w ork w ith h ig her sp ecif icit y to a ch iev e the same PRECISION of larger studies, r educing t he pow er of th e attack and the n umber of peop le at r isk ( Figure 2B ). In any case, t he perform ance and risk incr ea se w hen t he bas e pop ula tion is smaller, such as th e Amish or Hutterite pop ulation s, or when the meta - in formation enables strati fication of th e general popul ation ( Figure 2C ). The actual ri sk of ADAD on summary data ha s been th e subject of debate . Following the original Ho mer et al. study, the NIH and other dat a custodian s moved t heir 12 GWAS s umma ry st at istic data f rom p ubl ic dat abas es to a cces s con trol led d at abase s such as dbGA P 70 . A ret ro spect ive analy si s foun d t hat sign ifica ntly few er GWA S studies publicly releas ed their su mmar y stati stic d ata 71 . Most of the studie s publish summary st atistic d ata on 10 - 500 SNPs , whi ch is compat ible with on e sug gest ed guideline to m anage risk 67 . Some warned that t hese p olicies are too ha rsh 72 . There are several pra ctical co mplication s that the adv ersary need s to overco me to laun ch a successful atta ck, such as ac cess to the target’ s DNA data 73 , access to a larg e reference d atabase to asses s the gen eral populat ion freq uency d ata, and a ccurate matching between the ancestr ies of the target w ith those listed in th e refe rence database 74 . Failure to addres s any of these prere quisite s can severely im pact the performanc e of the A DAD. In addition, for a range of GWAS stud ies, t he associat ed attributes are not s ensitiv e or private (e.g. height ) . Thus , even if ADA D occurs, th e impact on t he par tic ipan t sho uld b e min im al. A r ec ent NIH work sh op pr opo sed t he release of su mmary stati stics as the defa ult polic y and developing an exempti on mechan is m for stud ies wi th in crea sed r isk d ue t o the sen sit ivity of t he att ri bute or the vulnerab ility level of t he summary data 75 . 13 ADAD : THE EX PRES SION D ATA S CENA RIO Public databas es such as GEO hold h undred s of thousands of gen e expre s sion profiles of indiv iduals th at are linked to a ran ge of medical attrib utes. Scha dt et al. proposed a poten tial rou te to exploit the se profile s for ADA D 76 . The metho d starts with a t rain ing step that employ s a standard EXPR ESSION QUANTITATIVE TRA IT LOCI (eQTL) analysis with a reference dat aset. The goal of t his step is to id entify several hundred strong eQT Ls and to lear n the distr ibutions of the ex pressi on level for each genotyp e . Then , the algor ith m scan s t he publi c exp re ssio n pro fil es an d calcul ate s the prob abil ity distr ibuti ons o f the genotypes of th e eQT Ls . La s t , t he algorithm matches the target’ s genot ype with the inferred allel ic distribu tions of each expre ssion profil e and test s the hypot hesis th at the mat ch is rando m. If the n ull hypothesis is reject ed, the a lg orithm link s the identity of the targ et to the medical 14 attribute in t he gen e expr ession exp er iment. T his ADA D te chn ique has the potential for relatively h igh a ccura cy in ideal condit ion s . Th e method per fectly match ed 580 indiv idual s wit h th eir exp ressio n pr of iles w hen t he tr ain ing w as con duct ed o n a distin ct dat aset . B ased on larg e- sca le simul at ions, they further predicted t hat the method can reac h a type I error of 1x 10 - 5 with a power of 85 % when test ed on an expression dat abase usi ng the entire US population . There are sever al practical limitations to ADA D via expr ession data. While the traini ng step an d inference st eps are capab le of work i ng with expr es sion profil e s from differ ent tiss ues, the method r eaches it s maximal p ower w hen the tr aining and infere nce util ize eQ TL f rom t he sa me t issue . Mor eo ver, there is a sign ifican t l oss o f accur acy w hen th e exp ress ion data in t he train ing pha se i s col lect ed u sing a different tec hnology tha n the expression dat a in the inferenc e phase. A nother compl icat ion is th at in ord er t o full y exec ut e the t echn ique on a large databa se such as GEO, the adversary w ill need to man age and pr ocess large - scale expression dat a . Due to these pra ctical barrier s, t he NIH did no t issu e any changes to their policies regarding sh aring expre ssion data fr om human sub jects. COM P LET ION A TTACK S Com ple ti on of genet ic informat ion fr om p arti al data is a well - studied task in genet ic studies, called g enotype imputat ion 77 . This method t akes advantage of the LINKAGE DISEQUIL IBRIUM between mark ers and use s reference panel s with complet e g enetic info rmation t o restore missing g enotype v alues in the data of inter est. T he very same strateg ies enable the adver sary to expose certa in region s of interest where only partial acces s to the DNA data i s available . One publici zed case of a compl etio n att ack w as th e infer en ce of J im Wa tso n’s r isk fo r Alz hei mer' s dis eas e. Watso n opted to publish his ent ire ident ified geno me sequence ex cept data from hi s ApoE gene , which is a ssoc iated wit h Al zhe imer ’s diseas e 78 . N yholt et al. restored th e ApoE status using i mput ation wit h marke rs t hat are 15Kb away fr om the mask ed site 79 . As a re sult of t h e s tudy, a 2Mb segment ar ound the Ap oE gene w as remove d from Watson’s pu bli she d genome. In some cases, c ompletion tec hniques al so enable the pred iction of gen omic sequences when there is no access to t he DNA of the target. T hi s techn iq ue is possi ble when t he refer ence pan el is c ombined with geneal ogical informat ion 80 . The algorithm fin ds relatives of the t arget that d onated their DNA t o the refer ence panel and that r eside on a unique p ath that includes the targ et, for example a pair of half - first co usin s when the ta rget is their gran dfather . A shared DNA seg ment betwee n the relatives indicates tha t the target had the same segment. B y scanning mo re pairs of relative s that ar e conn ected thro ugh the targ et, it is possible t o infer t he two copies of autos omal l oci an d coll ect more gen omi c i nforma tion on the tar get witho ut any a ccess to its DNA. B uilding on t he deep geneal ogical reco rds in Icelan d, deCode Genetics w as abl e to leverage t heir large referen ce panel to infer genet ic var iants of an additional 200 ,000 liv ing individual s who never do nated their DN A to the company. While thi s technique i s mostly relev ant to targ ets with a large nu mber of decedents and can b e ex ecute d in o nl y a narrow rang e of scenarios, it emp hasizes the c ompl exit ies of gen etic p riv acy . I n May 2013, I celand's Dat a Pro tection 15 Authority prohibited the us e of this te chn ique un til consent c an be obtained f rom the ind ivid ual s who are not par t o f the origi nal r eference panel 81 . MITIGA TION TEC HNIQUE S Most of the genetic pr ivacy b reaches presented above are quite soph isticated . They require a ba ckgr ound in g enetic s and stat isti cs an d -- importantly -- a motivate d adversary. On e school of thought posit s that th e se pract ical co mple xit ies al most eliminate the prob ability of an adverse event and therefore att enuate the risk to negligible l evels for most stud ies 82 , 83 . A ccord ing t o th is a ppro ach , an appropriate mitig ati on stra teg y i s just re mov i ng very obvious identifiers fro m the datasets befor e pu bli cly sh arin g t he infor mat ion. In th e f ield of c omp uter secu rity , t his r isk management st rat egy is ca lled secur ity by o bscu rity . Thi s app roac h is simp le t o implement an d pose s minima l burden o n data d iss eminatio n. Th e op pon ents of securi ty by obscu rity posit that risk manag eme nt sche mes based on the probabil ity of an adverse event ar e fragile and s hort last ing . Accor din g to t heir vie ws, techno log ies only get bet ter an d w hat is te chn ical ly chall enging b u t possible today will be much easier in the fut ure. Theref ore, the p robabilitie s of adve rse ev ents are non - computable and irrelevant 84 . Known i n cryptography as Shannon’s maxi m 85 , this sc hool of thought assu mes t hat th e adv ersary exists an d is equi p ped with the knowledge an d means to ex ecute the breach. R obust data protect ion, therefore, is achieved by explicit design of t he data access protocol rat her than by the actual chance of a br each 86 . Th is section survey s t he main sec urity by design schemes and their relevance to prote ct ing genetic data. ACCES S CONTRO L One approach to m itigate the chan ce of a privacy br each is t o plac e the s ensit ive data in a secure locatio n and sc ree n the legit imacy of t he appli cants an d their research proje cts. Once approval is made , the applicants are allow ed to dow nload the data under t he con ditions that th ey will store it in a secure locat ion and w ill not attempt to identify individuals . In additi on, the ap plicants shoul d be required to fil e periodic rep orts about the data usage and any adver se events . This app roach i s the co rners to ne of th e acc ess - controlled dbGAP 60 . Based on per iodic rep orts of th e users, a retro spect ive a naly sis of dbGA P access control has ident ified 8 data manag emen t in cide nts in cl ose t o 750 stud ies , mo stly non - adh erenc e to t he techn ical regu lat ions , and no reports of breachi ng the pr ivacy of participants 87 . Despite the ab sence of priv acy breache s t hus far , some have criti cized the fact that access control create s an illusion of se curit y 88 . Once th e data is in the han d of the applicant, th ere is n o real ov ersight of how it i s being st ored, the act ual work, an d wh at exactly is published . To add ress t hes e limit at ions, an alternative appr oach is the tru st - but - ve ri f y model , where the user cannot download the ra w da ta bu t m a y execute cer tain types of querie s that ar e recorded and mon itored by the syste m 89 . S upporters of this model state that mon itor ing has the potential to dete r mali cious user s f rom ac cess ing the data and facilitate s early detection . Anoth er developmen t based on thi s app roach is enfor cing the users and data custodians to have ‘skin i n the game’ 90 , by addi ng penal ties beyo nd d enyin g access to the re sour ce in case o f 16 misu se . The main down sid e of a cces s contr ol is that any of the models lis ted above require const ant manage ment of the resourc e and create admin istrative bu rden to both da ta cust odi ans and users. DATA AN ONYMIZ ATIO N AND AGG REG ATION The p remise of ano nymity is the ability to be ‘los t in the cr owd’. One lin e of stud ies sugge sted restor ing anonymity by restricti ng the gran ularity of the quasi - identif ier s to the point that each rec ord in the dat abase is not uniqu e. A popula r heu ristic is k - a nonymi ty 91 . Usin g th is a ppr oach , the quas i - identi fie r s are binned such tha t each subje ct’ s record is identical to that of at least k - 1 rec ords from other in d ivid ua ls in the data set . T o maximiz e the ut ility of the dat a for subs equent ana lysis, the binning proce ss is ad apt ive . C ertain records will hav e a lowe r resolution depend ing on the distribut ion of t he ot her record s and certa in data c ategories t hat are to o uni que are suppressed entirely . There is a strong trade - off i n the selection of th e value o f k; high values incre ase the s ize of the ba ckground cro wd but at the same t ime reduc e the utility of the dat a. As a r ule of t humb , it w as re co mmen ded to se t k≥5 ( 92 ). More recent work showed that w hile k - anony mity p rot ects a gain st id ent ity tracin g technique s, it is vulnerab le to attribut e disclo sure, especially w hen the adversar y has a certain level of p rior knowledge about the presence of the targ et in the database 93 . Subsequent st udies devel oped more elab orative redact ion techni ques to addr ess the se issues 93 , 94 . These an ony miza tion te chniq ues h ave b een mainl y succe ssful in s afeg uar din g demog rap hi c iden tif iers in med ica l re search . Ho weve r, attempts to adopt the se techn iques to DNA re search are yet to be practi cal 95 . Th e high d imen sion alit y o f DN A da ta d ictat es t hat mo st of the re cord s w ill be uni que and it is not clea r how the dat a can b e reda cted wit hout dest royin g it s val ue for secondary analy sis. Differenti al privacy offer s a distin ct approach to restore anon ymity by produci ng summ ary st atist ic s after sophisticated data p erturbation 96 . It aims to ensure that summa ry stat istics of two data sets that differ by exactly on e individ ual’s record ar e extremely cl ose to ea ch other. This way , the adver sary cannot be sure wh ether the target was part o f the dataset or not and there fore cannot learn sensitive attribute s. The c hallen ge in d iffer en tial p riv acy alg or ithm s is t o min im ize t he p ertu rbat ion while sat isfy ing t he privacy property so that the s ummar y st at istic will stil l con vey useful information on the pop ulation a s a whole . Diff erential priva cy ha s gained popul arit y in c omp uter s cien ce an d stat ist ics a s a v er y v ibrant researc h area and t he US Cen sus Bur eau use s this techniqu e for their O nTheMap tool 97 . Early attempt s have made prog res s tow ards pr ote ctin g GWA S data u sing thi s app roa ch 98 , 99 . Currently, t he main li mitation is t hat the amount of perturbation that needs to be added to the summar y st atist ic grows linearly w ith the numb er of expos ed SNPs, whic h quickl y abol ish es the ab ility to detect fine asso ciations in meta - anal ysis. Whether or n ot ther e is a way to add much smalle r amounts of noise in a w ay that still main tain ing p rivacy fo r GWA S dat aset s rema in s an op en q uesti on. 17 CRYP TOGR APHI C SO LUTI ONS M odern cryptography brough t new advanc ements for data dissem ination beyo nd the tr adit ion al u sage of encry pt ing se nsit ive fil es a nd d istr ibut ing the k ey t o aut ho rized us ers. S ecure mu ltipart y co mputatio n ( SMC) allow s two or more en tities who each have some pr ivate data to exec ute a com putation on t hese private inputs without revealing the input to each other or d isclosing it to a third party. In one class ical e xam ple of S MC , AL ICE a nd B OB can determine who is r icher wit ho ut eit he r one reveal ing th eir ac tu al weal th to the other. Re sear cher s hav e con stru cted SMC protocols in vari ous domai n s— from voting 100 to location - b ased serv ices 101 . In the area of gen etic data, on e line of w ork has developed S MC algorithms fo r genetic m atching. Br uekers et al. presented a priva cy - pre serv ing a lgor ith m to match STR profi les between two partie s without ex posing th e actual genet ic data 102 . Bohanno n et al. suggested searchab le genetic database s for for ensic p urpo ses tha t allow only going from genetic data to identity but not from identity to g enetic data 103 . In t heir sc hem e, the records in the database s a re encrypted w ith the individual’ s genotype a s the key. To toler ate genoty ping err ors or missing dat a, they utilize a fuzzy e ncryption sc heme that c an use a k ey that o nly approximately match es the original one. This way, only acce ss to the ge noty pe inf orma tio n c an reveal the identity but not the opposite . Along simil ar lin es, C ri stofar o et al . constr uct ed cryptographic pr otocol s for pr iva cy - preserving patern ity tests a nd gene tic c omp atibi lity test s 104 , albeit for molecul ar techn iques that are n o lon ger in 18 use , such as R FLP . They also pre sented a smart phone - base d i mple menta tion of th ese proto cols 105 . The perform ance varie s dramat ically between t asks that examine only a few loci and those that depend on th e whole genome. The f ormer com plete in under a second and t he latter take day s of computation and gigabytes of ban dwidth, rendering the m impract ical at the curren t time. In anoth er direct ion , Kamm et al suggeste d a se cure mult i - center GWAS an alysis 106 . In the ir p roto col, each cen ter d eploy s a secr et sh arin g sc heme on it s own c ollect io n of su bjects’ phe notypes and geno types th at divides th e data into small shares, ea ch of which reveals not hing about the or iginal value s on its own. The s hares are t hen sent to the other centers , whic h store t hem in de dicat ed se rve rs. The se r vers ha ve an interface that all ows outsider s to initiate a GWAS study on p henotypes and genotypes of interest. Upon request , the serv ers coord inate to perf orm th e assoc iat ion wit hout reco nst ruct ing t he orig ina l gen otyp es or ph enoty pe s an d only report in plain te xt the sign ifica nt SNP s. A potent ial shor tco min g of the ir app ro ach is that , at least theoreti cally , the end produ ct plain t ext is still v ulnerable to ADAD on summ ary st atist ic d ata, r en derin g the s olut ion f ar fr om co mple te. Another lin e of cry ptogr aphic work looks at priv acy - pre serv ing out sour cing o f compu tat ion s on g enetic infor matio n usin g h omo morp hic e ncr ypti on 107 ( Box 2 ) . Th e conc ept of this appro ach is that , with adven t of ubiqu itous usage of genet ic data, users (or phy sicians) will interact w ith a variety o f genetic inter pretation s ervices (e.g. p romet hease .com) throughout their l ives , whic h increases the chan ce of a genetic privacy breac h. Under th is cryptograp hic wor k, users send an encrypted version of t heir gen ome to the cloud. T he inter pr etat ion se rvic e can ac ces s the cl oud data but do es not have th e key and theref ore cann ot rea d the pl ain gen otype values . Instead, th e int erpr etat ion serv ice ex ecutes the al gebraic operat ions of its genet ic risk pred ict ion a lgor ith m on the encry pt ed ge no type s with out in spect ing the plaintext . After complet ing the algorithm, t he user grabs the cyphertex t re sult s f rom the cloud . Due to th e special mathemat ical proper ties of th e underly ing crypt osy stem, the u ser si mply de cr ypt s t he res ults to obta in h is r isk p red icti on. T his way the user doe s not ex pose any of his g enotype s or disease su sceptibilit y to the ser vice provi der. T he c urr ent scope o f r isk pred ict ion m odel s i s still limite d but this approach is q ui te amenable to futur e impr ovement s. CONCLUSION The i nvention of asy mme tric cr ypt ogra phy in the 1970 ’ s led to a revolution in secu re com munica tion . T oday, a wide v ariety of Internet tra nsactions build up on these security measures in ways that are co mpletely tran sparent to the av erage user. D at a pr ivacy still a wait s a simi lar breakthr ough . The status quo has greatly shift ed in the la st f ew y ears, wit h a torr ent of st udie s showin g th at a mo tivat ed, techn ical ly - so phi sti cated adversary i s capabl e of ex ploiting a wide range of g enetic data for unintend ed pur poses . With the constant inno vation in gene tics and the explosion of onl ine inf ormation , we can expect that n ew privacy breaching techn ique s wi ll be discovered in the next few year s . Rest oring the statu s qu o with techn ical mean s wil l ne cess itat e large strides in t he theory and impl ement ation of mitig ati on alg orit hms . S o me of the approaches, particularly ac cess co ntrol, have 19 bee n qui te us eful . But s o far, mit igati on sc he mes are resou rce and time con suming for both th e data custodian an d users. Due to both te chnical and huma n fac to rs 108 , t he privacy field has yet to come up with a set of methodologie s of c ompa rab le impact to co mmu nica tion secur ity. Succe ssful balan cin g of privacy de mand s an d data shar ing is n ot re stri cted t o technical m eans 109 . Balan ced inf ormed con sent s outli ning both be nefits and risks are key ing redients for mainta inin g lo ng - last ing cred ibil ity in ge net ic re sear ch. Wit h the active engag ements of a wide range of stakeholder s from t he broad gen etics comm unity and the gen era l pub lic, we a s a s ociet y co uld devel op socia l an d et hica l norms, legal framework s, and edu cational pr ograms to re duce the c hance of misuse of genetic data d espit e t he in abil ity t o theoretically prevent privacy brea ches. GLOS SARY SAFE HARBOR : A st andard in th e HIPAA Rule f or de - identification of pr otected health inform ati on by rem oving 18 t ypes of quasi - identifiers. HAPLOT YPES : A set of allele s along the same chr omosome . CRYPTOGR APHIC H ASHING : A procedu re th at yiel ds a fixed le ngth out put from any si ze of input i n a way tha t is ha rd t o de ter mine the inp ut fro m the out put. DICTIONARY AT T ACK S: A b rute force approach to rev erse cryptographic hashing by sc anni ng the rela tively small i np ut sp ac e. TYPE I ERROR : The p robability to obtain a p ositive ans wer fro m a negative item. LINKAGE EQUILIBRIUM: Absence of correlation between th e alleles in two loci. POW ER: T he probability to ob tain a positive ans wer for a positive ite m . SPECIFICITY: T he pr obability to obta in a nega tive a ns wer f or a negative ite m . EFFECT SIZES: In q uantitative traits, t he contributio n of a certa in allele to the value of the trait. EXP RESSION QUANTITATIVE TRAIT LOCI: Genetic variants associate d with variabilit y in gene e xpr essio n. LINKAGE DISEQUILIBRIUM: The correlation betw een alleles in two loci. ALICE AND BOB: C ommon placeh olders in cry ptog raphy to den ote part y A and party B. ACKN OWLEDG EMENTS YE is an Andria and Pau l Heafy Family F ellow and holds a Ca reer Awar d at the Scientifi c Inter face fr om t he Burro ughs W ellco me Fun d. This study w as als o suppo rted by a gift from Cath y and Jim S tone. The authors than k Dina Zielin ski and Melissa Gymr ek for use ful comment s and Shriram S ankarar aman for his n ice introd uction betwee n the authors. COMP ETIN G INTE RES TS STATEM ENT None. REFEREN CES 20 1 Fu, W. et al. Ana lysis of 6,515 exo mes reveals the recent origi n of most huma n pr ote in - coding variants. Nature 4 93, 216 - 220, doi:10.1038/nature11690 (2013). 2 Geno mes Projec t, C. et al. An int egrated ma p of genetic v ariation from 1,092 h uman genomes. Nature 491, 56 - 65, doi:10.1038/nature11632 ( 2012). 3 Roberts, J. P. Million veterans s equenc ed. Nat Biot ech 31, 4 70 - 470, doi:10.1038/nbt0613 - 470 ( 2013). 4 Drmanac, R. M edicine. T he ul timate genetic t est. Scie nce 336, 111 0 - 1112, doi:10.1126/science.1221037 (2012). 5 Burn, J. Sh ould we seq uenc e ev eryo ne's genom e? Yes . Bmj 3 46 , f3133 , doi:10.1136/bmj.f3133 (2013). 6 Kaye, J., Heeney, C., Ha wkins, N., de Vries, J. & Bodd ington, P. Data sha ring in genomics -- re - sh aping scie ntific practic e. Nat R ev Ge net 10, 331 - 335, doi:10.1038/nrg2573 (2009) . 7 Park, J. H. e t al. Estima tion of effect s ize dist ribution fro m genome - wide associa tion studies and implica tions for future discoveri es. Nat Genet 42, 570 - 575, doi:10.1038/ng.610 (2010). 8 Chatterje e, N. et al . Projectin g the per formance of risk pr ediction ba sed on p olygenic analys es of genome - wide a ssociation studi es. Nat Genet 45, 400 - 405, 405e401 - 403, doi:10.1038/ng.2579 (2013 ). 9 Friend, S. H. & Nor man, T. C. M etcalfe's la w and th e biol ogy informa tion com mons. Nature biot echnology 31 , 297 - 303, doi:10.1038/nbt.2555 (2013). 10 Rodriguez, L. L., Brooks , L. D., Greenberg, J. H. & Green, E. D. R esearch ethics. The complexities o f genomic ide ntifiability. Science 339, 275 - 276, doi:10.1126/science.1234593 (2013). 11 Car e, I. o . M. U . R. o . V. S . - D. H. in Clini cal Data as the Basic Staple o f Health Learnin g: Creating and Protecting a Public Good: Worksh op Summary The National Academies C ollection: R eports fund ed by N ational Insti tutes of Health ( 2010). 12 McGuire, A. L. et al . To share o r not to sha re: a randomi zed trial of consent fo r data sharing in genome r esearch. Genetic s in medicine : officia l journal of the American College of M edical Gene tics 13, 94 8- 955, d oi:10.1097/GIM.0b013e3182227589 (2011). 13 Oliver, J. M. et al. Balancing th e risks and benefi ts of genomic data s haring: genom e research participa nts' pe rspectives. Pu blic Health Genom ics 15, 106 - 114, doi:10.1159/000334718 (2012). 14 Schwartz, P. M. & Solov e, D. J. Reco nciling Perso nal Infor mation in th e United Stat es and European Union. S SRN Electronic Journal, doi:10.21 39/ssrn.2271442 (2013). 15 El Emam, K. Heuristic s for De - identi fying Heal th Data. Securi ty & Privacy, IEE E 6, 58 - 61, doi:10.1109/MSP.2008.84 (2008). 16 Lunshof, J. E., Cha dwick, R., Vorhaus , D. B. & Church, G . M. From g enetic privacy to open c onsent. Nat Rev Genet 9, 406 - 411, doi:10.1038/nrg2360 (2008). 17 Bren ner, S. E. Be pr epa red fo r the bi g gen om e lea k. Na ture 49 8, 139 , doi:10.1 038/498139a (2013 ). 18 < http://www. privacyrigh ts.org/da ta - br each > 19 Scambray, J. M. S. K. G . Hack ing exposed n etwork secu rity secrets & s olutions , < http://search .ebscohost.c om/login.a spx?direct=t rue&scope=sit e&db=nl ebk&db=n labk&AN=70568 > (2001). 20 Solove, D. J. A Ta xonomy of Priva cy. University of P ennsyl vania La w Review 154, 477 (2006). 21 Oh m, P. Brok en Promis es of Privacy: R esponding to the Su rprising Fa ilure of Anonymization. UCLA L aw Revie w 57 (2010). 22 Golle, P. i n Proceedings of t he 5th ACM workshop on Priva cy in elect ronic soc iety 77 - 80 (ACM, A lexandria, Virginia, USA , 2006). 23 Swe eney, L. A. Simple Demogr aphics O ften Iden tify Peo ple Unique ly. (20 00). 21 24 Greely , H. T . The une asy ethi cal and le gal und erpi nnings o f large - scale g enomic biobanks. A nnual review of genomics and hu man genetics 8, 343 - 364, doi:10.1146/annurev.genom.7.08050 5.11572 1 (2007). 25 Sweeney, L . A., Abu, A. & W inn, J. Id entifying Par ticipants in the Perso nal Genom e Project by Name (2013). < http://da taprivacyl ab.org/pro jects/pgp/102 1 -1.pdf> . 26 Unite d Sta tes. General Accounting Offic e. & United Sta tes. (U.S. General A ccounting Office, Washington, D.C., 200 2). 27 Benitez, K. & M alin, B. Eva luating re - id entification risks with respect to th e HIPAA privacy rule. Journal of th e America n Medical Inform atics Association : JAMIA 17 , 169 - 177, doi:10.1136/jamia.2009.000026 (2010). 28 Kwok, P., Davern, M., Hair, E. & La fky, D. in NORC at The University of C hicago (Chicago 2011). 29 Bennett, R. L . et al . Recomme ndations f or stand ardized huma n pedigre e nome ncl ature. Pedigr ee Standardiza tion Task Force of the National Soc iety of Genetic Cou nselors. Am J Hu m Genet 56, 745 - 752 (1995) . 30 Malin, B. Re - ide ntification of familia l databas e records. AM IA ... Annual Sym posium procee dings / AMIA Symposium. AMIA Sympo s ium, 52 4 - 528 (2006). 31 Israe l vs. Sh alom Bilik, Avr aham Adam, Yosef Vitman , Haim Ahar on, Mos he Moshkowitz a nd Meir Liv er ( In Hebrew) V erdict 24441 - 05 - 12 32 Gitschier, J. I nferential g enotyping of Y chromosom es in La tter - Day Saints founders and comparison to Uta h samples in the HapMa p project. A m J Hum Genet 84, 251 - 258, doi:S0002 - 9297(09)00025 - 1 [pii] 10.1016/j.ajhg.2 009.01.018 (2009). 33 Gymrek, M., Mc Guire, A. L., G olan, D., Halperin, E . & Erlich, Y. Id entifying personal geno mes by surna me i nfe rence. Sci ence 339, 321 - 324, doi:10.1126/science.1229566 (2013). 34 King, T. E. & Jobling, M. A. Wha t's in a name? Y chrom osomes, surnames a nd the genet ic g enea logy revol u tion. Trends G enet 2 5, 35 1 - 360, doi:S0168 - 9525(09)00133 - 4 [pii] 10.1016/j.tig.2009.06.0 03 (2009 ). 35 King, T. E. & Jobling, M . A. Founders, dr ift, and infidel ity: the relatio nship betwee n Y chromosome div ersity and patrilineal surnames. Mol Biol Evol 26, 1093 - 1102, doi:msp022 [p ii] 10.1093/molbev/msp022 ( 2009). 36 Motluk, A . Anonymous s perm do nor tr aced o n internet. New Sci 188, 2 (2005). 37 Stein, R. F ound on th e Web, With DNA: a Boy's Fa ther. Wa shington Post, 1 (200 5). 38 Naik, G. Fami ly Secre ts: An Adopte d Man's 26 - Ye ar Quest f or H is Fathe r Wall St reet Journal (2009). 39 Lehman n - Haupt, R. A re Sper m Donors R eall y Anonymou s Anymore? Sl ate (2010) . 40 Gymrek, M., Gol an, D., Ros set, S. & Erlich , Y. lobSTR : A short tandem repeat pro filer for personal genom es Genome res earch, doi: ( 2012). 41 Huff, C. D. et al. M aximum - likelihood esti mation of rec ent share d ancestry (ERSA). Geno me r esearc h 2 1, 7 68 - 774, doi:10.1101/g r.115972.110 (2011). 42 Henn, B. M. et al. Cryptic distant relatives a re common i n both isola ted and cosmopolitan g enetic s amples. PLoS One 7, e34 267, doi:10.1371/journal.pone.0034267 (2012) . 43 Lowrance, W. W. & Collins, F. S. Ethics. Identi fiability in genomic researc h. Scienc e 317, 600 - 602, doi:10.1126/s cience.1147699 (2007). 44 Kayser, M. & de Knijff, P. Im proving hu man forensics through adv ances in genetics, genomics and molecula r biology . Na t R ev Gen et 12 , 179 - 192, doi:10.1038/nrg2952 (2011). 45 Silventoine n, K. et al. H eritability of adu lt body height: a comparative study of twi n cohorts in eight cou ntries. Twin research : th e officia l journal of th e Internati onal Society fo r Twin Studi es 6, 399 - 408, doi:10.1375/1 36905203770326402 (20 03). 46 Kohn, L. A. P. The Role of Genetic s in Craniofacial Morphology and Growth. A nnu Rev Anthropol. 20, 261 - 278 ( 1991). 22 47 Zubakov, D. et al. Es timating human ag e from T - cel l DN A rea rrang eme nts. Cur r Bi ol 20, R970 - 971, doi:10.1016/j.c ub.2010.10.022 (2010). 48 Ou, X. L. et al . Predicting huma n age with bloo dstains by s jTREC quantifica tion. PLoS One 7, e42412, doi:10.1371/ journal.pone.0042412 (201 2). 49 Lango A llen, H. et al. H undre ds of variant s cluste red in gen omic loci and biol ogical pathways a ffect human he ight. Nature 467 , 832 - 838, doi:10.1038/natur e09410 (2010). 50 Manning, A. K. et al. A genome - wide ap proach acc ounting fo r body ma ss index identifies g enetic va riants infl uencing fa sting glyc emic tra its and i ns uli n resis tanc e. Nat Genet 44, 659 - 669, do i:10.1038/ng.2274 (2012). 51 Liu, F. e t al. A G eno me - Wide Ass ociation Study Id entifies Five Loc i Influencing Fa cial Morphology in Eu ropeans. PLoS Genet 8, e1002932, doi: 10.1371/journal.pgen.1 002932 (2012). 52 Wal s h, S. et al . IrisPlex: a s ensitive DN A tool fo r accur ate prediction of blue a nd brown eye colour in th e absence of anc estry informati on. Forensic Sc i Int Genet 5, 170 - 180, doi:10.1016/j.fsige n.2010.02.004 (2011). 53 CIPA. Vol. DC - 008 - 2010 (C amera & Ima gi ng Produc t Assoc iation, 2010) . 54 Byers, S. Informatio n leakage ca used by hidden data in published documen ts. Security & Privacy, IEEE 2, 23 - 27, doi:10.1109/MSECP .2004.1281241 (2004). 55 Kaufman, S., Ross et, S. & Perlich, C. in Proc eedings of the 17th A C M SIGKDD interna tiona l conf erenc e on Kno wledg e discov ery a nd dat a mining 55 6 - 563 (ACM, San Diego, California , USA, 2 011). 56 Acquisti, A. & Gross, R. P redicting Social Secu rity num bers from public data. P roc Natl Acad Sci U S A 106, 10975 - 10980, d oi:10.1073/pnas.0904891106 (2009). 57 Noumeir, R., L emay, A. & L ina , J. M. Pseudo nymization o f radiology da ta for research purposes. Jou rnal of digital ima ging 20, 284 - 295, doi:10.1007/s10278 - 006 - 1051 - 4 (2007). 58 Pakstis , A. J. et al. SNPs f or a univer sal ind iv idual ide ntification panel. Hum G enet 127, 315 - 324, doi:10.1007/s 00439 - 009 - 0771 - 1 (2010). 59 Lin, Z., Owen, A. B. & Altma n, R. B. Genetics. G enomic research an d human su bject privacy. Science 305, 183, do i:10.1126/science.1095019 (2004). 60 Mailman, M. D. et al. Th e NC BI dbG aP datab ase of geno type s and p henotype s. Nat Genet 39, 1181 - 1186, doi:10.1038/ng1007 - 1181 (2 007). 61 Homer, N. et al. R esolving individua ls contributi ng trace a mounts of DNA to highl y complex mixtu res using hig h - density SN P genotyping mi croarray s. PLoS Genet 4 , e1000167, doi:10.1371/journal.pgen.10001 67 (2008). 62 Halperin, E. & St ephan, D . A. SNP imputati on in ass ociation studi es. Nature biotechnology 27, 349 - 351, doi:10.1038/nbt0409 - 349 (2009). 63 Jacobs, K. B . et al. A n ew statistic and its pow er to inf er membersh ip in a g enome - wide assoc iation study u sing g enotype fr equencies. Nat Genet 4 1, 1253 - 1257, doi:ng.455 [ pii] 10.1038/ng.455 (2009). 64 Visscher, P. M. & Hill , W. G. The Limits of Indivi dual Identifica tion from Sampl e Allele Freq u encies: Theory a nd Statistica l Analysis. PLoS Genet 5, e100 0628, doi:10.1371/journal.pgen.1000628 (200 9). 65 Wang, R., Li, Y. F., Wang, X., Haixu, T. & Zhou, X. i n CCS’09 (Chicago, I L, USA, 2009). 66 Im, H. K., Gamaz on, E. R., Nicol ae, D. L. & Cox, N. J. On Sharing Quantitativ e Trait GWAS Results in a n Era of Multiple - omics Data a nd the Limits of Genomic Priva cy. Am J Hum Genet 90, 591 - 598, doi:S 0002 - 9297(12)00093 - 6 [pii] 10.1016/j.ajhg.2012.02.008 (2012). 67 Lumley, T. Pot ential for Revealing Indivi dua l- Level Informatio n in Genome - wi de Association Studies. JAMA 303, 659, doi:10.1001/jama.2 010.120 (2010). 68 Craig, D. W. et al. Ass essing and managi ng risk when s haring aggregate gen etic varian t data. Nat Rev Gene t 12, 730 - 736, doi:10.1038/ nrg3067 (2011). 23 69 Braun, R., Rowe, W., Schaefer, C., Zhang, J. & Bu etow, K. Ne edles in the Haystack: Identifying In dividuals Present in P ooled Genomic Da ta . PLoS Genet 5, e1000668, doi:10.1371/journal.pgen.1000668 (200 9). 70 Zerhouni, E. A. & Nabel, E. G. Protecting aggr egate genomic data. Science 322, 44, doi:10.1126/science.1165490 (2008). 71 John son, A. D., Le slie, R. & O'Donnell, C. J. Temp oral tren ds in result s availabi lity from genome - wide a ssocia tion studi es. PL oS Genet 7, e1002269, doi:10.1371/journal.pgen.1002269 (2011). 72 Gilbert, N. R esearchers criticize ge netic data restrictio ns. Nature, doi:10.1038/news.2008.1083 (2008). 73 Malin, B., Karp, D. & Scheuermann, R. H. Tech nical and polic y approaches to balanci ng pati ent pri vacy and data s haring in clin ical and translati onal re sear ch. Journal of inv estigative medicine : the o fficial p ublication of the America n Feder ation f or Cli nical Rese arch 5 8, 11 - 18, doi:10.231/JIM.0b013e3181c9b2ea (2010). 74 Clayton, D. On in ferring pres ence of an individua l in a mixture: a B ay esian ap pro ach. Biostatistic s 11, 661 - 673, doi:10.1093/biostatistics/k xq035 (2010). 75 Workshop o n Establish ing a C entral Resourc e of Data from Genom e Sequencin g Projects (2012). < http://www .genome.gov /Pages/Research/DER/GV P/Data_Aggr egation_Wor ksho p_Summary .pdf> . 76 Schadt, E. E ., Woo, S. & Hao, K. Bay esian method to pre dict individual SNP genotyp es from gen e expression da ta. Nat Genet 4 4, 603 - 608, do i:10.1038/ng.2248 (2012). 77 Marchini, J. & Howie, B . Genotype imputatio n for ge nome - wide as sociation s tudies. Nat Rev Genet 11, 499 - 511, doi:10.1038/nrg2796 (2010). 78 Check, E. James Watson s genome sequ enced . Nature (2007). 79 Nyholt, D. R., Yu, C. E. & Vissch er, P. M. On Jim Wa tson's APOE status: g enetic information is hard to hide. European journal of human genetics : EJHG 17, 147 - 149, doi:10.1038/ejhg.2008.198 (2009). 80 Kong, A. et al. Det ection of sharing by d escent, lo ng - range phasin g and haplot ype imputa tion. N at Genet 40, 10 68 - 1075, doi:10.1038/ng.216 (2008). 81 Kaiser , J. Human gene tics. Agency n ixes deCODE 's new data - mining plan. Sc ience 340, 1388 - 1389, doi:10.112 6/science.340.6139.1388 (2013). 82 Bambauer, J. R. Tragedy of the Data Commons. Ha rvard Jo urnal of Law an d Technology 25, doi: http://d x.doi.org/10.2139/ssrn.1789749 (2011). 83 Hartzog, W. & Stutz man, F. The Ca se for Online Obsc urity. Ca lifornia Law Revi ew 101, 1, doi: http://dx.doi.org /10.2139/ssrn.159774 (2013). 84 Taleb , N. N. The black sw an : the i mpact of t he highly imp robab le. (Random House, 2007). 85 Shannon, C. Communica tion Theo ry of Secr ecy Syste ms". Bell System T echnical Journ a l 28, 656 – 715 (194 9). 86 Cavoukian, A. Privac y by Design. (2009). < http://www.i pc.on.ca/ images/R esources/privac ybydes ign.pdf> . 87 Ramos, E. M . et al. A mechanism f or control led a cc ess t o G WAS d ata : exp eri ence of the GAIN Data Access C ommittee. Am J Hum Genet 92, 479 - 488, doi:10.1016/j.ajhg.2012.08.034 (2013). 88 Church, G. et al . Public ac cess to genome - wid e data: five v iews on balanc ing research with privac y and protection. PLoS Genet 5, e1000665, doi:10.1371/journal.pgen.1000665 (200 9). 89 Cre ating a Glob al Alli ance t o Enab le Re sponsib le Shar ing of G enomi c and Cli ncal Data. (2013). . 90 Sandis, C. & Tal eb, N. N. Ski n in the Game as a Require d Heuristic for Ac ting Unde r Uncer tai nty. Available at SSRN 2298292 (2013). 24 91 Swee ney, L. k - anonymity: a mod el for prot ecting priva cy. Int ernational jou rnal o f uncertainty, fuzziness, a nd k nowledge - based systems 10, 557 - 570 (20 02). 92 El Emam, K. & Da nkar, F . K. Protecti ng privacy using k - anonymity. Jou rnal of th e American Medica l Informatic s Associa tion : JAMIA 15, 627 - 637, doi:10.1197/jam ia.M2716 (2008). 93 Machanavaj jhala, A., Ki fer , D., Gehr ke, J . & Ve nkitasu bramaniam , M. L - dive rsity. ACM Trans. Knowl . Discov. Data 1 , 3 - es, d oi : 10.11 45/1217299.1217302 (2007). 94 Ningh ui, L., Tian cheng, L. & Ve nkatasu bramani an, S . in Data Eng inee ring, 200 7. ICDE 2007. IEEE 23rd Internatio nal Conference on. 10 6 - 115. 95 Malin, B. A . Protecting genomic sequence anony mity with gen eralization la ttices. Methods of i nformatio n in m edicine 44, 68 7 - 692 (2005). 96 Dwork, C. Dif ferential P rivacy . in ICALP. 1 - 12 (2007) . 97 Machana vajjhala, A., Kifer, D., Abowd, J., Gehrke, J. & Vilhuber, L. in Data E ngineering, 2008. ICDE 2008. IEEE 24th International Conference on. 277 - 286. 98 Uhler, C., Slavkovic , A. B. & Fienberg, S. E. Privacy - Pres ervi ng Data Sharing f or Geno me - Wide Assoc iation. CoRR a bs/1205.0739 (2012). 99 Johnson, A. & Shma tikov, V. in Proc eedings of the 19th ACM SIGKDD int ernational conference on Knowledge discov ery and data mi ning 107 9 - 1087 (ACM, Chicago , Illinois, USA, 2013). 100 Cao, G. in Compute r Science and C omputational Technology, 200 8. ISCSCT '08. International Symposium on . 292 - 294. 101 Narayanan , A., Thiagar ajan, N., Lakhan i, M., Hambur g, M. & Boneh, D. (NDSS, 2011). 102 Brueke rs . F. , Stefan, K., Kla us, K. & Pim, T. Privacy - Pr eservi ng Matc hing o f DNA Profiles. 2008 (2008). 103 Boha nnon, P., Jak obsson, M . & Sri kwa n, S. in P ublic K ey Cry ptogra phy V ol. 17 51 Lecture Notes in Com puter Sc ience (eds Hideki Ima i & Yulia ng Zheng) Ch. 25, 373 - 390 (Springer Berlin H eidelberg, 2000). 104 Baldi, P., Ba ronio, R., Cristofa ro, E. D., Gasti, P. & Tsu dik, G. in Proceedings of the 18th ACM confere nce on Comp uter and com municatio ns security 6 91 - 702 (ACM, Chicago, Il linois, USA, 2011). 105 Cristofaro, E . D., Fa ber, S., G asti, P. & Ts udik, G. in Pr oceedings of the 2012 A CM workshop on Privac y in the elect ronic society 97 - 108 (ACM, Ralei gh, North Carolina, USA, 2012). 106 Kamm, L., Bogdanov, D., Laur, S. & Vilo, J. A n ew way to p rotect priva cy in la rge - scale geno me - wid e associatio n studies. B ioinfor matics 29, 886 - 893, doi:10.1093/bioinformatics/btt066 ( 2013). 107 Ayday, E., Ra isaro, J. L. & H ubaux, J. P. Privacy - Enhanc ing Technol ogies for Medical Tests Using G enomic Data. T echnical Report (2013). < http://infos cience.epfl.c h/reco rd/182897/fil es/CS_ver sion_technica l_report.pdf> . 108 Naraya nan, A. What Happ end to the Crypto D ream? Secu rity & Privacy, I EEE 11, 75 - 76 (2013). 109 Presidential Commission fo r the Study of B ioethical Issues, Privacy and P rogress in Whol e Genom e S equenc ing . Pri vac y and P rogres s in Wh ole G enom e Sequ enci ng (2012). 110 Paillier, P. in Advanc es in Cryptol ogy — EU ROCRYPT ’99 Vol. 1592 Lecture Not es in Compu ter Scie nce (ed Jacqu es St ern) Ch. 16, 223 - 238 ( Spring er Be rlin H eidelb erg, 1999). 111 Gentry, C. A full y homomorphic encrypt ion sche me. doi:papers2:// publica tion/uuid/E 389BFF9 - B17D - 45A9 - BB67 - 0B586EE445F8 (2009). 25

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment