Segmentation of Encrypted data

Segmen tation of Encrypted Data Eric J¨ arp e ∗ Quen tin Gouc het ∗∗ ∗ Sc ho ol of Information Science, Halmstad Univ ersity , P .O. Bo x 823, 301 18 Halmstad, Sweden ∗∗ A tsec Information Securit y , Austin, TX 78759, USA ∗ T o whom corresp ondence should b e addressed; E-mail: eric.jarp e@hh.se. 2014 Abstract The retriev al of data from computer hard driv es that hav e b een seized from p olice busts against suspected criminals are sometimes not straight forward. T ypically the incriminating data, whic h ma y b e imp ortant evidence in subsequent trials, is encrypted and quic k deleted. The cryptanalysis of what can b e recov ered from such hard driv es is then sub ject to time-consuming brute-forcing and passw ord guessing. T o this end metho ds for accurate classiﬁcation of what is encrypted data and what is not is of the essence. Here a pro cedure for discriminating encrypted data from non-encrypted is deriv ed. Several metho ds are suggested and their accuracy is ev aluated in diﬀeren t w ays. Tw o methods to detect where encrypted data is lo cated in a hard disk driv e are detailed using passiv e c hange-p oin t detection. The mea- sures of p erformance of such metho ds are discussed and a new prop- ert y for ev aluation is suggested. The metho ds are then ev aluated and discussed according to the new p erformance measure as w ell as the standard measures. Keyw ords: Change-p oint dete ction, ciphertext, plaintext, c ompr ession. 1 In tro duction Bac kground Being able to detect encrypted ﬁles ma y b e primordial in sev eral cases such as in p olice inv estigations where evidence of criminal ac- tivit y is residing as data on a computer hard disk driv e (HDD). This is a ma jor issue in hea vy criminality and organized crime [9], [8]. When the p o- lice seize an HDD con taining data b elonging to a susp ected criminal, that data can b e material of evidence in a subsequen t trial. But criminals often try to mak e sure that the p olice will b e unable to use that data. P ossible actions to obstruct data access are then to encrypt and/or to quick delete that data. In case of quick delete of data the p oin ters to the ﬁles are de- stro y ed but the con ten ts of the ﬁles is still left. The other action is to encrypt ﬁles. Using state of the art tools for encryption of ﬁles (such as Bitlock er and the thereby provided ciphers) there is no general pro cedure of breaking the encryption other than brute force attac k b y guessing the cipher algorithm and systematically guessing the encryption key . If some of the ﬁles are en- crypted and others not, it is then a delicate matter to distinguish b etw een the tw o if the data has b een quick deleted. Nevertheless, the imp ortance of discriminating b et ween encrypted data (i.e. ciphertext) and non-encrypted data (i.e. plain text) is also stressed b y the fact that brute force attacking all clusters of data on the hard drive w ould b e an imp ossible task while b e- ing able to separate the encrypted data from the non-encrypted would make a substan tial improv ement of the chances for successful cryptanalysis. The p olice authorities usually ha ve softw are to brute force those encrypted data but this pro cedure ma y b e very time consuming if the amount of data is so large that diﬀerent parts of it hav e b een encrypted with diﬀerent keys. In juridical cases, time is t ypically of the matter since the c hances of success in prosecution and pro ceedings against a criminal is dep ending on deadlines (suc h as time of arrest, time of trial etc) any time sa vings in the pro cedure of extracting evidence from the hard driv es is essential. Th us time has to b e sp en t on the appropriate tasks: code-breaking only on the encrypted data rather than try to decipher data whic h are not encrypted. Giv en a single ﬁle, the task of determining whether it is encrypted or not is usually easy; but giv en a whole hard disk drive (HDD) without knowing where the encrypted ﬁles are, this is tric kier. There are several soft ware solutions for certifying whether a ﬁle is en- crypted or not, mostly chec king the header of the ﬁle and lo oking for some kno wn header lik e the EFS (Encryption Files System on Windows), BestCrypt, or other softw are headers. Suc h alternatives cannot be used in the case when the user has p erformed a (quic k) delete of the HDD because then the p oin ters to all ﬁles are lost. Nev ertheless, the ﬁles actually remain on the HDD: up on deleting a ﬁle on an HDD, the p ointer to the b eginning and end of the ph ys- ical space on the HDD containing the information of the ﬁle are remo ved. But the physical space on the HDD containing the information of the ﬁle is not ov erwritten b ecause that op eration would b e slow, i.e. as slo w as writing the same ﬁle again. The op erating systems designers decide to lea ve the ﬁle in the HDD in tact but indicates that this lo cation is free to host some other data. If nothing has b een ov erwritten, this means that the ﬁle is actually still stored in the HDD for some time which allows for reco very softw are to restore this ﬁle. Reco v ery softw are might help to simply recov er the but migh t also ov er- write the data con tained in the HDD which might result in loss of evidence in case of an in vestigation. In this case, the p olice would ha ve to lo cate the encrypted ﬁles without using an y reco very or ”header-detector” softw are. In spite of b eing a p ertinent problem in urgen t need of a solution, surprisingly little has been done to in vestigate the properties of metho ds for the detection of encrypted data and comparing these to whatever is done. An attempt in this direction is [1] but this is just a small ﬁrst step. Here, metho ds to lo cate quick deleted encrypted data are presented and detailed. First, a description of how encrypted data is diﬀerent from other data is presented. This is follow ed by the introduction to statistical change- p oin t detection metho ds for discrimination b et ween encrypted and non-en- crypted data. Finally , the results of these procedures are presented along with some exp erimen tal v alues to ev aluate the metho ds. Ho w ever, such a metho d will only work on mec hanical HDDs and not with ﬂash memory devices: in ﬂash memory (like USB memory stic ks or Solid State Drive (SSD)), as so on as a ﬁle is remo ved it is actually erased from the memory b ecause data cannot be ov erwritten. Therefore as so on as data is deleted, the op erating system will choose to delete the p oin ters. But erasing data in a ﬂash memory also tak es longer time since all the ﬁle con ten ts hav e to b e remo ved whic h takes as long as cop ying new ﬁles to the device. Description If encrypted data is not uniformly distributed, the cryptosys- tem used to cipher those data has a bias and is in this sense vulnerable to cryptanalysis attac ks. F or this reason characters of the ciphertext pro duced b y any mo dern high-quality cryptosystem is uniformly distributed [1], [10] i.e. the v alues of the b ytes of the ciphertext are uniformly distributed on some character interv al. The unencrypted ﬁles do not possess this feature although some types of ﬁles are actually close to ha ving their characters b e- ing uniformly distributed. The ﬁles coming closest to uniformly distributed con ten ts without b eing encrypted are compressed/zipp ed ﬁles: those ﬁles are indeed v ery close to cipher ﬁles in terms of the distribution of their c harac- ter’s byte num b ers. Alb eit small there is a diﬀerence in distribution making it p ossible to tell compressed ﬁles and encrypted ﬁles apart. This pap er is all ab out how to detect such small but systematic diﬀerences and consequently ho w to quic kly and accurately segmen t encrypted HDD data for eﬃcien t cryptanalysis. Metho ds Distribution of encrypted data The working hypothesis is that data (i.e. characters) constituting encrypted ﬁles are uniformly distributed, while data of non-encrypted ﬁles are not (i.e. diﬀerently distributed dep ending on whic h t yp e of non-encrypted ﬁles). The goal no w is to b e able to tell apart an encrypted ﬁle from a non-encrypted one. Let us assume the data constitutes of c haracters divided into clusters, c 1 , c 2 , c 3 , . . . with N c haracters in eac h cluster. The c haracters used a range o v er some alphab et of a set of p ossible forms. Merging these forms in to K c haracter classes, the coun ts O kt of o ccurrences of class k c haracters in cluster t are observed. One metho d of measuring distribution agreemen t is b y means of a χ 2 test statistic, Q t = P K k =1 E − 1 kt ( E kt − O kt ) 2 where E kt are the exp ected coun ts of o ccurrences of c haracters in class k , cluster t . Under the h yp othesis of uniformly distributed characters, the exp ected counts of o ccurrences within eac h class is E kt = N K . Also, b y deﬁnition, P K k =1 O kt = N . This reduces the statistic to Q t = − N + K X k =1 K N O 2 kt . The v alues of this statistic are henceforth referred to as Q scores. Large Q scores indicate deviance from the corresp onding exp ected frequencies E kt . The smallest p ossible Q score being 0 would b e attained if all E kt = O kt . The exp ected v alue, E kt , in eac h class should not b e smaller than 5 for the statistic to b e relev an t (5 is a commonly used v alue; other v alues lik e 2 or 10 are sometimes used dep ending on ho w stable the test statistic should b e to small deviances in tail probabilities). Therefore one should use at least 5 kB of data in eac h cluster to enable this test to b e relev ant. But the larger the n umber of bytes that are in each cluster, the w orse the precision to detect encrypted ﬁle v alues: indeed if to o man y b ytes are detected as b eing unencrypted (but were actually encrypted), a large amount of encrypted data will then not b e detected in the pro cedure. Histogram of encryptedvalues[[1]] encryptedvalues[[1]] Density 0 5 10 15 20 25 30 0.00 0.02 0.04 0.06 0.08 0.10 0.12 Figure 1: Distribution of the Q scores of encrypted ﬁles (obtained by using more than 5000 ﬁles) with the distribution function of Q ∈ χ 2 (7) Here the alphab et used was the n um b ers 0 , 1 , . . . , 255 representing the p ossible v alues of b ytes representing the characters in the data. These num- b ers w ere divided in to K = 8 classes (class 1: v alues of bytes in [0 , 31] to class 8: v alues of b ytes in [224 , 255]) and the clusters of size N = 64 b ytes making the exp ected v alues in each class E kt = 8, k = 1 , 2 , . . . , 8. Assuming that encrypted data is uniformly distributed, the Q scores based on counts of c haracters in encrypted data are χ 2 distributed, see Figure 1 and since 8 classes w ere c hosen, the num b er of degrees of freedom is 8 − 1 = 7. Distribution of non-encrypted data F or non-encrypted data, the dis- tribution is more complicated. Basically , each type of ﬁles has its own dis- tribution. Consequently , the standardized squared deviances from exp ected coun ts under an assumption ab out uniform distribution are larger and so are the Q scores of the χ 2 statistic. How ever, t w o problems emerge. Firstly , the size of these increased deviances dep ends on the t yp e of data – i.e. whether the data is a text ﬁle, an image, a compiled program, a compressed ﬁle or some other kind of ﬁle – and ho w should this information b e prop erly taken in to accoun t? Secondly , what is the distribution of the Q score in the case when the data are not encrypted? In order to develop a metho d for distinguishing b et ween encrypted and non-encrypted data, it is suﬃcient to focus on the non-encrypted which is most similar to the encrypted and this turns out to b e compressed data. Other types of ﬁles suc h as images, compiled programs etc. commonly ren- der higher Q scores and are therefore indirectly distinguished from encrypted data b y a metho d which is calibrated for discriminating b etw een encrypted and compressed data. Ab out the second question, this is not readily an- sw ered. Rather w e just suggest to mo del the Q score as b eing scaled χ 2 distributed, i.e. the Q score is assumed to hav e the distribution of the ran- dom v ariable αX where α > 1 and X ∈ χ 2 . The v alidit y of this approach is sustained by an empirical ev aluation based on more than 5000 compressed ﬁles. The resulting empirical distribution of their Q scores and the distri- bution of αX where X is χ 2 distributed and the v alue of α = 1 . 7374 was estimated b y the least square metho d w as plotted in Figure 2. Change-p oin t detection The ob ject of detecting encrypted data is to quic kly and accurately detect a shift in distribution from on-line observ ation of a random process (see e.g. [2 – 5]). Change-p oin t detection can be done activ ely (stop collecting the data as so on as a shift is detected) or passively (con tin ue collecting the data even if a shift is detected in order to detect more shifts). Here passiv e on-line c hange-p oin t detection was used to detect if the data from an HDD shifts from non-encrypted to encrypted and vice v ersa. The change-point detection metho d is a stopping rule τ = inf { t > 0 : a t > C } where a t is called alarm function and C thr eshold . The design of the alarm function deﬁnes diﬀeren t c hange-p oin t detection metho ds while the v alues of the threshold reﬂects the degree of sensitivity of the stopping rule. The alarm function ma y b e based on the lik eliho o d ratio L ( s, t ) = f Q t ( q t | θ = s ≤ t ) f Q t ( q t | θ > t ) where f Q t ( q t | A ) is the conditional joint density function of the random v ari- ables ( Q 1 , . . . , Q t ) = Q t giv en A and where q t is the v ector of the observed v alues of Q t . Assuming indp endence of the v ariables Q 1 , . . . , Q t the lik eli- Histogram of zip[[1]] zip[[1]] Density 0 10 20 30 40 50 0.00 0.02 0.04 0.06 Figure 2: Distribution of the Q scores of compressed ﬁles (obtained b y using more than 5000 ﬁles) with the distribution function of Q = 1 . 7374 · X where X ∈ χ 2 (7) ho o d ratio simpliﬁes to L ( s, t ) = t Y u = s f 1 ( q u ) f 0 ( q u ) where f 0 ( q u ) is the marginal conditional density function of Q u giv en that the shift has not o ccurred b y time u and f 1 ( q u ) is the marginal conditional densit y function of Q u giv en that the shift o ccurred in or b efore time u . The conditional density function of the Q score at time t given that the data is encrypted (i.e. uniformly distributed) is f E ( q t ) = q k/ 2 − 1 t e − q t / 2 2 k/ 2 Γ( k / 2) where k is the n umber of degrees of freedom, i.e. the n um b er of classes (whic h in this study is 8 as explained ab o v e). F or the non-encrypted ﬁles, the conditional Q score is mo delled by αX where X ∈ χ 2 ( k ) and α > 1, supp osedly reﬂecting the inﬂated deviances from the uniform distribution had the data b een encrypted. Th us f NE ( q t ) = ∂ ∂ q t P ( αX < q t | X ∈ χ 2 ( k )) = ( q t α ) k/ 2 − 1 e − q t / 2 α α 2 k/ 2 Γ( k / 2) (1) is the densit y function of non-encrypted data Q score. This means that t wo cases of shift in distribution are p ossible: • Shift from non-encrypted to encrypted data in which case L ( s, t ) = t Q u = s f E ( q u ) f NE ( q u ) = α k ( t − s +1) / 2 exp  − α − 1 2 α t P u = s q u  . (2) • Shift from encrypted to non-encrypted data in which case L ( s, t ) = t Q u = s f NE ( q u ) f E ( q u ) = α − k ( t − s +1) / 2 exp  α − 1 2 α t P u = s q u  . (3) T o detect whether the shift in distribution has o ccurred or not according to the stopping rule τ men tioned ab ov e, an alarm function should b e sp eciﬁed. Tw o of the most common choices here are: • CUSUM [6]: a t = max 1 ≤ s ≤ t L ( s, t ), • Shiryaev [7]: a t = P t s =1 L ( s, t ). Other p ossible c hoices are e.g. the Shewhart metho d, the Exp onen tially W eighted Moving Av ergage (EWMA), the full Lik eliho o d Ratio method (LR) and others, see e.g. [3] for a more extensiv e presen tation of diﬀeren t metho ds. F or the CUSUM alarm function, as arg max 1 ≤ s ≤ t L ( s, t ) = arg max 1 ≤ s ≤ t ln L ( s, t ) the alarm function is simpliﬁed without any loss of generality by using the log likelihoo d v alues instead. F or b oth cases, the alarm functions can b e ex- pressed recursively which facilitates running the algorithm in practice when data streams are big. The alarm function for shift from non-encrypted to encrypted data for the • CUSUM metho d is a t =  0 t = 0  a t − 1 + k ln α 2  + + 1 − α 2 α q t t = 1 , 2 , 3 , . . . • Shiryaev metho d is a t =  0 t = 0 α k/ 2 e 1 − α 2 α q t (1 + a t − 1 ) t = 1 , 2 , 3 , . . . The alarm function for shift from encrypted to non-encrypted data for the • CUSUM metho d is a t = ( 0 t = 0  a t − 1 − k ln( α ) 2  + + α − 1 2 α q t t = 1 , 2 , 3 , . . . • Shiryaev metho d is a t =  0 t = 0 α − k/ 2 e α − 1 2 α q t (1 + a t − 1 ) t = 1 , 2 , 3 , . . . Ev aluation T o quantify the quality of diﬀerent methods, the p erformance is compared regarding relev ant prop erties suc h as the time until false alarm, dela y of motiv ated alarm, the credibilit y of an alarm and so on. The threshold is commonly set with resp ect to the Av erage Run Length ARL 0 whic h is deﬁned as the expected time un til an alarm when no parameter shift actually o ccurred (whic h means that this is actually a false alarm). It is crucial to ha ve ν Metho ds CUSUM Shiry aev 100 500 2 500 10 000 100 500 2 500 10 000 0.2 4.7844 7.1087 9.5520 11.6905 4.9672 7.3116 9.7634 11.9159 0.15 4.7674 7.0788 9.5109 11.6495 4.8933 7.2409 9.6760 11.8015 0.07 4.7278 7.0162 9.4401 11.5695 4.7455 7.0610 9.4786 11.6114 0.05 4.7176 7.0017 9.4224 11.5420 4.7021 6.9975 9.4308 11.5441 0.02 4.6957 6.9712 9.3860 11.4940 4.6422 6.9144 9.3175 11.4477 0.01 4.6870 6.9581 9.3698 11.4693 4.6150 6.8633 9.2743 11.3973 T able 1: V alues of exp ected dela ys ED for the CUSUM and Shiryaev metho ds for diﬀeren t ARL 0 = 100 , 500 , 2 500 , 10 000 for a shift from encrypted to compressed data the righ t threshold v alues for the metho ds to p erform as sp eciﬁed. Setting the threshold suc h that ARL 0 is 100 , 500 , 2 500 and 10 000 resp ectiv ely (the most common v alues here are ARL 0 = 100 and 500 but the higher v alues are also considered since the v alue of ARL 0 deﬁnes the n umber of clusters/time p oin ts that are treated before a false alarm and the s hift could o ccur v ery far in to the HDD) prop erties of the metho ds regarding delay and credibility of a motiv ated alarm can b e compared. Of course, if the threshold is lo w, this will lead to more false alarms (detection of a c hange when there is none) but sp eciﬁcation with a to o high threshold will lead in a drop of sensitivity of the metho d to detect a shift (higher dela y b et ween a shift and its detection) and consequen tly an increased probabilit y of missing a real shift in distribution. The exp ected delay , ED( ν ) = E ν ( τ − θ | θ < τ ) (exp ectation of the dela y of a motiv ated alarm; see T able 1) or Conditional Exp ected Delay CED( t ) = E ( τ − θ | τ > θ = t ) (exp ectation of the dela y when the change p oin t is ﬁxed equal to t ), are imp ortan t measures of p erformance for many applications. Ho w ever, in the case of detecting encrypted data, exp ected dela ys are less relev an t as a measure of p erformance since the data can b e handled without an y time asp ect: the goal is to detect accurately where the encrypted data is lo cated. A metho d with long exp ected or conditional exp ected dela y merely means a sligh tly less eﬃcient pro cedure. A more relev an t p erformance indicator, in this case, is for instance the predictiv e v alue PV = P ( θ < τ ) (the probability that the metho d signal alarm when the c hange-p oin t has actually occurred; see T able 5 and Figure 5) or the p ercentage of detected encrypted ﬁles that is disco vered while running the pro cess and how to improv e it (see Figure 2). While running the process, the metho d will stop at some time, τ , and then estimate the change p oin t, θ , by maximizing the lik eliho o d function by using the data after the last previous alarm and the newly detected change- p oin t. This estimated change-point, ˆ θ , can b e either b efore or after the true c hange-p oin t θ . One could increase the interv als where encrypted data w ere disco v ered. This w ould lead to missing less encrypted data (see T able 2) but also brute-forcing more non-encrypted data. In T able 2 in terv als of the form [ τ 1 − i, τ 2 + i ] are considered as estimated regions of encrypted data. Typically with large i , the v alues are v ery close but not exactly equal to 1; this happ ens b ecause the c hange p oin ts are very close (i.e. less than 10 clusters apart for example) and the metho d do es not detect an y change. Then no ciphertext is detected at all. i P ercen tage CUSUM Shiry aev 0 0.960254 0.961280 1 0.971053 0.971274 2 0.976242 0.978665 3 0.979266 0.980872 4 0.983681 0.985597 5 0.986248 0.986101 10 0.990101 0.990101 50 0.993371 0.994931 100 0.994326 0.995682 T able 2: P ercen tage of encrypted ﬁles that are detected when the in- terv al of detected change p oin ts [ τ 1 , τ 2 ] is inﬂated to [ τ 1 − i, τ 2 + i ] and i = 0 , 1 , 2 , 3 , 4 , 5 , 10 , 50 , 100. Therefore the diﬀerence b et ween the c hange-p oints and the alarms ac- cording to the metho d is calculated. Since the prop ortion of encrypted data relativ e to the total amount of data on the HDD is unknown, the exp ected prop ortion of error is suggested. This is to sa y , giv en t w o consecutiv e c hange- p oin ts, θ 1 and θ 2 , and t wo corresp onding stopping times, τ 1 and τ 2 , the ex- p ected prop ortion of error is E  | τ 1 − θ 1 | + | τ 2 − θ 2 | θ 2 − θ 1  . But of course, this v alue has a sense only when there are no false alarms b et w een τ 1 and τ 2 . If there are false alarms b et w een τ 1 and τ 2 , the prop ortion of undetected encrypted data w as added to the prop ortion of error to determine the prop or- tion of the error made relativ e to the size of the encrypted data. Assuming that there are n false alarms τ 0 1 < . . . < τ 0 n in [ τ 1 , τ 2 ], called exp e cte d inac cu- 0.00 0.05 0.10 0.15 0.20 0.00 0.02 0.04 0.06 0.08 0.10 0.12 E.I ν Figure 3: Exp ected inaccuracy EI of the CUSUM and Shiry aev pro cedure. The Shiry aev method is little less accurate when ν increases but slightly more accurate for small ν compared to the CUSUM pro cedure. r acy , or EI for short, is deﬁned as follow s. EI( ν ) = E ν | τ 1 − θ 1 | + | τ 2 − θ 2 | θ 2 − θ 1 + n/ 2 X i =1 τ 0 2 i − τ 0 2 i − 1 θ 2 − θ 1 ! . The EI was measured for diﬀeren t v alues of the parameter ν in the geo- metrical distribution of the change-points for diﬀerent metho ds (see T able 3 and Figure 3). ν 0.2 0.15 0.07 0.05 0.02 0.01 0.005 0.001 P ercentage CUSUM 0.11379 0.11202 0.09559 0.08683 0.06214 0.04426 0.03096 0.01655 Shiry aev 0.11693 0.11276 0.09688 0.08890 0.06207 0.04398 0.03038 0.01603 T able 3: EI for CUSUM and Shiryaev c hange-p oin t detection metho ds and for some v alues of the parameter ν . Complete pro cedure The complete pro cedure returns a segmentation separating susp ected encrypted data and most likely non-encrypted data of an HDD, information provided in order to carry out the subsequen t brute force cryptanalysis eﬃciently . This pro cedure runs a likelihoo d ratio based c hange-p oin t detection metho d and as so on as it detects a c hange, maximizes a lik eliho o d function to ﬁnd the most lik ely estimator of the c hange-p oin t to determine where the real c hange is most lik ely lo cated. It will then start ARL 0 Thresholds CUSUM Shiry aev NE → E E → NE NE → E E → NE 100 1.2260 4.5801 64.0313 44.1271 500 2.6529 6.1250 323.0625 221.8125 2 500 4.2188 7.7120 1618.219 735.4088 10 000 5.5296 9.0990 6475.0547 4441.8413 T able 4: V alues of the thresholds for the CUSUM and Shiryaev metho ds for ARL 0 = 100 , 500 , 2 500 , 10 000 sp eciﬁed for detecting a shift from non- encrypted to encrypted data (indicated by NE → E for short) and for shift from encrypted to non-encrypted data (indicated b y E → NE) resp ectively . o v er from the lo cation of this estimated c hange-p oin t with the same metho d for on-line c hange-p oin t detection except that the likelihoo d ratio is reversed mo difying the alarm function to ﬁt with the opp osite change-point situation, and so on. Results Thresholds and exp erimen tal v alues for the method The ﬁrst step in establishing prop erties of change-point detection metho ds is to determine the thresholds rendering v alues of a verage run-length, ARL 0 . One purpose of this is simply to link the threshold v alues to a prop erty whic h relates to the probabilit y of false alarm. This is a result in itself, but it is also necessary for calibrating the metho ds so they are comparable in terms of other p erformance measures related to a motiv ated alarm, such as exp ected delay , predictiv e v alue and exp ected inaccuracy . Here the v alues ARL 0 = 100 , 500 , 2 500 and 10 000, are considered for b oth the CUSUM and Shiryaev metho ds, for a shift from encrypted to non-encrypted and vice versa. The change-points are commonly geometrically distributed with parameter ν . Here the a verage time b efore a change-point is exp ected to b e rather high (several hundreds or thousands ma yb e) as the metho d deal with 64-b yte clusters in an HDD of surely several hundreds of Giga or T era Bytes. Thus, since E ( τ ) = 1 /ν , the fo cus is on v ery small v alues of ν for the metho ds to b e sensibly trigger happ y . Commonly v alues of ARL 0 are 100 or 500 in order to mak e other prop- erties relev ant for comparisons. In the case of this application, how ever, as large v alues of ARL 0 as 2 500 and 10 000 are studied b ecause the ﬁrst c hange-p oin t might not o ccur un til far into the HDD. Adjusting the thresh- old by simulating data can take a very long time if ARL 0 is large (2 500 or 10 000, esp ecially for the Shiry aev metho d). In this case, it can take several hours or even up to days to compute the threshold. Therefore it would b e in teresting to ha v e a w a y of predicting the threshold b y extrap olation i.e. ha ving an explicit relation b et w een ARL 0 and the threshold C . In tuitiv ely , if ARL 0 is larger, more data will b e taken in to account implying a threshold prop ortionally larger. Indeed when ARL 0 increases more data is used in the pro cedure and the threshold is therefore prop ortionally increased from how m uc h more data that w as treated in the pro cedure. In the CUSUM case, since the alarm function is deﬁned b y means of the log lik eliho o d ratio, the relationship b et w een threshold C and ARL 0 is logarithmic: • for a shift from encrypted data to non-encrypted data: C = 0 . 997767 · ln  0 . 912316 · ARL 0 + 7 . 294950  • for a shift from non-encrypted data to encrypted data: C = 0 . 965524 · ln  0 . 030655 · ARL 0 + 0 . 494603  F or Shiry aev, the threshold C is a linear function of ARL 0 : • for a shift from encrypted data to non-encrypted data: C = 0 . 444214 · ARL 0 + 0 . 294281 • for a shift from non-encrypted data to encrypted data: C = 0 . 647578 · ARL 0 − 0 . 726563 Conclusions Using the change-point theory , metho ds to detect encrypted data on HDDs w ere successfully derived and ev aluated. These metho ds exploit the fact that encrypted data is uniformly distributed as opp osed to other types of ﬁles. The metho ds were designed to detect the diﬀerence b et w een encrypted and non-encrypted data where the kind of non-encrypted data that w as most similar to the encrypted data was compressed data. As the proposed methods detect even this small a diﬀerence in the data, any bigger deviance will b e ev en easier detected. ν Metho ds CUSUM Shiry aev 100 500 2 500 10 000 100 500 2 500 10 000 0.20 0.8950 0.9862 0.9984 0.9997 0.9859 0.9984 0.9998 1.0000 0.15 0.8727 0.9805 0.9974 0.9995 0.9736 0.9967 0.9996 0.9999 0.07 0.8031 0.9604 0.9933 0.9986 0.9147 0.9852 0.9976 0.9995 0.05 0.7634 0.9482 0.9907 0.9978 0.8708 0.9754 0.9957 0.9991 0.02 0.6171 0.8942 0.9784 0.9944 0.6927 0.9235 0.9847 0.9964 0.01 0.4712 0.8207 0.9590 0.9890 0.5153 0.8463 0.9661 0.9918 T able 5: Predictiv e v alue PV( ν ) = P ν ( θ < τ ), i.e. the probabilit y that a shift has o ccurred when an alarm is signalled, for the CUSUM and the Shiry aev metho ds, for diﬀerent v alues of ARL 0 and with diﬀerent v alues of the parameter ν in the geometric distribution of the c hange-p oin ts. Quic k and accurate detection of a change is commonly the desired prop- ert y of c hange-p oin t detection metho ds. In man y applications, time asp ects of the metho ds are a matter of interest, e.g. exp ected dela y in detection of a shift or probability of detecting a shift within a speciﬁed time in terv al. Here, ho w ever, this time aspect is not of primary interest since the data remain the same during the whole pro cess. Here the probability of correctly detecting encrypted data is more relev an t. The inv estigation shows that the prop osed metho ds detect more than 96% of the encrypted data and, b y extending the in terv als, the metho ds detect more than 99% of the encrypted data. By assuming that the change-points are not to o close – which is a plausible as- sumption since it is unlikely that ﬁles are so small if the device is not to o fragmen ted – then the metho d, by adding a little margin to the interv als, quic kly detects 100% of the encrypted data. The change-point itself is a random v ariable. Ho wev er, for some of the p erformance measures, the results dep end on whether the change is more lik ely to o ccur early or late. In the application of detecting encrypted data the distance b et w een the shift from encrypted to non-encrypted data and vice v ersa is t ypically longer than 20 clusters whic h corresp onds to parameter v alue ν < 0 . 05 in the geometrical distribution of the change-point. Therefore these v alues are more interesting in the case when the hard driv e is not very strongly fragmen ted. Then the Shiryaev metho d turns out to b e slightly b etter compared to the CUSUM metho d in resp ect of exp ected delay . The Shiry aev method also detects more encrypted data than the CUSUM metho d and has a sligh tly higher predictive v alue PV. All in all, this implies that both metho ds designed with the suggested mo delling, p erform very w ell with a slight preference to the Shiry aev metho d for detecting encrypted data on an HDD. Ac kno wledgemen ts The authors wish to express their gratitude to Mattias W ec kst´ en at Halmstad Univ ersit y for go o d ideas and previous readings of the man uscript Universit y and to Linus Nissi (formerly Lin us Barkman) at the Police Department of Southern Sw eden for earlier work in the area. References [1] Barkman, L. Detektering av krypterade ﬁler, T eknolo gie Kandidatexamen i Datateknik , div a2:428544, Halmstad Univ ersity . [2] F ris ´ en, M. Prop erties and use of the Shewhart metho d and its follo wers, Se quential Analysis , 26 (2) (2007) 171–193. DOI: 10.1080/07474940701247164 [3] F ris ´ en, M. Statistical surveillance. Optimality and methods, International Statistic al R eview , 71 (2) (2003) 403–434. DOI: 10.1111/j.1751-5823.2003.tb00205.x [4] F ris ´ en, M. and de Mar´ e, Jacques. Optimal Surveillance, Biometrika , 78 (2) (1991) 271–280. DOI: 10.2307/2337252 [5] J¨ arp e, E. Surveillance, En vironmental, Encyclop e dia of Envir onmetrics , 4 (2013) 2150–2153. DOI: 10.1002/9780470057339.v as065.pub2 [6] Page, E.S. Contin uous Insp ection Schemes, Biometrika , 41 (1/2) (1954) 100–115. DOI: 10.1093/biomet/41.1-2.100 [7] Shiryaev, A.N. On Optim um Metho ds in Quic k est Detection Problems, The ory of Pr ob ability and Its Applic ations , 8 (1) (1963) 22–46. DOI: 10.1137/1108002 [8] Swedish Civil Contingencies Agency . Informationss¨ ak erhet – trender 2015. Myndigheten f¨ or Samh¨ al lsskydd o ch Ber e dskap , (2015). [9] The Swedish P olice. Polisens rapp ort om organiserad brottslighet 2015, National op er ations dep ertment , (2015). [10] W estfeld, A. and Pﬁtzmann, A. Attac ks on Steganographic Systems Breaking the Steganographic Utilities EzStego, Jsteg, Steganos, and S- T o olsand Some Lessons Learned 3r d International Workshop on Infor- mation Hiding , Dresden, German y (2000) 61–76. 0.05 0.10 0.15 0.20 4.6 4.7 4.8 4.9 5.0 E.D 0.05 0.10 0.15 0.20 6.8 6.9 7.0 7.1 7.2 7.3 7.4 E.D 0.05 0.10 0.15 0.20 9.3 9.4 9.5 9.6 9.7 9.8 9.9 E.D 0.05 0.10 0.15 0.20 11.4 11.6 11.8 12.0 E.D ARL 0 = 100 ARL 0 = 500 ARL 0 = 2 500 ARL 0 = 10 000 ν ν ν ν Figure 4: Expected delays, ED, for a shift from encrypted to compressed data for the CUSUM pro cedure (blue) and Shiryaev pro cedure (red) 0.05 0.10 0.15 0.20 0.4 0.5 0.6 0.7 0.8 0.9 1.0 P.V ν ARL 0 = 100 ARL 0 = 500 ARL 0 = 2 500 ARL 0 = 10 000 0.05 0.10 0.15 0.20 0.4 0.5 0.6 0.7 0.8 0.9 1.0 P.V ν ARL 0 = 100 ARL 0 = 500 ARL 0 = 2 500 ARL 0 = 10 000 Figure 5: Predictiv e v alues for a shift from compressed to encrypted data for the CUSUM pro cedure (left) and for the Shiryaev pro cedure (right)

Segmentation of Encrypted data

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment