Segmentation of Encrypted data

The retrieval of data from computer hard drives that have been seized from police busts against suspected criminals are sometimes not straight forward. Typically the incriminating data, which may be important evidence in subsequent trials, is encrypt…

Authors: Eric J"arpe, Quentin Gouchet

Segmentation of Encrypted data
Segmen tation of Encrypted Data Eric J¨ arp e ∗ Quen tin Gouc het ∗∗ ∗ Sc ho ol of Information Science, Halmstad Univ ersity , P .O. Bo x 823, 301 18 Halmstad, Sweden ∗∗ A tsec Information Securit y , Austin, TX 78759, USA ∗ T o whom corresp ondence should b e addressed; E-mail: eric.jarp e@hh.se. 2014 Abstract The retriev al of data from computer hard driv es that hav e b een seized from p olice busts against suspected criminals are sometimes not straight forward. T ypically the incriminating data, whic h ma y b e imp ortant evidence in subsequent trials, is encrypted and quic k deleted. The cryptanalysis of what can b e recov ered from such hard driv es is then sub ject to time-consuming brute-forcing and passw ord guessing. T o this end metho ds for accurate classification of what is encrypted data and what is not is of the essence. Here a pro cedure for discriminating encrypted data from non-encrypted is deriv ed. Several metho ds are suggested and their accuracy is ev aluated in differen t w ays. Tw o methods to detect where encrypted data is lo cated in a hard disk driv e are detailed using passiv e c hange-p oin t detection. The mea- sures of p erformance of such metho ds are discussed and a new prop- ert y for ev aluation is suggested. The metho ds are then ev aluated and discussed according to the new p erformance measure as w ell as the standard measures. Keyw ords: Change-p oint dete ction, ciphertext, plaintext, c ompr ession. 1 In tro duction Bac kground Being able to detect encrypted files ma y b e primordial in sev eral cases such as in p olice inv estigations where evidence of criminal ac- tivit y is residing as data on a computer hard disk driv e (HDD). This is a ma jor issue in hea vy criminality and organized crime [9], [8]. When the p o- lice seize an HDD con taining data b elonging to a susp ected criminal, that data can b e material of evidence in a subsequen t trial. But criminals often try to mak e sure that the p olice will b e unable to use that data. P ossible actions to obstruct data access are then to encrypt and/or to quick delete that data. In case of quick delete of data the p oin ters to the files are de- stro y ed but the con ten ts of the files is still left. The other action is to encrypt files. Using state of the art tools for encryption of files (such as Bitlock er and the thereby provided ciphers) there is no general pro cedure of breaking the encryption other than brute force attac k b y guessing the cipher algorithm and systematically guessing the encryption key . If some of the files are en- crypted and others not, it is then a delicate matter to distinguish b etw een the tw o if the data has b een quick deleted. Nevertheless, the imp ortance of discriminating b et ween encrypted data (i.e. ciphertext) and non-encrypted data (i.e. plain text) is also stressed b y the fact that brute force attacking all clusters of data on the hard drive w ould b e an imp ossible task while b e- ing able to separate the encrypted data from the non-encrypted would make a substan tial improv ement of the chances for successful cryptanalysis. The p olice authorities usually ha ve softw are to brute force those encrypted data but this pro cedure ma y b e very time consuming if the amount of data is so large that different parts of it hav e b een encrypted with different keys. In juridical cases, time is t ypically of the matter since the c hances of success in prosecution and pro ceedings against a criminal is dep ending on deadlines (suc h as time of arrest, time of trial etc) any time sa vings in the pro cedure of extracting evidence from the hard driv es is essential. Th us time has to b e sp en t on the appropriate tasks: code-breaking only on the encrypted data rather than try to decipher data whic h are not encrypted. Giv en a single file, the task of determining whether it is encrypted or not is usually easy; but giv en a whole hard disk drive (HDD) without knowing where the encrypted files are, this is tric kier. There are several soft ware solutions for certifying whether a file is en- crypted or not, mostly chec king the header of the file and lo oking for some kno wn header lik e the EFS (Encryption Files System on Windows), BestCrypt, or other softw are headers. Suc h alternatives cannot be used in the case when the user has p erformed a (quic k) delete of the HDD because then the p oin ters to all files are lost. Nev ertheless, the files actually remain on the HDD: up on deleting a file on an HDD, the p ointer to the b eginning and end of the ph ys- ical space on the HDD containing the information of the file are remo ved. But the physical space on the HDD containing the information of the file is not ov erwritten b ecause that op eration would b e slow, i.e. as slo w as writing the same file again. The op erating systems designers decide to lea ve the file in the HDD in tact but indicates that this lo cation is free to host some other data. If nothing has b een ov erwritten, this means that the file is actually still stored in the HDD for some time which allows for reco very softw are to restore this file. Reco v ery softw are might help to simply recov er the but migh t also ov er- write the data con tained in the HDD which might result in loss of evidence in case of an in vestigation. In this case, the p olice would ha ve to lo cate the encrypted files without using an y reco very or ”header-detector” softw are. In spite of b eing a p ertinent problem in urgen t need of a solution, surprisingly little has been done to in vestigate the properties of metho ds for the detection of encrypted data and comparing these to whatever is done. An attempt in this direction is [1] but this is just a small first step. Here, metho ds to lo cate quick deleted encrypted data are presented and detailed. First, a description of how encrypted data is different from other data is presented. This is follow ed by the introduction to statistical change- p oin t detection metho ds for discrimination b et ween encrypted and non-en- crypted data. Finally , the results of these procedures are presented along with some exp erimen tal v alues to ev aluate the metho ds. Ho w ever, such a metho d will only work on mec hanical HDDs and not with flash memory devices: in flash memory (like USB memory stic ks or Solid State Drive (SSD)), as so on as a file is remo ved it is actually erased from the memory b ecause data cannot be ov erwritten. Therefore as so on as data is deleted, the op erating system will choose to delete the p oin ters. But erasing data in a flash memory also tak es longer time since all the file con ten ts hav e to b e remo ved whic h takes as long as cop ying new files to the device. Description If encrypted data is not uniformly distributed, the cryptosys- tem used to cipher those data has a bias and is in this sense vulnerable to cryptanalysis attac ks. F or this reason characters of the ciphertext pro duced b y any mo dern high-quality cryptosystem is uniformly distributed [1], [10] i.e. the v alues of the b ytes of the ciphertext are uniformly distributed on some character interv al. The unencrypted files do not possess this feature although some types of files are actually close to ha ving their characters b e- ing uniformly distributed. The files coming closest to uniformly distributed con ten ts without b eing encrypted are compressed/zipp ed files: those files are indeed v ery close to cipher files in terms of the distribution of their c harac- ter’s byte num b ers. Alb eit small there is a difference in distribution making it p ossible to tell compressed files and encrypted files apart. This pap er is all ab out how to detect such small but systematic differences and consequently ho w to quic kly and accurately segmen t encrypted HDD data for efficien t cryptanalysis. Metho ds Distribution of encrypted data The working hypothesis is that data (i.e. characters) constituting encrypted files are uniformly distributed, while data of non-encrypted files are not (i.e. differently distributed dep ending on whic h t yp e of non-encrypted files). The goal no w is to b e able to tell apart an encrypted file from a non-encrypted one. Let us assume the data constitutes of c haracters divided into clusters, c 1 , c 2 , c 3 , . . . with N c haracters in eac h cluster. The c haracters used a range o v er some alphab et of a set of p ossible forms. Merging these forms in to K c haracter classes, the coun ts O kt of o ccurrences of class k c haracters in cluster t are observed. One metho d of measuring distribution agreemen t is b y means of a χ 2 test statistic, Q t = P K k =1 E − 1 kt ( E kt − O kt ) 2 where E kt are the exp ected coun ts of o ccurrences of c haracters in class k , cluster t . Under the h yp othesis of uniformly distributed characters, the exp ected counts of o ccurrences within eac h class is E kt = N K . Also, b y definition, P K k =1 O kt = N . This reduces the statistic to Q t = − N + K X k =1 K N O 2 kt . The v alues of this statistic are henceforth referred to as Q scores. Large Q scores indicate deviance from the corresp onding exp ected frequencies E kt . The smallest p ossible Q score being 0 would b e attained if all E kt = O kt . The exp ected v alue, E kt , in eac h class should not b e smaller than 5 for the statistic to b e relev an t (5 is a commonly used v alue; other v alues lik e 2 or 10 are sometimes used dep ending on ho w stable the test statistic should b e to small deviances in tail probabilities). Therefore one should use at least 5 kB of data in eac h cluster to enable this test to b e relev ant. But the larger the n umber of bytes that are in each cluster, the w orse the precision to detect encrypted file v alues: indeed if to o man y b ytes are detected as b eing unencrypted (but were actually encrypted), a large amount of encrypted data will then not b e detected in the pro cedure. Histogram of encryptedvalues[[1]] encryptedvalues[[1]] Density 0 5 10 15 20 25 30 0.00 0.02 0.04 0.06 0.08 0.10 0.12 Figure 1: Distribution of the Q scores of encrypted files (obtained by using more than 5000 files) with the distribution function of Q ∈ χ 2 (7) Here the alphab et used was the n um b ers 0 , 1 , . . . , 255 representing the p ossible v alues of b ytes representing the characters in the data. These num- b ers w ere divided in to K = 8 classes (class 1: v alues of bytes in [0 , 31] to class 8: v alues of b ytes in [224 , 255]) and the clusters of size N = 64 b ytes making the exp ected v alues in each class E kt = 8, k = 1 , 2 , . . . , 8. Assuming that encrypted data is uniformly distributed, the Q scores based on counts of c haracters in encrypted data are χ 2 distributed, see Figure 1 and since 8 classes w ere c hosen, the num b er of degrees of freedom is 8 − 1 = 7. Distribution of non-encrypted data F or non-encrypted data, the dis- tribution is more complicated. Basically , each type of files has its own dis- tribution. Consequently , the standardized squared deviances from exp ected coun ts under an assumption ab out uniform distribution are larger and so are the Q scores of the χ 2 statistic. How ever, t w o problems emerge. Firstly , the size of these increased deviances dep ends on the t yp e of data – i.e. whether the data is a text file, an image, a compiled program, a compressed file or some other kind of file – and ho w should this information b e prop erly taken in to accoun t? Secondly , what is the distribution of the Q score in the case when the data are not encrypted? In order to develop a metho d for distinguishing b et ween encrypted and non-encrypted data, it is sufficient to focus on the non-encrypted which is most similar to the encrypted and this turns out to b e compressed data. Other types of files suc h as images, compiled programs etc. commonly ren- der higher Q scores and are therefore indirectly distinguished from encrypted data b y a metho d which is calibrated for discriminating b etw een encrypted and compressed data. Ab out the second question, this is not readily an- sw ered. Rather w e just suggest to mo del the Q score as b eing scaled χ 2 distributed, i.e. the Q score is assumed to hav e the distribution of the ran- dom v ariable αX where α > 1 and X ∈ χ 2 . The v alidit y of this approach is sustained by an empirical ev aluation based on more than 5000 compressed files. The resulting empirical distribution of their Q scores and the distri- bution of αX where X is χ 2 distributed and the v alue of α = 1 . 7374 was estimated b y the least square metho d w as plotted in Figure 2. Change-p oin t detection The ob ject of detecting encrypted data is to quic kly and accurately detect a shift in distribution from on-line observ ation of a random process (see e.g. [2 – 5]). Change-p oin t detection can be done activ ely (stop collecting the data as so on as a shift is detected) or passively (con tin ue collecting the data even if a shift is detected in order to detect more shifts). Here passiv e on-line c hange-p oin t detection was used to detect if the data from an HDD shifts from non-encrypted to encrypted and vice v ersa. The change-point detection metho d is a stopping rule τ = inf { t > 0 : a t > C } where a t is called alarm function and C thr eshold . The design of the alarm function defines differen t c hange-p oin t detection metho ds while the v alues of the threshold reflects the degree of sensitivity of the stopping rule. The alarm function ma y b e based on the lik eliho o d ratio L ( s, t ) = f Q t ( q t | θ = s ≤ t ) f Q t ( q t | θ > t ) where f Q t ( q t | A ) is the conditional joint density function of the random v ari- ables ( Q 1 , . . . , Q t ) = Q t giv en A and where q t is the v ector of the observed v alues of Q t . Assuming indp endence of the v ariables Q 1 , . . . , Q t the lik eli- Histogram of zip[[1]] zip[[1]] Density 0 10 20 30 40 50 0.00 0.02 0.04 0.06 Figure 2: Distribution of the Q scores of compressed files (obtained b y using more than 5000 files) with the distribution function of Q = 1 . 7374 · X where X ∈ χ 2 (7) ho o d ratio simplifies to L ( s, t ) = t Y u = s f 1 ( q u ) f 0 ( q u ) where f 0 ( q u ) is the marginal conditional density function of Q u giv en that the shift has not o ccurred b y time u and f 1 ( q u ) is the marginal conditional densit y function of Q u giv en that the shift o ccurred in or b efore time u . The conditional density function of the Q score at time t given that the data is encrypted (i.e. uniformly distributed) is f E ( q t ) = q k/ 2 − 1 t e − q t / 2 2 k/ 2 Γ( k / 2) where k is the n umber of degrees of freedom, i.e. the n um b er of classes (whic h in this study is 8 as explained ab o v e). F or the non-encrypted files, the conditional Q score is mo delled by αX where X ∈ χ 2 ( k ) and α > 1, supp osedly reflecting the inflated deviances from the uniform distribution had the data b een encrypted. Th us f NE ( q t ) = ∂ ∂ q t P ( αX < q t | X ∈ χ 2 ( k )) = ( q t α ) k/ 2 − 1 e − q t / 2 α α 2 k/ 2 Γ( k / 2) (1) is the densit y function of non-encrypted data Q score. This means that t wo cases of shift in distribution are p ossible: • Shift from non-encrypted to encrypted data in which case L ( s, t ) = t Q u = s f E ( q u ) f NE ( q u ) = α k ( t − s +1) / 2 exp  − α − 1 2 α t P u = s q u  . (2) • Shift from encrypted to non-encrypted data in which case L ( s, t ) = t Q u = s f NE ( q u ) f E ( q u ) = α − k ( t − s +1) / 2 exp  α − 1 2 α t P u = s q u  . (3) T o detect whether the shift in distribution has o ccurred or not according to the stopping rule τ men tioned ab ov e, an alarm function should b e sp ecified. Tw o of the most common choices here are: • CUSUM [6]: a t = max 1 ≤ s ≤ t L ( s, t ), • Shiryaev [7]: a t = P t s =1 L ( s, t ). Other p ossible c hoices are e.g. the Shewhart metho d, the Exp onen tially W eighted Moving Av ergage (EWMA), the full Lik eliho o d Ratio method (LR) and others, see e.g. [3] for a more extensiv e presen tation of differen t metho ds. F or the CUSUM alarm function, as arg max 1 ≤ s ≤ t L ( s, t ) = arg max 1 ≤ s ≤ t ln L ( s, t ) the alarm function is simplified without any loss of generality by using the log likelihoo d v alues instead. F or b oth cases, the alarm functions can b e ex- pressed recursively which facilitates running the algorithm in practice when data streams are big. The alarm function for shift from non-encrypted to encrypted data for the • CUSUM metho d is a t =  0 t = 0  a t − 1 + k ln α 2  + + 1 − α 2 α q t t = 1 , 2 , 3 , . . . • Shiryaev metho d is a t =  0 t = 0 α k/ 2 e 1 − α 2 α q t (1 + a t − 1 ) t = 1 , 2 , 3 , . . . The alarm function for shift from encrypted to non-encrypted data for the • CUSUM metho d is a t = ( 0 t = 0  a t − 1 − k ln( α ) 2  + + α − 1 2 α q t t = 1 , 2 , 3 , . . . • Shiryaev metho d is a t =  0 t = 0 α − k/ 2 e α − 1 2 α q t (1 + a t − 1 ) t = 1 , 2 , 3 , . . . Ev aluation T o quantify the quality of different methods, the p erformance is compared regarding relev ant prop erties suc h as the time until false alarm, dela y of motiv ated alarm, the credibilit y of an alarm and so on. The threshold is commonly set with resp ect to the Av erage Run Length ARL 0 whic h is defined as the expected time un til an alarm when no parameter shift actually o ccurred (whic h means that this is actually a false alarm). It is crucial to ha ve ν Metho ds CUSUM Shiry aev 100 500 2 500 10 000 100 500 2 500 10 000 0.2 4.7844 7.1087 9.5520 11.6905 4.9672 7.3116 9.7634 11.9159 0.15 4.7674 7.0788 9.5109 11.6495 4.8933 7.2409 9.6760 11.8015 0.07 4.7278 7.0162 9.4401 11.5695 4.7455 7.0610 9.4786 11.6114 0.05 4.7176 7.0017 9.4224 11.5420 4.7021 6.9975 9.4308 11.5441 0.02 4.6957 6.9712 9.3860 11.4940 4.6422 6.9144 9.3175 11.4477 0.01 4.6870 6.9581 9.3698 11.4693 4.6150 6.8633 9.2743 11.3973 T able 1: V alues of exp ected dela ys ED for the CUSUM and Shiryaev metho ds for differen t ARL 0 = 100 , 500 , 2 500 , 10 000 for a shift from encrypted to compressed data the righ t threshold v alues for the metho ds to p erform as sp ecified. Setting the threshold suc h that ARL 0 is 100 , 500 , 2 500 and 10 000 resp ectiv ely (the most common v alues here are ARL 0 = 100 and 500 but the higher v alues are also considered since the v alue of ARL 0 defines the n umber of clusters/time p oin ts that are treated before a false alarm and the s hift could o ccur v ery far in to the HDD) prop erties of the metho ds regarding delay and credibility of a motiv ated alarm can b e compared. Of course, if the threshold is lo w, this will lead to more false alarms (detection of a c hange when there is none) but sp ecification with a to o high threshold will lead in a drop of sensitivity of the metho d to detect a shift (higher dela y b et ween a shift and its detection) and consequen tly an increased probabilit y of missing a real shift in distribution. The exp ected delay , ED( ν ) = E ν ( τ − θ | θ < τ ) (exp ectation of the dela y of a motiv ated alarm; see T able 1) or Conditional Exp ected Delay CED( t ) = E ( τ − θ | τ > θ = t ) (exp ectation of the dela y when the change p oin t is fixed equal to t ), are imp ortan t measures of p erformance for many applications. Ho w ever, in the case of detecting encrypted data, exp ected dela ys are less relev an t as a measure of p erformance since the data can b e handled without an y time asp ect: the goal is to detect accurately where the encrypted data is lo cated. A metho d with long exp ected or conditional exp ected dela y merely means a sligh tly less efficient pro cedure. A more relev an t p erformance indicator, in this case, is for instance the predictiv e v alue PV = P ( θ < τ ) (the probability that the metho d signal alarm when the c hange-p oin t has actually occurred; see T able 5 and Figure 5) or the p ercentage of detected encrypted files that is disco vered while running the pro cess and how to improv e it (see Figure 2). While running the process, the metho d will stop at some time, τ , and then estimate the change p oin t, θ , by maximizing the lik eliho o d function by using the data after the last previous alarm and the newly detected change- p oin t. This estimated change-point, ˆ θ , can b e either b efore or after the true c hange-p oin t θ . One could increase the interv als where encrypted data w ere disco v ered. This w ould lead to missing less encrypted data (see T able 2) but also brute-forcing more non-encrypted data. In T able 2 in terv als of the form [ τ 1 − i, τ 2 + i ] are considered as estimated regions of encrypted data. Typically with large i , the v alues are v ery close but not exactly equal to 1; this happ ens b ecause the c hange p oin ts are very close (i.e. less than 10 clusters apart for example) and the metho d do es not detect an y change. Then no ciphertext is detected at all. i P ercen tage CUSUM Shiry aev 0 0.960254 0.961280 1 0.971053 0.971274 2 0.976242 0.978665 3 0.979266 0.980872 4 0.983681 0.985597 5 0.986248 0.986101 10 0.990101 0.990101 50 0.993371 0.994931 100 0.994326 0.995682 T able 2: P ercen tage of encrypted files that are detected when the in- terv al of detected change p oin ts [ τ 1 , τ 2 ] is inflated to [ τ 1 − i, τ 2 + i ] and i = 0 , 1 , 2 , 3 , 4 , 5 , 10 , 50 , 100. Therefore the difference b et ween the c hange-p oints and the alarms ac- cording to the metho d is calculated. Since the prop ortion of encrypted data relativ e to the total amount of data on the HDD is unknown, the exp ected prop ortion of error is suggested. This is to sa y , giv en t w o consecutiv e c hange- p oin ts, θ 1 and θ 2 , and t wo corresp onding stopping times, τ 1 and τ 2 , the ex- p ected prop ortion of error is E  | τ 1 − θ 1 | + | τ 2 − θ 2 | θ 2 − θ 1  . But of course, this v alue has a sense only when there are no false alarms b et w een τ 1 and τ 2 . If there are false alarms b et w een τ 1 and τ 2 , the prop ortion of undetected encrypted data w as added to the prop ortion of error to determine the prop or- tion of the error made relativ e to the size of the encrypted data. Assuming that there are n false alarms τ 0 1 < . . . < τ 0 n in [ τ 1 , τ 2 ], called exp e cte d inac cu- 0.00 0.05 0.10 0.15 0.20 0.00 0.02 0.04 0.06 0.08 0.10 0.12 E.I ν Figure 3: Exp ected inaccuracy EI of the CUSUM and Shiry aev pro cedure. The Shiry aev method is little less accurate when ν increases but slightly more accurate for small ν compared to the CUSUM pro cedure. r acy , or EI for short, is defined as follow s. EI( ν ) = E ν | τ 1 − θ 1 | + | τ 2 − θ 2 | θ 2 − θ 1 + n/ 2 X i =1 τ 0 2 i − τ 0 2 i − 1 θ 2 − θ 1 ! . The EI was measured for differen t v alues of the parameter ν in the geo- metrical distribution of the change-points for different metho ds (see T able 3 and Figure 3). ν 0.2 0.15 0.07 0.05 0.02 0.01 0.005 0.001 P ercentage CUSUM 0.11379 0.11202 0.09559 0.08683 0.06214 0.04426 0.03096 0.01655 Shiry aev 0.11693 0.11276 0.09688 0.08890 0.06207 0.04398 0.03038 0.01603 T able 3: EI for CUSUM and Shiryaev c hange-p oin t detection metho ds and for some v alues of the parameter ν . Complete pro cedure The complete pro cedure returns a segmentation separating susp ected encrypted data and most likely non-encrypted data of an HDD, information provided in order to carry out the subsequen t brute force cryptanalysis efficiently . This pro cedure runs a likelihoo d ratio based c hange-p oin t detection metho d and as so on as it detects a c hange, maximizes a lik eliho o d function to find the most lik ely estimator of the c hange-p oin t to determine where the real c hange is most lik ely lo cated. It will then start ARL 0 Thresholds CUSUM Shiry aev NE → E E → NE NE → E E → NE 100 1.2260 4.5801 64.0313 44.1271 500 2.6529 6.1250 323.0625 221.8125 2 500 4.2188 7.7120 1618.219 735.4088 10 000 5.5296 9.0990 6475.0547 4441.8413 T able 4: V alues of the thresholds for the CUSUM and Shiryaev metho ds for ARL 0 = 100 , 500 , 2 500 , 10 000 sp ecified for detecting a shift from non- encrypted to encrypted data (indicated by NE → E for short) and for shift from encrypted to non-encrypted data (indicated b y E → NE) resp ectively . o v er from the lo cation of this estimated c hange-p oin t with the same metho d for on-line c hange-p oin t detection except that the likelihoo d ratio is reversed mo difying the alarm function to fit with the opp osite change-point situation, and so on. Results Thresholds and exp erimen tal v alues for the method The first step in establishing prop erties of change-point detection metho ds is to determine the thresholds rendering v alues of a verage run-length, ARL 0 . One purpose of this is simply to link the threshold v alues to a prop erty whic h relates to the probabilit y of false alarm. This is a result in itself, but it is also necessary for calibrating the metho ds so they are comparable in terms of other p erformance measures related to a motiv ated alarm, such as exp ected delay , predictiv e v alue and exp ected inaccuracy . Here the v alues ARL 0 = 100 , 500 , 2 500 and 10 000, are considered for b oth the CUSUM and Shiryaev metho ds, for a shift from encrypted to non-encrypted and vice versa. The change-points are commonly geometrically distributed with parameter ν . Here the a verage time b efore a change-point is exp ected to b e rather high (several hundreds or thousands ma yb e) as the metho d deal with 64-b yte clusters in an HDD of surely several hundreds of Giga or T era Bytes. Thus, since E ( τ ) = 1 /ν , the fo cus is on v ery small v alues of ν for the metho ds to b e sensibly trigger happ y . Commonly v alues of ARL 0 are 100 or 500 in order to mak e other prop- erties relev ant for comparisons. In the case of this application, how ever, as large v alues of ARL 0 as 2 500 and 10 000 are studied b ecause the first c hange-p oin t might not o ccur un til far into the HDD. Adjusting the thresh- old by simulating data can take a very long time if ARL 0 is large (2 500 or 10 000, esp ecially for the Shiry aev metho d). In this case, it can take several hours or even up to days to compute the threshold. Therefore it would b e in teresting to ha v e a w a y of predicting the threshold b y extrap olation i.e. ha ving an explicit relation b et w een ARL 0 and the threshold C . In tuitiv ely , if ARL 0 is larger, more data will b e taken in to account implying a threshold prop ortionally larger. Indeed when ARL 0 increases more data is used in the pro cedure and the threshold is therefore prop ortionally increased from how m uc h more data that w as treated in the pro cedure. In the CUSUM case, since the alarm function is defined b y means of the log lik eliho o d ratio, the relationship b et w een threshold C and ARL 0 is logarithmic: • for a shift from encrypted data to non-encrypted data: C = 0 . 997767 · ln  0 . 912316 · ARL 0 + 7 . 294950  • for a shift from non-encrypted data to encrypted data: C = 0 . 965524 · ln  0 . 030655 · ARL 0 + 0 . 494603  F or Shiry aev, the threshold C is a linear function of ARL 0 : • for a shift from encrypted data to non-encrypted data: C = 0 . 444214 · ARL 0 + 0 . 294281 • for a shift from non-encrypted data to encrypted data: C = 0 . 647578 · ARL 0 − 0 . 726563 Conclusions Using the change-point theory , metho ds to detect encrypted data on HDDs w ere successfully derived and ev aluated. These metho ds exploit the fact that encrypted data is uniformly distributed as opp osed to other types of files. The metho ds were designed to detect the difference b et w een encrypted and non-encrypted data where the kind of non-encrypted data that w as most similar to the encrypted data was compressed data. As the proposed methods detect even this small a difference in the data, any bigger deviance will b e ev en easier detected. ν Metho ds CUSUM Shiry aev 100 500 2 500 10 000 100 500 2 500 10 000 0.20 0.8950 0.9862 0.9984 0.9997 0.9859 0.9984 0.9998 1.0000 0.15 0.8727 0.9805 0.9974 0.9995 0.9736 0.9967 0.9996 0.9999 0.07 0.8031 0.9604 0.9933 0.9986 0.9147 0.9852 0.9976 0.9995 0.05 0.7634 0.9482 0.9907 0.9978 0.8708 0.9754 0.9957 0.9991 0.02 0.6171 0.8942 0.9784 0.9944 0.6927 0.9235 0.9847 0.9964 0.01 0.4712 0.8207 0.9590 0.9890 0.5153 0.8463 0.9661 0.9918 T able 5: Predictiv e v alue PV( ν ) = P ν ( θ < τ ), i.e. the probabilit y that a shift has o ccurred when an alarm is signalled, for the CUSUM and the Shiry aev metho ds, for different v alues of ARL 0 and with different v alues of the parameter ν in the geometric distribution of the c hange-p oin ts. Quic k and accurate detection of a change is commonly the desired prop- ert y of c hange-p oin t detection metho ds. In man y applications, time asp ects of the metho ds are a matter of interest, e.g. exp ected dela y in detection of a shift or probability of detecting a shift within a specified time in terv al. Here, ho w ever, this time aspect is not of primary interest since the data remain the same during the whole pro cess. Here the probability of correctly detecting encrypted data is more relev an t. The inv estigation shows that the prop osed metho ds detect more than 96% of the encrypted data and, b y extending the in terv als, the metho ds detect more than 99% of the encrypted data. By assuming that the change-points are not to o close – which is a plausible as- sumption since it is unlikely that files are so small if the device is not to o fragmen ted – then the metho d, by adding a little margin to the interv als, quic kly detects 100% of the encrypted data. The change-point itself is a random v ariable. Ho wev er, for some of the p erformance measures, the results dep end on whether the change is more lik ely to o ccur early or late. In the application of detecting encrypted data the distance b et w een the shift from encrypted to non-encrypted data and vice v ersa is t ypically longer than 20 clusters whic h corresp onds to parameter v alue ν < 0 . 05 in the geometrical distribution of the change-point. Therefore these v alues are more interesting in the case when the hard driv e is not very strongly fragmen ted. Then the Shiryaev metho d turns out to b e slightly b etter compared to the CUSUM metho d in resp ect of exp ected delay . The Shiry aev method also detects more encrypted data than the CUSUM metho d and has a sligh tly higher predictive v alue PV. All in all, this implies that both metho ds designed with the suggested mo delling, p erform very w ell with a slight preference to the Shiry aev metho d for detecting encrypted data on an HDD. Ac kno wledgemen ts The authors wish to express their gratitude to Mattias W ec kst´ en at Halmstad Univ ersit y for go o d ideas and previous readings of the man uscript Universit y and to Linus Nissi (formerly Lin us Barkman) at the Police Department of Southern Sw eden for earlier work in the area. References [1] Barkman, L. Detektering av krypterade filer, T eknolo gie Kandidatexamen i Datateknik , div a2:428544, Halmstad Univ ersity . [2] F ris ´ en, M. Prop erties and use of the Shewhart metho d and its follo wers, Se quential Analysis , 26 (2) (2007) 171–193. DOI: 10.1080/07474940701247164 [3] F ris ´ en, M. Statistical surveillance. Optimality and methods, International Statistic al R eview , 71 (2) (2003) 403–434. DOI: 10.1111/j.1751-5823.2003.tb00205.x [4] F ris ´ en, M. and de Mar´ e, Jacques. Optimal Surveillance, Biometrika , 78 (2) (1991) 271–280. DOI: 10.2307/2337252 [5] J¨ arp e, E. Surveillance, En vironmental, Encyclop e dia of Envir onmetrics , 4 (2013) 2150–2153. DOI: 10.1002/9780470057339.v as065.pub2 [6] Page, E.S. Contin uous Insp ection Schemes, Biometrika , 41 (1/2) (1954) 100–115. DOI: 10.1093/biomet/41.1-2.100 [7] Shiryaev, A.N. On Optim um Metho ds in Quic k est Detection Problems, The ory of Pr ob ability and Its Applic ations , 8 (1) (1963) 22–46. DOI: 10.1137/1108002 [8] Swedish Civil Contingencies Agency . Informationss¨ ak erhet – trender 2015. Myndigheten f¨ or Samh¨ al lsskydd o ch Ber e dskap , (2015). [9] The Swedish P olice. Polisens rapp ort om organiserad brottslighet 2015, National op er ations dep ertment , (2015). [10] W estfeld, A. and Pfitzmann, A. Attac ks on Steganographic Systems Breaking the Steganographic Utilities EzStego, Jsteg, Steganos, and S- T o olsand Some Lessons Learned 3r d International Workshop on Infor- mation Hiding , Dresden, German y (2000) 61–76. 0.05 0.10 0.15 0.20 4.6 4.7 4.8 4.9 5.0 E.D 0.05 0.10 0.15 0.20 6.8 6.9 7.0 7.1 7.2 7.3 7.4 E.D 0.05 0.10 0.15 0.20 9.3 9.4 9.5 9.6 9.7 9.8 9.9 E.D 0.05 0.10 0.15 0.20 11.4 11.6 11.8 12.0 E.D ARL 0 = 100 ARL 0 = 500 ARL 0 = 2 500 ARL 0 = 10 000 ν ν ν ν Figure 4: Expected delays, ED, for a shift from encrypted to compressed data for the CUSUM pro cedure (blue) and Shiryaev pro cedure (red) 0.05 0.10 0.15 0.20 0.4 0.5 0.6 0.7 0.8 0.9 1.0 P.V ν ARL 0 = 100 ARL 0 = 500 ARL 0 = 2 500 ARL 0 = 10 000 0.05 0.10 0.15 0.20 0.4 0.5 0.6 0.7 0.8 0.9 1.0 P.V ν ARL 0 = 100 ARL 0 = 500 ARL 0 = 2 500 ARL 0 = 10 000 Figure 5: Predictiv e v alues for a shift from compressed to encrypted data for the CUSUM pro cedure (left) and for the Shiryaev pro cedure (right)

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment