Randomness Testing of Compressed Data
Random Number Generators play a critical role in a number of important applications. In practice, statistical testing is employed to gather evidence that a generator indeed produces numbers that appear to be random. In this paper, we reports on the s…
Authors: Weiling Chang, Binxing Fang, Xiaochun Yun
JOURNAL OF COMPUTING, VOLUME 2, ISSUE 1, JANUARY 2010, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
44
Randomness Testing of Compressed Data
Weiling Chang1*, Binxing Fang1,2, Xiaochun Yun2, Shupeng Wang2, Xiangzhan Yu1
Abstract—Random Number Generators play a critical role in a number of important applications. In practice, statistical testing
is employed to gather evidence that a generator indeed produces numbers that appear to be random. In this paper, we reports
on the studies that were conducted on the compressed data using 8 compression algorithms or compressors. The test results
suggest that the output of compression algorithms or compressors has bad randomness, the compression algorithms or
compressors are not suitable as random number generator. We also found that, for the same compression algorithm, there
exists positive correlation relationship between compression ratio and randomness, increasing the compression ratio increases
randomness of compressed data. As time permits, additional randomness testing efforts will be conducted.
Index Terms— Compression technologies, Data compaction and compression, Random number generation
—————————— ——————————
1 INTRODUCTION
andom number generators are an important primi-
tive widely used in simulation and cryptography.
Generating good random numbers is therefore of
critical importance for scientific software environments.
There are many different ways to test for randomness,
but all of them, in essence, boil down to computing a ma-
thematical metric from the data stream being tested and
comparing the result with the expectation value for an
infinite sequence of genuinely random data. Output from
well-designed pseudo-random number generators should
pass
assorted
statistical
tests
probing
for
non-
randomness.
This paper describes how the output for each of the
lossless data compressors was collected and then evalu-
ated for randomness. It discusses what was learned utiliz-
ing the NIST statistical and the Diehard tests and offers
an interpretation of the empirical results. In Section 2, the
randomness testing experimental setup is defined and
described. In Section 3, the empirical results compiled to
date are discussed and the interpretation of the test re-
sults is presented. Lastly, a summary of lessons learned is
presented.
2 METHODOLOGY
2.1 Test files
We carried out random tests on the contents of five cor-
pora: the Calgary Corpus [1], the Canterbury Corpus [1,
2], the Maximum Compression Corpus, the 100MB file
enwik8, and the HitIct corpus.
Two well-known data sets, the Calgary corpus and the
Canterbury Corpus, are used by researchers in the uni-
versal lossless data compression field. Over the years of
using of these two corpora some observations have prov-
en their important disadvantages. The most important in
our opinion are: the lack of large files and an over-
representation of English-language texts. In order to
avoid the two disadvantages, we introduced three other
data sets: the Maximum Compression Corpus, the 100MB
file enwik8, and the HitIct corpus.
The Maximum Compression benchmark [3] is a web-
site maintained by Werner Bergmans. It uses two data
sets, one public and one private. The Maximum Com-
pression Corpus is the public data set of MaximumCom-
pression, which consists of about 55 MiB in 10 files with a
variety of types: text in various formats, executable data,
and images. The enwik8 [4] is the first 100,000,000 charac-
ters of a specific version of English Wikipedia. The HitIct
corpus [5] is a Chinese Corpus which consists of 10 files
derived from the application of Chinese.
Four basic compression algorithms and four popular
compressors were tested, namely Huffman coding,
arithmetic coding, LZSS and LZW which adapted from
the related codes of the data compression book [6],
PPMVC [7], WinZip 12.1, WinRAR 3.90 and WinRK 3.12.
We harnessed and analyzed the five different sets of
data
(compressed
using
different
algo-
rithms/compressors) for each of these algorithms.
2.2 Randomness tests
Randomness is a probabilistic property; the properties of
a random sequence are characterized and described in
terms of probability. There are an infinite number of pos-
sible statistical tests, each assessing the presence or ab-
sence of a pattern which, if detected, would indicate that
the sequence is non-random. Because there are so many
tests for judging whether a sequence is random or not, no
specific finite set of tests is deemed complete. In this pa-
R
————————————————
1. Research Centre of Computer Network and Information Security Tech-
nology, Harbin Institute of Technology, Harbin 150001, China
2. Institute of Computing Technology, Chinese Academy of Science, Beijing
100080, China
* Correspondence Auther.
Supported by the National High-Tech Development 863 Program of China
(Grant Nos.2009AA01A403, 2007AA01Z406, 2007AA010501,
2009AA01Z437)
© 2009 Journal of Computing
http://sites.google.com/site/journalofcomputing/
JOURNAL OF COMPUTING, VOLUME 2, ISSUE 1, JANUARY 2010, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
45
per, we will focus on the NIST 800-22 statistical test suite
and the Diehard test suite.
The NIST Statistical Test-Suite
NIST has developed a suite of 15 tests to test the ran-
domness of binary sequences produced by either hard-
ware or software based cryptographic random or pseudo-
random number generators. The tests have been docu-
mented in NIST Special Publication (SP) 800-22, “ A Sta-
tistical Test Suite for Random and Pseudorandom Num-
ber Generators for Cryptographic Applications” [8].
These tests focus on a variety of different types of non-
randomness that could exist in a sequence. The publica-
tion and the associated tests are intended for individuals
who are responsible for the testing and evaluation of ran-
dom and pseudorandom number generators, including
(P)RNG developers and testers. SP 800-22 provides a
high-level description and examples for each of the 15
tests, along with the mathematical background for each
test. The 15 tests are listed in Table 1 [14].
TABLE 1. GENERAL CHARACTERISTICS OF NIST STATISTICAL TESTS
Test Name
Characteristics
Frequency (Monobit) Test
Frequency Test within a Block
Too many zeroes or ones.
Cumulative Sums Test
Too many zeroes or ones at the beginning of the sequence.
Runs Test
Large (small) total number of runs indicates that the oscillation in the bit
stream is too fast (too slow).
Tests for the Longest-Run-of-Ones in a Block
Deviation of the distribution of long runs of ones.
Binary Matrix Rank Test
Deviation of the rank distribution from a corresponding random se-
quence, due to periodicity.
Discrete Fourier Transform (Spectral) Test
Periodic features in the bit stream.
Non-overlapping Template Matching Test
Too many occurrences of non-periodic templates.
Overlapping Template Matching Test
Too many occurrences of m-bit runs of ones.
Maurer's "Universal Statistical" Test
Compressibility (regularity).
Approximate Entropy Test
Non-uniform distribution of m-length words. Small values of ApEn(m)
imply strong regularity.
Random Excursions Test
Deviation from the distribution of the number of visits of a random walk
to a certain state.
Random Excursions Variant Test
Deviation from the distribution of the total number of visits (across
many random walks) to a certain state.
Serial Test
Non-uniform distribution of m-length words. Similar to Approximate
Entropy.
Linear Complexity Test
Deviation from the distribution of the linear complexity for finite length
(sub)strings.
The NIST framework is based on hypothesis testing. A
hypothesis test is a procedure for determining if an asser-
tion about a characteristic of a population is reasonable.
In this case, the test involves determining whether or not
a specific sequence of zeroes and ones is random.
For each statistical test, a set of P-values is produced.
The P-value is the probability of obtaining a test statistic
as large or larger than the one observed if the sequence is
random. Hence, small values are interpreted as evidence
that a sequence is unlikely to be random. The decision
rule in this case states that "for a fixed significance value α,
a sequence fails the statistical test if it’s P-value < α." A
sequence passes a statistical test whenever the P-value ≥
α and fails otherwise. If the significance level α of a test of
H0 (which is that a given binary sequence was produced
by a random bit generator.) is too high, then the test may
reject sequences that were, in fact, produced by a random
bit generator (Type I error). On the other hand, if the sig-
nificance level α of a test of H0 is too low, then there is the
danger that the test may accept sequences even though
they were not produced by a random bit generator (Type
II error). It is, therefore, important that the test be care-
fully designed to have a significance level that appropri-
ate for the purpose at hand. However, the calculation of
the Type II error is more difficult than the calculation of α
because many possible types of non-randomness may
exist. Therefore, NIST statistical test suite adopts two fur-
ther analyses in order to minimize the probability of ac-
cepting a sequence being produced by a good generator
when the generator was actually bad. First, For each test,
a set of sequences from output is subjected to the test, and
the proportion of sequences whose corresponding P-
value satisfies P-value ≥ α is calculated. If the proportion
of success-sequences falls outside of following acceptable
interval (confidence interval), there is evidence that the
data is non-random.
(1)
where = 1 - , k = 3 is the number of standards de-
viations, and n is the sample size. If the proportion falls
JOURNAL OF COMPUTING, VOLUME 2, ISSUE 1, JANUARY 2010, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
46
outside of this interval, then there is evidence that the
data is non-random.
Second, the distribution of P-values is calculated for
each test. If the test sequences are truly random, P-value
is expected to appear uniform in [0, 1). NIST recommends
to χ2 test by interval between 0 and 1 is divided into 10
sub-intervals. This is the test of uniformity of P-value. The
degree of freedom is 9 in this case. Define Fi as number of
occurrence of P-value in i th interval, s is the number of
sequences, then χ2 statistics is given as bellow.
ሺ2ሻ
The P-value of P-values is calculated such that P-value
= igamc (9/2, χ2/2), where igamc is the incomplete gam-
ma function. If P-value ≥ 0.0001, i.e., the acceptance re-
gion of statistics is χ2 ≤ 33.72, and then the set of P-values
can be considered to be uniformly distributed.
The Diehard Test-Suite
The Diehard tests are a battery of statistical tests for
measuring the quality of a set of random numbers. They
were developed by Professor George Marsaglia of Florida
State University over several years and first published in
1995 on a CD-ROM of random numbers. The DIEHARD
suite of statistical tests [9] consists of 18 tests. These tests
are exquisitely sensitive to subtle departures from ran-
domness, and their results can all be expressed as the
probability the results obtained would be observed in a
genuinely random sequence. Probability values close to
zero or one indicate potential problems, while probabili-
ties in the middle of the range are expected for random
sequences.
3 EMPIRICAL RESULTS & ANALYSIS
In this section, we show the results of statistical test for
the 5 corpora. Randomness can be defined only statisti-
cally over a long sequence, it is impossible to comment
the randomness of a bit sequence with a single bit sample,
so we performed the above two tests many times. We
constructed five other test files as follows: encoded all the
test files using different compression algorithms or com-
pressors, since many compression techniques add a
somewhat predictable preface to their output stream, we
skip the 1024 bytes of the beginning of the compressed
sequence, and then concatenated them into one file. For
each statistical test, further analyses are conducted. Table
2 highlights these categories of data.
TABLE 2. CATEGORIES OF DATA
File
Name
Compressor
Size
(Bytes)
Number of
Sequences*
(NIST test)
Number of
pieces (Die-
hard test)
ari
Arithmetic
Coding
125646724
958
10
huff
Huffman
Coding
126397455
964
11
lzssj
LZSS
92964564
709
8
lzwj
LZW
125083724
954
10
zip
WinZip
67976783
518
5
rar
WinRAR
58233804
444
5
pmv
PPMVC
51596913
393
4
winrk
WinRK
41190604
314
3
Constructed by merge all files
in five corpora compressed
using eight different algo-
rithms / compressors.
*Except the random excursion (variant) test
3.1 Tests with the NIST Statistical Test Suite
The NIST Statistical Test Suite consists of 15 core statisti-
cal tests that, under different parameter inputs, can be
viewed as 189 statistical tests. Each P-value corresponds
to the application of an individual statistical test on a sin-
gle binary sequence. Randomness testing was performed
using the following strategy:
Input parameters such as the sequence length and sig-
nificance level were set at 220 bits and 0.01, respectively.
For each binary sequence and each statistical test, a P-
value was reported and a success/failure assessment was
made based on whether or not it exceeded or fell below
the pre-selected significance level. For each statistical test
and each test file, two evaluations were made. First, the
proportion of binary sequences in a test file that passed
the statistical test was calculated. Second, an additional P-
value was calculated, based on a χ2 test applied to the P-
values in the entire sample to ensure uniformity. For both
measures described above, an assessment was made. A
sample was considered to have passed a statistical test if
it satisfied both the proportion and uniformity assess-
ments.
3.1.1 Frequency (Monobit) Test
For a truly random sequence, any value in a given
random sequence has an equal chance of occurring, and
various other patterns in the data should be also distrib-
uted equiprobably, i.e. have a uniform distribution. The
focus of the Monobit test is the proportion of zeroes and
JOURNAL OF COMPUTING, VOLUME 2, ISSUE 1, JANUARY 2010, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
47
ones for the entire sequence. The purpose of this test is to
determine whether the number of ones and zeros in a
sequence are approximately the same as would be ex-
pected for a truly random sequence.
NIST recommends that the Monobit test should be ap-
plied first, since this supplies the most basic evidence for
the existence of non-randomness in a sequence, specifi-
cally, non-uniformity. All subsequent tests depend on
passing this test. If the results of this test support the null
hypothesis, then the user may proceed to apply other sta-
tistical tests.
For this test, the zeros and ones of the input sequences
are converted to values of -1 and +1 and are added to-
gether to produce:
Sn = X1 + X2 + … + Xn
where Xi = 2**i – 1. Fox example, if = 1100101101,
then n = 10 and Sn = 1 + 1 + (-1) + (-1) + 1 + (-1) + 1 + 1 + (-
1) + 1 = 2. Table 3 shows the results of NIST Monobit test
for eight test files. All eight files are failed. However, the
test files generated by arithmetic algorithm (PPMVC use
arithmetic coding to encode the actual selected symbol)
have higher success rates (0.7380 for arithmetic coding
and 0.7817 for PPMVC respectively).
TABLE 3. RESULTS FOR THE UNIFORMITY OF P-VALUES AND THE PROPORTION OF PASSING SEQUENCES
File/Compressor
C1
C2
C3
C4
C5
C6
C7
C8
C9
C10
P-
VALUE
PROPORTION
Result
445
114
83
64
55
40
42
34
43
38
0.000000
0.7422
Fail
ari/Arithmetic
Coding
The minimum pass rate is approximately = 0.980356 for a sample size = 958 binary se-
quences.
964
0
0
0
0
0
0
0
0
0
0.000000
0.0000
Fail
huff/Huffman
Coding
The minimum pass rate is approximately = 0.980386 for a sample size = 964 binary se-
quences.
709
0
0
0
0
0
0
0
0
0
0.000000
0.0000
Fail
lzssj/LZSS
The minimum pass rate is approximately = 0.978790 for a sample size = 709 binary se-
quences.
954
0
0
0
0
0
0
0
0
0
0.000000
0.0000
Fail
lzwj/LZW
The minimum pass rate is approximately = 0.980336 for a sample size = 954 binary se-
quences.
501
6
1
5
2
1
1
0
0
1
0.000000
0.0521
Fail
zip/WinZip
The minimum pass rate is approximately = 0.976885 for a sample size = 518 binary se-
quences.
327
21
9
11
16
15
9
12
9
15
0.000000
0.3964
Fail
rar/WinRAR
The minimum pass rate is approximately = 0.975834 for a sample size = 444 binary se-
quences.
213
44
31
28
24
13
20
8
5
7
0.000000
0.7684
Fail
pmv/PPMVC
The minimum pass rate is approximately = 0.974943 for a sample size = 393 binary se-
quences.
234
22
15
7
6
5
6
7
7
5
0.000000
0.4618
Fail
winrk/WinRK
The minimum pass rate is approximately = 0.973155 for a sample size = 314 binary se-
quences.
Figure 1 depicts the differences between the propor-
tion of zeroes and ones for all the 46 test files. In each fig-
ure, the x-axis is the test file and the y-axis represents the
proportion difference ( = 1- (count of ones)/(count of ze-
ros)).
Fig. 1. The differences between the proportion of zeroes and ones
(the x-axis indicates the test file, sorted in descending order by file
size.)
It can be seen from figure 1, for all the compression al-
gorithms/compressors, the number of ones and zeros in
their output are not uniform. We also notice from figure 1
that the output of the WinZip, WinRAR, PPMVC and
WinRK is much closer to uniform than other algorithms.
JOURNAL OF COMPUTING, VOLUME 2, ISSUE 1, JANUARY 2010, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
48
Fig. 2. Standard deviations for proportion differencs (grammar.rk,
xargs.rk)
As can be seen from figure 2 that WinZip, WinRAR,
PPMVC and WinRK have a lower standard deviation
than arithmetic coding, Huffman, LZW and LZSS, it indi-
cates that the their proportion differences tend to be very
close to the mean (close to zero), whereas the proportion
differences of the arithmetic coding, Huffman, LZW and
LZSS are spread out over a large range of values. It seems
that the distribution of zeros and ones is much more close
to uniform with the increase of the compression ratio.
3.1.2 More NIST Test Results
Table 4 shows the results of eight test files. For every
test file, the first column shows the P-value’s uniformity
of each test, the second column shows the passing ratio of
each test. All eight test files do not pass the NIST test suite.
The two test file pmv and winrk pass 4 tests of all 15 NIST
tests, and they seem to be more random than other test
files. It is also noted that all 15 tests are not passed for the
two test files huff (the output of Huffman Coding) and
lzwj (the output of LZW). The reason will be explained
later.
TABLE 4. NIST TEST RESULTS OF EIGHT TEST FILES (THE √ SYMBOL DENOTES PASS, ‘BLANK’ DENOTES FAIL)
Test Name
ari
huff
lzssj
lzwj
zip
rar
pmv
winrk
Frequency (Monobit)
BlockFrequency
√
CumulativeSums
Runs
√
LongestRun
Rank
√ √
√ √
√ √
√
√
√
√
√ √
FFT
√ √
√
√
NonOverlappingTemplate
OverlappingTemplate
Universal
√
√
√ √
ApproximateEntropy
RandomExcursions
√
√
√ √
√
√
√ √
RandomExcursionsVariant √
√
√
√
√
√ √
Serial
LinearComplexity
√ √
√
√
Information included in data stream at least has 4 fea-
tures: statistical, syntax or grammar (the arrangement of
symbols to form a message and the structural relation-
ships between these symbols.), semantics (which is to do
with the range of possible meanings of symbols, depend-
ent on their context the content specify, i.e. its meaning),
pragmatics (The context wherein the symbols are used.
Different contexts can result in different meanings for the
same symbols.).
There are two kinds of redundancy contained in the
data stream: statistics redundancy and non-statistics re-
dundancy. The non-statistics redundancy includes re-
dundancy derived from syntax, semantics and pragmatics.
Order-1 statistics-based compressors compress the statis-
tics redundancy, higher order statistics-based and dic-
tionary-based compression algorithms exploit the statis-
tics redundancy and the non-statistics redundancy. For
example, the strong dependency between adjacent sym-
bols of normal text is usually expressed as a Markov
model, with the probability of the occurrence of a particu-
lar symbol being expressed as a function of the preceding
n symbols.
Huffman coding uses a variable-length code table to
encode a symbol where the variable-length code table has
been derived in a particular way based on the estimated
probability of occurrence for each possible value of the
symbol. Huffman coding only reduce coding redundancy,
it has nothing to do with information redundancy but
with the representation of information, i.e., coding itself.
Although the binary digits of Huffman coding's output
are nearly evenly distributed it still maintains the statisti-
cal characteristics of the original data at symbol level, that
is, the symbol probability distribution between the vari-
able-length coded symbols of compressed data and the
fixed-length coded symbols of original data is identical.
Arithmetic coding is a form of variable-length entropy
encoding that converts a string into another representa-
tion that represents frequently used characters using few-
er bits and infrequently used characters using more bits,
with the goal of using fewer bits in total. As opposed to
other entropy encoding techniques that separate the input
message into its component symbols and replace each
symbol with a code word, arithmetic coding encodes the
entire message into a single number, a fraction n where
JOURNAL OF COMPUTING, VOLUME 2, ISSUE 1, JANUARY 2010, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
49
(0.0≤ n < 1.0). Arithmetic coders produce near-optimal
output for a given set of symbols and probabilities. Com-
pression algorithms that use arithmetic coding (such as
PPM, BWT) start by determining a model of the data –
basically a prediction of what patterns will be found in
the symbols of the message. The model is a prediction
algorithm which maintains a statistical model of the data
stream. The Huffman coding or the Arithmetic coding is
merely a coding scheme, its compression ratio depends
on the modeling approach used. The more accurate this
prediction is, the closer to optimality the output will be. It
can be seen from our experimental results, for Huffman
coding and Arithmetic coding, that there is no correlation
between their compression ratio and the randomness of
their output.
In LZSS, the encoded file consists of a sequence of
items, each of which is either a single character (literal) or
a pointer of the form (index, length), the probability distri-
bution of index values is near uniform. LZSS undermines
the statistical characteristics contained by the original
data stream. However, LZSS can produce new statistical
characteristics for the compressed data. The literal values
have the characteristics of uneven probability distribution
which is different from original data, and so do the length
values.
LZW builds a string translation table from the text be-
ing compressed. The string translation table maps fixed-
length codes to strings. LZW replaces strings of characters
with single codes. Under LZW, the compressor never
outputs single characters, only phrases. LZW altered the
statistical characteristics held by the original data stream.
However, by the characteristics of original data, which
include data locality and semantic or syntactic attribute,
the compressed data produces new statistical characteris-
tics and KCC attribute stemmed from the original data
stream.
Altogether, it can be concluded from the experimental
results that the output from all the lossless compression
algorithms/compressors has bad randomness and in-
creasing the compression ratio can increase the random-
ness of compressed data.
3.2 Tests with the Diehard Test-Suite
The DIEHARD has 18 tests and each test has some P-
value. The number of P-Value is different between each
test. The sort of test and its P-value is in following table.
Reference
Number
Test Name
Symbol
Number of P‐
value
1
Birthday Spacings Test
BDAY
10
2
Overlapping 5‐Permutation Test
OPERM5
2
3
Binary Rank Test for 31x31 Matrices
RANK31x31
1
4
Binary Rank Test for 32x32 Matrices
RANK32x32
1
5
Binary Rank Test for 6x8 Matrices
RANK6x8
26
6
Bitstream Test
BITSTREAM
20
7
Overlapping‐Pairs‐Sparse‐Occupancy
OPSO
23
8
Overlapping‐Quadruples‐Sparse‐Occupancy
OQSO
28
9
DNA Test
DNA
31
10
Count‐The‐1’s Test on a Stream of Bytes
C1STREAM
2
11
Count‐The‐1’s Test for Specific Bytes
C1BYTE
25
12
Parking Lot Test
PARKLOT
11
13
Minimum Distance Test
MINDIST
1
14
3D‐Spheres Test
3D
21
15
Squeeze Test
SQEEZE
1
16
Overlapping Sums Test
OSUM
11
17
Runs Test
RUNS
4
18
Craps Test
CRAPS
2
The Diehard test suite was run on a file of at least 80
million bits, so we split our test files into pieces of
11,468,800 bytes. The column of number of pieces in table
2 highlights the split results. There are 220 P-value in a set
of DIEHARD so that total number of P-Value is 56 × 220 =
12320 because we test 56 times (The 8 test files split into
56 pieces).
Although the Diehard test suite is one of the most
comprehensive publically available sets of randomness
tests, unfortunately passing the Diehard tests is not very
well defined since Dr. Marsaglia does not provide con-
crete criteria. Intel [10] assumed that a test is considered
failed if it produces a P-value less than or equal to 0.0001
or greater than or equal to 0.9999. It results in a 95% con-
fidence interval of P-values between 0.0001 and 0.9999.
This method was used for our testing. The Diehard test
results are summarized in Table 6. If multiple P-values
are in those results, the worst case value is presented.
JOURNAL OF COMPUTING, VOLUME 2, ISSUE 1, JANUARY 2010, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
50
TABLE 6: DIEHARD TEST RESULT SUMMARY
P-value
Test Name
huff
ari
lzwj
lzssj
zip
rar
pmv
winrk
BDAY
1.000000
1.000000
1.000000
1.000000
1.000000
0.006078
0.002076
1.000000
OPERM5
1.000000
1.000000
1.000000
1.000000
1.000000
0.016173
0.986673
0.002675
RANK31x31
1.000000
1.000000
1.000000
0.896222
1.000000
0.780829
0.859391
0.991590
RANK32x32
1.000000
1.000000
1.000000
0.940968
1.000000
0.829265
0.697502
0.965003
RANK6x8
1.000000
1.000000
1.000000
1.000000
1.000000
0.999608
0.998732
1.000000
BITSTREAM
1.00000
1.00000
1.00000
1.00000
1.00000
1.00000
0.00971
1.000000
OPSO
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
0.9972
0.9988
OQSO
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
0.0045
0.9988
DNA
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
0.0009
0.9949
C1STREAM
1.000000
1.000000
1.000000
1.000000
1.000000
1.000000
0.999301
1.000000
C1BYTE
1.000000
1.000000
1.000000
1.000000
1.000000
1.000000
0.001038
1.000000
PARKLOT
1.000000
0.000505
1.000000
1.000000
0.985802
0.994722
0.009936
0.007758
MINDIST
1.000000
1.000000
1.000000
1.000000
1.000000
1.000000
0.924820
1.000000
3D
1.000000
0.00131
1.000000
1.000000
0.00000
0.00089
0.00809
1.000000
SQEEZE
1.000000
1.000000
1.000000
1.000000
1.000000
0.959771
0.142881
0.999814
OSUM
0.001560
0.012183
0.001560
0.991762
0.001516
0.994038
0.002129
0.976676
RUNS
1.000000
0.001407
1.000000
1.000000
0.975837
0.977231
0.007877
0.025155
CRAPS
1.000000
0.998402
1.000000
1.000000
0.999989
0.996809
0.919484
0.976432
We can see from table 6 above, the test file huff, lzwj,
lzssj, ari and zip only pass 1 to 3 tests of the 18 Diehard
tests, i.e., the output of Huffman coding (Winzip is a
combination of LZ77 and Huffman coding), Arithmetic
Coding, LZSS and LZW is non-random. This is consistent
with the NIST test. The test file rar, winrk and pmv pass
most of the tests in Diehard. In order to further investi-
gate their randomness, we must consider the distribution
of the P-value.
As far as the P-value is concerned, the central limit
theorem does not apply, and large samples do not con-
verge in probability. For a good random number genera-
tor, P-values from tests will be uniformly distributed. The
distribution of the P-values from the Diehard suite of tests
is shown below.
Observed Percent
P-value
range
rar
pmv
winrk
“Expected” Per-
cent
0.0 - - 0.1
23.09
11.02
8.03
10
0.1 - - 0.2
5.55
8.52
6.52
10
0.2 - - 0.3
5.27
7.27
5.76
10
0.3 - - 0.4
4.45
9.09
6.97
10
0.4 - - 0.5
5.27
8.98
8.48
10
0.5 - - 0.6
5.45
11.14
8.03
10
0.6 - - 0.7
5.55
11.59
10.76
10
0.7 - - 0.8
7.00
9.32
9.24
10
0.8 - - 0.9
10.18
11.36
11.36
10
0.9 - - 1.0
28.18
11.7
24.85
10
Uniformity may also be determined via an application
of a χ2 test. Table 8 shows the valuation of P-value’s uni-
formity of the 3 test files, the statistics is given by equa-
tion (2).
TABLE 8: CHI-SQUARE TEST RESULTS OF P-VALUES
Acceptance Region: χ2 ≤ 33.72
Test File
χ2
Result
rar
711.45
Fail
pmv
19.00
Success
winrk
180.09
Fail
Each Diehard test produces one or more P-values. A P-
value can be considered good, bad, or suspect. To investi-
gate the randomness of different test files some kind of
overall quality metric is needed to convert the sets of P-
values produced by test batteries into a single measure,
allowing relative comparisons. Meysenburg et al [11, 12,
13] proposed a scheme which assigns a score to a P-value
as follows: if p
0.998 or p
0.002 then it is classified as
bad, if 0.95
p < 0.998 or 0. 002
p < 0.05 then it is clas-
sified as suspect. All other P-values are classified as good.
The formula used to calculate scores is:
For each test file, the scores for each test were summed,
and the total for each test file is the sum of all the test
scores for that test file. Using this scheme, high scores
indicate a poor randomness and low scores indicate a
good randomness. The results for each test are given in
table 9. If the test file is split into multiple pieces, the
worst piece is presented.
JOURNAL OF COMPUTING, VOLUME 2, ISSUE 1, JANUARY 2010, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
51
TABLE 9. DIEHARD TEST RESULTS
Test Name
Max Score
huff
ari
lzwj
lzssj
zip
rar
pmv
winrk
BDAY
40
40
40
40
40
20
0
0
40
OPERM5
8
8
5
8
8
5
0
2
0
RANK31x31
4
4
0
4
0
4
0
0
1
RANK32x32
4
4
0
4
0
4
0
0
1
RANK6x8
104
104
3
104
104
104
4
7
68
BITSTREAM
80
80
80
80
80
80
53
0
12
OPSO
92
92
92
92
92
92
76
1
6
OQSO
112
112
112
112
112
112
34
3
6
DNA
124
124
124
124
124
124
7
5
1
C1STREAM
8
8
8
8
8
8
8
0
6
C1BYTE
100
100
100
100
100
100
32
7
100
PARKLOT
44
38
7
44
44
2
0
1
2
MINDIST
4
4
4
4
4
4
1
0
4
3D
84
25
1
37
15
16
3
4
39
SQEEZE
4
4
0
4
4
4
0
0
2
OSUM
44
3
0
18
1
3
3
0
1
RUNS
16
0
1
11
16
0
0
1
1
CRAPS
8
8
0
8
5
3
0
0
0
Total
880
758
577
802
757
685
221
31
290
It is evident that only the output from the compressor
PPMVC (the file pmv) appears to be random for all 18
Diehard tests. Unfortunately, just because a test passes
Diehard, that doesn’t make it perfect. What has been done
is to demonstrate some intrinsic nature of the data com-
pression.
¾
Increasing the compression ratio increases ran-
domness of compressed data
¾
Arithmetic coding provides better randomness
than other basic compression algorithms (PPMVC
use arithmetic coding to code symbols).
4 CONCLUSION
The data compressed using 8 different compression algo-
rithms/compressors are tested by means of two popular
randomness tests. One is the Special Publication 800-22
issued by the National Institute of Standards and Tech-
nology (NIST) and the other is the DIEHARD test pro-
vided by Dr. Marsaglia. It is impossible to comment the
randomness of a bit sequence with a single bit sample, so
we performed the above two tests many times to a data
set composed of 5 compression corpora.
The NIST test suite collectively spans many well-
known properties that any good cryptographic algorithm
should satisfy. All tested files do not pass the NIST test
suite. For Diehard test suite, only the pmv file passed. The
main conclusion from this investigation is that for the loss-
less compressed data, there is obvious deviation from
randomness, i.e., the output of the lossless compression
algorithms/compressors has bad randomness. A secon-
dary conclusion is that, for the same compression algo-
rithm, there exists positive correlation relationship be-
tween compression ratio and randomness, increasing the
compression ratio increases randomness of compressed
data.
REFERENCES
[1]
Tim Bell, the Canterbury Corpus, Online. Internet. Available from
http://corpus.canterbury.ac.nz/index.html, 3, Sept. 2009.
[2]
Ross Arnold and Tim Bell. A corpus for the evaluation of lossless com-
pression algorithms. In Proceedings of the IEEE Data Compression
Conference (DCC), pp. 201–210, 1997.
[3]
Maximum Compression's English Text compression test, Online. Inter-
net. Available from http://www.maximumcompression.com, 3, Sept.
2009.
[4]
http://prize.hutter1.net/index.htm
[5]
Chang Wei-Ling, Yun Xiao-Chun, Fang Bin-Xing, Wang Shu-Peng,
HitIct: A Chinese corpus for the evaluation of lossless compression al-
gorithms, Journal on Communication, v30, n3, p42-47, March 2009.
[6]
Nelson M., Gailly J., “The Data Compression Book, 2nd edition”, M&T
Books, New York, NY, 1995.
[7]
Homepage
of
Przemysław
Skibiński:
http://www.ii.uni.wroc.pl/~inikep/.
[8]
A. Rukhin, et al.: A Statistical Test Suite for Random and Pseudoran-
dom Number Generators for Cryptographic Applications, NIST (Re-
vised: August 2008), http://csrc.nist.gov/rng/
[9]
George
Marsaglia,
DIEHARD
Statistical
Tests:
http://www.stat.fsu.edu/pub/diehard/.
[10] Intel Platform Security Division, “The Intel random number generator,”
Intel technical brief, 1999. Retrieved October 6, 2009 from
http://citeseer.ist.psu.edu/435839.html
[11] M. M. Meysenburg and J. A. Foster. Random generator quality and GP
performance. In Proceedings of the Internation Conference on Genetic
JOURNAL OF COMPUTING, VOLUME 2, ISSUE 1, JANUARY 2010, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
52
and Evolutionary Compution (GECCO), pages 1131–1126, 1999.
[12] Peter Martin: An Analysis Of Random Number Generators For A
Hardware Implementation Of Genetic Programming Using FPGAs
And Handel-C. GECCO 2002: 837-844
[13] David B. Thomas and Wayne Luk, “High Quality Uniform Random
Number Generation for Massively Parallel Simulations in FPGAs,” In
Proceedings of Reconfig, 2005.
[14] Juan Soto, “Statistical testing of random number generators”, Proc. of
the 22nd National Information Systems Security Conference, 1999.
Weiling Chang was born in Shanxi province, China. He graduated
with a BA Econ from University of International Business and Eco-
nomics (UIBE) in 1993 and earned his master's degree in computer
science from China Agricultural University (CAU) in 2006. He is cur-
rently in PhD program in Computer Science at Harbin Institute of
Technology (HIT), Harbin, China. His major research interests in-
clude data compression, computer network and information security.
Binxing Fang, born in 1960, is a professor and supervisor of Ph.D.
candidates, an academician of Chinese Academy of Engineering
and the president of Beijing University of Post and Telecommunica-
tions. His research interests include computer network and informa-
tion security.
Xiaochun Yun, born in 1971, is a professor and Ph.D. supervisor at
Institute of Computing Technology of the Chinese Academy of Sci-
ences. His research interests include computer network and informa-
tion security.
Shupeng Wang, born in 1980, Ph.D. His research interests include
computer network and information security.
Yu Xiangzhan,Ph.D.,associate professor of Harbin Institute of
Technology,Research Fields: Computer Network and Information
Security, Large-Scale Distributed Storage Technology.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment