irodata: Entropy

ent is a very cool tool to quickly determine the entropy of a given file or stream.

Here is a quick demonstration:

The entropy of 'AAAAA':

The entropy of '28183' (5 byte pseudorandom number generated by the rand program):

Now, let's consider these two outputs (much of the content below is from the ent man page):

Entropy
Entropy (according to ent) is the information density of the contents of the file expressed as a number of bits per character. In the first example we see that 'AAAAA' is reported as having 0 entropy and that '28183' has 1.92. The takeaway here is that although '28183' is more dense than 'AAAAA', it is much less dense than one might expect.

Chi-Square Test
The chi-square test is the most commonly used test for the randomness of data, and is extremely sensitive to errors in pseudorandom sequence generators. The chi-square distribution is calculated for the stream of bytes in the file and expressed as an absolute number and a percentage which indicates how frequently a truly random sequence would exceed the value calculated. We interpret the percentage as the degree to which the sequence tested is suspected of being non-random. If the percentage is greater than 99% or less than 1%, the sequence is almost certainly not random. If the percentage is between 99% and 95% or between 1% and 5%, the sequence is suspect.
Note: This test clearly shows that ent does not consider '28183' to be not very random at all.

Arithmetic Mean
This is simply the result of summing the all the bytes (bits if the -b option is specified) in the file and dividing by the file length. If the data are close to random, this should be about 127.5 (0.5 for -b option output). If the mean departs from this value, the values are consistently high or low.
Note: If a file is printable characters only (or some other subset of possible byte values), but still pseudorandom inside that set, then the value that represents the pseudorandom mean will be different. Additionally, it is unclear what general impact subsets of data have on the results reported by ent.

Monte Carlo Value for PI
Each successive sequence of six bytes is used as 24 bit X and Y coordinates within a square. If the distance of the randomly-generated point is less than the radius of a circle inscribed within the square, the six-byte sequence is considered a "hit". The percentage of hits can be used to calculate the value of Pi. For very large streams (this approximation converges very slowly), the value will approach the correct value of Pi if the sequence is close to random. A 32768 byte file created by radioactive decay yielded: Monte Carlo value for Pi is 3.139648438 (error 0.06 percent).

Serial Correlation Coefficient
This quantity measures the extent to which each byte in the file depends upon the previous byte. For random sequences, this value (which can be positive or negative) will, of course, be close to zero. A non-random byte stream such as a C program will yield a serial correlation coefficient on the order of 0.5. Wildly predictable data such as uncompressed bitmaps will exhibit serial correlation coefficients approaching 1.
Note: Interesting to see that '28183' ends up with a serial correlation somewhere between that of a C program and wildly predictable data.

Tuesday, August 9, 2011

Entropy