Tuesday, August 9, 2011

Entropy

ent is a very cool tool to quickly determine the entropy of a given file or stream. 

Here is a quick demonstration:

The entropy of 'AAAAA':

The entropy of '28183' (5 byte pseudorandom number generated by the rand program):


Now, let's consider these two outputs (much of the content below is from the ent man page):
    Entropy
    Entropy (according to ent) is the  information  density  of  the  contents  of  the  file expressed as a number of bits  per  character.  In the first example we see that 'AAAAA' is reported as having 0 entropy and that '28183' has 1.92.  The takeaway here is that although '28183' is more dense than 'AAAAA', it is much less dense than one might expect.
      Chi-Square Test
      The chi-square test is the most commonly used test for the randomness of data, and is extremely sensitive  to errors in pseudorandom sequence generators. The chi-square distribution is calculated for the stream of bytes in the file and expressed as an absolute number and a percentage which indicates how frequently a truly  random  sequence  would  exceed  the  value  calculated.  We interpret the percentage as the degree to which the sequence tested is suspected of being non-random. If the percentage is greater than 99% or less than 1%,  the sequence  is  almost certainly not random. If the percentage is between 99% and 95% or between 1% and 5%, the sequence is suspect.
      Note: This test clearly shows that ent does not consider '28183' to be not very random at all.
        Arithmetic Mean
        This  is  simply the result of summing the all the bytes (bits if the -b option is specified) in the file and dividing by the file length. If the data are close to random, this should be about 127.5 (0.5 for  -b  option output). If the mean departs from this value, the values are consistently high or low.
        Note: If a file is printable characters only (or some other subset of possible byte values), but still pseudorandom inside that set, then the value that represents the pseudorandom mean will be different.  Additionally, it is unclear what general impact subsets of data have on the results reported by ent.
          Monte Carlo Value for PI
          Each  successive sequence of six bytes is used as 24 bit X and Y coordinates within a square. If the distance of the randomly-generated point is less than the radius of a circle inscribed within the square, the six-byte sequence  is  considered  a  "hit". The percentage of hits can be used to calculate the value of Pi. For very large streams (this approximation converges very slowly), the value will approach the correct value of Pi  if the sequence is close to random. A 32768 byte file created by radioactive decay yielded: Monte Carlo value for Pi is 3.139648438 (error 0.06 percent).
            Serial Correlation Coefficient
            This  quantity  measures the extent to which each byte in the file depends upon the previous byte. For random sequences, this value (which can be positive or negative) will, of course, be close  to  zero.  A  non-random byte  stream such as a C program will yield a serial correlation coefficient on the order of 0.5. Wildly predictable data such as uncompressed bitmaps will exhibit serial correlation coefficients  approaching  1.
            Note: Interesting to see that '28183' ends up with a serial correlation somewhere between that of a C program and wildly predictable data.