Calculate File Entropy

Entropy is the measurement of the randomness.  The concept originated in the study of thermodynamics, but Claude E. Shannon in applied the concept to digital communications his 1948 paper, “A Mathematical Theory of Communication.”  Shannon was interested in determining the theoretical maximum amount that a digital file could be compressed.

In simple terms, a file is compressed by replacing patterns of bits with shorter patterns of bits.  Therefore, the more entropy in the data file, the less it can be compressed.  Determining the entropy of a file is also useful to detect if it is likely to be encrypted.

In the field of cryptology, there are formal proofs that show that if an adversary can correctly distinguish an encrypted file from a file that is truly random with a greater than 50% probability then it is said that he has “the advantage.”  The adversary can then exploit that advantage and possibly break the encryption.  This concept of advantage applies to the mathematical analysis of encryption algorithms.  However in the real world, files that contain random data have no utility in a file system, therefore it is highly probable that files with high entropy are actually encrypted or compressed.

A contributor on code.activestate.com wrote a python program called file_entropy.py that can be run from the shell command line by with the following command:
python file_entropy.py [filename]

This shown below with the output:

This screenshot shows the use of file_entropy.py and typical results

The closer the entropy value is to 8.0, the higher the entropy.  It is often fun and useful to look at the frequency distribution of the bytes that comprise the file, so I have tweaked the code to create a frequency distribution bar chart using MatPlotLib.

Note:  If you do not have MatPlotLib and/or Python installed, I highly recommend Pythonxy to simplify the install and configuration process.  There are awesome tutorials for both on the Internet as well.

I have named the new Python program graph_file_entropy, and it is listed below.

 

 

You can run it from the shell command line by with the following command:
python graph_file_entropy.py [filename]
I have illustrated this in the screenshots below, along with the results and a bar chart.

This screenshot shows the usage of graph_file_entropy.py and typical results

TestDoc.TXT

In the next posting, we will look at the use of this tool to examine various types of file formats.

Tagged , , , . Bookmark the permalink.
  • The postings and views on this site are my own and do not necessarily reflect the positions, strategies, or opinions of any current or previous employer.