I had this curious thought the other day: what’s the byte value distribution in
binary files, such as an executable? Take for example /bin/echo
on OS X
10.11.1.
What we want to find out more precisely is: how likely does a certain byte value appear in a file? And how does it look like when we visualize that distribution?
This is going to be a field day thanks to a few handy classes and methods in the Python 3 standard library. Let’s get started then.
Reading the file
While reading binary files, we need to make sure that we open a file in binary
mode. You can achieve this by passing the "rb"
flag to the open function.
This flag tells Python and the operating system to open the file read-only and
to interpret the contents as raw bytes. The r
stands for read-only and the
b
stands for binary.
The b
flag keeps the Python runtime from trying to decode the contents of the
file and instead gives them to as raw bytes. If we don’t do this, Python throws
a UTF-8 decoding error. The binary file /bin/echo
contains bytes that Python
can’t decode into valid UTF-8 characters.
This is one of the biggest changes in Python 3, compared to Python 2. The way
the two handle byte strings and unicode can be a challenge when porting code
from Python 2 to 3. Yet, it makes things a lot easier once you’re done porting.
Byte strings, which are either of type bytes
or bytearray
in Python 3, have
a different intent than UTF-8 strings, which are of type str
.
Byte strings let you process binary data, such as executables, serialized files and more. UTF-8 strings are for handling natural languages. This could be an HTML template using German with lots of umlauts, or a regular expression looking through texts written in Japanese.
We also use the open()
context manager. You can open files and Python closes
and garbage collects file descriptors immediately after leaving the context
manager. This makes exception handling easier as well. The context manager
closes file descriptors even in when the inner code raises an exception.
Once the inner code raises an exception, the context manager first frees all resources before re-raising this exception. This means that you can have your cake and eat it too. Python cleanly handles resource access and you can have clean exception handling, too.
Note also that byte-wise access
with open("/bin/echo", "rb") as fd:
contents = fd.read()
A lot of thought has gone into just two lines of Python code. That’s what I like about the language. It’s well designed for these purposes.
Counting bytes
Here, we use the
Counter
class in the collections
module. The Counter
class counts the occurrence of
unique elements in an iterable
object, such as the byte contents of a file.
Let’s get going.
from collections import Counter
c = Counter(contents)
This creates a Counter
instance and it stores the absolute frequency of the
same items over the whole contents
bytes list. The Counter
class supports
basic operations like retrieving the most common elements or subtracting one
Counter
instance from another.
You can retrieve the most common byte by calling the most_common
function
like in the following snippet:
for value, frequency in c.most_common(10):
print("0x{:02x}: {}".format(value, frequency))
0x00: 12309
0x01: 215
0x30: 149
0x65: 134
0x74: 130
0x20: 127
0x69: 108
0x03: 100
0x72: 97
0x06: 94
It’s interesting to see that the 0x00
byte is the most frequent value. I am
not knowledgeable on the topic of executable file formats, but I suspect it’s
due to the padding of the file’s segments.
Getting to the probability distribution
We not only want to calculate the absolute distribution, meaning, how many
times a certain byte value occurs. We’re also interested in calculating the
probability distribution for all occurring bytes. With the probability
distribution we know what the likelihood is that the next byte we read from
/bin/echo
has a certain value.
The function probability_distribution
that I define here can calculate this
probability distribution for us. I went ahead and created a helper method. This
helper method uses a Counter
object and the file length to adjust every byte
value count by the number of bytes in the file. Inside of the helper method
there is a generator function that loops over a Counter
instance and runs the
calculation.
We need to wrap the generator inside of a dict, so that the Counter
doesn’t
count the frequency of value-frequency tuples. These tuples are all unique.
Instead it should apply a dict to itself and let us use all the extra features
that the Counter
class offers.
def probability_distribution(content):
def _helper():
absolute_distribution = Counter(content)
length = len(content)
for value, frequency in absolute_distribution.items():
yield int(value), float(frequency) / length
return Counter(dict(_helper()))
c = probability_distribution(contents)
Now, if we want to see the most common elements, we can do the same as before:
print("Probability Distribution")
for value, frequency in c.most_common(10):
print("0x{:02x}: {:.04f}".format(value, frequency))
Probability Distribution
0x00: 0.6826
0x01: 0.0119
0x30: 0.0083
0x65: 0.0074
0x74: 0.0072
0x20: 0.0070
0x69: 0.0060
0x03: 0.0055
0x72: 0.0054
0x06: 0.0052
The advantage with this way of displaying a value distribution is that we can compare it to other files more directly. The absolute distribution can change depending on the file length, but the relative distribution of certain values might not. This applies to padding bytes as well.
For example, if we run our script on /bin/pwd
instead, we see the following
probability distribution:
Probability Distribution
0x00: 0.6911
0x01: 0.0120
0x30: 0.0081
0x74: 0.0078
0x65: 0.0074
0x20: 0.0072
0x5f: 0.0060
0x69: 0.0058
0x70: 0.0054
0x03: 0.0053
We can see that the first 6 byte values appear in /bin/echo
as well, in a
slightly different ranking.
For the great finale, we visualize the distribution in the terminal.
max_prob = c.most_common(1)[0][1]
for value, frequency in c.most_common():
print("0x{:02x}: {}".format(value, "█" * int(frequency * 80/max_prob)))
0x00: ████████████████████████████████████████████████████████████████████████████████
0x01: █
0x30:
0x74:
0x65:
0x20:
0x5f:
...
That’s not too interesting. If we ignore the 0 byte, things look a lot more interesting.
c = probability_distribution(list(filter(bool, contents)))
max_prob = c.most_common(1)[0][1]
for value, frequency in c.most_common():
print("0x{:02x}: {}".format(value, "█" * int(frequency * 80/max_prob)))
0x01: ████████████████████████████████████████████████████████████████████████████████
0x30: █████████████████████████████████████████████████████
0x74: ███████████████████████████████████████████████████
0x65: █████████████████████████████████████████████████
0x20: ███████████████████████████████████████████████
0x5f: ███████████████████████████████████████
0x69: ██████████████████████████████████████
0x70: ███████████████████████████████████
0x03: ███████████████████████████████████
0x06: ███████████████████████████████████
0x72: ████████████████████████████████
0xff: █████████████████████████████
0x04: ████████████████████████████
0x63: ████████████████████████████
0x02: ██████████████████████████
0x61: ██████████████████████████
0x6e: █████████████████████████
0x6f: ████████████████████████
0x31: ██████████████████████
...
That looks interesting. Note that in both cases we’ve scaled the values, and show the value with the highest probability using 80 bar characters.