/ python

Calculating Frequencies with Numpy

Numpy provides great many utility functions for working with arrays. One of them called bincount is the star of today's story.

Data Task Description

Given a set of data, one may be interested in finding the character that appears most frequently in each column. For simplicity let's assume we're talking about equal sized data, for example this block:

shellt
compou
whentp
cutest
betwee
comman
attemp
fortex
sagest

From here on we'll assume the data is saved in a file named input.txt and is well formed.

The character that appears the most on the leftmost column is 'c'. On the second column it's o and so on.

Finding The Character That Appears Most Frequently

To find the character that appears the most we'll use numpy's bincount. First read the data into a numpy array:

a = np.array([
    [si for si in s.strip()] for s in open('input.txt', 'rb') if len(s.strip()) > 0
])

Then calculate the histogram for each column:

bc = [np.bincount(a[:,i]) for i in range(np.shape(a)[1])]

Each histogram is a list of indices telling us how many times the integer value at the given index appears:

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 1, 3, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       2, 0, 0, 0, 1])

Note that even though our data appeared as characters, by reading the file in binary form we converted it to ASCII values. This code wouldn't work on arbitrary unicode data.

To find the value that appears the most we just need to use np.argmax and convert the index back to the character it represents. This line prints the characters most frequently seen in each column:

print(''.join([chr(np.argmax(x)) for x in bc]))

Finding The Character That Appears Least Frequently

A bit harder challenge is to find the characters that appears least frequently. Trying to pull a similiar trick won't work, because 0 is the the minimum value in the histogram, but it represents characters which are not in the data.

The function np.where is very handy in such situations. I'll use it to replace all zeros with np.nan, and then use np.nanargmin to find the minimum value that is not a nan (which means it wasn't a zero before).

The line is:

print(''.join([chr(np.nanargmin(np.where(bc[i] != 0, bc[i], np.nan))) for i in range(np.shape(a)[1])]))

When I originally looked at this solution I wasn't sure how better it is than a pure-python approach. However the more I learn about numpy I find the code more readable and with its own beauty.

Numpy commands used in this snippet: numpy.bincount, numpy.nanargmin, numpy.nan, numpy.shape, numpy.where, numpy.argmax.