Histograms with a lot of dimensions in Python

I’m doing a simulation of a stochastic many-body system, and currently I need to obtain a multidimensional probability distribution from generated data. For this purpose, I was trying to use np.histogramdd as in:

bins = np.linspace(start = -x_max, stop = x_max, num = n_bins) hists = np.histogramdd(Data, bins = [bins] * dimensions, density = True) 

However, this code produces a MemoryError (or throws an exception about some array being too large) already for n_bins = 20, dimensions = 5 and np.shape(Data) = (1000, 5), which is much lower than target values. Number of buckets grows exponentially with a number of dimensions, so it is easy to see why such problems arise. So, the question is: how can a histogram of large dimensional magnitude be generated, stored and worked with in Python? Are there any existing frameworks for this? Is it better to switch to something else?

Edit: MCEV and examples of error code.

x_max = 10  n_bins = 20  Data = np.random.uniform(-x_max, x_max, size=(1000, dimensions))  bins = np.linspace(start = -x_max, stop = x_max, num = n_bins) hists = np.histogramdd(Data, bins = [bins] * dimensions, density = True) 

Putting dimensions = 7, I get:

lib\site-packages\numpy\lib\histograms.py in histogramdd(sample, bins, range, normed, weights, density) 1066 # Compute the number of repetitions in xy and assign it to the 1067 # flattened histmat. -> 1068  hist = np.bincount(xy, weights, minlength=nbin.prod()) MemoryError: 

dimensions = 15:

   1062     # Compute the sample indices in the flattened histogram matrix.    1063     # This raises an error if the array is too large. -> 1064     xy = np.ravel_multi_index(Ncount, nbin)    1065     1066     # Compute the number of repetitions in xy and assign it to the  ValueError: invalid dims: array size defined by dims is larger than the maximum possible size.  

dimensions = 10

   1066     # Compute the number of repetitions in xy and assign it to the    1067     # flattened histmat. -> 1068     hist = np.bincount(xy, weights, minlength=nbin.prod())    1069     1070     # Shape into a proper matrix  ValueError: 'minlength' must not be negative 
Add Comment
1 Answer(s)

If the histogram has a fixed bin width on each axis, you can do your own book-keeping and use a low-memory datatype for the counts (for example 1 byte per bin). In the following example, the bins are the same for each axis, but you can adapt it for bin ranges that differ along the axes, as long as the bin edges are equidistant.

This code will not do range checking; you need to ensure that the histogram bins are wide enough to fit the data or you’ll get an error.

import numpy as np  x_max = 10  n_dim = 7 n_data = 100000 data = np.random.uniform(-x_max, x_max-0.01, size=(n_data, n_dim))  # assume bins are the same for all dimensions. Bin edges at x0+i*xstep. n_bins = 5 x0 = -x_max xstep = 2*x_max/n_bins  # high-dimensional histogram hist = np.zeros((n_bins,)*n_dim, dtype=np.int8)  # build the histogram indices corresponding to the data samples. ii = ((data - x0)*(1/xstep)).astype(np.int16) # shape (n_data, n_dim)  # increment the histogram bins. The np.add.at will correctly handle  # bins that occur multiple times in the input. np.add.at(hist, tuple(ii.T), 1) 

But with n_dim=8 or 9 you’ll run out of memory on most systems anyway.

The question is what you’re going to do with a histogram that has 10**10 bins; do you have 10**11 or more samples?

It’s more practical to keep the ii array and generate lower-dimensional histograms when you need them. For example, if you want to reduce the 7D histogram to a 4D histogram on axes 0, 1, 5, 6:

hist_4d = np.zeros((n_bins,)*4, dtype=np.int16) np.add.at(hist_4d, tuple(ii[:, [0, 1, 5, 6]].T), 1) 

Note: I recommend that you use signed integers for the bin counts. Integer overflows will be silent, but at least negative numbers in the bins will indicate that you had an overflow.

Add Comment

Your Answer

By posting your answer, you agree to the privacy policy and terms of service.