Usage#
This package supplies functions similar to those of numpy:
histogram(), histogram2d() and
histogramdd().
Input data#
The first parameters are the DataArray(s) on which to
compute the histogram. The function also accept an optional argument giving the
weights to apply when computing the histogram. All arguments must be
broadcastable against each other. For Dask arrays, chunks will be adapted using
xarray.unify_chunks().
If only some of the input are Dask arrays, the other Numpy arrays will be
transformed to Dask by using xarray.DataArray.chunk(). This embeds their
data in the task graph which is generally undesirable. It may be preferable to
manually distribute the data or load it from file using Dask (see Dask best
practices for
details).
Bins / Axes#
The bins can be specified by either:
a Boost Axis object for finer control
an
intgiving the number of bins. The minimum and maximum values are specified with therangeargument or computed from data. The axis will be Regular.
The range argument can supply the minimum and maximum values. Either or both
can be set to None in which case it will be computed with x.min() or
x.max().
So for instance:
xh.histogram(x, bins=10, range=(0., None))
will result in a regular axis (equal width bins) ranging from 0 to the maximum value in x.
Directly passing an array of edges as in numpy.histogram() is not
supported. Instead, use a
Variable axis.
Tip
Using regularly spaced bins (even with a transform applied) is more efficient: it avoids having to use binary search to find in which bin a value falls.
Some examples of axis include:
import boost_histogram.axis as bha
# regular width bins
bha.Regular(200, 0., 10.)
# logarithmically spaced bins (without performance loss)
bha.Regular(200, 1e-3, 10., transform=bha.transform.log)
# integer bins
bha.Integer(0, 20)
# boolean
bha.Integer(0, 2, underflow=False, overflow=False)
Over/underflow#
By default, Boost axes are configured to keep count of the data points that
fall outside their range. Pass underflow=False and/or overflow=False
when creating an axis to disable this.
However, by default, the flow bins values are not kept in the output array.
To keep the flow bins, pass flow=True to the histogram functions. The
coordinates values for the underflow and overflow bins will be set to
Output#
All three functions return a single xarray.DataArray. Its name is
<variable names separated by underscores>_histogram (so for instance
x_y_histogram). The bins edges are contained in coordinates named
<variable>_bins. The right edge of the last bin is stored in a coordinate
attribute when applicable.
The nomenclature is the same as xhistogram to ensure easy transition between the two packages. It also enables the use of an accessor for extra features.
The dtype of output DataArray is int if the
storage is one of
Int64 or
AtomicInt64, or float otherwise.
🚧
It could be possible to enforce a given dtype. TODO…
Examples#
Simple histogram:
import xarray_histogram as xh
import boost_histogram.axis as bha
hist = xh.histogram(temp, bins=bha.Regular(100, -5., 40.))
hist.plot.line()
Multi-dimensional histogram, here in 2D for instance:
hist = xh.histogram2d(
temp, chlorophyll,
bins=[
bh.Regular(100, -5., 40.),
bh.Regular(100, 1e-3, 20, transform=bha.transform.log)
]
)
hist.plot.pcolormesh()
Finally, so far we have computed histograms on the whole flattened arrays, but we can compute only along some dimensions. For instance we can retrieve the time evolution of an histogram:
hist = xh.histogram(
temp,
bins=bha.Regular(100, 0., 10.),
dims=['lat', 'lon']
)
hist.plot.line(x="temp_bins")