Accessor#
An accessor is provided to
ease manipulation and analysis of the histogram outputs. Simply import
xarray_histogram.accessor to register it. It will then be available under
the hist attribute for all DataArrays that meet some conditions (see below).
It gives access to a number of methods.
import xarray_histogram as xh
import xarray_histogram.accessor
h = xh.histogram(data)
h.hist.median()
Operations are vectorized so that you can apply them to entire arrays of histograms. For instance for data defined along time, latitude and longitude, we can compute one histogram per time-step:
>>> h = xh.histogram(data, dims=["lon", "lat"])
>>> h.hist.median()
will be of dimensions ("time",)
Note
Computations are automatically vectorized in Python with
xarray.apply_ufunc(), which is not efficient for a large number of
histograms.
Conditions of accessibility#
Once registered, an accessor is a cached property that can be accessed on any DataArray. They are some conditions for the hist accessor to be created successfully:
The coordinates of the bins must be named
<variable>_bins.The array must be named as
<variable(s)_name>_<histogram or pdf>. histogram if it is not normalized, and pdf if it is normalized as a probability density function. If the histogram is multi-dimensional, the variables names must be separated by underscores. For instance:Temp_Sal_histogram.
Each bins coordinate may contain attributes:
bin_type: the class name of the Boost axis type that was used. If not present, the accessor will assume the bins are regularly spaced and will try to infer the rightmost edge.right_edge: the rightmost edge position, only necessary for Regular and Variable bins.underflowandoverflow: integers that indicate if the corresponding flow bins are present (0: not present, 1: present). If not specified, will assume no flow bins.Note
We use integers instead of booleans to conform with NetCDF attributes.
Those conventions are coherent with the output of
xarray_histogram.histogram*, so if you use this package functions you
should not have to worry. The names of the array and coordinates are also
consistent with that of xhistogram
(although coordinates attributes will be missing).
Computations#
Bins#
The accessor provides a number of methods that return bins-related values for a
given variable. If the histogram is uni-dimensional (ie for a single variable)
the variable name can be omitted. By default flow bins are kept but they can be
excluded by passing flow=False.
bins()returns the corresponding coordinate, this is essentiallyh.hist.coords["var_bins"].edges()returns the N+1 edges (including the rightmost edge). Edges are not available for the discrete bins “IntCategory” and “StrCategory”.widths()returns the widths of the bins The widths of flow bins and StrCategory are always 1.centers()returns the center position of the bins. The overflow bins centers are the same as their position (np.inffor instance).areas()returns the areas corresponding to each histogram point. This is the outer product of the widths of all bins. The areas of points that correspond to a flow bin in at least one dimension is equal to one. For instance for a 2D-histogram with underflow and overflow bins, all the borders of the 2D array for areas will be equal to 1.
To remove flow bins, remove_flow() will returns a
new histogram DataArray without the flow bins of the given variables (by default
all of them). This simply does a .isel operation based on the underflow
and overflow attributes of specified coordinates. It also set those
attributes to False in the output.
Bins transform#
Arbitrary functions can be applied to bins with
apply_func(). The result is equivalent to
computing the histogram of func(data["variable"]). The function must
transform the N+1 edges given as a DataArray. There is no need to account for
the right_edge attribute or flow bins.
For instance, scale() scales bins by a given
factor. It essential does hist.apply_func(lambda edges: edges * factor)
Normalization#
The histogram can be normalized to a probability density function if not
already, using normalize(). Note that for a
N-dimensional histogram, this function can normalize along only some variables.
The accessor considers the histogram normalized or not given the name of its
DataArray: normalized if named <variables>_pdf and non-normalized
if <variables>_histogram. This is consistent with the output of
histogram().
Important
This is important when computing statistics (see below) where the accessor must know if the histogram is normalized or not.
Normalizing when flow bins are present in the output is allowed. The values in flow bins are not changed and not counted in the normalization.
Statistics#
A number of statistics can be extracted from the histogram. The following
functions are wrappers around methods of scipy.stats.rv_histogram.
These function work only on 1D histograms, thus for ND-histograms a variable
must be specified. This does not support flow bins, they are removed along the
core dimension (the specified variable).
Note
The histogram cannot be chunked in the core dimension.
|
Return the cumulative distribution function at x. |
|
Return the confidence interval with equal areas around the median. |
|
Return the median value of the distribution. |
|
Return the mean value of the distribution. |
|
Return the nth moment of the distribution. |
|
Return the percent point function at q. |
|
Return the standard deviation of the distribution. |
|
Return the variance of the distribution. |