kalepy.kde module

kalepy’s top-level KDE class which provides all direct KDE functionality.

Contents:

KDE : class for interfacing with KDEs and derived functionality.

class kalepy.kde.KDE(dataset, bandwidth=None, weights=None, kernel=None, extrema=None, points=None, reflect=None, covariance=None, neff=None, diagonal=False, helper=True, bw_rescale=None, **kwargs)

Bases: object

Core class and primary API for using kalepy, by constructin a KDE based on given data.

The KDE class acts as an API to the underlying kernel structures and methods. From the passed data, a ‘bandwidth’ is calculated and/or set (using optional specifications using the bandwidth argument). A kernel is constructed (using optional specifications in the kernel argument) which performs the calculations of the kernel density estimation.

Notes

Reflection

Reflective boundary conditions can be used to better reconstruct a PDF that is known to have finite support (i.e. boundaries outside of which the PDF should be zero).

The pdf and resample methods accept the keyword-argument (kwarg) reflect to specify that a reflecting boundary should be used.

reflect(D,) array_like, None
Locations at which reflecting boundary conditions should be imposed. For each dimension D, a pair of boundary locations (for: lower, upper) must be specified, or None. None can also be given to specify no boundary at that location.

If a pair of boundaries are given, then the first value corresponds to the lower boundary, and the second value to the upper boundary, in that dimension. If there should only be a single lower or upper boundary, then None should be passed as the other boundary value.

For example, reflect=[None, [-1.0, 1.0], [0.0, None]], specifies that the 0th dimension has no boundaries, the 1st dimension has boundaries at both -1.0 and 1.0, and the 2nd dimension has a lower boundary at 0.0, and no upper boundary.

Projection / Marginalization

The PDF can be calculated for only particular parameters/dimensions. The pdf method accepts the keyword-argument (kwarg) params to specify particular parameters over which to calculate the PDF (i.e. the other parameters are projected over).

paramsint, array_like of int, None (default)
Only calculate the PDF for certain parameters (dimensions).

If None, then calculate PDF along all dimensions. If params is specified, then the target evaluation points pnts, must only contain the corresponding dimensions.

For example, if the dataset has shape (4, 100), but pdf is called with params=(1, 2), then the pnts array should have shape (2, M) where the two provides dimensions correspond to the 1st and 2nd variables of the dataset.

TO-DO: add notes on keep parameter

Dynamic Range

When the elements of the covariace matrix between data variables differs by numerous orders of magnitude, the KDE values (especially marginalized values) can become spurious. One solution is to use a diagonal covariance matrix by initializing the KDE instance with diagonal=True. An alternative is to transform the input data in such a way that each variable’s dynamic range becomes similar (e.g. taking the log of the values). A warning is given if the covariance matrix has a large dynamic very-large dynamic range, but no error is raised.

Examples

Construct semi-random data:

>>> import numpy as np
>>> np.random.seed(1234)
>>> data = np.random.normal(0.0, 1.0, 1000)

Construct KDE instance using this data, and the default bandwidth and kernels.

>>> import kalepy as kale
>>> kde = kale.KDE(data)

Compare original PDF and the data to the reconstructed PDF from the KDE:

>>> xx = np.linspace(-3, 3, 400)
>>> pdf_tru = np.exp(-xx*xx/2) / np.sqrt(2*np.pi)
>>> xx, pdf_kde = kde.density(xx, probability=True)

>>> import matplotlib.pyplot as plt
>>> ll = plt.plot(xx, pdf_tru, 'k--', label='Normal PDF')
>>> _, bins, _ = plt.hist(data, bins=14, density=True,                               color='0.5', rwidth=0.9, alpha=0.5, label='Data')
>>> ll = plt.plot(xx, pdf_kde, 'r-', label='KDE')
>>> ll = plt.legend()

Compare the KDE reconstructed PDF to the true PDF, make sure the chi-squared is consistent:

>>> dof = xx.size - 1
>>> x2 = np.sum(np.square(pdf_kde - pdf_tru)/pdf_tru**2)
>>> x2 = x2 / dof
>>> x2 < 0.1
True
>>> print("Chi-Squared: {:.1e}".format(x2))
Chi-Squared: 1.7e-02

Draw new samples from the data and make sure they are consistent with the original data:

>>> import scipy as sp
>>> samp = kde.resample()
>>> ll = plt.hist(samp, bins=bins, density=True, color='r', alpha=0.5, rwidth=0.5,                       label='Samples')
>>> ks, pv = sp.stats.ks_2samp(data, samp)
>>> pv > 0.05
True

Initialize the KDE class with the given dataset and optional specifications.

Parameters:

datasetarray_like (N,) or (D,N,): Dataset from which to construct the kernel-density-estimate.
bandwidthstr, float, array of float, None [optional]: Specification for the bandwidth, or the method by which the bandwidth should be determined. If a str is given, it must match one of the standard bandwidth determination methods. If a float is given, it is used as the bandwidth in each dimension. If an array of float are given, then each value will be used as the bandwidth for the corresponding data dimension.
weightsarray_like (N,), None [optional]: Weights corresponding to each dataset point. Must match the number of points N in the dataset. If None, weights are uniformly set to 1.0 for each value.
kernelstr, Distribution, None [optional]: The distribution function that should be used for the kernel. This can be a str specification that must match one of the existing distribution functions, or this can be a Distribution subclass itself that overrides the _evaluate method.
neffint, None [optional]: An effective number of datapoints. This is used in the plugin bandwidth determination methods. If None, neff is calculated from the weights array. If weights are all uniform, then neff equals the number of datapoints N.
diagonalbool,: Whether the bandwidth/covariance matrix should be set as a diagonal matrix (i.e. without covariances between parameters). NOTE: see KDE docstrings, ‘Dynamic Range’.

__init__(dataset, bandwidth=None, weights=None, kernel=None, extrema=None, points=None, reflect=None, covariance=None, neff=None, diagonal=False, helper=True, bw_rescale=None, **kwargs)

Initialize the KDE class with the given dataset and optional specifications.

Parameters:

datasetarray_like (N,) or (D,N,): Dataset from which to construct the kernel-density-estimate.
bandwidthstr, float, array of float, None [optional]: Specification for the bandwidth, or the method by which the bandwidth should be determined. If a str is given, it must match one of the standard bandwidth determination methods. If a float is given, it is used as the bandwidth in each dimension. If an array of float are given, then each value will be used as the bandwidth for the corresponding data dimension.
weightsarray_like (N,), None [optional]: Weights corresponding to each dataset point. Must match the number of points N in the dataset. If None, weights are uniformly set to 1.0 for each value.
kernelstr, Distribution, None [optional]: The distribution function that should be used for the kernel. This can be a str specification that must match one of the existing distribution functions, or this can be a Distribution subclass itself that overrides the _evaluate method.
neffint, None [optional]: An effective number of datapoints. This is used in the plugin bandwidth determination methods. If None, neff is calculated from the weights array. If weights are all uniform, then neff equals the number of datapoints N.
diagonalbool,: Whether the bandwidth/covariance matrix should be set as a diagonal matrix (i.e. without covariances between parameters). NOTE: see KDE docstrings, ‘Dynamic Range’.

property bandwidth

cdf(pnts, params=None, reflect=None)

Cumulative Distribution Function based on KDE smoothed data.

Parameters:

pnts([D,]N,) array_like of scalar: Target evaluation points

Returns:

cdf(N,) ndarray of scalar: CDF Values at the target points

property covariance

property dataset

density(points=None, reflect=None, params=None, grid=False, probability=False)

Evaluate the KDE distribution at the given data-points.

This method acts as an API to the Kernel.pdf method for this instance’s kernel.

Parameters:

points([D,]M,) array_like of float, or (D,) set of array_like point specifications

The locations at which the PDF should be evaluated. The number of dimensions D must match that of the dataset that initialized this class’ instance. NOTE: If the params kwarg (see below) is given, then only those dimensions of the target parameters should be specified in points. The meaning of points depends on the value of the grid argument:

grid=True : points must be a set of (D,) array_like objects which each give the evaluation points for the corresponding dimension to produce a grid of values. For example, for a 2D dataset, points=([0.1, 0.2, 0.3], [1, 2]), would produce a grid of points with shape (3, 2): [[0.1, 1], [0.1, 2]], [[0.2, 1], [0.2, 2]], [[0.3, 1], [0.3, 2]], and the returned values would be an array of the same shape (3, 2).

grid=False : points must be an array_like (D,M) describing the position of M sample points in each of D dimensions. For example, for a 3D dataset: points=([0.1, 0.2], [1.0, 2.0], [10, 20]), describes 2 sample points at the 3D locations, (0.1, 1.0, 10) and (0.2, 2.0, 20), and the returned values would be an array of shape (2,).

reflect(D,) array_like, None

Locations at which reflecting boundary conditions should be imposed. For each dimension D (matching the input data), a pair of boundary locations (lower, upper) must be specified, or None. None can also be given as one of the two locations, to specify no boundary at that location. If the data is one-dimensional (D=1), then reflect may be shaped as (2,). See class docstrings:Reflection for more information.

paramsint, array_like of int, None

Only calculate the PDF for certain parameters (dimensions). See class docstrings:Projection for more information.

gridbool,

Evaluate the KDE distribution at a grid of points specified by points. See points argument description above.

probabilitybool, normalize the results to sum to unity

Returns:

pointsarray_like of scalar: Locations at which the PDF is evaluated.
valsarray_like of scalar: PDF evaluated at the given points

property extrema

classmethod from_hist(bins, hist, bandwidth='bin width', *args, **kwargs)

Alternative constructor using a histogram as input instead of individual data points.

Parameters:

bins([D,]N,) array_like of scalar: Histogram bins. If using multiple dimensions N can be different for different dimensions.
hist(N,[N,…]) array_like of scalar: Histogram to construct KDE from. If in multiple dimensions dimensions can have different N.
bandwidthstr or float: Bandwidth. Defaults to width of bin in each dimension. Accepts all arguments passed to bandwidth when constructed using __init__.
*args, **kwargstuple, dict: Arguments passed to __init__ constructor.

Returns:

kdeinstance of KDE: Initialized KDE instance.

property kernel

property ndata

property ndim

property neff

pdf(*args, **kwargs)

property points

property reflect

resample(size=None, keep=None, reflect=None, squeeze=True)

Draw new values from the kernel-density-estimate calculated PDF.

The KDE calculates a PDF from the given dataset. This method draws new, semi-random data points from that PDF.

Parameters:

sizeint, None (default): The number of new data points to draw. If None, then the number of datapoints is used.
keepint, array_like of int, None (default): Parameters/dimensions where the original data-values should be drawn from, instead of from the reconstructed PDF. TODO: add more information.
reflect(D,) array_like, None (default): Locations at which reflecting boundary conditions should be imposed. For each dimension D, a pair of boundary locations (for: lower, upper) must be specified, or None. None can also be given to specify no boundary at that location.
squeezebool, (default: True): If the number of dimensions D is one, then return an array of shape (L,) instead of (1, L).

Returns:

samples([D,]L) ndarray of float: Newly drawn samples from the PDF, where the number of points L is determined by the size argument. If squeeze is True (default), and the number of dimensions in the original dataset D is one, then the returned array will have shape (L,).

property weights