Bounded Kernel Density Estimation | by Thomas Rouch

Bounded Distributions

Real-life data is often bounded by a given domain. For example, attributes such as age, weight, or duration are always non-negative values. In such scenarios, a standard smooth KDE may fail to accurately capture the true shape of the distribution, especially if there’s a density discontinuity at the boundary.

In 1D, with the exception of some exotic cases, bounded distributions typically have either one-sided (e.g. positive values) or two-sided (e.g. uniform interval) bounded domains.

As illustrated in the graph below, kernels are bad at estimating the edges of the uniform distribution and leak outside the bounded domain.

Gaussian KDE on 100 samples drawn from a uniform distribution — Image by the author

No Clean Public Solution in Python

Unfortunately, popular public Python libraries like scipy and scikit-learn do not currently address this issue. There are existing GitHub issues and pull requests discussing this topic, but regrettably, they have remained unresolved for quite some time.

In R, kde.boundary allows Kernel density estimate for bounded data.

There are various ways to take into account the bounded nature of the distribution. Let’s describe the most popular ones: Reflection, Weighting and Transformation.

Warning:
For the sake of readability, we will focus on the unit bounded domain, i.e. [0,1]. Please remember to standardize the data and scale the density appropriately in the general case [a,b].

Solution: Reflection

The trick is to augment the set of samples by reflecting them across the left and right boundaries. This is equivalent to reflecting the tails of the local kernels to keep them in the bounded domain. It works best when the density derivative is zero at the boundary.

The reflection technique also implies processing three times more sample points.

The graphs below illustrate the reflection trick for three standard distributions: uniform, right triangle and inverse square root. It does a pretty good job at reducing the bias at the boundaries, even for the singularity of the inverse square root distribution.