Bounded Distributions
Real-life data is often bounded by a given domain. For example, attributes such as age, weight, or duration are always non-negative values. In such scenarios, a standard smooth KDE may fail to accurately capture the true shape of the distribution, especially if there’s a density discontinuity at the boundary.
In 1D, with the exception of some exotic cases, bounded distributions typically have either one-sided (e.g. positive values) or two-sided (e.g. uniform interval) bounded domains.
As illustrated in the graph below, kernels are bad at estimating the edges of the uniform distribution and leak outside the bounded domain.
No Clean Public Solution in Python
Unfortunately, popular public Python libraries like scipy
and scikit-learn
do not currently address this issue. There are existing GitHub issues and pull requests discussing this topic, but regrettably, they have remained unresolved for quite some time.
In R,
kde.boundary
allows Kernel density estimate for bounded data.
There are various ways to take into account the bounded nature of the distribution. Let’s describe the most popular ones: Reflection, Weighting and Transformation.
Warning:
For the sake of readability, we will focus on the unit bounded domain, i.e.[0,1]
. Please remember to standardize the data and scale the density appropriately in the general case[a,b]
.
Solution: Reflection
The trick is to augment the set of samples by reflecting them across the left and right boundaries. This is equivalent to reflecting the tails of the local kernels to keep them in the bounded domain. It works best when the density derivative is zero at the boundary.
The reflection technique also implies processing three times more sample points.
The graphs below illustrate the reflection trick for three standard distributions: uniform, right triangle and inverse square root. It does a pretty good job at reducing the bias at the boundaries, even for the singularity of the inverse square root distribution.
N.B. The signature of
basic_kde
has been slightly updated to allow to optionally provide your own bandwidth parameter instead of using the Silverman’s rule of thumb.
Solution: Weighting
The reflection trick presented above takes the leaking tails of the local kernel and add them back to the bounded domain, so that the information isn’t lost. However, we could also compute how much of our local kernel has been lost outside the bounded domain and leverage it to correct the bias.
For a very large number of samples, the KDE converges to the convolution between the kernel and the true density, truncated by the bounded domain.
If x
is at a boundary, then only half of the kernel area will actually be used. Intuitively, we’d like to normalize the convolution kernel to make it integrate to 1 over the bounded domain. The integral will be close to 1 at the center of the bounded interval and will fall off to 0.5 near the borders. This accounts for the lack of neighboring kernels at the boundaries.
Similarly to the reflection technique, the graphs below illustrate the weighting trick for three standard distributions: uniform, right triangle and inverse square root. It performs very similarly to the reflection method.
From a computational perspective, it doesn’t require to process 3 times more samples, but it needs to evaluate the normal Cumulative Density Function at the prediction points.
Transformation
The transformation trick maps the bounded data to an unbounded space, where the KDE can be safely applied. This results in using a different kernel function for each input sample.
The logit function leverages the logarithm to map the unit interval [0,1]
to the entire real axis.
When applying a transform f
onto a random variable X
, the resulting density can be obtained by dividing by the absolute value of the derivative of f
.
We can now apply it for the special case of the logit transform to retrieve the density distribution from the one estimated in the logit space.
Similarly to the reflection and weighting techniques, the graphs below illustrate the weighting trick for three standard distributions: uniform, right triangle and inverse square root. It performs quite poorly by creating large oscillations at the boundaries. However, it handles extremely well the singularity of the inverse square root.