Data points are placed in bins when a histogram is created, but there is always a decision to be made about the number or width of the bins. This decision is often made arbitrarily or subjectively, but it need not be. A jackknife or leave-one-out cross-validation likelihood is defined and employed as a scalar objective function for optimization of the locations and widths of the bins. The objective is justified as being related to the histogram's usefulness for predicting future data. The method works for data or histograms of any dimensionality.
Deep Dive into Data analysis recipes: Choosing the binning for a histogram.
Data points are placed in bins when a histogram is created, but there is always a decision to be made about the number or width of the bins. This decision is often made arbitrarily or subjectively, but it need not be. A jackknife or leave-one-out cross-validation likelihood is defined and employed as a scalar objective function for optimization of the locations and widths of the bins. The objective is justified as being related to the histogram’s usefulness for predicting future data. The method works for data or histograms of any dimensionality.
arXiv:0807.4820v1 [physics.data-an] 30 Jul 2008
Data analysis recipes:
Choosing the binning for a histogram1
David W. Hogg
Center for Cosmology and Particle Physics, Department of Physics
New York University
david.hogg@nyu.edu
Abstract
Data points are placed in bins when a histogram is created, but
there is always a decision to be made about the number or width of
the bins. This decision is often made arbitrarily or subjectively, but it
need not be. A jackknife or leave-one-out cross-validation likelihood is
defined and employed as a scalar objective function for optimization
of the locations and widths of the bins. The objective is justified as
being related to the histogram’s usefulness for predicting future data.
The method works for data or histograms of any dimensionality.
1
Introduction
There are many situations in experimental science in which one is presented
with a collection of discrete measurements xj and one must bin those points
into a set of finite-sized bins i, with centers Xi and full-widths ∆i, to create
a histogram of numbers of points Ni, or the equivalent when the points have
non-uniform weights wj. The problem of binning comes up, for example,
when one needs to plot a data histogram, when one needs to perform least-
square fitting of a probability distribution function, and when one wants to
compute entropies or other measurements on the inferred data probability
distribution function.
The choice of bin centers and widths often seems arbitrary. However,
there is a non-arbitrary choice, derived below, which emerges when the his-
togram is thought of as an estimate of the probability distribution function
of whatever process generated the data. If the binning is too coarse, the his-
togram does not give much information about the shape of the probability
distribution function. If the binning is too fine, bins become empty and the
1Copyright 2008 David W. Hogg (david.hogg@nyu.edu). You may copy and distribute
this document provided that you make no changes to it whatsoever.
1
histogram becomes noisy, so it in some sense “overfits” the data. The best
binning lies in between these extremes and can be found simply and quickly
by a “jackknife” or cross-validation method, that is, by excluding data sub-
samples and using the non-excluded data to predict the excluded data. This
is not the only data-based binning-choice approach2, but it is simple and
sensible.
In what follows, we are going to consider a data histogram, which we
imagine as a set of bins i, with centers Xi and widths (or multi-dimensional
volumes) ∆i.
Equivalently (and perhaps more usefully), the parameteri-
zation of the bins can be described by a set of edges X(i−1/2) so the cen-
ters become Xi =
X(i−1/2) + X(i+1/2)
/2 and the widths become ∆i =
X(i+1/2) −X(i−1/2)
. These bins will get filled by a set of (possibly multi-
dimensional) data points xj, leading to each bin i containing a number of
data points Ni. We will also make reference to the binning function i(x)
which, for a given data value x, returns the bin i.
2
Model probability distribution function
Our best binning is based on the idea that the histogram is a sampling of a
probability distribution function and can therefore be thought of as providing
an estimate or model of that probability distribution function.
One possible (approximate) probabilistic model for the data is that they
are drawn from a probability distribution function such that, in each bin of
the histogram we are making, the probability is constant and proportional to
the number of actual data points that landed (by chance) in that bin. This
model has the limitation that bins that happen (by chance) to be empty
will be assigned zero probability; when a new datum happens to arrive (by
chance) inside one of those previously empty bins, it will be assigned a van-
ishing likelihood and render the probabilistic model false at (arbitrarily) high
confidence.
A more well-behaved (approximate) probabilistic model is that the prob-
2see, for example, Knuth, K. H., “Optimal data-based binning for histograms,”
arXiv:physics/0605197, and references cited therein.
2
ability p(i) that a data point land in bin i is
p(i) =
Ni + α
X
k
[Nk + α]
,
(1)
where α is a dimensionless “smoothing” constant of order unity (to be set
later). Here, so long as there are a finite number of bins, the probability
is non-zero in every bin. The associated (approximate, model) probability
distribution function is
˜f(x) = p (i (x))
∆i(x)
,
(2)
where i(x) is the function that returns the bin i for any value x. Note that
the function ˜f(x) is normalized by construction;
Z
˜f(x) dx = 1
.
(3)
In general, the data points will not all be treated equally, but in fact each
data point xj will come with a weight wj, and each bin i will contain total
weight Wi. The only change this makes is in the inferred probability p(i),
which becomes
p(i) =
Wi + α
X
k
[Wk + α]
,
(4)
where now the smoothing constant α will be of order the mean weight wj.
3
Jackk
…(Full text truncated)…
This content is AI-processed based on ArXiv data.