Data analysis recipes: Choosing the binning for a histogram

Reading time: 6 minute
...

📝 Original Info

  • Title: Data analysis recipes: Choosing the binning for a histogram
  • ArXiv ID: 0807.4820
  • Date: 2008-07-31
  • Authors: ** David W. Hogg (Center for Cosmology and Particle Physics, Department of Physics, New York University) **

📝 Abstract

Data points are placed in bins when a histogram is created, but there is always a decision to be made about the number or width of the bins. This decision is often made arbitrarily or subjectively, but it need not be. A jackknife or leave-one-out cross-validation likelihood is defined and employed as a scalar objective function for optimization of the locations and widths of the bins. The objective is justified as being related to the histogram's usefulness for predicting future data. The method works for data or histograms of any dimensionality.

💡 Deep Analysis

Deep Dive into Data analysis recipes: Choosing the binning for a histogram.

Data points are placed in bins when a histogram is created, but there is always a decision to be made about the number or width of the bins. This decision is often made arbitrarily or subjectively, but it need not be. A jackknife or leave-one-out cross-validation likelihood is defined and employed as a scalar objective function for optimization of the locations and widths of the bins. The objective is justified as being related to the histogram’s usefulness for predicting future data. The method works for data or histograms of any dimensionality.

📄 Full Content

arXiv:0807.4820v1 [physics.data-an] 30 Jul 2008 Data analysis recipes: Choosing the binning for a histogram1 David W. Hogg Center for Cosmology and Particle Physics, Department of Physics New York University david.hogg@nyu.edu Abstract Data points are placed in bins when a histogram is created, but there is always a decision to be made about the number or width of the bins. This decision is often made arbitrarily or subjectively, but it need not be. A jackknife or leave-one-out cross-validation likelihood is defined and employed as a scalar objective function for optimization of the locations and widths of the bins. The objective is justified as being related to the histogram’s usefulness for predicting future data. The method works for data or histograms of any dimensionality. 1 Introduction There are many situations in experimental science in which one is presented with a collection of discrete measurements xj and one must bin those points into a set of finite-sized bins i, with centers Xi and full-widths ∆i, to create a histogram of numbers of points Ni, or the equivalent when the points have non-uniform weights wj. The problem of binning comes up, for example, when one needs to plot a data histogram, when one needs to perform least- square fitting of a probability distribution function, and when one wants to compute entropies or other measurements on the inferred data probability distribution function. The choice of bin centers and widths often seems arbitrary. However, there is a non-arbitrary choice, derived below, which emerges when the his- togram is thought of as an estimate of the probability distribution function of whatever process generated the data. If the binning is too coarse, the his- togram does not give much information about the shape of the probability distribution function. If the binning is too fine, bins become empty and the 1Copyright 2008 David W. Hogg (david.hogg@nyu.edu). You may copy and distribute this document provided that you make no changes to it whatsoever. 1 histogram becomes noisy, so it in some sense “overfits” the data. The best binning lies in between these extremes and can be found simply and quickly by a “jackknife” or cross-validation method, that is, by excluding data sub- samples and using the non-excluded data to predict the excluded data. This is not the only data-based binning-choice approach2, but it is simple and sensible. In what follows, we are going to consider a data histogram, which we imagine as a set of bins i, with centers Xi and widths (or multi-dimensional volumes) ∆i. Equivalently (and perhaps more usefully), the parameteri- zation of the bins can be described by a set of edges X(i−1/2) so the cen- ters become Xi =  X(i−1/2) + X(i+1/2)  /2 and the widths become ∆i = X(i+1/2) −X(i−1/2) . These bins will get filled by a set of (possibly multi- dimensional) data points xj, leading to each bin i containing a number of data points Ni. We will also make reference to the binning function i(x) which, for a given data value x, returns the bin i. 2 Model probability distribution function Our best binning is based on the idea that the histogram is a sampling of a probability distribution function and can therefore be thought of as providing an estimate or model of that probability distribution function. One possible (approximate) probabilistic model for the data is that they are drawn from a probability distribution function such that, in each bin of the histogram we are making, the probability is constant and proportional to the number of actual data points that landed (by chance) in that bin. This model has the limitation that bins that happen (by chance) to be empty will be assigned zero probability; when a new datum happens to arrive (by chance) inside one of those previously empty bins, it will be assigned a van- ishing likelihood and render the probabilistic model false at (arbitrarily) high confidence. A more well-behaved (approximate) probabilistic model is that the prob- 2see, for example, Knuth, K. H., “Optimal data-based binning for histograms,” arXiv:physics/0605197, and references cited therein. 2 ability p(i) that a data point land in bin i is p(i) = Ni + α X k [Nk + α] , (1) where α is a dimensionless “smoothing” constant of order unity (to be set later). Here, so long as there are a finite number of bins, the probability is non-zero in every bin. The associated (approximate, model) probability distribution function is ˜f(x) = p (i (x)) ∆i(x) , (2) where i(x) is the function that returns the bin i for any value x. Note that the function ˜f(x) is normalized by construction; Z ˜f(x) dx = 1 . (3) In general, the data points will not all be treated equally, but in fact each data point xj will come with a weight wj, and each bin i will contain total weight Wi. The only change this makes is in the inferred probability p(i), which becomes p(i) = Wi + α X k [Wk + α] , (4) where now the smoothing constant α will be of order the mean weight wj. 3 Jackk

…(Full text truncated)…

Reference

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut