Data analysis recipes: Choosing the binning for a histogram
Data points are placed in bins when a histogram is created, but there is always a decision to be made about the number or width of the bins. This decision is often made arbitrarily or subjectively, but it need not be. A jackknife or leave-one-out cross-validation likelihood is defined and employed as a scalar objective function for optimization of the locations and widths of the bins. The objective is justified as being related to the histogram’s usefulness for predicting future data. The method works for data or histograms of any dimensionality.
💡 Research Summary
The paper tackles a fundamental yet often overlooked problem in data analysis: how to choose the number and width of bins when constructing a histogram. While histograms are a staple for visualizing distributions and for non‑parametric density estimation, the binning decision is traditionally made by applying simple empirical rules (Sturges, Scott, Freedman‑Diaconis) or by ad‑hoc visual inspection. These approaches ignore the actual shape of the data and can lead to over‑ or under‑smoothing, which in turn distorts downstream inference.
To address this, the authors propose a principled, data‑driven objective function based on leave‑one‑out cross‑validation likelihood (often called a jackknife likelihood). For a dataset D = {x₁,…,xₙ}, the procedure removes each observation xᵢ in turn, builds a histogram H₋ᵢ from the remaining n − 1 points, and evaluates the probability that xᵢ would fall into its corresponding bin under H₋ᵢ. If the bin containing xᵢ has width w_b and contains n_b⁽⁻ᵢ⁾ points (excluding xᵢ), the predictive probability is pᵢ = n_b⁽⁻ᵢ⁾ /
Comments & Academic Discussion
Loading comments...
Leave a Comment