On the average of inconsistent data

Reading time: 5 minute
...

📝 Original Info

  • Title: On the average of inconsistent data
  • ArXiv ID: 1109.5395
  • Date: 2011-09-27
  • Authors: Giovanni Mana, Maria Mirabela Predescu

📝 Abstract

When data do not conform to the hypothesis of a known sampling-variance, the fitting of a constant to the set of measured values is a long debated problem. Given the data, the fitting would require to find which measurand value is most probable. A fitting procedure is here reviewed which assigns probabilities to the possible measurand values, on the assumption that the uncertainty associated with each datum is the lower bound to the standard deviation. This procedure is applied to derive an estimate of the Planck constant.

💡 Deep Analysis

Figure 1

📄 Full Content

Given a set of measured values of a constant and the associated uncertainties, the Gauss-Markov theorem states that the unbiased minimum-variance estimator of the measurand is the weighted mean [1]. The uncertainty of the mean, which is smaller than the smallest uncertainty, does not depend on data spread. This is a consequence of the assumption that the variance of the sampling distribution of each datum is exactly know. In practice, this hypothesis is often false and the inconsistency of the data -quantified, for example, by the χ 2 or the Birge-ratio values -suggests that the uncertainties associated with the data are merely lower bounds to the standard deviations of their sampling distributions. In this case, the Gauss-Markov theorem is of no help and the choice of an optimal measurand value is long debated issue. The consequences of the assumption that the associated uncertainties are point estimates of the standard deviations are illustrated by Dose [2], who considers the estimate of the Newtonian constant of gravitation.

Decision theory and probability calculus help to deal with the discrepancy between the quoted uncertainties and the data scatter. Given the measurement results and the lower bounds of the standard deviations, the first step is to find the probabilities of the possible measurand values. As described by Sivia [3], who solved the related problem of dealing with outliers, probabilities are assigned by application of the product rule and marginalization. Next, given the loss due to a wrong decision, the optimal choice of the measurand value minimizes the expected loss over the assigned probabilities. Since foundations are in the probability theory, this method makes it no longer necessary to exclude the data disagreeing with the majority, as well as to scale the uncertainties to make the data consistent. After reviewing the Sivia’s analysis, the procedure is here applied to estimate the value of the Planck constant from a set of inconsistent measurement results.

Let us consider a set of N measured values x i of a measurand h, which have been independently sampled from Gaussian distributions having unknown variance σ 2 i ≥ u 2 i , where u i is the uncertainty associated with the x i datum. The problem is to find optimal estimates of the measurand value and of the confidence intervals. Clearly, we cannot rely on the weighted mean, because the actual variances of the sampling distributions are not known. On the other hand, we cannot go back to the arithmetic mean, because it leaves out significant information delivered by the associated uncertainties.

Given the data and their associated uncertainty, the solution is to assign probabilities to the h values and to use probabilities to minimize any given loss function [4,5,6].

Firstly, we make probability assignments to the possible σ i values, for which only the u i lower bounds are known. In the absence of any additional information, we assume that the probability distribution of σ i is independent of the measurement unit, that is, aP (aσ) = P (σ), where P (σ) is the sought distribution of the σ values [3]. This implies

where the Heaviside function θ(σ -u) is 1 if σ ≥ u and zero if σ < u and the i subscript has been dropped. The distribution (1) is improper, but, provided the final measurand distribution is integrable, this is not a serious problem. Anyhow, we can always set a finite upper bound and delay the calculation of the upper bound limit to the infinity until last.

Next, having stated P (σ; u), the sampling distribution of each x datum can be marginalized to eliminate the unknown variance. Hence,

where N (x; h, σ 2 ) is a normal distribution of the x values having h mean and σ 2 variance and erf is the error function.

Eventually, we make pre-data probability assignments to the h values. In the absence of any additional information, we assume that they are independent of the unit-scale origin, which implies a uniform distribution.

By application of the product rule of probability, the only post-data probability distribution of the measurand values, logically consistent with the pre-data assignments and sampling distributions, is

where the Q function is given in (2), 1/Z is a normalization factor, x i and u i are the i-th datum and its uncertainty.

The post-data distribution (3) is the central to our analysis. Given any loss L(h 0 -h) due to a wrong estimate h 0 , the optimal choice of the h value minimizes

With a quadratic loss, the optimal estimate is the mean; with a linear loss, it is the median. With a loss independent of the error, it is the mode. Confidence intervals are easily calculable from (3).

As an application example, let us consider the choice of the Planck constant value on the basis of the measurment results listed in table 1. We neither presume to account for the interlinks between the data nor we presume to compete with the carried out by the committee on data for science and technology [7].

The Planck co

📸 Image Gallery

cover.png

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut