We present an identity for an unbiased estimate of a general statistical distribution. The identity computes the distribution density from dividing a histogram sum over a local window by a correction factor from a mean-force integral, and the mean force can be evaluated as a configuration average. We show that the optimal window size is roughly the inverse of the local mean-force fluctuation. The new identity offers a more robust and precise estimate than a previous one by Adib and Jarzynski [J. Chem. Phys. 122, 014114, (2005)]. It also allows a straightforward generalization to an arbitrary ensemble and a joint distribution of multiple variables. Particularly we derive a mean-force enhanced version of the weighted histogram analysis method (WHAM). The method can be used to improve distributions computed from molecular simulations. We illustrate the use in computing a potential energy distribution, a volume distribution in a constant-pressure ensemble, a radial distribution function and a joint distribution of amino acid backbone dihedral angles.
Deep Dive into Estimating statistical distributions using an integral identity.
We present an identity for an unbiased estimate of a general statistical distribution. The identity computes the distribution density from dividing a histogram sum over a local window by a correction factor from a mean-force integral, and the mean force can be evaluated as a configuration average. We show that the optimal window size is roughly the inverse of the local mean-force fluctuation. The new identity offers a more robust and precise estimate than a previous one by Adib and Jarzynski [J. Chem. Phys. 122, 014114, (2005)]. It also allows a straightforward generalization to an arbitrary ensemble and a joint distribution of multiple variables. Particularly we derive a mean-force enhanced version of the weighted histogram analysis method (WHAM). The method can be used to improve distributions computed from molecular simulations. We illustrate the use in computing a potential energy distribution, a volume distribution in a constant-pressure ensemble, a radial distribution function an
We present a method for estimating a general statistical distribution from data collected in a molecular simulation. The method is based on an identity and is superior to the common approach of using a normalized histogram, which suffers from either a large noise when the bin size is too small or a systematic bias when the bin size is too large.
Our identity is akin to a previous one derived by Adib and Jarzynski 1 (hence the AJ identity), whereby the distribution density ) (x at a point x is estimated from a weighted number of visits to a window surrounding x, plus a correction from integrating the derivative of ) (x . The AJ identity improves over the histogram-based approach not only by eliminating the systematic bias from binning but also by smoothing out the resulting distribution, as the window contains much more data points than a single bin.
However, the identity is slightly inconvenient as it neither ensures a positive output, nor determines its optimal parameters.
Here we present a new identity in which we construct a proper correction factor from integrating the “mean force” [the logarithmic derivative of ) (x ], and use it to divide the number of visits to a local window to reach an unbiased estimate, as schematically illustrated in Fig. 1. It offers not only a nonnegative distribution, but also a simple estimate of the optimal window size, as it separates the error contributions from both the histogram and the mean force. The new strategy can also be straightforwardly extended to an arbitrary ensemble and to a joint distribution of multiple variables.
We describe the new identity in section II, present a few numerical examples in section III, and conclude the article in section IV with a few discussions.
We wish to find an expression for the distribution density (
We refer to Eq. ( 1) as the fractional identity. Unlike in the AJ identity 1 , the correction here is applied as a divisor instead of additively, see Fig. 1. Nevertheless, it can be derived as a near-optimal modification of the AJ identity, as shown in Appendix A.
The identity Eq. ( 1) requires a mean force ) ( ) ( f can be computed from a simulation trajectory every few frames, and the expressions can be found in Eq. ( 5) and Eq. ( 4) or (4´).
We first express x as a function ) , ( s r N X x of both molecular coordinates N r and (optionally) some variables s of the simulation ensemble, e.g., s can be the volume in an constant-pressure ensemble, or the temperature in a tempering simulation 2,3 . Note, ) , ( s r N X denotes a function of N r and s , while x its value.
where we have again used the δ-function to collect configurations with ) , ( s r N X being x, and the denominator is equal to ) (x . Our objective is to find an quantity
We now evaluate the derivative of
We proceed by introducing a vector field
More generally, it can be constructed from an arbitrary vector field
Note is defined on the joint vector space of both N r and s , so
where we have integrated by parts to shift the to the rest of the integrand, and defined
. The last step follows from
Comparing with Eq. ( 3), we see that the mean force
For a canonical ensemble, the second term
to the gradient of X [assuming Eq. ( 4)],
which fits the name “mean force” of x f .
The above derivation is analogous to that of the dynamic temperature by Rugh 4 .
In fact the latter can be derived as a special case of Eq. ( 5).
. According to Eq. ( 5), it equals to
we thus recover Rugh’s dynamic temperature.
We now determine the two window boundaries x and x in Eq. ( 1) such that they minimize the statistical error in
We first note that the histogram and mean-force data contribute independently to the numerator and denominator, respectively. How much the two contribute is however controlled by the window size. The output is dominated by the histogram contribution (numerator) with a narrow window, but by the mean-force contribution (denominator) with a wide window. For a narrow window, the denominator is reduced to the window width, and thus the output to a histogram average. At the other extreme, if the window covers the entire domain of x, the numerator becomes a constant, and the distribution density is determined entirely by the mean-force integral on the denominator, i.e.,
(the lower bound of the integral is to be determined by the normalization).
As the window size increases, the relative error of the numerator decreases as more data points reduces uncertainty, but that of the denominator increases as the error in the mean-force integral accumulates. The sum reaches a minimum at the optimal size.
Quantitatively, the relative error of the numerator
is the number of independent data points included in the window.
The relative error of the denominator D is harder to compute exactly and thus is estimated from an upper bound. First, since D is an integral of ) log exp(
, the relative error of D is no larger than the maximal relative error of
…(Full text truncated)…
This content is AI-processed based on ArXiv data.