The measurement of biallelic pair-wise association called linkage disequilibrium (LD) is an important issue in order to understand the genomic architecture. A large variety of such measures of association have been proposed in the literature. We propose and justify six biometrical postulates which should be fulfilled by a canonical measure of LD. In short, LD measures are defined as a mapping of probability tables to the set of real numbers. They should be zero in case of independence and extremal if one of the entries approaches zero while the marginals are positively bounded. They should reflect the symmetry group of the tables and be invariant under certain transformations of the marginals (selection invariance). There scale should have maximum entropy relative to a calibrating symmetric distribution. None of the established measures fulfil all of these properties in general. We prove that there is a unique canonical measure of LD for each choice of a calibrating distribution. We compa- re the canonical LD measures with other candidates from the literature. We recommend the canonical measure derived from Jeffreys' non-informative prior distribution when assessing linkage disequilibrium of SNP array data. In a second part, we consider various estimators for the theoretical LD measures discussed and compare them in a simulation study.
Deep Dive into A Canonical Measure of Allelic Association.
The measurement of biallelic pair-wise association called linkage disequilibrium (LD) is an important issue in order to understand the genomic architecture. A large variety of such measures of association have been proposed in the literature. We propose and justify six biometrical postulates which should be fulfilled by a canonical measure of LD. In short, LD measures are defined as a mapping of probability tables to the set of real numbers. They should be zero in case of independence and extremal if one of the entries approaches zero while the marginals are positively bounded. They should reflect the symmetry group of the tables and be invariant under certain transformations of the marginals (selection invariance). There scale should have maximum entropy relative to a calibrating symmetric distribution. None of the established measures fulfil all of these properties in general. We prove that there is a unique canonical measure of LD for each choice of a calibrating distribution. We co
Modern genetic high-through-put methods increasingly provide medium to large size data sets that consist of high dimensional vectors of binary markers. We have been particularly motivated by the example of SNP-chips that address up to one million of biallelic single nucleotide polymorphisms (SNPs). Another example of this data type is patterns of genomic aberration in tumours that can be measured based again on SNP-chip technology or by matrix competitive genome hybridisation (mCGH).
We restrict ourselves to one sample problems as opposed to two or more sample problems encountered in the context of disease association case-control studies. The focus is to detect highly linked pairs of markers. In the case of SNPs this kind of association is called linkage disequilibrium (LD).
Highly linked SNPs are interpreted to be inherited together. LD indicates that a recombination event between the two sites was rare in the population under study. However, there may be other reasons for high LD such as admixture or selection. Linkage has been analysed to understand the genomic architecture especially with respect to recombination hot-spots and jointly inherited haplotype blocks (Schulze et al., 2004;Service et al., 2006). In the following we always restrict ourselves to LD between two biallelic markers.
A basic step in analysing such data is assessing associations between markers in a very large number of two by two tables and comparing associations between tables. A bewildering plethora of measures of association are used in the literature (Devlin & Risch, 1995;Hedrick, 1987;Thomas, 2004). Some suggestions on the preferred use of single measures were made (Devlin & Risch, 1995;Mueller, 2004). Most of these arguments are based on biological issues such as dependence on allelic frequencies and rate of decay (Hedrick, 1987) or on practical applications such as correlation of test statistics (Pritchard & Przeworski, 2001) and determination of haplotype blocks (Gabriel et al., 2002).
After a short review of different LD measures, we propose and justify biometrical and statistical postulates to choose between measures of association in the one sample case. We conclude that none of the established LD measures fulfil all of the desirable properties in general. We construct a family of canonical linkage disequilibrium measures which fulfil all of our postulates. Family members differ in the choice of a symmetric Dirichlet distribution on the set of all two by two contingency tables. These Dirichlet distributions calibrate the scale of the measure which essentially measures the extremacy of LD relative to the given distribution. The new measures are compared with the established once. Finally, the problem of estimation of the new measure is addressed and different estimators are compared in a simulation study.
We consider to analyse contingency tables of two biallelic markers at one strand of the genome. Let T be the manifold of all tetranominal probability models written as a two by two table of probabilities:
T consists of all two by two matrices t with entries p ij ∈ R, (i, j ∈ {0, 1}) fulfilling the properties p ij > 0, i,j p ij = 1. The p ij denote the probabilities of the corresponding combination of the two alleles of the markers i and j. In the following, we abbreviate 1 i=0 1 j=0 = i,j , p i. = p i0 + p i1 and p .j = p 0j + p 1j for convenience. Here, the marginals p i. and p .j denote the frequencies of the alleles of the two markers.
Statistically, a measure of LD is simply a measure of association in the contingency table t. The following measures were defined in literature: D: The measure D is the absolute deviation of the observation from the expectation that the alleles of marker i are randomly combined with alleles of marker j under the assumption of constant marginals.
Hence:
This measure is zero in case of independence of the markers but extremal values depend on the marginals.
Lewontin’s D ′ (Lewontin, 1963): The widely used measure D ′ is a standardisation of the original measure D:
Lewontin’s D ′ ranges from -1 to 1 and tends to these values if one of the p ij tends to zero while the marginals are bounded away from zero.
Correlation coefficient r (Hill & Robertson, 1968): The usual correlation coefficient applied to binary data has similar popularity as D ′ . It also ranges from -1 to 1 where an absolute value of 1 is obtained when a diagonal of t tends to zero:
Odds ratio λ (Edwards, 1963):
The odds ratio is the first quantity which is not directly dependent on D and the marginals. It is well known that λ is independent of selection of single rows or columns of the table t. It is thus often used analysing (two sample) case-control studies. The odds ratio is extremal if one of the p ij tends to zero while the marginals are bounded away from zero.
Yule’s Q (Yule, 1900):
Since the common odds ratio λ is not standardised, this quantity has been defined as a function of λ which is bounded to [-1
…(Full text truncated)…
This content is AI-processed based on ArXiv data.