Using a priori knowledge to construct copulas
Our purpose is to model the dependence between two random variables, taking into account a priori knowledge on these variables. For example, in many applications (oceanography, finance...), there exists an order relation between the two variables; wh…
Authors: Dominique Drouet Mari, Valerie Monbet
Since the nineteen sixties and the pionnering works of Gumbel [11], Plackett [22], Mardia [18], the construction of bivariate distributions with fixed margins (i.e. the construction of copulas) has interested many researchers.
Various procedures to contruct copulas have been proposed. A fruitful method is to construct dependence by mixing with respect to a third random variable, called a frailty variable (see Clayton [3] and Oakes [20]). This method has been generalized with two or more frailty variables by Marshall and Olkin [19] and by Joe [12], but their works have not always been well comprised and often rediscovered. In 1995, Koudraji [15] developped a procedure to contruct asymmetric copulas without using a mixing variable, but as a product of two copulas. His work has been generalized by Liebscher [17] to multivariate copulas.
Here, our purpose is to show how to construct a copula by using a priori knowledge on the studied bivariate distribution. We propose a way to construct, step by step, from any basic model more complex models verifying the assumption of asymmetry, as well as the assumption of extremal dependance for minimum and/or maximum. These models with increasing complexity are obtained using mixing procedures. Each considered assumption adds a new parameter to the model and we can control how this parameter acts: it is not a blind method. This procedure is illustrated by using three known copula models which are fit by maximum likelihood.
Within the same family the models are nested, so they can be compared with likelihood ratio tests. For others comparisons, the best model for a given problem is selected using the Bayesian information criterion (BIC) [ [25]1978].
The proposed modelling method is illustrated on buoy data. In oceanometeorology, scientists are interested in modelling the statistical dependence between sea state parameters ( e.g. significative height and period of waves, surge, wind speed ...) because it is used in order to study the reliability and fatigue of structures. In this modelling method, it is important to take into account the extreme dependence between the different processes. Namely, the simultaneous occurence of extreme events can be the cause of great environemental or structural damage. Futhermore, the dependence between the different processes is rarely symmetric: there often exists an order relation between the variables. For example, one cannot observe very high waves with very short periods and, on the contrary, far from a storm in time or in space, waves have generally small height and long period. However, most models used in oceanography are symmetric and often multivariate Gaussian. [ [27]2003]. In 1995, Athanassoulis et al. proposed bivariate models based on Plackett's copulas.
In section two, we describe the construction of bivariate asymmetric distributions with or without extreme dependence and introduce three illustrative copulas that we propose to evaluate. In the third section, we recall how to simulate distributions from models with mixing variables and assymetric distributions. Such simulation tools are useful for any Monte Carlo approach, for instance. The inference and validation methods are detailed in the fourth section. Finally, in the last section, we described the metocean dataset and present the results of the evaluation of the models.
To construct any bivariate distribution, copulas allow for the separate modelling of the univariate margins and the dependence between the variables under weak assumptions. Copulas are a flexible tool for modelling any shape of dependence between two variables, the univariate distributions of these variables being characterized separately. In the same manner, the estimations of the parameters of the joint model can be made in two steps: the parameters of the univariate margins are estimated firstly and those of the copula secondly. Here, we show in particular how, using a priori knowledge on a bivariate distribution -for example, existence of an assymetry in the dependence of the variables, or existence of extreme dependence for the minimum or for the maximum-, we can transform a basic copula into a more complex copula verifying the a priori knowledge.
Let us first recall some definitions.
The copula summarizes the dependence between the two variables. Following the Sklar theorem ( [29]1959), to a cumulative distribution function H(x, y) with continuous margins F 1 (x) and F 2 (y), one associates copula C(u, v), defined by
It is easy to verify that the copula is, therefore, a cumulative distribution function (cdf) defined on the square unit with uniform margins. And that it summarizes the dependence between the two variables.
When the cdf H(x, y) is derivable and if (X, Y ) admits marginal densities f 1 (x) and f 2 (y) with respect to Lebesgue's measure and a joint probability density function h(x, y), then the theorem of Sklar can be rewritten as
where c(u, v) is the density of the copula. When C(u, v) = C(v, u), the copula is said to be non exchangeable, which is the situation of assymetry.
2. The function of dependence and the measure of extreme dependence Let (X i , Y i ), i = 1, ..., n, be a sample of a distribution H(x, y). To study extreme events, one considers the distribution of the couple
where
Constants a in and b in , i = 1, 2, are normalizing constants depending on the margins of X and Y . One defines then
and its associated copula C max (u, v). H(x, y) is said to belong to the domain of attraction of H max (x, y). If the distribution H max (x, y) is not the product of the margins, H(x, y) is said to be asymptotically dependent for the maximum. The copula C max (u, v) is such that
Following Pickands [21], the associated copula C max (u, v) can be written as
where A(.) is the dependence function verifying
The particular case where A(t) ≡ 1 corresponds to the independence case.
The extreme dependence can be quantified by
)
where C(u, v) is the survival function of the copula. If a copula C(., .) is in the attraction domain of a copula C A (., .), then they have the same value of λ (Joe, page 178). This quantity, when it is strictly positive, characterizes the asymptotic dependence. This measure, however, does not seem adequate for non exchangeable variables, where the maximal dependence is not along the first diagonal.
If we have to model the extreme dependence for the minimum of two variables instead of the maximum, the dual couple (1 -U, 1 -V ) can be considered in place of (U, V ). The survival function of the one is the cumulative distribution function of the other. The dual extreme measure is then: λ = lim u→0 C(u, u) u .
2.1. The mixture models and its generalizations 2.1.1. The frailty model To model the dependence between two random variables X and Y , with cumulative distribution functions (cdf) F 1 (x) and F 2 (y), a usual method [ [20]1989] is to suppose that the two variables are conditionally independent from a positive "frailty" variable Z with cdf G:
The two margins of H(., .) are H 1 (x) = F 1 (x) z dG(z) and H 2 (y) = F 2 (y) z dG(z). Calling ϕ -1 , the Laplace transform of G, the cdf H(., .) can be rewritten as:
The associated copula is then:
which is a particular case of an Archimedean copula with generator ϕ ( [8]).
The frailty models have been defined in the context of lifetime data analysis, from the two survival margins S 1 and S 2 . In the square unit, this is written as:
which is the dual copula of the previous copula. Now, two examples of Archimedean copulas, that we will use in the sequel, are introduced.
If Z has a Gamma distribution with Laplace transform
this results in Clayton's copula:
This copula owns extreme dependence on the minimum with
Written with survival functions, it leads to extreme dependence on the maximum. The dependence increases with α. When α tends to ∞, the copula tends to the upper maximal dependence copula. The case α = 0 corresponds to independence. Kendall's tau is equal to α α+2 .
(b) Gumbel's copula If Z has a positive stable distribution
which is the only Archimedean and extreme value copula with
Its lower tail dependence is zero. Its Kendall's tau is given by τ = 1 -α. The dependence decreases according to α.
A max-infinitely divisible (max-id) bivariate cdf F is such that any power of it, F γ , γ > 0, is still a cdf. Joe [ [12]1997] generalizes the mixture model with any max-id copula K(u, v) in place of the product of the marginals, and with ϕ -1 , the Laplace transform of a frailty variable Z. The obtained copula Ĉ(u, v) verifies:
Marshall and Olkin [19] have proposed another generalization of this method using two frailty variables Z 1 and Z 2 . Specifically, let G(., .) be a cdf such that Ḡ(0, 0) = 1 with margins G i (.), i = 1, 2. Then define new copula C(., .) by
where
They presented examples with frailty variables Z 1 and Z 2 such that
where W , U 1 , U 2 are independent random variables. Let ψ 1 , ψ 2 and ψ 0 , be the Laplace transforms of the three variables U 1 , U 2 and W . Then, the copula is written
Let us now present two examples where parameters α 0 , α 1 and α 2 are associated to the three Laplace transforms ψ i , i = 0, 1, 2:
(a) Clayton's family extension Let W , U 1 and U 2 be three variables with Gamma Laplace transforms with parameters α 0 , α 1 and α 2 and let
Using the reparametrization θ = α 1 α 0 +α 1 and δ = α 2 α 0 +α 2 , the copula can be rewritten as:
where C C (u, v) is the Clayton's copula.
(b) Gumbel's family extension Considering two frailty variables and Laplace transforms: ψ i (t) = exp(-θ i s α ) with 0 < α ≤ 1 and θ i ≥ 0 for i = 0, 1, 2, one obtains:
Using the reparametrization, θ 0 = 1, θ = 1 1+θ 1 and δ = 1 1+θ 2 , one obtains the same formal writing as in equation ( 6), and the obtained copula corresponds to Tawn's bivariate extreme value distribution (assymetric bilogistic distribution) [ [31]]
where C G (u, v) is Gumbel's copula.
The procedure of Marshall and Olkin ([19]) and its extension can be replaced in the framework of the assymetrization procedure proposed by Khoudraji (in his thesis [ [15]1995]). Without using frailty variables, he constructs an asymmetrized copula from two symmetric copulas.
One sees easily that, except for particular cases, C(u, v) = C(v, u).
A particularly interesting case is when C 1 (u, v) is the independence copula:
In this last case, the method weakens the dependence, especially if C 2 (u, v) is an extreme value copula for the maximum, then the new copula C(u, v) is still an extreme value copula, but the parameter of extreme dependence λ is smaller than λ [ [6]2004]. This phenomena occurs for instance in the case of Gumbel's copula.
Using Frchet's bounds, we have
The right hand side member of this inequality is the Cuadras-Aug copula (see [15]) whose Kendall τ is equal to
Consequently τ C is smaller than this last quantity. If the symmetric copula has no extreme maximal dependence, then the assymmetrized copula does. Moreover, even if C 2 (u, v) has a lower tail dependence, the lower tail dependence of C(u, v) is zero.
In a few examples, we can suppose that a common cause acts on the two variables U and V , but that one variable has its proper variability cause. In such cases, one can write Z 1 = U 1 and Z 2 = U 2 + W and one obtains the simpler model [ [6]2004]:
When δ = 1, we retrieve the basic copula. We are, therefore, able to deduce a test for the asymmetry.
Using equation ( 8) or (10), we can construct three assymetrized copulas on the basis of three basic copulas:
The Plackett copula is used, for instance, in oceanometeorology to model the dependence between the couples of wave heights and wave periods [2]. This copula is not obtained from a frailty model and has neither dependence for the maximum nor for the minimum.
It is introduced here for comparison.
If we apply the former procedures to the survival Clayton's copula
As Clayton's model is constructed on survival copula, the assymetry is applied to the variable (1 -u), and not to v. The resulting copula shows no extreme dependence for the maximum [ [6]2004].
The third copula is Gumbel's copula, which allows us to test for extremal dependence for the maximum. See earlier (7) or
The three models under consideration do not include any extreme dependence for the minimum. Using Joe's mixture method, we can construct new models with minimal dependence.
Introducing Z, a mixture variable with Gamma Laplace transform ϕ -1 whose parameter is β and from C(u, v), the asymmetrized copula we write
If parameter β = 0, we retrieve the model without minimum dependence. We can then deduce a test for this dependence. In the other cases, this copula has a lower tail dependence greater than the lower tail dependence of the Clayton copula (see Appendix).
For the Gumbel copula, we obtain:
Its lower tail dependence is given by λ
In case of any bivariate extreme value copula, C A (u, v) = exp(log(uv)A( log u log uv )), we obtain:
and then:
)}
When a copula, such as Clayton's copula is assymetrized from its survival function, we use u) , e -ϕ(v) )}
The articifial random generation of samples following the proposed distributions may be useful for Monte Carlo testing. Many simulation methods have been developped in many particular cases: Archimedean copulas (see for example the papers of Genest and Mackay [8], Genest and Rivest [9]), extreme value distributions (Ghoudi et al. [10], Shi [28], Stephenson [30]), mixtures of distributions (Marshall and Olkin [19]). In other cases, a general procedure can be used.
Different simulation methods will now be detailed.
Here, we work with a frailty variable following a Gamma distribution with parameter α, but another distribution could be used. This is the case of Clayton's copula (case 1), and when we add an extreme dependence for the minimum to Gumbel's assymetrized distribution (section 2.2, case 2). We use in that case the procedure of Marshall and Olkin [19] for mixture distributions, as follows.
(a) Simulate a random sample of couples (U i , V i ) of independent variables or according to the distribution C. (b) Generate a random sample of the mixture Gamma variable Z i with parameter α.
α . S i and T i have Clayton's distribution or the Ĉ distribution.
It is possible to generate a sample according to Gumbel's copula using a frailty variable with positive stable distribution Ps(α) with parameter α. See for example A. Stephenson [30] for generation Ps(α) distributions.
Instead of that, we have chosen here a procedure derived from Lee [16] for generating logistic extreme value distributions. The procedure uses the fact that T = ϕ(U ) ϕ(U )+ϕ(V ) , with ϕ(U ) = (-log(U )) α , has a uniform distribution and T and Z = C(U, V ) are independent. Furthermore Z 1 = (ϕ(Z)) α is distributed as a mixture of two Gamma variables. This method can be generalized to more than two variables (trilogistic distributions,...) [30] and also to extreme value distributions different from logistic distributions ([10]) using the two variables T = log(U ) log(U )+log(V ) and Z = C(U, V ) which are not independent but whose joint distribution is a function of the dependence function A(.) (see section 2) and of its second order derivative A ′′ (.).
(a) Simulate a mixture of Gamma variables Γ i with parameter (1, 1) and (2, 1). The Γ(2, 1) is generated in the proportion α. And let Z i = Γ α i . (b) Simulate a random sample of uniform variables W i and construct the products
). U i and V i have the Gumbel distribution 3. Assymmetrization from the cdf C 1 and C 2 with exponent θ and δ.
This procedure is described in Khoudraji [15]. The idea is that if (U 1 , V 1 ) and (U 2 , V 2 )
have respectively cdfs C 1 and C 2 , then max(U
2 ) and max(V
The simulation of couples (U i , V i ) from Ĉ(u, v) (eq. 11) is obtained by a more general method.
(a) Simulate a random sample of couples (U i , T i ) of independent uniform variables. (b) Let Ĉ2|1 (t|u) = ∂C(u,t) ∂u , the conditional distribution of Ĉ(u, v). Make the transformation V i = Ĉ-1 2|1 (T i ). Then (U i , V i ) are sampled from Ĉ(u, v). When no simple analytical expression is available for Ĉ-1 2|1 , then a numerical solution of v i = Ĉ2|1 (u i , t i ) is looked for. This is the case when we use equation ( 16) from Clayton's and Plackett's assymetrized models.
In figure 1, we present three examples of generated datasets from Clayton's survival copula, with one, two or three parameters. Parameter α is the same for the three datasets. Assymetry parameter δ (the other parameter θ is fixed to one) is the same for the second and the third datasets. We can then see how each parameter acts on the dependence. The dataset with one parameter is sampled from a distribution with Kendall's tau equal to τ = 0.50. The dataset generated with two parameters δ and α has its Kendall tau bounded by the assymetry parameter δ (see eq.9). Here, it is equal to 0.44. When we add parameter β corresponding to minimal dependence, the dependence on the third dataset becomes larger than in the second case and τ becomes equal to 0.50.
The estimation of parameters is done in two steps: at first estimating the margins by a moment method, then estimating the parameters of the copula by ML. This method, called IFM (Inference For Margins) was developped by Joe [13] and has mostly good properties (consistancy, asymptotic normality for the parameters). Let θ 1 and θ 2 , be the parameters of the margins and η, the parameters of the copula, then the loglikelihood L of the sample (X i , Y i ), i = 1, ...n can be written using equation (1) as:
Generally, there is no closed-form solution to the problem of maximisation of L, namely all the parameters are linked by the copula, but the loglikelihood being the sum of the two terms: the loglikelihood of the margins, and the loglikelihood of the copula, the maximisation may be split in two sub problems. In practice, the estimation of the margin parameters is firstly performed and the obtained estimations θ1 and θ2 are subtituted in the last term of the loglikelihood for the estimation of parameters η.
Moreover, to avoid misspecifications for the margins and the propagation of this error to the copula, we use a semi-parametric approach to model the margins, already used by Coles and Tawn [[4] 1994]. The idea consists in modelling the distribution of the data over a well chosen high threshold by a generalized Pareto distribution (see Davison and Smith [[5] 1990]) and under the threshold by the empirical distribution. The generalized Pareto distribution is given by
where x 0 is the chosen high threshold and u 0 the corresponding percentile. The main advantage of this model is that it provides a more general and more realistic copula (the maxima of which being not one) than a fully non parametric model.
At the second step, the parameters of the copula are estimated by Maximum Likelihood. All the ML estimations are obtained by using numerical optimization. The BIC-criterion allows us to compare all the models. Futhermore, inside the same family, the models are nested so if we compare the one parameter and the two parameter model, we can test the hypothesis that the assymetry is present (i.e. that the assymetry parameter is equal to one). We can also test if the minimal dependence is present, testing the β parameter at zero.
Finally, from the estimation of α parameter of Clayton's or Gumbel's one parameter models, we can deduce an estimation of the upper tail dependence using formulas ( 3) and ( 5) given in section 2.1.1 and an estimation of its variance by the delta-method. In the case of Clayton's copula, the upper tail dependence is estimated as 2 -1 α and its variance is given by ( 1 α2 ln(2)2 -1 α ) 2 var(α).
In order to study the reliability and/or the fatigue of marine structures, engineers need to know the joint distribution of sea state and atmospheric parameters. The sea state represents the state of the marine environnement at a given location and time. It is described by synthetic parameters like the significant wave height denoted H s and the mean wave period denoted T p . It is also usual to consider the wind speed W s . For reliability, it is determinant to well model the extremal dependence. For fatigue, the distribution close to the mode is generally of greater importance [24].
In order to model the joint distributions, we have selected three candidate models.
• Gumbel's model naturally characterizes a maximal extreme dependence.
• Plackett's model was already used to characterize the dependence between significant wave height and mean period of a sea state [2]; there is no extreme dependence in this model.
• Clayton's model plays the same role as Pareto distribution in univariate case: indeed it is a limit conditional model in the family of Archimedean copula [ [14]2002].
The models have one to four parameters: the basic models have one global dependence parameter, the asymmetrization adds one or two parameters and a last parameter corresponds to the minimal dependence obtained by introducing a mixture Gamma variable.
In this paper, we do not consider any distribution for more than two variables for two main reasons: engineers mostly use only univariate or bivariate distributions. However the presented theoretical results could be generalized to trivariate models, after some calculus.
In this paper, we consider data of the K1 buoy which is located in the North Atlantic close to the French coast, at the geographic coordinates (48.00N,12.40W). Five years of hourly data are recorded for the three variables: H s , T p , W s from 2002 to 2007. The T p recording process leads to integer values for this variable. In order to allow for better estimation of the parameters of the generalised Pareto distribution, a uniform noise defined on [-/2, +1/2] has been added to the obeserved T p . The transformed T p is also used in the sequel.
The margins are modeled as discussed in section 4. The thresholds for the semiparametric transformation chosen empirically so that the Pareto Generalized Distribution good fits the data. In practice, we use the 90% quantile for H s and T p and the 96% quantile for W s . The parameters estimated by a moment method are reported in table 1. This estimator has been chosen rather than others because of its robustness in this application. The scale parameters are difficult to interpret. The shape parameters of H s and W s are as low as expected for these variables. Those of T p is positive and it represents an extreme distribution of Weibull type.
Figure 2 illutrates the fitting of the GPDs on the empirical cumulative distribution functions. The agreement is good.
The joint distributions and the copulas of pairs (H s , W s ) and (H s , T p ) are represented in figure 3. One observes that dependence of the variable H s with W s and T p is quite strong. Both copulas seem to present extremal dependence and the (H s , T p ) copula clearly shows asymmetry. The Kendall tau have been estimated and they are respectively equal to 0.47 and 0.52 for (H s , W s ) and (H s , T p ) which can be considered as a strong dependence.
The parameters of the 3 × 3 models are estimated as described in section 4 and the results are reported in tables 2 and 3. The standard errors are calculated from the Hessian matrix of the log-likelihood. And as it was already remarked, the models are nested so that, when a model is degenerated, the variances of the former model are reported. In the tables, we also report the value of the log-likelihood at the estimated parameters and the BIC. The log-likelihoods allow for the comparison of the nested models by log-likelihood ratio tests.
For the (H s , W s ) couples, the Clayton model with 4 parameters has the smallest Bayesian Information Criterion (BIC). The estimated values of the parameters of this model give back the low asymmetry observed in the plotted copulas (Fig. 3). Plackett's and Gumbel's models do not show the same behaviour. In these models only the global dependence parameter α is significant and equal to 6.76 for Plackett and 0.57 for Gumbel which is quite strong. Furthermore, the standard deviation of the estimators of the parameters are smaller for the Clayton model than for the other ones. Thus this model is shown here to be more flexible and lead to more robust estimators. The one-parameter Clayton model allows us to estimate the upper tail dependence at 0.57 with a standard deviation equal to 4.34e -3 and the Gumbel model at 0.52 with a standard deviation equal to 7.33e -4 . These values are sufficiently large to conclude that the upper tail dependence is present. The lower tail dependence is also present. Namely, we can test it with a likelihood ratio test, comparing the four-parameter Clayton model to the three-parameter Clayton model: here -2(log L 3 -log L 4 ) = 228 and which is significant for a χ 2 statistic with one degree of freedom.
For the (H s , T p ) couples where the asymmetry is stronger, we choose the models defined by equation (10) or equation( 14) for reasons explained in section 1. The introduction of the asymmetry parameter δ (or θ ) allows for better fitting in all the models. Gumbel's and Clayton's models have a minimal dependence parameter greater than one which characterizes a low minimal dependence in the data. It corresponds to sea states with low significant wave heights and short periods. As previously, the Clayton model with three parameters has the smaller BIC. It has been observed that the optimisation procedure used to fit the three-parameter Plackett model is very sensitive to the initialization. It is due to instability of the gradient close to the frontier of β and also to the fact that the likelihood is flat for high levels of α. The upper tail dependence is estimated at .62 (0 .028 ) with the Clayton model and at .56 (6 .73 .e -3 ) with Gumbel's model. Finally, the comparison of the two parameter-Gumbel model with the three-parameter Gumbel model gives a very significant likelihood ratio test -2(log L 2 -log L 3 ) = 154. Using the formula given in the 4th section of the appendix, we can estimate this lower tail dependence and it is equal to λ = 0.63.
In this paper, we have proposed a procedure to construct a distribution model taking into account a priori knowledge on the data. The main idea consists in transforming a basic copula such as the Plackett one to better restore some features of the data distribution. Special attention is paid to asymmetry and extreme dependance. The transformed copulas have one to four parameters which are estimated by maximum likelihood.
The proposed procedure has been applied to sea state data. Two couples are considered. The first one (H s , W s ) presents maximal extreme dependence while the second one (H s , T p ) presents a clear asymmetry. Three basic copulas are studied: Plackett, Gumbel and Clayton. It is shown that the transformation of basic copulas may improve the fitting of the distribution models, especially in the case of the Clayton model. However, we also observe that for the Plackett copula, which is more flexible than the Clayton one, the introduction of new parameters is not really useful.
It is always difficult to have evidence of upper or lower dependence. Our method, by introducing a specific parameter devoted to the extreme dependence, allows us to test it and sometimes to estimate it from this parameter. The obtained estimator inherits the good properties of the ML-estimation (consistency and asymtotic normality). Alternatively, it would be possible to use non parametric estimation of these indexes. But such estimation is often uncertain, linked to the visual appearance of the data, and to the choice of a threshold (see Frahm et al. for more details [7]).
The procedure described here could also be adapted to other assumptions on the copula, for example local dependence located outside the diagonal.
Furthermore, we could have used a different approach than Joe's generalization to obtain a model with minimal dependence. Marsall and Olkin's procedure Ĉ(u, v) = K(e -zϕ(u) , e -zϕ(v) dG(z) could also lead to such model. Finally, to model the assymetry, an alternative method could have consisted in defining an expression for the boundary of the dataset and stipulating that the dependence is maximal in the vicinity of this boundary using a procedure as explained by Rschendorf [23], starting from a function κ(t) = t β , 0 < t <= 1 and modelling the boundary of the dataset. 4. In some cases, we can evaluate the lower tail dependence λ Ĉ of Ĉ(u, v).
(a) C(u, v) has lower tail dependence λC
In that case, C(u, u) is equivalent to λC u when u tends to 0.
λ Ĉ = lim u→0 ϕ -1 (-log C(e -ϕ(u) , e -ϕ(u) )) u
Let t = e -ϕ(u) . When u tends to 0, t also tends to 0. And λ Ĉ = lim
k + , x ≥ x 0
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment