The generalized Pareto distribution (GPD) is a fundamental model for analyzing the tail behavior of a distribution. In particular, the shape parameter of the GPD characterizes the extremal properties of the distribution. As described in this paper, we propose a method for grouping shape parameters in the GPD for clustered data via graph fused lasso. The proposed method simultaneously estimates the model parameters and identifies which clusters can be grouped together. We establish the asymptotic theory of the proposed estimator and demonstrate that its variance is lower than that of the cluster-wise estimator. This variance reduction not only enhances estimation stability but also provides a principled basis for identifying homogeneity and heterogeneity among clusters in terms of their tail behavior. We assess the performance of the proposed estimator through Monte Carlo simulations. As an illustrative example, our method is applied to rainfall data from 996 clustered sites across Japan.
In many fields of risk assessment, interest lies in predicting the tail behavior of a distribution rather than its mean or central characteristics. However, tail prediction is inherently difficult and unstable because of the inherent data sparsity in extreme regions. To construct efficient estimators and prediction models for tail behavior, we primarily examine the tail of the distribution rather than modeling the entire distribution. Extreme value theory (EVT) provides a theoretical foundation and statistical tools for modeling tail behavior. As a commonly used EVT model, the generalized Pareto distribution (GPD) is fitted to data exceeding a specified threshold and serves as a model for extreme values. The GPD is characterized by two parameters: a shape parameter and a scale parameter. The shape parameter, also known as the extreme value index, determines the heaviness of the tail and the domain of attraction, whereas the scale parameter describes the magnitude of threshold exceedance (de Haan and Ferreira 2006). Estimation of shape and scale parameters is typically conducted using the maximum likelihood method (e.g., Coles 2001). The fundamental theories of the maximum likelihood estimator were presented by Smith (1987) and Drees et al. (2004).
A typical example of a statistical problem using extreme value analysis is risk modeling of climate data (e.g., Bousquet and Bernardara 2021). In general, climate data are available as clustered data because these are observed daily at multiple observation sites. Then, the interesting purpose is the prediction of marginal distribution of each cluster and dependence between clusters. As described herein, we specifically examine the estimation of the marginal distribution of each cluster. Then, the cluster-wise estimation of the parameters included in GPD is a classical approach. However, for example, in climate data, one might consider that nearby clusters (sites) tend to have similar features of the data. This tendency motivates us to explore the possibility of constructing more accurate estimators of the parameters of GPD by incorporating mutual information between clusters. Actually, Hosking and Wallis (1997) provided regional frequency analysis, which are pooling data observed at several clusters where the statistical behavior of the data is assumed to be similar. Casson and Coles (1999) also studied the pooling method of information of clusters in spatial extreme value models. Einmahl et al. (2020) developed the novel extreme value theory for pooling clusters (and time dependence). However, with those approaches above, it is difficult to choose the clusters to be grouping when the number of sites is very large. Dupuis et al. (2023) and Rohrbeck and Tawn (2023) have also proposed methods for clustering in the context of extreme value analysis. Dupuis et al. (2023) specifically examined block maxima using the generalized extreme value distribution and proposed a heuristic grouping method based on marginal similarity, selecting the best model by BIC. Their method might present scalability issues when the number of clusters is large because of the exponential growth in candidate models. Rohrbeck and Tawn (2023) considered clustering sites with extreme value inference. Their Bayesian approach jointly estimates marginal effect, dependence structure, and cluster assignments, incorporating similarity in shape and scale parameters as well as extreme dependence. Details of other grouping methods in extreme value modeling are presented by Rohrbeck and Tawn (2023). Consequently, extreme value modeling with cluster grouping constitutes a useful framework for structurally characterizing tail behavior across clusters.
As described herein, we provide the novel method for simultaneous construction of accurate estimators of the parameters of the GPD in each cluster and grouping shape parameters in some clusters. Then, we aim to group the shape parameter between clusters because it affects the tail behavior and domain of attraction directly. However, grouping of scale parameters should be treated carefully because the scale parameter behavior is closely related to both the threshold and the shape parameter (de Haan and Ferreira 2006, theorem 1.2.5). Consequently, when the scale parameter is grouped, we should consider a grouping threshold in addition to the shape parameter (Section 3.2 herein). However, it is difficult because the determination of a threshold is beyond the framework of estimation of parameters in GPD. Therefore, we will try to group only shape parameters; not scale parameters. From this, the scale parameter represents the specific information of extreme value model for each cluster. Our proposed method is conducted using a penalized maximum likelihood method. The likelihood part consists of the likelihood based on GPD for each cluster. In the part of the penalty, we use the fused lasso penalty for the shape parameters across clusters (Tibshirani et al. 2005).
This content is AI-processed based on open access ArXiv data.