Generation of Multivariate Discrete Data with Generalized Poisson, Negative Binomial and Binomial Marginal Distributions

Generation of Multivariate Discrete Data with Generalized Poisson, Negative Binomial and Binomial Marginal Distributions
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

The analysis of multivariate discrete data is crucial in various scientific research areas, such as epidemiology, the social sciences, genomics, and environmental studies. As the availability of such data increases, developing robust analytical and data generation tools is necessary to understand the relationships among variables. This paper builds upon previous work on data generation frameworks for multivariate ordinal data with a prespecified correlation matrix. The proposed algorithm generates multivariate discrete data from marginal distributions that follow the generalized Poisson, negative binomial, and binomial distributions. A step-by-step algorithm is provided, and its performance is illustrated in four simulated data scenarios and three real-data scenarios. This technique has the potential to be applied in a wide range of settings involving the generation of correlated discrete data.


💡 Research Summary

The paper introduces a versatile algorithm for generating multivariate discrete data whose marginal distributions follow the Generalized Poisson, Negative Binomial, or Binomial families. Building on Demirtas (2006) for ordinal data, the authors treat each possible count as a categorical level, collapse it into a binary variable using the marginal median, and then employ the Emrich‑Piedmonte (1991) method to generate a high‑dimensional binary dataset with a target correlation structure. An iterative procedure adjusts the binary‑level correlations (δ_bij) so that, after back‑transformation to the original count scale, the resulting Pearson correlations (δ*_ij) match the user‑specified values within a small tolerance. The algorithm also computes feasible correlation bounds for each pair using Hoeffding‑Frechet limits, ensures the intermediate correlation matrix is positive‑definite via Higham’s nearPD algorithm, and finally rescales the binary outcomes to the original marginal probabilities.

Four simulation studies assess performance. In a five‑dimensional Generalized Poisson scenario (varied rate θ and dispersion λ), both small (N=200) and large (N=2000) samples yield average estimates virtually identical to true parameters, relative bias below 5 %, standardized bias below 50 %, and coverage rates above 90 %. Similar accuracy is observed for a five‑dimensional Negative Binomial case (different r and p) with an exchangeable correlation of 0.5, and for Binomial and mixed‑distribution settings. The results demonstrate that the method reliably reproduces marginal moments and inter‑variable correlations across a range of dispersion levels and correlation strengths.

An R package implementation (function genMultDiscrete) is provided, allowing users to specify marginal parameters and a target correlation matrix. The package handles truncation of unbounded supports, automatic feasibility checks, and offers parallel processing options for large‑scale simulations.

Three real‑world applications illustrate practical utility. First, seizure counts from a longitudinal epilepsy trial are simulated, preserving the temporal correlation pattern observed in the original data. Second, a study linking soil helminth infection status to microbial community composition uses a mixture of binomial and count data to replicate observed associations. Third, RNA‑seq differential expression data are simulated, combining negative‑binomial gene counts with binary clinical covariates, enabling downstream method benchmarking.

The authors acknowledge limitations: the median‑based binary collapse may be suboptimal for highly skewed marginals; convergence can be problematic for extreme correlations (>|0.8|) or severe under‑dispersion (λ≈‑1); and computational cost grows quadratically with the number of variables, potentially limiting high‑dimensional applications. Future work is suggested on optimal threshold selection, faster high‑dimensional approximations, integration with Bayesian hierarchical models, and extension to rank‑based (Spearman/Kendall) correlation structures.

Overall, the paper delivers a robust, flexible framework for generating correlated discrete data with diverse marginal distributions, filling a notable gap in simulation methodology and offering a valuable tool for methodological research across epidemiology, genomics, social sciences, and beyond.


Comments & Academic Discussion

Loading comments...

Leave a Comment