Modeling Multivariate Missingness with Tree Graphs and Conjugate Odds

Reading time: 5 minute
...

📝 Original Info

  • Title: Modeling Multivariate Missingness with Tree Graphs and Conjugate Odds
  • ArXiv ID: 2602.16992
  • Date: 2026-02-19
  • Authors: ** 논문에 명시된 저자 정보가 제공되지 않았습니다. (가능하면 원문에서 확인 필요) **

📝 Abstract

In this paper, we analyze a specific class of missing not at random (MNAR) assumptions called tree graphs, extending upon the work of pattern graphs. We build off previous work by introducing the idea of a conjugate odds family in which certain parametric models on the selection odds can preserve the data distribution family across all missing data patterns. Under a conjugate odds family and a tree graph assumption, we are able to model the full data distribution elegantly in the sense that for the observed data, we obtain a model that is conjugate from the complete-data, and for the missing entries, we create a simple imputation model. In addition, we investigate the problem of graph selection, sensitivity analysis, and statistical inference. Using both simulations and real data, we illustrate the applicability of our method.

💡 Deep Analysis

📄 Full Content

Missing data are pervasive across healthcare, social sciences, economics, and machine learning. They arise from survey nonresponse, equipment failure, privacy concerns, and other sources, and the manner in which data are missing strongly influences the validity of statistical analyses. When ignored, missingness can bias results and reduce statistical power, especially in large-scale studies where incomplete records are common (R. J. A. Little & Rubin, 2002).

Rubin’s framework classifies missingness into three categories (R. J. Little & Rubin, 1989): missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR). Standard approaches are effective under MCAR or MAR, but MNAR poses a fundamentally harder problem: the probability of missingness depends on unobserved values, rendering the distribution unidentifiable without further assumptions.

The challenge is particularly acute in multivariate and nonmonotone settings, where missingness occurs irregularly across variables.

Most practical methods rely on imputation, such as multiple imputation by chained equations (mice; van Buuren & Groothuis-Oudshoorn 2011) or MissForest (Stekhoven & Bühlmann, 2011), which are valued for their flexibility but implicitly assume MAR or rely on potentially incompatible conditionals. Moreover, methods such as MissForest are also single imputation methods, which can lead to inconsistent estimators, depending on the parameter of interest. These limitations make them vulnerable to bias or incoherence under MNAR. Direct modeling of imputation distributions is also difficult because of high dimensionality and interdependence among variables, motivating the search for methods that are both interpretable and theoretically principled.

Two classical approaches to MNAR are selection models (Diggle & Kenward, 1994) and pattern-mixture models (R. J. Little, 1993), which respectively specify missingness probabilities or stratify by missingness patterns. While widely used, both require untestable assumptions for identifiability. More recent strategies include “no self-censoring” assumptions (Shpitser, 2016;Sadinle & Reiter, 2017), auxiliary variables (Miao & Tchetgen Tchetgen, 2016), and CCMV-type restrictions (Tchetgen Tchetgen et al., 2018).

Graphical frameworks, such as missing data DAGs (Mohan et al., 2013) and pattern graphs (Chen, 2022), provide powerful representations of missingness assumptions, though their generality can make model selection challenging.

This paper builds on these advances by focusing on a structured and tractable subclass of pattern graphs, which we term tree graphs. Tree graphs simplify model specification, connect naturally to existing MNAR assumptions, and form the basis for scalable imputation strategies. To complement this structure, we introduce the conjugate odds property, which provides a flexible parametric tool for modeling conditional distributions.

Together, tree graphs and conjugate odds yield a unified framework that ensures nonparametric identification, facilitates inference, and enables practical sensitivity analysis.

Outline. We study tree graphs, a special case of pattern graph wtih nice properties in Section 2 and derive related theories. In Section 3, we introduce the idea of conjugate odds that is useful in domain adaptation. We study how the conjugate odds can be used in handling missing data with tree graphs in Section 4, which leads to an imputation model and a model on the observed data simultaneously. We introduce three approaches for selecting a tree graph in Section 5: prior knowledge, partial-ordering, and data-driven approaches. In Section 6, we apply the tree graph and conjugate odds to an Alzheimer’s disease data. In appendices, we also investigate tree graph performance via simulation studies (Appendix B), and study the problem of statistical inferences (Appendix E) and sensitivity analysis (Appendix F).

We use a capital boldface variable to denote a vector-valued random variable. In this paper, we consider a general problem setup, where X = (X 1 , X 2 , . . . , X d ) ∈ R d is a random vector of variables. Each of the d variables can possibly be missing for a total of up to 2 d missing patterns. Let R ∈ R ⊆ {0, 1} d be the random binary vector that describes the missing pattern associated with X.

We write R j = 0 if and only if variable X j is missing. For a fixed pattern r, let X r = (X j : r j = 1) denote the observed random variables and X r = (X j : r j = 0) denote the missing random variables. When we write “For j in r,” this refers to the indices that contain 1. For example, suppose X = (X 1 , X 2 , X 3 , X 4 ) and r = 1001. We have X r = (X 1 , X 4 ) and X r = (X 2 , X 3 ). Then, the statement “For j in r,” corresponds to “For j in 1, 4.” We assume that the complete data is generated by sampling i.i.d. from the joint distribution p(x, r), and the resulting associated pattern r generates the observed data. In this paper, we will use the

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut