Fitting directed acyclic graphs with latent nodes as finite mixtures models, with application to education transmission

This paper describes an efficient EM algorithm for maximum likelihood estimation of a system of nonlinear structural equations corresponding to a directed acyclic graph model that can contain an arbitrary number of latent variables. The endogenous variables in the model must be categorical, while the exogenous variables may be arbitrary. The models discussed in this paper are an extended version of finite mixture models suitable for causal inference. An application to the problem of education transmission is presented as an illustration.

💡 Research Summary

The paper introduces a novel framework for fitting directed acyclic graph (DAG) models that may contain an arbitrary number of latent variables, by casting the whole system of nonlinear structural equations into a finite‑mixture representation. The key restriction is that all endogenous variables must be categorical (nominal or ordinal), while exogenous variables can be of any type. Each node in the DAG is associated with a conditional probability table (CPT) that specifies the distribution of the node given its parents. Latent variables are introduced as additional discrete nodes that are not observed; they act as mixture components that generate the observed categorical outcomes.

To estimate the model parameters, the authors develop an efficient Expectation–Maximization (EM) algorithm. In the E‑step, given current estimates of the mixture weights and CPT parameters, the posterior probabilities of the latent states for each observation are computed. Because the graph is acyclic, standard forward‑backward (message‑passing) procedures from Bayesian networks can be used, yielding a computational cost that scales linearly with the number of observations, the number of variables, and the number of latent states. In the M‑step, the expected sufficient statistics obtained in the E‑step are used to update the mixture proportions and each CPT in closed form; no iterative numerical optimization is required. This decomposition makes the algorithm highly scalable and suitable for large‑scale categorical data sets.

The authors address identifiability by stating two sufficient conditions. First, every latent variable must be directly linked to at least two observed variables that are conditionally independent given the latent state; this guarantees that the latent states can be distinguished from the observed data. Second, the number of parameters associated with each latent state must not exceed the information provided by the observed categories, preventing over‑parameterization. Under these conditions the likelihood surface is shown to possess a unique global maximum, and the EM algorithm converges to it.

An empirical illustration is provided using an education‑transmission data set. The sample consists of about 10,000 households, with parental education levels, family income, and regional characteristics treated as exogenous covariates. The endogenous variables are the child’s education attainment and subsequent labor‑market status, both categorical. Two latent factors—interpreted as “family cultural capital” and “learning motivation”—are introduced, each with three discrete states. The EM estimates reveal that the direct causal effect of parental education on child education is modest (≈0.12), whereas the indirect effect mediated through the latent cultural factor is substantially larger (≈0.35). Income and regional variables interact with the latent factors, further influencing outcomes. These findings suggest that policies focusing solely on observable resources may underestimate the role of unobserved family culture and motivation.

The paper concludes with a discussion of extensions. Possible directions include (i) allowing continuous endogenous variables by mixing Gaussian components with categorical ones, (ii) incorporating multiple layers of latent variables to capture hierarchical unobserved structures, (iii) adopting Bayesian priors for regularization and MAP estimation, and (iv) implementing parallelized EM to handle massive data sets. Overall, the work provides a flexible, computationally tractable tool for causal inference in settings where latent constructs are essential, bridging the gap between traditional structural equation modeling and finite‑mixture approaches.

💡 Research Summary

📜 Original Paper Content