Causal Discovery on Dependent Mixed Data with Applications to Gene Regulatory Network Inference
Causal discovery aims to infer causal relationships among variables from observational data, typically represented by a directed acyclic graph (DAG). Most existing methods assume independent and identically distributed observations, an assumption oft…
Authors: Alex Chen, Qing Zhou
CA USAL DISCO VER Y ON DEPENDENT MIXED DA T A WITH APPLICA TIONS TO GENE REGULA TOR Y NETWORK INFERENCE B Y A L E X C H E N 1 A N D Q I N G Z H O U 1 , a 1 Department of Statistics and Data Science, University of California, Los Angeles a zhou@stat.ucla.edu Causal discov ery aims to infer causal relationships among variables from observational data, typically represented by a directed acyclic graph (D A G). Most existing methods assume independent and identically distributed ob- servations, an assumption often violated in modern applications. In addi- tion, many datasets contain a mixture of continuous and discrete variables, which further complicates causal modeling when dependence across samples is present. T o address these challenges, we propose a de-correlation frame- work for causal discovery from dependent mixed data. Our approach for- mulates a structural equation model with latent variables that accommodates both continuous and discrete variables while allowing correlated Gaussian er- rors across units. W e estimate the dependence structure among samples via a pairwise maximum likelihood estimator for the cov ariance matrix and de- velop an EM algorithm to impute latent v ariables underlying discrete obser- vations. A de-correlation transformation of the reco vered latent data enables the use of standard causal discovery algorithms to learn the underlying causal graph. Simulation studies demonstrate that the proposed method substantially improv es causal graph recovery compared with applying standard methods directly to the original dependent data. W e apply our approach to single-cell RN A sequencing data to infer gene regulatory networks governing embryonic stem cell differentiation. The inferred regulatory networks show significantly improv ed predictive likelihood on test data, and many high-confidence edges are supported by known regulatory interactions reported in the literature. 1. Introduction. Causal discov ery aims to identify the underlying causality among vari- ables from observed data. It is usually formalized through the framew ork of directed acyclic graphs (D A Gs), where the nodes of a D A G represent random variables and edges encode causal relationships between v ariables. Specifically , let V = { V 1 , . . . , V p } denote a set of p random v ariables, and let G = ( V , E ) be a D A G with variable (node) set V and edge set E such that an edge i → j ∈ E implies V i is a direct cause of V j . A causal discovery method learns the structure (i.e., the edge set) of the underlying DA G that best e xplains the observed conditional independencies among the v ariables. The directed edges in the learned graph indicate which v ariables directly influence others, thereby distinguishing association from causation. Causal discov ery methods are commonly separated into three categories: constraint-based, score-based, and hybrid approaches. Glymour , Zhang and Spirtes ( 2019 ) and Nogueira et al. ( 2022 ) provide comprehensive re views of causal discovery methods. Here, we briefly re- vie w the most classical algorithms from each category . Constraint-based methods perform a sequence of conditional independence tests to systematically identify and orient edges in a graph. T ypically , these methods begin with a complete undirected graph over the observed v ariables and then iterati vely prune edges when two variables are conditionally indepen- dent giv en a set of other observed variables. The result of the pruning stage is a skeleton, K eywor ds and phrases: Causal Discovery, Directed acyclic graph, Mixed data, Sample dependence, Single- cell RNA-seq, Gene regulatory network. 1 2 which is then oriented using logical orientation rules deriv ed from graph properties such as v-structures and acyclicity as introduced by Meek ( 1995 ). The most well-known constraint- based method is the PC algorithm ( Spirtes, Glymour and Scheines , 2000 ). Score-based meth- ods search over a space of graphs to identify a D AG or an equiv alence class of D A Gs that maximizes a predefined scoring function. A commonly used example is the Bayesian Infor- mation Criterion (BIC) ( Schwarz , 1978 ), which balances goodness of fit with model com- plexity by penalizing o verly comple x graphs. Common examples of score-based methods include Greedy Equiv alence Search (GES) ( Chickering , 2002 ) and Greedy Hill Climbing ( Gámez, Mateo and Puerta , 2011 ), both of which iterati vely modify the graph in order to improv e the score. Hybrid methods integrate both approaches, constraint and score-based learning, in order to leverage dif ferent strengths of each approach. One example is the Max- min Hill-Climbing (MMHC) algorithm ( Tsamardinos, Brown and Aliferis , 2006 ). MMHC first uses conditional independence tests to identify a skeleton. Then, a score-based search is used to orient the edges in the learned skeleton by selecting directions that maximize the scoring criterion. 1.1. Motivation. Most causal discov ery methods rely on a key assumption: observations (units) are independent and identically distributed (i.i.d.). Howe ver , this assumption is often violated in real-world applications, where dependence among samples naturally arises. For instance, in social network analysis, indi viduals’ behaviors are influenced by their social con- nections, such as friends, families, or colleagues, resulting in interdependent features across indi viduals. Another layer of complexity arises when the data consists of both continuous and discrete v ariables. Social network data can contain a mixture of continuous engagement metrics, binary indicators, and categorical demographic variables. In genomics, some vari- ables are typically represented as discrete states, such as the presence or absence of a protein in the cell, while other variables, say the expression le vels of genes, v ary continuously . It is also a common practice to discretize gene expression levels depending on the data qual- ity and generation mechanisms. Mixed data can be particularly challenging when we model between-sample dependence, since it is not straightforward to introduce dependence in the discrete data among different units. T o address these challenges, we propose a D A G model for mixed data with sample dependence and dev elop a de-correlation method that facilitates the structure learning of causal D A Gs under these settings. Although our method is broadly applicable across a range of data domains, the primary application in this work is the inference of gene regulatory networks (GRNs) from single-cell RN A sequencing (scRNA-seq) data. W e model a GRN by a causal D A G, where a directed edge V i → V j implies that gene i directly regulates the e xpression of gene j . Cells may orig- inate from the same lineage or follow shared differentiation trajectories, leading to implicit association in gene expression patterns. Such dependency structures induce intrinsic correla- tions among cells, which may be further amplified by technical factors such as batch ef fects. As another example, intercellular signaling and spatial organization create structured depen- dence among cells ( Almet et al. , 2024 ), leading to correlated gene expression patterns. As a result, gene expression measurements in scRN A-seq data may violate the i.i.d. assumption underlying many standard causal disco very methods. Moreo ver , gene expression can be rep- resented as a combination of continuous and discretized data, yielding a mixed data setting. GRN inference methods hav e been dev eloped specifically for single-cell applications. Pratapa et al. ( 2020 ) provide a comprehensiv e benchmark study of various GRN inference methods using synthetic GRNs and ev aluates their performance across multiple criteria. Among the top performing methods, GENIE3 ( Huynh-Thu et al. , 2010 ) performs GRN in- ference via a series of regression problems. Specifically , for each tar get gene, a tree-based ensemble model (e.g. random forest) is trained to predict its expression as a function of CA USAL DISCO VER Y ON DEPENDENT MIXED DA T A 3 all other genes. Feature importance scores are aggregated across models to infer regulatory edges. Similarly , GRNBoost2 ( Moerman et al. , 2019 ) builds upon GENIE3 by using gradi- ent boosting machines to improv e performance and scalability to large single-cell datasets. Both GENIE3 and GRNBoost2 rely on predicti ve models for the conditional distribution [ V i | V − i ] . As such, the resulting regulatory network captures statistical dependencies but lack a causal interpretation. Another top performing method, PIDC ( Chan, Stumpf and Babtie , 2017 ), infers GRN structures via an information-theoretic method based on mutual infor - mation between gene pairs but only estimates an undirected network. While these methods hav e demonstrated empirical results in recov ering plausible biological interactions in single- cell data, they all rely on the implicit assumption that individual cells are independent and identically distributed. As discussed above, cells may exhibit significant dependence with each other . These challenges moti v ate the dev elopment of our causal discov ery method that explicitly accounts for sample dependence in a mix of multiple data types. 1.2. Related work. In this work, we focus on learning the underlying causal structure among a set of variables V = ( V 1 , . . . , V p ) . W e assume that data is generated from a DA G G whose structure encodes direct causal relationships. The set of parent nodes of variable V j is denoted by P A j ⊆ { 1 , . . . , p } \ { j } . Under standard i.i.d. assumptions, each sample x i = ( x i 1 , . . . , x ip ) for i = 1 , . . . , n is independently dra wn from a joint distribution that factorizes according to the D A G, f ( V 1 , . . . , V p ) = p Y j =1 f ( V j | V P A j ) . (1.1) Causal graph learning under i.i.d. assumptions has been well-explored in a variety of set- tings; see Glymour , Zhang and Spirtes ( 2019 ) and Nogueira et al. ( 2022 ) for recent re- vie ws. In contrast, there are few methods that consider dependence among observational units { x 1 , . . . , x n } . One related direction exists in the causal inference literature on partial interfer ence , where units may influence one another ( Bhattacharya, Malinsky and Shpitser , 2020 ). Howe ver , partial interference methods typically assume that the causal D A G over vari- ables is known and focus on estimating unit-lev el causal effects. Our goal is to recov er the causal D A G G while allowing for dependence across observ ational units. Let X = ( X 1 , . . . , X p ) ∈ R n × p denote the observ ed data matrix, where each column X j = ( x 1 j , . . . , x nj ) ⊤ ∈ R n is a vector of observations across all n units. Li, Padilla and Zhou ( 2024 ) proposed a linear Gaussian structural equation model on netw ork data by introducing dependent exogenous v ariables ε j = ( ε 1 j , . . . , ε nj ) : X j = X k ∈ P A j β kj X k + ε j , ε j ∼ N n (0 , Σ) , (1.2) where β kj is the direct causal effect of X k on X j along the directed edge k → j in the underlying D A G G . The n × n cov ariance matrix Σ = ( σ ab ) is positiv e definite and models the covariance between two units such that σ ab = Cov ( ε aj , ε bj ) for j = 1 , . . . , p . Using the Cholesky factor L of Θ = Σ − 1 as a means to de-correlate the data matrix X results in de- correlated data e X = L ⊤ X . Applying a standard causal discovery method on the de-correlated data e X sho ws promising results for D A G learning. Howe ver , this approach is limited to continuous data and is not applicable to discrete v ariables. In addition, the support of Θ is restricted to a known network (graph) among the n units, which is somewhat limited for real applications. 4 In our prior work ( Chen and Zhou , 2025 ), we extend the framew ork of Li, P adilla and Zhou ( 2024 ) by introducing a latent-v ariable formulation for binary data: z ij = X k ∈ P A j β kj x ik + ε ij = x i β j + ε ij , (1.3) x ij = I ( z ij > 0) , (1.4) for i ∈ [ n ] := { 1 , . . . , n } and j ∈ [ p ] := { 1 , . . . , p } , where β j = ( β 1 j , . . . , β pj ) ∈ R p such that supp ( β j ) = P A j . T o introduce between-unit dependence, the noise term ε j = ( ε 1 j , . . . , ε nj ) is assumed to follow a multi variate normal distribution, ε j ∼ N n (0 , Σ) , where Σ cap- tures the dependence among units. Under this latent-v ariable formulation, x ij corresponds to thresholding an unobserved latent v ariable z ij . Since x ij is binary , one cannot sim- ply transform them by b L , the Cholesky of an estimated b Σ − 1 . Instead, the latent vectors Z j = ( z 1 j , . . . , z nj ) ⊤ , j ∈ [ p ] are imputed by sampling from a multiv ariate truncated Gaus- sian distrib ution conditional on the observed binary outcomes. T o achie ve de-correlation, the imputed latent continuous v ariables Z j are de-correlated with b L . Then, any standard causal discov ery method can be applied to the de-correlated Z j , j ∈ [ p ] to estimate a causal graph. 1.3. Contributions of this work. This paper makes two primary contributions. First, we apply the de-correlation frame work for causal discov ery to GRN learning from single-cell RN A-seq data with potential between-cell dependence. Howe ver , single-cell expression data may consist of both continuous gene expression and multi-level discrete states, which can- not be handled by existing de-correlation approaches ( Li, Padilla and Zhou , 2024 ; Chen and Zhou , 2025 ). T o address this, we generalize the latent-variable formulation in ( 1.3 ) and ( 1.4 ) and the associated de-correlation method to mix of general discrete and continuous vari- ables. Second, we apply our proposed frame work to an extensi ve study on inferring gene regulatory networks from scRNA-seq data in Chu et al. ( 2016 ). W e further de velop a stabil- ity measure through a bootstrap method to quantify uncertainty in estimated gene regulatory interactions (edges in the learned graph). High-confidence edges indeed are supported by kno wn biological interactions reported in the literature, demonstrating the ef fectiv eness of our de-correlation approach for causal discov ery in complex biological data. The rest of the paper is structured as follows. Section 2 introduces a ne w DA G model for dependent mixed data and presents an overvie w of de-correlation. In Section 3 , we describe procedures for pre-estimating some model parameters, in particular , the covariance matrix among the units. Section 4 de velops the latent variable recovery and de-correlation method for mixed data. Section 5 presents simulation studies on both random and real DA Gs. In Sec- tion 6 , we apply our method to the single-cell data and e valuate our estimated gene re gulatory relationships against findings reported in the biological literature. Section 7 concludes with a discussion of our main findings and potential directions for future research. 2. Model and overall idea. 2.1. D AG model for dependent mixed data. W e generalize the SEM in Equations ( 1.3 ) and ( 1.4 ) to mixed data, where each v ariable X j may be continuous, x ij ∈ R , or discrete with C j le vels, x ij ∈ { 0 , 1 , . . . , C j − 1 } . T o construct a unified model for the two types of v ariables, a set of continuous latent variables z ij , i ∈ [ n ] and j ∈ [ p ] , is introduced to define the underlying SEMs and generate observed mix ed data through transformations. Let D ⊆ [ p ] denote the index set of discrete variables and C = [ p ] \ D the index set of continuous variables. For j ∈ D , the discrete value of x ij is determined by z ij and a set of CA USAL DISCO VER Y ON DEPENDENT MIXED DA T A 5 thresholds T j = { τ j,c : τ j, 0 < τ j, 1 < · · · < τ j,C j } , where τ j, 0 = −∞ and τ j,C j = ∞ . W e define a quantization mapping of z ij gi ven the thresholds T j : Q ( z ij ; T j ) = c if τ j,c < z ij ≤ τ j,c +1 , c = 0 , . . . , C j − 1 . (2.1) Let G be a D AG ov er p nodes with parent sets P A j , j ∈ [ p ] . Our model for mixed data with between-unit dependency is defined by the follo wing SEM associated with G : z ij = X k ∈ P A j β kj x ik + ε ij = x i β j + ε ij , ε j ∼ N n (0 , Σ) , (2.2) x ij = ( Q ( z ij ; T j ) , if j ∈ D z ij , if j ∈ C , (2.3) for i ∈ [ n ] and j ∈ [ p ] , where the coefficient vector β j = ( β 1 j , . . . , β pj ) ∈ R p such that supp ( β j ) = P A j . Similarly to Chen and Zhou ( 2025 ), we introduce dependency across the ro ws of X by assuming that the error vector ε j = ( ε 1 j , . . . , ε nj ) follows a multi variate normal distribution in ( 2.2 ). W e fix diag (Σ) = 1 so that the parameters { β j , T j , Σ } are identifiable. The error terms ε ij are regarded as exogenous variables, representing the source of random- ness in the structural equation model. The dependence between exogenous noise ε ij across units induces dependence among the corresponding units x i . This formulation allows unit- le vel dependence to be explicitly modeled through the cov ariance structure of the exogenous v ariables. A possible alternativ e to the quantization mapping approach in Equations ( 2.2 ) and ( 2.3 ) is to model each discrete variable X j as a multi-logit regression of its parents, as proposed in Gu, Fu and Zhou ( 2019 ). Howe ver , multi-logit models introduce a lar ge number of param- eters: A C -categorical multi-logit model requires C − 1 sets of regression coefficients, one for each category other than the baseline, leading to a total of ( C − 1) p parameters for each discrete variable. Instead, our model has much fewer parameters for a discrete variable of a large number of levels. Moreov er , the multi-logit formulation ignores any potential ordering among the discrete le vels, treating them as categorical rather than ordinal. This limitation is particularly relev ant in applications such as gene re gulatory network inference, where dis- cretized gene expression le vels (e.g. low/medium/high) are ordinal by nature. Our model in Equations ( 2.2 ) and ( 2.3 ) reflect this ordering, making it more suitable for discretization used in single-cell RN A-seq data analysis. 2.2. Overall idea of de-correlation. Under our model in ( 2.2 ) and ( 2.3 ), both the continu- ous columns X C and discrete columns X D in the data X are generated through the continuous latent data Z = [ Z C , Z D ] = [ X C , Z D ] , where Z j = P k ∈ P A j β kj X k + ε j for j ∈ [ p ] . W e can re write Equation ( 2.2 ) in terms of the latent v ariables: Z j = X k ∈ P A j β kj g ( Z k ) + ε j , j ∈ [ p ] , (2.4) where g ( Z k ) = Z k for continuous parents k ∈ C and g ( Z k ) = Q ( Z k ; T k ) for discrete parents k ∈ D . Equation ( 2.4 ) defines an SEM ov er Z = [ Z 1 , . . . , Z p ] associated with the same D A G G over the observed data ( X ) . This observation moti vates our proposal to learn the structure of G from the continuous latent data Z . Ho we ver , there is between-sample dependency among z i , the i th row of Z , i ∈ [ n ] , due to the dependent model for ε in Equation ( 2.2 ). De-correlation aims to remove the sample- le vel dependencies such that each unit contributes independent information, which is a key step in our procedure to accurately estimate causal graphs in the dependent data setting. Let 6 Θ = Σ − 1 be the precision matrix and L be its Cholesk y factor , i.e., a lo wer-triangular matrix such that Θ = LL ⊤ . Since Cov ( L ⊤ ε j ) = L ⊤ Σ L = L ⊤ ( LL ⊤ ) − 1 L = I n , the transformed error vector e ε j = L ⊤ ε j has independent components e ε ij , i ∈ [ n ] . Let e i = ( ε i 1 , . . . , ε ip ) be the vector of all error variables for the i th unit and e e i = ( e ε i 1 , . . . , e ε ip ) . The SEMs ( 2.2 ) for j = 1 , . . . , p define a mapping F : R p 7→ R p such that z i = F ( e i ) . Then, e z i = F ( e e i ) becomes independent across the units i ∈ [ n ] , to which one can apply standard D A G learning methods for causal discov ery . This is the overall idea of de-correlation of the latent data Z . T wo ke y steps are (1) to estimate the cov ariance matrix Σ ; (2) to impute and de- correlate the error v ariables ε j and reconstruct the de-correlated latent data e Z = [ e Z 1 , . . . , e Z p ] . Our method consists of two phases that we briefly summarize here with the detailed de vel- opment provided in the next two sections. Phase 1, which we call the pre-estimation phase, focuses on estimating the covariance matrix Σ and the threshold v alues T = ( T j ) j ∈D . W e maximize the likelihood of the discrete data to estimate T . Giv en b T , we estimate the covari- ance matrix Σ using a pairwise likelihood approach. Phase 2 is the de-correlation phase. W e de velop an EM algorithm to iterati vely reconstruct de-correlated latent data e Z and update the SEM parameters β j , j ∈ [ p ] in Equation ( 2.2 ), while fixing the estimates b Σ and b T obtained in Phase 1. This procedure provides de-correlated, continuous data e Z on which one may apply any standard D AG learning algorithm to recover the underlying causal graph. 3. Pre-estimation of T and Σ . In the pre-estimation phase, we obtain initial estimates of the threshold parameters T = ( T j ) j ∈D , regression coefficients β = ( β 1 , . . . , β p ) , and co- v ariance matrix Σ . Our approach begins by estimating ( T , β ) under the initialization of Σ (0) = I n , and then estimates Σ conditional on the resulting ( b T , b β ) . T ogether , the two com- ponents yield estimates of ( b T , b β , b Σ) . During the de-correlation procedure in Section 4 , b T and b Σ will be fixed while b β will be used as the initial value for further update. The pre-estimation phase requires an initial estimate of the parent set P A j for each node j ∈ [ p ] . W e apply the mixed data v ersion of the Max-Min Hill Climbing (MMHC) algorithm ( Tsamardinos, Bro wn and Aliferis , 2006 ) to learn a DA G structure from the observed data. The estimated parent sets, denoted by d P A j for j ∈ [ p ] , in the learned D A G are used as the initial estimates. 3.1. Thr eshold estimation. Let ϕ n ( · ; µ, Σ) be the density of N ( µ, Σ) . Recall that X β j denotes the linear combination formed by the parent variables of node j as supp ( β j ) = P A j . Our starting point is the likelihood of X j , j ∈ D giv en X β j as specified by the SEMs in ( 2.2 ) and ( 2.3 ): P ( X j | X β j ; T j , Σ) = Z D ( X j ,T j ) ϕ n ( Z j ; X β j , Σ) d Z j , (3.1) where D ( X j , T j ) = Q n i =1 ( τ j,x ij , τ j,x ij +1 ] is the Cartesian product of the truncation intervals implied by the observed le vels x ij , i ∈ [ n ] . Joint estimation of β , T , and Σ is e xtremely challenging due to the high dimensionality of Σ and the complexity of the n -variate inte gral in ( 3.1 ). T o address this in a practical way , we first estimate β j and T j while fixing Σ = I n , which leads to two substantial simplifications: (1) the likelihood decomposes across samples, enabling ef ficient ev aluation of the inte gral in Equation ( 3.1 ), and (2) it decouples the estimation of ( T j , β j ) for different discrete variables. Specifically , we arriv e at the following log-likelihood for each j ∈ D : ℓ ( T j , β j | X ) = n X i =1 log Φ τ j,x ij +1 − x i β j − Φ τ j,x ij − x i β j , (3.2) CA USAL DISCO VER Y ON DEPENDENT MIXED DA T A 7 where Φ( · ) is the cumulative distribution function of N (0 , 1) . W e alternate between optimiz- ing T j and β j until con vergence as outlined in Algorithm 1 , which is a blockwise coordinate ascent algorithm to maximize the log-likelihood ℓ ( T j , β j | X ) in ( 3.2 ). Algorithm 1 Pre-estimation of T j and β j by coordinate ascent Input: d P A j and T (0) j Iterate between the following steps for t = 1 , 2 , . . . until a stopping criterion is met: 1. β ( t ) j ← arg max β j ℓ ( T ( t − 1) j , β j | X ) . 2. T ( t ) j ← arg max T j ℓ ( T j , β ( t ) j | X ) . The initial thresholds, T (0) j = ( τ (0) j, 1 , . . . , τ (0) j,C j − 1 ) , are chosen by the empirical distribution of the observed lev els. Specifically , we compute the frequency b p j,c of each le vel in the data X j and set, τ (0) j,c = Φ − 1 c X k =1 b p j,k ! , c = 1 , . . . , C j − 1 , where Φ − 1 ( · ) is the quantile function of N (0 , 1) . Since each discrete observation x ij , j ∈ D is obtained by thresholding a latent Gaus- sian v ariable z ij , the likelihood of the observed data inv olves integrals ov er multiv ari- ate truncated Gaussian distributions. In particular , for each variable j , the latent vector Z j = ( z 1 j , . . . , z nj ) ⊤ follo ws a multi v ariate Gaussian distribution with constraints on Z j im- posed by the observed discrete outcomes. As a result, the observed data likelihood does not admit a closed-form maximizer with respect to β j for fixed thresholds T j . T o address this, we dev elop an EM algorithm ( Dempster, Laird and Rubin , 1977 ) to maximize the observed data likelihood ℓ ( T ( t − 1) j , β j | X ) o ver β j in Step 1. In the E-step, we compute the conditional expectation of each z ij gi ven the observed lev el x ij , which is the mean of a truncated normal distribution. In the M-step, we perform a linear regression of the expectation of Z j on the parent variables X d P A j to update β j . W e perform a small fixed number of iterations, which guarantees monotone increase of the observed data likelihood in ( 3.2 ) at each step while substantially reducing computational cost. In Step 2, we estimate the thresholds vector T j = ( τ j, 1 , . . . , τ j,C j − 1 ) by maximizing the log-likelihood in ( 3.2 ) subject to τ j, 1 < · · · < τ j,C j − 1 . T o enforce this constraint, we re- parameterize the thresholds in terms of their successi ve dif ferences: δ j, 1 = τ j, 1 , δ j,c = τ j,c − τ j,c − 1 , c = 2 , . . . , C j − 1 , where δ j,c > 0 for c ≥ 2 . The resulting optimization problem for δ j is solved numerically through the L-BFGS-B algorithm, resulting in b T j . Empirical con vergence of Algorithm 1 appears to be very f ast, as sho wn in the Supplementary Material. 3.2. Covariance estimation. Fixing b T and b β , we estimate the cov ariance matrix Σ to capture the dependence among units. Recall that diag (Σ) = 1 . Due to the high-dimension of Σ , we develop a pairwise approach to estimate of f-diagonal correlation ρ ab for each pair of units a and b , by integrating both discrete and continuous v ariables. W ithout loss of generality , let us consider the estimation of ρ 12 , the correlation between the first two units, x 1 and x 2 . As diag (Σ) = 1 , the two exogenous error variables, ( ε 1 j , ε 2 j ) , 8 follo w a bi v ariate Gaussian distribution, ε 1 j ε 2 j ∼ N 0 0 , 1 ρ 12 ρ 12 1 for all j ∈ [ p ] . (3.3) There are two types of unit pairs among { ( x 1 j , x 2 j ) , j ∈ [ p ] } , pairs of discrete v alues and pairs of continuous v alues. Based on Equation ( 2.3 ), a pair of discrete v alues implies the follo wing constraints on ( ε 1 j , ε 2 j ) : x 1 j = c 1 ⇐ ⇒ τ j,c 1 < ε 1 j + x 1 β j ≤ τ j,c 1 +1 , x 2 j = c 2 ⇐ ⇒ τ j,c 2 < ε 2 j + x 2 β j ≤ τ j,c 2 +1 . Gi ven regression coefficient β j , we use Equation ( 2.2 ) to find the probability mass function of ( x 1 j , x 2 j ) through the CDF of a biv ariate Gaussian: P ( x 1 j , x 2 j | D 1 , 2 j , ρ 12 ) = Z Z D 1 , 2 j ϕ ( u 1 , u 2 | ρ 12 ) du 1 du 2 , (3.4) where ϕ ( · , ·| ρ 12 ) is the density function of the bi variate Gaussian distrib ution in ( 3.3 ) and D 1 , 2 j ⊂ R 2 is the domain of the inte gral. If ( x 1 j , x 2 j ) = ( c 1 , c 2 ) , then the domain D 1 , 2 j = ( τ j,c 1 − x 1 β j , τ j,c 1 +1 − x 1 β j ] × ( τ j,c 2 − x 2 β j , τ j,c 2 +1 − x 2 β j ] . For a continuous variable pair, their joint density is p ( x 1 j , x 2 j | ρ 12 ) = ϕ ( x 1 j − x 1 β j , x 2 j − x 2 β j | ρ 12 ) . (3.5) T o estimate ρ 12 , we construct a pairwise lik elihood that incorporates both discrete and continuous pairs across all p v ariables: L ( ρ 12 | x 1 , x 2 ) = Y j ∈D P ( x 1 j , x 2 j | D 1 , 2 j , ρ 12 ) × Y j ∈C p ( x 1 j , x 2 j | ρ 12 ) . (3.6) For each pair of units ( a, b ) ∈ [ n ] × [ n ] , we estimate the correlation ρ ab by maximizing the likelihood function ( 3.6 ). This uni-variate optimization can be efficiently solved using stan- dard numerical methods. Follo wing the setting of Li, Padilla and Zhou ( 2024 ), we assume a block-diagonal struc- ture for Σ , where each block captures dependence within a group of units and units from two dif ferent blocks are independent. This type of structure naturally arises in settings such as social networks, cell populations sharing a common lineage, or spatially localized units. Sparsity may exist within indi vidual blocks, b ut we do not impose this assumption in our simulations or empirical analysis. T o estimate the block-diagonal cov ariance matrix among the units, we find the pairwise maximum likelihood estimate of each correlation ρ ab between two units in the same block. After estimating all within-block correlations for each block, we check whether the resulting cov ariance matrix b Σ is positiv e definite to ensure b Θ = b Σ − 1 is well defined. If not, negati ve eigen values of b Σ are truncated to zero and off-diagonal elements are rescaled by a factor < 1 (e.g. 0.9) to improv e numerical stability . 4. De-correlation f or structure lear ning. 4.1. Latent data r ecovery and de-corr elation. In the de-correlation phase, we fix Σ and T to their pre-estimated values, b Σ and b T . Let b L be the Cholesky f actor of b Θ = b Σ − 1 . W e separate the de-correlation procedure between continuous v ariables and discrete variables. Continuous variables are de-correlated with b L by e X j = b L ⊤ X j , j ∈ C . For discrete v ariables, CA USAL DISCO VER Y ON DEPENDENT MIXED DA T A 9 we estimate the latent data Z D and coefficients β j , j ∈ D via an EM algorithm ( Dempster , Laird and Rubin , 1977 ). Gi ven Z D , we apply the Cholesk y factor b L to generate de-correlated data e Z j = b L ⊤ Z j , j ∈ D . The full de-correlated dataset is e Z = [ e X C , e Z D ] , in which all v ariables are continuous. In the E-step, we impute the latent data by computing its expectation from a truncated Gaussian distribution giv en the observed discrete outcomes and the current estimated coef- ficients b β . Recall that the distribution of the error vector is ε j ∼ N n (0 , Σ) from ( 2.2 ). Con- ditional on the observ ed data X , howe ver , ε j = ( ε ij , i ∈ [ n ]) must fall within the interv als implied by T j : τ j,x ij − x i β j < ε ij ≤ τ j,x ij +1 − x i β j , i ∈ [ n ] . (4.1) This corresponds to sampling ε j from a n -dimensional truncated Gaussian distribution, N T (0 , Σ; T j ) . When n is lar ge, directly sampling from an n -dimensional truncated Gaussian distribution is computationally challenging. T o address this, we exploit the block-diagonal structure of Σ , which limits dependence to within-block units. W e develop a Gibbs sampler to generate samples within each block. A veraging over N draws yields an approximate expectation E ( ε j | X, β j ) to reconstruct the latent continuous data b Z D via ( 2.2 ). The recov ered latent data, b Z D , can then be de-correlated to generate e Z j = b L ⊤ b Z j , similar to the continuous variables. Then, the M-step updates the regression parameters β j by re- gressing e Z j = b L ⊤ Z j on b L ⊤ X for j ∈ D , which now satisfies, approximately , the assumption of independent errors. Ridge regularization is applied to a void ov erfitting. Algorithm 2 details the iterative process of recovering the latent data Z j for discrete vari- ables, de-correlating Z , and estimating β . Our de-correlation approach may be understood under a unified perspecti ve for continuous and discrete v ariables, e Z j = b L ⊤ b Z j = b L ⊤ X b β j + b L ⊤ b ε j , j ∈ [ p ] , (4.2) where b Z j = X j and e Z j = e X j for j ∈ C . When b Σ is well estimated, b L ⊤ b ε j ef fecti vely re- mov es unit-le vel dependence, yielding approximately independent units. This de-correlation transformation is the ke y step in improving causal structure learning on dependent data. Algorithm 2 EM-based de-correlation for mixed data Input: Mixed data X = [ X C , X D ] , b Σ , b T , b β (0) , and d P A j , j ∈ [ p ] . Let e X j = b L ⊤ X j , j ∈ C . Iterate the following steps for t = 1 , 2 , . . . until a stopping criterion is met: 1. For each discrete variable j ∈ D : a) Draw samples of ε j from N T (0 , b Σ; b T j ) by a Gibbs sampler. b) Approximate the expectation b ε ( t ) j = E [ ε j | X , b β ( t − 1) j ] via Monte Carlo average of ε j . c) Reconstruct latent variables b Z ( t ) j = X b β ( t − 1) j + b ε ( t ) j . Form the de-correlated latent data e Z ( t ) D = b L ⊤ b Z ( t ) D and let e Z ( t ) = [ e X C , e Z ( t ) D ] . 2. For j ∈ D , regress e Z ( t ) j on b L ⊤ X d P A j to update b β ( t ) j . 10 4.2. Structur e learning. Standard structure learning methods for i.i.d. continuous data may no w be applied on the de-correlated data, e Z ( t ) recov ered in the final iteration of Al- gorithm 2 , to infer the underlying D A G. Howe ver , the conditional expectation of the latent v ariables Z D , is obtained via Monte Carlo sampling in the E-step. T o reduce variability , we aggregate information across multiple iterations of e Z ( t ) by pooling structure learning results via a consensus vote. In general, a causal D AG is not identifiable from observational data alone, since multiple D A Gs may encode the same set conditional independence relations. These D A Gs are called Marko v equi valent. A Marko v equiv alence class can be uniquely represented as a completed partially directed acyclic graph (CPD A G). Thus, our goal is to estimate a CPD AG from ob- serv ational data. W e consider two strategies for obtaining the final CPD AG estimate. In the Consensus approach, we run a standard structure learning algorithm (such as MMHC) on M de-correlated datasets { e Z ( t ) : t ∈ S , | S | = M } , where S is a subset of iterations from Algo- rithm 2 . An edge is accepted into the final causal graph estimate if it appears in at least half of the estimated CPD A Gs. The second strategy is called A verage , where we first compute an element-wise average of the M de-correlated datasets and then perform the same standard causal discov ery method on the averaged data. From a computational perspective, the av erage approach requires a single graph estimation whereas the consensus approach requires M estimates of a causal graph. The consensus ap- proach can yield more stable edge recov ery but for large variable sizes, the average approach may be more applicable. W e benchmark both methods against a Baseline approach, which applies the mixed data version of the same structure learning method directly to the original data. W e first show the improv ement of using our method on simulated data, and then mov e to the motiv ating application of inferring gene re gulatory netw orks from single-cell RNA-seq data. 5. Simulation results. For simulated data, we use synthetic random D AGs and real D A Gs from the bnlearn repository ( Scutari , 2017 ) that are widely used as benchmark net- works for ev aluating structure learning algorithms. W e compare the true CPD A G to the esti- mated CPD A G learned from our approach. T o compare methods, we compute the number of true positiv es (TP), false positiv es (FP), and false negati ves (FN). According to the interpre- tation of undirected edges in a CPD A G, we regard an undirected edge i − j as two directed edges i → j and j → i . A directed edge counts as a true positiv e if it appears in both the learned and true CPD A Gs, false positiv es correspond to directed edges present only in the learned CPD A G, and false negati ves correspond to directed edges present only in the true CPD A G. Under this metric, incorrect edge orientations are penalized and undirected edges are accounted for in both directions. W e summarize the performance of CPD A G recov ery using the F1-score, (5.1) F1-score = 2 × Precision × Recall Precision + Recall , where precision is defined as TP / ( TP + FP ) and recall as TP / ( TP + FN ) . Note the F1-score ranges between 0 and 1 , and a value of 1 means perfect recov ery . 5.1. Setup. For each D A G structure, edge weights were sampled independently and uni- formly from the interval [ − 0 . 9 , − 0 . 6] ∪ [0 . 6 , 0 . 9] . Exogenous noise terms were then generated according to Equation ( 2.2 ) with a specified cov ariance structure Σ (detailed below). Giv en sampled exogenous noise terms and edge weights, latent continuous variable Z j was gener- ated through ( 2.2 ). With probability 0 . 5 , Z j was discretized by the quantization mapping in ( 2.3 ) with thresholds τ j, 1 = − 1 and τ j, 2 = 1 . W e repeated this process for j = 1 , . . . , p accord- ing to a topological ordering of the D A G, which produced a mixed data matrix X ∈ R n × p . CA USAL DISCO VER Y ON DEPENDENT MIXED DA T A 11 The cov ariance structure Σ followed a block-diagonal structure, where each block corre- sponded to a cluster of correlated units of size 10 to 15. For each block, we randomly chose between Equal and T oeplitz covariance patterns. • Equal structure emulates fully-connected units, where Σ ij = θ if i 6 = j with θ ∼ U (0 . 4 , 0 . 7) . This occurs when all units in the block are equally dependent with one an- other . • T oeplitz structure emulates units connected in a Markov chain, where Σ ij = θ | i − j | / 5 and θ ∼ U (0 . 1 , 0 . 25) . These cov ariance designs allo w us to emulate v arying lev els and types of inter-unit depen- dence across clusters. 5.2. Evaluation of covariance estimation. W e e v aluated the accuracy of cov ariance ma- trix estimation as it plays an essential role for effecti ve de-correlation in Algorithm 2 . W e simulated mixed data according to ten random D AGs, with n ∈ { 100 , 500 } units and p ∈ { 100 , 500 } variables for each D AG. Using the threshold estimates by Algorithm 1 , we applied the pairwise MLE method from Section 3.2 to estimate each correlation b ρ ij within the known block structure and computed the element-wise RMSE between the estimated cov ariance matrix b Σ = ( b ρ ij ) n × n and the true cov ariance matrix Σ = ( ρ ij ) n × n : RMSE ( b Σ , Σ) = 1 | H | X ( i,j ) ∈ H ( b ρ ij − ρ ij ) 2 1 / 2 , where H is the set of non-zero, off-diagonal elements in Σ . W e conducted two simulation studies to ev aluate RMSE ( b Σ , Σ) . In the first simulation, we assumed knowledge of the true parent sets. In the second simulation, we repeated the same experiment but used parent sets estimated by MMHC on the original mixed data. T o quantify the accuracy of parent recov ery , we calculated the F1-score of the estimated parent sets. For n = 500 and p = 500 , the av erage F1-score was 0 . 75 . For smaller sample sizes ( n = 100 , p = 500 ), parent recov ery was substantially less accurate with the av erage F1- score of 0 . 41 . Figure 1 sho ws that the results from using the estimated parents and the true parents were similar for all settings. This indicates that the proposed cov ariance estimation procedure remains robust when the estimated graph is inaccurate. Additionally , cov ariance estimation improv es as p increases, since the likelihood-based estimation in ( 3.6 ) ef fecti vely treats the number of v ariables as the sample size. T o further assess the accuracy of the cov ariance estimates and the ef fecti veness of the de-correlation procedure, we examined pairwise correlations across observational units. W e only use the subset of continuous v ariables, as correlation measures for discrete data are less straightforward to define and interpret. Using simulated datasets of size ( n, p ) = (500 , 500) , we compared the correlations within the blocks of the original data matrix X j , j ∈ C with those of the de-correlated variables e X j , j ∈ C . As shown in Figure 2 , the original data exhibit strong correlations across units, with an average of 0.62, indicating that the data obviously violates the independence assumption commonly required for causal discovery algorithms. After applying our de-correlation method, the distribution of correlations is concentrated around zero with an average of 0.09. This demonstrates that the transformation indeed sub- stantially remov ed cross-unit dependence and yielded approximately uncorrelated units. 5.3. D AG learning accuracy . T o demonstrate consistency of our de-correlation algo- rithm across causal discovery methods, we used three different D AG learning methods: 12 0.075 0.100 0.125 0.150 0.175 n = 100, p = 100 n = 500, p = 100 n = 100, p = 500 n = 500, p = 500 RMSE Estimated Parents T rue Parents F I G 1 . RMSE of covariance estimate across data size settings of n ∈ { 100 , 500 } and p ∈ { 100 , 500 } using either estimated parent sets fr om MMHC on the mixed data or true par ent sets. Covariance Frequency −0.2 0.0 0.2 0.4 0.6 0.8 0 200 400 600 800 Dependent data De−correlated data F I G 2 . Histograms of the within block correlations among the continuous variables of the original data X j , j ∈ C and the corresponding de-correlated e X j , j ∈ C . MMHC ( Tsamardinos, Bro wn and Aliferis , 2006 ), PC, and Copula-PC ( Cui, Groot and Hes- kes , 2016 ). MMHC is a h ybrid causal disco very algorithm that first identifies potential parent sets using conditional independence tests and then orients edges using a hill climbing al- gorithm to optimize a scoring function. PC is a constraint-based method that estimates the skeleton of a D AG using conditional independence tests and then orients edges by a set of ori- entation rules known as the Meek’ s rules ( Meek , 1995 ). Copula-PC is an extension from the PC algorithm that assumes the data follo ws a Gaussian copula model. It uses Gibbs sampling to draw covariance matrices from a posterior distribution. The samples are used to compute an av erage correlation matrix and an effecti ve sample size, which are input to the standard PC algorithm for causal discovery . For each DA G learning method, we use the mixed data ver - sion on the dependent mixed data (baseline) and the continuous version on the de-correlated data from our algorithm (av erage and consensus approaches). W e then compare the results to CA USAL DISCO VER Y ON DEPENDENT MIXED DA T A 13 0.40 0.45 0.50 0.55 MMHC PC CopPC F1−Score n = 100, p = 100 0.7 0.8 MMHC PC CopPC n = 500, p = 100 0.30 0.35 0.40 0.45 0.50 MMHC PC CopPC n = 100, p = 1000 0.5 0.6 0.7 0.8 MMHC PC CopPC n = 500, p = 1000 Baseline A verage Consensus F I G 3 . F1-scor es of MMHC, PC, and Copula-PC algorithms on the original mixed data (baseline) and de- corr elated continuous data (average and consensus) for n = 100 , 500 and p = 100 , 1000 . quantify the performance of de-correlation. W e simulated 10 D A Gs in each parameter setting in our experiments belo w . W e first ev aluated the performance of our method on mixed data simulated from ran- dom D A Gs with n ∈ { 100 , 500 } units, p ∈ { 100 , 1000 } variables and 2 p edges. As reported in Figure 3 , the consensus and a verage approaches show significant improvement over the baseline approach in all experiments. The magnitude of improvement varies between settings with p = 100 and p = 1000 . As discussed in Section 5.2 , cov ariance estimation improves as the number of variables increases. Consequently , the increased performance gains observed when p = 1000 are attributable to more accurate estimation of Σ , which leads to more ef- fecti ve de-correlation. Compared to MMHC and PC, Copula-PC had the worst performance in the baseline approach, but the magnitude of improvement from the baseline to consensus approach is the lar gest. Because Copula PC relies on sampling correlation matrices, the large v ariable size makes accurate sampling dif ficult. The consensus approach explicitly addresses sample dependence and discretization to improv e causal graph estimation with Copula PC. Regardless of the D AG learning method used, we observe improv ed causal structure reco very via our de-correlation algorithm. Next, we ev aluated our method using eight benchmark DA Gs ( p variables and s edges) av ailable in the bnlearn repository ( Scutari , 2017 ): Hailfinder ( p = 56 , s = 66 ), Hepar2 ( p = 70 , s = 123 ), W in95pts ( p = 76 , s = 112 ), Munin1 ( p = 186 , s = 273 ), Andes ( p = 223 , s = 338 ), Pigs ( p = 441 , s = 592 ), Diabetes ( p = 413 , s = 602 ), Link ( p = 724 , s = 1125 ). W e simulated mixed data from each D A G and repeated the process 10 times. T o reflect varying data dimensionalities, smaller networks, Hailfinder , Hepar2, W in95pts, and Munin1, were simulated with sample sizes n ∈ { 100 , 200 } . For lar ger and more complex networks, Andes, Pigs, Diabetes, and Link, we used sample sizes n ∈ { 100 , 500 } . This design provided a range of e xperimental conditions: both low-dimensional ( n > p ) and high-dimensional ( p > n ) set- tings, allowing us to assess the performance under different sample-to-variable ratios. Fig- ure 4 summarizes the F1-scores for D A G recov ery across the eight networks. For nearly all networks and sample sizes, the de-correlation-based methods, the average and consensus approaches, consistently outperformed the baseline approach. In experiments with smaller networks such as Hepar2 and Hailfinder, we observe modest but consistent improvements ov er the baseline approach with n = 100 across MMHC, PC, and Copula PC. With n = 200 , the F1-scores for these networks using MMHC are compa- rable across the baseline, average, and consensus approaches. Howe ver , using PC and Cop- ula PC yields higher F1-scores for the consensus and average approaches than the baseline 14 0.24 0.28 0.32 0.36 MMHC PC CopPC F1−Score, n = 100 Hepar2 (p = 70) 0.40 0.45 0.50 0.55 0.60 MMHC PC CopPC Win95pts (p = 76) 0.20 0.25 0.30 0.35 MMHC PC CopPC Hailfinder (p = 56) 0.35 0.40 0.45 0.50 MMHC PC CopPC Munin1 (p = 186) 0.35 0.40 0.45 0.50 0.55 MMHC PC CopPC F1−Score, n = 200 Hepar2 (p = 70) 0.5 0.6 0.7 MMHC PC CopPC Win95pts (p = 76) 0.25 0.30 0.35 0.40 MMHC PC CopPC Hailfinder (p = 56) 0.40 0.45 0.50 0.55 0.60 0.65 MMHC PC CopPC Munin1 (p = 186) 0.40 0.45 0.50 0.55 MMHC PC CopPC F1−Score, n = 100 Andes (p = 223) 0.20 0.25 0.30 0.35 MMHC PC CopPC Pigs (p = 441) 0.35 0.40 0.45 0.50 MMHC PC CopPC Diabetes (p = 413) 0.09 0.11 0.13 0.15 MMHC PC CopPC Link (p = 724) 0.5 0.6 0.7 0.8 MMHC PC CopPC F1−Score, n = 500 Andes (p = 223) 0.15 0.20 0.25 0.30 0.35 MMHC PC CopPC Pigs (p = 441) 0.4 0.5 0.6 0.7 MMHC PC CopPC Diabetes (p = 413) 0.08 0.12 0.16 0.20 MMHC PC CopPC Link (p = 724) Baseline A verage Consensus F I G 4 . F1-scores of MMHC, PC, and Copula-PC algorithms on original mixed data (baseline) and de-corr elated continuous data (average and consensus) for four smaller networks (Hepar2, W in95pts, Hailfinder , Munin1) with n = 100 , 200 and four larg er networks (Andes, Pigs, Diabetes, Link) with n = 100 , 500 . approach. For Win95pts and Munin1, the a verage and consensus approaches sho w greater improv ements o ver the baseline for both n = 100 and n = 200 across all causal discovery methods. For example, with n = 100 , the Munin1 network using the MMHC method sees a 20% increase in F1-score by the consensus approach ov er the baseline. For larger graphs with limited sample size ( n = 100 ), including Andes, Pigs, and Link, the consensus and average approaches achiev e substantially higher F1-scores than the baseline across causal disco very methods. In the Diabetes network with n = 100 , MMHC yields sim- ilar performance across all approaches, while PC and Copula PC show improv ed F1-scores for the consensus and av erage approaches relative to the baseline. This pattern largely persists at n = 500 . For Andes and Pigs with n = 500 , the consensus and av erage approaches con- sistently outperform the baseline across all methods. W e observe a 38% and 16% increase in F1-score comparing the baseline and consensus approaches using the MMHC method for Pigs and Andes, respectively . For Diabetes, MMHC and PC show comparable perfor - mance across approaches, whereas Copula PC, again, sees the most improv ement through the consensus and average approaches. In the Link network with n = 500 , PC yields similar F1-scores across all approaches, while MMHC and Copula PC show improv ed performance CA USAL DISCO VER Y ON DEPENDENT MIXED DA T A 15 using the consensus and average approaches compared to the baseline. W e observe a 24% increase in F1-score of the consensus approach over the baseline approach for the MMHC method on the Link network with n = 500 . Overall, the consensus approach consistently achieves the best or comparable DA G re- cov ery in almost all cases, while the av erage approach provides a computationally efficient alternati ve with comparable D A G recov ery . 6. A pplication to GRN inference. Gene regulatory networks describe how genes reg- ulate each other through activ ation or repression, providing an understanding of ho w cells function and differentiate. Uncovering the causal relationships between genes is critical for understanding complex biological processes and how their dysregulation leads to diseases. In therapeutic settings, identifying which genes to up-regulate or inhibit can rev eal poten- tial targets for improving treatment efficacy . Because it is expensiv e and dif ficult to directly measure gene regulation, inferring GRNs from single-cell expression data is an active and important research area. 6.1. Backgr ound and the data. Chu et al. ( 2016 ) generated single-cell RN A-seq data to in vestigate how human embryonic stem cells (ESCs) transition from a pluripotent state to lineage-specific pr ogenitor states. Pluripotent stem cells are uncommitted cells capable of dif ferentiating into all three germ layers, ectoderm, mesoderm, and endoderm, while lineage- specific progenitors represent cells that have begun to dev elop tow ards a particular state. The ke y motiv ation of their study was to identify key transcriptional regulators that drive these transitions and to characterize gene regulation go verning differentiation. This dataset is publicly av ailable at the Gene Expression Omnibus under accession number GSE75748. The dataset consists of 1,018 single cells spanning multiple cell types deriv ed from ESCs and controls. These included 6 different cell types (H1s, n = 212 ; H9s, n = 162 ; DEC, n = 138 ; EC, n = 105 ; NPC, n = 173 ; TB, n = 69 ) along with human foreskins fibroblasts (HFFs; n = 159 ) that served as controls. There were 20,000 genes in the dataset but we focused on learning a GRN among p = 51 genes compiled by Chu et al. ( 2016 ) that are known to be important lineage-informati ve marker genes. The genes we do not consider for GRN estimation are referred to as background genes. 6.2. Pr e-pr ocessing. Each expression profile was pre-processed following the normal- ization and quality control pipeline described in Li and Li ( 2018 ). In our analysis, we ex- cluded HFFs (Human Foreskin Fibroblasts; controls) to focus on the dev elopmental trajec- tory from pluripotent embryonic stem cells to lineage-specific progenitors. Excluding HFF cells, the data consists of n = 859 cells (units) and p = 51 genes (variables). The raw measurements of single-cell RNA-seq data are counts of sequencing reads cor- responding to transcripts for each gene in a cell. Due to technical noise, dropouts, and data sparsity , reads can be unreliable and highly variable. Additionally , many genes in scRN A- seq data exhibit discrete expression dynamics: Signaling genes may switch “on” only when a pathway is activ ated, while other genes may display multi-modal expression patterns. As a result, discretization of such genes can lead to more rob ust estimation of a causal GRN. T o determine whether each gene should be modeled as continuous or discretized, we per- formed an additional test after the quality control pipeline. Specifically , for each gene j with expression data X j = ( x ij ) n i =1 , we fit both a single Gaussian model, x ij ∼ N ( µ j , σ 2 j ) , i = 1 , . . . , n, and Gaussian mixture models (GMMs) with K = 2 and K = 3 components, x ij ∼ K X k =1 π j k N ( µ j k , σ 2 j k ) , where K X k =1 π j k = 1 , π j k > 0 . 16 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Cell−type Clustering Dendr ogram Height H1/H9 DEC EC NPC TB F I G 5 . Hierar chical clustering dendro gram of cells with the appr oximate cell types corr esponding to the clusters. W e calculate the Bayesian Information Criterion (BIC) for each candidate model, and select the model with the smallest BIC v alue. If a Gaussian mixture model provides a better fit than the single Gaussian model, we further test the normality of each identified component using the Shapiro-Wilk test ( α = 0 . 05 ). If an y mixture component de viates significantly from normality , we discretize the gene based on the GMM cluster assignments across x ij , i ∈ [ n ] . Con versely , if the single Gaussian best fits the data or if all mixture components pass the normality test, we keep the gene as a continuous variable. After the test for discretization, our dataset X = ( x ij ) 859 × 51 has |D | = 20 discrete and |C | = 31 continuous variables. The distinct cell types in the scRN A-seq data suggest a block structure among cells since we expect cells of the same type to be more strongly correlated. T o infer this structure from the RNA-seq data, we applied hierarchical clustering to the 859 cells and 2,000 randomly selected background genes as the feature vector . Figure 5 sho ws that the dendrogram ap- proximately partitions cells of the same type in the same clusters. Note that H1 and H9 cells, ESCs of the male and the female, cluster closely together due to their similar gene expression profiles ( Sperger et al. , 2003 ). This verifies that it is reasonable to use the background genes to estimate the block structure among the cells. Considering the computation time and robustness of the cov ariance estimation method, we wanted to identify relati vely small blocks of cells exhibiting high correlations. This consider - ation motiv ated our choice of using 100 clusters from the dendrogram. As shown in Figure S4 in the Supplementary Material, 90% of the clusters had size ≤ 30 . 6.3. Model Evaluation. Without a true underlying GRN, we e valuated dif ferent GRN learning methods using cross-validation. Because likelihood ev aluation requires indepen- dence across folds, we aggregated independent blocks of cells into a total of 10 folds. Each fold consisted of approximately 85 cells. Let { X ( k ) train , X ( k ) test } 10 k =1 denote the 10-fold cross- v alidation splits of the scRN A-seq dataset. W e compared three different approaches to estimating the causal graph on each fold k : Baseline , Consensus-Ident , and Consensus . • Baseline: W e apply a mixed data version of MMHC directly on X ( k ) train , the mixed-type gene expression data. • Consensus: W e apply Algorithm 2 to X ( k ) train and apply a continuous data v ersion of MMHC on the resulting de-correlated continuous data e Z ( k ) train . CA USAL DISCO VER Y ON DEPENDENT MIXED DA T A 17 −2.0 −1.5 −1.0 −0.5 Baseline Consensus−Ident Consensus Method Log−Likelihood F I G 6 . Boxplot comparison of test data log-likelihood distributions in cr oss validation acr oss the baseline , consensus-ident, and consensus methods on single-cell data. • Consensus-Ident: Identical to the consensus approach but we fix b Σ = I n in Algorithm 2 , ef- fecti vely remo ving the de-correlation step. W e apply a continuous data version of MMHC on the dependent continuous data Z ( k ) train . For each k , we applied each of the above methods on X ( k ) train to obtain an estimated graph b G ( k ) and the associated parameters b Ψ ( k ) := ( b β ( k ) , b T ( k ) ) under our model. T est-data likelihood e valuation requires an estimate of the cov ariance matrix Σ . Howe ver , directly estimating Σ from the held-out test data X ( k ) test would lead to bias and inflate the result- ing likelihood. On the other hand, we cannot use the training data to estimate the cov ariance for test data because the dependence among units varies across folds. T o address these is- sues, we estimated the cov ariance structure using a randomly sampled set of 100 background genes. For each fold k , let X ( k ) bg ∈ R n k × 100 denote the expression matrix of 100 randomly sampled background genes measured on the n k cells of fold k . W e applied the cov ariance estimation method as in Section 3.2 on X ( k ) bg to estimate b Σ ( k ) test . Note that the test data X ( k ) test was not used in the estimate of b Σ ( k ) test . Gi ven b G ( k ) , b Ψ ( k ) , and b Σ ( k ) test , we ev aluated the log-likelihood of the test data X ( k ) test under our mixed data model. F or each fold, we report the normalized log-likelihood per data point, ℓ ( k ) = 1 | X ( k ) test | log p ( X ( k ) test | b G ( k ) , b Ψ ( k ) , b Σ ( k ) test ) , where | X ( k ) test | is the number of cells in the corresponding test dataset of fold k . This metric is a normalized log-likelihood to assess the model fit of each method accounting for fold size ensuring comparability across folds. Figure 6 compares the normalized test log-likelihood among the three methods. The consensus method had a substantially higher median normal- ized log-likelihood ( − 0 . 55 ) than the baseline and consensus-ident approaches, which had a median of − 1 . 5 . Additionally , the distribution of test log-likelihoods for the consensus method does not ov erlap with those of the other two methods, indicating consistent and sig- nificant improv ement in model fit. The gap between the consensus and consensus-ident methods isolates the effect of de- correlation: Both methods use the same consensus approach and graph aggre gation proce- dure, but only the consensus method explicitly removes sample dependence. The improv ed 18 likelihood under the consensus method highlights the importance of de-correlating dependent units prior to causal discov ery . Overall, these results demonstrate that our proposed model provides a better fit to this scRN A-seq dataset. More broadly , it confirms that explicitly modeling dependence among cells can lead to significant improv ement in model fitting. 6.4. Analysis of Results. Using the full dataset of n = 859 cells, we applied our con- sensus approach to obtain a final estimated CPD A G b G full . Recall that a CPD AG represents a Marko v equi v alence class and contains both directed ( → ) and undirected edges ( − ). T o quantify the robustness of each inferred regulatory interaction, we performed a resampling-based bootstrap analysis. In each bootstrap iteration t = 1 , . . . , B , we sampled 50% of the cells ( 429 cells) with replacement and re-estimated the causal graph using the consensus method given in Algorithm 2 . Because the block structure may v ary across subsets of the data, we re-estimated the block structure for each bootstrap sample t . This procedure yielded a collection of estimated CPD A Gs { b G ( t ) } B t =1 . For each ordered pair ( a, b ) , let I ( t ) a → b = 1 , if a → b ∈ b G ( t ) , 0 . 5 , if a − b ∈ b G ( t ) , 0 , otherwise . Note that undirected edges contribute equally to both possible orientations. W e aggregated I ( t ) a → b across the bootstrap graphs and accumulated counts for edges that ap- pear in the full-data estimate b G full . For a directed edge a → b ∈ b G full , we define its confidence score as Conf ( a → b ) = 1 B B X t =1 I ( t ) a → b . Since an undirected edge a − b in a CPD AG means that both orientations are possible, we define its confidence score as Conf ( a − b ) = Conf ( a → b ) + Conf ( b → a ) . W e retained all directed edges with Conf ( a → b ) ≥ 0 . 5 and undirected edges with Conf ( a − b ) ≥ 0 . 5 in the final graph. For an undirected edge a − b ∈ b G full , we further compare the confidence for the two orientations. Specifically , if Conf ( a → b ) Conf ( b → a ) ≥ 3 or Conf ( b → a ) Conf ( a → b ) ≥ 3 , (6.1) we orient the edge in the dominant direction. Otherwise, the edge remains undirected in the final graph. The resulting graph, b G cf , contains 39 edges, among which 23 are directed and 16 are undirected. The resulting netw ork is shown in Figure 7 , where genes (nodes) without any edge connections are remov ed. T able 1 presents the subset of predicted edges with either Conf ( a → b ) ≥ 60% or Conf ( a − b ) ≥ 80% in the final graph and the literature support for some of the predicted edges. T op-ranked undirected edges capture tightly coupled regulators within the pluripo- tency network. For example, interactions among POU5F1, N ANOG, and ZFP42 ( W ang et al. , 2006 ; Mitsui et al. , 2003 ), as well as the feedback loop relationship between NOD AL and LEFTY1 ( Juan and Hamada , 2001 ), are known regulatory relationships. Sev eral directed edges also align with known re gulatory mechanisms, such as PRDM14 stabilizing N ANOG- mediated pluripotency ( Ma et al. , 2011 ). DNMT3B and POU5F1 also have an experimentally CA USAL DISCO VER Y ON DEPENDENT MIXED DA T A 19 HAPLN1 MAPK10 CD34 ZSCAN10 ZFHX4 GA T A6 NANOG PMAIP1 PECAM1 DNMT3B P AX3 LECT1 LEFTY2 POU5F1 HAND1 SOX9 CER1 GA T A2 IGFBP3 LHX1 HNF4A GA T A4 NODAL DLK1 P AX6 GABRP PRDM14 MAP2 GNG11 FOXQ1 SOX2 GDF3 SOX17 LEFTY1 MT1X ZFP42 TNFSF10 EOMES Estimated Single−cell GRN F I G 7 . Gene re gulatory network showing the predicted edges found using the bootstrap resampling analysis. Blue edges correspond to undir ected edges and black edges corr espond to directed edges. T A B L E 1 List of predicted edges with high-confidence in the inferred gene re gulatory network X i − X j Confidence % Reference POU5F1 − ZFP42 100.0 W ang et al. ( 2006 ) N ANOG − POU5F1 99.0 Mitsui et al. ( 2003 ) LEFTY1 − NODAL 98.5 Juan and Hamada ( 2001 ) PECAM1 − POU5F1 98.5 PECAM1 − ZFP42 97.5 DNMT3B − ZFP42 96.5 DNMT3B − LECT1 91.0 CER1 − EOMES 84.5 DNMT3B − POU5F1 84.5 Liu et al. ( 2025 ) NOD AL → PMAIP1 73.0 P AX3 → MAP2 72.5 P AX6 → P AX3 71.3 HAPLN1 → GNG11 66.5 HNF4A → FOXQ1 66.9 GDF3 → DNMT3B 66.3 DLK1 → P AX3 64.5 TNFSF10 → PECAM1 62.0 PRDM14 → NANOG 61.2 Ma et al. ( 2011 ) MAPK10 → P AX3 61.0 CD34 → PECAM1 61.0 HAND1 → SOX2 60.0 established regulatory link by which DNMT3B suppresses POU5F1 transcription ( Liu et al. , 2025 ), although our method estimated this as an undirected connection. Edges without direct literature support may present unknown causal interactions or arise from indirect regulatory pathways rather than direct effects. The full list of edges in our estimated GRN is provided the Supplementary Material. 20 7. Discussion and conclusion. 7.1. Summary . In this work, we introduced a principled framew ork for causal discov ery on dependent mixed-type observational data. T o model the data, we formulated a latent- v ariable SEM in which every observed variable is generated from an underlying continuous latent process. Discrete v ariables arise through a thresholding mechanism applied to latent Gaussian variables, while continuous v ariables are observed directly . A central feature of the proposed model is the cross-unit dependence via correlated exogenous noise terms. This induces dependence among samples that is orthogonal to causal relationships encoded by a D A G, which operates across variables. The tw o sources of dependence, one across units and one across variables, complicate traditional causal discovery methods that assume indepen- dence across units. This motiv ates our de-correlation approach that lev erages an estimated cov ariance matrix to transform the latent continuous data. After de-correlation, our samples are approximately independent while preserving the underlying causal structure among v ariables. Giv en the de-correlated data, we may use any standard causal discov ery method to estimate the under- lying D AG. A key insight is that causal discovery algorithms do not need to be redesigned for dependent data, b ut rather the data dependence can be addressed upstream through an appropriate transformation. Across simulated and real networks, our method consistently resulted in better estimated causal graphs compared to graphs estimated from the original data. Importantly , we showed that the improvement from our proposed frame work is primarily due to the de-correlation of mixed data and not limited to a specific class of causal discovery algorithms. The method is particularly effecti ve in high-dimensional settings with p > n and strong dependence among units. In the application to GRN learning from scRNA-seq data, we designed a bootstrap re- sampling method to quantify the stability of the inferred causal relationships between genes and predicted high-confidence interactions that aligned with biological literature. W e also sho wed through cross-validation that our method substantially outperformed direct applica- tion of causal discov ery methods to single-cell data without de-correlation, supporting the notion of between-cell dependence in scRN A-seq data. 7.2. Futur e work. W e outline directions for future work. First, our method is designed for observational data and does not incorporate experimental interventions or known causal constraints. In many real-world applications, partial interventional data or prior knowledge is av ailable and can substantially improve causal identifiability and estimation accuracy . Ex- tending our framew ork to integrate interventional data or incorporate prior causal kno wledge would enhance its applicability in practical settings. Second, our data generating model and estimation procedure assume a linear structural equation model on the latent continuous data. While linear SEMs are widely used, they may be insufficient for capturing complex, poten- tially nonlinear interactions that arise in real-world systems. Generalizing our method to nonlinear SEMs is feasible but introduces se veral challenges. Suppose we replace Equation ( 2.2 ) with x ij = f j ( x i,P A j , ε ij ) , where f j ( · ) is an unkno wn nonlinear function. W ithout loss of generality , assuming the vari- ables of X are sorted according to a topological order of the underlying D A G, X j can be written as a function of the noise v ariables up to index j , X j = F j ( ε 1 , . . . , ε j ) . CA USAL DISCO VER Y ON DEPENDENT MIXED DA T A 21 If we can apply a Cholesky de-correlation to the noise terms, then the transformed variables can be expressed as e X j = F j ( e ε 1 , . . . , e ε j ) , where e ε j = L ⊤ ε j are independent. In principle, any nonlinear causal discovery method de- signed for independent data could be applied to { e X j , j ∈ [ p ] } . Adapting our current frame- work to the nonlinear setting is a promising important direction for future w ork. Funding. This work was supported in part by NIH grant R01GM163245 and NSF grant DMS-2305631 (to QZ). SUPPLEMENT AR Y MA TERIAL W eb appendices Appendices containing (a) con ver gence diagnostics for Algorithm 1 , (b) additional simula- tions on estimation accurac y of thresholds, (c) additional results and e xperiments for scRN A- seq application. REFERENCES A L M E T , A . A . , T S A I , Y . - C . , W AT A NA B E , M . and N I E , Q . (2024). Inferring pattern-driving intercellular flows from single-cell and spatial transcriptomics. Nature Methods 21 1806–1817. B H A T TAC H A RY A , R . , M A L I N S K Y , D . and S H P I T S E R , I . (2020). Causal Inference Under Interference And Net- work Uncertainty . In Pr oceedings of The 35th Uncertainty in Artificial Intelligence Conference ( R . P . A D A M S and V . G O G AT E , eds.). Pr oceedings of Machine Learning Resear ch 115 1028–1038. PMLR. C H A N , T. E ., S T U M P F , M . P . and B A B T I E , A . C . (2017). Gene regulatory network inference from single-cell data using multivariate information measures. Cell systems 5 251–267. C H E N , A . and Z H O U , Q . (2025). Causal Discovery on Dependent Binary Data. In Pr oceedings of The 28th Inter- national Conference on Artificial Intelligence and Statistics (Y . L I , S . M A N D T , S . A G R AW A L and E . K H A N , eds.). Proceedings of Machine Learning Researc h 258 2773–2781. PMLR. C H I C K E R I N G , D . M . (2002). Optimal structure identification with greedy search. Journal of machine learning r esearch 3 507–554. C H U , L . - F., L E N G , N . , Z H A N G , J . , H O U , Z . , M A M OT T , D . , V E R E I D E , D . T., C H O I , J . , K E N D Z I O R S K I , C . , S T E WART , R . and T H O M S O N , J . A . (2016). Single-cell RNA-seq re veals novel re gulators of human embryonic stem cell differentiation to definitive endoderm. Genome biology 17 1–20. C U I , R ., G RO OT , P . and H E S K E S , T. (2016). Copula PC algorithm for causal discovery from mixed data. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2016, Riva del Garda, Italy , September 19-23, 2016, Pr oceedings, P art II 16 377–392. Springer . D E M P S T E R , A . P ., L A I R D , N . M . and R U B I N , D . B . (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the royal statistical society: series B (methodological) 39 1–22. G Á M E Z , J. A . , M A T E O , J . L . and P U E RTA , J . M . (2011). Learning Bayesian networks by hill climbing: efficient methods based on progressive restriction of the neighborhood. Data Mining and Knowledge Discovery 22 106–148. G LY M O U R , C . , Z H A N G , K . and S P I RT E S , P . (2019). Revie w of causal discovery methods based on graphical models. Fr ontiers in genetics 10 524. G U , J . , F U , F. and Z H O U , Q . (2019). Penalized estimation of directed acyclic graphs from discrete data. Statistics and Computing 29 161–176. H U Y N H - T H U , V . A . , I R RT H U M , A . , W E H E N K E L , L . and G E U RT S , P . (2010). Inferring regulatory networks from expression data using tree-based methods. PloS one 5 e12776. J U A N , H . and H A M A DA , H . (2001). Roles of nodal-lefty regulatory loops in embryonic patterning of vertebrates. Genes to Cells 6 923–930. L I , W . V . and L I , J . J . (2018). Modeling and analysis of RNA-seq data: a re view from a statistical perspective. Quantitative Biology 6 195–209. L I , H . , P A D I L L A , O . H . M . and Z H O U , Q . (2024). Learning Gaussian D A Gs from Network Data. Journal of Machine Learning Researc h 25 1–52. 22 L I U , Y . , F U , B . , H E , Q . , B A I , X . and F A N , Y . (2025). DN A methylation of POU5F1 by DNMT1 and DNMT3B triggers apoptosis in interstitial Cajal-like cells via c-kit/SCF inhibition during cholesterol gallstone formation. Biochimica et Biophysica Acta (BBA)-Molecular Basis of Disease 1871 167689. M A , Z . , S W I G U T , T., V A L O U E V , A . , R A DA - I G L E S I A S , A . and W Y S O C K A , J . (2011). Sequence-specific reg- ulator Prdm14 safeguards mouse ESCs from entering extraembryonic endoderm fates. Natur e structural & molecular biology 18 120–127. M E E K , C . (1995). Causal inference and causal explanation with background knowledge. In Pr oceedings of the Eleventh Confer ence on Uncertainty in Artificial Intelligence . U AI’95 403410. Mor gan Kaufmann Publishers Inc., San Francisco, CA, USA. M I T S U I , K ., T O K U Z AW A , Y . , I T O H , H . , S E G AW A , K . , M U R A K A M I , M . , T A K A H A S H I , K ., M A RU Y A M A , M . , M A E D A , M . and Y A M A N A K A , S . (2003). The homeoprotein Nanog is required for maintenance of pluripo- tency in mouse epiblast and ES cells. Cell 113 631–642. M O E R M A N , T., A I BA R S A N T O S , S . , B R A V O G O N Z Á L E Z - B L A S , C . , S I M M , J . , M O R E AU , Y . , A E RT S , J . and A E RT S , S . (2019). GRNBoost2 and Arboreto: efficient and scalable inference of gene regulatory networks. Bioinformatics 35 2159–2161. N O G U E I R A , A . R . , P U G N A NA , A . , R U G G I E R I , S . , P E D R E S C H I , D . and G A M A , J . (2022). Methods and tools for causal discovery and causal inference. W iley interdisciplinary r eviews: data mining and knowledge discovery 12 e1449. P R A T A PA , A ., J A L I H A L , A . P ., L AW , J . N ., B H A R A D WA J , A . and M U R A L I , T. (2020). Benchmarking algorithms for gene regulatory network inference from single-cell transcriptomic data. Natur e methods 17 147–154. S C H WAR Z , G . (1978). Estimating the dimension of a model. The annals of statistics 461–464. S C U TA R I , M . (2017). Bayesian network constraint-based structure learning algorithms: Parallel and optimized implementations in the bnlearn R package. Journal of Statistical Softwar e 77 1–20. S P E R G E R , J . M . , C H E N , X . , D R A P E R , J . S . , A N T O S I E W I C Z , J . E . , C H O N , C . H . , J O N E S , S . B . , B R O O K S , J . D ., A N D R E W S , P . W . , B RO W N , P . O . and T H O M S O N , J . A . (2003). Gene expression patterns in human embryonic stem cells and human pluripotent germ cell tumors. Pr oceedings of the National Academy of Sciences 100 13350–13355. S P I RT E S , P . , G LY M O U R , C . N . and S C H E I N E S , R . (2000). Causation, pr ediction, and searc h . MIT press. T S A M A R D I N O S , I . , B RO W N , L . E . and A L I F E R I S , C . F. (2006). The max-min hill-climbing Bayesian network structure learning algorithm. Machine learning 65 31–78. W A N G , J . , R AO , S . , C H U , J . , S H E N , X . , L E V A S S E U R , D . N . , T H E U N I S S E N , T. W . and O R K I N , S . H . (2006). A protein interaction network for pluripotency of embryonic stem cells. Natur e 444 364–368.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment