📝 Original Info
- Title: Sparse Linear Identifiable Multivariate Modeling
- ArXiv ID: 1004.5265
- Date: 2011-06-24
- Authors: Researchers from original ArXiv paper
📝 Abstract
In this paper we consider sparse and identifiable linear latent variable (factor) and linear Bayesian network models for parsimonious analysis of multivariate data. We propose a computationally efficient method for joint parameter and model inference, and model comparison. It consists of a fully Bayesian hierarchy for sparse models using slab and spike priors (two-component delta-function and continuous mixtures), non-Gaussian latent factors and a stochastic search over the ordering of the variables. The framework, which we call SLIM (Sparse Linear Identifiable Multivariate modeling), is validated and bench-marked on artificial and real biological data sets. SLIM is closest in spirit to LiNGAM (Shimizu et al., 2006), but differs substantially in inference, Bayesian network structure learning and model comparison. Experimentally, SLIM performs equally well or better than LiNGAM with comparable computational complexity. We attribute this mainly to the stochastic search strategy used, and to parsimony (sparsity and identifiability), which is an explicit part of the model. We propose two extensions to the basic i.i.d. linear framework: non-linear dependence on observed variables, called SNIM (Sparse Non-linear Identifiable Multivariate modeling) and allowing for correlations between latent variables, called CSLIM (Correlated SLIM), for the temporal and/or spatial data. The source code and scripts are available from http://cogsys.imm.dtu.dk/slim/.
💡 Deep Analysis
Deep Dive into Sparse Linear Identifiable Multivariate Modeling.
In this paper we consider sparse and identifiable linear latent variable (factor) and linear Bayesian network models for parsimonious analysis of multivariate data. We propose a computationally efficient method for joint parameter and model inference, and model comparison. It consists of a fully Bayesian hierarchy for sparse models using slab and spike priors (two-component delta-function and continuous mixtures), non-Gaussian latent factors and a stochastic search over the ordering of the variables. The framework, which we call SLIM (Sparse Linear Identifiable Multivariate modeling), is validated and bench-marked on artificial and real biological data sets. SLIM is closest in spirit to LiNGAM (Shimizu et al., 2006), but differs substantially in inference, Bayesian network structure learning and model comparison. Experimentally, SLIM performs equally well or better than LiNGAM with comparable computational complexity. We attribute this mainly to the stochastic search strategy used, and
📄 Full Content
arXiv:1004.5265v3 [stat.ML] 23 Jun 2011
Journal of Machine Learning Research 12 (2011) 663-705
Submitted 10/09; Revised 10/10; Published 3/11
Sparse Linear Identifiable Multivariate Modeling
Ricardo Henao
rhenao@binf.ku.dk
Ole Winther
owi@imm.dtu.dk
DTU Informatics
Richard Petersens Plads, Building 321
Technical University of Denmark
DK-2800 Lyngby, Denmark
Bioinformatics Centre
University of Copenhagen
Ole Maaloes Vej 5
DK-2200 Copenhagen N, Denmark
Editor: Aapo Hyv¨arinen
Abstract
In this paper we consider sparse and identifiable linear latent variable (factor) and linear
Bayesian network models for parsimonious analysis of multivariate data. We propose a
computationally efficient method for joint parameter and model inference, and model com-
parison. It consists of a fully Bayesian hierarchy for sparse models using slab and spike
priors (two-component δ-function and continuous mixtures), non-Gaussian latent factors
and a stochastic search over the ordering of the variables. The framework, which we call
SLIM (Sparse Linear Identifiable Multivariate modeling), is validated and bench-marked on
artificial and real biological data sets. SLIM is closest in spirit to LiNGAM (Shimizu et al.,
2006), but differs substantially in inference, Bayesian network structure learning and model
comparison. Experimentally, SLIM performs equally well or better than LiNGAM with
comparable computational complexity. We attribute this mainly to the stochastic search
strategy used, and to parsimony (sparsity and identifiability), which is an explicit part of
the model. We propose two extensions to the basic i.i.d. linear framework: non-linear de-
pendence on observed variables, called SNIM (Sparse Non-linear Identifiable Multivariate
modeling) and allowing for correlations between latent variables, called CSLIM (Correlated
SLIM), for the temporal and/or spatial data. The source code and scripts are available
from http://cogsys.imm.dtu.dk/slim/.
Keywords:
Parsimony, sparsity, identifiability, factor models, linear Bayesian networks
1. Introduction
Modeling and interpretation of multivariate data are central themes in machine learning.
Linear latent variable models (or factor analysis) and linear directed acyclic graphs (DAGs)
are prominent examples of models for continuous multivariate data. In factor analysis, data
is modeled as a linear combination of independently distributed factors thus allowing for
capture of a rich underlying co-variation structure. In the DAG model, each variable is
expressed as regression on a subset of the remaining variables with the constraint that total
connectivity is acyclic in order to have a properly defined joint distribution. Parsimonious
(interpretable) modeling, using sparse factor loading matrix or restricting the number of
c⃝2011 Ricardo Henao and Ole Winther.
Henao and Winther
parents of a node in a DAG, are good prior assumptions in many applications. Recently,
there has been a great deal of interest in detailed modeling of sparsity in factor mod-
els, for example in the context of gene expression data analysis (West, 2003, Lucas et al.,
2006, Knowles and Ghahramani, 2007, Thibaux and Jordan, 2007, Carvalho et al., 2008,
Rai and Daume III, 2009). Sparsity arises for example in gene regulation because the la-
tent factors represent driving signals for gene regulatory sub-networks and/or transcription
factors, each of which only includes/affects a limited number of genes. A parsimonious DAG
is particularly attractable from an interpretation point of view but the restriction to only
having observed variables in the model may be a limitation because one rarely measures
all relevant variables. Furthermore, linear relationships might be unrealistic for example in
gene regulation, where it is generally accepted that one cannot replace the driving signal
(related to concentration of a transcription factor protein in the cell nucleus) with the mea-
sured concentration of corresponding mRNA. Bayesian networks represent a very general
class of models, encompassing both observed and latent variables. In many situations it
will thus be relevant to learn parsimonious Bayesian networks with both latent variables
and a non-linear DAG parts. Although attractive, by being closer to what one may expect
in practice, such modeling is complicated by difficult inference (Chickering (1996) showed
that DAG structure learning is NP-hard) and by potential non-identifiability. Identifiability
means that each setting of the parameters defines a unique distribution of the data. Clearly,
if the model is not identifiable in the DAG and latent parameters, this severely limits the
interpretability of the learned model.
Shimizu et al. (2006) provided the important insight that every DAG has a factor model
representation, i.e. the connectivity matrix of a DAG gives rise to a triangular mixing ma-
trix in the factor model. This provided the motivation for the Linear Non-Gaussian Acyclic
Model (LiNGAM) algorithm which solves the identifiable fa
…(Full text truncated)…
Reference
This content is AI-processed based on ArXiv data.