MARADONER: Motif Activity Response Analysis Done Right
Inferring the activities of transcription factors from high-throughput transcriptomic or open chromatin profiling, such as RNA-/CAGE-/ATAC-Seq, is a long-standing challenge in systems biology. Identification of highly active master regulators enables mechanistic interpretation of differential gene expression, chromatin state changes, or perturbation responses across conditions, cell types, and diseases. Here, we describe MARADONER, a statistical framework and its software implementation for motif activity response analysis (MARA), utilizing the sequence-level features obtained with pattern matching (motif scanning) of individual promoters and promoter- or gene-level activity or expression estimates. Compared to the classic MARA, MARADONER (MARA-done-right) employs an unbiased variance parameter estimation and a bias-adjusted likelihood estimation of fixed effects, thereby enhancing goodness-of-fit and the accuracy of activity estimation. Further, MARADONER is capable of accounting for heteroscedasticity of motif scores and activity estimates.
💡 Research Summary
The manuscript introduces MARADONER, a novel statistical framework that refines the classic Motif Activity Response Analysis (MARA) methodology for inferring transcription factor (TF) activities from high‑throughput transcriptomic and chromatin accessibility data such as RNA‑Seq, CAGE‑Seq, and ATAC‑Seq. While traditional MARA models gene expression (or chromatin signal) as a linear combination of motif occurrence counts, it suffers from several well‑known statistical shortcomings: biased variance parameter estimates when variance and mean are jointly estimated by maximum likelihood, inefficient fixed‑effect estimation (often ordinary least squares, OLS, instead of generalized least squares, GLS), and unrealistic assumptions such as zero‑mean TF activities that obscure the distinction between activators and repressors.
MARADONER addresses these issues through three core innovations. First, it employs restricted maximum likelihood (REML) for variance component estimation, thereby eliminating the bias inherent in joint ML estimation. Fixed effects are then estimated via GLS, yielding best linear unbiased estimators (BLUE) for TF activities. Second, rather than using a conventional centering matrix to remove means, MARADONER adopts an orthogonal‑complement operator based on the Helmert matrix. This transformation simultaneously orthogonalizes the data and design matrices, preserving the full rank of the model without imposing a zero‑mean constraint. The authors provide rigorous proofs of semi‑orthogonality, invertibility, and determinant properties for this operator, ensuring numerical stability. Third, the framework explicitly models heteroscedasticity by allowing motif‑specific and promoter‑specific variance matrices, which is especially beneficial in small‑sample regimes where variance heterogeneity can otherwise dominate the inference.
Algorithmically, MARADONER avoids a monolithic joint optimization that would be computationally prohibitive for genome‑scale data. Instead, it iteratively updates variance components and fixed effects in a coordinate‑descent fashion, leveraging matrix identities to keep each sub‑step cheap. The implementation also includes optional modules for motif clustering (to reduce dimensionality), Bayesian posterior testing, maximum‑a‑posteriori (MAP) activity estimation, and downstream gene regulatory network (GRN) construction via statistically significant TF‑gene pair identification.
The authors validate MARADONER on both synthetic benchmarks and real datasets. In synthetic experiments, where ground‑truth TF activities are known, MARADONER achieves markedly lower mean‑squared error and higher Pearson correlation with the true activities compared to classic MARA, particularly when the number of samples is limited (<10). In real‑world applications, the method successfully recovers established master regulators (e.g., NF‑κB, p53) from ATAC‑Seq and CAGE‑Seq data and proposes additional candidate TFs that are biologically plausible. Statistical tests based on asymptotic properties of the REML estimator provide confidence intervals and p‑values for each inferred activity, facilitating hypothesis generation for experimental validation.
MARADONER is released as an open‑source R package (hosted on GitHub), with modular code that supports custom extensions, parallel computation, and optional GPU acceleration. Documentation includes a step‑by‑step tutorial, example pipelines, and guidance on integrating the tool into existing genomic analysis workflows. In summary, MARADONER delivers unbiased variance estimation, bias‑adjusted likelihood for fixed effects, heteroscedastic modeling, and computational efficiency, representing a substantial methodological advance over existing MARA‑based tools and offering a robust platform for deciphering transcriptional regulatory mechanisms across diverse biological contexts.
Comments & Academic Discussion
Loading comments...
Leave a Comment