Latent space models are widely used for analyzing high-dimensional discrete data matrices, such as patient-feature matrices in electronic health records (EHRs), by capturing complex dependence structures through low-dimensional embeddings. However, estimation becomes challenging in the imbalanced regime, where one matrix dimension is much larger than the other. In EHR applications, cohort sizes are often limited by disease prevalence or data availability, whereas the feature space remains extremely large due to the breadth of medical coding system. Motivated by the increasing availability of external semantic embeddings, such as pre-trained embeddings of clinical concepts in EHRs, we propose a knowledge-embedded latent projection model that leverages semantic side information to regularize representation learning. Specifically, we model column embeddings as smooth functions of semantic embeddings via a mapping in a reproducing kernel Hilbert space. We develop a computationally efficient two-step estimation procedure that combines semantically guided subspace construction via kernel principal component analysis with scalable projected gradient descent. We establish estimation error bounds that characterize the trade-off between statistical error and approximation error induced by the kernel projection. Furthermore, we provide local convergence guarantees for our non-convex optimization procedure. Extensive simulation studies and a real-world EHR application demonstrate the effectiveness of the proposed method.
High-dimensional discrete data arise in many domains, including variant mutation data in genomics, electronic health records (EHRs) in clinical data, and bipartite relational data in social science (Ma et al., 2023;Li et al., 2020;Wu et al., 2024). These data can be represented as asymmetric matrices, where the two dimensions correspond to different types of entities. Examples include patient-clinical feature matrices in EHR data, patientvariant mutation matrices in genomic studies, and paper-author matrices in co-authorship data. Despite their high dimensionality, such data often exhibit structured dependence patterns across rows and columns that can be explained by a relatively small number of underlying latent factors. A natural and widely used approach in this setting is latent space modeling via representation learning (Van Den Oord et al., 2017;Kopf and Claassen, 2021;Lavrač et al., 2021). These methods assign low-dimensional embeddings to row and column entities and model observed discrete outcomes through interaction between the corresponding latent representations. This modeling perspective provides a flexible yet interpretable framework for capturing complex dependency structures in high-dimensional discrete data, while simultaneously yielding low-dimensional embeddings useful for downstream tasks such as visualization, clustering of similar entities, risk profiling, and missing data imputation. (Hoff et al., 2002;Chen et al., 2018;Ma et al., 2020;Wu et al., 2024) Despite their success in many applications, existing latent space modeling approaches often fall short in a challenging yet common real-world data regime characterized by imbalanced matrix dimensions, where one dimension is much larger than the other. To ground our discussion, we use EHR data as a running example throughout this work, where the number of clinical features can far exceed the number of patients.
The EHR Example. Commonly used EHR data naturally take the form of highdimensional binary observations, where each patient record consists of a large collection of clinical features that are either present or absent. These features include diagnosis and procedure codes, medication prescriptions, laboratory test indicators, and clinical concepts extracted from free-text notes (Choi et al., 2018;Yu et al., 2015). Such data can be represented as a binary patient-feature matrix, where rows correspond to patients and columns correspond to clinical features, and an entry indicates whether a particular feature occurs in a patient’s medical history. Due to the extensive granularity of modern medical coding systems, many EHR features encode overlapping or closely related clinical semantics.
Representing clinical features using low-dimensional vector embeddings has been shown to be effective for capturing this redundancy (Choi et al., 2018;Hong et al., 2021;Liu et al., 2024;Gan et al., 2025). However, in many EHR applications, particularly rare-disease and disease-specific cohort studies, there exists an imbalance between the number of patients and the number of clinical features. Cohort sizes are often limited by disease prevalence, inclusion criteria, or data availability, whereas the feature space remains extremely large due to the breadth of possible medical codes and extracted concepts. As a result, the patient-feature matrix is both highly imbalanced.
In this imbalanced regime, accurate estimation of latent space embeddings for both rows and columns becomes challenging. For the standard generalized latent factor model (GLFM), the average estimation error scales as O((n + p)/np) up to logarithmic factors, where n is the number of rows and p is the number of columns (Chen et al., 2019;Wang, 2022;Chen et al., 2023). This suggests that, when n ≪ p in the imbalanced regime, the estimation error is dominated by the limited sample size n. This challenge is further compounded by the pervasive sparsity of observations in the high-dimensional binary matrix data (Wu et al., 2024). In EHR data, only a small subset of all possible clinical features appear in each patient’s record. This leads to a binary matrix that is both highly imbalanced and sparse, with most entries equal to zero.
At the same time, a key characteristic of the motivating example is that the column entities correspond to semantic objects, rather than abstract indices. In EHR data, clinical features such as diagnosis and procedure codes can be linked to pretrained embeddings learned from large biomedical corpora or large-scale observational data, which encode clinically meaningful relationships among features (Hong et al., 2021;Gan et al., 2025).
This availability of semantic side information presents an opportunity to regularize latent structure estimation: semantically similar features or tasks should have similar latent embeddings. Leveraging this structure can help mitigate the estimation challenge posed by imbalance and sparsity.
Motivated by this observation, we propos
This content is AI-processed based on open access ArXiv data.