Forest-Guided Semantic Transport for Label-Supervised Manifold Alignment

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Label-supervised manifold alignment bridges the gap between unsupervised and correspondence-based paradigms by leveraging shared label information to align multimodal datasets. Still, most existing methods rely on Euclidean geometry to model intra-domain relationships. This approach can fail when features are only weakly related to the task of interest, leading to noisy, semantically misleading structure and degraded alignment quality. To address this limitation, we introduce FoSTA (Forest-guided Semantic Transport Alignment), a scalable alignment framework that leverages forest-induced geometry to denoise intra-domain structure and recover task-relevant manifolds prior to alignment. FoSTA builds semantic representations directly from label-informed forest affinities and aligns them via fast, hierarchical semantic transport, capturing meaningful cross-domain relationships. Extensive comparisons with established baselines demonstrate that FoSTA improves correspondence recovery and label transfer on synthetic benchmarks and delivers strong performance in practical single-cell applications, including batch correction and biological conservation.

💡 Research Summary

Forest‑guided Semantic Transport Alignment (FoSTA) tackles the weakly supervised manifold alignment problem where two heterogeneous domains share a common label space but lack explicit point‑wise correspondences. Traditional label‑supervised methods rely on Euclidean kernels to model intra‑domain geometry, which becomes unreliable when many features are irrelevant to the task or when high‑dimensional noise dominates. FoSTA replaces this fragile geometry with label‑informed Random Forest (RF) proximities, specifically an extension of the RF‑GAP measure, that can operate even when only a subset of samples are labeled.

The method proceeds in several stages. First, a Random Forest is trained on the labeled subset of each domain. Using out‑of‑bag and in‑bag leaf co‑occurrence statistics, a directed proximity p*_l(x_i, x_j) is computed for any query‑target pair, including unlabeled‑to‑labeled interactions. Asymmetry is resolved by symmetrizing the matrix, and unlabeled‑to‑unlabeled proximities are inferred via the expected bootstrap behavior of trees, yielding a full semi‑supervised affinity matrix W* for each domain. These matrices capture task‑relevant neighborhoods while suppressing noise.

Next, class‑wise semantic profiles are built by aggregating affinities for each label, and every sample is projected into a shared ℓ₂‑normalized semantic space. The cross‑domain transport cost is defined as the cosine distance between semantic vectors, which is far cheaper to compute than full pairwise Euclidean or Gromov‑Wasserstein costs.

FoSTA then employs Hierarchical Refinement (HiRef), a fast implicit optimal transport solver that leverages the hierarchical structure of the semantic space to estimate a bijective transport plan T without constructing the dense cost matrix. HiRef iteratively refines matches at coarser clusters before moving to finer granularity, achieving near‑linear time complexity.

The transport plan is used to propagate correspondences and construct cross‑domain affinity blocks W^{AB} and W^{BA}. Together with the intra‑domain blocks, a block‑matrix W =

Forest-Guided Semantic Transport for Label-Supervised Manifold Alignment

💡 Research Summary

Comments & Academic Discussion

Leave a Comment