Beyond Global Alignment: Fine-Grained Motion-Language Retrieval via Pyramidal Shapley-Taylor Learning

Beyond Global Alignment: Fine-Grained Motion-Language Retrieval via Pyramidal Shapley-Taylor Learning
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

As a foundational task in human-centric cross-modal intelligence, motion-language retrieval aims to bridge the semantic gap between natural language and human motion, enabling intuitive motion analysis, yet existing approaches predominantly focus on aligning entire motion sequences with global textual representations. This global-centric paradigm overlooks fine-grained interactions between local motion segments and individual body joints and text tokens, inevitably leading to suboptimal retrieval performance. To address this limitation, we draw inspiration from the pyramidal process of human motion perception (from joint dynamics to segment coherence, and finally to holistic comprehension) and propose a novel Pyramidal Shapley-Taylor (PST) learning framework for fine-grained motion-language retrieval. Specifically, the framework decomposes human motion into temporal segments and spatial body joints, and learns cross-modal correspondences through progressive joint-wise and segment-wise alignment in a pyramidal fashion, effectively capturing both local semantic details and hierarchical structural relationships. Extensive experiments on multiple public benchmark datasets demonstrate that our approach significantly outperforms state-of-the-art methods, achieving precise alignment between motion segments and body joints and their corresponding text tokens. The code of this work will be released upon acceptance.


💡 Research Summary

The paper tackles the problem of motion‑language retrieval, where the goal is to match natural‑language descriptions with human motion sequences. Existing methods largely rely on global alignment, treating an entire motion clip and its accompanying sentence as single holistic embeddings. While this approach captures coarse‑grained similarity, it ignores fine‑grained correspondences such as which body joint moves when a particular word is mentioned. The authors argue that human perception of motion follows a pyramidal process—starting from joint dynamics, moving to segment coherence, and finally to a holistic understanding—and they design a model that mirrors this hierarchy.

Key contributions

  1. Pyramidal Shapley‑Taylor (PST) framework – The model decomposes motion into joint‑level tokens (individual joint trajectories) and segment‑level tokens (temporal windows of joints). Text is tokenized into word‑level and phrase‑level units. Three alignment stages are performed sequentially: joint‑wise, segment‑wise, and holistic.
  2. Shapley‑Taylor Interaction (STI) – Building on the Shapley‑Taylor interaction concept, the authors define a metric that quantifies the marginal contribution of a joint‑text token pair when they are added together to a prefix set of tokens. This captures second‑order interactions, i.e., how much a specific joint and word reinforce each other beyond their individual effects.
  3. Lightweight STI estimator – Exact computation of STI would require enumerating all permutations of tokens, which is infeasible. The paper introduces a small neural head (convolution + self‑attention) trained via Monte‑Carlo sampling to approximate the STI distribution. A KL‑divergence loss (“STI distillation”) forces the estimator to mimic the true STI probabilities.
  4. Progressive training – The three pyramid levels are trained in a staged manner. The joint‑wise stage learns fine‑grained interaction maps; the segment‑wise stage compresses joint tokens into segment embeddings and aligns them with phrase embeddings; the holistic stage aggregates everything into global embeddings and applies a standard cosine similarity loss.

Methodology details

  • Motion data are represented as a tensor of shape (L × J × 3), where L is the number of frames, J the number of joints, and 3 the Cartesian coordinates. A sliding window of fixed length creates S segments.
  • Text is encoded with a pretrained language model (e.g., BERT). Word tokens are kept for joint‑wise alignment; phrase tokens are extracted (via syntactic parsing or learned pooling) for segment‑wise alignment.
  • For a given joint token (e_m) and word token (e_t), the STI value is:
    (\phi(e_t, e_m) = \mathbb{E}_\pi

Comments & Academic Discussion

Loading comments...

Leave a Comment