Latent Semantic Learning with Structured Sparse Representation for Human Action Recognition
This paper proposes a novel latent semantic learning method for extracting high-level features (i.e. latent semantics) from a large vocabulary of abundant mid-level features (i.e. visual keywords) with structured sparse representation, which can help to bridge the semantic gap in the challenging task of human action recognition. To discover the manifold structure of midlevel features, we develop a spectral embedding approach to latent semantic learning based on L1-graph, without the need to tune any parameter for graph construction as a key step of manifold learning. More importantly, we construct the L1-graph with structured sparse representation, which can be obtained by structured sparse coding with its structured sparsity ensured by novel L1-norm hypergraph regularization over mid-level features. In the new embedding space, we learn latent semantics automatically from abundant mid-level features through spectral clustering. The learnt latent semantics can be readily used for human action recognition with SVM by defining a histogram intersection kernel. Different from the traditional latent semantic analysis based on topic models, our latent semantic learning method can explore the manifold structure of mid-level features in both L1-graph construction and spectral embedding, which results in compact but discriminative high-level features. The experimental results on the commonly used KTH action dataset and unconstrained YouTube action dataset show the superior performance of our method.
💡 Research Summary
The paper introduces a novel latent semantic learning framework designed to bridge the semantic gap in human action recognition by converting a large vocabulary of visual keywords (mid‑level features) into a compact set of high‑level latent semantics. Traditional bag‑of‑words (BOW) pipelines rely on thousands of visual words, leading to high computational cost and redundancy, while topic‑model approaches such as PLSA or LDA capture only co‑occurrence statistics and ignore the geometric manifold structure of the feature space.
To address these shortcomings, the authors propose a two‑stage manifold‑aware method. First, each visual keyword is represented as an N‑dimensional vector of its occurrence counts across the training videos. Using L1‑norm sparse coding, each keyword vector is linearly reconstructed from the remaining keywords, yielding a sparse coefficient vector that directly encodes similarity. Collecting these coefficients for all keywords forms an L1‑graph whose adjacency matrix is parameter‑free, avoiding the need to set Gaussian kernel bandwidths as required in previous graph‑based methods.
Second, the authors enrich the sparsity pattern with a novel L1‑norm hypergraph regularization. A hypergraph is built where each video constitutes a hyperedge connecting all keywords that appear in it. The hyperedge weight is derived from the original cluster centers of the keywords, eliminating any hand‑tuned parameters. The L1‑norm regularization term, applied to the hypergraph Laplacian eigenvectors, enforces structured sparsity that respects the underlying manifold geometry while preserving the sparsity benefits of L1 regularization. This contrasts with conventional Laplacian regularization (quadratic L2 form) which is less suitable for parameter‑free graph construction.
With the structured sparse L1‑graph in hand, the method performs spectral embedding by computing the eigenvectors of the graph Laplacian. The resulting low‑dimensional embedding captures intrinsic relationships among keywords: those belonging to similar actions cluster together. A subsequent k‑means clustering on the embedded points yields K clusters, each interpreted as a latent semantic (high‑level feature). Unlike probabilistic topics, these semantics are derived from geometric proximity on the manifold, making them both compact and discriminative.
For classification, each video is represented as a histogram over the learned latent semantics. The authors employ a histogram intersection kernel within a support vector machine (SVM) framework, which efficiently exploits the additive nature of the histogram representation.
The approach is evaluated on two widely used benchmarks: the controlled KTH dataset and the challenging, unconstrained YouTube action dataset. Experiments show that the proposed method consistently outperforms baseline topic‑model methods (PLSA, LDA) and a recent diffusion‑map based latent semantic approach. Notably, the parameter‑free L1‑graph construction eliminates the sensitivity to kernel bandwidth selection, leading to more stable performance, especially on the noisy YouTube videos. Moreover, the method achieves comparable or higher accuracy while reducing the visual vocabulary size from several thousand to a few hundred latent semantics, thereby lowering both storage and computational demands.
Key contributions of the paper are:
- A parameter‑free L1‑graph construction using structured sparse coding, which directly captures similarity among visual keywords without manual kernel tuning.
- Introduction of an L1‑norm hypergraph regularization that embeds manifold structure into the sparsity pattern, offering a novel way to enforce structured sparsity in graph‑based learning.
- Integration of spectral embedding and clustering to derive latent semantics that are more compact and discriminative than traditional topic models.
- Demonstration that the learned high‑level features, combined with a simple histogram intersection kernel and SVM, achieve state‑of‑the‑art performance on both controlled and real‑world action datasets.
- The hypergraph regularization framework is generic and can be applied to other machine‑learning problems that benefit from Laplacian‑type regularization.
In summary, the paper presents a robust, efficient, and theoretically grounded pipeline for latent semantic learning in human action recognition, showing that exploiting manifold geometry both during graph construction and spectral embedding yields superior high‑level representations without the need for extensive parameter tuning.
Comments & Academic Discussion
Loading comments...
Leave a Comment