Entropy-Aware Structural Alignment for Zero-Shot Handwritten Chinese Character Recognition

Entropy-Aware Structural Alignment for Zero-Shot Handwritten Chinese Character Recognition
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Zero-shot Handwritten Chinese Character Recognition (HCCR) aims to recognize unseen characters by leveraging radical-based semantic compositions. However, existing approaches often treat characters as flat radical sequences, neglecting the hierarchical topology and the uneven information density of different components. To address these limitations, we propose an Entropy-Aware Structural Alignment Network that bridges the visual-semantic gap through information-theoretic modeling. First, we introduce an Information Entropy Prior to dynamically modulate positional embeddings via multiplicative interaction, acting as a saliency detector that prioritizes discriminative roots over ubiquitous components. Second, we construct a Dual-View Radical Tree to extract multi-granularity structural features, which are integrated via an adaptive Sigmoid-based gating network to encode both global layout and local spatial roles. Finally, a Top-K Semantic Feature Fusion mechanism is devised to augment the decoding process by utilizing the centroid of semantic neighbors, effectively rectifying visual ambiguities through feature-level consensus. Extensive experiments demonstrate that our method establishes new state-of-the-art performance, achieving an accuracy of 55.04% on the ICDAR 2013 dataset ($m=1500$), significantly outperforming existing CLIP-based baselines in the challenging zero-shot setting. Furthermore, the framework exhibits exceptional data efficiency, demonstrating rapid adaptability with minimal support samples, achieving 92.41% accuracy with only one support sample per class.


💡 Research Summary

The paper introduces the Entropy‑Aware Structural Alignment (EASA) network, a novel framework for zero‑shot handwritten Chinese character recognition (ZSHCCR). Traditional zero‑shot approaches decompose characters into radical sequences but treat all radicals equally and use simple similarity measures to align visual and semantic modalities. This leads to two major problems: (1) information imbalance among radicals—high‑frequency radicals such as “口” or “日” carry low discriminative power, while rare radicals are key identifiers; (2) coarse visual‑semantic alignment that ignores the hierarchical topology of Chinese characters (global layout versus local composition).

EASA addresses these issues with four core components.

  1. Information Entropy Prior and Entropy‑Aware Positional Embedding (EAPE).
    The authors compute an entropy value for each radical based on its corpus frequency, yielding higher entropy for rare radicals. This entropy is multiplied with the standard positional embedding, creating a saliency‑aware positional code that forces the model to attend more to high‑entropy (informative) radicals and suppress ubiquitous ones.

  2. Dual‑View Radical Tree.
    Two complementary tree representations are built for each character: a parent‑centric view that captures global layout relations (left‑right, top‑bottom, surround) and a child‑centric view that preserves fine‑grained local composition. From each view five multi‑granular features (global structure, local position, depth, sibling relations, subtree summary) are extracted, providing a richer structural prior than flat sequences or single‑view trees.

  3. Adaptive Sigmoid‑Gate Fusion and Cross‑Modal Matching Module.
    The multi‑granular features are passed through a sigmoid‑based gating network that learns dynamic importance weights for each feature type. The gated representation is then fused with visual features from a ResNet backbone via a cross‑modal attention mechanism, enabling deep, non‑linear interaction between distorted handwritten strokes and the rigid radical definitions.

  4. Top‑K Semantic Feature Fusion.
    During decoding, the query vector is augmented with the centroid of the K nearest semantic prototypes in the radical embedding space. This consensus‑based augmentation reduces ambiguity when visual cues are noisy or ambiguous, effectively leveraging the clustering structure of the semantic space.

To further improve robustness to handwriting variability, the authors propose a Multi‑grid 2D Elastic Deformation augmentation. A dense grid of control points is overlaid on the radical image; each point is displaced by a Gaussian offset, and bicubic interpolation produces a smoothly warped image. This simulates realistic elastic distortions of strokes, encouraging the model to learn invariance to complex deformations.

The experimental protocol includes zero‑shot and few‑shot evaluations on ICDAR‑2013 (m = 1500 characters) and CASIA‑HWDB. In the strict zero‑shot setting, EASA achieves 55.04 % accuracy, surpassing CLIP‑based baselines by 7–9 percentage points. Remarkably, with only one support sample per unseen class (1‑shot), the model reaches 92.41 % accuracy, demonstrating exceptional data efficiency. Extensive ablation studies confirm the contribution of each component: removing the entropy prior drops performance by 3.2 pp, replacing the dual‑view tree with a single view reduces accuracy by 2.7 pp, disabling the sigmoid gate leads to a 1.9 pp loss, and varying K in the Top‑K fusion shows that K = 5 yields the best trade‑off. Visualizations of attention maps reveal that the model indeed focuses on high‑entropy radicals, validating the intended saliency mechanism.

The paper also discusses limitations. The entropy prior relies on accurate radical frequency statistics; a mismatch between training and deployment corpora could degrade performance. The dual‑view tree and gating network introduce a non‑trivial number of parameters, posing challenges for deployment on resource‑constrained devices. The Top‑K fusion is sensitive to the choice of K, suggesting a need for adaptive selection strategies.

Overall, EASA presents a compelling integration of information‑theoretic weighting, hierarchical structural modeling, and deep cross‑modal interaction. It sets a new state‑of‑the‑art for zero‑shot handwritten Chinese character recognition and opens avenues for extending entropy‑aware mechanisms to other compositional scripts or to stroke‑level representations in future work.


Comments & Academic Discussion

Loading comments...

Leave a Comment