Autonomous Cleaning of Corrupted Scanned Documents - A Generative Modeling Approach

Autonomous Cleaning of Corrupted Scanned Documents - A Generative   Modeling Approach
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We study the task of cleaning scanned text documents that are strongly corrupted by dirt such as manual line strokes, spilled ink etc. We aim at autonomously removing dirt from a single letter-size page based only on the information the page contains. Our approach, therefore, has to learn character representations without supervision and requires a mechanism to distinguish learned representations from irregular patterns. To learn character representations, we use a probabilistic generative model parameterizing pattern features, feature variances, the features’ planar arrangements, and pattern frequencies. The latent variables of the model describe pattern class, pattern position, and the presence or absence of individual pattern features. The model parameters are optimized using a novel variational EM approximation. After learning, the parameters represent, independently of their absolute position, planar feature arrangements and their variances. A quality measure defined based on the learned representation then allows for an autonomous discrimination between regular character patterns and the irregular patterns making up the dirt. The irregular patterns can thus be removed to clean the document. For a full Latin alphabet we found that a single page does not contain sufficiently many character examples. However, even if heavily corrupted by dirt, we show that a page containing a lower number of character types can efficiently and autonomously be cleaned solely based on the structural regularity of the characters it contains. In different examples using characters from different alphabets, we demonstrate generality of the approach and discuss its implications for future developments.


💡 Research Summary

The paper tackles the problem of cleaning heavily corrupted scanned text documents using only the information contained on a single page, without any external supervision or prior knowledge of the font. The authors formulate the task as an unsupervised learning problem: they treat each character (or any visual pattern) as a probabilistic generative model composed of a set of planar features (e.g., strokes, edge fragments). For each feature the model stores a mean position, a covariance that captures allowable spatial variation, and an existence probability. Latent variables describe (i) the class of the pattern (which character or which type of noise), (ii) the two‑dimensional position of the pattern on the page, and (iii) a binary mask indicating whether each feature is present in the observed image.

To estimate the model parameters, the authors devise a novel variational Expectation‑Maximization (EM) scheme. In the E‑step they approximate the posterior over the high‑dimensional latent space by factorising it into independent distributions for class, position, and feature‑mask variables, and they further exploit the sparsity of the mask to limit computation to the most likely features. The M‑step updates the means, covariances, existence probabilities, and class frequencies using the expected sufficient statistics. This iterative process converges to a representation in which the regular, repeated character patterns have tightly clustered feature positions (low covariance) and high feature‑presence rates, whereas irregular artefacts such as manual strokes, ink spills, or scanning glitches exhibit high spatial variance and low presence probabilities.

After learning, a quality measure Q is defined for each discovered pattern class. Q combines the average feature covariance and the average existence probability; low Q indicates a regular, well‑structured pattern (i.e., a genuine character), while high Q signals an irregular, noisy pattern. By thresholding Q, the system automatically separates characters from dirt. The identified noisy patterns are then removed by replacing their pixel locations with the background colour or by interpolating surrounding clean pixels, resulting in a cleaned document.

Experimental evaluation shows that a full Latin alphabet (26 characters) cannot be reliably learned from a single page because each character appears too few times. However, when the alphabet is reduced—so that a limited set of character types appears many times on the page—the method achieves high cleaning accuracy even when up to 30 % of the pixels belong to dirt. The approach is also demonstrated on scripts from different language families (Korean Hangul, Cyrillic, Arabic), confirming that the model does not depend on any specific script but only on the statistical regularity of the visual patterns.

The main contributions are: (1) an unsupervised generative model that captures both the spatial arrangement and variability of character features; (2) a scalable variational EM algorithm that can learn from a single corrupted page; (3) a principled quality metric that discriminates regular characters from irregular noise; and (4) empirical evidence of cross‑script generality. Limitations include sensitivity to the number of distinct character classes per page (few examples per class degrade performance) and the assumption of a uniform white background. Future work is suggested in the direction of multi‑page joint learning, incorporation of weak font priors, more sophisticated background modelling, and acceleration via GPU‑based parallel inference. In sum, the study presents a promising generative‑model‑based framework for fully autonomous document de‑noising, opening avenues for robust preprocessing in OCR pipelines and archival digitisation projects.


Comments & Academic Discussion

Loading comments...

Leave a Comment