The use of image transformations is essential for efficient modeling and learning of visual data. But the class of relevant transformations is large: affine transformations, projective transformations, elastic deformations, ... the list goes on. Therefore, learning these transformations, rather than hand coding them, is of great conceptual interest. To the best of our knowledge, all the related work so far has been concerned with either supervised or weakly supervised learning (from correlated sequences, video streams, or image-transform pairs). In this paper, on the contrary, we present a simple method for learning affine and elastic transformations when no examples of these transformations are explicitly given, and no prior knowledge of space (such as ordering of pixels) is included either. The system has only access to a moderately large database of natural images arranged in no particular order.
Biological vision remains largely unmatched by artificial visual systems across a wide range of tasks. Among its most remarkable capabilities are the aptitude for unsupervised learning and efficient use of spatial transformations. Indeed, the brain's proficiency in various visual tasks seems to indicate that some complex internal representations are utilized to model visual data. Even though the nature of those representations is far from understood, it is often presumed that learning them in an unsupervised manner is central to the biological neural processing [2] or, at very least, highly relevant for modeling neural processing computationally [9,14,22]. Likewise, it is poorly understood how the brain implements various transformations in its processing. Yet it must be clear that the level of learning efficiency demonstrated by humans and other biological systems can only be achieved by means of transformation-invariant learning. This follows, for example, from an observation that people can learn to recognize objects fairly well from only a small number of views.
Covering both topics (unsupervised learning and image transformations) at once, by way of learning transformations without supervision, appears interesting to us for two reasons.
Firstly, it can potentially further our understanding of unsupervised learning: what can be learned, how it can be learned, what are its strengths and limitations. Secondly, the class of transformations important for representing visual data may be too large for manual construction. In addition to transformations describable by a few parameters, such as affine, the transformations requiring infinitely many parameters, such as elastic, are deemed to be important [4]. Transformations need not be limited to spacial coordinates, they can involve temporal dimension or color space. Transformations can be discontinuous, can be composed of simpler transformations, or can be non-invertible. All these cases are likely to be required for efficient representation of, say, an animal or person. Unsupervised learning opens the possibility of capturing such diversity.
A number of works have been devoted to learning image transformations [11][12][13][16][17][18].
Other works were aimed at learning perceptual invariance with respect to the transformations [5,19,21], but without explicitly extracting them. Often, no knowledge of space structure was assumed (such methods are invariant with respect to random pixel permutations), and in some cases the learning was termed unsupervised. In this paper we adopt a more stringent notion of unsupervised learning, by requiring that no ordering of an image dataset be provided. In contrast, the authors of the cited references considered some sort of temporal ordering: either sequential (synthetic sequences or video streams) or pairwise (grouping original and transformed images). Obviously, a learning algorithm can greatly benefit from temporal ordering; just like ordering of pixels opens the problem to a host of otherwise unsuitable strategies. Ordering of images provides explicit examples of transformations. Without ordering, no explicit examples are given. It is in this sense that we talk about learning without (explicit) examples.
The main goal of this paper is to demonstrate learning of affine and elastic transformations from a set of naturals images by a rather simple procedure. Inference is done on a moderately large set of random images, and not just on a small set of strongly correlated images. The latter case is a (simpler) special case of our more general problem setting.
The possibility of inferring even simple transformations from an unordered dataset of images seems intriguing in itself. Yet, we think that dispensing with temporal order has a wider significance. Temporal proximity of visual percepts can be very helpful for learning some transformations but not others. Even the case of 3D rotations will likely require generation of hidden parameters encoding higher level information, such as shape and ori-entation. That will likely require processing a large number of images off-line, in a batch mode, incompatible with temporal proximity.
The paper is organized as follows. A brief overview of related approaches is given in section II. Our method is introduced in section III. In section IV the method is tested on a synthetic and natural sets of random images. In section V we conclude with discussion of limitations and possible extensions of the current approach, outlining a potential application to the learning of 3D transformations.
It is recognized that transformation invariant learning, and hence transformations themselves, possess great potential for artificial cognition. Numerous systems, attempting to realize this potential, have been proposed over the last few decades. In most cases the transformation invariant capabilities were bult-in. In the context of neural networks, for example, translational invariance can be built-in
This content is AI-processed based on open access ArXiv data.