Learning using ConvNets

A joint feature descriptor was learnt by supervising a Convolutional Neural Network (ConvNet) to perform 6DOF camera pose estimation and wide baseline matching between pairs of image patches. For the purpose of training, any two image patches depicting the same physical target point in the street view dataset were labelled as matching and other pairs of images were labelled as non-matching. The training for camera pose estimation was performed using matching patches. The patches were always cropped from the center of the collected street view image to keep the optical center at the target point.

The camera pose between each pair of matching patches was represented by a 6D vector; the first three dimensions were Tait-Bryan angles (roll, yaw, pitch) and the last three dimensions were cartesian (x, y, z) translation coordinates expressed in meters. For the purpose of training, 6D pose vectors were pre-processed to be zero mean and unit standard deviation (i.e., z-scoring). The ground-truth and predicted pose vectors for the $`i^{th}`$ example are denoted by $`p^{*}_i, ~p_i`$ respectively. The pose estimation loss $`L_{pose}(p^{*}_i, p_i)`$ was set to be the robust regression loss described in equation [eq:robust]:

\begin{equation}
\label{eq:robust}
L_{pose}(p^{*}_i, p_i) = 
    \left\{
    \begin{array}{ll}
        e & \mbox{if } e \leq 1 \\
        1 + \log e & \mbox{if } e > 1 
    \end{array}
\right.
\mbox{   where   }   e={||p^{*}_i-p_i||}_{l_2}.
\end{equation}

The loss function for patch matching $`L_{match}({m_i^{*}, m_i})`$ was set to be sigmoid cross entropy, where $`m_i^{*}`$ is the ground-truth binary variable indicating matching/non-matching and $`m_i`$ is the predicted probability of matching.

ConvNet training was performed to optimize the joint matching and pose estimation loss ($`L_{joint}`$) described in equation [eq:joint]. The relative weighting between the pose ($`L_{pose}`$) and matching ($`L_{match}`$) losses was controlled by $`\lambda`$ (we set $`\lambda = 1`$).

\begin{equation}
\label{eq:joint}
    L_{joint}(p^{*}_i, m_i^{*}, p_i, m_i) = L_{pose}(p^{*}_i, p_i) + \lambda L_{match}(m_i, m^{*}_i).
\end{equation}

Our training set consisted of patch pairs drawn from a wide distribution of baseline changes ranging from $`0\degree`$ to over $`120\degree`$. We consider patches of size 192x192 ($`<15\%`$of the actual image size) and rescaled them to 101x101 before passing them into the ConvNet.

A ConvNet model with siamese architecture containing two identical streams with identical set of weights was used for computing the relative pose and the matching score between the two input patches. A standard ConvNet architecture was used for each stream: C(20, 7, 1)-ReLU-P(2, 2)-C(40, 5, 1)-ReLU-P(2, 2)-C(80, 4, 1)-ReLU-P(2, 2)-C(160, 4, 2)-ReLU-P(2, 2)-F(500)-ReLU-F(500)-ReLU. The naming convention is as follows: C($`n, k, s`$): convolutional layer $`n`$ filters, spatial size $`k\times k`$, and stride $`s`$. P($`k, s`$): max pooling layer of size $`k\times k`$ and stride $`s`$. ReLU: rectified linear unit. F($`n`$): fully connected linear layer with $`n`$ output units. The feature descriptors of both streams were concatenated and fed into a fully connected layer of 500 units which were then fed into the pose and matching losses. With this ConvNet configuration, the size of the image representation (i.e., the last FC vector of one siamese half - see Figure 11) is 500. Our architecture is admittedly pretty common and standard. This allows us to evaluate if our good end performance is attributed to our hypothesis on learning on foundational tasks and the new dataset, rather than a novel architecture.

We trained the ConvNet model from scratch (i.e., randomly initialized weights) using SGD with momentum (initial learning rate of .001 divided by 10 per 60K iterations), gradient clipping, and a batch size of 256. We found that the use of gradient clipping was essential for training as even robust regression losses produce unstable gradients at the starting of training. Our network converged after 210K iterations. Training using Euler angles performed better than quaternions ($`17.7\degree`$ vs $`29.8\degree`$ median angular error), and the robust loss outperformed the non-robust $`l_2`$ loss ($`17.7\degree`$ vs $`22.3\degree`$ median angular error). Additional details about the training procedure can be found in the supplementary material.

Introduction

Supposed an image is given and we are interested in extracting some 3D information from it, such as, the scene layout or the pose of the visible objects. One potential approach would be to annotate a dataset for every single desired problem and train a fully supervised system for each (i.e., supervised learning). This is undesirable as an annotated dataset for each problem would be needed as well as the fact that the problems would be treated independently. In addition, unlike semantic annotations such as, object labels, certain annotations in 3D are cumbersome to collect and often require special sensors (imagine manually annotating exact pose of an object or surface normals). An alternative approach is to develop a system with a rather generic perception that can conveniently generalize to novel tasks. In this paper, we take a step towards developing a generic 3D perception system that 1) can solve novel 3D problems without fine-tuning, and 2) is capable of certain abstract generalizations in the 3D context (e.g., reason about pose similarity between two drastically different objects).

But, how could one learn such a generalizable system? Cognitive studies suggest living organisms can perform cognitive tasks for which they have not received supervision by supervised learning of other foundational tasks . Learning the relationship between visual appearance and changing the vantage point (self-motion) is among the first visual skills developed by infants and play a fundamental role in developing other skills, e.g., depth perception. A classic experiment showed a kitten that was deprived from self-motion experienced fundamental issues in 3D perception, such as failing to understand depth when placed on the Visual Cliff . Later works argued this finding was not, at least fully, due to motion intentionality and the supervision signal of self-motion was indeed a crucial elements in learning basic visual skills. What these studies essentially suggest are: 1) by receiving supervision on a certain proxy task (in this case, self-motion), other tasks (depth understanding) can be solved sufficiently without requiring an explicit supervision, 2) some vision tasks are more foundational than others (e.g., self-motion perception vs depth understanding).

Learning a generic 3D representation: we develop a supervised joint framework for camera pose estimation and wide baseline matching. We then show the internal representation of this framework can be used as a 3D representation generalizable to various 3D prediction tasks.

Inspired by the above discussion, we develop a supervised framework where a ConvNet is trained to perform 6DOF camera pose estimation. This basic task allows learning the relationship between an arbitrary change in the viewpoint and the appearance of an object/scene-point. One property of our approach is performing the camera pose estimation in a object/scene-centric manner: the training data is formed of image bundles that show the same point of an object/scene while the camera moves around (i.e., it fixates - see the figure 10 (c)). This is different from existing video+metadata datasets , the problem of Visual Odometry , and recent works on ego-motion estimation , where in the training data, the camera moves independent of the scene. Our object/scene-centric approach is equivalent to allowing a learner to focus on a physical point while moving around and observing how the appearance of that particular point transforms according to viewpoint change. Therefore, the learner receives an additional piece of information that the observed pixels are indeed showing the same object, giving more information about how the element looks under different viewpoints and providing better grounds for learning visual encoding of an observation. Infants also explore object-motion relationships in a similar way as they hold an object in hand and observe it from different views.

Our dataset also provides supervision for the task of wide baseline matching, defined as identifying if two images/patches are showing the same point regardless of the magnitude of viewpoint change. Wide baseline matching is also an important 3D problem and is closely related to object/scene-centric camera pose estimation: to identify whether two images could be showing the same point despite drastic changes in the appearance, an agent could learnt how viewpoint change impacts the appearance. Therefore, we perform our supervised training in a multi-task manner to simultaneously solve for both wide baseline matching and pose estimation. This has the advantage of learning a single representation that encodes both problems. In experiments section (17.1.5), we show it is possible to have a single representation solving both problems without a performance drop compared to having two dedicate representations. This provides practical computational and storage advantages. Also, training ConvNets using multiple tasks/losses is desirable as it has been shown to be better regularized .¹

We train the ConvNet (siamese structure with weight sharing) on patch pairs extracted from the training data and use the last FC vector of one siamese tower as the generic 3D representation (see figure 11). We will empirically investigate if this representation can be used for solving novel 3D problems (we evaluated on scene layout estimation, object pose estimation, surface normal estimation), and whether it can perform any 3D abstraction (we experimented on cross category pose estimation and relating the pose of synthetic geometric elements to images).

Dataset: We developed an object-centric dataset of street view scenes from the cities of Washington DC, NYC, San Francisco, Paris, Amsterdam, Las Vegas, and Chicago, augmented with camera pose information and point correspondences. It includes 25 million images, 118 million matching image pairs, camera metadata, 3D models of 8 cities. We release the dataset, trained models, and an online demo at http://3Drepresentation.stanford.edu/.

Novelty in the Supervised Tasks: Independent of providing a generic 3D representation, our approach to solving the two supervised tasks is novel in a few aspects. There is a large amount of previous work on detecting, describing, and matching image features, either through a handcrafting the feature or learning it . Unlike the majority of such features that utilize pre-rectification (within either the method or the training data), we argue that rectification prior to descriptor matching is not required; our representation can learn the impact of viewpoint change, rather than canceling it (by directly training on non-rectified data and supplying camera pose information during training). Therefore, it does not need an apriori rectification and is capable of performing wide baseline matching at the descriptor level. We report state-of-the-art results on feature matching. Wide baseline matching has been also the topic of many papers with the majority of them focused on leveraging various geometric constraints for ruling out incorrect ‘already-established’ correspondences, as well as a number of methods that operate based on generating exhaustive warps or assuming 3D information about the scene is given . In contrast, we learn a descriptor that is supervised to internally handle a wide baseline in the first place.

In the context of pose estimation, we show estimating a 6DOF camera pose given only a pair of local image patches, and without the need for several point correspondences, is feasible. This is different from many previous works from both visual odometery and SfM literature that perform the estimation through a two step process consisting of finding point correspondences between images followed by pose estimation. Koser and Koch also demonstrate pose estimation from a local region, though the plane on which the region lies is assumed to be given. The recent works of supervise a ConvNet on the camera pose from image batches but do not provide results on matching and pose estimation. We report a human-level accuracy on this task.

Existing Unsupervised Learning and ConvNet Initialization Works: The majority of previous unsupervised learning, transfer learning, and representation learning works have been targeted towards semantics . It has been practically well observed that the representation of a convnet trained on imagenet can generalize to other, mostly semantic, tasks. A number of methods investigated initialization techniques for ConvNet training based on unsupervised/weakly supervised data to alleviate the need for a large training dataset for various tasks . Very recently, the methods of explored using motion metadata associated with videos (KITTI dataset ) as a form of supervision for training a ConvNet. However, they either do not investigate developing a 3D representation or intent to provide initialization strategies that are meant to be fine-tuned with supervised data for a desired task. In contrast, we investigate developing a generalizable 3D representation, perform the learning in an object-centric manner, and evaluate its unsupervised performance on various 3D tasks without any fine-tuning on the representation. We experimentally compare against the related recent works that made their models available .

Primary contributions of this paper are summarized as: I) A generic 3D representation with empirically validated abstraction and generalization abilities. II) A learned joint descriptor for wide baseline matching and camera pose estimationat at the level of local image patches. III) A large-scale object-centric dataset of street view scenes including camera pose and correspondence information.

Evaluating the 3D Representation on Novel Tasks

The results of evaluating our representation on novel 3D tasks are provided in this section. The tasks as well as the images (e.g., Airship images from ImageNet) used in these evaluations are significantly different from what our representation was trained for (i.e., camera pose estimation and matching on local patches of street view images). The fact that, despite such differences, our representation achieves best results among all unsupervised methods and gets close to supervised methods for each of the tasks empirically validates our hypothesis on learning on foundational tasks (see section 14).

Our ways of evaluating and probing the representation in an unsupervised manner are 1) tSNE : large-scale 2D embedding of the representation. This allows visualizing the space and getting a sense of similarity from the perspective of the representation, 2) Nearest Neighbors (NN) on the full dimensional representation, and 3) training a simple classifier (e.g., KNN or a linear classifier) on the frozen representation ( i.e., no fine-tuning) to read out a desired variable. The latter enables quantifying if the required information for solving a novel task is encoded in the representation and can be extracted using a simple function. We compare against the representations of related methods that made their models available , various layers of AlexNet trained on ImageNet , and a number of supervised techniques for some of the tasks. Additional results are provided in the supplementary material and the website.

2D embedding of our representation on 3,000 unseen patches using tSNE. An organization based on the Manhattan pose of the patches can be seen. See comparable AlexNet’s embedding in the supplementary material’s section 6. (best seen on screen)

Surface Normals and Vanishing Points

figure 2 shows tSNE embedding of 3,000 unseen patches showing that the organization of the representation space is based on geometry and not semantics/appearance. The ConvNet was trained to estimate the pose between matching patches only while in the embedding, the non-matching patches with a similar pose are placed nearby. This suggests the representation has generalized the concept of pose to non-matching patches. This indeed has relations to surface normals as the relative pose between an arbitrary and a frontal patch is equal to the pose of the arbitrary patch; figure 2 can be perceived as the organization of the patches based on their surface normals.

a) tSNE of a superset of various vanishing point benchmarks (to battle the small size of datasets). b) inversion of our representation. Both plots shows traits of vanishing points.

To better understand how this was achieved, we visualized the activations of the ConvNet at different layers. Similar to other ConvNets, the first few layers formed general gradient based filters while in higher layers, the edges parallel in the physical world seemed to persist and cluster together. This is similar to the concept of vanishing points, and from the theoretical perspective, would be intriguing and explain the pose estimation results, since three common vanishing points are theoretically enough for a full angular pose estimation . To further investigate this, we generated the inversion of our representation using the method of (see figure 3-(b)), which show patterns correlating with the vanishing points of the image. Figure 3-(a) also illustrates the tSNE of a superset of several vanishing point benchmarks showing that images with similar vanishing points are embedded nearby. Therefore, we speculate that the ConvNet has developed a representation based on the concept of vanishing points². This would also explain the results shown in the following sections.

Surface normal estimation on NYUv2 : numerical evaluation on unsupervised surface normal estimation provided in supplementary material sec. 4.

Scene layout NN search results between LSUN images and synthetic concave cubes defining abstract 3D layouts. Images with yellow boundary show the ground truth layout.

Scene Layout Estimation

We evaluated our representation on LSUN layout benchmark using the standard protocol . Table [tab:Layout-est] provides the results of layout estimation using a simple NN classifier on our representation along with two supervised baselines, showing that our representation (with no fine-tuning) achieved a performance close to Hedau et al.’s supervised method on this novel task. Table [tab:Layout-Class] provides the results of layout classification using NN classifier on our representation compared to AlexNets FC7 and Pool5.

Abstraction: Cube$`\leftrightarrows`$Layout: To evaluate the abstract generalization abilities of our representation, we generated a sparse set of 88 images showing the interior of a simple synthetic cube parametrized over different view angles. The rendered images can be seen as an abstract cubic layout of a room. We then performed NN search between these images and LSUN dataset using our representations and several baselines. As apparent in figure 4, our representation retrieves meaningful NNs while the baselines mostly overfit to appearance and retrieve either an incorrect or always the same NN. This suggests our representation could abstract away the irrelevant information and encode some information essential to the 3D of the image.

NN search results between EPFL dataset images and a synthetic cube defining an abstract 3D pose. See the supplementary material (section 5) for tSNE embedding of all cubes and car poses in a joint space. Note that the 3D poses defined by the cubes are 90congruent.

3D Object Pose Estimation

$`\\`$

Abstraction: Cube$`\leftrightarrows`$Object: we performed a similar abstraction test between a set of 88 convex cubes and the images of EPFL Multi-View Car dataset , which includes a dense sampling of various viewpoints of cars in an exhibition. We picked this simple cube pattern as it is the simplest geometric element that defines three vanishing points. The same observation as the abstraction experiment on LSUN’s is made here with our NNs being meaningful while baselines mostly overfit to appearance with no clear geometric abstraction trait(figure 5).

ImageNet: Figure 6 shows the tSNE embedding of several ImageNet categories based on our representation and the baselines. The embeddings of our representation are geometrically meaningful, while the baselines either perform a semantic organization or overfit to other aspects, such as color.

tSNE of several ImageNet categories using our unsupervised representation along with several baselines. Our representation manifests a meaningful geometric organization of objects. tSNE of more categories in the supplementary material and the website. (best seen on screen)

PASCAL3D: Figure 7 shows cross-category NN search results for our representation along with several baselines. This experiment also evaluates a certain level of abstraction as some of the object categories can be drastically different looking. We also quantitatively evaluated on 3D object pose estimation on PASCAL3D. For this experiment, we trained a ConvNet from scratch, fine-tuned AlexNet pre-trained on ImageNet, and fine-tuned our network; we read the pose out using a linear regressor layer.³ Our results outperform scratch network and come close to AlexNet that has seen thousands of images from the same categories from ImageNet and other objects. Note that certain aspects of object pose estimation, e.g., distinguishing between the front and back of a bus, are more of a semantic task rather than geometric/3D. This explains a considerable part of the failures of our representation which is object/semantic agnostic.

Qualitative results of cross-category NN-search on PASCAL3D using our representation along with baselines.

Experimental Discussions and Results

We implemented our framework using data parallelism on a cluster of 5-10 GPUs. At the test time, computing the representation is a feed-forward pass through a siamese half ConvNet and takes $`\sim2.9ms`$ per image on a single processor. Sections 17.1 and 17.2 provide the evaluations of the learned representation on the supervised and novel 3D tasks, respectively.

Evaluations on the Supervised Tasks

Evaluations on the Street View Dataset.

The test set of pose estimation is composed of 7725 pairs of matching patches from our dataset. The test set of matching includes 4223 matching and 18648 non-matching pairs. It is made sure that no data from those areas and their vicinity is used in training. Each patch pair in the test sets was verified by three Amazon Methanical Turkers to verify the ground truth is indeed correct. For the matching pairs, the Turkers also ensured the center pixel of patches are no more than 25 pixels ($`\sim3\%`$ of image width) apart. Visualizations of the test set can be seen on our website.

(a) Sample qualitative results of camera pose estimation. 1^st and 2^nd rows show the patches. The 3^rd row depicts the estimated relative camera poses on a unit sphere (black: patch 1’s camera (reference), red: ground-truth pose of patch 2, blue: estimated pose of patch 2). Rightward and upward are the positive directions. (b)Sample wide baseline matching results. Green and red represent ‘matching’ and ‘non-matching’, respectively. Three failure cases are shown on the right.

Pose Estimation.

Figure 8-(a) provides qualitative results of pose estimation. The angular evaluation metric is the standard overall angular error , defined as the angle between the predicted pose vector and the ground truth vector in the plane defined by their cross product. The translational error metric is $`l_2`$ norm of the difference vector between the normalized predicted translation vector and ground truth . The translation vector was normalized to enable comparing with up-to-scale SfM.

Figure 9-right provides the quantitative evaluations. The plots (a) and (c) illustrate the distribution of the test set with respect to pose estimation error for each method (the more skewed to the left, the better). The green curve shows pose estimation results by human subjects. Two users with computer vision knowledge, but unaware of the particular use case, were asked to estimated the relative pitch and yaw between a random subset of 500 test pairs. They were allowed to train themselves with as many training sampled as they wished. ConvNet outperformed human on this task with a margin of $`8\degree`$ in median error.

Left: Quantitative evaluation of matching. ROC curves of each method and corresponding AUC and FPR@95 values are shown in (a). Right: Quantitative evaluation of camera pose estimation. VO and SfM denote Visual Odometery (LIBVISO2) and Structure-from-Motion (visualSfM), respectively. Evaluation of robustness to wide baseline camera shifts is shown in (b) plots.

Pose Estimation Baselines: We compared against Structure-from-Motion (visualSfM with default components and tuned hyper-parameters for pairwise pose estimation on $`192\times 192`$ patches and full images) and LIBVISO2 Visual Odometery on full images. Both SfM and LIBVISO2 VO suffer from a large RANSAC failure rate mostly due to the wide baselines in test pairs.

Figure 9-right (b) shows how the median angular error (Y axis) changes as the baseline of the test pairs (X axis) increases. This is achieved through binning the test set into 8 bins based on their baseline size. This plot quantifies the ability of the evaluated methods in handling a wide baseline. We adopt the slope of the curves as the quantification of deterioration in accuracy as the baseline increases.

Wide Baseline Matching.

Figure 8-(b) shows samples feature matching results using our approach, with three failure cases on the right. Figure 9-left provides the quantitative results. The standard metric for descriptor matching is ROC curve acquired from sorting the test set pairs according to their matching score. For unsupervised methods, e.g., SIFT, the matching score is the $`l_2`$ distance. False Positive Rate at 95% recall (FPR@95) and Area Under Curve (AUC) of ROC are standard scalar quantifications of descriptor matching .

Matching Baselines: We compared our results with the handcrafted features of SIFT , Root-SIFT , DAISY , VIP (which requires the surface normals in the input for which we used the normals from the 3D models), and ASIFT . The matching score of ASIFT was the number of found correspondences in the test pair given the full images. We also compared against the learning based features of Zagoruyko & Komodakis (using the models of authors), Simonyan & Zisserman (with and without retraining), Simo-Serra et al. (using authors’ best pretrained model) as well as human subjects (the red dot on the ROC plot). Figure 9-left(b) provides the evaluations in terms of handling wide baselines, similar to Figure 9-right(b).

Brown et al. Benchmark & Mikolajczyk’s Benchmark.

We performed evaluations on the non-street view benchmarks of Brown et al. and Mikolajczyk & Schmid to find if 1) if our representation was performing well only on street view scenery, and 2) if wide baseline handling capability was achieved at the expense of lower performance on small baselines (as these benchmarks have a narrower baseline compared to our dataset for the most part). Tables [tab:brown]&[tab:Mikola] provide the quantitative results. We include a thorough description of evaluation setup and detailed discussions in the supplementary material (section 2).

Joint Feature Learning.

We studied different aspects of joint learning the representation and information sharing among the core supervised tasks. In the interest of space, we provide quantitative results in supplementary material (section 1). The conclusion of the tests was that: First, the problems of wide baseline matching and camera pose estimation have a great deal of shared information. Second, one descriptor can encode both problems with no performance drop.

Discussion and Conclusion

To summarize, we developed a generic 3D representation through solving a set of supervised foundational proxy tasks. We reported state-of-the-art results on the supervised tasks and showed the learned representation manifests generalization and abstraction traits. However, a number of questions remain open:

Though we were inspired by cognitive studies in defining the foundational supervised tasks leading to a generalizable representation, this remains at an inspiration level. Given that a ‘taxonomy’ among basic 3D tasks has not been developed, it is not concretely defined which tasks are foundational and which ones are secondary. Developing such a taxonomy (i.e., whether task A is inclusive of, overlapping with, or disjoint from task B) or generally efforts understanding the task space would be a rewarding step towards soundly developing the 3D complete representation. Also, semantic and 3D aspects of the visual world are tangled together. So far, we have developed independent semantic and 3D representations, but investigating concrete techniques for integrating them (beyond simplistic late fusion or ConvNet fine-tuning) is a worthwhile future direction for research. Perhaps, inspirations from partitions of visual cortex could be insightful towards developing the ultimate vision complete representation.

Acknowledgement: We gratefully acknowledge the support of ICME/NVIDIA Award (1196793-1-GWMUE), MURI (1186514-1-TBCJE), and Nissan (1188371-1-UDARQ).

Object-Centric Street View Dataset

The dataset for the formulated task needs to not only provide a large amount of training data, but also show a rich camera pose variety, while the scale of the aimed learning problem invalidates any manual procedure. We present a procedure that allows acquiring a large amount of training data in an automated manner, based on two sources of information: 1) Google street view which is an almost inexhaustible source of geo-referenced and calibrated images, 2) 3D city models that cover thousands of cities around the world. The dataset including 25 million images and 118 million matching image pairs with their camera pose and 3D models of 8 cities is available here.

The core idea of our approach is to form correspondences between the geo-referenced street view camera and physical 3D points that are given by the 3D models. More specifically, at any given street view location, we densely shoot rays into space in order to find intersections with nearby buildings. Each ray back projects one image pixel into the 3D space, as shown in figure 10-(a). By projecting the resulting intersection points onto adjacent street view panoramas (see figure 10-b), we can form image to image correspondences (see figure 10c). Each image is then associated with a (virtual) camera that fixates on the physical target point on a building by placing it on the optical center. To make the ray intersection procedure scalable, we perform occlusion reasoning on the 3D models to pre-identify from what GPS locations an arbitrary target would be visible and perform the ray intersection on those points only.

Illustration of the object-centric data collection process. We use large-scale geo-registered 3D building models to register pixels in street view images on world coordinates system (see (a)) and use that for finding correspondences and their relative pose across multiple street view images (see (b)). Each ray represents one pixel-3D world coordinate correspondence. Each of the red, green, and blue colors represent one street view location. Each row in (c) shows a sample collected image bundle. The center pixel (marker) is expected to correspond to the same physical point.

Pixel Alignment and Pruning: This system requires integration of multiple resources, including elevation maps, GPS from street view, and 3D models. Though the quality of output exceeded our expectation (see samples in Figure 10 (c)), any slight inaccuracy in the metadata or 3D models can cause a pixel misalignment in the collected images (examples shown in the first and last rows of figure 10 (c)). Also, there are undocumented objects such as trees or moving objects that cause occlusions. Thus, a content-based post alignment and pruning was necessary. We again used metadata in our alignment procedure to be able to handle image bundles with arbitrarily wide baselines (note that the collected image bundles can show large, often$`>100\degree`$, viewpoint changes). In the interest of space, we describe this procedure in supplementary material (section 3).

This process forms our dataset composed of matching and non-matching patches as well as the relative camera pose for the matching pairs. We stopped collecting data when we reached the coverage of $`>200{km}^2`$ from the 7 cities mentioned in section 14. The collection procedure is currently done on Google street view, but can be performed using any geo-referenced calibrated imagery. We will experimentally show that the trained representation on this data does not manifest a clear bias towards street view scenes and outperforms existing feature learning methods on non-street view benchmarks.

Noise Statistics: We performed a user study through Amazon Mechanical Turk to quantify the amount of noise in the final dataset. Please see supplementary material (section 3.2) for the complete discussion and results. Briefly, $`68\%`$ of the patch pairs were found to have at least $`25\%`$ of overlap in their content. The mean and standard deviation of pixel misalignment was 16.12 ($`\approx11\%`$ of patch width) and 11.55 pixels, respectively. We did not perform any filtering or geo-fencing on top of the collected data as the amount of noise appeared to be within the robustness tolerance of ConvNet trainings and they converged.

Introduction

A representation performs the task of converting visual information to a mathematical form (often a vector). This mathematical form is expected to encode the information required for solving the task it was devised for, and ideally, generalize to related problems. While a considerable number of existing works have explored the possibility of forming generic semantic representations , transfer learning , and initialization strategies of ConvNets useful for scaling recognition tasks, an equivalent representation for 3D is yet to be proposed. We take a step towards developing a generic 3D representation, that 1) can solve novel 3D problems without the need for fine-tuning the representation, and 2) is capable of certain abstract generalizations in the 3D context (e.g., reason about pose similarity between two drastically different objects).

But, how one could learn such a generalizable representation? Cognitive studies suggest living organisms can perform cognitive tasks for which they have not received supervision by supervised learning of other foundational tasks . Learning the relationship between visual appearance and changing the vantage point (self-motion) is among the first visual skills developed by infants and play a fundamental role in developing other skills, e.g., depth perception. A classic experiment showed a kitten that was deprived from self-motion experienced fundamental issues in 3D perception, such as failing to understand depth when placed on the Visual Cliff . Later studies argued this finding was not, at least fully, due to motion intentionality and the supervision signal of self-motion was indeed a crucial elements in learning basic visual skills. What these experiments essentially suggest are: 1) by receiving supervision over a certain proxy task (in this case, visual perception of self-motion), other tasks (depth understanding) can be solved sufficiently without requiring an explicit supervision, 2) some vision tasks are more foundational than others (e.g., self-motion perception vs depth understanding).

Our dataset also provides supervision for the task of wide baseline matching, defined as identifying if two images/patches are showing the same point regardless of the magnitude of viewpoint change. Wide baseline matching is also an important 3D problem and is closely related to object/scene-centric camera pose estimation: to identify whether two images could be showing the same point despite drastic changes in the appearance, an agent could learnt how viewpoint change impacts the appearance. Therefore, we perform our supervised training in a multitask manner to simultaneously solve for both wide-baseline matching and pose estimation. This has the advantages of learning a single representation that encodes both problems. In experiments section (17.1.5), we show it is possible to have a single representation solving both problems without a performance drop compared to having two dedicate representations. This provides practical computational and storage advantages. Also, training ConvNets using multiple tasks/losses is desirable as it has been shown to be better regularized .⁴

We train the ConvNet (siamese structure with weight sharing) on patch pairs extracted from the training data and use the last FC vector of one siamese tower as the 3D representation (see figure 11). We will empirically investigate if this representation can be used for solving novel 3D problems (we evaluated on scene layout estimation, object pose estimation, surface normal estimation), and whether it can perform any 3D abstraction (we experimented on cross category pose estimation and relating the pose of synthetic geometric elements to images).

Dataset: We developed an object-centric dataset of streetview scenes from the cities of Washington DC, NYC, San Francisco, Paris, Amsterdam, Las Vegas, and Chicago, augmented with camera pose information and point correspondences (with $`>`$half a billion training data points). We release the dataset, trained models, and an online demo at http://3Drepresentation.stanford.edu/.

Novelty in the Supervised Tasks: Independent of providing a generic 3D representation, our approach to solving the two supervised tasks is novel in a few aspects. There is a large amount of previous work on detecting, describing, and matching image features, either through a handcrafting the feature or learning it . Unlike the majority of such features that utilize pre-rectification (within either the method or the training data), we argue that rectification prior to descriptor matching is not required; our representation can learn the impact of viewpoint change, rather than canceling it (by directly training on non-rectified data and supplying camera pose information during training). Therefore, it does not need an apriori rectification and is capable of performing wide-baseline matching at the descriptor level. We report state-of-the-art results on feature matching. Wide baseline matching has been also the topic of many papers with the majority of them focused on leveraging various geometric constraints for ruling out incorrect ‘already-established’ correspondences, as well as a number of methods that operate based on generating exhastive warps or assuming 3D information about the scene is given . In contrast, we learn a descriptor that is supervised to internally handle a wide baseline in the first place.

Existing Unsupervised Learning and ConvNet Initialization Methods: The majority of previous unsupervised learning, transfer learning, and representation learning works have been targeted towards semantics . It has been practically well observed that the representation of a convnet trained on imagenet can generalize to other, mostly semantic, tasks. A number methods investigated initialization techniques for ConvNet training based on unsupervised/weakly supervised data to alleviate the need for a large training dataset for various tasks . Very recently, a few methods investigated using motion metadata associated with videos (KITTI dataset ) as a form of supervision for training a ConvNet. However, they either do not investigate developing a 3D representation or intent to provide initialization strategies that are meant to be fine-tuned with supervised data for a desired task. In contrast, we investigate developing a generalizable 3D representation, perform the learning in an object-centric manner, and evaluate its unsupervised performance on various 3D tasks without any fine-tuning on the representation. We experimentally compare against the related recent works that made their models available .

The main contributions of this paper can be summarized as:
I) A generic 3D representation with empirically validated abstraction and generalization advantages.
II) A joint learned descriptor for wide baseline matching and camera pose estimationat at the level of local image patches.
III) A large-scale object-centric dataset of streetview scenes including camera pose and correspondence information.

Joint Feature Learning

We investigated different aspects of joint learning the representation and information sharing among the two supervised tasks in Tables [tab:trans] and [tab:joint-net]. To quantify the amount of information shared among the matching and pose estimation tasks, we trained a single-task network dedicated to each problem; their error for their respective task is reported in “Direct" row of the Table [tab:trans]. “Transduction" provides the error rate when a linear classifier was trained on the frozen representation of one task to solve the other task. The fact that the Transduction setup achieves a reasonable performance suggests the two problems have a great deal of shared information in their representations. Table [tab:joint-net] compares the performance of single vs multi-task networks. The multi-task network performs comparable to its dedicated counterparts showing it encoded both problems with no performance drop.

Brown et al. Benchmark.

We evaluated the performance of our representation on the benchmarks of Brown et al. and Mikolajczyk & Schmid (next subsection). Compared to our dataset, these benchmarks mostly include narrower baselines (except for a subset of ), and therefore, do not pronounce wide baseline handling abilities; our method also sees more training data than the baselines. However, these benchmarks would reveal 1) if our representation was performing well only on street view scenery, and 2) if wide baseline handling capability was achieved at the expense of lower performance on small baselines.

We compared our results with six baselines, including Zagoruyko & Komodakis’s and MatchNet , against their most similar descriptor dimensionality (512) and network architecture to ours. For this experiment, we mixed our training dataset with the corresponding training split of Brown’s (see Table [tab:brown]). Our representation outperforms the baselines, except for two splits from the Yosemite National Park that are substantially foliage covered (for which we speculate that the foliage coverage is reason since our ConvNet is mostly agnostic with respect to foliage as trees are uninformative for matching and pose in street view scenery).

Mikolajczyk & Schmid Benchmark.

The evaluation results on the benchmark of Mikolajczyk & Schmid are provided in Table [tab:Mikola] using the standard protocol . Following , we performed the matching on MSER features. The last two rows show our results on the MSER patches with and without rectification (i.e., skipping MSER rectification). Our representation outperforms the baselines in both cases, while not performing the rectification actually improves the performance.

Dataset and Data Collection Details

Our dataset was collected from a broad geographical area spanning multiple cities. Some representative city locations from which data was collected are shown in Figure 12.

Some of the areas from which we have collected our dataset.

Sample images from our dataset that were used for training the pose estimation and patch matching network are shown in Figures 13,14. Each row in the figure shows a single target location from different viewpoints. The target location is marked with a red dot.

Sample images from our dataset. Each row shows one image bundle showing one target. The marker is on the center pixel and should show the target point. The columns show different views.

Pixel Alignment and Pruning

The data collection system required integration of multiple resources, including GPS from street view, elevation maps, and 3D models. Any slight inaccuracy in the metadata or 3D models can cause a pixel misalignment in the collected images. Therefore, we performed the following pose-processing procedure to cancel some of these errors.

We wish to verify if the center of images in a bundle show the same physical target and adjust them if necessary. The collected image bundles can show large (often $`>100\degree`$) viewpoint changes, while the existing registration methods do not handle such large angular changes in unconstrained scenes. To solve this issue, we utilize the metadata again: we extract the relative pose of the cameras to the desired physical point’s surface normal (acquired from the 3D models). Then a homography transformation is applied to project one patch in the bundle onto another in a way that the local image planes on which the desired physical point lies are parallel. This cancels the perspective transformation caused by the baseline change (see ‘Warped’ in Figure 15). Thus, the registration transformation between the two rectified images can be effectively approximated by a similarity transformation. To avoid a quadratic complexity with respect to the number of images, we select the most frontal view for each target point and align the rest of the images of the target with respect to it.

Since the patches often show non-planarity, we found employing a nonrigid transformation (we used SIFT flow ) for registration of the rectified patches to be more robust. We use RANSAC to fit a similarity transformation to the resulting flow field and apply the inverse of the rectification transformation to find the translation vector which should be applied on the original image to perform the alignment (see ‘Aligned’ in Figure 15). We use only the translation component for performing the alignment since we want to preserve the original camera rotation information to use it in training. Finally, to remove the remaining unreliable and occluded patches, we extract two registration metrics (Structural Similarity Index which measures the pixel similarity among registered images, and magnitude of the estimated transformation) and threshold the dataset based on them (see ‘Occluded’ in Figure 15-b).

Pixel alignment process for two sample targets. Blue arrow shows the adjustment vector between the initial and updated position of the center. (b) shows a case with occlusion (view 4).

Noise Statistics

A user study through Amazon Mechanical Turk was performed to quantify the amount of noise in the final dataset. 15124 random pairs of matching patches from the dataset were annotated by three Turkers to identify if they have at least $`25\%`$ overlap in their content, and if not, what the reason was. Also, to quantify the magnitude of pixel misalignment, they were asked to click on the pixel in the $`2^{nd}`$ patch which corresponds to the center pixel of the $`1^{st}`$ patch.

About $`68\%`$ of the pairs were found to have at least $`25\%`$ overlap (by majority voting). For the subset with less than $`25\%`$ overlap, the originating issues were identified as: $`49.5\%`$ due to inaccuracies in meta-data and 3D models, $`27.7\%`$ due to tree occlusion, $`16.5\%`$ due to other occlusions, and $`6.1\%`$ other reasons. In heavily tree occluded areas (e.g., Washington D.C. suburb), the percentage of matching pairs reduced to $`58\%`$. Such areas can be easily avoided by geo-fencing (e.g., by collecting data from downtowns since they show more buildings than trees compared to suburbs).

The mean and standard deviation of the magnitude of misalignment (defined as the magnitude of the vector connecting the center of the $`2^{nd}`$ patch to the pixel location where the Turkers clicked) was 16.12 ($`\approx11\%`$ of patch width) and 11.55 pixels, respectively.

In our training dataset, we did not perform any manual filtering or geo-fencing on top of the automatically collected data since the amount of noise appeared to be within the robustness tolerance of ConvNets as their training converged and well performed on the carefully filtered test set.

Surface Normal Estimation

Estimating surface normals is a fundamental problem in 3D computer vision. We evaluated the usefulness of the representation learned by our method for surface normal estimation on the standard NYU2 benchmark . The dataset contains 795 training images and 654 testing images.

In this work, our goal is not to develop the best method for a specific task, but rather to investigate how useful are features learned by training for pose-estimation for various 3D tasks without the requirement for fine-tuning. For this task, we employed two methods to read the surface normal value out of the representation of an image/patch: a linear classifier and nearest neighbors. While running our experiments, we noticed that the surface normal estimation in the NYU dataset is heavily biased as most pixels belong to one of the three categories of ceiling, floor, and walls. This inevitably means that the prediction results would be dominated by performance on such categories. In order to account for this, we propose the use of reporting binned errors. The binned errors are calculated by first binning pixel wise surface normals into 20 bins (in a manner similar to ) and calculating the error within each bin. Then we calculate the average statistics such as the mean or the median across the bins. The idea of using binned error is similar to the idea of reporting mean of the class accuracies in classification tests. In addition to this setup, we also report numbers according to the standard benchmark evaluation of pixelwise median error and the percentage of pixels that were within $`11.5^{o}, 22.5^{o}`$ and $`30^{o}`$ respectively.

Prediction via Linear Classification

In previous works, such as , coarse surface normals were estimated by predicting normals on a 20x20 spatial grid. The normal of each grid block is predicted by solving a 20 way classification problem, where the 20 classes correspond to 20 pre-calculated surface normal clusters. The final estimate is provided by up sampling this 20x20 grid into the full image resolution.

Following a similar experimental setup, we first report the accuracy of predicting surface normals on 20x20 grids. We trained a linear classifier on top of our representation as well as the methods of and AlexNet trained for ImageNet. We report the classification accuracy in individual grid locations and the mean/median angular error in the estimated surface normals. The angular error is calculated as the angular distance between the predicted and ground truth surface normal clusters. We report accuracies with and without binning in Table [table:nyubin] indicating that our representation outperforms AlexNet and .

Evaluation of surface normal estimation on the NYU2 dataset using the common evaluation protocol affected by dominating wall, ceiling, and floor pixels. We used 1-NN technique on each representation to predict the normals. We report the pixelwise median error and the percentage of pixels that were within 11.5^o, 22.5^o and 30^o respectively. Our method outperforms the feature learning approaches of Agrawal et al and of Wang et al. and is comparable to layer 7 features of AlexNet trained for Imagenet classification using this evaluation setup.
	(Lower Better)	(Higher Better)
Net	Median	11.5^o	22.5^o	30^o
Agrawal et al.	20.2	29.1	54.5	67.9
Wang et al.	20.4	28.9	54.0	67.7
AlexNet	19.7	30.3	55.7	69.5
Ours	19.6	30.7	55.6	68.8

Prediction via Nearest Neighbors

In addition to results presented above, we also used 1-Nearest Neighbor to compute the pixel wise surface normal errors. For each image in the test set, we found the closest image in the training set using each representation. The surface normals of this closest image were taken as the predicted surface normals for the query image. The pixel wise error in surface normal estimation for various CNNs is reported in Table 1. The results are consistent with the experiments using a linear classifier.

Joint Embedding of Synthetic Cubes and Images

In order to further investigate the nature our representations, we performed a joint tSNE embedding of synthetic cubes along with a single car category from the EPFL dataset (see Figure 16). This plot is closely related to the Cube$`\leftrightarrows`$Object nearest neighbor results provided in Figure 8 of the main paper (here we show all images but using a lossy 2D embedding; in Figure 8 of the main paper, the full dimensional representation was used for retrieving nearest neighbors, but only a few sample queries could be shown).

If a representation mostly encodes semantic (or low level appearance) information, then cars and cubes should cluster into two different parts of the feature space, whereas if it encodes more geometric information, the organization should be based on pose. Figure 16 shows that our method embeds cubes and cars according to their geometric pose. In addition, in our feature space, cars and cubes are close together and one is enclosed by the other, whereas the baselines linearly separate them and put them far apart. This indicates that, as compared to other representations, ours predominantly capture geometric information over semantics or low-level appearance. It should be noted that synthetic cubes were sampled uniformly in the pose space, whereas cars in the EPFL dataset are not sampled uniformly (no camera pitch). Due to this, many cubes do not have a corresponding car and the tSNE plots are slightly skewed.

tSNE on Affinity Matrix:

the tSNE embeddings in Figure 16 are plotted in an affinity space. That is, given $`N`$ car and $`M`$ cube images, we compute the representation of each image and form a $`(N+M) \times M`$ affinity matrix where element $`(i,j)`$ is the $`l_2`$ distance between the representations of the $`i^{th}`$ image and $`j^{th}`$ cube. In other words, the representation of all of the car and cube images are cast based on the cube collection as the reference. We then perform the tSNE embedding on this $`(N+M) \times M`$ affinity matrix with each row being the $`M`$ dimensional representation of the image, rather than its original representation.

We adopted this approach in order to enable the tSNE plots to bring out the geometric encoding aspect and not be dominated by the appearance/semantic information in the representation. For performing a successful pose estimation, encoding of both geometric and appearance information is essential ⁵. However, when a tSNE plot is formed, the user does not have any control over specifying whether the geometry or the appearance factor (or both) should govern the embedded space. In general, all of the manifolds existing in a representation contribute to the final 2D embedding. The workaround of finding the tSNE embedding on a unified affinity space lowers the impact of appearance as the appearance of all cubes are roughly the same while their pose covers a wide range. All of the plots in Figure 16 (including the baselines) are formed using this approach.

tSNE embedding of the synthetic cubes and one category of EPFL dataset. Our representation shows a geometric organization while the baselines perform a clear semantic/appearance based separation.

Illustration of Pose Induction via tSNE Embeddings of more ImageNet Classes and MIT Places

tSNE of more ImageNet categories of our representation vs various baselines.

We only trained our network for estimating 6-DOF pose on street view scenes. However, it appears that our representation learned something generic about the geometry of objects (see the discussion on unsupervised evaluations in section 4 of the main paper). This is elucidated by tSNE embeddings of objects in the Imagenet dataset (Figures 17,18). These embeddings show that the feature space produced by our network performs pose induction for unseen object classes without any additional training.

However it should be noted that a purely geometric approach is sometimes insufficient for a full object pose estimation, such as, distinguishing between front and back the car. Such knowledge is dependent on semantics and not merely geometry. It is therefore not surprising that our method puts such poses close to each other. In addition, one more mode of confusion of our method is putting together poses that are apart by 90 degrees. If the pose is indeed estimated based on vanishing points (see section 4.2 of the main paper), it is to be expected that objects 90 degrees apart in azimuth would have the same vanishing points and therefore it would require the knowledge of semantics to tease these poses apart. Since our method shows such confusion, it supplies further support that our method may performs pose estimation analogous to a method that would estimate pose based on vanishing points.

tSNE of our representation vs AlexNet (ImageNet) on a non-streetview dataset: MIT Places Benchmark (class ‘library’ which is one of the pose rich categories). Our network shows a clear pose based embedding, unlike AlexNet trained on ImageNet. For both networks, the descriptor was computed over the entire image, and not patches.

This figure shows the same tSNE plot as the one in Figure 7 of the main paper, as well as AlexNet’s tSNE for the sake of comparison. The AlexNet is trained on ImageNet and shows a non-geometric organization.

In addition to pose induction on imagenet, our method also performs pose induction on scene categories outside our street view dataset. For instance in Figure 19, the images of MIT places datasets (category library) are embedded according to pose. Finally, additional visualization of embedding of image patches from our dataset using our representation and AlexNet’s (trained on Imagenet) are shown in Figure 20.

Additional Training Details

For pose training we followed a curriculum strategy where we first trained only for angles within $`[-90\degree, 90\degree]`$ and then extended to all the angles. We found that this performed slightly better than directly training for all angles. We also experimented with employing quaternions, but found that predicting Euler angles performed better (quantitative results in the main paper).

Though visual matching/tracking is also one of early developed cognitive skills, we are unware of any studies investigating its foundational role in developing visual perception. Therefore, we presume (and empirically observe) that the generality of our 3D representation is mostly attributed to the camera pose estimation component. ↩︎
We attempted to quantitatively evaluate this, but the largest vanishing point datasets (e.g., York and PKU ) include only 102-200 images for both training and testing. Given a 500D descriptor, it was not feasible to provide a statistically significant evidence. ↩︎
The classes of boat, sofa, and chair were showing a performance near statistically informed random for all methods and were removed from the evaluations. ↩︎
Though visual matching/tracking is also one of the early developed cognitive skills, we are unware of any studies investigating its foundational role in developing visual perception. Therefore, we presume (and empirically observe) that the generality of our 3D representation is mostly attributed to the camera pose estimation component. ↩︎
For instance, if the pose estimation is being performed based on vanishing points, then besides extraction of at least 3 vanishing points from the input images, the correspondences among vanishing points needs to be done as well. Therefore, the representation must encode some appearance information in order to enable finding the correspondences among vanishing points. ↩︎