Multiview Aggregation for Learning Category-Specific Shape Reconstruction

Multiview Aggr egation f or Learning Category-Speciﬁc Shape Reconstruction Srinath Sridhar 1 Davis Rempe 1 Julien V alentin 2 Soﬁen Bouaziz 2 Leonidas J. Guibas 1,3 1 Stanford Univ ersity 2 Google Inc. 3 Facebook AI Research R ssrinath@cs.stanford.edu  geometry.stanford.edu/projects/xnocs Abstract W e in vestigate the problem of learning cate gory-speciﬁc 3D shape reconstruction from a v ariable number of RGB views of pre viously unobserved object instances. Most approaches for multivie w shape reconstruction operate on sparse shape representations, or assume a ﬁxed number of views. W e present a method that can estimate dense 3D shape, and aggregate shape across multiple and varying number of input vie ws. Given a single input vie w of an object instance, we propose a representation that encodes the dense shape of the visible object surface as well as the surface behind line of sight occluded by the visible surf ace. When multiple input views are av ailable, the shape representation is designed to be aggregated into a single 3D shape using an ine xpensive union operation. W e train a 2D CNN to learn to predict this representation from a variable number of vie ws (1 or more). W e further aggregate multi view information by using permutation equiv ariant layers that promote order-agnostic vie w information exchange at the feature le vel. Experiments sho w that our approach is able to produce dense 3D reconstructions of objects that improv e in quality as more views are added. 1 Introduction Learning to estimate the 3D shape of objects observed from one or more views is an important problem in 3D computer vision with applications in robotics, 3D scene understanding, and augmented reality . Humans and many animals perform well at this task, especially for kno wn object categories, ev en when observed object instances ha ve ne ver been encountered before [ 27 ]. W e are able to infer the 3D surface shape of both object parts that are directly visible, and of parts that are occluded by the visible surface. When provided with more views of the instance, our conﬁdence about its shape increases. Endowing machines with this ability would allo w us to operate and reason in new en vironments and enable a wide range of applications. W e study this problem of learning category-speciﬁc 3D surface shape reconstruction giv en a v ariable number of RGB vie ws (1 or more) of an object instance. There are sev eral challenges in developing a learning-based solution for this problem. First, we need a representation that can encode the 3D geometry of both the visible and occluded parts of an object while still being able to aggregate shape information across multiple views. Second, for a giv en object category , we need to learn to predict the shape of ne w instances from a variable number of views at test time. W e address these challenges by introducing a ne w representation for encoding category-speciﬁc 3D surface shape, and a method for learning to predict shape from a v ariable number of views in an order -agnostic manner . Representations such as vox el grids [ 6 ], point clouds [ 9 , 17 ], and meshes [ 11 , 40 ] hav e previously been used for learning 3D shape. These representations can be computationally expensiv e to operate on, often produce only sparse or smoothed-out reconstructions, or decouple 3D shape from 2D 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), V ancouver , Canada. Figure 1: An input RGB view of a previously unseen object instance (a). Humans are capable of inferring the shape of the visible object surface (original colors in (b)) as well as the parts that are outside the line of sight (separated by red line in (b)). W e propose an extended v ersion of the NOCS map representation [ 39 ] to encode both the visible surface (c) and the occluded surface furthest from the current vie w , the X-NOCS map (d). Note that (c) and (d) are in exact pix el correspondence to (a), and their point set union yields the complete 3D shape of the object. RGB colors denote the XYZ position within NOCS. W e learn category-speciﬁc 3D reconstruction from one or more vie ws. projection losing 2D–3D correspondence. T o overcome these issues, we b uild upon the normalized object coor dinate space maps ( NOCS maps ) representation [ 39 ]—a 2D projection of a shared category-le vel 3D object shape space that can encode intra-cate gory shape v ariation (see Figure 1). A NOCS map can be interpreted as a 3D surface reconstruction in a canonical space of object pix els directly visible in an image. NOCS maps retain the advantages of point clouds and are implicitly grounded to the image since they pro vide a strong pixel–shape correspondence—a feature that allows us to copy object texture from the input image. Howe ver , a NOCS map only encodes the surface shape of object parts directly in the line of sight. W e extend it to also encode the 3D shape of object parts that are occluded by the visible surf ace by predicting the shape of the object surface furthest and hidden from the vie w—called X-NOCS maps (see Figure 1). Given a single RGB vie w of an object instance, we aim to reconstruct the NOCS maps corresponding to the visible surf ace and the X-NOCS map of the occluded surface. Giv en multiple views, we aggregate the predicted NOCS and X-NOCS maps from each view into a single 3D shape using an ine xpensiv e union operation. T o learn to predict these visible and occluded NOCS maps for one or more vie ws, we use an encoder- decoder architecture based on SegNet [ 3 ]. W e show that a network can learn to predict shape independently for each view . Howev er , independent learning does not exploit multivie w overlap information. W e therefore propose to aggregate multiview information in a view order-agnostic manner by using permutation equivariant layers [ 43 ] that promote information exchange among the vie ws at the feature lev el. Thus, our approach aggregates multiview information both at the shape le vel , and at the feature level enabling better reconstructions. Our approach is trained on a variable number of input vie ws and can be used on a dif ferent variable number of vie ws at test time. Extensiv e experiments sho w that our approach outperforms other state-of-the-art approaches, is able to reconstruct object shape with ﬁne details, and accurately captures dense shape while improving reconstruction as more views are added, both during training and testing. 2 Related W ork Extensiv e work exists on recognizing and reconstructing 3D shape of objects from images. This revie w focuses on learning-based approaches which ha ve dominated recent state of the art, but we brieﬂy summarize below literature on techniques that rely purely on geometry and constraints. Non-Learning 3D Reconstruction Methods : The method presented in [ 22 ] requires user input to estimate both camera intrinsics and multiple lev els of reconstruction detail using primiti ves which allow complete 3D reconstruction. The approach of [ 38 ] also requires user input but is more data- driv en and tar gets class-based 3D reconstruction of the objects on the Pascal VOC dataset. [ 4 ] is another notable approach for class-based 3D reconstruction a parametric 3D model and corresponding parameters per object-instance are predicted with minimal user intervention. W e now focus on learning-based methods for single and multivie w reconstruction. Single-V iew Reconstruction : Single-view 3D reconstruction of objects is a severely under- constrained problem. Probabilistic or generativ e techniques ha ve been used to impose constraints on the solution space. For instance, [ 21 ] uses structure from motion to estimate camera parameters 2 and learns category-speciﬁc generati ve models. The approach of [ 9 ] learns a generativ e model of un-ordered point clouds. The method of [ 15 ] also argue for learning generative models that can predict 3D shape, pose and lighting from a single image. Most techniques implicitly or explicitly learn class-speciﬁc generati ve models, b ut there are some, e.g., [ 36 ], that take a radically dif ferent approach and use multiple views of the same object to impose a geometric loss during training. The approach of [ 42 ] predicts 2.5D sketches in the form of depth, surface normals, and silhouette images of the object. It then infers the 3D object shape using a vox el representation. In [ 13 ], the authors present a technique that uses s ilhouette constraints. That loss is not well suited for non-con vex objects and hence the authors propose to use another set of constraints coming from a generative model which has been taught to generate 3D models. Finally , [ 44 ] propose an approach that ﬁrst predicts depth from a 2D image which is then projected onto a spherical map. This map is inpainted to ﬁll holes and backprojected into a 3D shape. Multiview Reconstruction : Multiple vie ws of an object add more constraints to the reconstructed 3D shape. Some of the most popular constraints in computer vision are multivie w photometric consistency , depth error , and silhouette constraints [ 18 , 41 ]. In [ 20 ], the authors assume that the pose of the camera is giv en and extract image features that are un-projected in 3D and iterativ ely fused with the information from other vie ws into a v oxel grid. Similarly , [ 16 ] uses structure from motion to extract camera calibration and pose. [ 23 ] proposes an approach to dif ferentiable point-cloud rendering that effecti vely deals with the problem of visibility . Some approaches jointly perform the tasks of estimating the camera parameters as well as reconstructing the object in 3D [17, 45]. Permutation In variance and Equivariance : One of the requirements of supporting a variable number of input views is that the network must be agnostic to the order of the inputs. This is not the case with [ 6 ] since their RNN is sensitive to input view order . In this work, we use ideas of permutation in variance and equiv ariance from DeepSets [ 29 , 43 ]. Permutation in variance has been used in computer vision in problems such as burst image deblurring [ 2 ], shape recognition [ 35 ], and 3D vision [ 28 ]. Permutation equi variance is not as widely used in vision but is common in other areas [ 29 , 30 ]. Other forms of approximate equiv ariance have been used in multi view netw orks [ 7 ]. A detailed theoretical analysis is provided by [25]. Shape Repr esentations : There are tw o dominant f amilies of shape representations used in literature: volumetric and surf ace representations, each with their trade-offs in terms of memory , closeness to the actual surface and ease of use in neural networks. W e offer a brief re vie w and refer the reader to [1, 34] for a more extensi ve study . The vox el representation is the most common volumetric representation because of its regular grid structure, making con volutional operators easy to implement. As illustrated in [ 6 ] which performs single and multi view reconstructions, vox els can be used as an occupancy grid, usually resulting in coarse surfaces. [ 26 ] demonstrates high quality reconstruction and geometry completion results. Howe ver , vox els hav e high memory cost, especially when combined with 3D con volutions. This has been noted by several authors, including [ 31 ] who propose to ﬁrst predict a series of 6 depth maps observed from each face of a cube containing the object to reconstruct. Each series of 6 depth map represent a different surface, allowing to efﬁciently capture both the outside and the inside (occluded) parts of objects. These series of depth maps are coined shape layers and are combined in an occupancy grid to obtain the ﬁnal reconstructions. Surface representations have advantages such as compactness, and are amenable to differentiable operators that can be applied on them. They are gaining popularity in learning 3D reconstruction with works like [ 19 ], where the authors present a technique for predicting category-speciﬁc mesh (and texture) reconstructions from single images, or explorations like in [ 9 ], which introduces a technique for reconstructing the surface of objects using point clouds. Another interesting representation is scene coordinates which associates each pixel in the image with a 3D position on the surface of the object or scene being observed. This representation has been successfully used for several problems including camera pose estimation [ 37 ] and face reconstruction [ 10 ]. Howe ver , it requires a scene- or instance-speciﬁc scan to be av ailable. Finally , geometry images [ 12 ] hav e been proposed to encode 3D shape in images. Howe ver , they lack input RGB pixel to shape correspondence. In this work, we propose a cate gory-level surface repr esentation that has the adv antages of point clouds but encodes strong pixel–3D shape correspondence which allo ws multivie w shape aggregation without explicit correspondences. 3 Figure 2: Gi ven canonically aligned and scaled instances from an object category [ 5 ], the NOCS representation [ 39 ] can be used to encode intra-category shape variation. For a single view (a), a NOCS map encodes the shape of the visible parts of the object (b). W e extend this representation to also encode the occluded parts called an X-NOCS map (c). Multiple (X-)NOCS maps can be tri vially combined using a set union operation ( S ) into a single dense shape (rightmost). W e can also efﬁciently represent the texture of object surfaces that are not directly observ able (d). Inputs to our method are shown in green box es, predictions are in red, and optional predictions are in orange. 3 Background and Ov erview In this section, we provide a description of our shape representation, rele vant background, and a general ov erview of our method. Shape Representation : Our goal is to design a shape representation that can capture dense shapes of both the visible and occluded surfaces of objects observed from any gi ven vie wpoint. W e would like a representation that can support computationally ef ﬁcient signal processing (e.g., 2D con volution) while also having the adv antages of 3D point clouds. This requires a strong coupling between image pixels and 3D shapes. W e build upon the NOCS map [ 39 ] representation, which we describe below . Figure 3: W e use depth peeling to extract X-NOCS maps corresponding to dif fer- ent ray intersections. The top row sho ws 4 intersections. The bottom row shows our representation which uses the ﬁrst and last intersections. The Normalized Object Coordinates Space (NOCS) can be described as the 3D space contained within a unit cube as shown in Figure 2. Giv en a collection of shapes from a category which are consistently oriented and scaled, we build a shape space where the XYZ coordinates within NOCS represent the shape of an instance. A NOCS map is a 2D projection of the 3D NOCS points of an instance as seen from a particular vie wpoint. Each pixel in the NOCS map denotes the 3D position of that object point in NOCS (color coded in Figure 2). NOCS maps are dense shape representations that scale with the size of the object in the vie w—objects that are closer to the camera with more image pixels are denser than object further away . They can readily be con verted to a point cloud by reading out the pix el v alues, but still retain 3D shape–pixel correspondence. Because of this correspondence we can obtain camera pose in the canonical NOCS space using the direct linear transform algorithm [ 14 ]. Howe ver , NOCS maps only encode the shape of the visible surface of the object. Depth P eeling : T o ov ercome this limitation and encode the shape of the occluded object surf ace, we build upon the idea of depth peeling [ 8 ] and layered depth images [ 33 ]. Depth peeling is a technique used to generate more accurate order -independent transparency ef fects when blending transparent objects. As shown in Figure 3, this process refers to the extraction of object depth or , alternativ ely , NOCS coordinates corresponding to the k th intersection of a ray passing through a giv en image pixel. By peeling a sufﬁciently lar ge number of layers (e.g., k = 10 ), we can accurately encode the interior and exterior shape of an object. Howe ver , using many layers can be unnecessarily expensi ve, especially if the goal is to estimate only the external object surface. W e therefore propose to use 2 layers to approximate the e xternal surfaces corresponding the ﬁrst and last ray intersections. These intersections faithfully capture the visible a nd occluded parts of most common con vex objects. W e 4 refer to the maps corresponding to the occluded surface (i.e., last ray intersection) as X-NOCS maps , similar to X-ray images. Both NOCS and X-NOCS maps support multi vie w shape aggregation into a single 3D shape using an inexpensi ve point set union operation. This is because NOCS is a canonical and normalized space where multiple views correspond to the same 3D space. Since these maps preserve pixel–shape correspondence, they also support estimation of object or camera pose in the canonical NOCS space [ 39 ]. W e can use the direct linear transform [ 14 ] to estimate camera pose, up to an unkno wn scale factor (see supplementary document). Furthermore, we can support the prediction of the texture of the occluded parts of the object by hallucinating a peeled color image (see Figure 2 (d)). Learning Shape Reconstruction : Gi ven the X-NOCS map representation that encodes the 3D shape both of occluded object surfaces, our goal is to learn to predict both maps from a variable number of input vie ws and aggregate multi view predictions. W e adopt a supervised approach for this problem. W e generated a large corpus of training data with synthetic objects from 3 popular categories—cars, chairs, and airplanes. For each object we render multiple viewpoints, as well the corresponding ground truth X-NOCS maps. Our network learns to predict the (X-)NOCS maps corresponding to each vie w using a SegNet-based [ 3 ] encoder-decoder architecture. Learning independently on each view does not e xploit the a vailable multi view ov erlap information. W e therefore aggregate multivie w information at the feature level by using permutation equiv ariant layers that combine input view information in an order -agnostic manner . The multivie w aggregation that we perform at the NOCS shape and feature lev els allows us to reconstruct dense shape with details as we sho w in Section 5. 4 Method Our goal is to learn to predict the both NOCS and X-NOCS maps corresponding to a v ariable number of input RGB views of previously unobserved object instances. W e adopt a supervised learning approach and restrict ourselves to speciﬁc object cate gories. W e ﬁrst describe our general approach to this problem and then discuss how we aggre gate multivie w information. 4.1 Single-V iew (X-)NOCS Map Prediction The goal of this task is to predict the NOCS maps for the visible ( N v ) and X-NOCS maps for the occluded parts ( N o ) of the object giv en a single RGB view I . W e assume that no other multi view inputs are available at train or test time. F or this pixel-le vel prediction task we use an encoder - decoder architecture similar to Se gNet [ 3 ] (see Figure 4). Our architecture takes a 3 channel RGB image as input and predicts 6 output channels corresponding to the NOCS and X-NOCS maps ( N i = { N v , N o } ), and optionally also predicts a peeled color map ( C p ) encoding the texture of the occluded object surface (see Figure 2 (d)). W e include skip connections between the encoder and decoder to promote information sharing and consistency . T o obtain the 3D shape of object instances, the output (X-)NOCS maps are combined into a single 3D point cloud as P = R ( N v ) S R ( N o ) , where R denotes a readout operation that con verts each map to a 3D point set. (X-)NOCS Map Aggregation : While single-vie w (X-)NOCS map prediction is trained indepen- dently on each vie w , it can still be used for multi view shape aggregation. Gi ven multiple input views, { I 0 , . . . , I n } , we predict the (X-)NOCS maps { N 0 , . . . , N n } for each view independently . NOCS represents a canonical and normalized space and thus (X-)NOCS maps can also be interpreted as dense correspondences between pixels and 3D NOCS space. Therefore any set of (X-)NOCS maps will map into the same space—multi view consistenc y is implicit in the representation. Given multiple independent (X-)NOCS maps, we can combine them into a single 3D point cloud as P n = S n i =0 R ( N i ) . Loss Functions : W e experimented with se veral loss functions for (X-)NOCS map prediction including a pixel-le vel L 2 loss, and a combined pixel-le vel mask and L 2 loss. The L 2 loss is deﬁned as L e ( y , ˆ y ) = 1 n X || y − ˆ y || 2 , ∀ y ∈ N v , N o , ∀ ˆ y ∈ ˆ N v , ˆ N o , (1) where y , ˆ y ∈ R 3 denote the ground truth and predicted 3D NOCS v alue, ˆ N v , ˆ N o are the predicted NOCS and X-NOCS maps, and n is the total number of pixels in the X-NOCS maps. Howe ver , this function computes the loss for all pixels, ev en those that do not belong to the object thus wasting 5 Figure 4: W e use an encoder-decoder architecture based on SegNet [ 3 ] to predict NOCS and X-NOCS maps from an input RGB vie w independently . T o better exploit multi view information, we propose to use the same architecture b ut with added permutation equi variant layers (bottom right) to combine multivie w information at the feature le vel. Our network can operate on a v ariable number of input views in an order-agnostic manner . The features extracted for each view during upsampling and downsampling operations are combined using permutation equi v ariant layers (orange bars). network capacity . W e therefore use object masks to restrict the loss computation only to the object pixels in the image. W e predict 2 masks corresponding to the NOCS and X-NOCS maps—8 channels in total. W e predict 2 independent masks since they could be different for thin structures like airplane tail ﬁns. The combined mask loss is deﬁned as L m = L v + L o , where the loss for the visible NOCS map and mask is deﬁned as L v ( y , ˆ y ) = w m M ( M v , ˆ M v ) + w l 1 m X || y − ˆ y || 2 , ∀ y ∈ M v , ∀ ˆ y ∈ ˆ M v , (2) where ˆ M v is the predicted mask corresponding to the visible NOCS map, M v is the ground truth mask, M is the binary cross entropy loss on the mask, and m is the number of masked pixels. L o is identical to L v but for the X-NOCS map. W e empirically set the weights w m and w l to be 0.7 and 0.3 respecti vely . Experimentally , we observe that the combined pixel-le vel mask and L 2 loss outperforms the L 2 loss since more network capacity can be utilized for shape prediction. 4.2 Multiview (X-)NOCS Map Pr ediction The abov e approach predicts (X-)NOCS maps independently and aggregates them to produce a 3D shape. Ho wever , multi view images of an object hav e strong inter -view ov erlap information which we hav e not made use of. T o promote information exchange between views both during training and testing, and to support a v ariable number of input views, we propose to use permutation equi variant layers [43] that are agnostic to the order of the views. Featur e Level Multi view Aggregation : Our multivie w aggregation network is illustrated in Figure 4. The network is identical to the single-vie w network except for the addition of se veral permutation equiv ariant layers (orange bars). A network layer is said to be permutation equiv ariant if and only if the of f diagonal elements of the learned weight matrix are equal, as are the diagonal elements [ 43 ]. In practice, this can be achiev ed by passing each feature map through a pool-subtract operation follo wed by a non-linear function. The pool-subtract operation pools features e xtracted from dif ferent viewpoints and subtracts the pooled feature from the individual features (see Figure 4). W e use multiple permutation equi variant layers after each downsampling and upsampling operation in the encoder-decoder architecture (vertical orange bars in Figure 4). Both average pooling and max pooling can used but experimentally a verage pooling worked best. Our permutation equi variant layers consist of an av erage-subtraction operation and the non-linearity from the next con volutional layer . Hallucinating Occluded Object T exture : As an additional feature, we train both our single and multivi ew networks to also predict the texture of the occluded surface of the object (see Figure 2 (d)). This is predicted as 3 additional output channels with the same loss as L v . This optional prediction can be used to hallucinate the texture of hidden object surfaces. 6 5 Experiments Dataset : W e generated our o wn dataset, called ShapeNetCOCO , consisting of object instances from 3 categories commonly used in related work: chairs, cars, and airplanes. W e use thousands of instances from the ShapeNet [ 5 ] repository and render 20 different vie ws for each instance and additionally augment backgrounds with randomly chosen COCO images [ 24 ]. This dataset is harder than previously proposed datasets because of random backgrounds, and widely varying camera distances. T o facilitate comparisons with previous work [ 6 , 17 ], we also generated a simpler dataset, called ShapeNetPlain , with white backgrounds and 5 vie ws per object follo wing the camera placement procedure of [ 17 ]. Except for comparisons and T able 3, we report results from the more complex dataset. W e follo w the train/test protocol of [ 36 ]. Unless otherwise speciﬁed, we use a batch size of 1 (multivie w) or 2 (single-vie w), a learning rate of 0.0001, and the Adam optimizer . Metrics : For all experiments, we ev aluate point cloud reconstruction using the 2-way Chamfer distance multiplied by 100. Given tw o point sets S 1 and S 2 the Chamfer distance is deﬁned as d ( S 1 , S 2 ) = 1 | S 1 | X x ∈ S 1 min y ∈ S 2 k x − y k 2 2 + 1 | S 2 | X y ∈ S 2 min x ∈ S 1 k x − y k 2 2 . (3) 5.1 Design Choices T able 1: Single-view reconstruction performance using v arious losses and outputs. For each category , the Chamfer distance is shown. Using the joint loss with L 2 and the mask signiﬁcantly outperforms just L 2 . Predicting peeled color further improves reconstruction. Loss Output Cars Airplanes Chairs L2 (X-)NOCS+Peel 3.6573 7.9072 4.4716 L2+Mask (X-)NOCS+Mask 0.5093 0.3037 0.4401 L2+Mask (X-)NOCSS+Mask+Peel 0.3714 0.2659 0.4288 W e ﬁrst justify our loss function choice and network outputs. As described, we experiment with two loss functions— L 2 losses with and without a mask. Further, there are sev eral outputs that we predict in addition to the NOCS and X-NOCS maps i.e., mask and peeled color . In T able 1, we sum- marize the average Chamfer dis- tance loss for all v ariants trained independently on single views ( ShapeNetCOCO dataset). Using the loss function which jointly accounts for NOCS map, X-NOCS maps and mask output clearly outperforms a vanilla L 2 loss on the NOCS and X-NOCS maps. W e also observe that predicting peeled color along with the (X-)NOCS maps giv es better performance on all categories. 5.2 Multiview Aggr egation T able 2: Comparison of different forms of multivie w ag- gregation. Aggregating multiple views using set union improv es performance with further improvements using feature space aggregation. Category Model 2 views 3 views 5 views Cars Single-V iew 0.4206 0.3974 0.3692 Multivie w 0.3789 0.3537 0.2731 Airplanes Single-V iew 0.1760 0.1677 0.1619 Multivie w 0.2387 0.1782 0.1277 Chairs Single-V iew 0.4249 0.3813 0.3600 Multivie w 0.3649 0.2860 0.2457 Next we sho w that our multivie w aggre- gation approach is capable of estimating better reconstructions when more vie ws are av ailable ( ShapeNetCOCO dataset). T able 2 sho ws that the reconstruction from the single vie w network improv es as we aggregate more vie ws into NOCS space (using set union) without an y fea- ture space aggreg ation. When we train with feature space aggregation from 5 views using the permutation equi v ariant layers we see further improvements as more views are added. T able 3 shows variations of our multi view model: one trained on a ﬁxed number of views, one trained on a v ariable number of vie ws up to a maximum of 5, and one trained on a variable number up to 10 vie ws. All these models are trained on the ShapeNetPlain dataset for 100 epochs. W e see that both ﬁxed and v ariable models take adv antage of the additional information from more vie ws, almost always increasing performance from left to right. Although the ﬁxed multi view models perform best, we hypothesize that the variable vie w models will be able to better handle the widening gap between the number of train-time and test-time vie ws. In Figure 5, we visualize our results in 3D which shows the small scale details such as airplane engines reconstructed by our method. 7 Figure 5: Qualitativ e reconstructions produced by our method. Each rows sho ws the input RGB vie ws, NOCS map ground truth and prediction of the central view , and the ground truth and predicted 3D shape. These visualizations are produced by the variable multi view model trained on up to 5 views and e valuated on 5 vie ws. W e post-process the both NOCS and X-NOCS maps with a bilateral ﬁlter follo wed by a statistical outlier ﬁlter [ 32 ], and use the input RGB images to color the point cloud. Best viewed zoomed and in color . 5.3 Comparisons T able 3: Multivie w reconstruction variations. W e observe that both ﬁxed and v ariable models take advantage of the additional information from more views. Category Model 2 views 3 views 5 views Cars Fixed Multi 0.2645 0.1645 0.1721 V ariable Multi (5) 0.2896 0.1989 0.1955 V ariable Multi (10) 0.2992 0.2447 0.3095 Airplanes Fixed Multi 0.1318 0.1571 0.0604 V ariable Multi (5) 0.1418 0.1006 0.0991 V ariable Multi (10) 0.1847 0.1309 0.1049 Chairs Fixed Multi 0.2967 0.1845 0.1314 V ariable Multi (5) 0.2642 0.2072 0.1695 V ariable Multi (10) 0.2643 0.2070 0.1693 W e compare our method to two previous works. The ﬁrst, called differentiable point clouds (DPC) [ 17 ] directly predicts a point cloud gi ven a single image of an object. W e train a separate single-view model for cars, airplanes, and chairs to predict the NOCS maps, X-NOCS maps, mask and peeled color ( ShapeNetPlain dataset). T o e valuate the Chamfer dis- tance for DPC outputs, we ﬁrst scale the predicted output point cloud such that the bounding box diagonal is one, then we follow the alignment procedure from their paper to calculate the transforma- tion from the network’ s output frame to the ground truth point cloud frame. As seen in T able 4, the X-NOCS map representation allows our netw ork to outperform DPC in all three categories. T able 4: Single-vie w reconstruction com- parison to DPC [17]. Method Cars Airplanes Chairs DPC 0.2932 0.2549 0.4314 Ours 0.1569 0.1855 0.3803 W e next compare our multivie w permutation equi vari- ant model to the multi view method 3D-R2N2 [ 6 ]. In each training batch, both methods are gi ven a random subset of 5 vie ws of an object, so that they may be e v al- uated with up to 5 vie ws at test time. Since 3D-R2N2 outputs a volumetric 32x32x32 vox el grid, we ﬁrst ﬁnd all surface voxels of the output then place a point at the center of these surface vox els to obtain a 3D point cloud. This point cloud is scaled to ha ve a unit-diagonal bounding box to match the ground truth ShapeNet objects. W e limit our comparison to only chairs since we were unable to make their method con ver ge on the other categories. T able 5: Multivie w reconstruction performance compared to 3D-R2N2 [ 6 ] on the chairs category in ShapeNetPlain . Method 2 views 3 views 5 views 3D-R2N2 0.2511 0.2191 0.1932 Ours 0.2508 0.1952 0.1576 T able 5 shows the performance of both methods when trained on the chairs cat- egory and e valuated on 2, 3, and 5 views ( ShapeNetPlain dataset). For 2 views the methods perform similar but when combining more vie ws to reconstruct the shape, our method becomes more accu- rate. W e again see the trend of increasing performance as more views are used. 8 Figure 6: More qualitati ve reconstructions produced by our method. For each box, ground truth is shown leftmost. Here we show reconstructions from (a) the permutation equi variant network trained and tested on 10 views for a car , and (b, c) permutation equi variant network trained on chairs with 5 views and tested on 5. A reconstruction with higher shape variance that f ails to capture small scale detail is shown in (d). Finally , in (e) we show a visual comparison with the reconstruction produced by [ 6 ] which lacks detail such as the armrest although it sees 5 different views. Best viewed in color . Limitations and Future W ork : While we reconstruct dense 3D shapes, there is still some variance in our predicted shape. W e can further improve the quality of our reconstructions by incorporating surface topology information. W e currently use the DL T algorithm [ 14 ] to predict camera pose in our canonical NOCS space, howe ver we would need extra information such as depth [ 39 ] to estimate metric pose. Jointly estimating pose and shape is a tightly coupled problem and an interesting future direction. Finally , we observed that Chamfer distance, although used widely to ev aluate shape reconstruction quality , is not the ideal metric to help differentiate ﬁne scale detail and overall shape. W e plan to explore the use of the other metrics to e valuate reconstruction quality . 6 Conclusion In this paper we introduced X-NOCS maps, a ne w and efﬁcient surface representation that is well suited for the task of 3D reconstruction of objects, e ven of occluded parts, from a v ariable number of views. W e demonstrate how this representation can be used to estimate the ﬁrst and the last surface point that would project on any pixel in the observed image, and also to estimate the appearance of these surface points. W e then show ho w adding a permutation equiv ariant layer allows the proposed method to be agnostic to the number of vie ws and their associated viewpoints, but also how our aggregation netw ork is able to ef ﬁciently combine these observations to yield ev en higher quality results compared to those obtained with a single observation. Finally , extensi ve analysis and experiments v alidate that our method reaches state-of-the-art results using a single observation, and signiﬁcantly improv es upon existing techniques. Acknowledgments : This work w as supported by the Google Daydream Uni versity Research Pro- gram, A WS Machine Learning A wards Program, and the T oyota-Stanford Center for AI Research. W e would like to thank Jiahui Lei, the anon ymous revie wers, and members of the Guibas Group for useful feedback. T oyota Research Institute (“TRI”) provided funds to assist the authors with their research but this article solely reﬂects the opinions and conclusions of its authors and not TRI or any other T oyota entity . References [1] Eman Ahmed, Alexandre Saint, Abd El Rahman Shabayek, Kseniya Cherenkov a, Rig Das, Gleb Gu- sev , Djamila Aouada, and Bjorn Ottersten. A survey on deep learning advances on different 3d data representations. arXiv pr eprint arXiv:1808.01462 , 2018. [2] Miika Aittala and Frédo Durand. Burst image deblurring using permutation in variant con volutional neural networks. In Pr oceedings of the European Confer ence on Computer V ision (ECCV) , pages 731–747, 2018. 9 [3] V ijay Badrinarayanan, Alex K endall, and Roberto Cipolla. Segnet: A deep conv olutional encoder-decoder architecture for image segmentation. IEEE transactions on pattern analysis and machine intelligence , 39(12):2481–2495, 2017. [4] Thomas J Cashman and Andrew W Fitzgibbon. What shape are dolphins? building 3d morphable models from 2d images. IEEE transactions on pattern analysis and machine intellig ence , 35(1):232–244, 2012. [5] Angel X Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Sav arese, Manolis Savv a, Shuran Song, Hao Su, et al. Shapenet: An information-rich 3d model repository . arXiv pr eprint arXiv:1512.03012 , 2015. [6] Christopher Bongsoo Choy , Danfei Xu, JunY oung Gwak, K evin Chen, and Silvio Sav arese. 3d-r2n2: A uniﬁed approach for single and multi-view 3d object reconstruction. CoRR , abs/1604.00449, 2016. [7] Carlos Estev es, Y inshuang Xu, Christine Allen-Blanchette, and K ostas Daniilidis. Equi variant multi-view networks. arXiv pr eprint arXiv:1904.00993 , 2019. [8] Cass Everitt. Interactive order -independent transparency . White paper , nVIDIA , 2(6):7, 2001. [9] Haoqiang Fan, Hao Su, and Leonidas J Guibas. A point set generation network for 3d object reconstruction from a single image. In Pr oceedings of the IEEE confer ence on computer vision and pattern reco gnition , pages 605–613, 2017. [10] Y ao Feng, Fan W u, Xiaohu Shao, Y anfeng W ang, and Xi Zhou. Joint 3d face reconstruction and dense alignment with position map regression network. In Pr oceedings of the Eur opean Conference on Computer V ision (ECCV) , pages 534–551, 2018. [11] Thibault Groueix, Matthe w Fisher, Vladimir G. Kim, Bryan C. Russell, and Mathieu Aubry . Atlasnet: A papier-mâché approach to learning 3d surf ace generation. CoRR , abs/1802.05384, 2018. [12] Xianfeng Gu, Stev en J Gortler , and Hugues Hoppe. Geometry images. In ACM T ransactions on Gr aphics (TOG) , v olume 21, pages 355–361. ACM, 2002. [13] JunY oung Gwak, Christopher B Choy , Manmohan Chandraker, Animesh Gar g, and Silvio Sa varese. W eakly supervised 3d reconstruction with adversarial constraint. In 2017 International Confer ence on 3D V ision (3D V) , pages 263–272. IEEE, 2017. [14] Richard Hartley and Andrew Zisserman. Multiple view geometry in computer vision . Cambridge uni versity press, 2003. [15] Paul Henderson and V ittorio Ferrari. Learning single-image 3d reconstruction by generative modelling of shape, pose and shading. arXiv pr eprint arXiv:1901.06447 , 2019. [16] Po-Han Huang, Ke vin Matzen, Johannes Kopf, Narendra Ahuja, and Jia-Bin Huang. Deepmvs: Learning multi-view stereopsis. In Pr oceedings of the IEEE Confer ence on Computer V ision and P attern Reco gnition , pages 2821–2830, 2018. [17] Eldar Insafutdinov and Ale xey Dosovitskiy . Unsupervised learning of shape and pose with differentiable point clouds. In Advances in Neural Information Pr ocessing Systems , pages 2807–2817, 2018. [18] Mengqi Ji, Juergen Gall, Haitian Zheng, Y ebin Liu, and Lu Fang. Surfacenet: An end-to-end 3d neural network for multivie w stereopsis. In Pr oceedings of the IEEE International Conference on Computer V ision , pages 2307–2315, 2017. [19] Angjoo Kanazawa, Shubham T ulsiani, Alexei A Efros, and Jitendra Malik. Learning category-speciﬁc mesh reconstruction from image collections. In Pr oceedings of the Eur opean Conference on Computer V ision (ECCV) , pages 371–386, 2018. [20] Abhishek Kar , Christian Häne, and Jitendra Malik. Learning a multi-view stereo machine. In Advances in neural information pr ocessing systems , pages 365–376, 2017. [21] Abhishek Kar , Shubham T ulsiani, Joao Carreira, and Jitendra Malik. Cate gory-speciﬁc object recon- struction from a single image. In Proceedings of the IEEE conference on computer vision and pattern r ecognition , pages 1966–1974, 2015. [22] Akash M Kushal, Gaurav Chanda, Kanishka Sri vastav a, Mohit Gupta, Subhajit Sanyal, TVN Sriram, Prem Kalra, and Subhashis Banerjee. Multile vel modelling and rendering of architectural scenes. In Proc. Eur oGraphics , 2003. 10 [23] Chen-Hsuan Lin, Chen K ong, and Simon Lucey . Learning efﬁcient point cloud generation for dense 3d object reconstruction. In AAAI Confer ence on Artiﬁcial Intelligence (AAAI) , 2018. [24] Tsung-Y i Lin, Michael Maire, Ser ge Belongie, James Hays, Pietro Perona, De va Ramanan, Piotr Dollár , and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Eur opean confer ence on computer vision , pages 740–755. Springer , 2014. [25] Haggai Maron, Heli Ben-Hamu, Nadav Shamir, and Y aron Lipman. Inv ariant and equivariant graph networks. arXiv pr eprint arXiv:1812.09902 , 2018. [26] Jeong Joon Park, Peter Florence, Julian Straub, Richard Newcombe, and Ste ven Lovegro ve. Deepsdf: Learning continuous signed distance functions for shape representation. arXiv pr eprint arXiv:1901.05103 , 2019. [27] Alex Pentland. Shape information from shading: a theory about human perception. In [1988 Pr oceedings] Second International Confer ence on Computer V ision , pages 404–413. IEEE, 1988. [28] Charles Ruizhongtai Qi, Li Y i, Hao Su, and Leonidas J Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In Advances in Neural Information Pr ocessing Systems 30 , pages 5099–5108, 2017. [29] Siamak Ravanbakhsh, Jeff Schneider , and Barnabas Poczos. Deep learning with sets and point clouds. arXiv pr eprint arXiv:1611.04500 , 2016. [30] Siamak Rav anbakhsh, Jeff Schneider , and Barnabas Poczos. Equivariance through parameter -sharing. In Pr oceedings of the 34th International Conference on Machine Learning-V olume 70 , pages 2892–2901. JMLR. org, 2017. [31] Stephan R. Richter and Stefan Roth. Matryoshka networks: Predicting 3d geometry via nested shape layers. In 2018 IEEE Confer ence on Computer V ision and P attern Recognition, CVPR 2018, Salt Lake City , UT , USA, June 18-22, 2018 , pages 1936–1944, 2018. [32] Radu Bogdan Rusu and Steve Cousins. 3D is here: Point Cloud Library (PCL). In IEEE International Confer ence on Robotics and Automation (ICRA) , Shanghai, China, May 9-13 2011. [33] Jonathan Shade, Ste ven Gortler , Li-wei He, and Richard Szeliski. Layered depth images. 1998. [34] Daeyun Shin, Charless C Fowlkes, and Derek Hoiem. Pixels, vox els, and views: A study of shape representations for single view 3d object shape prediction. In Pr oceedings of the IEEE Confer ence on Computer V ision and P attern Recognition , pages 3061–3069, 2018. [35] Hang Su, Subhransu Maji, Evangelos Kalogerakis, and Erik Learned-Miller . Multi-view con volutional neural networks for 3d shape recognition. In Pr oceedings of the IEEE international confer ence on computer vision , pages 945–953, 2015. [36] Shubham Tulsiani, Alexei A Efros, and Jitendra Malik. Multi-view consistency as supervisory signal for learning shape and pose prediction. In Pr oceedings of the IEEE Confer ence on Computer V ision and P attern Recognition , pages 2897–2905, 2018. [37] Julien V alentin, Matthias Nießner , Jamie Shotton, Andrew Fitzgibbon, Shahram Izadi, and Philip HS T orr . Exploiting uncertainty in regression forests for accurate camera relocalization. In Pr oceedings of the IEEE Confer ence on Computer V ision and P attern Recognition , pages 4400–4408, 2015. [38] Sara V icente, Joao Carreira, Lourdes Agapito, and Jor ge Batista. Reconstructing pascal v oc. In Pr oceedings of the IEEE confer ence on computer vision and pattern reco gnition , pages 41–48, 2014. [39] He W ang, Srinath Sridhar , Jingwei Huang, Julien V alentin, Shuran Song, and Leonidas J Guibas. Nor- malized object coordinate space for category-le vel 6d object pose and size estimation. arXiv preprint arXiv:1901.02970 , 2019. [40] Nanyang W ang, Y inda Zhang, Zhuwen Li, Y anwei Fu, W ei Liu, and Y u-Gang Jiang. Pixel2mesh: Generating 3d mesh models from single rgb images. In Pr oceedings of the Eur opean Conference on Computer V ision (ECCV) , pages 52–67, 2018. [41] Olivia Wiles and Andrew Zisserman. Silnet: Single-and multi-view reconstruction by learning from silhouettes. arXiv pr eprint arXiv:1711.07888 , 2017. [42] Jiajun W u, Y ifan W ang, Tianf an Xue, Xingyuan Sun, Bill Freeman, and Josh T enenbaum. Marrnet: 3d shape reconstruction via 2.5 d sketches. In Advances in neural information pr ocessing systems , pages 540–550, 2017. 11 [43] Manzil Zaheer , Satwik K ottur, Siamak Ravanbakhsh, Barnabas Poczos, Ruslan R Salakhutdinov , and Alexander J Smola. Deep sets. In Advances in neural information pr ocessing systems , pages 3391–3401, 2017. [44] Xiuming Zhang, Zhoutong Zhang, Chengkai Zhang, Josh T enenbaum, Bill Freeman, and Jiajun W u. Learning to reconstruct shapes from unseen classes. In Advances in Neural Information Pr ocessing Systems , pages 2263–2274, 2018. [45] Rui Zhu, Chaoyang W ang, Chen-Hsuan Lin, Ziyan W ang, and Simon Lucey . Object-centric photometric bundle adjustment with deep shape prior . In 2018 IEEE W inter Conference on Applications of Computer V ision (W ACV) , pages 894–902. IEEE, 2018. A ppendix In this appendix, we show more qualitati ve results of our reconstructions, and raw NO X maps predicted by our network. W e additionally also show that our approach can support articulating object categories lik e the hand, and show e xample camera pose estimation results. W e will make all datasets and code public upon publication of the paper . Dataset : Figure 7 shows some sample data from our synthetic dataset from each of the 3 categories we consider . As opposed to pre vious datasets, our dataset features challenging backgrounds, and widely varying camera pose and distance to objects. Figure 7: T wo sample views of an instance from each category in our dataset. Our dataset features widely camera viewpoints and backgrounds which are more challenging than pre vious datasets. NO X Maps : Figure 8 shows a zoomed-in version of the ground truth and predicted NOX maps (i.e., visible and occluded NOCS maps). Our predictions are able to capture the dense 3D shape of the visible and occluded object surfaces while maintaining pixel-shape correspondence. Qualitative Results : In Figure 9, we show more detailed qualitati ve results including the predicted NO X maps and peeled color maps. Please see the main paper for details on each of the results. Articulating Category : Our approach is generalizable to other object categories. In particular , we show that we can e xtend our approach to an articulating category like the human hand. W e generated a dataset with a virtual hand animated to various poses and augmented with background images. W e generated 5 vie ws for each pose. W e train the single-view NOCS map prediction network (only the visible NOCS map) with the combined mask and L 2 loss function. Figure 10 shows a test pose from the validation set with previously unseen hand shape and pose. Figure 11 shows the 3D point cloud reconstructions of the predicted NOCS maps. Camera Pose Estimation : Finally , we demonstrate that our approach can also be used for estimating camera pose from the predicted NOCS maps. W e use the direct linear transform algorithm [ 14 ] to obtain camera intrinsics and e xtrinsics parameters in the NOCS space. Thus, the estimated pose is accurate upto an unknown scale f actor . W e leav e ev aluation of camera pose for future work. 12 Figure 8: For the input RGB view on the left, we predict a NOX map i.e., NOCS maps for the visible and occluded object surfaces. For each box on the right, the grounth truth is shown in the left and our prediction from the permutation equiv ariant network on the right. Figure 9: Detailed qualitative results for the reconstructions shown in Figure 5 in the main paper . Here, for each example, we show tw o vie ws with the corresponding NO X maps (ground truth on left, prediction on right), peeled color map, and the 3D point cloud (ground truth on top, reconstruction on bottom). 13 Figure 10: Reconstruction of an articulating shape category—hands. Here we trained a single-view network to predict only the visible NOCS map. Figure 11: Example of camera pose estimation from the predicted NOCS maps for the hands dataset. For each vie w , we use the DL T algorithm to obtain camera pose in the canonical NOCS space (axes colored red-green-blue). W e can then use the set union operation to aggregate the 3D shape. 14

Multiview Aggregation for Learning Category-Specific Shape Reconstruction

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment