CNN-based synthesis of realistic high-resolution LiDAR data
This paper presents a novel CNN-based approach for synthesizing high-resolution LiDAR point cloud data. Our approach generates semantically and perceptually realistic results with guidance from specialized loss-functions. First, we utilize a modified…
Authors: Larissa T. Triess, David Peter, Christoph B. Rist
CNN-based synthesis of r ealistic high-r esolution LiD AR data Larissa T . T riess 1 , 2 , David Peter 1 , Christoph B. Rist 1 , Markus Enzweiler 1 , and J. Marius Z ¨ ollner 2 , 3 Abstract —This paper presents a novel CNN-based approach for synthesizing high-resolution LiDAR point cloud data. Our approach generates semantically and per ceptually realistic r esults with guidance from specialized loss-functions. First, we utilize a modified per -point loss that addresses missing LiDAR point measurements. Second, we align the quality of our generated output with real-w orld sensor data by applying a perceptual loss. In large-scale experiments on real-world datasets, we evaluate both the geometric accuracy and semantic segmentation perfor - mance using our generated data vs. ground truth. In a mean opinion scor e testing we further assess the per ceptual quality of our generated point clouds. Our results demonstrate a significant quantitative and qualitati ve impro vement in both geometry and semantics over traditional non CNN-based up-sampling methods. I . I N T RO D U C T I O N LiD AR scanners are a key enabler for autonomous dri ving. They are required to have a very high resolution to provide detailed information on the en vironment and ensure a high detection performance. Due to the unique properties of the LiD AR data structure, irregularity , and sparsity , it is not trivial to increase the point density of LiD AR scans. T ypical approaches start by accumulating scans over time or by the guidance of high resolution RGB camera images. The former faces the dif ficulty of motion. It is possible to eliminate this effect for static objects, since the ego-motion of the sensor is usually known. Ho wever , this does not account for the mov ement of passing objects, which remains a restriction of this approach. For the second case, it is necessary to hav e a RGB camera in the sensor setup, which is not always guaranteed. Furthermore, if the distance between the two sensors is quite large, it becomes more difficult to translate useful data from one sensor to the other . T o ov ercome these issues, we focus on single frame up-sampling with only one modality , the LiDAR sensor . LiD AR scans are generated by periodically emitting laser pulses while rotating around a vertical axis. Detailed three dimensional geometric information about the vehicles sur- rounding is obtained, see fig. 1 (top). The resulting point clouds are typically irregular and sparse in three dimen- sional space. Processing high-resolution data directly with three dimensional con volutions is challenging without com- promising accurac y [1]. Gi ven that our three-dimensional point clouds e xhibit a re gular structure, we can use cylindrical two- dimensional projections to represent the data in a 2D (image- 1 Daimler AG, Research and De velopment, Stuttgart, Germany 2 Karlsruhe Institut of T echnology , Karlsruhe, Germany 3 FZI Research Center for Information T echnology , Karlsruhe, Germany Primary contact: larissa.triess@daimler.com - 90 ° +90 ° - 90 ° +90 ° 0 ° 0 ° Fig. 1: Up-sampled point cloud: The top left shows a three dimensional point cloud recorded by a V elodyne VLP32 LiD AR scanner , with e very other layer remo ved. On the bot- tom left, the corresponding cylindrical projection, the LiD AR distance image, is depicted in range − 90 ◦ < θ < + 90 ◦ . The color coding depicts closer objects in brighter colors and marks missing measurements in black. For better visibility , the vertical pixel size is fiv e times the actual size. The right side sho ws the same scene synthesized with our approach. It is up-sampled with a factor of two and every other layer is colored in violet in the top image. The scene sho ws an urban street crossing with cars, pedestrians, bicycles, buildings and trees. like) f ashion, see fig. 1 (bottom). This allows to design two- dimensional conv olutional neural networks (CNNs) which are more efficient than three-dimensional CNNs. One observ es that LiD AR sensors pro vide data with a high horizontal resolution, e.g. 1800 px. Ho wev er, the vertical resolution is only a small fraction of that, depending on the number of lasers within the LiD AR sensor, for example 16, 32, or 64 for commonly used V elodyne LiD ARs [2]. Therefore, the objectiv e of this work is to up-sample the vertical resolution of LiD AR scans, thereby synthesizing LiD AR data as if recorded by a scanner with more layers. Operating in 2D preserves the regular structure in our data, see fig. 1 (bottom), which is beneficial for do wnstream perception algorithms. Further , since the industry is constantly moving tow ards higher resolution sensors, our approach enables the re-usage of recorded (and possibly annotated) data when moving to wards a higher resolution LiD AR in the future. 978-1-7281-0559-8/19/$31.00 ©2019 IEEE 6 4 c o n v ( 9 , 9 ) R e s. B l o c k R e s. B l o c k … 6 4 t r a n s ( 4 , 1 ) 1 c o n v ( 9 , 9 ) Up - S amp l i n g N e t w o r k P e r c e p t u a l L o ss P o i n t - w i se L o ss d i st a n c e i mag e l ow r e s . high r e s . d o w n - sam p l i n g hig h r e s . c o n v ( 1, 1) c o n v ( 7, 3) c o n v ( 3, 7) c o n v ( 3, 3) B l o c k f e a t ur e m a p e x t r a c t i o n po i n t F e a t u r e Ex t r a c t o r B l o c k 0 B l o c k 1 B l o c k 2 B l o c k 3 B l o c k 4 1 3 c o n v ( 1 , 1 ) S e man t i c S e g me n t a t i o n N e t w o r k high r e s . high r e s . * * c r o s s - e n t r o py S e man t i c C o n si st e n c y L o ss L o ss 6 4 c o n v ( 3 , 3 ) B a t c h n o r m. R e L U 6 4 c o n v ( 3 , 3 ) B a t c h n o r m. R e si d u a l B l o c k • o n e - h o t e n c o de d s e m a n t i c s a n n o t a t i o n s • Fig. 2: Overview on the proposed architecture: It is di vided into three separate networks. The top sho ws the overall architecture with a detailed vie w on the residual block (green). The input to the network is a down-sampled distance image of size L / 2 × W with information about the missing measurements. The residual up-sampling network outputs an up-sampled distance image of size L × W with in-network up-scaling. Both distance images are inputs to the loss (yello w). The bottom shows the three different loss functions under consideration (only one is used at a time). I I . R E L A T E D W O R K The aim of up-sampling is to estimate the high-resolution visual output of a corresponding low-resolution input. In this work we consider c ylindrical two dimensional projections of structured LiDAR point clouds, therefore it is vital to also take into account analogous approaches on RGB images to solve this task. A sizable amount of literature exists on RGB image up-sampling. W e focus on what we consider most relev ant to this paper . Y ang et al. present a comprehensi ve ev alua- tion of prev ailing RGB up-sampling techniques prior to the adoption of con volutional neural networks [3]. More advanced techniques, such as SRCNN [4], outperform these traditional methods. Howe ver , they cannot cope with data that features missing measurements, since dense input representations are required. The traditional methods, on the other hand, can easily be applied to cylindrical LiD AR projections. Due to their low computational complexity they can be used for real-time applications. Ho wev er, the traditional re-sampling techniques are not able to restore the high-frequency information, i.e. fine details in the resized input, due to the low-pass beha vior of the interpolation filters [5]. The literature on up-sampling three-dimensional data falls far behind the one on RGB image up-sampling. A number of methods considers point cloud up-sampling as a depth completion task by projecting the laser scans into sparse depth maps. They either directly operate on the depth input [6] or require guidance, e.g. from a high-resolution camera image [7], [8], [9], [10]. Here, the original structure of the input point cloud is lost and transformed into a high-resolution depth map at camera image resolution. Y u et al. recently proposed PU-Net which directly operates on three dimensional point clouds [11]. The up-sampling network learns multile vel features per point and expands the point set via multi-branch con volution units. The expanded feature is then split into a multitude of features, which are then reconstructed to an up-sampled point set. The point set is unordered and forms a generic point cloud. Howe ver , for our application, it is important to maintain the ordered point cloud structure provided by LiD AR sensors. First, we are able to apply downstream perceptual algorithms which ha ve been designed for the structured low-resolution data. Second, it is possible to re-use v aluable data recordings by up-sampling it to higher resolutions especially when ne w LiD AR sensors with more layers are introduced to the market. Or in the case when algorithms, like semantic segmentation [12] or stixels [13], hav e to be adapted to the higher resolution. The aforementioned methods all focus on optimizing a pixel lev el error of the prediction tow ards the target. Especially in RGB images, the literature agrees that high-resolution image predictions typically do not appear visually realistic to humans [14]. Resulting high-resolution images often lack high-frequency details and are perceptually unsatisfying in terms of failing to match the fidelity expected at the higher resolution. Therefore, a variety of perceptual optimization methods ev olved since. Johnson et al. proposed a style transfer and up-sampling network using a perceptual loss based on VGG-16 [15], [16]. In contrast to previous methods, it uses an in-network re-sizing layer , making it independent from bicubic interpolation pre-processing. In 2017, Ledig et al. proposed SRGAN which uses a perceptual loss function consisting of an adversarial loss and a content loss [17]. The adversarial loss pushes the output to the natural image manifold using a discriminator netw ork. This network is trained to differentiate between super-resolv ed images and the origin photo-realistic images. Additionally , a content loss was used which enforces perceptual similarity instead of similarity in the pixel space. T o the best of our kno wledge, no literature on perceptual losses applied to LiDAR data exists. Our main contributions are three-fold: • we present a CNN-based up-sampling approach that synthesizes semantically and perceptually realistic point clouds with three specialized loss functions • to the best of our knowledge, our approach is first to employ perceptual losses for LiD AR based applications • besides quantitativ e performance evaluation on large- scale real-world data, we also analyze qualitative perfor - mance through a mean opinion score study in volving 30 human subjects I I I . M E T H O D Fig. 2 depicts our overall system architecture in three different v ariants indicated by the yellow rectangles on the bottom of the figure. The up-sampling network transforms a low resolution LiD AR scan into a corresponding high- resolution output. This prediction is compared with the ground truth high-resolution scan in either the point-wise, perceptual or semantic consistency loss function (yello w rectangles). An error is calculated which is then minimized by an Adam optimizer [18]. A. Up-Sampling Network The up-sampling network is a deep residual con volutional neural netw ork [19]. It up-samples the resolution of a LiDAR distance image to produce a high-resolution output. The output data can be understood as the equiv alent of a recording from a LiD AR sensor with twice as many layers. The design of the up-sampling network is inspired by the image transformation network by Johnson et al. [15] and uses a fractionally strided con volution for the actual up-sampling (cf. trans block in fig. 2). Performing the resolution change in-network is advantageous over alternative approaches where the re-scaling is implemented in a bicubic interpolation step prior to the actual network [4], as it enables the re-scaling parameters to be learned. In contrast to the architecture by Johnson et al., our network consists of 16 residual blocks [19] and does not need a normalizing tanh-activ ation at the output layer . Furthermore, the kernel of the fractionally strided con volution has a size of ( 4 , 1 ) . F ollowing [15], the con volu- tional layers within the residual blocks are followed by spatial batch normalization and a ReLU nonlinearity . The first and last layers use 9 × 9 kernels while all remaining con volutions hav e kernel sizes of 3 × 3. The input to the network is a LiD AR scan with L / 2 layers, represented by a two-dimensional projection of shape L / 2 × W . W ith up-sampling factors ( f i , f j ) = ( 2 , 1 ) , the output is a high-resolution distance image of shape L × W . Since the network is fully-con volutional, it can be applied to inputs of any resolution. B. Cylindrical LiD AR Pr ojection A LiD AR scanner determines the distance to surrounding objects by measuring the time of flight of emitted laser pulses. The kind of scanner used in this work consists of L vertically stacked send-and-recei ve modules which rev olve around a common vertical axis. While rotating, each module periodically measures the distance r i j at its current orientation which can be described by an elev ation angle θ i and an azimuth angle ϕ j . The indices i = 1 . . . L and j = 1 . . . W represent the possible discrete orientations. Point of a full 360 ◦ rotation are referred to as frame or scan. There are multiple ways in which a LiD AR sensor can fail to provide a point distance measurement. First, the maximum distance is limited due to beam diver gence and atmospheric absorption. Second, outgoing lasers pulses might hit specular reflectiv e surfaces and never return to the sensor . Third, the laser might not be pointed towards an object at all (but to wards the sk y). T o account for these missing measurements, we first define the set of all valid measurements as V = ( i , j ) reflection at θ i , ϕ j receiv ed . (1) A two-dimensional LiD AR distance image d i j can then be constructed by setting d i j = ( r i j / m ( i , j ) ∈ V d ∗ otherwise (2) where we represent all measured ranges in units of meters [m] and define a proxy value d ∗ for the missing measurements. The latter is necessary to provide a dense image structure for the con volutional network. Experiments show no significant difference in choosing this value, so we set d ∗ = 0 for simplicity , and handle it appropriately within our loss function design. The missing measurements are one of the essential differences between RGB images and LiD AR distance images. This prevents us from using the same methods designed for RGB image up-sampling to up-sample LiD AR distance images directly . Note that the distance image representation is a c ylindrical projection without any loss of information, as there are no mu- tual point occlusions. Since all orientation angles are known, the image can always be transformed back to a 3D point cloud { ( x i j , y i j , z i j ) | ( i , j ) ∈ V } with a spherical-to-Cartesian mapping. C. Modified P oint-wise Loss In a supervised setting, up-sampling is a classic regression problem where a loss function L ( d pred , d gt ) compares the generated high-resolution distance image d pred = { d pred i j } with its corresponding ground truth counterpart d gt = { d gt i j } . The most commonly used error functions for this application are the L 1 and L 2 loss functions. In the case of LiD AR distance images, we modify these loss functions to mask the missing measurements which have been replaced by d ∗ . W e therefore define the modified point-wise loss functions L α dist = 1 α | V | ∑ ( i , j ) ∈ V d gt i j − d pred i j α α = 1 , 2 (3) where α = 1 describes the mean average error and α = 2 describes the mean squared error . Refer to the leftmost loss block in fig. 2. D. P er ceptual Loss The pre viously introduced point-wise loss encourages the network to predict high-resolution LiDAR scans where each point is close to the ground truth counterpart in a purely spatial sense. A perfect match would be ideal in theory , b ut this approach can fail to output realistic point clouds in practice. T o see this, note that a perfectly realistic point cloud constructed from a slightly rotated ground truth point cloud would lead to high loss v alues. Similarly , while an actual scan of a treetop looks like a seemingly random collection of points, a L α - guided optimization will tend to produce smooth surfaces to decrease the overall distance error . W e make use of a perceptual loss function to circumvent this problem. In order to address the shortcomings of the per-pixel losses and to allow the loss function to measure semantical and perceptual differences between LiDAR scans, the perceptual loss function utilizes a deep conv olutional network itself. This network is pre-trained for point-wise semantic segmentation in LiD AR scans [12], and can therefore be used as a feature extractor which encodes semantic information. This feature extractor φ is used to compare the scans on a more abstract lev el (see middle loss block in fig. 2). T o achie ve semantical and perceptual similarity with the ground truth scan, the perceptual loss function propagates both scans through the feature extractor φ and computes a L 1 error on the resulting high-dimensional feature maps: L feat = ∑ c , i , j φ ( d gt ) ci j − φ ( d pred ) ci j (4) Here, c iterates over the different channels of the feature map. Note that the weights in the feature extractor φ stay fixed dur- ing the training. The feature maps can be extracted at various points within the netw ork. In contrast to our definition of the point-wise loss function it is not necessary to exclude missing measurements from the loss calculation as the semantic feature extractor uses context information to correctly label missing input points. E. Semantic consistency loss In addition to the point-wise and the perceptual losses, we propose a semantic consistency loss function that is designed to maintain the semantic content of the LiD AR scan during the up-sampling process. It uses a pre-trained semantic seg- mentation network (same as in III-D) to compare the two T ABLE I: Overvie w on the dataset split T raining V alidation T est Semantics [12] 344,027 73,487 137,682 KITTI Raw [21] 28,548 5,982 11,499 scans in a cross-entropy f ashion (see rightmost loss block in fig. 2. T o that end, it propag ates the high-resolution prediction d pred through the weight-fix ed network to compute logits for the 13 semantic classes (road, person, car , building, . . . ). In L cross-entropy , the result is compared with the one-hot encoded ground truth annotations from the Semantics dataset. W orking with the cross-entropy loss in isolation is not enough, since the spatial structure of the predicted point cloud is now completely unconstrained. T o account for this, we compute a point-wise L 1 dist loss in addition (see eq. 3) and combine the two loss functions in the following multi-task semantic consistency (SC) loss function: L SC = 1 2 σ r L 1 dist + log σ r + 1 σ c L cross-entropy + log σ c . (5) Here, σ r and σ c are trainable variables that balance the relative weights of the two tasks, cf. Kendall et al. [20]. I V . E X P E R I M E N T S In the following, we introduce the experimental setup of our performance ev aluation (section IV -A) as well as a discussion of quantitati ve (section IV -B) and qualitativ e (section IV -C) results. A. Experimental Setup 1) T raining Data: T o train our netw orks we use two large- scale LiD AR datasets. The first one is the dataset introduced by Piew ak et al. [12], which we refer to as ”Semantics” dataset. Second, the ra w dataset of the public KITTI benchmark is used (”KITTI Raw”). Details are given in T able I. The dataset split into training (0.62), validation (0.13), and test (0.25) has been performed on a sequence basis in order to prev ent correlations between subsets. Both datasets contain a variety of different scenes cap- tured in urban, rural, and highway traffic. Howe ver , there are important dif ferences between the two. Most significantly , the Semantics dataset w as recorded with a V elodyne VLP32, whereas KITTI used a V elodyne HDL64 sensor . The number in the names corresponds to the layer count (number of rows) in the LiD AR scan. For HDL64, these layers have an equidistant spacing whereas the VLP32 has a higher layer density in the middle. The VLP32 has a higher range, whereas the HDL64 is limited to distances lower than 80 meters. The normalized distance distributions of both datasets are shown as shaded areas in fig. 3. For the application of point cloud up-sampling it is straight- forward to generate the training input and the correspond- ing ground truth. The presented datasets serve as our high- resolution target data with a shape of 32 × 1800 for Semantics Semantics KITTI Raw 0 20 40 60 80 100 0 2 4 Distance [ m ] MAE [ m ] Fig. 3: Distance-dependent error: The plot sho ws the mean absolute error of the L 1 dist network as a function of the (ground truth) distance on the Semantics and KITTI Raw datasets. The shaded areas depict the (normalize) distance distribution ov er each of the datasets. and 64 × 1565 for KITTI. In order to obtain the lo w resolution frames of size L / 2 × W , every other layer is simply remov ed from the input scans. This procedure is dif ferent from most im- age up-sampling applications, where a bicubic down-sampling is used to generate the lo w-resolution data. F or LiDAR point clouds, this method generates unrealistic results due to the large vertical spacing between the layers. 2) Evaluation Metrics: W e employ four different ev aluation metrics for the performance assessment. The first three con- tribute to our quantitative results, whereas the fourth metric is based on human opinion and is used for the qualitati ve assessment in section IV -C. Since the point-wise losses L 1 dist and L 2 dist minimize the mean absolute error (MAE) and the mean squared error (MSE), respectiv ely , we also use these two error functions as ev aluation metrics by computing the corresponding av erage distance error on the whole test set. Note that the numbers obtained for the L 1 dist error can readily be interpreted as the av erage point deviation in meters. Upon con ver gence we select the training state with the lo west errors on the v alidation set and report the performance metrics on the test set of our dataset in table II. The third quantitati ve metric is constituted by a semantic segmentation network. The pre-trained network is applied to the generated point clouds, allowing us to compute the mean intersection over union (mIoU) with respect to the ground truth annotations. This mIoU score can then be compared to the score which is obtained when using the original ground- truth high-resolution input (56.8%, see table II). This giv es valuable information about the semantic information contained in the generated scans. W e assume that semantics are important high-lev el features of real-world LiD AR data and thus an indicator for realistic LiDAR scans. The network used for this purpose has the same configuration as the feature extractor in the perceptual loss function, but uses filter counts of n b = { 32 , 64 , 96 , 96 , 64 } , respectively . T ABLE II: T est set results for different metrics Networks Semantics Dataset KITTI Raw MSE MAE mIoU MSE MAE [m] [m] [%] [m] [m] Ground truth 0.0 0.00 56.8 0.0 0.00 Bilinear 88.2 2.29 34.1 11.6 0.81 Bicubic 97.2 2.59 28.7 13.7 0.95 Nearest neighbor 147.5 2.83 28.3 19.6 0.95 L 1 dist 20.9 0.68 34.6 2.23 0.21 L 2 dist 17.6 0.86 12.5 1.95 0.28 L feat , 0 74.1 1.33 41.2 - - L feat , 1 110.4 3.05 45.0 - - L feat , 2 112.1 2.45 49.4 - - L feat , 3 74.1 1.49 49.1 - - L SC 18.1 0.86 47.4 - - Furthermore, we conducted a mean opinion scor e survey with humans who e v aluated the visual quality of the generated point clouds. The results of the study , which was conducted on the Semantics dataset, are visualized in fig. 5. 3) Baseline and Methods: As a baseline, we ev aluated three traditional interpolation techniques: bilinear , bicubic, and nearest neighbor , see table II. The overall architecture combined with the rightmost loss block of fig. 2 illustrates the setup of our training with point- wise losses. The results of the tw o experiments are sho wn as L 1 dist and L 2 dist in table II. The perceptual loss network has been in vestigated in four different variants which differ in the exact location where the feature map has been extracted. They are illustrated by the L feat , b blocks ( b = 0 . . . 3) in the middle loss block of fig. 2. The same identifier is used in table II. W e encoun- tered major up-sampling performance degradation when using large amounts of filters in the feature extractor . This can be attributed to the storage of irrelev ant information in the superfluous channels of the feature map. W e hav e therefore drastically reduced the amount of filters in the feature extractor to n b = { 32 , 64 , 96 , 96 , 64 } , as compared to the original archi- tecture by Piewak et. al. [12]. This change slightly reduces the performance of the semantic segmentation by 1.5 percentage points in mIoU. The remaining difference between the original mIoU score of 60.2% and the network used in this work (56.8%) is due to the fact that our network is trained on distance images alone and does not use the additional reflec- tivity channel. For all perceptual loss up-sampling network trainings, we use the L 1 dist network weights for initialization, which speeds up the training significantly . Finally , table II shows the results of the network trained with the semantic consistency loss L SC , which uses the same network architecture as the feature e xtractor of the perceptual trainings, in order to predict the logits for 13 semantic classes, a subset of the Cityscapes label set [22]. Neither the perceptual loss networks, nor the semantic-consistency guided model can be e v aluated on the KITTI dataset, due to the absence of (a) Ground Truth (b) Low-resolution Input (c) Bilinear Interpolation (d) L 1 dist Network (e) L 2 dist Network (f) L feat , 1 Network (g) L feat , 2 Network (h) Semantic consistency L SC Fig. 4: Examples of the differ ent methods: Synthesize (c) - (h) from (b) and compare to (a). Reconstruction quality mainly differs in high frequency perturbations in object boundaries, especially L 2 dist network, and overall noise lev el, e.g. bilinear interpolation. The red rectangle enlarges the van visible in scene. ground truth semantic annotations for the LiD AR scans. B. Quantitative Results In all experiments, we encountered a significantly better performance on the KITTI dataset compared to the Semantics dataset. W e attribute this to the following reasons. First, the KITTI dataset was recorded with a LiDAR sensor that has twice as many layers as the one used for the Semantics dataset. This makes the interpolation easier , as neighboring points hav e a higher probability to lie on the same object. Second, the HDL64 sensor used in KITTI has a smaller range. As the error generally increases with distance (see fig. 3), the KITTI dataset is less challenging in this respect. Last, the VLP32 sensor does not hav e an equidistant layer spacing, rendering the interpolation task on the Semantics dataset more challenging. Considering the traditional methods first, we notice that the bilinear interpolation performs better than the bicubic interpo- lation. This is in contrast to results on RGB images, where up- sampling with bicubic interpolation typically achiev es better results. The fact that this does not seem to hold for LiDAR distance images can be attributed to the low vertical resolution of the input scan, which makes next-to-nearest neighbors unlikely to contrib ute any useful information. As the bilinear interpolation achiev es the lowest errors, it is considered as our baseline in the following. Unsurprisingly , it works very well for smooth surfaces but fails to properly reconstruct sharp edges. The L 1 dist and L 2 dist -guided con volution networks are de- signed to address these shortcomings. Both outperform the baseline and achiev e far lower prediction errors. Each network obtained the lowest overall error on the metric which it is designed to minimize. Considering the mIoU, we can clearly observe a very lo w performance on the L 2 dist generated point clouds. T aking a look at the respecti ve label predictions rev eals that a majority of the points have been labeled as ve getation or terrain. As we will later see in the qualitati ve results, the L 2 dist generated point clouds exhibit high frequency perturbations, just like actual samples from the vegetation and terrain classes. The reason why this effect is so prominent for the L 2 dist trained network lies in the mathematical definition of the mean-squared error function. The L 2 dist error penalizes larger errors more than the L 1 dist loss. As the interpolation error generally increases with distance (see fig. 3), the optimizer tries to minimize those errors at the expense of accuracy in the near field. In addition to those two dif ferent point-wise loss netw orks, we also in vestigated how performance changes when addition- ally feeding the reflectivity channel to the network. In contrast to a semantic segmentation network, the up-sampling model did not show any performance gain. Further , we ev aluated sparse con volutions by Uhrig et al. [6] (with v arying pooling sizes), but did not achiev e competiti ve results on the giv en metrics, which can probably be explained by the fact that our input data is rather dense and only has a small fraction of missing points. Since the perceptual networks are not specialized on min- imizing the point error between prediction and target, their errors are higher than the ones for the point-wise loss net- works. Howe ver , we see a significant gain in mIoU. The mIoU score can be improved by extracting the feature map at later stages in the network, with the highest mIoU of 49.4% being obtained for L feat , 2 . The increased semantic segmentation performance leads us to the assumption that the perceptually trained networks produce perceptually more realistic point G r o u n d T r u t h Bi l i n e ar ℒ d is t 1 ℒ d is t 2 ℒ f eat , 0 ℒ f eat , 1 ℒ f e a t , 2 ℒ f eat , 3 ℒ SC 1 2 3 4 5 1. 83 3. 66 1. 77 2. 81 1. 32 2. 67 2. 96 3. 20 2. 82 n u mb e r of v o t e s Fig. 5: Mean opinion scor e surv ey: Color -coded distribution of mean opinion scores for ten randomly selected frames from the validation dataset. 300 votes (10 images x 30 subjects) were assessed for each method. The circular marker shows the mean opinion score (MOS). clouds than all pre vious methods. This proposition will be further discussed in the qualitativ e assessment of the next section. From the given experiments, we can also see that there is no method that simultaneously minimizes all three metrics. The preferred choice of method is therefore highly dependent on the application context. C. Qualitative Results T o gain further insight, we conducted a surve y among hu- mans to obtain a mean opinion score (MOS) for our proposed networks. In the survey we asked LiDAR experts to e v aluate the visual quality of the generated point clouds. In contrast to similar surveys on RGB images (showing e veryday scenes) where random people hav e been selected [17], the people participating in our surve y were required to be familiar with LiD AR data. This restriction was necessary because we found that laypersons could not reliably judge the point cloud quality due to a lack of domain kno wledge. The provided MOS results are illustrated in fig. 5, an example scene of a selection of methods is shown in fig. 4. Specifically , we ask ed 30 subjects to assign a score from one (bad quality) to fi ve (excellent quality) to the generated high- resolution LiD AR scans. T o this end, we rendered a view of the 3D point cloud, similar to the images in fig. 4. The experts rated nine versions of each image: bilinear interpolation, networks trained with the point-wise losses L α dist ( α = 1 , 2), four versions of the perceptual loss L feat , b ( b = 0 , 1 , 2 , 3), the semantic consistenc y loss network, and the original high- resolution LiD AR scan (ground truth). Each subject thus rated 90 instances (nine versions of ten scenes) that were presented in a randomized fashion. Prior to the testing, subjects were giv en examples for category five (ground truth) and category one (nearest neighbor interpolation, random interpolation). Fig. 5 shows that the perceptually trained networks achie ve higher mean opinion scores than all other networks. The (a) Original scan with 64 layers (b) Up-sampled scan with 128 layers Fig. 6: Example scene KITTI: The left image shows an original sensor recording of HDL64 from the KITTI Raw dataset. The right is the same scene, up-sampled to 128 layers with the L 1 dist network. Since the network is fully con volutional, it is able to up-sample from 64 to 128 layers, though it was trained on 32 layers. semantic-consistency network did not perform well in contrast to what the quantitativ e results showed. T aking a look at the example scene in fig. 4h shows that the generated point cloud is rather noisy . Only the samples generated by the L 2 dist network, fig. 4e, exhibit e ven stronger high frequency perturbations, which is expressed in low MOS and mIoU scores. The L 1 dist trained network achie ved far better results than the bilinear interpolation or the semantic-consistency network, but still lacks the ability to fool someone to be a real sensor recording. On average, our subjects assigned the highest ratings to the ground truth. The rather small difference between ground truth and our L feat , 1 approach indicates that it was sometimes dif fi- cult to distinguish between ground truth and synthesized data. This is exactly what we wanted to achieve with the proposed approach: Synthesizing data that is almost indistinguishable from ground truth without minimizing the error on a point lev el but rather on a perceptual lev el. W e assume that our synthesis performance scales with the initial resolution of our input data. Synthesizing 128-layer LiD AR data from 64-layer (KITTI) data as input, for example, would generate even more con vincing results, see fig. 6. Given the lack of ground truth data, this cannot be quantitati vely e valuated today and is left for future work. V . C O N C L U S I O N S This paper presented a nov el approach for synthesizing high-resolution LiD AR scans with a high semantical and perceptual realism in volving different v ariants of CNNs. In extensi ve experiments we demonstrated that all of our sys- tem v ariants outperform se veral baseline approaches. From a quantitativ e perspecti ve, the choice of best performing model variant is highly application-specific as different models e xcel in different e valuation metrics with respect to geometric and semantical accuracy . In our qualitativ e performance assess- ment, human subjects have fa vored model variants in volving perceptual loss based on visual realism as a performance criterion. Designing a single method that optimizes all our performance metrics at the same time is left for future w ork. AC K N OW L E D G M E N T The authors thank Rainer Ott (University of Stuttgart) for valuable discussions and feedback. They also thank all participants of the mean opinion score surv ey . R E F E R E N C E S [1] G. Riegler , A. O. Ulusoy , and A. Geiger, “Octnet: Learning deep 3d representations at high resolutions, ” in Proceedings of the IEEE Confer ence on Computer V ision and P attern Recognition , 2017. [2] V elodyne. V elodyne LiD AR. [Online]. A vailable: https://v elodynelidar . com/ [3] C.-Y . Y ang, C. Ma, and M.-H. Y ang, “Single-image super-resolution: A benchmark, ” in Computer V ision – ECCV 2014 , D. Fleet, T . Pajdla, B. Schiele, and T . T uytelaars, Eds. Springer International Publishing, 2014, pp. 372–386. [4] C. Dong, C. C. Loy , K. He, and X. T ang, “Image super-resolution using deep con volutional networks, ” IEEE T ransactions on P attern Analysis and Machine Intelligence , vol. 38, no. 2, pp. 295–307, Feb 2016. [5] A. Gavade and P . Sane, “Super resolution image reconstruction by using bicubic interpolation, ” in National Confer ence on Advanced T echnolo- gies in Electrical and Electronic Systems , 10 2014. [6] J. Uhrig, N. Schneider, L. Schneider, U. Franke, T . Brox, and A. Geiger, “Sparsity in variant cnns, ” in 2017 International Confer ence on 3D V ision (3D V) , Oct 2017, pp. 11–20. [7] J. Dolson, J. Baek, C. Plagemann, and S. Thrun, “Upsampling range data in dynamic en vironments. ” in CVPR . IEEE Computer Society , 2010, pp. 1141–1148. [8] M. Liu, O. T uzel, and Y . T aguchi, “Joint geodesic upsampling of depth images, ” in 2013 IEEE Confer ence on Computer V ision and P attern Recognition , June 2013, pp. 169–176. [9] X. Song, Y . Dai, and X. Qin, “Deep depth super -resolution: Learning depth super-resolution using deep con volutional neural network, ” in Computer V ision - ACCV 2016 - 13th Asian Conference on Computer V ision, T aipei, T aiwan, November 20-24, 2016, Revised Selected P apers, P art IV , 2016, pp. 360–376. [10] T .-W . Hui, C. C. Loy , and X. T ang, “Depth map super-resolution by deep multi-scale guidance, ” in Pr oceedings of European Conference on Computer V ision (ECCV) , 2016. [11] L. Y u, X. Li, C.-W . Fu, D. Cohen-Or , and P .-A. Heng, “PU-Net: Point cloud upsampling network, ” in Proceedings of IEEE Conference on Computer V ision and P attern Recognition (CVPR) , 2018. [12] F . Piew ak, P . Pinggera, M. Sch ¨ afer , D. Peter, B. Schwarz, N. Schneider , M. Enzweiler , D. Pfeiffer , and J. M. Z ¨ ollner , “Boosting lidar-based semantic labeling by cross-modal training data generation, ” in Computer V ision - ECCV 2018 W orkshops - Munich, Germany , September 8-14, 2018, Proceedings, P art VI , 2018, pp. 497–513. [13] F . Pie wak, P . Pinggera, M. Enzweiler , D. Pfeiffer , and M. Z ¨ ollner , “Improved Semantic Stixels via Multimodal Sensor Fusion, ” in GCPR , 2018. [14] R. Dahl, M. Norouzi, and J. Shlens, “Pixel recursiv e super resolution, ” in 2017 IEEE International Confer ence on Computer V ision (ICCV) , Oct 2017, pp. 5449–5458. [15] J. Johnson, A. Alahi, and L. Fei-Fei, “Perceptual losses for real- time style transfer and super -resolution, ” in European Conference on Computer V ision , 2016. [16] K. Simonyan and A. Zisserman, “V ery deep con volutional networks for large-scale image recognition, ” CoRR , vol. abs/1409.1556, 2014. [Online]. A vailable: http://arxi v .org/abs/1409.1556 [17] C. Ledig, L. Theis, F . Huszar, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. T ejani, J. T otz, Z. W ang, and W . Shi, “Photo-realistic single image super -resolution using a generati ve adversarial network, ” in Pr oceedings of the IEEE Conference on Computer V ision and P attern Recognition , 2017, pp. 4681–4690. [18] D. Kingma and J. Ba, “ Adam: A method for stochastic optimization, ” International Conference on Learning Repr esentations , 12 2014. [19] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition, ” in 2016 IEEE Confer ence on Computer V ision and P attern Recognition (CVPR) , June 2016, pp. 770–778. [20] A. Kendall, Y . Gal, and R. Cipolla, “Multi-task learning using uncer- tainty to weigh losses for scene geometry and semantics, ” in Pr oceedings of the IEEE Confer ence on Computer V ision and P attern Recognition (CVPR) , 2018. [21] A. Geiger, P . Lenz, C. Stiller , and R. Urtasun, “V ision meets robotics: The KITTI dataset, ” International Journal of Robotics Researc h (IJRR) , 2013. [22] M. Cordts, M. Omran, S. Ramos, T . Rehfeld, M. Enzweiler, R. Benen- son, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset for semantic urban scene understanding, ” in Pr oc. of the IEEE Conference on Computer V ision and P attern Recognition (CVPR) , 2016.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment