Example-based super-resolution for point-cloud video
We propose a mixed-resolution point-cloud representation and an example-based super-resolution framework, from which several processing tools can be derived, such as compression, denoising and error concealment. By inferring the high-frequency conten…
Authors: Diogo C. Garcia, Tiago A. Fonseca, Ricardo L. de Queiroz
EXAMPLE-B ASED SUPER-RESOLUTION FOR POINT -CLOUD VIDEO Diogo C. Gar cia, T iago A. F onseca and Ricar do L. de Queir oz Uni versidade de Brasilia Brasilia, Brasil { diogo,tiago } @ima ge.unb .br and queir oz@ieee.or g ABSTRA CT W e propose a mixed-resolution point-cloud representa- tion and an example-based super-resolution frame work, from which sev eral processing tools can be derived, such as com- pression, denoising and error concealment. By inferring the high-frequency content of lo w-resolution frames based on the similarities between adjacent full-resolution frames, the pro- posed frame work achie ves an a verage 1.18 dB gain ov er lo w- pass versions of the point-cloud, for a projection-based dis- tortion metric [1, 2]. Index T erms — Point-cloud processing, 3D immersiv e video, free-viewpoint video, octree, super -resolution (SR). 1. INTR ODUCTION Recent demand for AR/VR applications hav e accelerated the interest in electronic systems to capture, process and render 3D signals such as point clouds [3, 4]. Nonetheless, there are no established standards regarding the capture, represen- tation, compression and quality assessment of point clouds (PC). This lack of standards attracted attention to this research field and moti vated a sequence of recent adv ances in process- ing 3D signals. These signals can be captured using a set of RGBD cam- eras and represented as a voxelized point cloud. A voxelized PC consists in a set of points ( x, y , z ) constrained to lie on a regular 3D grid [5, 6]. Each point can be considered as the address of a volumetric element, or vox el, which is said to be occupied or unoccupied and, for each occupied posi- tion, the surf ace color (RGB) is recorded. Instead of using a dense volumetric signal with all possible RGB samples for each frame, the point-cloud can be represented by a list of occupied v oxels (geometry) and its color attributes, from no w on referred as point-cloud. A v ery ef ficient geometry representation can be obtained using the octree method, where the 3D space is recursi vely divided into fix ed-size cubes, or octants, allowing for data compression, fast searching and spatial scalability [4, 6, W ork partially supported by CNPq under grant 308150/2014-7. This paper was submitted to ICIP-2018 and its copyright may be transferred to IEEE. 7]. Mixed-resolution scenarios (interleav ed low- and full- resolution frames) naturally emerge from this representation: a low-resolution error-protected base layer , for instance, can be enhanced from a previously decoded full-resolution frame [8, 9]. The super-resolution technique was already explored when processing 3D signals, with works trying to increase data resolution in depth maps [10, 11, 12]. They usually explore geometric properties to improve the le vel of detail of a depth map. The contribution brought by this work is to infer high-frequency content (detail) of a point cloud by exploring the similarities between time-adjacent frames of already v oxelized point-cloud signals as illustrated in Fig. 1. A concise signal model is deri ved in Section 2 and drives an example-based super -resolution framework, which is able to enhance the point-cloud le vel of detail as shown in Section 3. The conclusions are presented in Section 4. Fig. 1 . The super-resolution frame work (SR) outputs a super - resolved frame { ˆ V T , ˆ C T } by exploring the high-frequency content similarities between a down-sampled point-cloud { V D T , C D T } and a time-adjacent frame { V R , C R } . 2. PR OPOSED METHOD T raditional super-resolution (SR) [8] techniques generate high-resolution images from either multiple low-resolution images (multi-image SR) or databases of low- and high- resolution image pairs (example-based SR). The proposed method borrows ideas from the latter in order to increase the resolution and the details of a lo w-resolution point- cloud. Instead of relying on a database of low- and high- resolution point-cloud pairs, the proposed method extracts high-resolution information from an adjacent frame in a mixed-resolution point-cloud video sequence. In this scenario, each point-cloud frame is composed of a pair of lists { V , C } , representing the XYZ positions of the occupied voxels and their corresponding color channels, respectiv ely . Each down-sampled target frame { V D T , C D T } is preceded by a full-resolution reference frame { V R , C R } . The proposed method generates the super-resolv ed v ersion of the target frame, { ˆ V T , ˆ C T } , by adding the high-resolution information from the reference frame to the downsampled target frame. In order to extract high-resolution information from { V R , C R } , the algorithm may estimate the motion of the vox els from one frame to another . Howe ver , only a down- sampled version of the target frame is a vailable. By do wn- sampling the reference frame, { V D R , C D R } , motion can be estimated in a lo wer resolution of the point-clouds. Howe ver , the resolution of the motion estimation is also lowered. F or example, if nearest-neighbour do wnsampling by a factor D F of 2 is employed, the motion estimation is not be able to find any 1-voxel motion at full resolution, regardless of the motion direction. A better estimation of the point-cloud’ s motion can be achiev ed by generating several downsampled versions of the reference frame considering incremental motion in all directions. For example, for D F = 2 , 8 do wnsampled versions can be generated by considering XYZ translations by [0 , 0 , 0] , [0 , 0 , 1] , [0 , 1 , 0] , [0 , 1 , 1] , [1 , 0 , 0] , [1 , 0 , 1] , [1 , 1 , 0] and [1 , 1 , 1] . At full resolution, this is equiv alent to dilating the reference point-cloud by a 2 × 2 × 2 cube, rendering V DL R . Figure 2 illustrates this concept in a 2D scenario, where only XY translations are considered, and dilation is performed with a 2 × 2 square. Super-resolution of the target frame requires that D F × D F × D F vox els should be estimated for each vox el in V D T . In order to do that, motion estimation is performed between V D T and V DL R on a vox el basis, using the N × N × N neighborhood around each voxel as support. Each neighborhood n D T and n DL R is defined as a binary string, indicating which voxels are occupied and which are not. Since V D T and V DL R hav e different resolutions, the neighborhoods in these frames are obtained by skipping D F − 1 positions around each of their vox els, which acts as the motion estimation between V D T and each of the do wnsampled versions of V R . Figure 3 illustrates this concept in a 2D scenario for a 3 × 3 neighborhood. In the motion-estimation process, an L × L × L search window around each voxel is chosen by the user , creating a tradeoff between speed and super -resolution accuracy . The cost function C ( i, j ) between the i -th vox el in V D T and the j -th voxel in V DL R is defined as: Fig. 2 . Downsampling of a 2D symbol by a f actor of 2 , considering 1-pixel translation in the XY directions, and posterior grouping, which can be seen as a dilation of the original symbol with a 2 × 2 square. The circles, squares, multiplication signs and stars illustrate from which downsampled version each position in the dilated version came from. C ( i, j ) = H ( i, j ) + wD ( i, j ) w + 1 , (1) where H ( i, j ) is the hamming distance between the neighbor - hoods n D T ( i ) and n DL R ( j ) , D ( i, j ) is the Euclidean distance between V D T ( i ) and V DL R ( j ) , and w is the inv erse of the Euclidean distance between the centers of mass of V D T and V DL R . A small Euclidean distance between these centers of mass indicates small motion between the reference and tar get frames, increasing the value of w , i.e. C ( i, j ) fav ors smaller motion vectors. 2.1. Perf ormance Metrics This work uses two measures to ev aluate the achieved signal enhancements: a geometric quality metric [13], referred as GPSNR, and a distortion metric referred as projection peak signal to noise ratio (PPSNR) [14, 1]. GPSNR uses point-to- plane distances between point-clouds to calculate the geomet- ric fidelity between them. The following steps are performed to ev aluate the PPSNR: 1. Each target frame is expected to be represented in three versions: original { V T , C T } , low-pass { V DL T , C DL T } , and super-resolved { ˆ V T , ˆ C T } point-clouds. For each version, project ( π i ( { V , C } ) = f ( x, y ) ) the 3D signal on the faces of its surrounding cube to get 6 2D signals, Π( { V , C } ) = S 6 i =1 π i ( { V , C } ) , for each point- cloud version. Each projection π i ( { V , C } ) outputs a 512 × 512 YUV [15] color image. Those represent the scene views from an orthographic projection on each cube face. 2. For each cube face, ev aluate the luma PSNR be- tween the projections of the original π i ( { V T , C T } ) Fig. 3 . 3 × 3 neighborhood around the downsampled version of a 2D symbol by a factor of 2 , and the upsampled version of these neighborhoods. Note that only e very other position is considered in the upsampled version, as indicated by the thicker lines around the considered pixels. and the super-resolved π i ( { ˆ V T , ˆ C T } ) signals and the projections of the original and the upsampled π i ( { V DL T , C DL T } ) signals. The PSNR between pro- jections will pro vide 6 PSNRs v alues for each signal pair . 3. A verage the 6 PSNRs to get two quality assessments: one e v aluating the super -resolved and another e valuat- ing the lo w-pass version. Then, subtracts the first v alue and the second to get the SR quality enhancement. 4. T o get a PNSR value which represents the SR deriv ed improv ements, take the average of the PSNR dif fer- ences along the sequence frames. 3. EXPERIMENT AL RESUL TS T ests for the proposed method were carried out with sev en point-cloud sequences: Andre w , David , Loot , Man , Phil , Ri- car do and Sarah [5, 16, 6]. The test set is made of fi ve upper - body scenes of subjects captured at 30 fps and recorded at a spatial resolution of 512 × 512 × 512 v oxels or a 9-le vel reso- lution. Man and Loot are full-body scenes recorded at 9- and 10-lev el resolution, respecti vely .The av erage PPSNR was calculated according to Sec. 2.1. T able 1 summarizes the PPSNR performance of the proposed SR method for the test set. The comparison metric is the difference between the a verage PPSNR of the SR method and that e v aluated for the lo w-pass v ersion for each sequence. T able 2 shows the average PPSNR and GPSNR performance gains. 1 For all sequences, our method achieves a superior performance in inferring the high-frequency . The Man sequence benefited the most from the inferred high-frequency , while Phil achiev ed the most modest enhancement. The observed trend is that the more complex 2 the geometry , the greater potential to infer the high-frequency . Figures 4(a) and (b) present the PPSNR on a frame-by- frame basis for the lo w-pass point-cloud version and for the SR version for sequences Man and Phil , respecti vely . Figure 4(a) sho ws the best high-frequency inference observed in the test set. This is due to a relati ve lack of motion in sequence Man on its first 30 frames; on the other side, an abrupt scene change around the 150th frame penalizes the quality of both versions (low-pass and SR). Despite the more challenging scenario in Fig. 4(b), the SR frame work yields better enhancement than the low-pass signal, on a verage. (a) (b) Fig. 4 . A verage PPSNR on a frame basis for SR and for the upsampled version (lo w-pass). Sequences: (a) Man ; (b) Phil . Figures 5(a)-(c) allow for the subjecti ve ev aluation of some point-cloud projections for sequences Man and Phil . 1 The average gains for Loot considers only its first 50 frames. For some frames of Man , the SR inserted geometric artifacts orthogonal to the PC normals, perfectly recovering the geometry in a point-to-plane sense [13]. This resulted in infinite GPSNR. 2 In terms of bits per vox el to encode the full octree [17]. (a) (b) (c) Fig. 5 . Point-cloud projections for sequences (a)-(b) Man , frames 23 and 93, and (c) Phil , frame 175. For each image, from left to right, the columns correspond to the projections of: the original signal, the super-resolved signal, the residue of the super -resolved signal, the lo w-pass signal and the residue of the low-pass signal. Figure 5(a) sho ws the best SR performance for the test set, with an av erage 16.9 dB PPSNR gain over the low-pass version, mainly due to low movement of the test subject. The worst performance can be seen in Figs. 5(b) and (c), with a verage PPSNR losses of 4.12 and 1.84 dB o ver their low-pass v ersions, respecti vely . T able 1 . SR performance results. PPSNR-SR and PPSNR-LP stand for the av erage projected PSNR of the SR signal and the low-pass version, respecti vely . All values are in dB. Sequence PPSNR-SR PPSNR-LP Andr ew 30.83 30.17 David 30.91 29.90 Loot 41.61 39.73 Man 33.52 31.59 Phil 30.15 29.89 Ricar do 33.60 32.36 Sarah 31.90 30.72 4. CONCLUSIONS In this paper , an example based super-resolution framew ork to infer the high-frequency content of a v oxelized point-cloud was presented. T able 2 . SR performance improvements. PPSNR Gains and GPSNR Gains stand for the average gains in projected PSNR and in geometric quality metric [13, 14], respecti vely . All v alues are in dB. Sequence PPSNR Gains GPSNR Gains Andr ew 0.76 4.99 David 1.01 4.25 Loot 1.84 5.40 Man 1.93 ∞ Phil 0.27 4.61 Ricar do 1.24 5.16 Sarah 1.18 4.64 A verage 1.18 4.84 Based on an already efficient point-cloud representation [7], we benefited from its inherent scalability in resolution to explore simi- larities between point-cloud frames of test sequences. Experiments carried with seven point-cloud sequences show that the proposed method is able to successfully infer the high-frequency content for all the test sequences, yielding an a verage improv ement of 1.18 dB when compared to a lo w-pass v ersion of the test sequences. These results can benefit a point-cloud encoding frame work, for ef ficient transmission, error concealment or ev en storage. 5. REFERENCES [1] R. L. De Queiroz, E. T orlig, and T . A. Fonseca, “Objectiv e metrics and subjective tests for quality evaluation of point clouds, ” ISO/IEC JTC1/SC29/WG1 input document M78030 , January 2018. [2] R. L. de Queiroz and P . A. Chou, “Compression of 3d point clouds using a re gion-adaptiv e hierarchical transform, ” IEEE T rans. on Imag e Processing , vol. 25, no. 8, pp. 3497–3956, August 2016. [3] S. Orts-Escolano et al., “Holoportation: V irtual 3d tele- portation in real-time, ” in Pr oceedings of the 29th Annual Symposium on User Interface Softwar e and T echnology , 2016, UIST ’16, pp. 741–754. [4] A. Collet et al., “High-quality streamable free-viewpoint video, ” ACM T rans. Graph. , vol. 34, no. 4, pp. 69:1–69:13, Jul 2015. [5] C. Loop, Q. Cai, S.O. Escolano, and P .A. Chou, “Microsoft vox elized upper bodies - a voxelized point cloud dataset, ” ISO/IEC JTC1/SC29 Joint WG11/WG1 (MPEG/JPEG) input document m38673/M72012 , May 2016. [6] R. L. de Queiroz and P . A. Chou, “Compression of 3d point clouds using a re gion-adaptiv e hierarchical transform, ” IEEE T ransactions on Image Processing , vol. 25, no. 8, pp. 3947– 3956, Aug 2016. [7] D. Meagher , “Geometric modeling using octree encoding, ” Computer Graphics and Image Pr ocessing , vol. 19, no. 2, pp. 129–147, Jun 1982. [8] E. M. Hung, R. L. de Queiroz, F . Brandi, K. F . Oliveira, and D. Mukherjee, “V ideo super-resolution using codebooks de- riv ed from ke y frames, ” IEEE T rans. Circuits and Systems for V ideo T echnolo gy , vol. 22, no. 9, pp. 1321–1331, September 2012. [9] E. M. Hung, D. C. Garcia, and R. L. de Queiroz, “Example- based enhancement of degraded video, ” IEEE Signal Pr ocess- ing Letters , vol. 21, no. 9, pp. 1140–1144, Sept 2014. [10] D. B. Mesquita, M. F .M. Campos, and E. R. Nascimento, “ A methodology for obtaining super-resolution images and depth maps from RGB-D data, ” in Pr oc. Confer ence on Graphics, P atterns and Ima ges , August 2015. [11] S. A. Ganihar, S. Joshi, S. Setty , and U. Mudenagudi, “3d object super resolution using metric tensor and christoffel symbols, ” in Proc 2014 Indian Confer ence on Computer V ision Graphics and Image Pr ocessing , December 2014, pp. 87:1– 87:8. [12] Y . Diskin and V . K. Asari, “Dense point-cloud creation using superresolution for a monocular 3d reconstruction system, ” in Pr oc. SPIE 8399 , May 2012, vol. 8399, pp. 83990N1– 83990N9. [13] D. T ian, H. Ochimizu, C. Feng, R. Cohen, and A. V etro, “Geometric distortion metrics for point cloud compression, ” in Pr oc. IEEE Intl. Conf. Image Pr ocessing , September 2017. [14] R. L. de Queiroz and P . A. Chou, “Motion-compensated compression of dynamic v oxelized point clouds, ” IEEE T rans. on Image Pr ocessing , vol. 26, no. 8, pp. 3886–3895, August 2017. [15] I. E. Richardson, H. 264 and MPEG-4 V ideo Compression: V ideo Coding for Next-gener ation Multimedia , John W iley & Sons, 2004. [16] E. d’Eon, B. Harrison, T . Myers, and P . A. Chou, “8i v oxelized full bodies, version 2 – a voxelized point cloud dataset, ” ISO/IEC JTC1/SC29 Joint WG11/WG1 (MPEG/JPEG) input document m40059/M74006 , January 2016. [17] D. C. Garcia and R. L. de Queiroz, “Context-based octree coding for point-cloud video, ” in Pr oc. IEEE Intl. Conf. Image Pr ocessing , September 2017.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment