Lidar-based Object Classification with Explicit Occlusion Modeling

Lidar-based Ob ject Classiﬁcation with Explicit Occlusion Mo deling Xiao xiang Zhang 1 , Hao F u 1 , and Bin Dai 1 , 2 , 3 1 College of Int elligence Science and T echnology , National Univ ersit y of Defense T echnology , Changsha 410073, China; zhangxiao xiang17@qq.com; fuhao@n udt.edu.cn 2 Unmanned Systems Researc h Center, National Inno v ation Institute of Defense T echnology ,Beijing 100071, China 3 Corresp ondence:bindai.cs@gmail.com Abstract. LID AR is one of the most imp ortan t sensors for Unmanned Ground V ehicles (UGV). Ob ject detection and classiﬁcation based on lidar p oin t cloud is a key technology for UGV. In ob ject detection and classiﬁcation, the mutual o cclusion betw een neighboring ob jects is an imp ortan t factor aﬀecting the accuracy . In this pap er, w e consider occlu- sion as an intrinsic property of the p oin t cloud data. W e prop ose a no v el approac h that explicitly mo del the o cclusion. The o cclusion prop ert y is then tak en into account in the subsequen t classiﬁcation step. W e p er- form exp eriments on the KITTI dataset. Exp erimen tal results indicate that by utilizing the o cclusion prop ert y that we mo deled, the classiﬁer obtains m uch better p erformance. Keyw ords: Ob ject classiﬁcation · LIDAR · UGV · Occlusion. 1 In tro duction LID AR is one of the most p opular sensors for unmanned v ehicle due to its highly precise range measurements. Ob ject detection and classiﬁcation based on lidar p oin t cloud is an extremely imp ortan t technology for unmanned v ehicle. Ho w- ev er, the sparseness of the lidar p oin t cloud and the mutual o cclusion b et w een neigh b oring ob jects poses signiﬁcant c hallenges for ob ject detection and clas- siﬁcation algorithm. Fig.1 is a typical traﬃc scene. W e could observe a lot of o cclusions o ccurred in this ﬁgure. Ideally , the lidar p oin t cloud corresp onding to an ob ject should b e relatively complete and fully reﬂect the spatial distribution c haracteristics of ob jects. Ho w- ev er, due to the mutual o cclusion of neighboring ob jects, the ob ject p oint cloud is usually incomplete which ma y result in the wrong classiﬁcation of the ob ject. An illustrativ e example is sho wn in Fig.2. In the training phase, man y p ositiv e samples, including sample A and B as shown in the top ro w of Fig.2, are fed into the classiﬁer. It is seen that sample B is o ccluded by another obstacle, making its p oint cloud incomplete. The classiﬁer is then trained to adapt to this intra- class v ariation. In the testing phase, the classiﬁer encoun ters t w o samples, C and D. Among them, sample D is a true p ositiv e while C is comp osed of tw o small 2 Xiao xiang Zhang, Hao F u, and Bin Dai Fig. 1. In a t ypical traﬃc scenario, it is common to see the m utual occlusion b et w een neigh b oring ob jects. The lidar point cloud of the ob ject to be classiﬁed is often incom- plete and fragmen ted, which could easily result in wrong classiﬁcation results. ob jects, E and F. The classiﬁer will encounter diﬃculties in distinguishing C from D, and it is very likely to classify C as a false p ositiv e or classify D as a false negativ e. In this pap er, w e consider occlusion as an intrinsic property of the point cloud data. The occlusion area could be accurately computed by considering the relative p osition b etw een the LIDAR itself and eac h detected LIDAR p oin t using ray-casting technique[1]. Therefore, we add a pre-pro cessing step to add the occlusion prop ert y to the p oin t cloud before an y further pro cessing. As shown in the b ottom row of Fig.2, the o ccluded area is colored in yello w. With the help of the o cclusion area, the classiﬁer can no w easily distinguish ob ject C from D. Therefore, b oth the false p ositiv e rate and false negative rate might b e reduced. W e test our approac h on the KITTI dataset. W e choose to use Poin tNet[2] as the basic classiﬁer. W e mo diﬁed P oin tNet to enable it to utilize the o cclusion prop ert y . Exp erimen tal results show that our metho d obtains a signiﬁcant im- pro v emen t compared to the original Poin tNet, b oth in the o v erall classiﬁcation accuracy and p er-class classiﬁcation accuracy . 2 Related W ork There has b een a large literature on ob ject detection approaches based on p oin t cloud. P etro vsk ay a et al. prop osed an ob ject detection algorithm based on ob ject geometry and motion mo del [3,4,5] , and used Bay esian ﬁlters to estimate the mo del parameters. Himmelsbac h et al. extracted the geometric features of p oin t cloud using the point feature histogram [6,7] , and then used SVM to classify the ob ject. Built on the w ork of [3,4,5] , W o jk e et al. [8] proposed an ob ject detection algorithm based on the combination of line features and angular features. Cheng et al. prop osed to use histogram features for ob ject detection and recognition [9] . Lidar-based Ob ject Classiﬁcation with Explicit Occlusion Mo deling 3 Fig. 2. In the training phase of the traditional approach, many p ositive samples, in- cluding sample A and B as shown in the top row, are fed into the classiﬁer. Sample B is o ccluded by another obstacle, making its p oin t cloud incomplete. In the testing phase, the classiﬁer encounters t w o samples, C and D. Sample C is likely to be classi- ﬁed as false p ositiv e while sample D is likely to b e classiﬁed as a false negative. In our approac h, we add a pre-pro cessing step that computes the o cclusion prop ert y of the p oin t cloud before the classiﬁcation module. As shown in the b ottom ro w, the occluded area is colored in yello w. With the help of the o cclusion area, the classiﬁer can now easily distinguish ob ject C from D, thus b oth the false p ositive rate and false negative rate migh t b e reduced. Recen tly , deep learning based approac hes hav e b ecome p opular due to its outstanding p erformance. MV3D [10] ﬁrstly pro jects p oin t cloud onto the bird’s ey e view and then trains a region prop osal netw ork (RPN) for generating 3D b ounding b o x prop osals. Ho w ev er, MV3D does not perform w ell in detecting small ob jects suc h as p edestrians and cyclists. V oxelNet [11] is an end-to-end ob ject detection framework. It divides the p oin t cloud into equally spaced three- dimensional vo xels and then transforms the p oin ts in eac h vo xel into a uni- form feature representation through the newly in tro duced V o xel F eature Co ding (VFE) lay er. The point cloud is then encoded as a volume represen tation to p er- form the detection and classiﬁcation. Diﬀerent from those previous approaches that rely on a mid-level represen tation, suc h as the image grids or the 3D vo xels, Qi et al. prop osed a new type of netw ork called Poin tNet [2] that works directly on the original p oin t cloud. Poin tNet is a uniﬁed framework that can b e applied to ob ject classiﬁcation, part segmen tation and scene seman tic parsing. It obtains comp etitiv e results on several 3D ob ject classiﬁcation b enc hmarks. F or o cclusion handling, there ha v e b een several works [12,13,14,15] trying to directly predict the o cclusion mask. How ev er, most of these works are image- based approac hes. There has been little w ork on lidar-based approaches that 4 Xiao xiang Zhang, Hao F u, and Bin Dai directly mo dels o cclusion and utilize the o cclusion prop ert y to aid the classiﬁ- cation tasks. 3 The Prop osed Approach 3.1 P oin t Cloud Deﬁnition A point cloud is represen ted as a set of three dimensional points { P i | i = 1 , ..., n } , where eac h p oin t P i is a v ector of ( x, y , z ). W e deﬁne the ob ject point cloud data within the ob ject bounding box as the ob ject p oin t cloud P raw =  P 1 raw , P 2 raw , ..., P n raw  . The point cloud out- side the ob ject b ounding b o x is deﬁned as the obstacle p oint cloud P ob =  P 1 ob , P 2 ob , ..., P m ob  . The obstacle point cloud will block the lidar ra y from passing through it, thus resulting in an incomplete ob ject p oint cloud. In Fig.3, we can see that the ob ject point cloud is divided in to tw o parts. The occlusion area generated by the p oint cloud using the ray-casting technique is deﬁned as the o ccluded p oin t cloud P oc =  P 1 oc , P 2 oc , ..., P k oc  , and is colored in yello w and pink resp ectiv ely . Fig. 3. Poin t cloud deﬁnition. The top ﬁgure is the 3D-view and the b ottom ﬁgure is the corresp onding birds-eye view. The gray cub e represents the obstacle p oin t cloud. The ob ject p oin t cloud is colored in blue. The o cclusion area generated by the p oin t cloud is colored in pink and yello w. Lidar-based Ob ject Classiﬁcation with Explicit Occlusion Mo deling 5 3.2 Occlusion Area Mo deling F or each p oin t P i ob =  x i ob , y i ob , z i ob  , i = 1 , 2 , 3 , ..., m of the obstacle p oin t cloud and eac h p oin t P j raw =  x j raw , y j raw , z j raw  , j = 1 , 2 , 3 , ..., n of the raw ob ject p oin t cloud, we use the ray-casting technique to mo del the o cclusion. W e deﬁne the p osition of the LIDAR as the origin O . F or eac h p oin t P i ob and P j raw , we add o ccluded p oin ts P l oc along the direction of O to P i ob or P j raw at a ﬁxed step. The o ccluded p oin ts are added until their height is b elow the ground plane. The ground plane is estimated b y using a block recursiv e Gaussian pro cess regression algorithm [16]. F or eac h p oin t P i ob of the obstacle p oin t cloud: L ( O P l 1 oc ) L ( O P i ob ) = x l 1 oc x i ob = y l 1 oc y i ob = z l 1 oc z i ob (1) L ( O P l 1 oc ) = L ( O P i ob ) + k 1 s (2) F or eac h p oin t P j raw of the ob ject p oin t cloud: L ( O P l 2 oc ) L ( O P j raw ) = x l 2 oc x j raw = y l 2 oc y j raw = z l 2 oc z j raw (3) L ( O P l 2 oc ) = L ( O P j raw ) + k 2 s (4) where k 1 , k 2 are p ositiv e integer, s is the step size ( in our exp erimen t we set s = 0 . 3 m ). The function L ( x ) represents the distance from the p oin t P to the origin O . T o distinguish the added o ccluded p oin t cloud from the original p oin t cloud, w e add a new dimension named ‘o ccluded’ to the original p oin t cloud data, expanding the point cloud dimension from three dimensional ( x, y , z ) to four dimensional ( x, y , z , o ). W e set the occlusion prop ert y of original point clouds to 0 and set the o cclusion prop ert y of newly added o cclusion p oin ts to 1. W e use our approach to add the o ccluded p oin t cloud for b oth the ob ject and the obstacle p oin t cloud. W e sho w the comparison of the ob ject point cloud with and without the o ccluded p oin ts in Fig.4. The ﬁrst row and the third ro w are the ra w ob ject p oin t cloud without occluded p oin ts. The second and the fourth ro w are the new ob ject p oin t cloud with occluded p oin ts. It is ob vious that the ob ject p oin t cloud with o ccluded p oin ts is more complete than the raw ob ject p oin t cloud. 3.3 Deep Learning Based Poin t Cloud Classiﬁcation Approach W e choose to use Poin tNet as the classiﬁcation approach. Poin tNet[2] prop osed b y Qi et. al, is an metho d that directly pro cesses the original p oin t cloud. Poin t- Net mainly consists of several transformation lay ers and sev eral Multi-Lay er P erceptron (MLP) blo c ks. The ﬁrst lay er of Poin tNet takes n p oin ts as input and learns a D × D transformation matrix through the T-Net learning, where D represen ts the feature dimension. 6 Xiao xiang Zhang, Hao F u, and Bin Dai (a) (b) (c) (d) (e) (f ) (g) (h) (i) (j) (k) (l) Fig. 4. The comparison of the ob ject p oin t cloud with and without the occluded points. The ﬁrst ro w and the third ro w are the raw ob ject point cloud without occluded p oin ts. The second and the fourth row are the new ob ject p oin t cloud with occluded p oin ts. W e can clearly see that the ob ject p oin t cloud with o ccluded p oin ts is more complete compared with the original one. The transformed data then goes through sev eral Multi-Lay er Perceptron(MLP) blo c ks shared by each p oint, an intermediate max p ooling la y er, a spatial trans- formation lay er and t w o fully connected lay ers. The initial v alue of the spatial transformation matrix is set to an identit y matrix. Except for the last la y er, ReLU and Batc h Normalization are applied to all other lay ers. MLP of Poin tNet is implemented by the conv olution of shared weigh ts. The con v olution k ernel of the ﬁrst la y er is 1 × 3, and the subsequent conv olution k ernel size is 1 × 1. 3.4 Deep Learning Based P oin t Cloud Classiﬁcation Approac h With Occlusion Mo deling Based on the original P oin tNet netw ork, we make some mo diﬁcations to utilize the o cclusion prop erty prop osed in this pap er. W e expand the input data from 3D to 4D, i.e. n × 4 in order to enable P oin tNet to pro cess new formats of p oin t cloud data. F or the D × D transformation matrix obtained by the T-Net learning, w e hav e also mo diﬁed them so that the feature dimension of the new transformation matrix b ecomes 4 × 4. In the subsequent mo dule, we hav e also made appropriate mo diﬁcations to the netw ork. The size of the conv olution kernel of the MLP is mo diﬁed to 1 × 4 according to the input data dimension, and the output dimension of the last la y er is set to the n umber of classes. Lidar-based Ob ject Classiﬁcation with Explicit Occlusion Mo deling 7 Fig. 5. Here we show the main structure of the Poin tNet’s classiﬁcation netw ork and the diﬀerence b etw een the origin Poin tNet and ours. It is seen that we do not need to mak e many changes on the structure of the netw ork itself. In Fig.5, w e show the comparison of P oin tNet and our mo diﬁed Poin tNet. The top ﬁgure is the original Poin tNet. The b ottom ﬁgure is our mo diﬁed Poin tNet. Changed parts are sho wn in the b ottom b ounding b o x. W e can see that w e do not need to make man y c hanges on the structure of the netw ork itself. Our approac h could b e applied to any netw ork which can directly pro cess the raw lidar p oin t cloud data. 4 Exp erimen tal Results W e divide our experiments in to t w o parts and w e c hoose to p erform experiments on the KITTI dataset. W e ﬁrstly did exp erimen ts on the seven categories (‘car’, ‘v an’, ‘truck’, ‘p edestrain’, ‘cyclist’, ‘tram’ and ‘misc’) of KITTI dataset. As ‘car’, ‘v an’ and ‘truc k’ share a lot of similarities, and in fact they all b elong to the ‘v ehicle’ category , w e then merge car, v an and truck to a single category , and p erform exp erimen ts on these ﬁve categories. 4.1 Classiﬁcation Results on the 7 Categories W e separately train the P oin tNet net w ork on the original p oin t cloud and the p oin t cloud with o ccluded p oin ts. The classiﬁcation results are shown in T able.1 8 Xiao xiang Zhang, Hao F u, and Bin Dai and Fig.6. Exp erimen tal results show that b oth the ov erall accuracy and p er- class accuracy of our approach ha v e a signiﬁcant improv ement compared with the original P oin tNet. T able 1. Classiﬁcation results on the KITTI 7 categories dataset. dataset accuracy a vg. class accuracy o v erall Ours KITTI 0.784 0.920 Fig. 6. Classiﬁcation results on the KITTI 7 categories dataset. In Fig.7, we show the confusion matrix of the original Poin tNet and our ap- proac h. In Fig.8, we show the comparision b etw een the point cloud with and without the added points. F or man y samples o ccluded by obstacles, their in- complete p oin t cloud alwa ys result in wrong classiﬁcation, such as sample C in Fig.8. Due to the incompleteness of the p oin t cloud, sample C is classiﬁed as ‘misc’ category in the original Poin tNet. In our approach, with the help of the added o ccluded p oin ts, it is correctly classiﬁed as the ‘car’. Lidar-based Ob ject Classiﬁcation with Explicit Occlusion Mo deling 9 (a) Poin tNet (b) Our Approach Fig. 7. Confusion matrix on the 7 categories using the original Poin tNet and our ap- proac h. (a) (b) (c) (d) (e) Fig. 8. The original p oint cloud is colored in blue. The added o ccluded p oin ts are colored in red. The original p oin t cloud is mostly o ccluded and may easily lead to a wrong classiﬁcation ressult. With the help of the o ccluded p oints, these samples hav e no w b een correctly classiﬁed. 4.2 Classiﬁcation Results on the 5 Categories W e merge car, v an and truc k into a single class and p erform the exp erimen ts on the ﬁve categories. W e b eliev e that these three categories all b elong to the ‘v ehicle’ class, and they are equally imp ortan t to the self-driving cars. The clas- siﬁcation results are sho wn in T able.3 and Fig.10. T able 2. The p ercen tage of each category’s samples. car v an truc k p edestrain cyclist tram misc T esting data 0.626 0.108 0.038 0.142 0.036 0.026 0.024 10 Xiao xiang Zhang, Hao F u, and Bin Dai (a) (b) (c) Fig. 9. The ob ject p oin t cloud with and without o ccluded p oin ts of the v an and car. Sample A is a v an. Sample B and C are cars. T able 3. Classiﬁcation results on the KITTI 5 categories dataset. W e can see that the accuracy o v erall results of ours modiﬁed Poin tNet hav e b etter p erformance than P ointNet. dataset accuracy a vg. class accuracy o v erall Ours KITTI 0.808 0.962 In Fig.10, it is easily seen that each category’s classiﬁcation accuracy of our approac h is improv ed in our approach. Some qualitative examples are shown in Fig.9. The confusion matrix is sho wn in Fig.11. Fig. 10. Classiﬁcation results on the KITTI 5 categories dataset. Lidar-based Ob ject Classiﬁcation with Explicit Occlusion Mo deling 11 (a) Poin tNet (b) Our Approach Fig. 11. Confusion matrix on the KITTI 5 categories using the original Poin tNet and our approac h. 5 Concluding Remarks In this pap er, we inv estigate the lidar classiﬁcation problem in o ccluded scenar- ios. W e mo del occlusion as a in trinsic property of the lidar p oin t cloud, and add a pre-precessing step to the lidar point cloud processing pipeline. It is imp ortan t to emphasize that our approac h is not only applicable to enhance P oin tNet’s classi- ﬁcation p erformance. W e b elieve that our approach for mo deling o cclusion is an imp ortan t pre-pro cessing step that can enhance any classiﬁcation approaches. References 1. Scott D Roth. Ra y casting for mo deling solids. Computer gr aphics and image pr o c essing , 18(2):109–144, 1982. 2. Charles R Qi, Hao Su, Kaic hun Mo, and Leonidas J Guibas. P ointnet: Deep learning on p oin t sets for 3d classiﬁcation and segmentation. In Pr o ce e dings of the IEEE Conferenc e on Computer Vision and Pattern R e c o gnition , pages 652–660, 2017. 3. Anna Petro vsk ay a and Sebastian Thrun. Model based vehicle detection and track- ing for autonomous urban driving. Autonomous Rob ots , 26(2-3):123–139, 2009. 4. Anna Petro vsk ay a and Sebastian Thrun. Mo del based vehicle tracking in urban en vironments. In IEEE International Confer enc e on R ob otics and Automation, Workshop on Safe Navigation , volume 1, pages 1–8, 2009. 5. Anna Petro vsk a y a and Sebastian Thrun. Eﬃcient techniques for dynamic vehicle detection. In Experimental Rob otics , pages 79–91. Springer, 2009. 6. Mic hael Himmelsbach, Thorsten Luettel, and H-J W uensc he. Real-time ob ject clas- siﬁcation in 3d p oin t clouds using p oint feature histograms. In 2009 IEEE/RSJ In- ternational Confer enc e on Intel ligent R ob ots and Systems , pages 994–1000. IEEE, 2009. 7. Chieh-Chih W ang, Charles Thorp e, and Sebastian Thrun. Online simultaneous lo calization and mapping with detection and tracking of moving ob jects: Theory 12 Xiao xiang Zhang, Hao F u, and Bin Dai and results from a ground vehicle in crowded urban areas. In 2003 IEEE Interna- tional Confer enc e on R ob otics and Automation (Cat. No. 03CH37422) , volume 1, pages 842–849. IEEE, 2003. 8. Nicolai W o jk e and Marcel H¨ aselic h. Mo ving v ehicle detection and trac king in unstructured en vironmen ts. In 2012 IEEE International Confer enc e on R ob otics and Automation , pages 3082–3087. IEEE, 2012. 9. Jian Cheng, Zhiyu Xiang, T eng Cao, and Jilin Liu. Robust vehicle detection using 3d lidar under complex urban environmen t. In 2014 IEEE International Confer enc e on Rob otics and Automation (ICRA) , pages 691–696. IEEE, 2014. 10. Xiaozhi Chen, Huimin Ma, Ji W an, Bo Li, and Tian Xia. Multi-view 3d ob ject detection netw ork for autonomous driving. In Pro c ee dings of the IEEE Conferenc e on Computer Vision and Pattern R e c o gnition , pages 1907–1915, 2017. 11. Yin Zhou and Oncel T uzel. V oxelnet: End-to-end learning for p oin t cloud based 3d ob ject detection. In Pr oc e e dings of the IEEE Confer enc e on Computer Vision and Pattern R e c o gnition , pages 4490–4499, 2018. 12. Xinlong W ang, T ete Xiao, Y uning Jiang, Sh uai Shao, Jian Sun, and Chunh ua Shen. Repulsion loss: Detecting p edestrians in a crowd. In The IEEE Confer ence on Computer Vision and Pattern R e c o gnition (CVPR) , June 2018. 13. Shifeng Zhang, Longyin W en, Xiao Bian, Zhen Lei, and Stan Z. Li. Occlusion- a ware r-cnn: Detecting pedestrians in a crowd. In The Europ e an Confer enc e on Computer Vision (ECCV) , September 2018. 14. Pierre Baque, F rancois Fleuret, and Pascal F ua. Deep o cclusion reasoning for m ulti-camera multi-target detection. In The IEEE International Confer enc e on Computer Vision (ICCV) , Oct 2017. 15. Hsiao Edward and Heb ert Martial. Occlusion reasoning for ob ject detectionun- der arbitrary viewp oin t. In IEEE T r ansactions on Pattern Analysis and Machine Intel ligenc e , pages 1803 – 1815, 2014. 16. 3D LIDAR-b ase d Dynamic V ehicle Dete ction and T r acking . PhD thesis, 2016.

Lidar-based Object Classification with Explicit Occlusion Modeling

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment