Lidar-based Object Classification with Explicit Occlusion Modeling
LIDAR is one of the most important sensors for Unmanned Ground Vehicles (UGV). Object detection and classification based on lidar point cloud is a key technology for UGV. In object detection and classification, the mutual occlusion between neighborin…
Authors: Xiaoxiang Zhang, Hao Fu, Bin Dai
Lidar-based Ob ject Classification with Explicit Occlusion Mo deling Xiao xiang Zhang 1 , Hao F u 1 , and Bin Dai 1 , 2 , 3 1 College of Int elligence Science and T echnology , National Univ ersit y of Defense T echnology , Changsha 410073, China; zhangxiao xiang17@qq.com; fuhao@n udt.edu.cn 2 Unmanned Systems Researc h Center, National Inno v ation Institute of Defense T echnology ,Beijing 100071, China 3 Corresp ondence:bindai.cs@gmail.com Abstract. LID AR is one of the most imp ortan t sensors for Unmanned Ground V ehicles (UGV). Ob ject detection and classification based on lidar p oin t cloud is a key technology for UGV. In ob ject detection and classification, the mutual o cclusion betw een neighboring ob jects is an imp ortan t factor affecting the accuracy . In this pap er, w e consider occlu- sion as an intrinsic property of the p oin t cloud data. W e prop ose a no v el approac h that explicitly mo del the o cclusion. The o cclusion prop ert y is then tak en into account in the subsequen t classification step. W e p er- form exp eriments on the KITTI dataset. Exp erimen tal results indicate that by utilizing the o cclusion prop ert y that we mo deled, the classifier obtains m uch better p erformance. Keyw ords: Ob ject classification · LIDAR · UGV · Occlusion. 1 In tro duction LID AR is one of the most p opular sensors for unmanned v ehicle due to its highly precise range measurements. Ob ject detection and classification based on lidar p oin t cloud is an extremely imp ortan t technology for unmanned v ehicle. Ho w- ev er, the sparseness of the lidar p oin t cloud and the mutual o cclusion b et w een neigh b oring ob jects poses significant c hallenges for ob ject detection and clas- sification algorithm. Fig.1 is a typical traffic scene. W e could observe a lot of o cclusions o ccurred in this figure. Ideally , the lidar p oin t cloud corresp onding to an ob ject should b e relatively complete and fully reflect the spatial distribution c haracteristics of ob jects. Ho w- ev er, due to the mutual o cclusion of neighboring ob jects, the ob ject p oint cloud is usually incomplete which ma y result in the wrong classification of the ob ject. An illustrativ e example is sho wn in Fig.2. In the training phase, man y p ositiv e samples, including sample A and B as shown in the top ro w of Fig.2, are fed into the classifier. It is seen that sample B is o ccluded by another obstacle, making its p oint cloud incomplete. The classifier is then trained to adapt to this intra- class v ariation. In the testing phase, the classifier encoun ters t w o samples, C and D. Among them, sample D is a true p ositiv e while C is comp osed of tw o small 2 Xiao xiang Zhang, Hao F u, and Bin Dai Fig. 1. In a t ypical traffic scenario, it is common to see the m utual occlusion b et w een neigh b oring ob jects. The lidar point cloud of the ob ject to be classified is often incom- plete and fragmen ted, which could easily result in wrong classification results. ob jects, E and F. The classifier will encounter difficulties in distinguishing C from D, and it is very likely to classify C as a false p ositiv e or classify D as a false negativ e. In this pap er, w e consider occlusion as an intrinsic property of the point cloud data. The occlusion area could be accurately computed by considering the relative p osition b etw een the LIDAR itself and eac h detected LIDAR p oin t using ray-casting technique[1]. Therefore, we add a pre-pro cessing step to add the occlusion prop ert y to the p oin t cloud before an y further pro cessing. As shown in the b ottom row of Fig.2, the o ccluded area is colored in yello w. With the help of the o cclusion area, the classifier can no w easily distinguish ob ject C from D. Therefore, b oth the false p ositiv e rate and false negative rate might b e reduced. W e test our approac h on the KITTI dataset. W e choose to use Poin tNet[2] as the basic classifier. W e mo dified P oin tNet to enable it to utilize the o cclusion prop ert y . Exp erimen tal results show that our metho d obtains a significant im- pro v emen t compared to the original Poin tNet, b oth in the o v erall classification accuracy and p er-class classification accuracy . 2 Related W ork There has b een a large literature on ob ject detection approaches based on p oin t cloud. P etro vsk ay a et al. prop osed an ob ject detection algorithm based on ob ject geometry and motion mo del [3,4,5] , and used Bay esian filters to estimate the mo del parameters. Himmelsbac h et al. extracted the geometric features of p oin t cloud using the point feature histogram [6,7] , and then used SVM to classify the ob ject. Built on the w ork of [3,4,5] , W o jk e et al. [8] proposed an ob ject detection algorithm based on the combination of line features and angular features. Cheng et al. prop osed to use histogram features for ob ject detection and recognition [9] . Lidar-based Ob ject Classification with Explicit Occlusion Mo deling 3 Fig. 2. In the training phase of the traditional approach, many p ositive samples, in- cluding sample A and B as shown in the top row, are fed into the classifier. Sample B is o ccluded by another obstacle, making its p oin t cloud incomplete. In the testing phase, the classifier encounters t w o samples, C and D. Sample C is likely to be classi- fied as false p ositiv e while sample D is likely to b e classified as a false negative. In our approac h, we add a pre-pro cessing step that computes the o cclusion prop ert y of the p oin t cloud before the classification module. As shown in the b ottom ro w, the occluded area is colored in yello w. With the help of the o cclusion area, the classifier can now easily distinguish ob ject C from D, thus b oth the false p ositive rate and false negative rate migh t b e reduced. Recen tly , deep learning based approac hes hav e b ecome p opular due to its outstanding p erformance. MV3D [10] firstly pro jects p oin t cloud onto the bird’s ey e view and then trains a region prop osal netw ork (RPN) for generating 3D b ounding b o x prop osals. Ho w ev er, MV3D does not perform w ell in detecting small ob jects suc h as p edestrians and cyclists. V oxelNet [11] is an end-to-end ob ject detection framework. It divides the p oin t cloud into equally spaced three- dimensional vo xels and then transforms the p oin ts in eac h vo xel into a uni- form feature representation through the newly in tro duced V o xel F eature Co ding (VFE) lay er. The point cloud is then encoded as a volume represen tation to p er- form the detection and classification. Different from those previous approaches that rely on a mid-level represen tation, suc h as the image grids or the 3D vo xels, Qi et al. prop osed a new type of netw ork called Poin tNet [2] that works directly on the original p oin t cloud. Poin tNet is a unified framework that can b e applied to ob ject classification, part segmen tation and scene seman tic parsing. It obtains comp etitiv e results on several 3D ob ject classification b enc hmarks. F or o cclusion handling, there ha v e b een several works [12,13,14,15] trying to directly predict the o cclusion mask. How ev er, most of these works are image- based approac hes. There has been little w ork on lidar-based approaches that 4 Xiao xiang Zhang, Hao F u, and Bin Dai directly mo dels o cclusion and utilize the o cclusion prop ert y to aid the classifi- cation tasks. 3 The Prop osed Approach 3.1 P oin t Cloud Definition A point cloud is represen ted as a set of three dimensional points { P i | i = 1 , ..., n } , where eac h p oin t P i is a v ector of ( x, y , z ). W e define the ob ject point cloud data within the ob ject bounding box as the ob ject p oin t cloud P raw = P 1 raw , P 2 raw , ..., P n raw . The point cloud out- side the ob ject b ounding b o x is defined as the obstacle p oint cloud P ob = P 1 ob , P 2 ob , ..., P m ob . The obstacle point cloud will block the lidar ra y from passing through it, thus resulting in an incomplete ob ject p oint cloud. In Fig.3, we can see that the ob ject point cloud is divided in to tw o parts. The occlusion area generated by the p oint cloud using the ray-casting technique is defined as the o ccluded p oin t cloud P oc = P 1 oc , P 2 oc , ..., P k oc , and is colored in yello w and pink resp ectiv ely . Fig. 3. Poin t cloud definition. The top figure is the 3D-view and the b ottom figure is the corresp onding birds-eye view. The gray cub e represents the obstacle p oin t cloud. The ob ject p oin t cloud is colored in blue. The o cclusion area generated by the p oin t cloud is colored in pink and yello w. Lidar-based Ob ject Classification with Explicit Occlusion Mo deling 5 3.2 Occlusion Area Mo deling F or each p oin t P i ob = x i ob , y i ob , z i ob , i = 1 , 2 , 3 , ..., m of the obstacle p oin t cloud and eac h p oin t P j raw = x j raw , y j raw , z j raw , j = 1 , 2 , 3 , ..., n of the raw ob ject p oin t cloud, we use the ray-casting technique to mo del the o cclusion. W e define the p osition of the LIDAR as the origin O . F or eac h p oin t P i ob and P j raw , we add o ccluded p oin ts P l oc along the direction of O to P i ob or P j raw at a fixed step. The o ccluded p oin ts are added until their height is b elow the ground plane. The ground plane is estimated b y using a block recursiv e Gaussian pro cess regression algorithm [16]. F or eac h p oin t P i ob of the obstacle p oin t cloud: L ( O P l 1 oc ) L ( O P i ob ) = x l 1 oc x i ob = y l 1 oc y i ob = z l 1 oc z i ob (1) L ( O P l 1 oc ) = L ( O P i ob ) + k 1 s (2) F or eac h p oin t P j raw of the ob ject p oin t cloud: L ( O P l 2 oc ) L ( O P j raw ) = x l 2 oc x j raw = y l 2 oc y j raw = z l 2 oc z j raw (3) L ( O P l 2 oc ) = L ( O P j raw ) + k 2 s (4) where k 1 , k 2 are p ositiv e integer, s is the step size ( in our exp erimen t we set s = 0 . 3 m ). The function L ( x ) represents the distance from the p oin t P to the origin O . T o distinguish the added o ccluded p oin t cloud from the original p oin t cloud, w e add a new dimension named ‘o ccluded’ to the original p oin t cloud data, expanding the point cloud dimension from three dimensional ( x, y , z ) to four dimensional ( x, y , z , o ). W e set the occlusion prop ert y of original point clouds to 0 and set the o cclusion prop ert y of newly added o cclusion p oin ts to 1. W e use our approach to add the o ccluded p oin t cloud for b oth the ob ject and the obstacle p oin t cloud. W e sho w the comparison of the ob ject point cloud with and without the o ccluded p oin ts in Fig.4. The first row and the third ro w are the ra w ob ject p oin t cloud without occluded p oin ts. The second and the fourth ro w are the new ob ject p oin t cloud with occluded p oin ts. It is ob vious that the ob ject p oin t cloud with o ccluded p oin ts is more complete than the raw ob ject p oin t cloud. 3.3 Deep Learning Based Poin t Cloud Classification Approach W e choose to use Poin tNet as the classification approach. Poin tNet[2] prop osed b y Qi et. al, is an metho d that directly pro cesses the original p oin t cloud. Poin t- Net mainly consists of several transformation lay ers and sev eral Multi-Lay er P erceptron (MLP) blo c ks. The first lay er of Poin tNet takes n p oin ts as input and learns a D × D transformation matrix through the T-Net learning, where D represen ts the feature dimension. 6 Xiao xiang Zhang, Hao F u, and Bin Dai (a) (b) (c) (d) (e) (f ) (g) (h) (i) (j) (k) (l) Fig. 4. The comparison of the ob ject p oin t cloud with and without the occluded points. The first ro w and the third ro w are the raw ob ject point cloud without occluded p oin ts. The second and the fourth row are the new ob ject p oin t cloud with occluded p oin ts. W e can clearly see that the ob ject p oin t cloud with o ccluded p oin ts is more complete compared with the original one. The transformed data then goes through sev eral Multi-Lay er Perceptron(MLP) blo c ks shared by each p oint, an intermediate max p ooling la y er, a spatial trans- formation lay er and t w o fully connected lay ers. The initial v alue of the spatial transformation matrix is set to an identit y matrix. Except for the last la y er, ReLU and Batc h Normalization are applied to all other lay ers. MLP of Poin tNet is implemented by the conv olution of shared weigh ts. The con v olution k ernel of the first la y er is 1 × 3, and the subsequent conv olution k ernel size is 1 × 1. 3.4 Deep Learning Based P oin t Cloud Classification Approac h With Occlusion Mo deling Based on the original P oin tNet netw ork, we make some mo difications to utilize the o cclusion prop erty prop osed in this pap er. W e expand the input data from 3D to 4D, i.e. n × 4 in order to enable P oin tNet to pro cess new formats of p oin t cloud data. F or the D × D transformation matrix obtained by the T-Net learning, w e hav e also mo dified them so that the feature dimension of the new transformation matrix b ecomes 4 × 4. In the subsequent mo dule, we hav e also made appropriate mo difications to the netw ork. The size of the conv olution kernel of the MLP is mo dified to 1 × 4 according to the input data dimension, and the output dimension of the last la y er is set to the n umber of classes. Lidar-based Ob ject Classification with Explicit Occlusion Mo deling 7 Fig. 5. Here we show the main structure of the Poin tNet’s classification netw ork and the difference b etw een the origin Poin tNet and ours. It is seen that we do not need to mak e many changes on the structure of the netw ork itself. In Fig.5, w e show the comparison of P oin tNet and our mo dified Poin tNet. The top figure is the original Poin tNet. The b ottom figure is our mo dified Poin tNet. Changed parts are sho wn in the b ottom b ounding b o x. W e can see that w e do not need to make man y c hanges on the structure of the netw ork itself. Our approac h could b e applied to any netw ork which can directly pro cess the raw lidar p oin t cloud data. 4 Exp erimen tal Results W e divide our experiments in to t w o parts and w e c hoose to p erform experiments on the KITTI dataset. W e firstly did exp erimen ts on the seven categories (‘car’, ‘v an’, ‘truck’, ‘p edestrain’, ‘cyclist’, ‘tram’ and ‘misc’) of KITTI dataset. As ‘car’, ‘v an’ and ‘truc k’ share a lot of similarities, and in fact they all b elong to the ‘v ehicle’ category , w e then merge car, v an and truck to a single category , and p erform exp erimen ts on these five categories. 4.1 Classification Results on the 7 Categories W e separately train the P oin tNet net w ork on the original p oin t cloud and the p oin t cloud with o ccluded p oin ts. The classification results are shown in T able.1 8 Xiao xiang Zhang, Hao F u, and Bin Dai and Fig.6. Exp erimen tal results show that b oth the ov erall accuracy and p er- class accuracy of our approach ha v e a significant improv ement compared with the original P oin tNet. T able 1. Classification results on the KITTI 7 categories dataset. dataset accuracy a vg. class accuracy o v erall Ours KITTI 0.784 0.920 Fig. 6. Classification results on the KITTI 7 categories dataset. In Fig.7, we show the confusion matrix of the original Poin tNet and our ap- proac h. In Fig.8, we show the comparision b etw een the point cloud with and without the added points. F or man y samples o ccluded by obstacles, their in- complete p oin t cloud alwa ys result in wrong classification, such as sample C in Fig.8. Due to the incompleteness of the p oin t cloud, sample C is classified as ‘misc’ category in the original Poin tNet. In our approach, with the help of the added o ccluded p oin ts, it is correctly classified as the ‘car’. Lidar-based Ob ject Classification with Explicit Occlusion Mo deling 9 (a) Poin tNet (b) Our Approach Fig. 7. Confusion matrix on the 7 categories using the original Poin tNet and our ap- proac h. (a) (b) (c) (d) (e) Fig. 8. The original p oint cloud is colored in blue. The added o ccluded p oin ts are colored in red. The original p oin t cloud is mostly o ccluded and may easily lead to a wrong classification ressult. With the help of the o ccluded p oints, these samples hav e no w b een correctly classified. 4.2 Classification Results on the 5 Categories W e merge car, v an and truc k into a single class and p erform the exp erimen ts on the five categories. W e b eliev e that these three categories all b elong to the ‘v ehicle’ class, and they are equally imp ortan t to the self-driving cars. The clas- sification results are sho wn in T able.3 and Fig.10. T able 2. The p ercen tage of each category’s samples. car v an truc k p edestrain cyclist tram misc T esting data 0.626 0.108 0.038 0.142 0.036 0.026 0.024 10 Xiao xiang Zhang, Hao F u, and Bin Dai (a) (b) (c) Fig. 9. The ob ject p oin t cloud with and without o ccluded p oin ts of the v an and car. Sample A is a v an. Sample B and C are cars. T able 3. Classification results on the KITTI 5 categories dataset. W e can see that the accuracy o v erall results of ours modified Poin tNet hav e b etter p erformance than P ointNet. dataset accuracy a vg. class accuracy o v erall Ours KITTI 0.808 0.962 In Fig.10, it is easily seen that each category’s classification accuracy of our approac h is improv ed in our approach. Some qualitative examples are shown in Fig.9. The confusion matrix is sho wn in Fig.11. Fig. 10. Classification results on the KITTI 5 categories dataset. Lidar-based Ob ject Classification with Explicit Occlusion Mo deling 11 (a) Poin tNet (b) Our Approach Fig. 11. Confusion matrix on the KITTI 5 categories using the original Poin tNet and our approac h. 5 Concluding Remarks In this pap er, we inv estigate the lidar classification problem in o ccluded scenar- ios. W e mo del occlusion as a in trinsic property of the lidar p oin t cloud, and add a pre-precessing step to the lidar point cloud processing pipeline. It is imp ortan t to emphasize that our approac h is not only applicable to enhance P oin tNet’s classi- fication p erformance. W e b elieve that our approach for mo deling o cclusion is an imp ortan t pre-pro cessing step that can enhance any classification approaches. References 1. Scott D Roth. Ra y casting for mo deling solids. Computer gr aphics and image pr o c essing , 18(2):109–144, 1982. 2. Charles R Qi, Hao Su, Kaic hun Mo, and Leonidas J Guibas. P ointnet: Deep learning on p oin t sets for 3d classification and segmentation. In Pr o ce e dings of the IEEE Conferenc e on Computer Vision and Pattern R e c o gnition , pages 652–660, 2017. 3. Anna Petro vsk ay a and Sebastian Thrun. Model based vehicle detection and track- ing for autonomous urban driving. Autonomous Rob ots , 26(2-3):123–139, 2009. 4. Anna Petro vsk ay a and Sebastian Thrun. Mo del based vehicle tracking in urban en vironments. In IEEE International Confer enc e on R ob otics and Automation, Workshop on Safe Navigation , volume 1, pages 1–8, 2009. 5. Anna Petro vsk a y a and Sebastian Thrun. Efficient techniques for dynamic vehicle detection. In Experimental Rob otics , pages 79–91. Springer, 2009. 6. Mic hael Himmelsbach, Thorsten Luettel, and H-J W uensc he. Real-time ob ject clas- sification in 3d p oin t clouds using p oint feature histograms. In 2009 IEEE/RSJ In- ternational Confer enc e on Intel ligent R ob ots and Systems , pages 994–1000. IEEE, 2009. 7. Chieh-Chih W ang, Charles Thorp e, and Sebastian Thrun. Online simultaneous lo calization and mapping with detection and tracking of moving ob jects: Theory 12 Xiao xiang Zhang, Hao F u, and Bin Dai and results from a ground vehicle in crowded urban areas. In 2003 IEEE Interna- tional Confer enc e on R ob otics and Automation (Cat. No. 03CH37422) , volume 1, pages 842–849. IEEE, 2003. 8. Nicolai W o jk e and Marcel H¨ aselic h. Mo ving v ehicle detection and trac king in unstructured en vironmen ts. In 2012 IEEE International Confer enc e on R ob otics and Automation , pages 3082–3087. IEEE, 2012. 9. Jian Cheng, Zhiyu Xiang, T eng Cao, and Jilin Liu. Robust vehicle detection using 3d lidar under complex urban environmen t. In 2014 IEEE International Confer enc e on Rob otics and Automation (ICRA) , pages 691–696. IEEE, 2014. 10. Xiaozhi Chen, Huimin Ma, Ji W an, Bo Li, and Tian Xia. Multi-view 3d ob ject detection netw ork for autonomous driving. In Pro c ee dings of the IEEE Conferenc e on Computer Vision and Pattern R e c o gnition , pages 1907–1915, 2017. 11. Yin Zhou and Oncel T uzel. V oxelnet: End-to-end learning for p oin t cloud based 3d ob ject detection. In Pr oc e e dings of the IEEE Confer enc e on Computer Vision and Pattern R e c o gnition , pages 4490–4499, 2018. 12. Xinlong W ang, T ete Xiao, Y uning Jiang, Sh uai Shao, Jian Sun, and Chunh ua Shen. Repulsion loss: Detecting p edestrians in a crowd. In The IEEE Confer ence on Computer Vision and Pattern R e c o gnition (CVPR) , June 2018. 13. Shifeng Zhang, Longyin W en, Xiao Bian, Zhen Lei, and Stan Z. Li. Occlusion- a ware r-cnn: Detecting pedestrians in a crowd. In The Europ e an Confer enc e on Computer Vision (ECCV) , September 2018. 14. Pierre Baque, F rancois Fleuret, and Pascal F ua. Deep o cclusion reasoning for m ulti-camera multi-target detection. In The IEEE International Confer enc e on Computer Vision (ICCV) , Oct 2017. 15. Hsiao Edward and Heb ert Martial. Occlusion reasoning for ob ject detectionun- der arbitrary viewp oin t. In IEEE T r ansactions on Pattern Analysis and Machine Intel ligenc e , pages 1803 – 1815, 2014. 16. 3D LIDAR-b ase d Dynamic V ehicle Dete ction and T r acking . PhD thesis, 2016.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment