Light Field Saliency Detection with Deep Convolutional Networks

1 Light Field Salienc y Detection with Deep Con v olutional Netw orks Jun Zhang, Y amei Liu, Shengping Zhang, Ronald Poppe, and Meng W ang Abstract —Light ﬁeld imaging presents an attractive alternati ve to RGB imaging because of the recording of the dir ection of the incoming light. The detection of salient regions in a light ﬁeld image beneﬁts from the additional modeling of angular patterns. For RGB imaging, methods using CNNs have achie ved excellent results on a range of tasks, including saliency detection. However , it is not trivial to use CNN-based methods f or saliency detection on light ﬁeld images because these methods are not speciﬁcally designed for processing light ﬁeld inputs. In addition, current light ﬁeld datasets are not sufﬁciently large to train CNNs. T o over come these issues, we present a new L ytro Illum dataset, which contains 640 light ﬁelds and their corresponding ground- truth saliency maps. Compared to current light ﬁeld saliency datasets [ 1 ], [ 2 ], our new dataset is larger , of higher quality , contains more variation and mor e types of light ﬁeld inputs. This makes our dataset suitable for training deeper networks and benchmarking. Furthermore, we propose a novel end-to- end CNN-based framework f or light ﬁeld saliency detection. Speciﬁcally , we pr opose three novel MA C (Model Angular Changes) blocks to process light ﬁeld micr o-lens images. W e systematically study the impact of different architecture variants and compare light ﬁeld saliency with regular 2D saliency . Our ex- tensive comparisons indicate that our nov el network signiﬁcantly outperforms state-of-the-art methods on the pr oposed dataset and has desir ed generalization abilities on other existing datasets. I . I N T R O D U C T IO N L IGHT ﬁeld imaging [ 3 ] not only captures the color inten- sity of each pix el b ut also the directions of all incoming light rays. The directional information inherent in a light ﬁeld implicitly deﬁnes the geometry of the observed scene [ 4 ]. In recent years, commercial and industrial light ﬁeld cameras with a micro-lens array inserted between the main lens and the photosensor , such as L ytro [ 5 ] and Raytrix [ 6 ], ha ve taken light ﬁeld imaging into a new era. The obtained light ﬁeld can be represented by 4D parameterization ( u, v , x, y ) [ 7 ], where uv denotes the viewpoint plane and xy denotes the image plane, as shown in Figures 1 (a) and (c). The 4D light ﬁeld can be further con verted into multiple 2D light ﬁeld images, such as multi-view sub-aperture images [ 7 ], micro-lens im- ages [ 8 ], and epipolar plane images (EPIs) [ 9 ]. These light ﬁeld images ha ve been e xploited to improv e the performance of many applications, such as material recognition [ 10 ], face recognition [ 11 ], [ 12 ], depth estimation [ 13 ]–[ 17 ] and super- resolution [ 9 ], [ 18 ], [ 19 ]. J. Zhang, Y . Liu, and M. W ang are with the School of Computer Science and Information Engineering, Hefei Uni versity of T echnology , Hefei, Anhui, 230601 China. S. Zhang is with the School of Computer Science and T echnology , Harbin Institute of T echnology , W eihai, Shandong, 264209 China. R. Poppe is with the Department of Information and Computing Sciences, Utrecht University , 3584 CC Utrecht Netherlands. Corresponding author: Jun Zhang (e-mail: zhangjun1126@gmail.com) (a) (b) (c) (d) Main lens Micro-lens array Photosensor Sub-aperture image 𝐿 𝑆 𝑢 ∗ , 𝑣 ∗ , 𝑥, 𝑦 𝐿 𝑀 𝑢, 𝑣, 𝑥 1 , 𝑦 1 … … … … 𝐿 𝑀 𝑢, 𝑣, 𝑥 1 , 𝑦 540 𝐿 𝑀 𝑢, 𝑣, 𝑥 375 , 𝑦 1 𝐿 𝑀 𝑢, 𝑣, 𝑥 375 , 𝑦 540 𝐿 𝑀 𝑢, 𝑣, 𝑥 𝑖 , 𝑦 𝑗 … … … … 𝐿 𝑆 𝑢 −4 , 𝑣 −4 , 𝑥, 𝑦 𝐿 𝑆 𝑢 −4 , 𝑣 4 , 𝑥, 𝑦 𝐿 𝑆 𝑢 4 , 𝑣 4 , 𝑥, 𝑦 𝐿 𝑆 𝑢 4 , 𝑣 −4 , 𝑥, 𝑦 𝐿 𝑆 𝑢 0 , 𝑣 0 , 𝑥, 𝑦 Micro-lens image Main lens Photosensor 𝐿 𝑀 𝑢, 𝑣 , 𝑥 ∗ , 𝑦 ∗ Micro-lens array 𝑢 𝑢 𝑣 𝑣 𝑦 𝑦 𝑥 𝑥 Fig. 1. Illustrations of light ﬁeld representations. (a) Micro-lens image representation with the given location ( x ∗ , y ∗ ) . (b) Micro-lens images at sampled spatial locations. (c) Sub-aperture image representation with the giv en viewpoint ( u ∗ , v ∗ ) . (d) Sub-aperture images at sampled viewpoints, where ( u 0 , v 0 ) represents the central vie wpoint. This paper studies saliency detection on light ﬁeld images. Previous work [ 1 ], [ 2 ], [ 20 ], [ 21 ] has focused on developing hand-crafted light ﬁeld features at the superpixel le vel by utilizing heterogenous types of light ﬁeld images ( e.g ., color , depth, focusness, or ﬂow). These methods strongly rely on low-le vel cues and are less capable of extracting high-level semantic concepts. This makes them unsuitable for handling highly cluttered backgrounds or predicting uniform re gions inside salient objects. In recent years, conv olutional neural networks (CNNs) hav e been successfully applied to learn an implicit relation between pixels and salience in RGB images [ 22 ]–[ 28 ]. These CNN- based methods hav e been combined with object proposals [ 22 ], post-processing steps [ 23 ], contextual features [ 24 ], [ 25 ], at- tention models [ 26 ], [ 27 ], and recurrent structures [ 28 ]. Al- though these approaches achiev e improved performance over saliency detection on benchmark datasets, they often adopt complex network architectures, which limits generalization and complicates training. Besides, the limited information in RGB images does not allow to fully e xploit geometric constraints, which ha ve been sho wn to be beneﬁcial in salienc y detection [ 29 ], [ 30 ]. In this paper , we propose a novel method to predict the salience of light ﬁelds by utilizing deep learning technologies. Even with the emergence of CNNs for RGB images, there are still two key issues for saliency detection on light ﬁeld images: (1) Dataset. The RGB datasets [ 31 ]–[ 33 ] are not sufﬁcient to 2 Fusi on Block1 Block3 Block2 Bil inea r int erpol ati on Sali ency map Non-sal ienc y map Block4 Block5 ASP P Micro -l ens im age arra y MA C Block Fig. 2. Architecture of our network. The MA C building block con verts the micro-lens image array of light ﬁelds into feature maps, which are processed by a modiﬁed DeepLab-v2 backbone model. address signiﬁcant variations in illumination, scale and back- ground clutter . Previous light ﬁeld saliency datasets LFSD [ 1 ] and HFUT -L ytro [ 2 ] include only 100 and 255 light ﬁelds, respectiv ely , captured by the ﬁrst-generation L ytro cameras. They are not lar ge enough to train deep conv olutional networks without severely ov erﬁtting. In addition, the unav ailability of multi-views in the LFSD dataset and the color distortion of the sub-aperture images in the HFUT -L ytro dataset impede an e valuation of existing methods. (2) Architectur e. The adoption of CNN-based architectures in light ﬁeld saliency detection is not tri vial because the existing CNNs for 2D images do not support the representation of 4D light ﬁeld data. Thus, novel architectures must be de veloped for saliency detection in light ﬁeld images. Based on the aforementioned issues, we introduce a com- prehensiv e, realistic and challenging benchmark dataset for light ﬁeld saliency detection. Using a L ytro Illum camera, we collect 640 light ﬁelds with signiﬁcant variations in terms of size, amount of texture, background clutter and illumination. For each light ﬁeld, we provide a high-quality micro-lens image array that contains multiple viewpoints for each spatial location. T o explore spatial and multi-vie w properties of light ﬁelds for saliency detection, we further propose a novel deep con volutional network based on the modiﬁed DeepLab-v2 model [ 34 ] as well as several architecture variants (termed as MA C blocks) speciﬁcally designed for light ﬁeld images. Our blocks aim to Model Angular Changes (MA C) from micro- lens images in an explicit way . One block type is similar to the angular ﬁlter e xplored in [ 10 ] for material recognition. The main difference is that [ 10 ] reorganizes the 4D information of the light ﬁeld in different ways and di vide the input into patches, which are further processed by the V GG-16 network for patch classiﬁcation. W e observe that this study does not pay much attention to the angular changes that may arise due to the network parameters, as it lacks an in-depth analysis of the relationship between its learned features and reﬂected information beneﬁcial to material recognition. In contrast, inspired by the micro-lens array hardware conﬁguration of light ﬁeld cameras, the proposed MA C blocks are specially tailored to process micro-lens images in an e xplicit way . The network parameters are designed to sample different views and capture vie w dependencies by performing non-overlapping con volution on each micro-lens image. W e experimentally show that the angular changes are consistent with the view- point v ariations of micro-lens images, and the ef fective angular changes of each pixel may increase depth selecti vity and the ability for accurate saliency detection. Figure 2 provides an ov erview of the proposed network. Our contributions are summarized as follows: • W e construct a ne w light ﬁeld dataset for saliency de- tection, which comprises of 640 high-quality light ﬁelds and the corresponding per-pixel ground-truth saliency maps. This is the largest light ﬁeld dataset that enables efﬁcient deep network training for saliency detection, and addresses new challenges in saliency detection such as inconsistent illumination and small salient objects in the cluttered or similar background. • W e propose an end-to-end deep con volutional network for predicting saliency on light ﬁeld images. T o the best of our kno wledge, no w ork has been reported on employing deep learning techniques for light ﬁeld saliency detection. • W e provide an analysis of the proposed architecture variants speciﬁcally designed for light-ﬁeld inputs. W e also quantitativ ely and qualitativ ely compare our best- performing architecture with the 2D model using the central viewing image and other 2D RGB-based methods. W e show that our network outperforms state-of-the-art methods on the proposed dataset and generalizes well to other datasets. The remainder of this paper is structured as follows. The next section summarizes related work on light ﬁeld datasets, saliency detection from light ﬁeld images, and saliency de- tection using deep learning technologies. W e introduce our nov el L ytro Illum saliency dataset in Section III . W e introduce our nov el MAC blocks in Section IV and e valuate them in Section V . W e conclude in Section VI . I I . R E L A T E D W O R K A. Light ﬁeld datasets for saliency detection There are only two existing datasets designed for light ﬁeld saliency detection, both recorded with L ytro’ s ﬁrst-generation 3 cameras. The Light Field Saliency Database (LFSD) [ 1 ] con- tains 100 light ﬁelds with 360 × 360 spatial resolution. A rough focal stack and an all-focus image are provided for each light ﬁeld. The images in this dataset usually hav e one salient foreground object and a background with good color contrast. The limited complexity of the dataset is not suf ﬁcient to address the v ariety of challenges for saliency detection when using a light ﬁeld camera, such as illumination variations and small objects on the similar or cluttered background. Later , Zhang et al. [ 2 ] proposed the HFUT -L ytro dataset, which consists of 255 light ﬁelds with complex backgrounds and multiple salient objects. Each light ﬁeld has a 7 × 7 angular resolution and 328 × 328 pixels of spatial resolution. Focal stacks, sub-aperture images, all-focus images, and coarse depth maps are provided in this dataset. Ho wev er , the color channels in their sub-aperture images are distorted o wing to the under-sampling during decoding [ 35 ]. In this work, we use a L ytro Illum camera to b uild a lar ger , higher-quality and more challenging salienc y dataset by capturing more v ariations in illuminance, scale, position. W e also generate the micro- lens image array for each light ﬁeld, which is not provided in previous datasets. B. Saliency detection on light ﬁeld images Previous methods for light ﬁeld saliency detection rely on superpixel-le vel hand-crafted features [ 1 ], [ 2 ], [ 20 ], [ 21 ], [ 36 ]. Pioneering work by Li et al. [ 1 ], [ 36 ] sho ws the feasibility of detecting salient regions using all-focus images and focal stacks from light ﬁelds. Zhang et al. [ 20 ] explored the light ﬁeld depth cue in saliency detection, and further computed light ﬁeld ﬂow ﬁelds ov er focal slices and multi-vie w sub- aperture images to capture depth contrast [ 2 ]. In [ 21 ], a dic- tionary learning-based method is presented to combine v arious light ﬁeld features using a sparse coding framework. Notably , these approaches share the assumption that dissimilarities between image regions imply salient cues. In addition, some of them [ 2 ], [ 20 ], [ 21 ] also utilize reﬁnement strategies to enforce neighboring constraints for saliency optimization. In contrast to the above methods, we propose a deep con volutional net- work by learning efﬁcient angular kernels without additional reﬁnement on the upsampled image. C. Deep learning for saliency pr ediction Recently , remarkable advances in deep learning drive re- search towards the use of CNNs for saliency detection [ 22 ]– [ 28 ]. Different from con ventional learning-based methods, CNNs can directly learn a mapping between 2D images and saliency maps. Since the task is closely related to pixel- wise image classiﬁcation, most works hav e built upon suc- cessful architectures for image recognition on the ImageNet dataset [ 37 ], often initializing their networks with the VGG network [ 38 ]. F or example, sev eral methods directly use CNNs to learn effecti ve conte xtual features and combine them to infer saliency [ 24 ], [ 25 ]. Other methods extract features at multiple scales and generate saliency maps in a fully con volutional way [ 39 ], [ 40 ]. Recently , attention models [ 26 ], [ 27 ] have been introduced to saliency detection to mimic the visual attention mechanism by focusing on informativ e regions in visual scenes. Another direction for improving the quality of the saliency maps is the use of a recurrent structure [ 28 ], which mainly serves as a reﬁnement stage to correct pre vious errors. Although deep CNNs have achiev ed great success in saliency detection, none of them addresses challenges in the 4D light ﬁeld. Directly applying the existing network architectures to light ﬁeld images would not be appropriate because a stan- dard network is not particularly good at capturing vie wpoint changes in light ﬁelds. Our work is the ﬁrst to address light ﬁeld saliency detection with end-to-end deep con volutional networks. D. Deep learning technologies on light ﬁeld data Recently , in terms of different light ﬁeld image types, learning-based techniques have been explored for light ﬁeld image processing. Y oon et al. [ 18 ] proposed a deep learning framew ork for spatial and angular super-resolution, in which two adjacent sub-aperture images are employed to generate the in-between view . W ang et al. [ 41 ] built a bidirectional recurrent CNN to super-resolv e horizontally and vertically ad- jacent sub-aperture image stacks separately and then combined them using a multi-scale fusion scheme to obtain complete view images. V ery recently , Zhang et al. [ 42 ] designed a residual network structure to process one central view image and four stacks of sub-aperture images from four angular directions. Residual information from different directions is then combined to yield the high-resolution central view im- age. Kalantari et al. [ 43 ] proposed the ﬁrst deep learning framew ork for vie w synthesis. They applied two sequential CNNs on only four corner sub-aperture images to model depth and color estimation simultaneously by minimizing the error between synthesized views and ground truth images. W u et al. [ 19 ] introduced the blur-restoration-deblur framework for light ﬁeld reconstruction on 2D EPIs (epipolar plane images). In order to directly synthesize nov el views of dense 4D light ﬁelds from sparse views, W ang et al. [ 44 ] assembled 2D strided con volutions operated on stack ed EPIs and two detail- restoration 3D CNNs connected with angular con version to build a pseudo 4D CNN. Heber et al. [ 45 ] applied a CNN in a sliding window fashion for shape from EPIs, which allows to estimate the depth map of a predeﬁned sub-aperture image. In the successiv e work [ 46 ], they designed a U-shaped network for disparity estimation operating on a EPI volume with two spatial dimensions and one angular dimension. The most similar to our work is [ 10 ], which proposes sev eral CNN architectures based on dif ferent light ﬁeld image types. One of the architectures is developed for the images similar to raw micro-lens images. Ho wev er , the network is mainly designed to verify the advantages of multi-view information of light ﬁeld compared with 2D RGB images in material recognition. In contrast, our work speciﬁcally focuses on the learned angular features from micro-lens images and their relationship with salient/non-salient cues. I I I . T H E L Y T RO I L L U M S A L I E N C Y DA TA S E T T o train and e valuate our network for salienc y detection, we introduce a comprehensi ve no vel light ﬁeld dataset. 4 Sub-aperture images (14 × 14) × ( 540 × 375) Sampled sub-aperture images (9 × 9) × ( 540 × 375) Micro -lens image array (4860 × 3375) Lytro Illum camera (a) (d) (c) Ground truth (540 × 375) Central viewing image (540 × 375) (b) ! 1 2 3 4 5 6 7 8 9 " Micro -lens image array 1 7 4 2 8 5 3 9 6 Sub-aperture images ! " (e) # $ # $ Fig. 3. Flowchart of the dataset construction. (a) L ytro Illum camera. (b) Sub-aperture images. (c) Micro-lens image array . (d) Ground-truth map for the central viewing image. (e) The generation of a micro-lens image array from sub-aperture images. The digits indicate vie wpoints. A. Light ﬁeld r epr esentation There are various ways to represent the light ﬁeld [ 7 ], [ 47 ], [ 48 ]. W e adopt the two-plane parameterization [ 7 ] to deﬁne the light ﬁeld as a 4D function L ( u, v , x, y ) , where u × v indicates the angular resolution and x × y indicates the spatial resolution. As illustrated in Figures 1 (a) and (b), a set of all incoming rays from the uv plane intersected with a giv en micro-lens location ( x ∗ , y ∗ ) produces a micro-lens image with multiple viewpoints L M ( u, v , x ∗ , y ∗ ) . The micro- lens images from different locations can be arranged into a micro-lens image array . As shown in Figures 1 (c) and (d), all micro-lens regions on the xy plane recei ve the incoming rays from a giv en angular position ( u ∗ , v ∗ ) , which produces a sub-aperture image with all locations L S ( u ∗ , v ∗ , x, y ) . The central viewing image is formed by the rays passed through the main lens optical center ( u = u 0 , v = v 0 ) . Since the sub- aperture images contain optical distortions caused by the light rays passed through the lens [ 49 ], [ 50 ], in this paper , we build our network based on the micro-lens images, which have been shown advantages ov er the sub-aperture images for scene reconstruction [ 51 ]. B. Dataset construction Figure 3 illustrates the procedure of our light ﬁeld dataset construction. First, a set of 4D light ﬁelds are obtained using a L ytro Illum camera (Figure 3 (a)). Second, we use L ytro Power T ools (LPT) [ 52 ] to decode light ﬁelds from ra w 4D data to 2D sub-aperture images so that each light ﬁeld has a spatial resolution of 540 × 375 and an angular resolution of 14 × 14 . T o reach a compromise on the training time and the detection accuracy , we sample 9 × 9 vie wpoints from each light ﬁeld to generate ne w sub-aperture images, as sho wn in Figure 3 (b). Third, we generate a micro-lens image by sampling the same spatial location from each sub-aperture image (see Figure 3 (e)), which further produces a micro- lens image array of size 4860 × 3375 , sho wn in Figure 3 (c). The red region indicates one pixel with 9 × 9 observation viewpoints in Figure 3 (c), comparing to one pixel only with the central view in Figure 3 (d). W e initially collect 800 light ﬁelds and manually annotate the per-pixel ground-truth label for each central viewing image. T o reduce label inconsistency , each image is annotated by ﬁve independent annotators. W e only regard a pixel as salient if it is veriﬁed by at least three annotators. W e only keep those images with sufﬁcient agreement. In the end, our ne w dataset contains 640 light ﬁelds with 81 vie ws. Figure 4 shows eight examples of central viewing images and their corresponding ground-truth saliency maps. There are signiﬁcant v ariations in illumination, spatial distribution, scale and background. Besides, there are multiple regions for some saliency annotations. I V . L I G H T FI E L D S A L I E N C Y N E T W O R K W e propose an end-to-end deep con volutional network framew ork for light ﬁeld saliency detection as shown in Figure 2 . Based on the micro-lens image array , the MA C (Modal Angular Changes) blocks are designed to transfer the light ﬁeld inputs to feature maps in different ways. Then, the feature maps are fed to a modiﬁed DeepLab-v2 [ 34 ] to predict 5 Fig. 4. Example central viewing images (top) and their corresponding ground-truth salienc y maps (bottom) from our novel L ytro Illum dataset. Feature maps Co nv 3× 3, 64 3× 3, 64 Max pool 3× 3 St r ide:2 Co nv 3× 3, 128 3× 3, 128 Max pool 3× 3 St r ide:2 Co nv 3× 3, 256 3× 3, 256 3× 3, 256 Max pool 3× 3 St r ide:2 Co nv 3× 3, 512 3× 3, 512 3× 3, 512 Max pool 3× 3 St r ide:1 Co nv 3× 3, 512 3× 3, 512 3× 3, 512 Rat e: 2 Max pool 3× 3 St r ide:1 Co nv *3× 3, 1024 1× 1, 1024 Rat e: 6 Co nv *3× 3, 1024 1× 1, 1024 Rat e: 1 2 Co nv 1×1,2 Co nv 1×1,2 Co nv 1×1,2 Co nv 1×1,2 Co nv *3× 3, 1024 1× 1, 1024 Rat e: 1 8 Co nv *3× 3, 1024 1× 1, 1024 Rat e: 2 4 Su m Block2 Block3 Block4 Block5 Block1 1 1/2 1/2 1/4 1/4 1/8 1/8 1/8 1/8 1/8 1/8 1/8 ASPP Bil inear inte rpo la t i o n Fig. 5. Network structure of the backbone model based on DeepLab-v2 [ 34 ]. The reduction in resolution is shown at the top of each box. saliency maps. W e ﬁrst discuss the backbone model and then detail dif ferent MA C block v ariants. A. Backbone model W e formulate light ﬁeld saliency detection as a binary pixel labeling problem. Saliency detection and semantic seg- mentation are closely related because both are pixel-wise labeling tasks and require lo w-lev el cues as well as high- lev el semantic information. Inspired by previous literature on semantic segmentation [ 34 ], [ 53 ], [ 54 ], we design our backbone model based on DeepLab [ 54 ], which is a variant of FCNs [ 53 ] modiﬁed from the VGG-16 network [ 38 ]. There are several v ariants of DeepLab [ 34 ], [ 55 ], [ 56 ]. In this work, we use DeepLab-v2 [ 34 ], which introduces ASPP to capture multi-scale information and long-range spatial dependencies among image units. The modiﬁed network is composed of ﬁve con volutional ( con v ) blocks, each of which is di vided into con volutions followed by a ReLu. A max-pooling layer is connected after the top con v layer of each con v block. The ASPP is applied on top of block5, which consists of four branches with atrous rates ( r = { 6 , 12 , 18 , 24 } ). Each branch contains one 3 × 3 con volution and one 1 × 1 con volution. The resulting features from all branches are then passed through another 1 × 1 con volution and summed to generate the ﬁnal score. The network further employs bilinear interpolation to upsample the fused score map to the original resolution of the central viewing image, which produces the saliency prediction at the pixel level. In addition, we add dropout to all the con v layers of the ﬁv e blocks to av oid overﬁtting and set the 1 × 1 con v layer with 2 channels after ASPP to produce saliency and non- saliency score maps. The detailed architecture is illustrated in Figure 5 . B. MAC blocks Our network is essentially a modiﬁed DeepLab-v2 network augmented with a light ﬁeld input process. As shown in Figure 2 , a MA C block is a basic computational unit operating on a micro-lens image array input M ∈ R W × H × C and producing an output feature map F ∈ R W 0 × H 0 × C 0 . Here, W = N x × N u and H = N y × N v , in which ( N x , N y ) is the spatial size and ( N u , N v ) is the view size, respectively . The motiv ation of the MA C block is to model angular changes at one pixel location in an explicit manner . An essential part of learning angular features is the design of con volutional kernels applied on the micro-lens images. Ho wev er, it is unclear what deﬁnes good angular ﬁlters and ho w many angular directions should be chosen for better performance. In this paper , we propose three different MA C block v ariant architectures to process light ﬁeld micro-lens image arrays before block1 of the backbone model, in which con volutional methods with kernel sizes, stride size and sampled vie wpoints are all designed to capture angular changes in light ﬁelds, as sho wn in Figure 6 . For the design simplicity of the MA C block, some default settings are ﬁx ed to guarantee that the predicted map and the ground truth map have the same spatial resolution in a fully con volutional network architecture. First, the spatial dimension of output of the MA C block is ensured to be the same with that of the 2D sub-aperture image, i.e. W 0 = N x and H 0 = N y . Second, the number C 0 of con volutional kernels in the MAC block is the same as that of con volution kernels in block1 of DeepLab-v2. In our case where the data are captured by a L ytro Illum camera, the MA C block conv erts the light ﬁeld input data into a 540 × 375 × 64 feature map. The parameters of LFNet v ariants, including the kernel size k × k × C and the con volutional stride s , should meet the above two conditions. 6 (b) (a) 3375 375 Micro - lens image array Feature map 9 9 4860 540 (c) 375 3375 Feature map 4860 3 3 3 3 11 25 1620 540 9 9 rate=4 Concat Feature map 375 ⋯ Atrous conv 1×1 conv 3375 4860 540 540 375 rate=1 rate=1 rate=2 rate=3 Micro - lens image array Micro - lens image array Fig. 6. Architectures of the proposed MA C blocks. (a) MAC block- 9 × 9 . (b) MAC block- 3 × 3 . (c) MAC block-star shaped. The selected viewpoints are highlighted in red. W e no w discuss the detailed architectures of the three proposed MA C blocks. 1) MAC block- 9 × 9 : As described in Section III-B , each micro-lens image has 9 × 9 viewpoints and can be consid- ered as one of the pixel locations. The spatial resolution is 540 × 375 thus the size of the whole micro-lens image array is 4860 × 3375 . In this architecture, we design angular con volutional kernels across all viewpoint directions, as shown in Figure 6 (a). The kernel size shares the same angular resolution of one micro-lens image, and the number of kernels and the stride size are set to e xtract angular features for each micro-lens image. Speciﬁcally , we propose 64 angular kernels, each of which is a 9 × 9 ﬁlter . The stride of conv olution operations is 9 , which leads to 540 × 375 × 64 feature maps. Each point on the feature map can be considered as being captured by the 81 lenslets. These kernels differ from common con volutional kernels applied on 2D images in that they only detect the angular changes in the micro-lens image array . This architecture directly learns the angular information from light ﬁeld images, and thus is expected to distinguish salient foregrounds and backgrounds with similar colors or textures. 2) MAC bloc k- 3 × 3 : Motiv ated by the effecti veness of the smaller kernels in VGG-16 [ 38 ] and Inception v2 [ 55 ], we replace the 9 × 9 con volution in MA C block- 9 × 9 with two layers of 3 × 3 con volution (stride = 3 ) sho wn in Figure 6 (b), which increases the number of parameters while enhancing the network nonlinearity . 3) MAC bloc k-star shaped: W e design atrous angular con- volutional kernels to capture long-range angular features. The atrous rates are set to sample representativ e viewpoint direc- tions. It has been sho wn that using selected angular directions is beneﬁcial in the context of depth estimation [ 15 ], [ 57 ]. Here, we test the application in saliency detection. Different from MA C block- 9 × 9 , we select star -shaped viewpoints ( i.e. four directions θ = { 0 ◦ , 45 ◦ , 90 ◦ , 135 ◦ } ) from each micro-lens image. T o implement vie wpoint sampling and angular ﬁltering, we use atrous con volution with ﬁv e atrous rates, as sho wn in Figure 6 (c). The resulting feature maps are concatenated and combined using 1 × 1 con volutions for later processing. Adaptation. Note that although the proposed framew ork is specially tailored to process micro-lens based light ﬁeld data for saliency detection, in theory , the proposed network could be adapted to different types of light ﬁeld data as well if the camera acquires the same pixel with sufﬁcient different views. Howe ver , directly using the proposed network to process other types of light ﬁeld data potentially has the following problems. (i) The number of vie ws is usually limited and the resulting images suf fer from angular aliasing due to the poor angular sampling of other cameras, such as multi- camera arrays [ 58 ] and light ﬁeld gantries [ 59 ]. (ii) For sparse and wide-baseline light ﬁelds captured by multi-camera arrays, con volution operating ov er the full resolution of light ﬁelds may be prohibitiv ely memory intensiv e and computationally expensi ve. (iii) Compared to micro-lens based light ﬁeld cameras of fered by simple design, ﬂexibility and little mar ginal costs, other light ﬁeld systems are generally impractical for the outdoor data collection due to the complex, heavy , and cost of the capturing system. It is difﬁcult to construct a realistic and challenging saliency dataset of a certain scale. Therefore, the proposed network in this paper is currently only applicable to micro-lens based light ﬁeld data with a set of dense views. V . E X P E R I M E N T A L R E S U L T S Our experimental ev aluation is split up into three main parts. The ﬁrst section ev aluates the three v ariations of the MAC blocks to identify the network design that works best for light- ﬁeld saliency detection. The second section discusses the an- gular resolution, the o verﬁtting issue and the adv antages of the selected variant of the MA C block compared to 2D saliency detectors. Finally , the third section sho ws the performance we can attain based on the best performing model. W e show state- of-the-art results on the ne w L ytro Illum, HFUT -L ytro [ 2 ] and LFSD datasets [ 1 ], based on a pre-trained network on the proposed dataset. A. Settings 1) Implementation and training: The computational en vi- ronment has an Intel i7-6700K CPU@4.00GHz, 15GB RAM, 7 (a) (b) (a ) ( b ) Fig. 7. Light ﬁeld image examples. (a) The HFUT -L ytro dataset. (b) The proposed Lytro Illum dataset. Left: the all-focus images and the central viewing images are sho wn for the two datasets, respectiv ely . Right: nine sub- aperture images are randomly sampled for each light ﬁeld. and an NVIDIA GTX1080Ti GPU. W e trained our network using the Caf fe library [ 60 ] with the maximum iteration step of 160K. W e initialize the backbone model with DeepLab- v2 [ 34 ] pre-trained on the P ASCAL V OC 2012 segmentation benchmark [ 61 ]. The ne wly added conv layers in the MA C block, the ﬁrst layer of block1, and the score layer are initialized using the Xavier algorithm [ 62 ]. The whole network is trained end-to-end using the stochastic gradient descent (SGD) algorithm. T o leverage the training time and the image size, we use a single image batch size. Momentum and weight decay are set to 0 . 9 and 0 . 0005 , respectiv ely . The base learning rate is initialized as 0 . 01 for the ne wly added conv layers in the MA C block and the ﬁrst layer of block1, and 0 . 001 with the poly decay policy for the remaining layers. A dropout layer with probabilities p = [0 . 1 , 0 . 1 , 0 . 2 , 0 . 2 , 0 . 3 , 0 . 5] is applied after con v layers for block1–block5 and ASPP , respecti vely . W e use the softmax loss function deﬁned as L = − 1 W × H W X i =1 H X j =1 log e z y i,j i,j e z 0 i,j + e z 1 i,j (1) where W and H indicate the width and height of an image, z 0 ij and z 1 ij are the last two activ ation values of the pixel ( i, j ) and y ij is the ground-truth label of the pixel ( i, j ) . Note that y ij is 1 only when pixel ( i, j ) is salient. Our code and dataset are available at https://github .com/pencilzhang/ MA C- light- ﬁeld- saliency- net.git . 2) Datasets: Three datasets are used for benchmarking: the proposed L ytro Illum dataset, the HFUT -L ytro dataset [ 2 ], and the LFSD dataset [ 1 ]. Our network is trained and e valuated on the proposed L ytro Illum dataset using a ﬁ ve-fold cross- validation. The trained model is further tested on the other two datasets to ev aluate the generalization ability of our network. Note that the unav ailable viewpoints in the LFSD dataset and the color distortion of sub-aperture images in the HFUT -L ytro dataset (see examples in Figure 7 for visual comparison) are unsuitable for e valuation of our method. T o apply the trained model on the two datasets, we pad the angular resolutions to 9 × 9 using the all-focus image. 3) Data augmentation: In order to obtain more training data to achieve good performance without ov erﬁtting, we aug- ment the training data aggressively on-the-ﬂy . T o facilitate this augmentation, we use geometric transformations ( i.e . rotation, ﬂipping and cropping), changes in brightness, contrast, and chroma as well as additiv e Gaussian noise. Speciﬁcally , we rotate the micro-lens image array 90 , 180 , and 270 degrees, and perform horizontal and v ertical ﬂipping. T o change the rel- ativ e position of the saliency region in the image, we randomly crop two subimages of 3519 × 2907 size from the micro-lens image array . Then for one subimage and the image arrays with 0 , 90 , and 180 degrees of rotation, we adjust the brightness by multiplying all pixels by 1 . 5 and 0 . 6 , respectively , and both chroma and contrast by the multiplication factor 1 . 7 . Finally , we add the zero-mean Gaussian noise with variance of 0 . 01 to all images. In total, we expand the micro-lens image array by 48 ( (4 × 4 + 8) × 2 ) such that the whole training dataset is increased from 512 to 24 , 576 . 4) Evaluation metrics: W e adopt ﬁ ve metrics to ev aluate our network. The ﬁrst one is precision-recall (PR) curve. Speciﬁcally , saliency maps are ﬁrst binarized and then com- pared to the ground truths under varying thresholds. The second metric is F β –measure, which considers both precision and recall F β = (1 + β 2 ) P recision · R ecall β 2 · P recision + R ecall (2) where β 2 is set to 0 . 3 as suggested in [ 31 ]. The third metric is A verage Precision (AP), which is computed by averaging the precision v alues at ev enly spaced recall le vels. The fourth metric is Mean Absolute Error (MAE), which computes the av erage absolute per-pixel difference between the predicted map and the corresponding ground truth map. Additionally , to amend several limitations of the abov e four metrics, such as interpolation ﬂaw for AP , dependenc y ﬂaw for PR curve and F β –measure, and equal-importance ﬂaw for all metrics, as suggested in [ 63 ], we use weighted F w β (WF)–measure based on weighted precision and recall as the ﬁfth metric F w β = (1 + β 2 ) P recision w · Recal l w β 2 · P recision w + Recal l w (3) where w is a weighting function based on the Euclidean distance to calculate the pixel importance from the ground truth. B. Evaluation of MA C blocks W e present a detailed performance comparison among dif- ferent MA C block variant architectures on the proposed L ytro Illum dataset. As described in Section IV -B , these variants only dif fer in the con volution operations applied on their light ﬁeld inputs. The quantitati ve results of the comparison are shown in T able I , from which we can see that the MA C block- 9 × 9 architecture achie ves the best performance for all metrics on the proposed dataset. W e hypothesize that treating ev ery micro-lens image as a whole and applying the angular kernels that ha ve the same size with the angular resolution of the light ﬁeld can help to exploit the multi-vie w information in the micro-lens image array . The detection performances of two other v ariants are lower , probably because the increased number of parameters make the network more dif ﬁcult to train. 8 T ABLE I Q UA N TI TA T I V E R E SU LT S O N T H E P RO P O SE D L Y T RO I LL U M D A TA S ET Method F-measure WF-measure MAE AP MA C block–star shaped 0.8045 0.7426 0.0555 0.9120 MA C block– 3 × 3 0.8066 0.7471 0.0562 0.9118 MA C block– 9 × 9 0.8116 0.7540 0.0551 0.9124 (a) (b) (c) (d) (e) Fig. 8. V isual comparison of different MA C block variants. (a) Central viewing images. (b) Ground truth maps. (c) MAC block- 9 × 9 . (d) MAC block- 3 × 3 . (e) MA C block-star shaped. Figure 8 presents qualitative results of all variants. As illus- trated in the ﬁgure, these v ariants can separate the most salient regions from similar or cluttered backgrounds. Compared to other variants, MAC block- 9 × 9 outputs cleaner and more consistent predictions for the regions with specular reﬂections (row 1), small salient objects (ro w 2), and similar foreground and background (rows 2 and 3). Moreover , we can see that MA C block- 9 × 9 better predicts salient regions without being highly affected by the light source (row 4). These results demonstrate that the proposed network variants are likely to extract potential depth cues by learning angular changes, which are helpful to saliency detection. The kernels with the same size of the angular resolution show better capability in depth discrimination. C. Model analysis Here, we perform all following e xperiments using MAC block- 9 × 9 , since this setup performed best in pre vious ev aluation. 1) Effectiveness of the MAC block: T o further delve into the dif ference between regular image saliency and light ﬁeld saliency , we present some important properties of light ﬁeld features that can better facilitate saliency detection. W e com- pare our 4D light ﬁeld saliency ( i.e. MA C block- 9 × 9 ) to 2D model using the central vie wing image as input (2D-central view). The quantitativ e results are shown in T ables II . W e observe that light ﬁeld salienc y detection with multi-vie ws turns out to perform better than the 2D detector with only the central vie w . T o provide complementary insight of why light ﬁeld saliency works, we visualize the weights of the ﬁrst con v layers of our network and 2D-central view in Figure 9 (a) T ABLE II Q UA N TI TA T I V E C O MPA R IS O N B E T W EE N O UR 4 D M O D EL A ND 2 D - C E N T RA L V IE W ON T H E P RO P OS E D L Y TR O I L LU M DAT A S E T Method F-measure WF-measure MAE AP Ours 0.8116 0.7540 0.0551 0.9124 2D-central view 0.8056 0.7446 0.0597 0.9016 T ABLE III E FF EC T S O F T HE A NG U L A R R E SO L U T IO N O N T H E P R OP O S ED DAT A S ET Angular resolution F-measure WF-measure MAE AP 7 × 7 0.8018 0.7406 0.0567 0.9135 9 × 9 0.8116 0.7540 0.0551 0.9124 11 × 11 0.8006 0.7392 0.0567 0.9109 to compare angular and spatial patterns. W e can see that the learned weights from our MA C block hav e noticeable changes in angular space, which suggests that the vie wpoint cue of light ﬁeld data is well captured. The angular changes are also consistent with the vie wpoint variations of micro-lens images, as sho wn in Figure 9 (b). The results are attributed to the newly designed con v method in which the kernel size is the same as the angular resolution of the micro-lens image, and the stride length guarantees angular features are extracted for each micro-lens image. Therefore, our 4D saliency detector produces more accurate saliency maps than the 2D detector shown in Figure 9 (c). In addition, we show the feature maps obtained from the two models in Figure 10 . It can be seen that different layers encode different types of features. Higher layers capture semantic concepts of the salient region, whereas lower layers encode more discriminati ve features for identifying the salient region. The proposed 4D saliency detector can well discriminate the white spout from the white pants, as sho wn in Figure 10 (a). Howe ver , as illustrated in the block1-con v1 and block5 of Figure 10 (b), most feature maps from the 2D detector hav e small v alues that are not discriminativ e enough to separate the salient tea cup from the pants. Thus the 2D detector produces features cluttered with background noise in the following ASPP and score fusion. More comparisons of saliency maps between the two models can be seen in Figure 11 . 2) Effect of the angular r esolutions: T o show the effect of the angular resolutions in the network, we compare the performance of our architecture with varying number of vie w- points in T able III . Note that we change the kernel size to stay the same with the angular resolution. From the table, we can see that the network using 9 × 9 viewpoints sho ws the best performance overall. Increasing the angular resolution to 11 × 11 cannot improv e the performance, which can be explained by the fact that the viewing angles at the boundary are very oblique [ 64 ] and the narrow baseline of the light ﬁeld camera leads to high vie wing redundanc y with higher angular resolutions [ 7 ], [ 65 ]. 3) Overﬁtting issues: Overﬁtting is a common problem related to training a CNN with limited data. In this section, we analyse the proposed network by introducing dif ferent strategies to handle ov erﬁtting: data augmentation (DG) and 9 (a) (b) (c) (d) Fig. 9. V isual comparison of our 4D model (top) and 2D model using the central view (bottom). (a) V isualization of the ﬁrst con v layers. (b) Light ﬁeld input with highlighted regions. (c) Saliency predictions. (d) Ground truth maps. Block1-conv1 Score fusion (a) (b) Block5 ASPP Fig. 10. Feature maps obtained from (a) 4D model and (b) 2D model using the central view from different layers. From top to bottom: the ﬁrst conv features of block1, block5 output features, ASPP features with four atrous rates, and the score fusion maps via sum-pooling. For ASPP and sum fusion, the non-salience and salience scores are shown in the left and right subﬁgures, respectiv ely . dropout. The results obtained for our best performing model are shown in Figure 12 . Clearly , the network is overﬁtting with (a) (d) (c ) (b) Fig. 11. Qualitative comparison of 4D model and 2D-central view . (a) Central viewing images. (b) Ground-truth maps. (c) Ours. (d) 2D-central view . original training data as shown in Figure 12 (a). As expected, both DG and dropout are crucial to minimize overﬁtting as shown in Figures 12 (b)–(d). Figure 12 (e) presents the corresponding PR curves. It can be seen that by increasing the amount and diversity of the data and the amount of dropout between different layers during training, the performance of the network increases as well. D. Comparison with 2D models T o understand the additional information contained in the micro-lens light ﬁeld images, we compare our best performing approach ( i.e. MAC block- 9 × 9 ) to 8 existing methods on the test set of our proposed dataset. Our comparison includes 4 traditional approaches MST [ 66 ], SMD [ 67 ], MDC [ 68 ], WFD [ 69 ]; and 4 CNN-based ones: PiCANet [ 25 ], Amulet [ 70 ], LFR [ 71 ], HyperFusion [ 72 ]. T o facilitate fair comparison and effecti ve model training, we use the recom- mended parameter settings provided by the authors to initialize these models. All CNN-based methods are based on DNNs pre-trained on the ImageNet [ 37 ] classiﬁcation task. W e retrain these CNN models on the proposed dataset in a ﬁve-fold 10 0 2 4 6 8 10 12 14 16 Iterations 10 4 0 0.2 0.4 0.6 0.8 1 1.2 1.4 Loss Original data Training Validation 0 2 4 6 8 10 12 14 16 Iterations 10 4 0 0.2 0.4 0.6 0.8 1 1.2 1.4 Loss w/ DG Training Validation 0 2 4 6 8 10 12 14 16 Iterations 10 4 0 0.2 0.4 0.6 0.8 1 1.2 1.4 Loss w/ dropout Training Validation 0 2 4 6 8 10 12 14 16 Iterations 10 4 0 0.2 0.4 0.6 0.8 1 1.2 1.4 Loss w/ DG+dropout Training Validation (a) (b) (c) (d) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Recall 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Precision Original data w/ DG w/ dropout w/ DG+dropout (e) Fig. 12. T raining and validation loss for our model on the proposed L ytro Illum dataset. (a) Original training data. (b) T raining with DG. (c) Training with dropout. (d) T raining with DG and dropout. (e) PR curves for different strategies. T ABLE IV Q UA N TI TA T I V E C O MPA R IS O N O F O UR A P PR OAC H A N D O T HE R 2D M O DE L S O N T HE P RO P OS E D D A TA SE T . B O L D : B E S T , U ND E R L IN E D : S E CO N D B E S T Model F-measure WF-measure MAE AP T raditional MST [ 66 ] 0.6695 0.5834 0.1243 0.6967 SMD [ 67 ] 0.7246 0.5371 0.1234 0.7789 MDC [ 68 ] 0.7407 0.5891 0.1094 0.7552 WFD [ 69 ] 0.7260 0.6408 0.1024 0.7604 CNN-based PiCANet [ 25 ] 0.7908 0.6745 0.0782 0.8362 Amulet [ 70 ] 0.8059 0.7686 0.0552 0.8485 LFR [ 71 ] 0.7756 0.7242 0.0702 0.8463 HyperFusion [ 72 ] 0.7549 0.6945 0.0752 0.8359 Ours 0.8116 0.7540 0.0551 0.9124 cross-validation way and apply the same data augmentation method used in our work. The quantitative results are shown in T able IV . W e can see that in general, our model outperforms other methods in terms of F-measure, MAE, and AP metrics. Amulet [ 70 ] obtains the second best performance on the proposed dataset. CNN-based methods consistently perform better than traditional methods. E. Comparison to state-of-the-art light ﬁeld methods W e compare our best performing model MA C block- 9 × 9 to four state-of-the-art methods tailored to light ﬁeld salienc y detection: Multi-cue [ 2 ], DILF [ 20 ], WSC [ 21 ], and LFS [ 1 ]. W e train our network on the novel dataset, and ev aluate on the others without ﬁne-tuning. The results of other methods are obtained using the authors’ implementations. T ables V – VII and Figure 13 sho w quantitati ve results on three datasets. Overall, our approach outperforms other methods on three datasets without any post-processing for reﬁnement, which demonstrates the advantage of the proposed deep con volutional network for light ﬁeld saliency detection. In particular , we observe that the proposed approach shows signiﬁcant per - formance gains when compared to previous methods on the proposed dataset for all metrics. The performance is lo wer on the HFUT -L ytro and LFSD datasets, which is due to the limited viewpoint information in these datasets. Therefore, a large number of ﬁlters learnt on the proposed dataset are un- derused. This demonstrates that different light ﬁeld datasets do affect the accuracy of methods. Multi-cue [ 2 ] and DILF [ 20 ] methods sho w better performance than our approach in terms of F-measure and AP on the LFSD dataset. The reason is that these methods use e xternal depth features and post-processing reﬁnement to improv e the performance. T ABLE V Q UA N TI TA T I V E R E SU LT S O N T H E P RO P O SE D L Y T RO I LL U M D A TA S ET . B O LD : B ES T , U N D ER L I N ED : S EC O N D B E ST Method F-measure WF-measure MAE AP LFS [ 1 ] 0.6107 0.3596 0.1697 0.6193 WSC [ 21 ] 0.6451 0.5945 0.1093 0.5958 DILF [ 20 ] 0.6395 0.4844 0.1389 0.6921 Multi-cue [ 2 ] 0.6648 0.5420 0.1197 0.6593 Ours 0.8116 0.7540 0.0551 0.9124 T ABLE VI Q UA N TI TA T I V E R E SU LT S O N T H E H F U T - L Y T RO D A TA SE T . B O L D : B E S T , U N DE R L I NE D : S E C ON D B ES T Method F-measure WF-measure MAE AP LFS [ 1 ] 0.4868 0.3023 0.2215 0.4718 WSC [ 21 ] 0.5552 0.5080 0.1454 0.4743 DILF [ 20 ] 0.5543 0.4468 0.1579 0.6221 Multi-cue [ 2 ] 0.6135 0.5146 0.1388 0.6354 Ours 0.6721 0.6087 0.1029 0.7390 T ABLE VII Q UA N TI TA T I V E R E SU LT S O N T H E L F S D D A TA S ET . B O L D : B E S T , U N DE R L I NE D : S E C ON D B ES T Method F-measure WF-measure MAE AP LFS [ 1 ] 0.7525 0.5319 0.2072 0.8161 WSC [ 21 ] 0.7729 0.7371 0.1453 0.6832 DILF [ 20 ] 0.8173 0.6695 0.1363 0.8787 Multi-cue [ 2 ] 0.8249 0.7155 0.1503 0.8625 Ours 0.8105 0.7378 0.1164 0.8561 11 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Recall 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Precision Lytro Illum LFS DILF WSC Multi-cue Ours 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Recall 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Precision HFUT-Lytro LFS DILF WSC Multi-cue Ours 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Recall 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Precision LFSD LFS DILF WSC Multi-cue Ours (a) (b) (c) Fig. 13. Comparison on three datasets in terms of PR curv e. (a) The proposed L ytro Illum dataset. (b) The HFUT -L ytro dataset. (c) The LFSD dataset. Some qualitative results are sho wn in Figure 14 . W e can see that our approach can handle various challenging scenarios, including multiple salient objects (ro ws 1 and 2), cluttered backgrounds (ro ws 3 and 5), small salient objects (ro ws 4 and 7), inconsistent illumination (rows 1 and 6), and salient objects in similar backgrounds (rows 8, 9 and 10). It is also worth noting that without any post-processing, our approach can highlight salient objects more uniformly than other methods. V I . C O N C L U S I O N This paper introduces a deep conv olutional network for saliency detection on light ﬁelds by exploiting multi-view information in micro-lens images. Speciﬁcally , we propose MA C block variants to process the micro-lens image array . This paper can be vie wed as the ﬁrst work that addresses light ﬁeld saliency detection using an end-to-end CNN. T o facilitate training such a deep network, we introduce a chal- lenging saliency dataset with light ﬁeld images captured from a L ytro Illum camera. In total, 640 high quality light ﬁelds are produced, making the dataset the lar gest among existing light ﬁeld saliency datasets. Extensiv e experiments demonstrate that comparing to 2D saliency based on the central view alone, 4D light ﬁeld saliency can exploit additional angular information contributing to an increase in the performance of saliency detection. The proposed network is superior to saliency detection methods designed for 2D RGB images on the proposed dataset, and outperforms the state-of-the-art light ﬁeld saliency detection methods on the proposed dataset and generalizes well to the existing datasets. In particular , our approach is capable of detecting salient regions in challenging cases, such as with similar foregrounds and backgrounds, in- consistent illumination, multiple salient objects, and cluttered backgrounds. Our work suggests promising future directions of exploiting spatial and angular patterns in light ﬁelds and deep learning technologies to adv ance the state-of-the-art in pixel-wise prediction tasks. R E F E R E N C E S [1] N. Li, J. Y e, Y . Ji, H. Ling, and J. Y u, “Saliency detection on light ﬁeld, ” IEEE T ransactions on P attern Analysis and Machine Intelligence , vol. 39, no. 8, pp. 1605–1616, 2017. 1 , 2 , 3 , 6 , 7 , 10 [2] J. Zhang, M. W ang, L. Lin, X. Y ang, J. Gao, and Y . Rui, “Saliency detection on light ﬁeld: A multi-cue approach, ” A CM T ransactions on Multimedia Computing, Communications, and Applications , vol. 13, no. 3, 2017. 1 , 2 , 3 , 6 , 7 , 10 , 12 [3] E. H. Adelson and J. Y . W ang, “Single lens stereo with a plenoptic cam- era, ” IEEE T ransactions on P attern Analysis and Machine Intelligence , vol. 14, no. 2, pp. 99–106, 1992. 1 [4] S. W anner and B. Goldluecke, “Globally consistent depth labeling of 4D light ﬁelds, ” in IEEE Confer ence on Computer V ision and P attern Recognition , 2012, pp. 41–48. 1 [5] “L ytro, inc. ” https://www .lytro.com/. 1 [6] “Raytrix gmbh, ” https://raytrix.de/. 1 [7] M. Le voy and P . Hanrahan, “Light ﬁeld rendering, ” in SIGGRAPH , 1996, pp. 31–42. 1 , 4 , 8 [8] R. Ng, M. Levoy , M. Br ´ edif, G. Duval, M. Horowitz, and P . Hanrahan, “Light ﬁeld photography with a hand-held plenoptic camera, ” Stanford Univ ersity Computer Science, T echnical Report, 2005. 1 [9] S. W anner and B. G. and, “V ariational light ﬁeld analysis for disparity estimation and super-resolution, ” IEEE Tr ansactions on P attern Analysis and Machine Intellig ence , v ol. 36, no. 3, pp. 606–619, 2014. 1 [10] T .-C. W ang, J.-Y . Zhu, and E. Hiroaki, “ A 4D light-ﬁeld dataset and CNN architectures for material recognition, ” in Eur opean Confer ence on Computer V ision , 2016. 1 , 2 , 3 [11] R. Ragha vendra, K. B. Raja, and C. Busch, “Presentation attack detec- tion for face recognition using light ﬁeld camera, ” IEEE T ransactions on Image Processing , vol. 24, no. 3, pp. 1060–1075, 2015. 1 [12] A. Sepas-Moghaddam, M. A. Haque, P . L. Correia, K. Nasrollahi, T . B. Moeslund, and F . Pereira, “ A double-deep spatio-angular learning framew ork for light ﬁeld based face recognition, ” arXi v:1805.10078v2, 2018. 1 [13] M. W . T ao, P . P . Srinivasan, and S. Hadap, “Shape estimation from shading, defocus, and correspondence using light-ﬁeld angular coher- ence, ” IEEE Tr ansactions on P attern Analysis and Machine Intelligence , vol. 39, no. 3, pp. 546–560, 2017. 1 [14] W . Williem, I. K. Park, and K. M. Lee, “Robust light ﬁeld depth estimation using occlusion-noise aware data costs, ” IEEE T ransactions on P attern Analysis and Machine Intelligence , p. 10.1109/TP AMI.2017.2746858, 2017. 1 [15] C. Shin, H.-G. Jeon, Y . Y oon, I. S. Kweon, and S. J. Kim, “EPINET: A fully-con volutional neural network using epipolar geometry for depth from light ﬁeld images, ” in IEEE Confer ence on Computer V ision and P attern Recognition , 2018. 1 , 6 [16] H. Schilling, M. Diebold, C. Rother , and B. J ¨ ahne, “Trust your model: Light ﬁeld depth estimation with inline occlusion handling, ” in IEEE Confer ence on Computer V ision and P attern Recognition , 2018, pp. 4530–4538. 1 [17] H.-G. Jeon, J. Park, G. Choe, J. Park, Y . Bok, Y . W . T ai, and I. S. Kweon, “Depth from a light ﬁeld image with learning-based matching costs, ” IEEE T ransactions on P attern Analysis and Machine Intelligence , vol. 10.1109/TP AMI.2018.2794979, 2018. 1 [18] Y . Y oon, H.-G. Jeon, and D. Y oo, “Learning a deep conv olutional network for light-ﬁeld image super -resolution, ” in IEEE International Confer ence on Computer V ision W orkshop , 2015. 1 , 3 12 (a) (b) (c) (d) (e) (f) (g) Fig. 14. V isual comparison of our best MA C block variant (Ours) and state-of-the-art methods on three datasets. (a) Central viewing/all-focus images. (b) Ground truth maps. (c) Ours. (d) LFS [ 36 ]. (e) DILF [ 20 ]. (f) WSC [ 21 ]. (g) Multi-cue [ 2 ]. The ﬁrst ﬁve samples are tak en from the proposed L ytro Illum dataset, the middle three samples are taken from the HFUT -L ytro dataset, and the last tw o samples are taken from the LFSD dataset. [19] G. W u, Y . Liu, L. Fang, Q. Dai, and T . Chai, “Light ﬁeld reconstruction using con volutional network on epi and extended applications, ” IEEE T ransactions on P attern Analysis and Machine Intelligence , vol. 41, no. 7, pp. 1681–1694, 2019. 1 , 3 [20] J. Zhang, M. W ang, J. Gao, Y . W ang, X. Zhang, and X. Wu, “Saliency detection with a deeper in vestigation of light ﬁeld, ” in International J oint Confer ence on Artiﬁcial Intelligence , 2015, pp. 2212–2218. 1 , 3 , 10 , 12 [21] N. Li, B. Sun, and J. Y u, “ A weighted sparse coding framew ork for saliency detection, ” in IEEE Confer ence on Computer V ision and P attern Recognition , 2015. 1 , 3 , 10 , 12 [22] L. W ang, H. Lu, X. Ruan, and M.-H. Y ang, “Deep networks for saliency detection via local estimation and global search, ” in IEEE Conference on Computer V ision and P attern Reco gnition , 2015, pp. 3183–3192. 1 , 3 13 [23] G. Li and Y . Y u, “V isual saliency based on multiscale deep features, ” in IEEE Conference on Computer V ision and P attern Recognition , 2015. 1 , 3 [24] R. Zhao, W . Ouyang, H. Li, and X. W ang, “Saliency detection by multi- context deep learning, ” in IEEE Conference on Computer V ision and P attern Recognition , 2015. 1 , 3 [25] N. Liu, J. Han, and M.-H. Y ang, “PiCANet: Learning pixel-wise contextual attention for saliency detection, ” in IEEE Conference on Computer V ision and P attern Recognition , 2018. 1 , 3 , 9 , 10 [26] J. Kuen, Z. W ang, and G. W ang, “Recurrent attentional networks for saliency detection, ” in IEEE Confer ence on Computer V ision and P attern Recognition , 2016. 1 , 3 [27] X. Zhang, T . W ang, J. Qi, H. Lu, and G. W ang, “Progressi ve attention guided recurrent network for salient object detection, ” in IEEE Confer- ence on Computer V ision and P attern Reco gnition , 2018. 1 , 3 [28] L. W ang, L. W ang, H. Lu, P . Zhang, and X. Ruan, “Saliency detection with recurrent fully con volutional networks, ” in Eur opean Conference on Computer V ision , 2016, pp. 825–841. 1 , 3 [29] Y . Niu, Y . Geng, X. Li, and F . Liu, “Lev eraging stereopsis for saliency analysis, ” in IEEE Conference on Computer V ision and P attern Reco g- nition , 2012, pp. 454–461. 1 [30] V . Sitzmann, A. Serrano, A. Pa vel, M. Agrawala, D. Gutierrez, B. Masia, and G. W etzstein, “Saliency in VR: Ho w do people e xplore virtual envi- ronments?” IEEE transactions on visualization and computer graphics , vol. 24, no. 4, pp. 1633–1642, 2018. 1 [31] R. Achanta, S. Hemami, F . Estrada, and S. Susstrunk, “Frequency-tuned salient re gion detection, ” in IEEE Confer ence on Computer V ision and P attern Recognition , 2009, pp. 1597–1604. 1 , 7 [32] C. Y ang, L. Zhang, H. Lu, X. Ruan, and M.-H. Y ang, “Saliency detection via graph-based manifold ranking, ” in IEEE Conference on Computer V ision and P attern Recognition , 2013, pp. 3166–3173. 1 [33] Y . Li, X. Hou, C. K och, J. M. Rehg, and A. L. Y uille, “The secrets of salient object se gmentation, ” in IEEE Conference on Computer V ision and P attern Recognition , 2014, pp. 280–287. 1 [34] L. C. Chen, G. P apandreou, I. K okkinos, K. Murphy , and A. L. Y uille, “DeepLab: Semantic image se gmentation with deep con volutional nets, atrous conv olution, and fully connected crfs, ” IEEE T ransactions on P attern Analysis and Mac hine Intelligence , vol. 40, no. 4, pp. 834–848, 2018. 2 , 4 , 5 , 7 [35] D. G. Dansereau, O. Pizarro, and S. B. Williams, “Decoding, calibration and rectiﬁcation for lenselet-based plenoptic cameras, ” in IEEE Confer- ence on Computer V ision and P attern Recognition , 2013, pp. 1027–1034. 3 [36] N. Li, J. Y e, Y . Ji, H. Ling, and J. Y u, “Saliency detection on light ﬁeld, ” in IEEE Conference on Computer V ision and P attern Recognition , 2014. 3 , 12 [37] O. Russakovsk y , J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy , A. Khosla, M. Bernstein, A. C. Berg, and L. Fei- Fei, “ImageNet lar ge scale visual recognition challenge, ” International Journal of Computer V ision , vol. 115, no. 3, pp. 211–252, 2015. 3 , 9 [38] K. Simonyan and A. Zisserman, “V ery deep conv olutional networks for large-scale image recognition, ” arXi v:1409.1556, 2014. 3 , 5 , 6 [39] G. Li and Y . Y u, “Deep contrast learning for salient object detection, ” in IEEE Conference on Computer V ision and P attern Recognition , 2016. 3 [40] S. S. S. Kruthiv enti, V . Gudisa, J. H. Dholakiya, and R. V . Babu, “Saliency uniﬁed: A deep architecture for simultaneous eye ﬁxation prediction and salient object segmentation, ” in IEEE Conference on Computer V ision and P attern Recognition , 2016. 3 [41] Y . W ang, F . Liu, K. Zhang, G. Hou, Z. Sun, and T . T an, “LFNet: A novel bidirectional recurrent con volutional neural network for light- ﬁeld image super -resolution, ” IEEE T ransactions on Image Pr ocessing , vol. 27, no. 9, pp. 4274–4286, 2018. 3 [42] S. Zhang, Y . Lin, and H. Sheng, “Residual networks for light ﬁeld image super-resolution, ” in IEEE Confer ence on Computer V ision and P attern Recognition , 2019, pp. 11 046–11 055. 3 [43] N. K. Kalantari, T .-C. W ang, and R. Ramamoorthi, “Learning-based view synthesis for light ﬁeld cameras, ” A CM T ransactions on Graphics , vol. 35, no. 6, p. 193, 2016. 3 [44] Y . W ang, F . Liu, Z. W ang, G. Hou, Z. Sun, and T . T an, “End-to-end view synthesis for light ﬁeld imaging with pseudo 4DCNN, ” in Eur opean Confer ence on Computer V ision , 2018, pp. 333–348. 3 [45] S. Heber and T . Pock, “Conv olutional networks for shape from light ﬁeld, ” in IEEE Conference on Computer V ision and P attern Recognition , 2016, pp. 3746–3754. 3 [46] S. Heber, W . Y u, and T . Pock, “Neural EPI-v olume networks for shape from light ﬁeld, ” in IEEE International Conference on Computer V ision , 2017. 3 [47] L. McMillan and G. Bishop, “Plenoptic modeling: An image-based rendering system, ” in Computer graphics and interactive techniques , 1995, pp. 39–46. 4 [48] E. H. Adelson and J. R. Bergen, “The plenoptic function and the elements of early vision, ” in Computational Models of V isual Pr ocessing . The MIT Press, 1991. 4 [49] H. T ang and K. N. K utulakos, “What does an aberrated photo tell us about the lens and the scene?” in IEEE International Confer ence on Computational Photography , 2013, pp. 1–10. 4 [50] H.-G. Jeon, J. P ark, G. Choe, J. Park, Y . Bok, Y .-W . T ai, and I. S. Kweon, “ Accurate depth map estimation from a lenslet light ﬁeld camera, ” in IEEE Conference on Computer V ision and P attern Recognition , 2015, pp. 1547–1555. 4 [51] S. Zhang, H. Sheng, D. Y ang, J. Zhang, and Z. Xiong, “Micro-lens-based matching for scene reco very in lenslet cameras, ” IEEE T ransactions on Image Pr ocessing , vol. 27, no. 3, pp. 1060–1075, 2018. 4 [52] “L ytro Power Tools, ” https://github.com/kmader/lytro-po wer-tools. 4 [53] E. Shelhamer , J. Long, and T . Darrell, “Fully con volutional netw orks for semantic segmentation, ” IEEE T ransactions on P attern Analysis and Machine Intelligence , vol. 39, no. 4, pp. 640–651, 2017. 5 [54] L.-C. Chen, G. Papandreou, I. K okkinos, K. Murphy , and A. L. Y uille, “Semantic image se gmentation with deep conv olutional nets and fully connected crfs, ” arXiv:1412.7062v4, 2016. 5 [55] C. Szegedy , V . V anhoucke, S. Iof fe, J. Shlens, and Z. W ojna, “Rethinking the inception architecture for computer vision, ” in IEEE Conference on Computer V ision and P attern Recognition , 2016, pp. 2818–2826. 5 , 6 [56] L.-C. Chen, Y . Zhu, G. Papandreou, F . Schroff, and H. Adam, “Encoder- decoder with atrous separable con volution for semantic image segmen- tation, ” arXi v preprint arXiv:1802.02611, 2018. 5 [57] M. Strecke, A. Alperovich, and B. Goldluecke, “ Accurate depth and normal maps from occlusion-aware focal stack symmetry , ” in IEEE Confer ence on Computer V ision and P attern Recognition , 2017, pp. 2529–2537. 6 [58] B. Wilb urn, N. Joshi, V . V aish, E.-V . T alv ala, E. Antunez, A. Barth, A. Adams, M. Horowitz, and M. Le voy , “High performance imaging using large camera arrays, ” A CM T ransactions on Graphics , vol. 24, no. 3, pp. 765–776, 2005. 6 [59] S. W anner, S. Meister , and B. Goldluecke, “Datasets and benchmarks for densely sampled 4D light ﬁelds, ” V ision, Modeling, and V isualization , vol. 13, pp. 225–226, 2013. 6 [60] Y . Jia, E. Shelhamer, J. Donahue, S. Karayev , J. Long, R. Girshick, S. Guadarrama, and T . Darrell, “Caffe: Conv olutional architecture for fast feature embedding, ” arXiv preprint arXiv:1408.5093, 2014. 7 [61] M. Everingham, S. A. Eslami, L. V . Gool, C. K. W illiams, J. Winn, and A. Zisserman, “The pascal visual object classes challenge: A retrospectiv e, ” International Journal of Computer V ision , vol. 111, no. 1, pp. 98–136, 2014. 7 [62] X. Glorot and Y . Bengio, “Understanding the difﬁculty of training deep feedforward neural networks, ” in International Confer ence on Artiﬁcial Intelligence and Statistics , 2010, pp. 249–256. 7 [63] R. Margolin, L. Zelnik-Manor , and A. T al, “Ho w to e valuate foreground maps, ” in IEEE Confer ence on Computer V ision and P attern Recogni- tion , 2014. 7 [64] M. Lev oy , Z. Zhang, and I. McDowall, “Recording and controlling the 4D light ﬁeld in a microscope using microlens arrays, ” Journal of micr oscopy , vol. 235, no. 2, pp. 144–162, 2009. 8 [65] M. L. Pendu, X. Jiang, and C. Guillemot, “Light ﬁeld inpainting propagation via low rank matrix completion, ” IEEE Tr ansactions on Image Pr ocessing , vol. 27, no. 4, pp. 1981–1993, 2018. 8 [66] W .-C. Tu, S. He, Q. Y ang, and S.-Y . Chien, “Real-time salient object detection with a minimum spanning tree, ” in IEEE Confer ence on Computer V ision and P attern Recognition , 2016, pp. 2334–2342. 9 , 10 [67] H. Peng, B. Li, H. Ling, W . Hu, W . Xiong, and S. J. Maybank, “Salient object detection via structured matrix decomposition, ” IEEE T ransactions on P attern Analysis and Machine Intelligence , vol. 39, no. 4, pp. 818–832, 2017. 9 , 10 [68] X. Huang and Y .-J. Zhang, “300-fps salient object detection via min- imum directional contrast, ” IEEE T ransactions on Image Processing , vol. 26, no. 9, pp. 4243–4254, 2017. 9 , 10 [69] X. Huang and Y . Zhang, “W ater ﬂow driven salient object detection at 180 fps, ” P attern Recognition , vol. 76, pp. 95–107, 2018. 9 , 10 14 [70] P . Zhang, D. W ang, H. Lu, H. W ang, and X. Ruan, “ Amulet: Aggreg ating multi-lev el convolutional features for salient object detection, ” in IEEE International Confer ence on Computer V ision , 2017, pp. 202–211. 9 , 10 [71] P . Zhang, W . Liu, H. Lu, and C. Shen, “Salient object detection by lossless feature reﬂection, ” in International Joint Conference on Artiﬁcial Intelligence , 2018, pp. 1149–1155. 9 , 10 [72] P . Zhang, W . Liu, Y . Lei, and H. Lu, “Hyperfusion-net: Hyper -densely reﬂectiv e feature fusion for salient object detection, ” P attern Recogni- tion , vol. 93, pp. 521–533, 2019. 9 , 10

Light Field Saliency Detection with Deep Convolutional Networks

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment