Concise and Effective Network for 3D Human Modeling from Orthogonal Silhouettes

Concise and Effective Netw ork f or 3D Human Modeling fr om Or thogonal Silhouettes Bin Liu 1 , 2 , Xiuping Liu 2 , Zhixin Y ang 3 , Charlie C.L. W ang 4 ∗ 1 School of Mathematics and Inf or mation Science, Nanchang Hangk ong University , Nanchang, China 2 School of Mathematical Sciences, Dalian Univ ersity of T echnology , Dalian, China 3 State K ey Laboratory of Inter net of Things f or Smar t City / Dept. of Electromechanical Engineering, University of Macau, Macau, China 4 Depar tment of Mechanical, Aerospace and Civil Engineering, University of Manchester , United Kingdom In this paper , we r evisit the pr oblem of 3D human modeling fr om two orthogonal silhouettes of individuals (i.e., fr ont and side views). Differ ent from our prior work [1], a supervised learning appr oach based on con volutional neural network (CNN) is in vestigated to solve the pr oblem by establishing a mapping function that can effectively e xtract featur es fr om two silhouettes and fuse them into coefﬁcients in the shape space of human bodies. A new CNN structur e is pr oposed in our work to extract not only the discriminative featur es of fr ont and side views and also their mixed features for the mapping function. 3D human models with high accuracy ar e synthesized fr om coefﬁcients generated by the mapping function. Existing CNN approac hes for 3D human model- ing usually learn a lar ge number of parameters (fr om 8.5M to 355.4M) fr om two binary images. Dif ferently , we inves- tigate a new network ar chitectur e and conduct the samples on silhouettes as input. As a consequence, more accurate models can be generated by our network with only 2.4M co- efﬁcients. The training of our network is conducted on sam- ples obtained by augmenting a publicly accessible dataset. Learning transfer by using datasets with a smaller number of scanned models is applied to our network to enable the function of generating r esults with gender-oriented (or geo- graphical) patterns. 1 Introduction In recent years, with the growth of demand in appli- cations such as virtual try-on, customized design and body health monitoring, simple and effecti ve human shape estima- tion techniques have caught more and more attention. Many approaches ha ve been de veloped to generate 3D human mod- els as mesh surfaces from photos as it is no wadays an easiest way to capture information by using smart phones. Consid- ering the factor of a user -friendly interface, tw o images from two orthogonal views (i.e., the front and side views as con- ducted in [1]) becomes an optimal solution if not the best. ∗ Corresponding Author; Email: changling.wang@manchester .ac.uk W ith the help of a statistical model such as SCAPE [2] that can capture the variation of human body’ s shape space, effort have been made to reconstruct 3D human models from silhouettes by minimizing the error between input silhouette images and the projected proﬁles of a parameterized 3D body shape [3, 4, 5]. These methods in general need a complex matching algorithm to ﬁt poses and shapes, which may result in a prohibitive complexity to prev ent their usage in real-time applications. T o solve this problem, some researchers considered to establish the mapping between silhouettes and 3D models in a parametric representation in their shape space. They at- tempt to learn the mapping from handcrafted features by in- tegrating linear [6, 7] or non-linear [8, 9] projection. Ho w- ev er , these approaches based on handcrafted features are not robust enough for input from real applications. Differ - ent from handcrafted features, con volutional neural network (CNN) can make an end-to-end process to automatically ex- tract features from a training dataset, which hav e embedded the application-oriented knowledge. Recently , researchers hav e made signiﬁcant progress on CNN-based 3D human shape estimation from silhouette images [10, 11, 12]. The y use CNN to learn the non-linear mapping function between silhouettes images and the shape space of human bodies. The input silhouettes images represent the region with human body and the background as binary images, which contains a lot redundant information. Directly learning features from binary images is less efﬁcient as the meaningful information – i.e., silhouettes are inherently sparse [13]. T o achiev e an estimation with high accuracy , a large number of parameters will be trained in a time-consuming learning process; and it also results in a large memory consumption. Here we argue that only points on the silhouettes will be good enough to extract effecti ve features to establish the mapping from or- thogonal silhouettes to the shape space of human bodies in a concise network. A summary to compare the pros and cons of existing methods is gi ven in T able 1. T o overcome all existing problems, we propose a novel 1 T able 1. Comparison of existing methods. Methods Pros and cons Shape matching algorithms [3, 4, 5] Concise but time-consuming & not rob ust Handcrafted features [6, 7, 8, 9] Efﬁcient b ut not robust CNN deep-learning [10, 11, 12] Robust b ut with high memory consumption CNN that is concise and effecti ve. The network is trained on an augmented set of 3D human database, which is re- leased by the authors of [14]. First, we construct a parame- terized representation of 3D human bodies’ shape space by applying principle component analysis (PCA) to models nor- malized by their height. After that, each 3D human model can be compactly represented by the coefﬁcients of the ﬁrst k most important principle components (PCs) instead of its original polygonal mesh. This can signiﬁcantly reduce the difﬁculty of learning an ef fective network. Then, render -to- texture technique in OpenGL is used to generate the binary silhouette images from 3D models. The silhouette of ev ery 3D human model is sampled into a ﬁxed number of points, the x - and y -coordinates of which are employed as input for our CNN-based learning approach. The output of our net- work – serv ed as a non-linear mapping function is the coefﬁ- cients of the ﬁrst k PCs for the corresponding human model. A novel structure of CNN is proposed in our work, which can extract both the discriminative features from respectiv e front and side silhouettes and the mixed feature to fuse the information from two views. T o generate results with ge- ographical (or gender-oriented) patterns, we apply learning transfer to the above trained network by using datasets with a smaller number of scanned models. In summary , our method shows the following merits: • Effectiveness: Having two images taken from the front and side views as input, our network can generate 3D human models more accurate than other CNN-based methods [10, 12]. • Conciseness: The number of parameters used in our network is at the size of 2.4M, which is much less than two other CNN-based methods for realizing the same function (i.e., 355.4M and 8.5M in [10] and [12] respec- tiv ely). Both are beneﬁt from the novel network structure proposed in our method. Moreover , user study on a variety of indi vid- uals has conducted to verify the robustness of our approach. As a common merit of the end-to-end approaches, the com- putation for 3D human model estimation is very efﬁcient, which allows it to be used in real-time applications. A mo- bile APP has been developed by using our network to pro- duce customized clothes as an Industry 4.0 application. The rest of our paper is organized as follows. After revie w- ing the related work in Section 2, we present the technical approach of our CNN-based 3D human model generation in Section 3. The details of implementation are given in Sec- tion 4. Experimental results, comparisons and the user study on a variety of individuals are presented in Section 5. The application of customized design by using our approach will be presented in Section 6. Lastly , our paper ends with the conclusion section. 2 Related W ork Estimation 3D human shape from images is a popular research topic that has been widely studied for many years [3, 8, 15, 16, 17]. Providing a comprehensiv e surve y of all past researches has been beyond the scope of this paper . W e only revie w the most relev ant approaches that inspire this work. 2.1 Human shape repr esentation space Building a statistical model for human bodies has made signiﬁcant improvement in recent years. The pioneer work presented in [18] used PCA to model the representation space of human body shape without considering the human poses. W ith the help of this PCA-based statistical model, they could reconstruct a complete human surface from range scans based on template ﬁtting. T o obtain a better statis- tical model, researchers have tried to learn the shape vari- ation space and pose variation space separately (e.g., the SCAPE [2] and SMPL [19] models widely employed in many applications). The variation of human bodies is usu- ally encoded in terms of deformation on triangular meshes with the same connectivity , which can be obtained by ei- ther template ﬁtting (e.g., [18, 2, 20]) or the more advanced cross-parameterization technique (e.g., [21, 22, 23]). Some researchers intend to use the representation space from in- trinsic features on mesh. Freifeld and Black [24] adopted the principle geodesic analysis [25] to capture shape varia- tion. Dibra et al. [11] employed heat kernel signatures to obtain the shape representation space through con volutional neural network. In our approach, we adopted a dataset of hu- man models with nearly unique pose to generate the shape space representation of human model by PCA-based statisti- cal model. Skeleton-based representation in SMPL [19] can be used to further enrich our work by adding more poses into the training dataset. 2.2 Shape estimation from images Estimating from images is a direct and low-cost mode for reconstructing shapes and poses of 3D human bodies, which has many applications [5, 26, 27]. Some early methods (e.g., [1, 28, 29]) use a template model to minimize shape ap- proximation errors, which heavily depends on the positions of feature point on 2D contours or the correspondence re- lationship between a 2D contour and the projected silhou- ette of a 3D model. These methods need to solve a com- plex corresponding problem, and therefore are not rob ust in general. Following the similar strategy , Zhou et al. [5] pre- sented an image retouching technique for realistic reshaping of human bodies in a single image. Zhu et al. [15] integrated both image-based and example-based modelling techniques to create human models for individual customers based on two orthogonal-view photos. Song et al. [30] selected critical vertices on 3D registered human bodies as landmarks. After that, they learned a regression function from these vertices 2 to human shape representation space deﬁned by SCAPE [2]. The time-consuming error minimization process is in volv ed in these approaches. Unlike our end-to-end approach, they are not efﬁcient enough to be used in real-time applications. Researchers also dev eloped techniques to estimate 3D human shape from a single or a fe w silhouettes in a standard posture by handcrafted features. A probabilistic frame work representing the shape variation of human bodies was pro- posed by Chen et al. [31] to predict 3D human shape from a single silhouette. Boisvert et al. [9] used geometric and sta- tistical priors to reconstruct the human shape form a frontal and a lateral silhouettes. Differently , Alldieck et al. [32] em- ployed a video to estimate the shape and texture by trans- forming the silhouette cones corresponding to dynamic hu- man silhouettes. The major problem of these approaches based on handcrafted features is that they have prohibitiv e time complexity to be used in practical applications. T o resolve the efﬁcienc y problem, more and more re- search effort has been paid to de velop an end-to-end ap- proach with the help of CNN techniques. T oward this trend, Dibra et al. [10] and Ji et al. [12] reconstructed 3D human shape from two orthogonal silhouette images based on a modiﬁed AlexNet [33] and multiple dense blocks [34]. The major difference between these two approaches is their train- ing mode. Ji et al. [12] ﬁrst learned two networks from the front and the side views separately , and then trained a new network from the features learned in the previously deter- mined two networks. T o be less inﬂuenced by the poses of human bodies, Dibra et al. [11] conducted multi-view learn- ing on a quite complex cross-modal neural network, which cannot be trained simultaneously . A common problem of these methods [10, 11, 12] is that direct learning from binary images by con volution operators is inef ﬁcient as the silhou- ette features are very sparsely represented in the binary im- ages. More discussion about the learning efﬁciency inﬂu- enced by the sparsity of information can be found in [13]. Recently , researchers hav e dev eloped a variety of tech- niques for estimating the 3D human shape from a single RGB image with the help of CNN(e.g., [35, 36, 37, 38, 39]). How- ev er , the accuracy of resultant models generated by these ap- proaches is in general worse than the results obtained from orthogonal silhouettes because of the absence of other views. Moreov er, when photos with tight clothes are used as input, the pri vacy of a user is better protected by silhouette images. 2.3 Con volutional neural network In many computer vision tasks such as image classiﬁca- tion, recognition and segmentation, it is an important role to obtain highly discriminativ e image descriptors. For this pur- pose, a wide range of neural networks have been proposed – such as AlexNet [33], VGGNet [40], GoogleLeNet [41], U-net [42], ResNet [43] and DenseNet [34]. These prior re- searches ha ve introduced some frequently used operations in CNN architecture, such as conv olution layer , pooling layer and acti vation layer . In addition, they also provided inspira- tional ideas on other aspects of network design. For example, dropout layer is used to prevent ov er-ﬁtting, batch normaliza- Fig. 1. In our CNN-based approach, 22 principal components (PCs) are employ ed to span the shape space of 3D human models ( k = 22 ). The ﬁrst sev en PCs are displayed to visualize the shape variations. tion is utilized to accelerate the con ver gence of network and identify block is proposed to enhance feature reuse. All these concepts inspire the architecture design of our network. As the input of our network is a well-organized set of points, the approach with similar well-or ganized information such as T extCNN [44] and PointNet [45] also provide inspi- ration to the design of our network. T extCNN has applied in sentence-level classiﬁcation tasks and performed con vo- lution on top of word vectors. PointNet was used to deal with point sets and it also provided an architecture for ap- plications ranging from 3D shape classiﬁcation to semantic segmentation. In their approaches, ﬁlters are ﬁrst used to ex- tract features from input datasets, which is followed by ap- plying the maximal pooling operation over the feature map to capture the global descriptors. In ﬁnal steps, fully con- nected layers are employed to obtain the labels of input. W e use a similar routine to design the architecture of our net- work. Howe ver , to capture more local structural information of sample points on silhouettes, maximal pooling operations are only performed in the local region of each feature map in our approach. 3 T echnique A pproach In this section, we ﬁrst introduce the PCA-based shape representation that spans the space of 3D human models. Then, the method of data augmentation is presented to en- rich the training dataset. After these steps of preparation, the structure and the loss function of our network are presented. 3.1 Shape space repr esentation Principal component analysis (PCA) has been widely used to construct statistical models for representing the shape variation of 3D human bodies [2, 18, 46, 47, 14], where all models hav e the same mesh connectivity . W e adopt the strat- egy of normalizing the height of all models before PCA [48] to better capture the shape variation. In our approach, we represent the shape of each human body by a vector ϕ s ∈ R k 3 as B s = B ( ϕ s ) = ¯ B + Ωϕ s (1) where B s ∈ R 3 N denotes a body shape with N vertices, ¯ B rep- resents the av erage body shape of all models in the dataset and Ω ∈ R 3 N × k is a linear space formed by k principal com- ponent vectors. In our implementation, k is set as the mini- mum value when the statistic model based on PCA can cap- ture more than 97% of the original shape. The shape of each human model is then represented by the vector ϕ s with k components – named as shape coefﬁcients . This represen- tation is much more compact than the original mesh repre- sentation with 3 N coefﬁcients for the N vertices. When the poses of all models in a dataset are the same, PCA can mainly capture the shape variation in bodies and not be inﬂuenced by the pose variation. W e use the dataset of 4 , 308 individuals released by the authors of [14] to build the space of human shape representation, where all models are in a neutral pose and are generated by ﬁtting a template mesh to the CAESAR dataset [49]. In detail, we ﬁrst select 4 , 306 models from this database – two are left out as they are incomplete and highly distorted. Then, these models are split into two dataset with a ratio of 4 : 1, where the set with 3 , 444 models are used to build a training dataset (will be further augmented below) and the set with 862 models are employed as the test dataset. PCA model is build on the set with 3 , 444 models. An il- lustration of the ﬁrst 7 principal components can be seen in Fig. 1. 3.2 T raining data augmentation In general, the set of training data should be large enough for machine learning and it is expected to be uni- formly distributed to cover the whole space – especially for the task of regression [50]. For this purpose, the training set is further augmented by inserting more samples according to a metric deﬁned on the Euclidean distances between the vec- tors of shape coefﬁcients. W e aims at enlarging the dataset meanwhile improving its distribution to become more uni- form. During data augmentation, we use the K-nearest neigh- bor search to obtain the neighboring set N s for each sample ϕ s in the dataset. Here K = 11 is adopted in our experiments. W e deﬁne the neighboring radius of a sample ϕ s as r s = 1 K ∑ ϕ i ∈ N s k ϕ i − ϕ s k . (2) The minimal neighboring radius r min among all samples can be calculated by r min = min { r s } . Then, the following reﬁne- ment step is iterati vely applied: for any pair of neighboring samples ϕ s and ϕ q , if k ϕ s − ϕ q k > 1 . 8 r min , we insert a new sample in the middle as ϕ p = ( ϕ s + ϕ q ) / 2. The iteration is repeated two or three times. After reﬁnement, we obtain an augmented dataset of human models with 8 , 941 samples (see Fig.2). (a) Our initial dataset with 3 , 444 samples (b) The augmented dataset with 8 , 941 samples Fig. 2. An illustration of sample distribution on PC1 and PC2. Af- ter applying our reﬁnement based data augmentation, the distribu- tion has become more uniform and the newly added samples are all inside the conve x-hull of the original samples. Sampling from the Gaussian distribution along each PC may generate a model with un- realistic body shape (see the three e xamples in (b) as red points) that are located outside the conv ex-hull of original data points. Here we do not employ the method of sampling from the Gaussian distribution along each PC, which is used in other papers [10, 11, 12]. This is because that simply gener- ating a new sample from Gaussian distribution may lead to unrealistic body shapes (see the samples shown in Fig.2(b)). Our method of data augmentation can effecti vely prev ent the generation of unrealistic samples. 3.3 Silhouette sampling Now we introduce how to generate the silhouette im- ages of a 3D human model and the sampling points on the silhouette contours. For each sample ϕ s , we reconstruct its 3D mesh model by Eq.(1). Then, the 3D model is rendered in a window with 500 × 600 pixels as binary images along its front-vie w and side-view (see the most-left of Fig.3). In order to simulate the real scenario of taking photos for in- 4 Fig. 3. The ov er view of our CNN’ s architecture, which takes the two or thogonal silhouettes as input and outputs the estimated shape coefﬁcient. It employs two types of ‘Bloc ks’, each ‘Block’ contains 3 or 4 operations, and the details can be found in Fig. 4. ’+’ represents the additive operation between output layers of ‘Blocks’, which facilitates the information communication between ‘Blocks’ of different views . FC denotes a fully connected layer and the follo wing number gives the number of hidden neurons. ϕ ∗ s represents the output of network and it is supervised by the shape coefﬁcient ϕ s of ground truth. Fig. 4. The structures of BlockA(c1, c2) and BlockB(c1, c2), where c1 and c2 represent the input and output number of components in feature maps. BatchNor m means the batch normalization operation and ReLU is the activation function. dividuals, we conduct perspective projection to render mod- els and also add a Gaussian perturbation to the position of camera (i.e., setting the mean position at the center of image and the standard deviation of 0 . 05m with the image height as 2m). The contour extraction technology [51] is applied to ob- tain the closed boundary from each silhouette image. Then, the contour is uniformly sampled into M ordered points, the coordinate of which are transformed back to the scale of unit body height. Origin of all sample points are located at their av erage position. Every list of ordered sample points starts from a highest point located at the top and center of the model’ s head. All points are sorted in the anti-clockwise order . Moreov er, a copy of the ending point and a copy of the starting point are inserted at the beginning and the end of the list respectiv ely to solv e the problem of con volution at the boundary of input. The number of sample points will hav e inﬂuence on the ac- curacy of 3D shape estimation. How to choose the number of samples will be discussed later in Section 4. 3.4 Network architecture for learning Front-view and side-vie w can adequately conv ey the main shape of a human body from different aspects. Their boundary points also encode information with local and global structures. A good network architecture is important for the effecti veness of learning a mapping function between contour sampling points and the shape coefﬁcients. Con vo- lutional neural networks (CNNs) are well kno wn for their capability of automatic feature e xtraction and adaptability of highly nonlinear mapping. W e propose a nov el CNN archi- tecture for the problem of 3D human shape estimation from orthogonal silhouettes. Our network will be trained to extract features from the local and global structures, which will be mapped to the shape coefﬁcients for generating 3D human model from PCs. There are two existing CNN architectures for solving the problem of 3D human modeling from orthogonal silhouettes. Both use binary images as input. • Dibra et al. [10] proposed an architecture which is simi- lar to AlexNet [33] to learn the mapping function. They ﬁrst stack the front-view and the right-view images to form two channel images considered as the input of their network. The network employed in this approach is very dense. • Ji et al. [12] addressed the same problem by an an ar- chitecture that contains three pipelines. T wo of them are used to extract features from the front-view and the right-view silhouette images. The third one is used to concatenate the features learned from two separated pipelines to be further processed by fully connected lay- ers. Differently , we use samples on silhouette contours as input. A novel network is proposed with the architecture as illus- trated in Figs. 3 and 4. Our network can extract discrimina- tiv e features from silhouette sample points of two orthogo- 5 nal views and also fuse the features learned from different views after each block of learning. The additional blocks in the middle pipeline facilitate information communication between two vie ws and help to fuse the local structures of hu- man body in two views. Moreover , different from [12] that three pipelines are trained one by one, the three pipelines in our architecture can still be trained together in an end-to- end mode. This is easier for implementation. In order to extract local and global structure information from sample points of silhouette contours, the following three strategies are employed in our network design which facilitate feature propagation, feature reuse and feature aggregation. • Con volution layer with size 3 and step 1 (Con v(3,1)) is applied to percei ve local structures on the silhouette con- tour . Furthermore, multiple such layers are emplo yed to enlarge the recepti ve ﬁeld of con volution. • The layer of maximal pooling with size 3 and step 3 (MaxPool(3,3)) is adopted to aggregates local features to global features step by step. In our network architec- ture, the maximal pooling operation is applied 4 times to reduce the size of feature maps into ( 1 / 3 ) 4 = 1 / 81 of its original size 648 × 1. • Information communication between blocks of two views can help to detect more coherent local structures of human bodies. Note that the additive operation is em- ployed in our fusion pipeline. The beneﬁt is twofold. While it can make the features from dif ferent views fused together, it can also largely decrease the number of parameters comparing to the commonly used concatena- tion operator for fusion. All fully connected layers use the ReLU activ ation function except the ﬁnal layer . In short, we intend to provide a concise and ef fectiv e solution, which takes full advantage of samples from two orthogonal silhouette contours. Unlike T extNet [44] and PointNet [45], we do not con- duct a ﬁnal maximal pooling operation applied to the whole feature map in our network. This is because that such an op- eration will discard too much local information. It is very useful for a classiﬁcation task but not a regression task as what we propose in this paper . Therefore, we only embrace the maximal pooling operation in a region with size restric- tion, which can effecti vely extract and use the information of local structures. A comparison will be giv en in Section 4.4 to further verify the ef fectiv eness of this choice (see Fig. 7). 3.5 Loss function W e now discuss the loss function used in training. For ev ery 3D model in the training dataset represented as ϕ s in the human shape space with the predicted shape coefﬁcients as ϕ ∗ s , the error metric for shape dif ference between the orig- inal 3D model and the predicted 3D model can be deﬁned as the sum of squared verte x-to-vertex distances. The total error on all models is then used as the loss function as L = ∑ s k B ( ϕ ∗ s ) − B ( ϕ s ) k 2 2 (3) Fig. 5. Comparing the performance of our network when using dif- ferent numbers of sampling points on silhouette contours. The CDF curves are generated on the training dataset to tune the number of sample points used for each silhouette . with B ( · ) deﬁned in Eq.(1). The error for each model can then be re-written as k B ( ϕ ∗ s ) − B ( ϕ s ) k 2 2 = k ( ¯ B + Ωϕ ∗ s ) − ( ¯ B + Ωϕ s ) k 2 2 = k Ω ( ϕ ∗ s − ϕ s ) k 2 2 = ( ϕ ∗ s − ϕ s ) T Ω T Ω ( ϕ ∗ s − ϕ s ) (4) When using orthonormal vectors to form the shape space Ω (i.e., the PCs obtained by PCA), we can ha ve Ω T Ω = I . As a consequence, the loss function can be simpliﬁed to L = ∑ s k ϕ ∗ s − ϕ s k 2 2 . (5) This is the loss function used in our training process. Com- paring to Eq.(3), the ev aluation of loss function deﬁned in Eq.(5) can sav e a lot of memory and computing time. 4 Metrics Evaluation and Implementation Details W e hav e implemented the proposed network in PyT orch, and the source code is av ailable to public 1 . Our network is trained with all parameters randomly initialized, and we train the network by using the Adam optimization algorithm [52] running on a PC with an Intel(R) Core(Tm) i7-8700 CPU @ 3.2GHz, 16GB of RAM and a GeForce GTX 2080 GPU. The epochs and batch size are set as 500 and 128 respectively with the learning rate 1 . 0 × 10 − 5 , and we use the default val- ues for all other parameters. In this section, we will discuss a few detail decisions about our approach, including the number of sampling points on silhouette contours, the functionality of fusion blocks and the usage of regional max pooling. T o facilitate the discus- sion of these decisions, the metrics used for ev aluation the performance of a network are ﬁrst introduced belo w . 1 https://github.com/liubindlut/ SilhouettesbasedHSE 6 T able 2. Statistic by using diff erent numbers of contour points # of Pnts. 324 486 648 972 Coefﬁcients # in Netw ork 1.84M 2.14M 2.43M 3.02M A verage Comp. Time (ms) 33 . 3 33 . 8 34 . 3 35 . 2 Fig. 6. The architecture of our network after removing the fusion bloc ks. Fig. 7. Comparison of CDF cur ves generated on the training dataset by using different strategies of network – 1) with vs. with- out fusion bloc ks and 2) regional or whole-map max pooling. 4.1 Metrics for evaluation W e employ two shape-based metrics to provide the quantitativ e ev aluation for the accuracy of reconstructed 3D models. For all samples in the test dataset, they hav e the ground-truth surfaces represented as triangular meshes. From an estimated vector of shape coefﬁcients ϕ s , the esti- mated 3D shape can be obtained as a triangular mesh with the same connectivity . Therefore, we can use the maximal error E max and the a verage error E aver of verte x-to-vertex distances to measure the deviation between the reconstructed and the ground-truth surfaces. Speciﬁcally , these two metrics can be deﬁned as E max = max i = 1 ,..., N { k v i − v ∗ i k 2 } (6) and E aver = 1 N N ∑ i = 1 k v i − v ∗ i k 2 , (7) where v i and v ∗ i are the positions of the i -th verte x on the ground-truth and the reconstructed meshes respectively , and Fig. 8. Compar ison of reconstruction results to illustrate the effec- tiveness of fusion bloc ks – (a) the sample points on original silhouette and the ground-truth model, (b) the results generated by a network without fusion blocks (i.e., the architecture in Fig.6) and (c) the re- sults generated by our network with fusion blocks . It is easy to ﬁnd from the colorful error maps given in (b) and (c) that the reconstr uc- tion errors can be signiﬁcantly reduced after adding the fusion b locks. These two e xample are from the training dataset. Fig. 9. The distr ibutions of geometr ic errors on test samples. The unit of errors is centimeter (CM). Left shows the histogram of fre- quency distribution and the corresponding probability cumulativ e dis- tribution functions (CDFs) are shown in the right. N is the total number of vertices on the mesh. W ith the help of these two metrics, we can use the histogram of frequency distribution and the cumulative distribution function (CDF) to measure the performance of each algorithm on a whole dataset. In addition, we can also make a rough estimation by the av erage values among H samples in a dataset as ¯ E max = 1 H H ∑ j = 1 E j max , ¯ E aver = 1 H H ∑ j = 1 E j aver (8) with E j max and E j aver being the corresponding errors of the j -th sample, H is the number of dataset. 7 As all human models generated by network ha ving the same mesh. Feature curves for measuring intrinsic metrics can be predeﬁned on the mesh surface and mapped to newly generate human models by barycentric coordinates. Besides of the abov e geometric metrics, we also follow the strategy of Dibra et al. [10] by ev aluating intrinsic metrics on hu- man bodies. Speciﬁcally , the errors on three girths (B:Chest, W :W aist and H:Hip) are e valuated in our experiments. 4.2 Number of contour points Giv en silhouette images in a ﬁxed resolution, the num- ber of points employed to sample the silhouette contours will hav e inﬂuence on the capability to reconstruct an accurate 3D human model. Here we tune this parameter by studying the accuracy of reconstruction – i.e., 324, 486, 648 a nd 972 con- tour points are tried respectiv ely . CDFs for both E max and E aver are generated as shown in Fig.5. It is easy to ﬁnd from the ﬁgure that the network trained by using more contour points will have more samples with errors less than a gi ven threshold for both E max and E aver . In other words, the net- work trained from more contour points is more accurate. No lunch is free. Training by using more contour points will lead to higher memory cost and therefore also longer computing time in reconstruction. Speciﬁcally , the resultant numbers of coefﬁcients for storing a trained network and the av erage computing time for reconstructing a 3D human model are giv en in T able 2. W e choose 648 contour points in practice by considering the balance between computational cost and accuracy . 4.3 With or without fusion blocks Now we study the functionality of the blocks used in our network for fusing features between front and side views. After removing all the blocks for fusion, our network will de- generate into another architecture as sho wn in Fig.6, which is similar to the network presented in [12]. The training dataset is then emplo yed to check the ef fectiv eness of these two net- works. The curves of CDF are generated for both E max and E aver as shown in Fig.7. It can be observed that the accuracy of reconstruction can be signiﬁcantly improv ed after adding the fusion blocks. The v alues of ¯ E aver and ¯ E max drop from 0 . 31 to 0 . 27 and from 1 . 44 to 1 . 15 respectiv ely . The compar- ison of reconstruction with vs. without fusion blocks on two examples can also be found in Fig.8, where the shape devia- tion errors are visualized as color maps. Based on the results obtained from experiment, we believ e that our architecture can be used as a basic network, which can derive different variants by modifying the structures of BlockA and BlockB according to the strategy of multi-modal fusion learning [53]. 4.4 Regional or whole map pooling After verifying the functionality of fusion blocks, we study the effecti veness of using regional max pooling in our network design. As discussed above, when applying max- imal pooling operation to the whole feature map, the net- work will discard too much local features which ho wev er are T able 3. Quantitative comparisons for different methods on test samples (Unit: centimeter). The best method is highlighted. Dibra et al. [10] Ji et al. [12] Our Approach ¯ E aver 1.87 0.97 0.62 ¯ E max 5.46 3.63 2.76 important to our human model reconstruction problem. W e compare the results by using regional maximal pooling (as MaxPool(3,3) in our BlockB design – see Fig.4) with the re- sults by using whole-map max pooling operation. From the curves of CDF giv en in Fig.7, it is found that our design by using regional pooling gi ves more accurate reconstruction. 5 Results The work presented in this paper focuses on reconstruct- ing 3D human models from two orthogonal silhouette im- ages by using the deep learning technique. In this section, we compare the results generated by our approach with the state-of-the-art methods [10, 12] in different aspects, includ- ing the number of parameters, the geometric accuracy of re- construction and the errors on intrinsic measurements. For conducting fair comparisons, we mo ve the centers of silhou- ette in both views into the center of binary images as the input of their methods. Moreov er, user study is conducted on a variety of individuals to verify the performance and ro- bustness of our approach. 5.1 Number of parameters Dibra et al. [10] used the architecture of AlexNet [33] to estimate the shape parameters. Considering the limited memory of GPU, we choose the implementation that takes two silhouette images as two channels introduced in their pa- per . They tested silhouette images with resolution of 192 × 264, and the number of their parameters is about 355.4M. Ji et al. [12] employed the structure of DenseNet [34] in their architecture, which results in a network with 8.5M parame- ters when using the input images at the resolution of 64 × 64. Moreov er, they also discussed the inﬂuence of image reso- lution in their paper and claim that results with similar accu- racy will be obtain e ven after doubling the resolution of input images. Differently , our network needs to store only 2.4M parameters. In practical applications, the memory consump- tion of a network is a very important cost to be controlled as the AI-engine is always operated on the cloud servers. In terms of demanded storage, our approach needs only 10M while the other two methods need about 1.4G and 34M stor- age space respectively . The superiority of silhouette and ac- curacy of reconstruction are discussed belo w . 5.2 Accuracy of reconstruction W e now ev aluate the accuracy of 3D human models re- constructed by our method and compare to the state-of-the- art [10, 12]. Both histograms and colorful error maps are employed to visualize the errors. 8 Fig. 10. The illustration of three worst reconstruction with the metr ic E aver (left) and the metric E max (right) among all the samples in the test dataset for different methods – (a) the results of Dibra et al. [10], (b) the results of Ji et al. [12] and (c) our results. For ev ery example , a pair of models are displayed where the ground-truth models are shown in gray color and the colorful maps of reconstructed models are employ ed to visualize the distribution of point-to-point distance errors in the unit of centimeter . Fig. 11. The distr ibutions of gir th errors on test samples (Unit: cen- timeter). Figure 9 sho ws the distrib ution of shape errors E aver and E max on all samples in the test dataset. It is obvious that our method is superior to the other two methods. Speciﬁ- cally , when checking the CDF curve of E aver , our network can lead to a result of 94% test samples having an error less than 1 . 0 cm. Only 63% test samples can result in a recon- struction with E aver less than 1 . 0 cm by using the approach of Ji et al. [12], and all human models reconstructed by the method of Dibra et al. [10] have the error E aver larger than 1 . 0 cm. The maximal E aver of our approach on test samples is 3 . 4 cm, which is also much smaller than the other two meth- ods. In terms of the E max ’ s CDF curv es, 84% of test samples can reach the maximal error less than 4 . 0 cm by using our method. Howe ver , the other two methods can only hav e 70% and 18% reach this lev el of accuracy . The histograms of fre- quency distribution for the E aver and E max errors shown in the left of Fig.9 also lead to the conclusion that most of our reconstructed models fall in the region with smaller errors. T able 4. Quantitative comparisons on three semantic cur ves with statistical errors ev aluated on all 862 models in the test dataset. The mean and the standard deviation of absolute errors (Unit: centimeter) are denoted as µ and σ . The best results are highlighted by bold fonts . ( µ , σ ) Dibra et al. [10] Ji et al. [12] Our Approach Hip (4.09, 3.86) (1.44, 1.34) (0.67, 0.66) W aist (6.26, 4.21) (1.66, 1.70) (0.76, 0.81) Chest (4.99, 4.03) (1.47, 1.40) (0.81, 0.69) The av erage errors are also listed and compared in T able 3. Besides of the statistical results, we also sho w the accuracy of reconstruction on individual models that hav e the largest av erage error E aver and the largest maximal error E max re- spectiv ely . Three worst cases are displayed in Fig.10 by us- ing the colorful error map. The results from [10, 12] and ours are listed by using the same range of errors (Unit: cen- timeter). The color distribution changing from blue to red is employed to denote the variation of geometric error from zero to maximum. Again, our approach performs the best. In addition, we also ev aluate the girth errors (B:Chest, W :W aist and H:Hip) on all models in our test dataset. Our results are compared to the other two approaches [10, 12]. The distributions of girth errors are given in Fig. 11. The error statistics are reported in T able 4. It can be observed that our method performs the best. The errors of our results on these three girths are always less than one centimeter . 9 T able 5. Error estimation between the predicted feature cur ve and the real one in the corresponding test datasets. The mean absolute error and the standard deviation ( µ , σ ) are giv en (Unit: centimeter). Error ( µ , σ ) Hip W aist Chest Female (2.02, 1.58) (1.48, 1.21) (2.01, 1.60) Male (2.52, 2.11) (3.83, 2.38) (2.48, 2.32) 6 Application In this section, we discuss practical issues to use our network of human shape modeling in an Industry 4.0 appli- cation – design automation for customized clothes. First, a learning transfer approach is applied to solve the problems of dataset with small number of 3D human models. After that, we demonstrate a smartphone APP that is based on our network to reconstruct 3D human models for generating cus- tomized clothes. 6.1 T ransfer learning By using a large dataset of human models released in [14], we are able to train a regression model for estimat- ing the 3D shape of human model in high accuracy . How- ev er , this dataset is not able to reﬂect the gender-oriented (or geographical) shape patterns. Ideally , different datasets will be needed for training networks for different genders or races. As shown in Fig. 12(c), the human model recon- structed by the network trained on a database with mixed genders (i.e., [14]) does not show a strong pattern of gender . In order to avoid collecting large dataset of 3D human models, we apply the strategy of transfer learning [54] to obtain gender-oriented networks from datasets with limited number of samples. T wo registered small datasets with 31 female models and 40 male models, released by Hasler et al. [55], respectively are conducted for the transfer learning. Speciﬁcally , we ﬁx the parameters of all blocks of BlockA and BlockB in our network, and only update parameters in the fully connected layers (see the yello w FC layers in Fig.3) by using these small datasets. Again, each training sample is still represented by two silhouette images and the corre- sponding shape coefﬁcients ϕ s . W e still conduct the similar strategy introduced in Section 3 to generate our new train- ing and test datasets by using 1 . 2 r min as threshold, K = 11 for neighborhood search and uniformly inserting three sam- ples between two neighboring samples, which results in two training databases with 425 models for female and 472 mod- els for male. T able 5 shows the errors of feature curves be- tween the predicted results and their real ones. W e can ﬁnd that most mean absolute errors are less than three centime- ters. As a byproduct of transfer learning, the human mod- els employed can have mesh connectivity different from the samples used for training the original network. W e apply the real photos obtained in user study (i.e., Fig.1 of [56]) to the new network obtained from transfer learning. The measurement errors in three girths are ev al- uated and compared with the results generated by the net- T able 6. Compar ison of results generated by the networks before vs. after transfer learning by using the real photos of user study [56]. The mean absolute error and the standard deviation ( µ , σ ) are giv en (Unit: centimeter). Error ( µ , σ ) Hip W aist Chest Original network (3.16, 2.45) (5.83, 4.93) (4.13, 2.41) W ith transfer learning (2.76, 2.74) (3.72, 2.84) (2.30, 2.47) Fig. 12. Reconstruction results by the network before vs. after transf er lear ning – examples are obtained from the real photos of user study: (a) the original images with background remov ed (g round truth of chest, w aist and hip girths are giv en), (b) the sampled silhou- ette points, (c) results obtained by using the or iginal network (i.e., before transf er lear ning), and (d) results obtained by the network af- ter transfer learning. From these results, w e can ﬁnd that the netw ork updated by transf er lear ning can capture more female / male features – i.e., gender-oriented patterns can be captured more clear ly . work without transfer learning (see T able 6). Besides of these statistical analysis, examples of reconstruction are shown in Fig. 12. As can be found from Fig. 12(d), the gender-oriented patterns and measurement accuracy have been signiﬁcantly enhanced after apply the transfer learning to our network with the help of a gender-separated small training dataset. 6.2 Customized design Based on the network proposed in this paper , we have dev eloped a smartphone APP to reconstruct 3D human mod- els from the photos of individuals – see the scenario demon- strated in Fig.13 and the APP interface shown in Fig.14. 10 Fig. 13. The scenario of using our smar tphone APP for 3D human model reconstruction. Fig. 14. Interface of our smar tphone APP . Background is remov ed by using the function that is built in the smartphone. Note that, a strategy similar to [57] is em- ployed to supervise the process of taking photos for the side view of a human model (i.e., by checking the completeness of legs overlap). A video demonstration of this APP can be found at: https://youtu.be/JEPAmiB0wYI . By us- ing the barycentric coordinate, the feature curves predeﬁned on a human model with standard shape can be automatically generated on the reconstructed models for the measurement purpose. Moreov er, the perfect ﬁt clothes for indi vidual cus- tomers can be automatically generated by using the design transfer technology 2 [20, 58]. Figure 15 has demonstrated the design transform result by the female models reconstructed from two silhouette images. 7 Conclusions and Future W ork In this paper , we have dev eloped a novel architecture of CNN for modeling 3D human bodies from two silhou- ette images. Our network is concise and efﬁcient, which is beneﬁt by dev eloping an architecture with fusion blocks and also conducting samples on silhouette as input. More ac- curate models can be generated by our network with only 2 An implementation of this design transfer on 3D human models can be access at: https://zishun.github.io/projects/3DHBGen/ . Fig. 15. The dress originally designed for a standard model (a) can be automatically transf erred to a customized shape perfectly ﬁt the 3D human body generated from our network (b). 2.4M coef ﬁcients. The learning of our network is ﬁrst con- ducted on samples obtained by augmenting a publicly ac- cessible dataset, and the transfer learning method is applied to make it usable when a dataset with only small number of human models is av ailable. Experimental tests hav e been conducted to prov e the effecti veness of our network. The major limitation of our current work is that the input sample points should capture the shape of human silhouette accurately . This requirement is too strong in some scenarios of practical usage – e.g., the long hair of female users may lead to incorrect shape at the region of neck. In addition, we argue that inte grating 3D information into our network or modifying the structures of BlockA and BlockB, may further improv e the accuracy of shape prediction which will be ex- plored in our future work. Lastly , the input side-vie w should hav e the orientation consistent to the training samples. W e plan to eliminate this requirement in our future dev elopment (i.e., make network automatically adaptive to the left or the right-views as input). Moreov er, as our method performs the best in neutral poses, we plan to enhance the smartphone APP by providing the function of pose similarity guidance based on skeleton extraction. Acknowledgement Part of this work is completed when B. Liu and C.C.L. W ang worked at the Chinese Uni versity of Hong K ong. This work is partially supported by HKSAR Inno v ation and T ech- nology Commission (ITC) Innov ation and T echnology Fund (Project Ref. No.: ITT/032/18GP), the Natural Science Foundation of China(61976040, 61702079, 61762064), and the Science and T echnology Dev elopment Fund of Macau SAR (File No.: SKL-IO TSC-2018-2020, 0018/2019/AKP and 0008/2019/AGJ). The authors also would like to thank the valuable comment giv en by Zishun Liu in priv ate com- munications. 11 References [1] C. C. W ang, Y . W ang, T . K. Chang, M. M. Y uen, V irtual human modeling from photographs for garment indus- try , Computer-Aided Design 35 (6) (2003) 577–589. [2] D. Anguelov , P . Srini vasan, D. K oller , S. Thrun, J. Rodgers, J. Davis, SCAPE: Shape completion and animation of people, A CM T rans. Graph. 24 (3) (2005) 408–416. [3] A. O. Balan, L. Sigal, M. J. Black, J. E. Davis, H. W . Haussecker , Detailed human shape and pose from im- ages, in: 2007 IEEE Conference on Computer V ision and Pattern Recognition, IEEE, 2007, pp. 1–8. [4] A. O. B ˘ alan, M. J. Black, The naked truth: Estimating body shape under clothing, in: European Conference on Computer V ision, Springer , 2008, pp. 15–29. [5] S. Zhou, H. Fu, L. Liu, D. Cohen-Or , X. Han, P aramet- ric reshaping of human bodies in images, A CM T rans. Graph. 29 (4) (2010) 126:1–126:10. [6] P . Xi, W .-S. Lee, C. Shu, A data-driven approach to human-body cloning using a segmented body database, in: 15th P aciﬁc Conference on Computer Graphics and Applications, IEEE, 2007, pp. 139–147. [7] E. Dibra, C. ¨ Oztireli, R. Ziegler , M. Gross, Shape from selﬁes: Human body shape estimation using cca regres- sion forests, in: European Conference on Computer V i- sion, Springer , 2016, pp. 88–104. [8] Y . Chen, R. Cipolla, Learning shape priors for sin- gle view reconstruction, in: 2009 IEEE 12th Inter- national Conference on Computer V ision W orkshops, IEEE, 2009, pp. 1425–1432. [9] J. Boisvert, C. Shu, S. W uhrer, P . Xi, Three- dimensional human shape inference from silhouettes: reconstruction and validation, Machine vision and ap- plications 24 (1) (2013) 145–157. [10] E. Dibra, H. Jain, C. ¨ Oztireli, R. Ziegler , M. Gross, Hs- nets: Estimating human body shape from silhouettes with conv olutional neural networks, in: 2016 Fourth In- ternational Conference on 3D V ision, IEEE, 2016, pp. 108–117. [11] E. Dibra, H. Jain, C. Oztireli, R. Ziegler , M. Gross, Human shape from silhouettes using generativ e HKS descriptors and cross-modal neural networks, in: Pro- ceedings of the IEEE Conference on Computer V ision and Pattern Recognition, 2017, pp. 4826–4836. [12] Z. Ji, X. Qi, Y . W ang, G. Xu, P . Du, X. Wu, Q. W u, Hu- man body shape reconstruction from binary silhouette images, Computer Aided Geometric Design. [13] B. Graham, L. van der Maaten, Submanifold sparse con volutional networks, CoRR abs/1706.01307. URL [14] L. Pishchulin, S. W uhrer, T . Helten, C. Theobalt, B. Schiele, Building statistical shape spaces for 3d hu- man modeling, Pattern Recognition 67 (2017) 276– 286. [15] S. Zhu, P . Mok, Predicting realistic and precise human body models under clothing based on orthogonal-vie w photos, Procedia Manufacturing 3 (2015) 3812–3819. [16] A. Kanazawa, M. J. Black, D. W . Jacobs, J. Malik, End- to-end reco very of human shape and pose, in: Proceed- ings of the IEEE Conference on Computer V ision and Pattern Recognition, 2018, pp. 7122–7131. [17] G. Pa vlakos, L. Zhu, X. Zhou, K. Daniilidis, Learn- ing to estimate 3d human pose and shape from a single color image, in: Proceedings of the IEEE Conference on Computer V ision and Pattern Recognition, 2018, pp. 459–468. [18] B. Allen, B. Curless, B. Curless, Z. Popo vi ´ c, The space of human body shapes: Reconstruction and parameter- ization from range scans, A CM T rans. Graph. 22 (3) (2003) 587–594. [19] M. Loper , N. Mahmood, J. Romero, G. Pons-Moll, M. J. Black, Smpl: A skinned multi-person linear model, A CM transactions on graphics 34 (6) (2015) 248. [20] C. C. W ang, K.-C. Hui, K. T ong, V olume parameteri- zation for design automation of customized free-form products, IEEE T ransactions on Automation Science and Engineering 4 (1) (2007) 11–21. [21] V . Kraev oy , A. Sheffer , Cross-parameterization and compatible remeshing of 3D models, ACM Transac- tions on Graphics 23 (3) (2004) 861––869. [22] T .-H. Kwok, Y . Zhang, C. C. W ang, Efﬁcient optimization of common base domains for cross- parameterization, IEEE Transactions on V isualization and Computer Graphics 18 (10) (2012) 1678–1692. [23] T .-H. Kwok, Y . Zhang, C. C. W ang, Constructing com- mon base domains by cues from voronoi diagram, Graphical Models 74 (4) (2012) 152–163. [24] O. Freifeld, M. J. Black, Lie bodies: A manifold repre- sentation of 3d human shape, in: European Conference on Computer V ision, Springer , 2012, pp. 1–14. [25] P . T . Fletcher, C. Lu, S. Joshi, Statistics of shape via principal geodesic analysis on lie groups, in: 2003 IEEE Computer Society Conference on Computer V i- sion and Pattern Recognition, V ol. 1, IEEE, 2003, pp. I–I. [26] L. Rogge, F . Klose, M. Stengel, M. Eisemann, M. Mag- nor , Garment replacement in monocular video se- quences, Acm T ransactions on Graphics 34 (1) (2014) 6. [27] A. Neophytou, A. Hilton, A layered model of human body and garment deformation, in: 2014 2nd Interna- tional Conference on 3D V ision, V ol. 1, IEEE, 2014, pp. 171–178. [28] A. Hilton, D. Beresford, T . Gentils, R. Smith, W . Sun, J. Illingworth, Whole-body modelling of people from multivie w images to populate virtual worlds, The V i- sual Computer 16 (7) (2000) 411–436. [29] W . Lee, J. Gu, N. Magnenat-Thalmann, Generating ani- matable 3d virtual humans from photographs, in: Com- puter Graphics Forum, V ol. 19, Wile y Online Library , 2000, pp. 1–10. [30] D. Song, R. T ong, J. Chang, X. Y ang, M. T ang, J. J. Zhang, 3d body shapes estimation from dressed-human silhouettes, in: Computer Graphics Forum, V ol. 35, W i- ley Online Library , 2016, pp. 147–156. 12 [31] Y . Chen, T .-K. Kim, R. Cipolla, Inferring 3d shapes and deformations from single vie ws, in: European Confer- ence on Computer V ision, Springer, 2010, pp. 300–313. [32] T . Alldieck, M. Magnor, W . Xu, C. Theobalt, G. Pons- Moll, V ideo based reconstruction of 3d people models, in: Proceedings of the IEEE Conference on Computer V ision and Pattern Recognition, 2018, pp. 8387–8397. [33] A. Krizhevsk y , I. Sutskev er , G. E. Hinton, Imagenet classiﬁcation with deep con volutional neural netw orks, in: Advances in neural information processing systems, 2012, pp. 1097–1105. [34] G. Huang, Z. Liu, L. V an Der Maaten, K. Q. W ein- berger , Densely connected conv olutional networks, in: Proceedings of the IEEE conference on computer vi- sion and pattern recognition, 2017, pp. 4700–4708. [35] G. Pa vlakos, L. Zhu, X. Zhou, K. Daniilidis, Learn- ing to estimate 3d human pose and shape from a single color image. [36] A. Kanazawa, M. J. Black, D. W . Jacobs, J. Malik, End- to-end recovery of human shape and pose, in: 2018 IEEE/CVF Conference on Computer V ision and Pat- tern Recognition, 2018. [37] Z. Huang, T . Li, W . Chen, Y . Zhao, J. Xing, C. LeG- endre, L. Luo, C. Ma, H. Li, Deep volumetric video from very sparse multi-view performance capture, in: Proceedings of the 15th European Conference on Com- puter V ision (ECCV), 2018, pp. 336–354. [38] S. Saito, Z. Huang, R. Natsume, S. Morishima, A. Kanazawa, H. Li, Pifu: Pixel-aligned implicit func- tion for high-resolution clothed human digitization, in: 2019 IEEE/CVF International Conference on Com- puter V ision (ICCV), 2019. [39] Z. Zheng, T . Y u, Y . W ei, Q. Dai, Y . Liu, Deephu- man: 3d human reconstruction from a single image, in: The IEEE International Conference on Computer V ision (ICCV), 2019. [40] K. Simonyan, A. Zisserman, V ery deep con volutional networks for large-scale image recognition, arXiv preprint [41] C. Szegedy , W . Liu, Y . Jia, P . Sermanet, S. Reed, D. Anguelov , D. Erhan, V . V anhoucke, A. Rabinovich, Going deeper with con volutions, in: Proceedings of the IEEE conference on computer vision and pattern recog- nition, 2015, pp. 1–9. [42] O. Ronneberger , P . Fischer , T . Brox, U-net: Conv olu- tional netw orks for biomedical image segmentation, in: International Conference on Medical image computing and computer-assisted intervention, Springer , 2015, pp. 234–241. [43] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learn- ing for image recognition, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778. [44] Y . Kim, Conv olutional neural networks for sentence classiﬁcation, arXiv preprint arXi v:1408.5882. [45] C. R. Qi, H. Su, K. Mo, L. J. Guibas, Pointnet: Deep learning on point sets for 3d classiﬁcation and seg- mentation, in: Proceedings of the IEEE Conference on Computer V ision and Pattern Recognition, 2017, pp. 652–660. [46] C.-H. Chu, Y .-T . Tsai, C. C. W ang, T .-H. Kwok, Exemplar -based statistical model for semantic paramet- ric design of human body , Computers in Industry 61 (6) (2010) 541–549. [47] C. C. W ang, Geometric Modeling and Reasoning of Human-Centered Freeform Products, Springer Science & Business Media, 2012. [48] T .-H. Kwok, K.-Y . Y eung, C. C. W ang, V olumetric template ﬁtting for human body reconstruction from incomplete data, Journal of Manufacturing Systems 33 (4) (2014) 678–689. [49] K. M. Robinette, S. Blackwell, H. Daanen, M. Boehmer, S. Fleming, Civilian american and european surface anthropometry resource (caesar), ﬁnal report. volume 1. summary , T ech. rep. (2002). [50] T .-Y . Lin, P . Goyal, R. Girshick, K. He, P . Doll ´ ar , Fo- cal loss for dense object detection, in: Proceedings of the IEEE international conference on computer vision, 2017, pp. 2980–2988. [51] S. Suzuki, et al., T opological structural analysis of dig- itized binary images by border follo wing, Computer vi- sion, graphics, and image processing 30 (1) (1985) 32– 46. [52] D. P . Kingma, J. Ba, Adam: A method for stochastic optimization, arXiv preprint arXi v:1412.6980. [53] J. Gao, P . Li, Z. Chen, J. Zhang, A surve y on deep learning for multimodal data fusion, Neural Computa- tion 32 (5) (2020) 829–864. [54] L. T orrey , J. Shavlik, Transfer learning, in: Hand- book of research on machine learning applications and trends: algorithms, methods, and techniques, IGI Global, 2010, pp. 242–264. [55] N. Hasler , C. Stoll, M. Sunkel, B. Rosenhahn, H.-P . Seidel, A statistical model of human pose and body shape, in: Computer graphics forum, V ol. 28, W iley Online Library , 2009, pp. 337–346. [56] B. Liu, X. Liu, Z. Y ang, C. C. L. W ang, User study: Concise network for 3D human modeling from orthog- onal silhouettes, https://mewangcl.github. io/pubs/imageHuman3DUserStudy.pdf (2020). [57] C. C. W ang, T . K. Chang, M. M. Y uen, From laser- scanned data to feature human model: a system based on fuzzy logic concept, Computer-Aided Design 35 (3) (2003) 241–253. [58] Y . Meng, C. C. W ang, X. Jin, Flexible shape control for automatic resizing of apparel products, Computer- Aided Design 44 (1) (2012) 68–76. 13

Concise and Effective Network for 3D Human Modeling from Orthogonal Silhouettes

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment