Low-latency Cloud-based Volumetric Video Streaming Using Head Motion Prediction

Volumetric video is an emerging key technology for immersive representation of 3D spaces and objects. Rendering volumetric video requires lots of computational power which is challenging especially for mobile devices. To mitigate this, we developed a…

Authors: ** 정보가 제공되지 않음 (논문에 명시된 저자 정보가 없습니다). **

Low-latency Cloud-based Volumetric Video Streaming Using Head Motion   Prediction
Low-latency Cloud-based V olumetric Video Streaming Using Head Motion Prediction Serhan Gül Fraunhofer HHI Berlin, Germany serhan.guel@hhi.fraunhofer .de Dimitri Podborski Fraunhofer HHI Berlin, Germany dimitri.podborski@hhi.fraunhofer .de Thomas Buchholz Deutsche T elekom A G Berlin, Germany thomas.buchholz@telekom.de Thomas Schierl Fraunhofer HHI Berlin, Germany thomas.schierl@hhi.fraunhofer .de Cornelius Hellge Fraunhofer HHI Berlin, Germany cornelius.hellge@hhi.fraunhofer .de ABSTRA CT V olumetric video is an emerging key technology for immersiv e rep- resentation of 3D spaces and objects. Rendering volumetric video requires lots of computational power which is challenging espe- cially for mobile devices. T o mitigate this, we developed a streaming system that r enders a 2D view from the volumetric video at a cloud server and streams a 2D video stream to the client. However , such network-based processing increases the motion-to-photon (M2P) latency due to the additional network and pr ocessing delays. In order to compensate the added latency , prediction of the future user pose is necessary . W e de veloped a head motion prediction model and investigated its potential to reduce the M2P latency for dier- ent look-ahead times. Our results show that the presented model reduces the rendering errors caused by the M2P latency compared to a baseline system in which no prediction is performed. CCS CONCEPTS • Information systems → Multime dia streaming ; • Human- centered computing → Ubiquitous and mobile computing systems and tools ; • Networks → Cloud computing . KEY W ORDS volumetric vide o, augmented reality , mixed r eality , cloud streaming, head motion prediction 1 IN TRODUCTION Recent advances in hardware for displaying of immersive me dia have aroused a huge market interest in virtual reality (VR) and augmented reality (AR) applications. Although the initial interest was focused on omnidirectional ( 360 ° ) video applications, with the improvements in captur e and processing technologies, volumetric video has recently started to b ecome the center of attention [ 21 ]. V olumetric videos capture the 3D space and objects and enable services with six degrees of freedom (6DoF), allowing a viewer to freely change both the position in space and the orientation. Although the computing power of mobile end devices has dra- matically increased in the recent years, rendering rich v olumetric objects is still a very demanding task for such devices. Moreov er , there are yet no ecient hardware decoders for volumetric content (e .g. point clouds or meshes), and software decoding can be prohib- itively expensive in terms of battery usage and real-time rendering requirements. One way of decreasing the processing load on the client is to avoid sending the volumetric content and instead send a 2D rendered view corresponding to the position and orientation of the user . T o achieve this, the expensive rendering process needs to be ooaded to a server infrastructure. Rendering 3D graphics on a p owerful device and displaying the results on a thin client connected through a network is known as r emote (or interactiv e) rendering [26]. Such rendering servers can be deployed at a cloud computing platform such that the resources can be exibly allocated and scaled up when more processing load is present. In a cloud-based rendering system, the ser ver renders the 3D graphics based on the user input (e .g, head pose) and encodes the rendering result into a 2D video stream. Depending on the user in- teraction, camera pose of the rendered video is dynamically adapted by the cloud server . After a matching view has been rendered and encoded, the obtained video str eam is transmitted to the client. The client can eciently deco de the video using its hardware video decoders and display the video stream. Moreover , network band- width requirements are reduced by avoiding the transmission of the volumetric content. Despite these advantages, one major drawback of cloud-based rendering is an increase in the end-to-end latency of the system, typically known as motion-to-photon (M2P) latency . Due to the added network latency and processing delays (rendering and en- coding), the amount of time until an update d image is presented to the user is greater than a local rendering system. It is well-known that an increase in M2P latency may cause an unpleasant user experience and motion sickness [ 1 , 3 ]. One way to reduce the net- work latency is to mov e the volumetric content to an edge server geographically closer to the user . Deployment of real-time commu- nication pr otocols such as W ebRTC are also necessary for ultra-low latency video streaming applications [ 10 ]. The processing latency at the rendering server is another signicant latency component. Therefore, using fast hardwar e-based video encoders is critical for reducing the encoding latency . Another way of reducing the M2P latency is to predict the fu- ture user p ose at the remote ser ver and send the corresponding rendered view to the client. Thus, it is possible to r educe or even completely eliminate the M2P latency , if the user p ose is predicted for a look-ahead time (LA T) e qual to or larger than the M2P la- tency of the system [ 6 ]. Howev er , mispredictions of head motion may potentially degrade the user’s Quality of Exp erience (QoE). Gül and Podborski, et al. Thus, design of accurate prediction algorithms has b een a popu- lar research area, espe cially for the viewport prediction for 360 ° videos (see Section 2.3). Howev er , application of such algorithms to 6DoF movement (i.e. translational and rotational) has not yet been investigated. In this paper , we describ e our cloud-based volumetric streaming system, present a pr ediction model to forecast the 6DoF position of the user and investigate the achieved rendering accuracy using the developed prediction model. Additionally , we pr esent an analysis of the latency contributors in our system and a simple latency measurement technique that we used to characterize the the M2P latency of our system. 2 BA CKGROUND 2.1 V olumetric vide o streaming Some recent works present initial framew orks for streaming of vol- umetric videos. Qian et al. [ 18 ] developed a proof-of-concept point cloud streaming system and introduce d optimizations to reduce the M2P latency . V an der Ho oft et al. [ 29 ] proposed an adaptive streaming framework compliant to the r ecent point cloud compres- sion standard MPEG V -PCC [ 22 ]. They used their framew ork for H T TP adaptive streaming of scenes with multiple dynamic point cloud objects and presented a rate adaptation algorithm that consid- ers the user’s position and focus. Petrangeli et al. [ 17 ] proposed a streaming framework for AR applications that dynamically decides which virtual objects should be fetched from the ser ver as well as their level-of-details (LODs), depending on the proximity of the user and likelihood of the user to view the object. 2.2 Cloud rendering systems The concept of remote rendering was rst put for ward to facilitate the processing of 3D graphics rendering when PCs did not have suf- cient computational power for intensive graphics tasks. A detailed survey of interactive remote r endering systems in the literature is presented in [ 26 ]. Shi et al. [ 25 ] proposed a Mobile Edge Comput- ing (MEC) system to stream AR scenes containing only the user’s eld-of-view (Fo V) and a latency-adaptive margin around the Fo V. They evaluate the performance of their prototype on a MEC node connected to a 4G (LTE) testbe d. Mangiante et al. [ 15 ] proposed an edge computing framework that p erforms Fo V rendering of 360 ° videos. Their system aims to optimize the r equired bandwidth as well as reduce the processing requirements and battery utilization. Cloud rendering has also started to receiv e increasing interest from the industr y , especially for cloud gaming services. Nvidia CloudXR [ 16 ] provides an SDK to run computationally intensive extended reality (XR) applications on Nvidia cloud servers to deliver advanced graphics performances to thin clients. 2.3 Head motion prediction te chniques Several sensor-based methods have been proposed in the literature that attempt to predict the user’s future viewport for optimized streaming of 360 ° -videos. Those can b e divided into two categories. W orks such as [ 4 , 6 , 7 , 13 , 20 ] were spe cically designed for VR applications and use the sensor data from head-mounted displays (HMDs) whereas the works in [ 5 , 8 , 12 ] attempt to infer user motion based on some physiological data such as ele ctroencephalogram (EEG) and electromyography (EMG) signals. Bao et al. [ 6 ] collected head orientation data and exploited the correlations in dierent dimensions to pr edict the head motion using regression techniques. Their ndings indicate that LA T s of 100 - 500 ms is a feasible range for sucient prediction accuracy . Sanchez et al. [ 20 ] analyzed the eect of M2P latency on a tile-based streaming system and proposed an angular acceleration-based prediction method to mitigate the impact on the observed delity . Barniv et al. [ 8 ] used the myoelectric signals obtained from EMG devices to predict the impending head motion. They trained a neural network to map EMG signals to trajectory outputs and experimented with combining EMG output with inertial data. Their ndings indicate that a LA T s of 30 - 70 ms are achievable with low err or rates. Most of the previous works target three degrees of freedom (3DoF) VR applications and thus focus on prediction of only the head orientation in order to optimize the streaming 360 ° videos. Howev er , little work has be en done so far on prediction of 6DoF movement for advance d AR and VR applications. In Sec. 5, we present an initial statistical model for 6DoF prediction and discuss our ndings. 3 SYSTEM ARCHI TECT URE This section presents the system architecture of our cloud rendering- based v olumetric video str eaming system and describes its dierent components. A simplied version of this architecture is shown in Fig. 1. 3.1 Server architecture The ser ver-side implementation is comp osed of two main parts: a volumetric vide o player and a cross-platform cloud rendering library , each describ ed further in more detail. V olumetric video player . The volumetric video play er is implemented in Unity and plays a single MP4 le which has one vide o track containing the compressed texture data and one mesh track con- taining the compressed mesh data of a volumetric object. Before the playback of a volumetric video starts, the player registers all the required objects. For example, the virtual camera of the ren- dered view and the volumetric object are registered, and those can later be controlled by the client. After initialization, the volumetric video player can start playing the MP4 le. During play out, both tracks are demultiplexed and fed into the corresponding decoders; video decoder for texture track and mesh decoder for mesh track. After decoding, each mesh is synchronized with the corresponding texture and rendered to a scene. The render ed view of the scene is represented by a Unity Render T exture that is passed to our cloud rendering library for further pr ocessing. While rendering the scene, the player concurrently asks the cloud rendering library for the lat- est positions of the rele vant objects that wer e previously r egistered in the initialization phase. Cloud r endering librar y . W e created a cross-platform cloud render- ing library written in C++ that can b e integrated into a variety of applications. In the case of the Unity application, the library is integrated into the player as a native plugin. The librar y utilizes the GStreamer W ebRTC plugin for low-latency video streaming be- tween the server and client that is integrated into a media pipeline Low-latency Cloud-based V olumetric Video Streaming Using Head Motion Prediction Figure 1: Overview of the system components and interfaces. as described in Se c. 3.3. In addition, the library provides interfaces for registering the objects of the rendered scene and retrieving the latest client-controlled transformations of those objects while rendering the scene. In the follo wing, we describe the modules of our library , each of which runs asynchronously in its own thr ead to achieve high performance. The W ebSocket Server is used for exchanging signaling data between the client and the ser ver . Such signaling data includes Session Description Protocol (SDP), Interactive Connectivity Estab- lishment (ICE) as well as application-specic metadata for scene description. In addition, W ebSocket (WS) connection can also be used for sending the control data, e.g. changing the p osition and orientation of any registered game obje ct or camera. Both, plain W ebSockets as well as Secure W ebSockets are supported which is important for practical operation of the system. The GStreamer module contains the media processing pip eline which takes the rendered texture and compressed it into a video stream that is sent to the client using the W ebRTC plugin. The most important components of the media pipeline are describ ed in Sec. 3.3. The Controller module represents the application logic and controls the other modules depending on the application state. For example, it closes the media pipeline if the client disconnects, re- initializes the media pipeline when a new client has connecte d, and updates the controllable objects based on the output of the Prediction Engine. The Prediction Engine implements a regression-based predic- tion method (please refer to Sec. 5) and provides interfaces for usage of other potential methods. Based on the previously received input from the client and the implemented algorithm, the module updates the position of the registered objects accordingly such that the rendered scene corresponds to the predicte d positions of the object after a given LA T. 3.2 Client architecture The client-side architecture is depicted on the left side of the Fig. 1. Before the streaming session starts, the client establishes a WS connection to the server and asks the server to send a description of the rendered scene. The ser ver responds with a list of objects and parameters which the client is later allow ed to update. After receiving the scene description, the client replicates the scene and initiates a peer-to-peer (P2P) W ebRTC connection to the server . The server and client b egin the W ebRTC negotiation process by sending SDP and ICE data over the established WS connection. Finally , the P2P connection is established, and the client starts receiving a video stream corresponding to the current vie w of the volumetric video. At the same time, the client can use the WS connection, as well as the RTCPeerConnection for sending control data to the server in order to modify the properties of the scene. For example, the client may change its 6DoF position, or it may r otate, mov e and scale any volumetric object in the scene. W e have implemented b oth a web player in JavaScript and a na- tive application for the HoloLens, the untethered AR headset from Microsoft. While our web application targets VR, our HoloLens application is implemented for AR use cases. In the HoloLens appli- cation, w e perform further processing to r emove the background of the video texture before rendering the texture onto the AR display . In general, the client-side architecture remains the same for both VR and AR use cases, and the most complex client-side module is the video decoder . Thus, the complexity of our system is concentrated largely in our cloud-based rendering server . Gül and Podborski, et al. 3.3 Media pipeline The simplied structure of the media pipeline is shown in the b ot- tom part of Fig. 1. A r endered te xture is given to the media pipeline as input using the A ppSource element of Gstreamer . Since the ren- dered texture is originally in RGB format but the video encoder requires Y UV input, we use the VideoConvert element to conv ert the RGB texture to I420 format 1 . After conversion, the textur e is passed to the encoder element, which can be set to any supp orted encoder on the system. Since enco der latency is a signicant contributor the overall M2P latency , we evaluated the encoding performances of dierent encoders for a careful selection. For detailed results on the encoder performance, please refer to Sec. 4.1. After the texture is encoded the resulting video bitstream is packaged into RTP packets, encrypte d and sent to the client us- ing W ebRTC. W ebRTC was chosen as the deliver y method since it allows us to achieve an ultra-low latency while using the P2P connection between the client and ser ver . In addition, W ebRTC is already widely adopted by dierent web br owsers allowing our system to support several dierent platforms. 4 MOTION- TO-PHOTON LA TENCY The dierent components of the M2P latency ar e illustrate d in Fig. 2 and related by T M2P = T server + T network + T client (1) where T s e r v e r , T c l i e n t and T n e t w o r k consist of the following com- ponent latencies: T server = T rend + T enc (2) T network = T up + T down + T trans (3) T client = T dec + T disp (4) Figure 2: Comp onents of the motion-to-photon latency for a r emote rendering system. In our analysis, w e neglect the time for the HMD to compute the user pose using its tracker module. This computation is typically based on a fusion of the sensor data from the inertial measurement units (IMUs) and visual data from the cameras. Although AR device 1 https://www.four cc.org/pixel-format/yuv-i420 cameras typically operate at 30 - 60 Hz , IMUs are much faster and a reliable estimation of the user pose can be performe d with a frequency of multiple kHz [ 14 , 30 ]. Thus, the expected tracker latency is on the order of microseconds. T enc is the time to compress a frame and depends on the encoder type (hardware or software) and the picture resolution. W e present a detailed latency analysis of dierent encoders that we tested for our system in Section 4.1. T network is the network round-trip time (RT T). It consists of the propagation delay ( T up + T down ) and the transmission delay T trans . T up , the time for the server to retrieve sensor data from the client, and T down the time for the server to transmit a compressed frame to the client. T rend is the time for the server to generate a new frame by rendering a view from the volumetric data based on the actual user pose. In general, it can be set to match the frame rate of the encoder for a given rendered textur e resolution. T dec is the time to decode a compressed frame on the client device and is typically much smaller than T enc since video deco ding is inherently a faster op eration than video encoding. Also, the end devices typically have hardware-accelerated video deco ders that further reduce the decoding latency . T disp is the display latency and mainly depends on the refresh rate of the display . For a typical refr esh rate of 60 Hz , the average value of T disp is 8 . 3 ms , and the worst-case value is 16 . 6 ms in case the decoded frame misses the current VSync signal and has to wait in the frame buer for the next VSync signal. 4.1 Encoder latency W e characterized the encoding sp eeds of dierent encoders us- ing the test dataset pr ovided by ISO/I T U Joint Video Exploration T eam ( JVET) for the next generation video coding standard V ersa- tile Video Coding (V V C) [ 24 ]. In our measurements, we used the FFmpeg libraries of the encoders NVENC, x264, x265, and Intel SVT -HEV C, enable d their low-latency presets and measured the encoded frames per second (FPS) using FFmpeg -benchmark option. The measurements were performed on a Ubuntu 18.04 machine with 16 Intel Xeon Gold 6130 CP U (2.10GHz) CP Us using the default threading options for the tested software-based encoders. For x264 and x265, we used the ultrafast preset and zerolatency tuning [ 9 ]. For NVENC, we evaluated the presets default, high-performance (HP), low-latency (LL) and low-latency high-performance (LLHP). A brief description of the NVENC presets can be found in [28]. T able 1 shows the mean FPS over all tested sequences for dierent encoders. W e obser ved that both H.264 and HEV C enco ders of NVENC are signicantly faster than x264 andx265 (both using ultrafast preset and zer olatency tuning) as well as SVT -HEVC (Low delay P). N VENC is able to encode 1080p and 4K videos in our test dataset with encoding speeds up to 800 fps and 200 fps , respectively . W e also obser ved that for some sequences, HEVC encoding turned out to be faster than H.264 encoding. W e believ e that this dierence is caused by a more ecient GP U implementation for HEVC. All the low-latency presets tested in our experiments turn o B- frames to reduce latency . Despite that, we observed that the pictur e quality obtained by N VENC in terms of PSNR is comparable to the other tested encoders (using low-latency presets). As a result of Low-latency Cloud-based V olumetric Video Streaming Using Head Motion Prediction our analysis, we decided to use NVENC H.264 (HP preset) in our system. T able 1: Mean enco ding performances over all tested se- quences for dierent enco ders and presets. Standard Encoder Preset Mean FPS H.264 x264 Ultrafast 81 NVENC Default 353 NVENC HP 465 NVENC LL 359 NVENC LLHP 281 HEV C x265 Ultrafast 33 SVT -HEV C Low delay P 74 NVENC Default 212 NVENC HP 492 NVENC LL 278 NVENC LLHP 211 4.2 Latency measurements W e developed a framework to measure the M2P latency of our system. In our setup, we run the server application on an Amazon EC2 instance in Frankfurt, and the client application runs in a web browser in Berlin which is connected to the Internet over WiFi. W e implemented a ser ver-side console application which is us- ing the same cloud rendering library as described in Sec. 3.1 but instead of sending the rendered textures from the v olumetric video player , the application sends predened textures (known by the client) depending on the received control data from the client. These textures consist of simple vertical bars with dierent colors. For example, if the client instructs the server application to move the main camera to position P 1 , the server pushes the texture F 1 into the media pipeline. Similarly , another camera position P 2 results in the texture F 2 . On the client side, we implemented a web-based application that connects to the server application and renders the received video stream to a canvas. Since the client knows exactly how those textures look like, it can evaluate the incoming video stream and determine when the requested texture was rendered on the screen. As soon as the client application sends P 1 to the server , it starts the timer and checks the canvas for F 1 at every web browser windo w repaint event. According to the W3C recommendation [ 19 ], the repaint event matches the refresh rate of the display . As soon as the texture F 1 is detected the client stops the timer and computes the M2P latency T M2P . Once the connection is establishe d, the user can start the session by dening the number of independent measur ements. Since w e are using the second smallest instance type of Amazon EC2 (t2.micro), we set the size of each video frame to 512 × 512 pixels. W e enco de the stream using x264 congured with ultrafast preset and zerolatency tuning with an encoding speed of ∼ 80 fps . As an e xample, we set the client to perform 100 latency measurements and calculate d the average, minimum and maximum M2P latency . Our results show that T M2P uctuates between 41 ms and 63 ms , and the measured average M2P latency is 58 ms . 5 HEAD MOTION PREDICTION One important technique to mitigate the incr eased M2P latency in a cloud-based rendering system is the prediction of the user’s future pose. In this section, we describe our statistical prediction model for 6DoF head motion prediction and evaluate its performance using real user traces. 5.1 Data colle ction W e collecte d motion traces from ve users while the y were freely interacting with a static virtual object using Microsoft HoloLens. W e recorded the users’ movements in 6DoF space; i.e., collected position samples ( x , y , z ) and rotation samples represented as quater- nions ( q x , q y , q z , q w ). Since the raw sensor data we obtaine d from HoloLens was unevenly sampled (i.e. dier ent temporal distances between consecutive samples) at 60 Hz , we interpolated the data to obtain temporally equidistant samples. W e upsample d the position data using linear interpolation and the rotation data (quaternions) using Spherical Linear Interpolation of Rotations (SLERP) [ 27 ]. Thus, we obtained an evenly-sampled dataset with a sampling rate of 200 Hz (one sample at each 5 ms ). 1 5.2 Prediction metho d W e use a simple autoregressiv e (A utoReg) model to predict the future user pose based on a time series of its past values. A utoReg models use a linear combination of the past values of a variable to forecast its future values [11]. An A utoReg model of lag order ρ can be written as y t = c + ϕ 1 y t − 1 + ϕ 2 y t − 2 + · · · + ϕ ρ y ρ − 1 + ϵ t (5) where y t is the true value of the time series y at time t , ϵ t is the white noise, ϕ i are the coecients of the model. Such a model with ρ lagged values is referred to as an AR ( ρ ) model. Some statistics libraries can determine the lag order automatically using statistical tests such as the Akaike Information Criterion (AIC) [2]. W e used the x and q x values from one of the collected traces as training data and created two AutoReg models using the Python library statsmodels [ 23 ], for translational and rotational compo- nents, respectively . Our model has a lag order of 32 samples i.e. it considers a history window ( hw ) of the past 32 ∗ 5 = 160 ms and predicts the ne xt sample using (5) . T ypically we need to predict not only the next sample but multiple samples in the future to achieve a given LA T; therefore, we repeat the prediction step by adding the just-predicted sample to the history window and iterating (5) until we obtain the future sample corresponding to the desired LA T. The process is then repeated for each frame e.g. each 10 ms for an assumed 100 Hz display refresh rate . W e used the trained model to predict the users’ translational ( x , y , z ), and rotational motion ( q x , q y , q z , q w ). W e perform the prediction of rotations in the quaternion domain since we read- ily obtain quaternions from the sensors and they allow smooth interpolation using techniques like SLERP. After prediction, we convert the predicted quaternions to Euler angles (yaw , pitch, roll) and evaluate the prediction accuracy in the domain of Euler angles since they are better suited for understanding the rendering osets in terms of angular distances. 1 Our 6DoF head movement dataset is freely available on Github for further usage in research community under: https://github .com/serhan- gul/dataset_6DoF Gül and Podborski, et al. Figure 3: Mean absolute error (MAE) for dierent translational and rotational components averaged over ve users. Results are given for the look-ahead times T LA T in the range 20 - 100 ms . 5.3 Evaluation In our evaluation, we investigated the eect of prediction on the accuracy of the render ed image displayed to the user , i.e. the r ender- ing oset. Specically , we compared the predicted user pose (given a certain look-ahead time T LA T ) to the real user pose as obtained from the sensors. As a b enchmark, we evaluated a baseline case in which the render ed pose lags behind the actual user pose by a delay corresponding to the M2P latency ( T M2P ), i.e., no prediction is performed. For each user trace, we evaluated the prediction algorithm for a T LA T ranging between 20 - 100 ms . In each experiment, we assume that the M2P latency is e qual to the prediction time ( T LA T = T M2P ) such that the prediction model attempts to predict the p ose that the user will attain at the time the rendered image is display ed to the user . W e evaluated our results by computing the mean absolute error (MAE) b etween the true and predicte d values for the dier ent components. Figure 4: Comparison of the prediction ( blue) and baseline (red) results for the x and roll components of one of the traces (sample time 5 ms ; showing the time range 0 - 1 s ) for T LA T = 40 ms. The dashe d gray line shows the recorded sen- sor data. Fig. 3 compares the average rendering errors over ve traces ob- tained using our prediction method to the baseline. W e observe that for all considered T LA T , prediction reduces the average rendering error for both positional and rotational components. Fig. 4 shows for one of the traces the predicted and baseline (lagged by M2P latency) values for the x and roll components. At the beginning of the session, the prediction mo dule colle cts the required amount of samples for a hw of 160 ms and makes the rst prediction, i.e., the pose that the user is predicted to attain after a time of T LA T = 40 ms (green shade d). W e obser ve that the accuracy of the prediction depends on the frequency of the abrupt, short-term changes of the user pose. If a component of the user pose linearly changes over a hw (without changing direction), the resulting predictions for that component are fairly accurate. Other wise, if short-term changes are present within a hw , the prediction tends to perform worse than the baseline. 6 CONCLUSION W e presented a cloud-based base d volumetric streaming system that ooads the rendering to a powerful server and thus r educes the rendering load on the client-side. T o comp ensate the added network and processing latency , we developed a method to predict the user’s head motion in six degrees of freedom. Our results show that the developed prediction model reduces the rendering errors caused by the added latency due to the cloud-base d rendering. In our future work, we will analyze the eect of motion-to-photon latency on the user experience through subjective tests and develop mor e advanced prediction techniques e.g. based on Kalman ltering. REFERENCES [1] Bernard D. A delstein, Thomas G. Lee, and Stephen R. Ellis. 2003. Head Tracking Latency in Virtual Environments: Psychophysics and a Model. Proce edings of the Human Factors and Ergonomics Society Annual Meeting 47, 20 (Oct. 2003), 2083–2087. https://doi.org/10.1177/154193120304702001 [2] Htrotugu Akaike. 1973. Maximum likelihood identication of Gaussian au- toregressive moving average models. Biometrika 60, 2 (1973), 255–265. https: //doi.org/10.1093/biomet/60.2.255 [3] R.S. Allison, L.R. Harris, M. Jenkin, U. Jasiobedzka, and J.E. Zacher . 2001. T olerance of temporal delay in virtual environments. In Proceedings IEEE Virtual Reality 2001 . IEEE Comput. Soc, 247–254. https://doi.org/10.1109/vr .2001.913793 [4] T amay A ykut, Mojtaba Karimi, Christoph Burgmair , Andreas Finkenzeller , Christoph Bachhuber , and Eckehard Steinbach. 2018. Delay compensation for a telepresence System with 3D 360 degree vision base d on deep head motion prediction and dynamic Fo V adaptation. IEEE Rob otics and A utomation Letters 3, 4 (2018), 4343–4350. https://doi.org/10.1109/lra.2018.2864359 Low-latency Cloud-based V olumetric Video Streaming Using Head Motion Prediction [5] Ou Bai, V arun Rathi, Peter Lin, Dandan Huang, Harsha Battapady , Ding-Y u Fei, Logan Schneider , Elise Houdayer , Xuedong Chen, and Mark Hallett. 2011. Pre- diction of human v oluntary movement before it occurs. Clinical Neurophysiology 122, 2 (2011), 364–372. https://doi.org/10.1016/j.clinph.2010.07.010 [6] Y anan Bao, Huasen Wu, Tianxiao Zhang, Albara Ah Ramli, and Xin Liu. 2016. Shooting a moving target: Motion-prediction-based transmission for 360-degree videos. In 2016 IEEE International Conference on Big Data (Big Data) . IEEE, 1161– 1170. https://doi.org/10.1109/bigdata.2016.7840720 [7] Y anan Bao, Tianxiao Zhang, Amit Pande, Huasen Wu, and Xin Liu. 2017. Motion- Prediction-Based Multicast for 360-Degree Video Transmissions. In 2017 14th A nnual IEEE International Conference on Sensing, Communication, and Networking (SECON) . IEEE, 1–9. https://doi.org/10.1109/sahcn.2017.7964928 [8] Y air Barniv , Mario Aguilar , and Erion Hasanbelliu. 2005. Using EMG to anticipate head motion for virtual-environment applications. IEEE Transactions on Biomedi- cal Engineering 52, 6 (2005), 1078–1093. https://doi.org/10.1109/tbme.2005.848378 [9] FFmpeg. 2019. H.264 Video Encoding Guide. https://trac.mpeg.org/wiki/Encode/ H.264. Online; accessed: 2020-03-26. [10] C. Holmberg, S. Hakansson, and G. Eriksson. 2015. W eb real-time communication use cases and requirements . RFC 7478. https://doi.org/10.17487/rfc7478 [11] Rob J Hyndman and George Athanasopoulos. 2018. Forecasting: principles and practice . OT exts. [12] Kishor Koirala, Meera Dasog, Pu Liu, and Edward A Clancy . 2015. Using the electromyogram to anticipate torques about the elbow . IEEE Transactions on Neural Systems and Rehabilitation Engineering 23, 3 (2015), 396–402. https: //doi.org/10.1109/tnsre.2014.2331686 [13] Steve LaV alle and Peter Giokaris. 2015. Perception base d predictive tracking for head mounted displays. US Patent No. 9348410B2, Filed May 22, 2014, Issued Jun. 6., 2015. [14] Peter Lincoln, Alex Blate, Montek Singh, T urner Whitte d, Andrei State, Anselmo Lastra, and Henry Fuchs. 2016. From motion to photons in 80 microseconds: T owards minimal latency for virtual and augmente d reality . IEEE transactions on visualization and computer graphics 22, 4 (2016), 1367–1376. https://doi.org/10. 1109/tvcg.2016.2518038 [15] Simone Mangiante, Guenter Klas, Amit Navon, Zhuang GuanHua, Ju Ran, and Marco Dias Silva. 2017. VR is on the edge: How to deliver 360 videos in mobile networks. In Proceedings of the W orkshop on Virtual Reality and A ugmented Reality Network . A CM, 30–35. https://doi.org/10.1145/3097895.3097901 [16] NVIDIA. 2019. N VIDIA CloudXR Delivers Low-Latency AR/VR Streaming Over 5G Networks to Any Device . https://blogs.nvidia.com/blog/2019/10/22/nvidia- cloudxr. Online; accessed: 2020-03-26. [17] Stefano Petrangeli, Gwendal Simon, Haoliang W ang, and Vishy Swaminathan. 2019. Dynamic Adaptive Streaming for Augmented Reality Applications. In 2019 IEEE International Symposium on Multimedia (ISM) . IEEE, 56–567. https: //doi.org/10.1109/ism46123.2019.00017 [18] Feng Qian, Bo Han, Jarrell Pair , and Vijay Gopalakrishnan. 2019. T oward practical volumetric video streaming on commodity smartphones. In Proceedings of the 20th International W orkshop on Mobile Computing Systems and Applications . ACM, 135–140. https://doi.org/10.1145/3301293.3302358 [19] James Robinson and Cameron McCormack. 2015. Timing control for script- based animations . W3C W orking Draft. https://w ww .w3.org/TR/2015/NOTE- animation- timing- 20150922 [20] Y ago Sanchez, Gurdeep Singh Bhullar , Rob ert Skupin, Cornelius Hellge, and Thomas Schierl. 2019. Delay impact on MPEG OMAFâĂŹs tile-based viewport- dependent 360Âř video streaming. IEEE Journal on Emerging and Selecte d T opics in Circuits and Systems (2019). https://doi.org/10.1109/jetcas.2019.2899516 [21] O Schreer , I Feldmann, P Kau, P Eisert, D T atzelt, C Hellge, K Müller , T Ebner , and S Bliedung. 2019. Lessons learnt during one year of commercial volumetric video production. In 2019 IBC conference . IBC. [22] Sebastian Schwarz, Marius Preda, Vittorio Baroncini, Madhukar Budagavi, Pablo Cesar , Philip A Chou, Rob ert A Cohen, Maja Krivokuća, Sébastien Lasserre, Zhu Li, et al . 2018. Emerging MPEG standards for point cloud compression. IEEE Journal on Emerging and Selected Topics in Circuits and Systems 9, 1 (2018), 133–148. https://doi.org/10.1109/dcc.2016.91 [23] Skipper Seabold and Josef Perktold. 2010. statsmodels: Econometric and statistical modeling with python. In 9th Python in Science Conference . inproceedings. [24] Andrew Segall, Vittorio Baroncini, Jill Boyce , Jianle Chen, and T eruhiko Suzuki. 2017. Joint call for proposals on video compression with capability beyond HEVC. In JVET -H1002 . [25] Shu Shi, V arun Gupta, Michael Hwang, and Rittwik Jana. 2019. Mobile VR on edge cloud: a latency-driven design. In Proceedings of the 10th ACM Multimedia Systems Conference . ACM, 222–231. https://doi.org/10.1145/3304109.3306217 [26] Shu Shi and Cheng-Hsin Hsu. 2015. A survey of interactive remote rendering systems. Comput. Surveys 47, 4 (May 2015), 1–29. https://doi.org/10.1145/2719921 [27] Ken Shoemake. 1985. Animating rotation with quaternion cur ves. In ACM SIGGRAPH computer graphics , V ol. 19. ACM, 245–254. https://doi.org/10.1145/ 325165.325242 [28] T witch. 2018. Using Netix machine learning to analyze T witch stream picture quality . https://streamquality .report/docs/report.html. Online; accessed: 2020- 03-26. [29] Jeroen van der Hooft, Tim W auters, Filip De T urck, Christian Timmerer, and Hermann Hellwagner . 2019. T owards 6DoF HT TP adaptive streaming through point cloud compression. In Proceedings of the 27th ACM International Conference on Multimedia . 2405–2413. https://doi.org/10.1145/3343031.3350917 [30] Daniel W agner . 2018. Motion-to-photon latency in mobile AR and VR. https: //daqri.com/blog/motion- to- photon- latency. Online; accessed: 2020-03-26.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment