OO-VR: NUMA Friendly Object-Oriented VR Rendering Framework For Future NUMA-Based Multi-GPU Systems
Evaluation
We model the object-oriented VR rendering framework (OO-VR) by extending AITTILA-sim . To get the object graphical properties (e.g., viewports, number of triangles and texture data), we profile the rendering-traces from our real-game benchmarks as shown in Table [table:benchmarks]. Then in ATTILA-sim, we implement the OOVR programming model in its GPUDriver, and the object distribution engine in its command processor, and the distributed hardware composition during the color writing procedure. To evaluate the effectiveness of our proposed OO-VR design, we compare it with several design scenarios: (i) Baseline - the baseline multi-GPU system with single programming model (Section 26); (ii) 1TB/s-BW - the baseline system with 1 TB/s inter-GPU link bandwidth; (iii) Object-level - the Object-level SFR which distributes objects among GPMs (Section 27); (iv) Frame-level - the AFR which renders entire frame within each GPM; and (v) OO_APP - the proposed object-oriented programming model (Section 29). We provide results and detailed analysis of our proposed design on performance, inter-GPU memory traffic, sensitivity study for inter-GPM link bandwidth and the performance scalability over the number of GPMs.
Effectiveness On Performance
Fig.1 shows the performance results with respect to single frame latency under the five design scenarios. We gather the entire rendering cycles from the beginning to the end for each frame and normalized the performance speedup to baseline case. We show the performance speedup for single frame because it is critical to avoid motion sickness for VR. From the figure, we have several observations.
First, without hardware modifications, the OO_APP improves the performance about 99%, 39% an 28% on average comparing to the Baseline, Object-level SFR and 1TB/s-BW, respectively. It combines the two views of the same object and enable the multi-view rendering to share the texture data. In addition, by grouping objects into large batches, it further increases the data locality within one GPM to reduce the inter-GPM memory traffic. However, it still suffers serious workload unbalance. For instance, object-level SFR slightly outperforms OO_APP when executing DM3-1280 and DM3-1600. This is because some batches within these two benchmarks require much longer rendering time than other batches, the software scheduling policy alone in OO_APP can not balance the execution time across GPMs without runtime information. Second, we observe that on average, OO-VR outperforms Baseline, Object-level SFR and OO_APP by 1.58x, 99% and 59%, respectively. With the software and hardware co-design, OO-VR distributes batches based on the predicted rendering time and provides better workload balance than OO_APP. It also increases the pixel rate by fully utilizing the ROPs of all GPMs.
We also observe that OO-VR could achieve similar performance as Frame-level parallelism which is considered to provide ideal performance on overall rendering cycles for all frames (as shown in Fig.[fig:AFR](left)). However, in terms of the single frame latency, Frame-level parallelism suffers 40% slowdown while OO-VR could significantly improve the performance.
Effectiveness On Inter-GPU Memory Traffic
Reducing inter-GPM memory traffic is another important criteria to justify the effectiveness of OO-VR. Fig.[fig:result-mem] shows the impact of OO-VR on inter-GPM memory traffic. Both Baseline and 1TB/s-BW have the same inter-GPM memory traffic, and Frame-Level is processing each frame in one GPM and has near-zero inter-GPM traffic. Moreover, the memory traffic reduction is mainly cause by our software-level design, the inter-GPM traffic is the same under the impact of OO_APP and OO-VR. Therefore, Fig.[fig:result-mem] only shows the results for Baseline, Object-Level and OO-VR, and we mainly investigate these three techniques in the following subsections. From the figure, we observe OO-VR can save 76% and 36% inter-GPM memory accesses comparing to the Baseline and Object-level SFR, respectively. This is because OO-VR allocates the required rendering data to the local DRAM of GPMs. The majority inter-GPM memory accesses are contributed by the distributed hardware composition, command transmit and Z-test during fragment process. We observe that the delay caused by these inter-GPM memory accesses can be fully hidden by executing thousands of threads simultaneously in numerous shader cores. In addition, the data transfer via the inter-GPM links also leads to higher power dissipation (e.g. 10pj/bit for board or 250pj/bit for nodes based on different integration technologies). By reducing inter-GPM memory traffic, OO-VR also achieves significant energy and cost saving.
Sensitivity To Inter-GPM Link Bandwidth
Inter-GPU link bandwidth is one of the most important factors in multi-GPU systems. Previous works have shown that increasing the bandwidth of inter-processor link is difficult and requires high fabric cost. To understand how inter-GPM link bandwidth impacts the design choice, we examine the performance gain of OO-VR under a variety of choices on link bandwidth. Fig.[fig:memsensitive] shows the speedup under different link bandwidth when applying Baseline, Object-level SFR and our proposed OO-VR. In this figure, we normalize the performance to the Baseline with 64GB/s inter-GPM link. We observe that the inter-GPU link bandwidth highly affects the Baseline and Object-level SFR design scenarios. This is because these two designs cannot capture the data locality within the GPM to minimize the inter-GPU memory accesses during rendering. The large amount of shared data across GPMs significantly stalls the rendering performance. In the contrast, OO-VR fairly distributes the rendering workloads into different GPMs and convert numerous remote data to local data. By doing this, it fully utilizes the high-speed local memory bandwidth and is insensitive to the bandwidth of inter-GPM link even the inter-GPM memory accesses are not entirely eliminated. As the local memory bandwidth scales in future GPU design (e.g. High-Bandwidth Memory (HBM)), the performance of the future multi-GPU scenario is more likely to be constrained by inter-GPU memory. In this case, we consider the OO-VR can potentially benefit the future multi-GPU scenario by reducing inter-GPM memory traffic.
Scalability of OO-VR
Fig.2 shows the average speedup of the Baseline, Object-level SFR and OO-VR as the number of GPMs increases. The results are normalized to single-GPU system. As the figure shows, the Baseline and Object-level SFR suffer limited performance scalability due to the NUMA bottleneck. With 8 GPMs, the Baseline and Object-level SFR only improve the overall performance by 2.08x and 3.47x on average over the single GPU processing. On the other hand, the OO-VR provides scalable performance improvement by distributing independent rendering tasks to each GPM. Hence, with 4 and 8 GPMs, it achieves 3.64x and 6.27x speedup over the single GPU processing, respectively.