Surface Compression Using Dynamic Color Palettes

Off-chip memory traffic is a major source of power and energy consumption on mobile platforms. A large amount of this off-chip traffic is used to manipulate graphics framebuffer surfaces. To cut down the cost of accessing off-chip memory, framebuffer…

Authors: Ayub A. Gubran, Felix Huang, Tor M. Aamodt

Surface Compression Using Dynamic Color Palettes
Surface Compression Using Dynamic Color P alettes A yub A. Gubran University of Br itish Columbia V ancouv er , BC , Canada a youbg@ece.ubc.ca F elix Huang Carnegie Mellon Univ ersity Pittsburgh, P A, United States f elixh@andrew .cm u.edu T or M. Aamodt University of Br itish Columbia V ancouv er , BC , Canada aamodt@ece.ubc.ca ABSTRA CT Off-c hip memory traffic is a ma jor source of p o wer and energy consumption on mobile platforms. A large amount of this off-c hip traffic is used to manipulate graphics framebuffer surfaces. T o cut down the cost of accessing off-c hip memory , framebuffer surfaces are compressed to reduce the bandwidth consumed on surface manipu- lation when rendering or displaying. In this work, we study the compression prop erties of framebuffer surfaces and highlight the fact that sur- faces from differen t applications ha ve different compres- sion characteristics. W e use the results of our analysis to prop ose a scheme, Dynamic Color Palettes (DCP), whic h ac hiev es higher compression rates with UI and 2D surfaces. DCP is a hardware mechanism for exploiting inter- frame coherence in lossless surface compression; it im- plemen ts a scheme that dynamically constructs color palettes, which are then used to efficiently compress framebuffer surfaces. T o ev aluate DCP , w e created an extensiv e set of Op enGL workload traces from 124 An- droid applications. W e found that DCP impro ves com- pression rates b y 91% for UI and 20% for 2D applica- tions compared to previous proposals [1, 2]. W e also ev aluate a hybrid sc heme that com bines DCP with a generic compression scheme [1], and found that com- pression rates improv e o ver previous prop osals [1, 2] by 161%, 124% and 83% for UI, 2D and 3D applications, respectively . 1. INTR ODUCTION Off-c hip memory traffic, including that of framebuffer surfaces, is one of the ma jor sources of p o w er consump- tion on mobile systems-on-c hip (SoCs). In some cases, the energy consumption to access data on the off-c hip memory can dominate that from computations [3]. In this w ork, w e study the prop erties of framebuffer sur- faces and prop ose a set of unique compression tech - niques to reduce the bandwidth consumed b y frame- buffer operations. In graphics rendering, a framebuffer surface is an off- c hip memory space that contains pixels generated by the graphics processing unit (GPU) and then used by the display con troller to read pixels to the screen. In some cases, the display con troller op erates on mult iple framebuffer surfaces, whic h are comp osited to a single surface for screen displa y . Also GPUs can use frame- buffer surfaces as inputs to additional rendering stages, e.g., render to texture and deferred shading; as a result, an y giv en application may utilize one or more frame- buffer surfaces. This w ork studies a large set of Android w orkloads to infer the compression prop erties of framebuffer surfaces generated by mobile UI, 2D and 3D applications. Our study found that framebuffers from different classes of w orkloads ha ve differen t compression prop erties. W e exploit these prop erties to prop ose an effective palette- based framebuffer compression scheme that fo cuses on common UI and 2D applications. In addition, we ex- ploit temp oral coherence in graphics, where applications exhibit minor changes b etw een frames that can b e ex- ploited for compression. Using temp oral coherence, and by fo cusing on com- mon uses cases, we prop ose and ev aluate our Dynamic Color P alettes (DCP) tec hnique. DCP uses palette based compression and fo cuses on reducing the traffic caused b y framebuffer op erations in UI and 2D applications. T o ev aluate our compression sc heme, w e created an ex- tensiv e set of workloads from 124 Android applications. W e show that by com bining DCP with other compres- sion tec hniques [1], DCP is able to impro ve compression rates b etw een 83% and 161% across UI, 2D and 3D ap- plications. This paper makes the following contributions: 1. Characterizes compression prop erties of framebuffer surfaces from user-interface (UI) as well as non-UI 2D and 3D applications; 2. Uses characterization results to propose and ev al- uate dynamic color palettes (DCP), a compression tec hnique that offers higher compression rates for common UI and non-UI 2D applications; 3. Prop oses t wo DCP v ariations that dynamically c ho ose an optimal palette size based on the frequencies of the v alues in color palettes; 4. Ev aluate our compression schemes using an exten- siv e set of workloads created from the Op enGL traces of 124 Android applications. Ap p lic a tio n 1 App lic a tio n 2 App lic a tio n 3 OS Gr aphics Lib r ar ies / Gr aphic A PI c al ls CPU (Softw a r e R endering) / G PU (H a r dw a r e R endering) R e n d e r Fr ame b u ff e r 1 R e n d e r Fr ame b u ff e r 2 R e n d e r Fr ame b u ff e r 3 S y s t em Compo sit or (e. g., Andr oid’ s Surf aceFling er ) D isp la y fr ame b u ff e r Displa y Con tr oller D e vic e Scr e e n @ sc reen refresh r at e ( i .e. , 60 FP S) R ef res h ra t es var y b y appl i c ati on and us er ac ti vit y Call ed up on up da ti ng an y of the fr ame bu f f er s GPU or CPU Compo sitio n OS Le v el App Le v el H W Le v el Lege nds Comp r e ssi o n ! Comp r e ssi o n ! Comp r e ssi o n ! 1 2 3 Figure 1: The life-cycle of a surface from rendering to displa y . In 1 , applications render to their corresp onding framebuffers. In 2 , a comp ositor com bines the surfaces generated b y different applications. In 3 , the composited surface is used and display ed on the screen by the display con troller. 2. B A CKGR OUND AND RELA TED WORK 2.1 The life-cycle of a framebuffer surface Figure 1 summarizes the life-cycle of a frame surface in contemporary mobile systems (Android Ice Cream Sandwic h 4.0 and later [4]). Figure 1 shows a typical scenario of drawing mul- tiple surfaces simu ltaneously from multipl e pro cesses: the status bar, F aceb o ok, and the navigation bar. Each pro cess indep enden tly renders to its o wn surface ( 1 ); for example, F aceb ook renders a new surface when the user scrolls or click s, while the navigation bar up dates the corresponding surface when the user clic ks on one of its buttons. F or displa y , a system comp ositor, such as Surface- Flinger [4] in Android, com bines surfaces from multiple applications b efore sending them to the screen ( 2 ). The compositor actively monitors the surfaces of all ap- plications and when a pro cess up dates a surface, the comp ositor subsequently updates the comp osited sur- face. Simultaneously , the display con troller hardware con tin uously reads the comp osited surface to the screen at 60 frame s p er second (FPS) or higher ( 3 ). Note that b ecause using the same surface for up dates and screen refresh op erations can cause artifacts, such as flick ering and tearing, double (or triple) buffering is used [5]. The example in Figure 1 sho ws how a surface can b e used and re-used m ultiple times and this is why it is important to reduce the ov erhead of framebuffer ma- nipulation through compression. 2.2 Surface compression techniques Surface compression is used to reduce off-c hip mem- ory traffic, which can impro v e p erformance and/or re- duce energy consumption. Graphics pipeline implemen- tations utilize compression for textures [6], surfaces [1, 7, 8], depth [9] and ve rtex data [10]. Man y of the compression tec hniques use lossy com- pression as well on lossless compression. F or framebuffer surfaces, lossless compression is used to a void error ac- cum ulation up on reading then re-writing surfaces (as is the case with comp osition). Surface compression differs from texture compression in that b oth encoding, as well as deco ding, are p er- formed in real-time. Also opp osite to surface compres- sion sc hemes, most texture compression algorithms are lossy [6, 11, 12]. Another crucial aspect of surface compression is ran- dom accessibility . T echnique s like Run-Length Enco d- ing (RLE) are unable to provide suc h accessibilit y . How- ev er, it important to be able to randomly access a sur- face when used for sampling (e.g., used as a texture), resizing, or comp osition. Compression algorithms hav e used blo c k-based schemes to enable random access for their simplicit y and practicalit y . Blo ck-based compres- sion mec hanisms define preconfigured compression sizes that allo w random access to compressed surfaces. Blo c k- based mec hanisms ha ve been used for compressing inte- ger (e.g., RGBA) surfaces [1], floating-p oin t surfaces [13, 14] and depth buffers [9, 14]. The work by Rasmusson et al. [1] (which we refer to b y RAS ) ev aluated several surface compression pro- p osals [15, 16, 17] and compared them against their tec hnique. RAS is a lossless blo ck-based compression tec hnique for integer buffers that enco des the difference b et ween adjacent pixel v alues. R GB pixels are con- v erted to the Y c o c g (luminance-c hrominance) format, to increase compression efficiency . W e compare against RAS in this pap er since it rep orts b etter compression results v ersus prior w ork. W e also ev aluate our scheme against the compression sc heme prop osed b y Nvidia [2]. In this scheme, for eac h blo c k going to memory , the algorithm c hec ks if 4 × 2 pixels in sub-blo c ks within a blo ck are identical. If so, the blo c k is compressed 1:8. When that is not p ossible, the algorithm then c hecks if 2 × 2 regions ha ve identical colors, if so the blo ck is compressed 1:4, otherwise the blo c k remains uncompressed. This algorithm works w ell with regions of identical color v alues, as the case with UI surfaces. Other compression work incl udes the w ork by Daniel- son [18], whic h prop oses using a dictionary-based com- pression in whic h the op erating system and/or program sp ecify the colors to configure a dictionary . In con trast to Danielson’s w ork, our work exploits temp oral co- herence to dynamically construct dictionaries ( palettes) a v oiding the need for softw are c hanges. Another w ork by Shim et al. [19] use a dictionary- based compression mechanism targeted at display buffer compression. Shim et al.’s approach compresses sur- faces using Huffman coding after rendering is completed to reduce the bandwidth of displa y refresh op erations. Rendered surfaces are read to construct critical color differences which are used in a second stage to con- struct a Huffman dictionary . The third stage re-reads the surface buffer and writes out a compressed buffer that is then used for screen refresh op erations. In con- trast to previous w ork, w e prop ose employing temp oral coherence to pr e dict the v alues for the dictionar y , av oid- ing submitting uncompressed surfaces to memory or re- quiring additional surface read/write op erations. Also w e propose an adaptiv e compression sc heme that av oids Huffman co ding inefficiencies with probability distribu- tions that are not exact p o wers of tw o. Finally , a b o dy of work has exploited temp oral co- herence in real-time rendering through in ter-frame data reuse. These techniques, in addition to off-line render- ing tec hniques like ray-trac ing, are summarized in the surv ey work of Sch erzer et al [20]. Here we prop ose a differen t application for temporal coherence by exploit- ing it for compression. 2.3 GPU Architectur es GPU architectures are broadly categorized as either tile-based or immediate-mode arc hitectures. Tile-Based arc hitectures aim to sav e bandwidth by handling all raster operations, lik e blending and depth testing, us- ing an on-chip buffer. Most mobile GPUs are tile-based arc hitectures (including Qualcomm, Imagination and ARM GPUs). In tiled rendering, the screen is divided in to render tiles (e.g., 32 × 32 or 64 × 64 pixels). F or each tile, the GPU renders all primitives that map to that tile using an on-c hip buffer b efore committing that buffer to the off-c hip memory . As a result, what is b eing compressed and sent to the off-chip memory is the final surface v alue of eac h tile. On the other hand, immediate-mode arc hitectures render primitiv es in their drawing order. They av oid the o verhead of the tiling process, but the v alues sen t to the off-c hip memory will con tain intermediate sur- face v alues. In the case of o verdra wing (i.e., when a pixel location is cov ered by more than one primitiv e), the same memory lo cation will b e written to multiple times. When a compression sc heme is deplo y ed in an immediate-mode GPU, it will compress the v alues sent to the off-chip memory as rendering progresses; this means compressing blocks from the GPU’s LLC instead of an on-c hip buffer. As tile-based architectures are the dominant choice for mobile GPUs, going forw ard, we assume a tile-based arc hitecture when ev aluating surfaces for compression. 2.4 Mobile Use Patter ns In this work , w e fo cus on developing an effectiv e scheme for compressing UI and 2D framebuffer surfaces. The reason for this choi ce is that studies found users to sp end 70% of their time running UI applications [21, 22], where o ver half of the time is spent on w eb brows- ing, messaging and social media alone. Whereas, games of all t yp es, 2D, and 3D, account for only 30% of the usage time. Th us, we saw the opportunity of design- ing an effectiv e scheme that targets suc h common use cases. In Section 7.5, w e sho w ho w our DCP sc heme can b e com bined with other generic compression algorithms to provide a comprehensive compression solution for all use-cases. 3. TEMPORAL COHERENCE IN MOBILE GRAPHICS T emporal coherence is the property of inter-frame similarit y [20]; this means that in a sequence of frames, con ten t only gradually changes from one frame to the next. T o quantify temporal coherence, we use tw o mea- suremen ts: Color change and Pixel change . Color change is the total difference in pixel color fre- quencies b et w een tw o frames regardless of the lo cations of the pixe ls. On the other hand, Pixel change is the to- tal n umber of pixels that c hange color betw een frames, whic h is measured by coun ting the num b er of pixel lo- cations that differ in v alue b et ween tw o frames. Color change estimates how similar tw o frames are, only with regard to color frequency . While Pixel change captures the mo vemen t of con ten t on a surface. T o illustrate color and pixel change, we use an ex- ample from the Go ogle Chrome browser in Figure 2. The example shows pixel and color change for differen t ev en ts. Notice that in some cases, when a new con tent is displa yed, b oth pixel and color change v alues are high (e.g., new web se ar ch ). In most cases, how ever, pixel c hange is alw ays higher than color change; this means that in man y cases the con tent is mo ving but not c hang- ing, as the case with BBC news in Figure 2. Lo oking at a range of mobile workloads, we found that temporal color coherence is reflected b y lo w Color change v alues, especially in UI applications. W e ana- lyzed a set of nine Android UI applications and games (UI: Twitter; F aceb ook; Chrome; and Android Home Screen, and 3D: F ruit Ninja; Need 4 Sp eed; Gunship 2; and T emple Run 2). In 3D applications, color and pixel change rates are 15.7% and 65%, respectively . On the other hand, UI applications has rates of 3.3% and 14.5% for color and pixel change v alues, resp ectiv ely . These num b ers sho w that 2D and UI applications ex- hibit higher temporal c olor coherence relativ e to 3D ap- plications. In addition to higher temporal coherence, w e found that UI applications tend to use fewer colors. Figure 3 demonstrates ho w a small num b er of frequent pixel color v alues dominates a typical UI application compared to a 3D one. Figure 3 shows the cum ulativ e distribution function of colors used in Twitter UI (a), compared to a 3D game, T emple Run 2 (b). In (a), the top 100 most common color v alues cov er ov er 80% of the frame’s sur- 0% 1 0% 2 0% 3 0% 4 0% 5 0% 6 0% 7 0% 8 0% 9 0% 1 1 01 2 01 3 01 4 01 5 01 6 01 7 01 8 01 9 01 1 001 1 101 1 201 1 301 1 401 1 501 1 601 1 701 1 801 1 901 2 001 2 101 2 201 2 301 CH ANG E (% OF PIXELS) FRAME # C o lo r c h an ge P ix e l c h an ge ch r ome N e w w e b s e a r ch S c r ol l ing L o a d i ng B B C n e ws L o a d i ng A m a z on Figure 2: Pixel change and Color change in Go ogle Chrome . In most cases, pixel c hange is higher than color change; this means that conten t is moving but not changing most of the time. Our compression sc heme tak es adv antage of lo wer color change betw een frames to predict compression palettes. face, while co verage is 10% for T emple Run 2. Mea- suring compressibilit y with Shannon en tropy , we found that Twitter has an en tropy of 4.5 bits per pixel, while it is 14 bits per pixel for T emple Run, indicating higher compressibilit y for Twitter. The next section sho ws ho w t o take adv antage of tem- p oral coherence and the color characteristics of UI ap- plications to design a dynamic color palette sc heme for compressing UI and 2D surfaces. 4. D YNAMIC COLOR P ALETTES (DCP) COM- PRESSION DCP is a technique to exploit graphics temporal color coherence for framebuffer compression. F or each frame, DCP carries tw o operations in parallel: color frequency collection and framebuffer compression. F or color fre- quency collection, DCP trac ks the most frequently used colors as the rendering of a frame progresses; mean- while, DCP w orks on compressing the pixels in the frame with a palette constructed using the frequency information of the previous frame. DCP has tw o main adv antages o ver previous dicti onary- based techniques [18, 19]. First, it emplo ys sampling to exploit temp oral coherence to predict future dictionary v alues on-the-fly , alleviating the need for soft ware hints or a m ulti-stage dictionary update pro cess. This allows DCP to compress intermediate surfaces (i.e., applica- tion surfaces) as well as the framebuffer surface used by the display unit. Second, as we will show later, DCP maximizes compression using adaptiv e dictionary siz- ing, whic h puts to use the color frequency data collected for eac h frame. DCP relies on tw o structures (sho wn in Figure 4), the F requent V alues Collector (FVC) for color frequency collection, and the Common Colors Dictionary (CCD) for compressing new pixels. The FV C identifies most commonly o ccurring colors, while the CCD enco des the most frequen t colors as iden tified b y the FVC from the previous frame. As shown in Figure 4, eac h frame the FV C collects color frequency information that are then used to construct the CCD of the next frame. 4.1 DCP workflow Figure 5 shows DCP workflo w. In 1 , the GPU com- mits tiles to the off-c hip memory in m ultiple batc hes, i.e., blo c ks of spatially adjacent pixels [23, 24, 25]. F or the example in Figure 5, we use a blo c k size of 4 × 4 pix- els and a sub-blo c k size of 2 × 2. Pixels in each block are sen t to the FVC a1 and the CCD b1 . In a1 , the FV C uses pixel v alues in each bl o c k to up- date the common color frequencies of the current frame (more details on that in the next Section). In b1 , the CCD compresses pixel blo c ks in batches of sub-blo c ks. In b2 , if all pixel v alues in a sub-blo c k ha v e an entry in the CCD, the sub-block is determined to be compressible. Each color v alue in a compress- ible sub-blo ck is represen ted using l og 2 (CCD size) bits, e.g., 6 bits p er pixel for a CCD with 64 entries. If one of the pixel v alue in a sub-block do es not hav e a CCD entry , then the whole sub-blo ck remains uncom- pressed. Compressed and non-compressed sub-blo c ks are buffered and once a full blo c k is pro cessed, it is then written to the off-chip memory b3 . Lik e other blo ck-bas ed compression sc hemes [1, 26], DCP uses an a metadata compression status buffer (CSB) that contains a compression status bit for each sub- blo c k. Up on compressing a sub-blo c k, the corresp ond- ing en try in the CSB is set b4 , and upon reading a compressed surface, the CSB is consulted to determine ho w m uch data should b e fetc hed from memory . 4.2 The Frequent V alues Collector (FVC) FV C is a relatively small–e .g., 16 to 128 en tries–asso ciative memory structure. The FVC stores a set of pixel v alues and their corresp onding frequencies as v alue-frequency pairs. F or each pixel access, the FVC determines if a pixel already has an entry in the FVC, if so, the FVC increases the corresponding frequency counter by one. Ho w ev er, b ecause FVC size is limited, the FVC uses an eviction p olicy to determine which pixel frequencies to k eep track of. Similar to a fully asso ciativ e cach e, the FV C uses the least frequent color (LR C) policy , where it evicts the pixel v alue with the smallest frequency when an entry is needed to track the frequency of a new pixel v alue. Hardw are Cost . Eac h FVC entry contains a color v alue (32 bits for R GBA), a v alidit y flag (1 bit), and a counter with log 2(num b er of screen pixels) bits. F or example, a 64-entry FV C sized for a 4k × 4k displa y will only require 456 b ytes of storage. 4.3 The Common Colors Dictionary (CCD) CCD is used to enco de compressed pixels. At the end of eac h frame, the FV C holds the frequencies of the es- timated most common colors. The FVC is then used to (a) Twitter (b) T emple Run 2 Figure 3: The cumulativ e distribution function (CDF) of unique color v alues in UI (Twitter) and 3D (T emple Run 2) Android applications. Fr ame 0 Fr ame 1 Fr ame 2 FV C Ti me Constr uct FV C fr om t he pi x el v al ues of t he cur r en t fr ame Constr uct t he CCD di ct ion ar y us in g FV C v al ues T o memor y CCD FV C T o memor y FV C CCD T o memor y Fr ameb uf f er acc es s c ompr es se d us in g CCD Figure 4: Using DCP across frames. G P U On - c hi p B uf fe r D i s pat c h P i xel B l o c ks CCD ( C o mp res s s ub - b l oc ks) E a c h p i x el i n t h e c u r re nt s u b - blo c k h a s C C D e n try ? Wr i t e a c om pres sed s ub - b l o c k Wr i t e a n on - c om pres sed s ub - b l o c k Y e s No b2 B l oc k B uf f er W r ite b l o c k t o o f f - ch ip m e mory P i xel S a mp ling a1 F V C ( c on st ruct s a C C D f or t he n e xt f r ame) b1 1 b3 a2 C S B B uf f er b4 Figure 5: DCP stages. C S B bu f f e r Off - c h i p Me m o r y rC C D D e t e r m in e blo c k - s i ze F e t c h B l o c k De - c o m p r es s B l o c k T o d i s p la y / T e x t u r e o r C o m p o s it io n U ni t Figure 6: Reading a DCP compressed surface. construct the CCD f or the next frame. Eac h CCD entry maps a pixel v alue to a dictionary (encoding) v alue. The CCD is implemen ted using a fully asso ciativ e structure. When reading a surface, the mapping of CCD is re- v ersed to decompress encoded pixels. W e call the direct mapped structure that holds this rev ersed mapping the rCCD. Upon compressing a frame, or a set of frames, the rCCD mapping is attac hed to the frame and stored in main memory . Later on, when the frame is read, the rCCD is used to decompress the frame as describ ed in Section 4.5 below. Hardw are Cost . CCD/rCCD with 64 en tries only requires 264 bytes of storage. 4.4 The Compression Status Buffer (CSB) Similar to other blo c k-based compression algorithms [1, 13, 14, 9], a metadata buffer is used to hold the status of eac h compression block . F or DCP , the CSB buffer in- dicates whether a given sub-block is compressed, where CSB holds one bit p er sub-blo c k. In our baseline, this translates to a cost of 1 bit per 128 bit of surface data. Both CSB and rCCD are needed to read a compressed frame as explained in the next section. 4.5 Reading a Compressed Frameb uffer Sur - face Figure 6 shows the process of reading a compressed surface. It starts with loading the corresponding rCCD and CSB. T o read a pixel, CSB entries are deco ded to determine the size of compressed data and ho w man y b ytes should be fetch ed for eac h blo ck. T o a v oid dou- ble latency , and since CSB size is relatively small, the CSB can b e prefetched to a small on-c hip buffer/cache. Once CSB is used to determine the size of a compressed blo c k, the blo c k is then fetched and the rCCD is used to decompress the v alues in eac h sub-blo c k as sho wn in Figure 6. 4.6 Multi-Surface Support Multiple R ender T ar gets (MR T) : Some graph- ics applications may render to m ultiple target surfaces. T ec hniques that use MR T, lik e deferred shading, are p opular in 3D applications and used to render scenes with complex lighting [27]. T o supp ort m ultiple ren- der targets, w e need to replicate some of the structures in Figure 5 to match the maxim um p ossible num b er of target surfaces. DCP will need a single FVC and a sin- gle CCD unit p er render target. How ever, no need for additional FVC and CCD units if m ultiple passes are used to process MR T. Since most UI and 2D workloads render to a single target, a typical hardw are implementation ma y only need supp ort a single render target and the rare case of multiple targets is handled by using DCP with just a single surface. How ever, as discussed in Section 7.4, adding extra structure is relatively c heap and cost little c hip area. Multi-Surfac e Comp osition : Con temp orary com- p ositor engines can comp osite up to 16 surfaces in one pass [28]. T o supp ort m ulti-surface comp osition, the n um b er of rCCD structures in Figure 6 should match the num b er of surfaces that can b e comp osited in par- allel. 4.7 Coupling DCP with Other Compression Algorithms DCP targets common UI and 2D applications. Other compression algorithms are b etter suited to 3D and some 2D applications. Industry practitioners ha ve pro- p osed supp orting mult iple compression algorithms [2, 26]. This means that in a hy brid sc heme, eac h blo c k can b e compressed either using DCP or an alternative algorithm. In Section 7.5 we ev aluate the results of com bining DCP with RAS. 4.8 Dynamically Enabling DCP In this section, we explain ho w DCP can be enabled/disabled based on the expected compression p erformance. DCP p erformance can b e predicted using the frequencies col- lected by the FVC at the end of a frame. By adding frequency v alues in the FVC, then comparing it to the total num b er of pixels (sample size), we can calculate what we call FVC co verage, whic h can b e used to pre- dict DCP performance, where: FV C co verage = Sum of FV C frequencies Num b er of samples By defining a cov erage threshold (CT) and compar- ing it to FVC cov erage, then DCP can be used only if FVC co verage ≥ CT. By p eriodically enabling FVC, e.g., once ev ery n frames, FV C co verage can be updated and used to determine if DCP should be enabled. F or an N-entry FVC, calculating cov erage tak es N − 1 in- 0 1 2 3 4 5 6 0 . 1 1 0 . 4 6 0 . 6 5 0 . 7 3 0 . 8 0 0 . 8 5 0 . 8 8 0 . 9 1 0 . 9 5 0 . 9 6 0 . 9 8 0 . 9 9 1 . 0 0 W o rk l o a d F V C C o v era g e (s o rted by com pr es s i o n ra te) Com p r e ssi on Rat e Figure 7: Compression rates vs. FVC co verage. teger addition and one division op erations p er frame. Figure 7 sho ws FVC cov erage vs. compression rates across w orkloads in T able 3. It is clear that higher com- pression rates are achiev ed with higher FVC co v erage. In our set of w orkloads, using DCP with FVC cov erage ≥ 0.7 seems to achiev e go od compression rates ( > 2). Figure 7 also shows some cases where larger FVC co v- erage yields lo we r compression. These cases represen t w orkloads that exhibit sudden c hanges in frames, as a result, temp oral coherence is low er than that of o ther b enc hmarks with similar FV C cov erage. Two examples from Figure 7 (the tw o large dips at the right end) are Unwind whic h exhibits a UI with changin g color brigh t- ness and Sup er Hexagon which exhibits an in terface that con tin uously switches theme colors. 5. DCP SCHEMES 5.1 Baseline DCP In baseline DCP , the CCD is constructed using all FV C entries; thus, the n umber of entries in the CCD will alw ays match FVC, and compressed blo c ks will ha v e a fixed size of l og 2(FV C size) × (pixels p er blo c k) bits. Memory lay out and effective compression rates . Figure 8 sho ws the memory lay out of a DCP com- pressed surface. Space allocated to DCP blo c ks (0-2) is fixed ( S 0 , i.e., the size of an uncompressed block). On the other hand, the actual utilized space is deter- mined by the size of compressed data ( S 2 ). But b e- cause DRAM reads/writes data blo c ks using a nu m- b er of bandwidth cycles that are burst size multiples, a blo c k that should b e compressed by S 0 /S 2 will hav e an effectiv e compression rate of S 0 /S 1 , where S 1 is the size of DRAM bursts needed to read compressed data. In this work, we use the effective compression rate whic h reflects the reduction in memory bandwidth. In the remainder of this section, tw o v ariants of DCP (ADCP and VDCP) are intr o duced in addition to a h ybrid scheme com bining DCP and RAS (HDCP). 5.2 Adaptive DCP (ADCP) ADCP is a v ariation of DCP that uses the distribu- tion of frequen t color v alues in the FVC to adjust the n um b er of CCD entries. ADCP lo oks for the b est trade- off betw een the num b er of compressible blo cks and the Blo c k s lay o u t in DRA M B lo c k 0 B lo c k 1 B lo c k 2 Un u se d sp a c e B l o c k 0 ( Co m p r e ss e d ) D R A M l a y o u t Ef f ecti v e co m pres s i o n s i ze Ef f e c ti v e siz e ( S₁ )= N × (DRA M b u rst siz e ) [ T h is w h a t is re a d /w rit ten to DRA M ] Un u se d sp a c e C o m p r e ssed Pi x e l s DCP c o m p re ss e d d a ta ( S₂ ) DRA M b u rs ts o v e rh e a d B l o c k 0 D R A M b u s l a y o u t S₀ Figure 8: DCP memory Lay out. 0.45 0.55 0.65 0.75 0.85 0.95 1.05 1.15 1.25 8 16 32 64 128 256 512 # of F VC /CCD E n tr ies Kindle F a c e book Figure 9: DCP compression vs. CCD size. size of their enco ding. F rames of different applications, or differen t frames within the same application, may perform b etter/worse under larger/smaller palette sizes. Figure 9 sho ws DCP compression rates for F ac eb o ok and Kind le using 16 to 512 entry CCDs. Kind le with its simple text achiev es higher compression rates using smaller CCDs. On the other hand, F ac eb o ok ac hiev es the best compression rate using a 256-en try CCD. A larger CCD co vers a wider range of v alues and it is able to compress more blo ck s, while a smaller CCD uses smaller enco ding sizes. F or example, if a frame that uses 32-bit pixels with blo cks that are 80% white, 18% blue, 1% black and 1% red uses a 2-entry CCD, 98% of the blo cks can be compressed using 1 bit p er pixel for a total compression rate of 19.75:1 (ignoring metadata ov erhead). Another option is to use a 4-entry CCD to compress all the frame using 2 bits per pixel producing a compression rate of 16:1. ADCP tries to optimize CCD size for eac h case b y activ ely predicting the optimal num b er of CCD entries. CCD size determines encoding sizes and subsequently the size of compressed blo cks. ADCP uses FV C to pre- dict the optimal CCD size using Algorithm 1. In Al- gorithm 1, FVC frequencies, sorted from most to least frequen t in FVC V al, are used as input. Note that to simplify calculations, DRAM burst size and pixels lay- out w ere ignored. ADCP has a negligible ov erhead; the num b er of iter- ations in Algorithm 1 dep ends on the n umber of FVC en tries. F or example, for 64-en try FV C, the loop will only execute six times (i.e., log 2(FVC size)). Algorithm 1 Predicting optimal CCD size INPUTS (F rameSizePixels, PixelSizeBits, FVC V al, Max FVC Size) . predicted compressed frame size in bits expected frame size = F rame W × H*PixelSizeBits . Optimal CCD entries = 2 opt CCD opt CCD = 0 for i =0 to log 2(Max FVC Size) do sum = SumF requencies(FV C V al(0) to FVC V al(2 i -1)) frame size = sum * i + (F rameSizePixels-sum)* Pixel- SizeBits if frame size < exp ected frame size then expected frame size = frame size opt CCD = i end if end for return 2 opt CCD C D D C 0 C 1 C 2 C 3 C 4 C 5 C 6 C 7 0 1 2 3 4 5 6 7 CCD C 0 C 1 C 2 C 3 C 4 C 5 C 6 C 7 000 001 010 01 1 100 101 1 10 111 { 10,1 1} { 0,1} { φ } { 001,101 } { 1 10,1 1 1} { C 3 ,C y } { C x ,C 4 } { C 2 , C 3 } {C 0 , C 1 } {C 0 , C 0 } { C 1 , C 5 } { C 6 , C 7 } { C 3 , C y } { C x , C 4 } 010 001 000 01 1 01 1 111 111 B l ock E ncoding C SB (a) (b) Figure 10: VDCP example enco ding (a). CCD entries (b). The eco ding of 2-pixel blo c ks using the CCD in T able (a). 5.3 V ariable DCP (VDCP) VDCP is another DCP v ariation that goes further than ADCP by adapting palette sizes to optmize com- pression at the sub-block lev el. VDCP uses v ariable- length co ding b y changing the num b er of rCCD en- tries used to enco de/decode each sub-blo ck. VDCP reduces the num b er of enco ding bits p er pixel to i = ceil ( l og 2 ( max ( pixel col or index ))), which means that i is determined by the p ixel within the sub-block that has the highest index (i.e., low est frequency) in the CCD. With VDCP , CSB is used to deter mine the num b er of rCCD en tries used for each sub-block, where the num b er rCCD entries equals to 2 C S B V al ue (i.e., enco ded colors fall in the first 2 C S B V al ue CCD entrie s), and a sp ecial CSB V alue is used for uncompressed sub-blocks. Figure 10 shows a VDCP example. In Figure 10.a, an example CCD is shown , where the most frequent color v alue, C 0 , is enco ded to 000 and th e least frequent v alue, C 7 , is enco ded to 111. Figure 10.b shows the VDCP enco ding for seven 2-pixel sub-blocks. As sho wn in the figure, the CSB trac ks each sub-blo ck’s enco ding. 000 2 in the CSB indicates that only the most frequen t color in the CCD C 0 is used, while 001 2 indicates that the top 2 CCD colors, { C 0 and C 1 } , are used and so on. 111 2 is used for uncompressed sub-blo c ks. In Figure 10.b, the first row shows a sub-block with v alues C 2 and C 3 , this means that only the top 2 2 en- tries in the CCD are used for enco ding the sub-blo c k. Baseline DCP Configurations FVC Size 64 en tries FVC replacement policy Least-frequen t v alue CCD size 64 en tries Pixel Block size 8 × 8 Pixel Sub-block size 2 × 2 CSB bits p er sub-block 1 (DCP , ADCP & HuffDCP) 3 (VDCP), 5 (HDCP) Memory Burst Size 128 bits Pixel Sampling Rate 1:1 T able 1: Baseline Configurations Subsequen tly , the corresp onding CSB en try is set to 010 2 , and eac h pixel color is encoded using 2 bits. The second sub-block in Figure10.b con tains { C 0 ,C 1 } , en- co ded using the top 2 1 CCD en tries (1 bit p er color), and the corresp onding CSB en try is set to 001 2 . The third sub-blo ck con tains only C 0 , and the corresponding CSB v alue in this case is 000 2 , which indicates the con- ten t for the entire sub-block ( since only the top en try in the table is used). Note that to make this example easy to follow, the CCD is sho wn with eight entries (instead of 64), so the CSB v alues 100 2 to 110 2 are not in use. 5.4 Hybrid DCP (HDCP) DCP is only effectiv e on a subset of applications. Ideally , it should b e used with other compression al- gorithms. W e ev aluate a Hybrid DCP that com bines DCP with RAS. HDCP compresses each blo c k using DCP and RAS and uses the result with higher com- pression rate. T o support the additional compression mo des, the num b er of CSB bits is increased. Results in Section 6 show that this tec hnique pro duces higher compression rates at the cost of additional on-chip com- putations. 5.5 DCP Implementation In addition to hardw are structures (FVC, CCD and rCCD), DCP requires some supp ort from the softw are la y er. T o implemen t DCP , the graphics driver will at- tac h DCP data as part of the state asso ciated with a surface (along other state data like size and formatting). F or VDCP , Algorithm 1 can b e added to the driv er as w ell, where it can calculate next CCD size at the end of eac h frame. 6. METHODOLOGY Our exp erimen tation configurations are listed in T a- ble 1. W e calculated compression rates using a mo del that assumes a tile-based GPU arc hitecture. Our model w orks as follo ws: • First, we feed the frames of each wor kload to our mo del, whic h then splits each frame to 8x8 blo c ks. • F or each block, the mo del calculates the compressed size of each sub-blo c k. The total of compressed and uncompressed sub-blo c k sizes are added to cal- culate the compressed size of the block. • Compressed blo c k size is then used to calculate the num b er of DRAM bursts required. The mo del then calculates the total bandwidth consumed by a compressed frame b y summing the num b er of DRAM bursts of all the blo c ks in the frame. Note that the mo del computes compression rates start- ing from the second frame, using the first frame to p op- ulate the first FVC and CCD. W e ev aluated surface compression using our set of randomly c hosen p opular Android applications (T able 2). Our traces will be published and made av ailable for any future studies. W e split applications into three groups: UI applica- tions, 2D applications, and 3D games. All of our bench- marks use OpenGL ES and render to a single target buffer (up to OpenGL ES 2.0 MR T is only supp orted through v endor extensions [29]). W e man ually in teracted with each application to ex- ecute a simple task. In total, we used 34468 frames that represent 124 applications (shown in T able 3). W e only consider regions of interest in each workload that represen t the t ypical use case of the w orkload (i.e., l oad- ing/initialization frames are not considered). The rest of configurations are listed in T able 1. The effectiv e compression rate and metadata ov erhead are tak en int o accoun t when calculating the total compres- sion rate. W e use a blo c k size of 8 × 8 pixels (256 bytes), whic h matches the blo ck sizes used b y RAS. In addition to DCP , we ev aluate tw o lossless metho ds described in Section 2. RED uses Nvidia’s compression [2] and RAS , which is based on w ork of Rasm usson et al. [1]. RAS is a prediction based algorithm that pre- dicts the v alue of a pixel using neighbor pixel v alues. The difference b etw een prediction and the actual v alue is then enco ded using Golom b-Rice coding. W e used parameters suggested by Rasmusson et al. [1], namely 8 × 8 blocks and, as described in the pap er, we set the v alue of the Golomb-Rice parameter k b y exploring v al- ues b et ween 0 and 6, use k = 7 for the “sp ecial mode” , and use the suggested “3 sizes mode” for higher com- pression rates. W e organize color v alues by their color c hannel as describ ed in Str ¨ om et al. [14]. W e exp eri- men ted with RAS using R GBA and Y c o c g formats and found that for many applications, particularly UI and 2D, RAS sho ws fa vorable results using R GBA channels. So we used RAS with R GBA channels in our compari- son. F or CSB, DCP and ADCP use 1-bit per sub-block. VDCP uses 3 bits p er sub-blo c k; with an FV C size of 64, seven combinations are used–1, 2, 4, 8, 16, 32 and 64, plus a combination for non-compressed sub-blocks. T o compare against techni ques that use Huffman co d- ing [19], a Huffman co ded DCP (HuffDCP) is imple- men ted, where FVC frequencies are used to construct CCD with v ariable length Huffman coding. 7. RESUL TS AND DISCUSSION 7.1 DCP Schemes T o compare DCP schemes, we isolate the effect of memory burst si ze and only tak e in to accoun t CSB o v er- # Cat. Benchmark # Cat. Benchmark # Cat. Benchmark # Cat. Benchmark # Cat. Benchmark 1 UI Android Settings 26 UI Pocket 51 UI Y ellowpages 76 UI T extra 101 2D Un wind 2 UI Morecast 27 UI ES File Explorer 52 UI Eye in the Sky 77 UI WPS Office 102 2D Color Switch 3 UI Po weramp 28 UI Chrome 53 UI OfficeSuite 78 UI People Contacts 103 2D Impossible Game 4 UI Speedest 29 UI Applo c k 54 UI Dictionary .com 79 UI Unit Conv. Ult. 104 2D Flo w 5 UI Twitter 30 UI Accuweather 55 UI W algreens 80 UI Skyscanner 105 2D 2048 6 UI F acebo ok 31 UI Flipb oard 56 UI W almart 81 UI Calendar 106 2D Gyro 7 UI Twitch 32 UI Booking.com 57 UI CNN 82 UI Merriam W ebster 107 2D 99 Problems 8 UI Wish 33 UI Shazam 58 UI File Commander 83 UI ESPN 108 2D Dum b W ays to Die 9 UI Imgur 34 UI Zedge 59 UI T erminal Emulator 84 UI T umblr 109 2D Piano Tiles 10 UI Soundcloud 35 UI Indeed 60 UI Adob e Acrobat 85 UI Quickpic 110 2D loop 11 UI Automate 36 UI Runkeeper 61 UI Android Call 86 UI Duolingo 111 2D Ultraflow 12 UI Musixmatch 37 UI Steam 62 UI Gallery 87 UI Clock 112 2D Oka y 13 UI Airbnb 38 UI Khan Academy 63 UI F eedly 88 UI Google Messenger 113 3D T raffic Rider 14 UI CBS Sports 39 UI The W eather Channel 64 UI Baconreader 89 UI Calculator 114 3D Extreme Car Driving 15 UI Etsy 40 UI Y aho o Finance 65 UI aCalendar 90 UI Soundhound 115 3D 3D Bowling 16 UI Android Home 41 UI T apatalk 66 UI Bak areader 91 UI T ranslate 116 3D Dr. Driving 17 UI Pinterest 42 UI Kickstarter 67 UI Kindle 92 UI Any .do 117 3D Paper T oss 18 UI Aldiko 43 UI Amazon Store 68 UI eBay 93 2D Candy Crush Saga 118 3D Rolling Sky 19 UI Letgo 44 UI Zomato 69 UI V enmo 94 2D T rainya rd 119 3D Stack 20 UI Y elp 45 UI Sp otify 70 UI Mcdonalds 95 2D Mines 120 3D Zigzag 21 UI Android Messaging 46 UI Runtastic 71 UI Colornote 96 2D Cut the Rope 2 121 3D Stargather 22 UI BBC iPlay er 47 UI theScore 72 UI Reddit 97 2D Angry Birds 122 3D Commute H. T raffic 23 UI T achiyomi 48 UI F oo d Net work 73 UI Check out 51 98 2D Strata 123 3D Crossy Road 24 UI gReader 49 UI MX Play er 74 UI T asker 99 2D Brain it On 124 3D Smashy Road 25 UI Go ogle Maps 50 UI VLC 75 UI IFTTT 100 2D Sup er Hexagon T able 2: List of Android workloads 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 C o m pre s s ion R a te VDCP DCP A D CP H D CP 1 - 92 ( U I ) 113 - 12 4 ( 3D ) 93 - 11 3 ( 2D ) 1 2 4 5 6 7 8 3 9 V A (a) DCP schemes compression. 1 6 11 16 21 26 31 C o m pre s s ion R a te R A SS R E D A D CP 1 - 92 ( U I ) 113 - 12 4 ( 3D ) 93 - 11 3 ( 2D ) V (b) Comparing RAS, RED, and VDCP compression rates. Figure 11: Compression rates of workloads ordered from left to righ t follo wing their order in T able 2. 2 .8 7 2 .2 2 1 .5 9 4 .2 2 2 .6 2 1 .7 0 5 .5 6 3 .0 2 1 .7 8 5 .1 9 2 .9 0 1 .8 0 UI 2D 3D C o m pr es s i o n ra te D CP ADCP V DCP Hu ffDCP Figure 12: Harmonic mean of DCP sc hemes compres- sion rates per application category . 2. 73 2. 90 2. 54 2. 77 1. 93 1. 35 5. 26 2. 90 1. 75 UI 2D 3D C o m pr es s i o n ra te R AS R E D V DCP Figure 13: Comparing RAS, RED and VDCP effective compression rates. System Configurations Operating System Android 4.2.2 (API 17) Display Size 720 × 1280 Android w orkloads Category # of w orkloads UI Applications 92 (24031 frames) 2D Applications 20 (7888 frames) 3D Applications 12 (2549 frames) T otal # of Applications 124 (34468 frames) T able 3: System configurations and workloads summary head. Later, bursts are tak en in to accoun t when com- paring DCP , RAS and RED. W e compare baseline DCP against ADCP , VDCP and HuffDCP . Figure 11a shows compression rates in eac h category sorted by baseline DCP compression rate (the same or- der used in T able 2). The figure sh ows that the baseline DCP is the least effective sch eme. UI applications 1 , suc h as Android Settings, Morecast, and P o w eramp, sho w lo w compression rates of less than 2. After ex- amining these applications, w e found that they feature gradien t backgrounds and graphical elemen ts that DCP cannot compress using small palettes. In 2 (Zedge) and 3 (Spotify), ADCP achiev es higher compression rates than HuffDCP and VDCP . Lo oking at these applications we found that they con- tain a mix of solid bac kgrounds and frames that con- tain images which DCP will, mostly , not b e able to compress. ADCP can compress frames with solid back- grounds with low er o verhead than VDCP since it has a low er CSB ov erhead. On the other hand, in frames con taining images, b oth ADCP and VDCP will not b e able to perform well , but ADCP will incur low er CSB o v erhead. F or applications with simple color sc hemes, suc h as OfficeSuite 4 and Any .do 5 , VDCP and DCP ac hiev e high compression rates. Nev ertheless, VDCP , ADCP , and HuffDCP were all able to achiev e even higher com- pression rates. Lo oking at 2D applications, p erformance v aries significan tly . In 6 , applications with sophisticated graphics, lik e Candy Crush, T rainy ard, Mines, Cut the Rop e and An- gry Birds, hav e low compression rates ( < 1.7). On the other hand, applications using simpler graphics (e.g., lo op Ultraflow, and Ok ay) achiev e high compression rates, esp ecially with VDCP 7 . A similar trend is exhibited in 8 , where graphically ric h 3D games (T raf- fic Rider, Extreme Car Driving, and 3D Bowling) show lo w compression rates. On the other hand, games like Smash y Road, sho w go od compression rates (highest VDCP at 10.63). Also Stargather 9 , with similar c har- acteristics to UI applications in 2 and 3 , shows higher rates with HuffDCP and VDCP . Figure 12 summarizes the results in Figure 11a. VDCP sho ws b etter compression rates for UI and 2D applica- tions with 5.56 and 3.02 resp ectiv ely . F or 3D games, HuffDCP shows the highest rate (1.80). HuffDCP and VDCP do b etter with 3D w orkloads since their compres- sion rates are similar to VDCP but with low er CSB ov er- head. Int erestingly , using Huffman enco ding in HuffDCP ac hiev es low er compression rates than VDCP in UI and 2D workloads. This is due to Huffman inefficiencies with probabilit y distributions that are not exact pow ers of tw o. F or example, if w e ha ve 32 bit v alues and fre- quencies of A(49.5%), B(49.5%), C(0.5%) and D(0.5%) then Huffman enc o ding will assign co des of 1 bit to A, 2 bits to B and 3 b its to C and D with a total compression rate of 21.12. ADCP and VDCP enco de A and B using 1 bit, while keeping C and D uncompressed, resulting in a compression rate of 24.4. 7.2 Comparing VDCP , RAS and RED Figure 11b compares VDCP with RAS and RED and Figure 13 summarizes the results in Figure 11b. Mem- ory bursts and CSB ov erhead were tak en into accoun t. F or UI applications, VDCP ac hieves a mean effectiv e compression rate of 5.26 compared to 2.73 for RAS and 2.77 for RED. VDCP p erforms well with UI and 2D applications. On the other hand, RAS, a more generic compression algorithm, has consisten t p erfor- mance across all wor kloads. RAS outperforms VDCP in 3D games (2.54, compared to 1.75 for VDCP). Similar to VDCP , RED performs w ell with UI and 2D applica- tions, but with low er rates that VDCP . VDCP perfor- mance with 3D workloads is the reasoning behind sug- gesting a h ybrid approach consisting of DCP and an- other general purpose compression algorithms–similar to what is described some implementations [2, 26]. A Hybrid VDCP-RAS sc heme is discussed in Section 7.5. 7.3 F actors affecting FVC Fidelit y In this section we discuss and quantitativ ely ev aluate four factors that affect FVC and should b e considered when using DCP . FV C Size: . Larger FV C sizes can capture frequent colors more accurately , as they are less likely to evict a frequen t v alue from the FVC because of capacit y . T o ev aluate how FV C size affect accuracy , we use r elative c over age . F or an N -en try FVC, we calculate relativ e cov erage by dividing the num b er of pixels rep- resen ted b y the N top colors collected b y FV C by the n um b er of pixels represented by the actual N most fre- quen t colors. Figure 14a sho ws the effect of FV C size for UI appli- cations. A 16-entry FV C has a relative co verage of 94% compared to 98.3% for 512-en try FV C. This means a 16- en try FVC is able to capture colors that cov er 94% of the area cov ered b y the actual 16 most frequent colors, while the 512-entry FV C is able to capture 98% of the co v erage the actual 512 most comm on colors are able to co v er. Figure 14b shows how accuracy affect compres- sion rates, as frequencies collected using larger FVCs are a better representation of the actual most common colors. Replacemen t P olicy and Asso ciativit y . W e ev aluated using a n um b er of replacemen t policies: the baseline least-frequen t color (LFC), second least- 0 .7 0 .8 0 .9 1 16 32 64 1 2 8 2 5 6 5 1 2 N o rm a l i zed co m pr es s i o n ra te FVC Siz e UI 2D F V C s iz e v s . R e la tiv e c o v erage (a) FV C size vs. Relative Cov erage. 4 .53 4 .85 5 .26 5 .84 6 .57 7 .22 2 .57 2 .77 2 .90 3 .21 3 .51 3 .89 16 32 64 1 2 8 2 5 6 5 1 2 N o rm a l i zed co m pr es s i o n ra te # of F VC Entries UI 2D F V C s iz e v s . c o m pr es sio n r a t e (b) FV C size vs. VDCP compression rate. Figure 14: Comparing FVC size with relativ e co v erage (a) and its effect on compression rates (b). 0 .4 0 .6 0 .8 1 2 4 8 16 32 64 N o rm a l i zed co m pr es s i o n ra te # o f s ets UI 2D AD C C om p r e ssi on vs # of se t in 64 e n tr y F VC Figure 15: 64-en try FVC asso ciativit y vs. the compres- sion rate of fully asso ciativ e FVC. frequen t (2LFC), least-recen tly-used (LRU), and ran- dom replacement. The idea b ehind including a 2LFC is to see the effect of a voiding thrashing newly discov ered colors that are prone to eviction. Using UI work loads with 64-ent ry FVC, the mean compression rate with LFC is 5.26, while it is 5.25 for 2LF C. On the other hand, LRU and random achiev e lo w er rates of 3.04 and 2.92, respectively . W e also ev aluated c hanging FV C asso ciativit y from fully associative to direct-mapped, and used color chan- nel v alues to determine the set. As exp ected, the FVC p erformance degrades as w e increase the num b er of sets (as sho wn Figure 15). Pixel Sampling . W e noticed that the FVC can b e constructed using a subset of frame pixels, i.e., by sampling them using only one in every n th pixel to collect frequen t colors statistics. Figure 16 illustrates the effect of pixel sampling on VDCP . W e ev aluate sampling rates from 1:1 (every pixel accesses the FVC) to 1:16384. 1:16 sampling achiev es 98.7% (UI) and 102% (2D) of the compression achiev ed b y 1:1 sampling. W e expect that the slightly higher compression rate for 2D w orkloads is caused by sam- pling w orking as a noise filter. F rame sampling . In frame sampling, the same CCD is used for a n um- b er of frames ( N ) instead for just one frame. W e v ary the sampling p eriod ( N ) for VDCP b etw een 1 (every frame) and 60 frames. Figure 17 shows compression 0 .75 0 .85 0 .95 1 .05 N o rm a l i zed co m pr es s i o n ra te Pi x el sa m pling rate UI 2D A D C P C o m pr es sio n v s P i x el S a m p li ng R a t e Figure 16: Normalized compression rates vs. FVC pixel sampling rate for UI applications. 0 .2 0 .4 0 .6 0 .8 1 2 5 8 10 15 20 30 40 50 60 N o rm a l i zed co m pr es s i o n ra te Sa m pling perio d (in fram es) UI 2D A D CP C o m pre ssi on v s F ra me S a m p li ng Figure 17: VDCP normalized (to sampling perio d of 1) compression rates vs. FV C frame sampling perio d. rates relativ e to N =1. VDCP maintains go od compres- sion rates with N =2, with a relative compression rate of 97%. Compression rates, how ever, significantly de- crease with higher N v alues with 44.6% and 43.3% for N v alues of 50 and 60, resp ectiv ely . 7.4 Implementation Cost and Energy Sa vings Section 5 mentions storage requirements associated with DCP . Sp ecifically , for 64-en try FVC/CCD, 456 b ytes are need for FV C and 264 bytes for CCD. The cost of rCCDs is (264 b ytes) x (maxim um num b er of surfaces that can be read in parallel). Current systems support up to 16 surfaces [28]. F or energy , we used DRAMP o w er v4.0 [30] to esti- mate the energy cost of accessing a MICR ON 1600 x32 LPDDR3 DRAM. W e found the cost of DRAM ac- cesses to b e around 451.2 pJ/byte (this n umber ex- cludes DRAM idle energy and other system energy costs lik e the in terconnection net work). F or DCP we used CA CTI v7.0 [31] with the 22nm pro cess to estimate the area/energy/latency of DCP structures as sho wn in T a- ble 4. Using the n umbers in T able 4, DCP total area cost with support of 16 surfaces equals to 0.009527672 mm 2 . T o compare this area with curren t hardware, it is less than 0.003% of Nvidia’s Xa vier die area [32]. F or the the dynamic energy cost of compressing/decompression a b yte using DCP , w e foun d it to be around 1.3 pJ/b yte, i.e., less than 0.29% of DRAM access cost. Energy savings . DRAM consumes around 199.6 mW (629.4 mW in- cluding static p o wer) for framebuffer op erations under a typical rate of 60 FPS using HD frames (GPU writ- ing/displa y controller reading, or 949.21 MB/s). W e calculated DCP total comp ression/decompression static and dynamic energy consumption (4.83 pJ/byte ) and w e compared it to only DRAM dynamic energy con- sumption (451.2 pJ/b yte). W e found that VDCP re- Structure Type Area cost (mm 2 ) Leak age p o wer (mW) Access cost (pJ) Access latency (ns) Max bandwidth (MPixels/s) FVC CAM 0.0 0304232 0.881899 0.766572 0.131695 7241 CCD CAM 0.00197988 0.39 0.402 0.128338 7431 rCCD Cache 0.000281592 0.281112 0.104106 0.0722227 13203 T able 4: DCP structures hardware cost. 0 0. 1 0. 2 0. 3 0.4 0. 5 0. 6 0. 7 0. 8 0. 9 1 R a ti o of b locks c ompre ssed using VD C P W or kl oads ( or der e d by the r at io of V D C P c om pr e sse d bl oc k) V D C P R A S A v g. VDCP : 0.49 Figure 18: Ratio of DCP vs. RAS compressed blo cks across all w orkloads. duces the energy consumed by framebuffer op erations b y 79.9% for UI apps, 64.4% for 2D apps, and 41.8% for 3D apps. 7.5 Hybrid Schemes Our h ybrid compression sc heme uses RAS and VDCP . W e compress using both algorithms and then use the b est of the tw o. This exploits VDCP high compression rates for simpler surfaces while falling bac k on RAS for other cases. RAS+VDCP outperforms RAS and VDCP (with rates of 7.2, 5.206 and 3.23 for UI, 2D and 3D ap- plications respectively). The ratio of VDCP vs. RAS compressed blo c ks v aries by application as shown in Fig- ure 18. Ho wev er, we found that, on a verage, VDCP and RAS compress an equal num b er of blocks. 8. CONCLUSION This wor k presents surface compression techniques that reduce the off-chip bandwidth of framebuffer op- erations in energy-constrained mobile devices. In this w ork, w e analyze and characterize the framebuffer sur- faces of UI, 2D and 3D applications and highligh t the unique c haracteristics of eac h. T o ev aluate our compression sc hemes, we created and used a set of workloads that represents 124 p opular mobile applications. Our results show that VDCP im- pro v es compression by an a v erage of 93% relative to RAS for UI applications, while impro ving UI and 2D applications o ver RED by 89% and 50%, resp ectiv ely . DCP f o cuses on 2D an d UI applications and can com- plemen t other generic compression algorithms. W e ev al- uated a hyb rid VDCP+RAS (HDCP) sch eme; the sc heme w as able to increase compression rates b y 163%, 79% and 27% ov er RAS, and by 159%, 169% and 139% o ver RED for UI, 2D and 3D applications, respectively . 9. REFERENCES [1] J. Rasm usson, J. Hasselgren, and T. Akenine-Moller, “Exact and error-b ounded appro ximate color buffer compression and decompression,” in SIGGRAPH/EUR OGRAPHICS Confer ence On Gr aphics Har dwar e: Pr oc e e dings of the 22 nd ACM SIGGRAPH/EUR OGRAPHICS symp osium on Gr aphics har dwar e , vol. 4, no. 05, 2007, pp. 41–48. [2] NVIDIA, “NVIDIA T egra X1 Whitepap er,” 2015. [Online]. Av ailable: http://in ternational.download.n vidia.com/pdf/ tegra/T egra- X1- whitepaper- v1.0.pdf [3] T. J. Olson, “Saving the planet, one handset at a time: Designing lo w-p ow er, lo w-bandwidth gpus,” in A CM SIGGRAPH 2012 Mobile , ser. SIGGRAPH ’12. New Y ork, NY, USA: ACM, 2012, pp. 1:1–1:1. [Online]. Av ailable: http://doi.acm.org/10.1145/2341910.2341912 [4] A. SurfaceFlinger, “SurfaceFlinger and Hardw are Composer,” 2016. [Online]. Av ailable: h ttps: //source.android.com/devices/graphics/arc h- sf- hw c.html [5] Android, “Android : Graphics arc hitecture,” 2016. [Online]. Av ailable: https: //source.android.com/devices/graphics/arc hitecture.html [6] J. Str ¨ om and T. Akenine-M ¨ oller, “i P ACKMAN: high-qualit y , low-complex ity texture compression for mobile phones,” in Pr oc e e dings of the ACM SIGGRAPH/EUR OGRAPHICS c onferenc e on Gr aphics har dwar e . ACM, 2005, pp. 63–70. [7] T. Ak enine-M ¨ oller and J. Str ¨ om, “Graphics for the masses: a hardw are rasterization arc hitecture for mobile phones,” in ACM T r ansactions on Gr aphics (TOG) , v ol. 22. ACM, 2003, pp. 801–808. [8] I. An tochi, B. Juurlink, S. V assiliadis, and P . Liuha, “Memory bandwidth requirements of tile-based rendering,” in Computer Systems: Ar chite ctur es, Mo deling, and Simulation . Springer, 2004, pp. 323–332. [9] J. Hasselgren and T. Akenine- Moller, “Efficien t depth buffer compression,” in SIGGRAPH/EUR OGRAPHICS Confer enc e On Gr aphics Har dware: Pr o c e edings of the 21 st A CM SIGGRAPH/Eur ogr aphics symp osium on Gr aphics har dwar e: Vienna, Austria , v ol. 3, 2006, pp. 103–110. [10] A. Khodako vsky , P . Sc hr ¨ oder, and W. Sweldens, “Progressiv e geometry compression,” in Pr o c e e dings of the 27th annual confe r enc e on Computer gr aphics and inter active te chniques . ACM Press/Addison-W esley Publishing Co., 2000, pp. 271–278. [11] J. Str ¨ om and M. Pettersson, “Etc 2: texture compression using in v alid combinations,” in Gr aphics Har dwar e , 2007, pp. 49–54. [12] J. Nystad, A. Lassen, A. P omianowski, S. Ellis, and T. Olson, “Adaptive scalable texture compression,” in Pr o ce e dings of the F ourth ACM SIGGRAPH/Eur o graphics c onfer enc e on High-Performanc e Gr aphics . Eurographics Association, 2012, pp. 105–114. [13] J. P ool, A. Lastra, and M. Singh, “Lossless compression of v ariable-precision floating-point buffers on GPUs,” in Pr o ce e dings of the A CM SIGGRAPH Symp osium on Inter active 3D Gr aphics and Games . ACM, 2012, pp. 47–54. [14] J. Str ¨ om, P . W ennersten, J. Rasmusson, J. Hasselgren, J. Munkberg, P . Clarberg, and T. Akenine-M ¨ oller, “Floating-point buffer compression in a unified co dec architecture ,” in Pr oc e e dings of the 23r d ACM SIGGRAPH/EUR OGRAPHICS symp osium on Gr aphics har dwar e . Eurographics Association, 2008, pp. 75–84. [15] T. J. V an Hook, “Method and apparatus for compression and decompression of color data,” May 2 2006, uS Patent 7,039,241. [16] S. E. Molnar, B.-O. Schneider, J. Montrym, J. M. V an Dyke, and S. D. Lew, “System and method for real-time compression of pixel colors,” Nov. 30 2004, uS Paten t 6,825,847. [17] S. L. Morein and M. A. Natale, “System, method, and apparatus for compression of video data using offset v alues,” Jul. 13 2004, uS P atent 6,762,758. [18] B. H. Danielson, J. J. W atters, and T. J. McDonald, “Method and apparatus for displaying computer graphics data stored in a compressed format with an efficient color indexing system,” Apr. 14 1998, uS Paten t 5,740,345. [19] H. Shim, Y. Cho, and N. Chang, “F rame buffer compression using a limited-size co de b ook for lo w-power display systems,” in Emb edde d Systems for R e al-Time Multime dia, 2005. 3r d Workshop on . IEEE, 2005, pp. 7–12. [20] D. Sc herzer, L. Y ang, O. Mattausch, D. Nehab, P . V. Sander, M. Wimmer, and E. Eisemann, “T emp oral Coherence Methods in Real-Time Rendering,” in Computer Gr aphics F orum , v ol. 31, no. 8. Wiley Online Library , 2012, pp. 2378–2408. [21] Flurry Analytics, “Flurry Five-Y ear Rep ort: Itˆ a ˘ A ´ Zs an App W orld. The W eb Just Lives in It,” 2013. [Online]. Av ailable: http://ww w.flurry .com/bid/95723/Flurry- Five- Y ear- Report- It- s- an- App- W orld- The- W eb- Just- Liv es- in- It [22] Nielsen, “All about Android,” 2011. [Online]. Av ailable: http://ww w.nielsen.com/us/en/insigh ts/webinars/2011/all- about- android- insights- from- nielsens- smartphone- meters.html [23] JEDEC, “JEDEC LPDDR2 standard (JESD209-2F),” 2013. [Online]. Av ailable: h ttps://www.jedec.org/standards- documents/results/JESD209- 2F [24] ——, “JEDEC LPDDR3 standard (JESD209-3C),” 2015. [Online]. Av ailable: h ttp://www.jedec.org/standards- documents/results/jesd209- 3c [25] ——, “JEDEC LPDDR4 standard (JESD209-4A ),” 2015. [Online]. Av ailable: h ttp://www.jedec.org/standards- documents/results/jesd209- 4a [26] N. Kulshrestha, D. K. McAllister, and S. E. Molnar, “Selecting and representing multiple compression metho ds,” Oct. 7 2010, uS P atent App. 12/900,362. [27] Unity3D, “Deferred Shading Rendering Path.” [Online]. Av ailable: https://docs.unity3d.com/Man ual/RenderT ech- DeferredShading.h tml [28] Viv ante Corp oration, “COMPOSITION PROCESSSING CORES (CPC).” [Online]. Av ailable: http://ww w.viv antecorp.com/index.php/en/tec hnology/ composition.html [29] Khronos, “OpenGL ES Extension 91,” 2016. [Online]. Av ailable: https://www.khronos.org/registry/gles/ extensions/NV/NV draw buffers.txt [30] K. Chandrasek ar, C. W eis, Y. Li, B. Ak esson, N. W ehn, and K. Go ossens, “Dramp ow er: Op en-source dram p o wer & energy estimation to ol,” URL: http://www. dr amp ower. info , v ol. 22, 2012. [31] R. Balasubramonian, A. B. Kahng, N. Muralimanohar, A. Shafiee, and V. Sriniv as, “Cacti 7: New to ols for interconnec t exploration in innov ative off-chip memories,” ACM T r ans. A r chit. Co de Optim. , vol. 14, no. 2, pp. 14:1–14:25, Jun. 2017. [Online]. Av ailable: http://doi. acm.org/10.1145/3085572 [32] M. Ditt y , A. Karandik ar, and D. Reed, “Nvidia’s xa vier soc,” 2018. [Online]. Av ailable: https: //www.hotchips.org/ hc30/1conf/1.12 Nvidia XavierH otchips2018Final 814.pdf

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment