A Multiplierless Pruned DCT-like Transformation for Image and Video Compression that Requires 10 Additions Only

A Multiplierless Pruned DCT-like T ransform ation for Image and Video Compressi on that Requires 10 Additions Only Vitor A. Coutinho ∗ Renato J. Cin tra F´ abio M. Ba y er † Sunera Kulasek era Arjuna Madanay ak e ‡ Abstract A multipli erless prun ed appro ximate 8-p oint discrete cosi ne transform (DCT) requ iring only 10 ad- ditions is introduced. The proposed algo rithm wa s assessed in image and video compression, showi ng compet itive p erformance with state-of-the-art metho ds. D igital synthesis in 45 nm CMOS technology up to place-and-route level ind icates clock sp eed of 288 MH z at a 1.1 V s upply . The 8 × 8 bloc k rate i s 36 MHz. The DCT approximation wa s embedded into HEVC reference soft war e; resulting video frames, at up to 327 Hz for 8-b it RGB HEVC, presented negligible image d egradation. Keywords Approximate d iscrete cosine transform, prun ing, pruned DCT, HEVC 1 Introduction The discre te co sine transfor m (DCT) plays a fundamental role in s ig nal pro ces s ing techniques [1] and is part of mo der n image and v ide o standar ds , such as JPEG [2], MPEG-1 [3], MP EG-2 [4], H.26 1 [5], H.2 63 [6], H.264/A V C [7, 8], and the high eﬃciency vide o coding (HEV C) [9, 10]. In particular, the tr ansform co ding stage of the H.264 and HEVC standar ds employs the 8-p o int DCT of type I I [1 0 , 11] a mong other transfo rms of diﬀer ent blo cklengh ts, such a s 4, 16, and 32 po ints [12–14]. In [1 5], the 8 -p oint DCT stage of the HEVC was o ptimized. Among the ab ov e-mentioned sta nda rds, the HEVC is ca pable o f achieving hig h c ompression per formance at approximately half the bit rate r e quired by H.264/A VC with same imag e quality [10, 13,1 5, 1 6]. How ever, HEV C p osses ses a signiﬁcant computational complexity in terms o f arithmetic opera tio ns [11, 13, 15, 16]. In fac t, HE VC ca n b e 2–4 times mo re computationally demanding when compared to H.264/A VC [13, 15]. Therefore, the prop osa l o f eﬃcient low-complexity DCT-like approximations can b eneﬁt future video co decs including emerging HEVC-based sys tems. Recently , low-complexity DCT approximations hav e b een co nsidered for imag e and video pro ce s sing [1 2, 15, 17 – 24]. Such approximate trans forms c an o ﬀer meaningful DCT estimations at the expe nse of sma ll ∗ Vitor A. Coutinho and Rena to J. Cintra are with the Signal Pro cessing Group, Departa mento de Estat ´ ıstica and the Grad- uate Program in Electrical Enginee ring, Universidade F ederal de Pernam buco, Recife, PE, Brazil (e-mail: r jdsc@stat.ufp e.org). † F´ abio M. Ba y er is with the Departamen to de Estat ´ ıstica and Laborat´ orio de Ciˆ encias Espaciais de San ta Maria (LA CESM ), Unive rsidade F ederal de Santa Mari a, San ta Maria, RS, Brazil (e-mai l : ba yer@ufsm.br). ‡ Sunera Kulasek era and Arjuna M adanay ake are with the Departmen t of Electrical and Computer Engineering, The Unive r- sity of Akron, Akron, OH, USA (e-mai l : arjuna@uak ron.edu). 1 error s. Suc h tr ade-oﬀ is o ften acceptable leading to low-pow er, high-sp eed har dware realiza tio ns [15], while ensuring a dequate numerical accura cy . In some applications, such as data compr ession [25], hig h-frequency comp onents are often zer o ed by the quantization pr o cess. Th us, one ma y judiciously restrict the computation to the q uantities tha t are lik ely to b e remain s igniﬁcant [26]. This approach is called pru ning and was origina lly prop osed as a metho d for computing the discrete F o ur ier transfo r m (DFT) [2 7, 28]. 1.1 Rela ted Works In tha t co nt ext, pr uning was applied in time-doma in, i.e., pa r ticular input sa mples were igno red and the op- erations inv olving them were av oided [29]. F requency-doma in pruning—discar ding transfor m co eﬃcients—is an alternative approach. This la tter approach has b een r ecently applied in mixed-ra dix FFT alg o rithms [30], cognitive radio design [31], and wir eless communications [32]. Another ex ample of a pruning-like alg orithm is the well-kno wn Go er tzel metho d for DFT computatio n [3 3–35]. F or the DCT case, pruning was origina lly prop o sed by W ang in [36] considering a decima tion-in-time algorithm for p ow er-o f-t wo blo cklengths. In [37], suc h a lgorithm w as generalized for arbitrar y blo cklength. In [38], Lecuir e et al. extended the pr uning method to the t wo-dimensional ca se, re ferring to the metho d as the zonal DCT, whic h is an alter native ter minology . In [39], Ka r akonstant is et al. pr op osed a hardware-based pruning a ppr oach for the DCT computatio n. Instead of discar ding hig h frequency DCT co eﬃcients, a VLSI system ca pable of computing low frequency comp onents using fas ter paths was sugges ted. Such metho d was applied in the co ntext of v oltage scaling for low-p ow er dissipa tio n. In the context of lo w-p ow ered wir eless vision se nsor net works, a pruned approximate DCT was pr op osed in [40 ] based on the DCT approximation theory a dv anced in [2 0]. In [4 1] the pruning terminology was employed in a diﬀerent context. It was considered to descr ibe a hardware computation of the DCT which maintains the sy s tem word size c o nstant by means of a controlled disca rding of le ast-signiﬁca nt bits. 1.2 Aims In respo nse to the growing need for hig h co mpression ratios for image and moving pictures as pr escrib ed in [9], we pro p o se a further reduction of the computational cost of the DCT computation in the co ntext of J P EG- and HEVC-lik e co ding a nd pro ce ssing. The go al of this pa p er is to o ﬀer a comprehensive analys is of pruning schemes in combination with appr oximate transforms . The sough schemes must b e ca pa ble of reducing the num be r of computed appr oximate DCT co eﬃcie nts and, at the same time, the eﬀected degrada tion on picture quality must b e ne g ligible. In the present work, a multiplication-free pruned approximate 8-p oint DCT is sought. W e also aim at VLSI realizations of both 1-D and 2-D versions of the prop osed pr uned approximate transform. The sought methods are intended to b e fully embedded into an op en sour ce HEVC reference softw are [42] for per formance assessment in rea l time video co ding. 2 2 Proposed pruned appr oxima te DCT 2.1 Proposed appro xima tion In [22] a v ery low-complexit y 8-point DCT a pproximation was in tro duced and it is r eferred to as the modiﬁed rounded DCT (RDCT), which is a s so ciated to the following low-complexity matrix: T =                 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 − 1 1 0 0 − 1 − 1 0 0 1 0 0 − 1 0 0 1 0 0 1 − 1 − 1 1 1 − 1 − 1 1 0 − 1 0 0 0 0 1 0 0 − 1 1 0 0 1 − 1 0 0 0 0 − 1 1 0 0 0                 . (1) Its asso ciated fast algo rithm req uires o nly 14 additions, having the low e s t co mputational c o mplexity among the meaningful DCT approximations ar chiv ed in literature [1 5, 17 – 22, 43]. Co nsidering the orthogo nalization metho ds de s crib ed in [44], an o rthonorma l a pproximation for the DCT is given by g iven by: ˆ C = D · T = 1 2 · dia g  1 √ 2 , √ 2 , 1 , √ 2 , 1 √ 2 , √ 2 , 1 , √ 2 ,  · T , (2) where diag ( · ) returns a diag onal matrix with the elements of its argument. By means of analyzing ﬁfty 51 2 × 512 8- bit r e presentativ e standard image s [45], we noticed that the 2- D version of the 8-p oint modiﬁed RDCT [22] can concentrate in av er a ge ≈ 9 8% of the total image energ y in the 16 low e r fre q uency c o eﬃcients. Additionally , in JPEG-like image compres sion, the q uantization step is pr one to zero the high frequency coeﬃcients [38]. Therefore, co mputational eﬀor ts may b e sav ed b y not computing the high frequency co e ﬃcient s, keeping only lo w-frequency coe ﬃcie nts. These considera tio ns yield the following tra nsformation derived from the low-complexit y matr ix asso ciated to the mo diﬁed RDCT: T 4 =       1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 − 1 1 0 0 − 1 − 1 0 0 1 0 0 − 1 0 0 1 0 0       . (3) Above transforma tion computes the four low er fre q uency components of the 1 -D o riginal modiﬁed RDCT lo w- complexity matr ix, which co r resp onds to the 16 lo wer frequency co mpo nents of the asso ciated 2-D v er sion. Thu s, consider ing the orthog o nalization metho ds descr ibe d in [44], we can obtain a semi-orthog onal matrix given by: ˆ C 4 = D 4 · T 4 = 1 2 · dia g  1 √ 2 , √ 2 , 1 , √ 2  · T 4 . (4) 3 Matrix ˆ C 4 is the pruned version of the modiﬁed RDCT. F or image a nd video compre ssion, the s caling diagonal matrix D 4 do es not introduce any computationa l ov erhead, since it can b e merged into the quanti- zation step [1 5, 18, 19, 2 1, 22]. Therefor e, in such co nt ext, the co mputational co mplexity of ˆ C 4 is essentially conﬁned into the low-complexity matrix T 4 . Aiming at the eﬃcient of implement ation of T 4 , we factor ized it in a pro duct of extremely low-complexity sparse matrices. Such factoriza tion is ba sed on decimation-in- time metho ds a s descr ib ed in [21, 43, 4 6]. T hus, the following expressio n is obtained: T 4 = P · A 3 · A 2 · A 1 , (5) where P =       1 0 0 0 0 0 0 1 0 1 0 0 0 0 1 0       , A 3 =       1 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1       , A 2 =         1 0 0 1 0 0 0 1 1 0 0 0 1 0 0 − 1 0 0 0 0 0 0 − 1 0 0 0 0 0 0 1         , A 1 =            1 0 0 0 0 0 0 1 0 1 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 0 1 1 0 0 0 0 0 1 0 0 − 1 0 0 1 0 0 0 0 0 0 − 1            , (6) where A 1 , A 2 , and A 3 , ar e pre-a ddition matrices [46] and P is a permutation matrix. Fig. 1 provides the signal ﬂow gra ph o f the fast algor ithm for T 4 , relating input signal x n , n = 0 , 1 , . . . , 7 to output signal X k , k = 0 , 1 , . . . , 7 . T r ansform-do main comp onents X 4 , . . . , X 7 are no t repre s ented, b eing set to ze r o. Based on the 2-D co mputation of the DCT [4 3], the approximate 2 - D DCT is given by [18]: B = ˆ C · A · ˆ C ⊤ , (7) where A is an input 8 × 8 ima g e subblo ck, B is the asso ciated tr a nsform-doma in 8 × 8 output image s ubblo ck and sup erscript ⊤ indicates matr ix transp ositio n. F or instance, JPEG-like schemes are entirely based o n the DCT-based tra nsformation o f 8 × 8 subblo cks [2]. In a similar fashion, the 2 -D forward pruned transforma tio n can b e derived [38, 40, 47] and it is desc r ib ed 4 X 1 X 2 X 0 X 3 x 5 x 2 x 6 x 1 x 4 x 3 x 7 x 0 Figure 1 : Signa l ﬂow g raph rela ting input data x n , n = 0 , 1 , . . . , 7, to output X k , k = 0 , 1 , 3 , 4, accor ding to T 4 . Dashed arr ows are multiplications by − 1 according to : B ′ = ˆ C 4 · A · ˆ C ⊤ 4 , (8) where B ′ is the pruned 4 × 4 o utput image subblo ck. Matrix B ′ contains a subset of the transfor m-domain co eﬃcients B furnished by the mo diﬁed RDCT. The approximate 2-D sp ectrum is g iven by the 8 × 8 matrix ˆ B constituted o f B ′ in upp er-left corner and zeros elsewhe r e, as shown b e low: B ≈ ˆ B = " B ′ 0 4 0 4 0 4 # , (9) where 0 4 represents the 4 × 4 null matrix. The inv erse tr ansformatio n can b e computed by tak ing the inv erse transformatio n of the above zero -padded matrix ˆ B . Ho wev er , this is equiv alent to the following computation: A ≈ ˆ A = ˆ C ⊤ 4 · B ′ · ˆ C 4 . (10) Therefore, padding b ecomes unneces sary . A dditionally , we noticed that ˆ C ⊤ 4 is the Mo ore-Penrose pseudo- inv er s e of ˆ C 4 [48]. 2.2 Complexity Assessment By counting the num b er of mult iplications, additions, a nd bit-shifting o p erations, we a ssessed the c o mputa- tional cost of the pro p osed 1-D and 2-D pruned mo diﬁed RDCT. T able 1 compares the obtained complexities with the computatio nal c osts as so ciated to traditio nal a nd s ta te-of-the-ar t DCT metho ds. Selected DCT approximations include: (i) the signed DCT (SDCT) [1 7]; (ii) the r ounded DCT (RDCT) [21]; (iii) the mo diﬁed RDCT [22]; a nd a set o f DCT a pproximations in tr o duced by Bouguezel-Ahmad-Swamy , namely , BAS-2008 [1 8], BAS-200 9 [19], and BAS-2013 [2 0]. Here, we also included the computational cost of the exact DCT computation a c cording to Chen’s DCT algo rithm [49], which is the alg orithm employed in the HEVC co dec [42]. Each of the selected metho ds was asse ssed b oth in its full and pruned versions. The 1-D and 2-D versions retained the four and 16 lower frequency co eﬃcients, resp ectively . The pr op osed 1- D metho d demands only 10 additions . The ass o ciate p er cent complexity r eduction co m- pared to selected state-of-ar t metho ds is presented in T a ble 2, for b oth the 1-D and 2- D case. A certain arithmetic co mplex it y reduction was alr e a dy exp ected by using pruning appr o ach. How ever, when co nsider- ing the 2-D transfor ma tion, arithmetic c o mplexity r e duction eﬀected by the pr uning pro cedure is even more 5 T able 1: Computational complexity asses s ment Nonpruned Pruned 1-D Metho d Mult. Add. Shift Mult. Add. Shift DCT (by deﬁnition) 64 56 0 32 28 0 Chen’s DCT [49] 16 26 0 6 12 0 SDCT [17] 0 24 0 0 20 0 BAS-2008 [18] 0 18 2 0 14 1 BAS-2009 [19] 0 18 0 0 14 0 BAS-2013 [20, 4 0] 0 24 0 0 20 0 RDCT [21] 0 2 2 0 0 1 6 0 Mo diﬁed RDCT [22] 0 1 4 0 0 10 0 2-D Metho d Mult. Add. Shift Mult. Add. Shift DCT (by deﬁnition) 1024 896 0 384 336 0 Chen’s DCT [49] 256 416 0 72 144 0 SDCT [17] 0 384 0 0 240 0 BAS-2008 [18] 0 288 32 0 168 12 BAS-2009 [19] 0 288 0 0 168 0 BAS-2013 [20, 4 0] 0 384 0 0 24 0 0 RDCT [21] 0 3 52 0 0 192 0 Mo diﬁed RDCT [22] 0 2 24 0 0 120 0 T able 2 : P ercent complexity reduction of the pr op osed metho d compared to sta te-of-art metho ds Metho d 1-D 2-D SDCT [17] 58.3% 68.8% BAS-2008 [1 8] 50.0% 62.5% BAS-2009 [1 9] 44.4% 58.3% BAS-2013 [2 0] 58.3% 68.8% RDCT [21] 54.5% 65.9% Mo diﬁed RDCT [22] 28 .6% 46.4% signiﬁcant, as sho wn in T able 2. In fact, the 8 × 8 nonpruned 2-D transformation can be decomp osed int o eight nonpruned row-wise 1-D transfor mations; fo llowed b y eight column-wise instantiations of the same 1- D transformatio n. In contrast, the prop o sed pruned 2- D transfor mation ca n b e deco mpo sed into eig ht pruned row-wise 1-D trans formations of the rows; follow ed by only four pruned 1-D transfo r mations. Ther efore, the pr uned 2-D transfor mation c alls the 1-D algorithm fewer times when compared with the nonpruned case. The complexity v alues presen ted in T able 1 were calculated according to above co ns iderations. As a consequence, the pr o p osed transfor mation requir es 120 additions . The pro po sed metho d outp erforms the recently prop o sed pruned approximation describ ed in [40], r e- quiring 50% le ss operations for both 1-D and 2 - D versions. Moreover, the comparison among pruned-o nly versions of ab ove metho ds sho ws that the prop os ed a pproximation demands 2 8.5% le s s op era tions, in b oth 1-D and 2-D cases, than the b est comp eting metho ds, namely BAS-2 0 08 [18] a nd B AS-2009 [19]. 6 T able 3 : P erforma nce as sessment in image co mpression Metho d Nonpruned Pruned PSNR SSIM NZ (%) P SNR SSIM NZ (%) Chen’s DCT [49] 33.10 0.90 81.83 30.40 0.86 86.19 SDCT [17] 29.28 0.84 80.20 27.14 0.77 86.27 BAS-2008 [1 8] 32.17 0.89 80.87 29.24 0.83 86.00 BAS-2009 [1 9] 31.72 0.88 80.59 28.69 0.82 86.16 BAS-2013 [2 0, 40] 31.82 0.88 80.52 28.72 0.82 86.10 RDCT [21] 31.91 0.88 81.03 28.93 0.82 86.45 Mo diﬁed RDCT [22] 30.94 0.86 79.83 26.37 0.72 86.75 3 Image compression W e pro cessed the set of images men tioned in the previo us section a c cording to the ima ge compressio n simulation deta iled in [18, 21, 22]. Images were subdivided in to 8 × 8 blo cks and were submitted to 2-D transformatio n according to the pr op osed pruned approximate DCT and comp eting metho ds . The r esulting co eﬃcients in the transfo r m domain w e r e submitted to the standar d quantization o p er ation for luminance [50, p. 155]. W e adopted a v ariable length coding appro ach, where the n umber of zero ed tra nsform-doma in co eﬃcients is determined by the qua nt ization step. The ma x imum num b er of non-zero co eﬃcients is 16, as impo sed by the pruning s cheme. Subsequently , inverse transforma tio ns were considered. F or the prop o sed metho d, the inv erse pro c e dure describ ed in Section I I w as applied and compressed imag es w ere reconstructed. O riginal and pro cess e d images w ere e v aluated for ima ge deg radatio n using the p e a k signal-to-nois e ratio (PSNR) [50, p. 9] and Structural Similarity (SSIM) [51]. W e a ls o computed the num ber of zeros (NZ) a fter quantization, which provides the percentage num b er of zero ed co e ﬃcient s after quantization step a nd furnishes a mea sure of energy c o mpaction in the transfor m domain. High v alues of NZ translates into longer runs o f zero s, which are b eneﬁcia l for subsequent run-leng th enco ding a nd Huﬀman co ding stages [5 0] . (a) Mo diﬁed RDCT [22] (non- pruned) (b) Prop osed pruned transform Figure 2 : Compr essed Lena images . In contrast with [18–20, 40], w e adopted a verage measurements, which a re less pr o ne to v a r iance eﬀects 7 T able 4: Resource cons umption on Xilinx XC6VLX240T-1FFG1156 device. Metho d CLB FF CPD F max D p Q p Mo diﬁed RDCT 445 1696 3.390 294.9 8 2.74 3.44 Prop os e d 247 961 2 .946 339.44 1.35 3.43 T able 5 : Resource consumption for CMOS 45 nm synthesis. Metho d Area A T A T 2 CPD F max D p Q p Mo diﬁed RDCT 0.073 0 .261 0.936 3.582 279.17 0 .050 0.039 Prop os e d 0.043 0 .149 0.518 3.471 288.10 0 .012 0.011 and fortuitous data. T a ble 3 shows the av erage PSNR v a lues and p er cent v alues for NZ based on the s e lected image set for ea ch co nsidered metho d. Results indica te that the propos ed metho d can signiﬁcantly reduce the co mputational complexity , while maintaining go od PSNR ﬁgure s. F o r insta nce, consider ing the original and pruned MRDCT, we noticed a ≈ 15% decreas e in P SNR and SSIM; how ever the asso cia ted arithmetic complexity r eduction is of ≈ 50 %. A q ualitative compariso n b e t ween the Lena [45] co mpr essed image obtained from the ab ov e descr ib e pro cedure using the mo diﬁed RDCT [22] a nd the prop osed pruned trans fo rm is shown in Fig. 2. 4 VLSI ar chitectures T o further inv estigate the capabilities of the prop os ed algo rithm, we separa te the mo diﬁed RDCT and the prop osed pruned appr oximation for hardware synthesis and ev a luation in the actual HE VC scheme. 4.1 FPGA Ar chitecture These approximations were r e alized a s a separa ble 2-D blo ck transform using tw o 1- D tra nsform blo cks and a transp ose buﬀer . Such blo cks were initially mo deled and tested in Matlab Simulink and then co mbin ed to furnish the co mplete 2-D tra nsform. The resulting architecture was physically realiz e d on a Xilinx Virtex-6 X C6VLX240 T-1FFG1156 ﬁeld progr ammable gate array (FPGA) device and v alidated using hardware-in- the-lo op testing through the J T AG interface. The DCT a pproximation FP GA pro totype was veriﬁed using more than 100 00 test vectors with complete ag reement with theoretica l v alues. Quantities were obtained from the Xilinx FPGA synthesis by a ccessing the xf low.r esults rep ort ﬁle for ea ch run of the design ﬂow. Metrics, including conﬁgurable logic blo cks (CLB ) a nd ﬂip-ﬂop (FF) count, critical path delay (CPD) (in ns), and maxim um o p erating frequency ( F max , in MHz), ar e provided. In a ddition, static ( Q p , in W) and frequency normaliz ed dynamic p ower ( D p , in mW/MHz) consumptions were estimated using the Xilinx XPo wer Analy z e r. 4.2 CMOS Place -Route F ollowing FPGA ba s ed veriﬁcation, the har dware description languag e co de was p or ted to 45 nm CMO S techn olog y and s ub ject to synthesis a nd place-and- route steps using Cadence Encounter. Both FP GA syn- thesize and CMO S place- and-route r esults a re ta bulated in T a ble 4 and 5, res p ec tively . F or the CMOS 8 place-and-r oute, critica l path delay (CP D) (in ns), area (in mm 2 ), ar ea-time co mplexity (A T, in mm 2 · ns), area-time- s quared complexity (A T 2 , in mm 2 · ns 2 ), maximum op e rating fre quency ( F max , in MHz), static ( Q p , in W) and frequency nor ma lized dynamic p ow er ( D p , in mW/ MHz) consumptions ar e also provided. The FP GA rea liz a tion of the prop osed pr uned DCT approximation show ed a reduction of 44 .49% in are a as measured by the num ber of CLBs and a 50.7 2% reduction in frequency norma lized dynamic p ower consump- tion when compared with the full DCT approximation. Synth esis a t the 4 5 nm CMOS technology no de using F reePDK4 5 standar d c ells revealed a 41.09 % reductio n in area and a 76 % r eduction in frequency nor malized dynamic p ow e r for a supply voltage ﬁxed at V DD = 1 . 1 V. All metrics indicate clea r a dv antages o f using the prop osed pruned DCT a pproximation ov er the full 8 -p oint approximation. F urther, the 2 8 8 MHz CMO S clo ck indica tes a blo ck rate of 36 MHz and a frame- rate of 32 7 Hz, assuming 8- bit RGB video at 19 20 × 1 080 resolution. 5 HEVC softw are simula tion (a) (b) (c) Figure 3: Selected fra me from ‘Ba sketballP ass’ test video co ded b y means of (a) the Chen’s DCT algo- rithm (PSNR 37.62 dB), (b) t he modiﬁed RDCT (PSNR 37.42 dB), and (c) the prop ose d pruned approxi- mation, with 7 6 .2% less arithmetic op erations then Chen’s DCT (PSNR 3 7.41 dB). W e considered r eal time v ideo coding b y em b edding the propos e d alg orithms in to the HEV C reference softw ar e by the F r a unhofer Heinric h Hertz Institute [42]. The or iginal trans form included in the HEV C reference so ft ware is a scaled approximation of Chen’s DCT a lgorithm. Our metho do logy co nsists o f re- placing the 8 × 8 DCT algor ithm of the re ference so ft ware by the mo diﬁed RDCT a nd the propose d pruned approximation. 9 0 1 2 3 4 5 30 35 40 45 50 bits/frame PSNR (dB) RDCT Pruned RDCT (a) PSNR v alues related to bit rate 10 20 30 40 50 30 35 40 45 50 QP PSNR (dB) RDCT PrunedRDCT (b) PSNR v al ues related to QP Figure 4: RD curves for ‘BasketballPass’ tes t se q uence. Fig. 3 shows three 416 × 24 0 fra mes of the ‘BasketballP ass’ test sequence [5 2] obtaine d from the HEVC simulation. Resulting fra mes were co ded using the Chen’s DCT a lg orithm (Fig. 3(a)), the modiﬁed RD CT (Fig. 3(b)), and the prop o s ed pruned a ppr oximation (Fig. 3(c)). The P SNR v alues for these thr ee frames are shown in Fig. 3. T he pr uned approximation eﬀected minimal image degra da tion—less than 0.25 dB. O n the other hand, co mputational complexity of the 8-p oint DCT was signiﬁcantly r educed—76.2% less ar ithmetic op erations when compa red with the origina l Chen’s DCT algorithm. W e hav e a ls o computed rate distortion (RD) curves for both RDCT a nd the prop osed pruned approx- imation using s tandard video sequences [52]. F or such, we v ar ied the quantization p oint (QP) fr om 0 to 50 and computed the PSNR of the pro po sed pruned approximate with r eference to the RDCT along with the bits/frame of the enco ded video . As a result, we obtained the curv es shown in Fig. 4(a). The PSNR v alues rela ted to QP ar e shown in Fig ur e 4(b). The diﬀerence in the ra te po ints b etw een the RDCT a nd the prop osed pruned appr oximation is less than 0.5 7dB, which is sma ller than 1 . 3 %. 6 Conclusion In this pap er , we presented a very low-complexity DCT approximation obtained via pruning. The resulting approximate transfo r m req uires only 10 additions and p osses s es p er fo rmance metrics co mparable with state- of-the-art metho ds, including the recent a r chitecture presented in [40]. The prop osed pruning a pproach ca n be adapted to other transforma tio n metho ds , r e gardless o f the tra nsform size . By means of co mputational simulation, VLSI har dware realizations, and a full HEVC implemen tation, we demonstrated the pr actical relev ance of o ur metho d as an imag e and video c o dec. O ur go al with the image a nd video s imulations is not to sugg est the mo diﬁca tio n of existing standards, which w ould b e unfeasible. Instead we aim at showing that (i) pruned approximations can be consider ed in tailored low-complexity , low-power systems for acc elerated deco ding of J PEG a nd HE V C and (ii) appro ximation metho ds combined with pruning are a viable alternative to the design of future standa rds. 10 F or future work, we intend to apply the pruned appr oach to o ther discr ete transforms metho ds for diﬀerent blo cklengths. In particular, the 4-, 16-, a nd 3 2-p oint DCT-based approximations a re na turally ﬁtted to the pr op osed approach. Moreov er, a pro sp ective study on the energ y distribution in the transfor m domain co uld indicate the optimal n umber of co e ﬃcient s to retain in the pruning proces s. F orthcoming applications include low-p ow er wir eless v is ion senso r netw o rks and accelera ted imag e deco ding . Ackno wledgments Authors acknowledge partial supp ort from CNP q, F A CEPE, F APERGS, a nd The Universit y of Akron. References [1] K. R. Rao and P . Yip, Di scr ete Cosine T r ansform: Algorithms, A dvantage s, Appli c ations . San Diego, CA: Academic Press, 1990. [2] G. K . W allace, “The JPE G sti ll picture comp ression standard,” IEEE T r ansactions on Consumer Ele ctr onics , vol . 38, pp. x viii–xxxiv, 1992. [3] N. Roma and L. Sousa, “Eﬃcien t hybrid DCT-domain algorithm fo r video spatial downscaling,” EU RASIP Journal on A dvanc es in Si gnal Pr o c essing , vol. 2007, pp. 30–30, 2007. [4] International Organisation for Stan d ardisation, “Generic co ding of moving pictures and associated audio infor- mation – Part 2: Video,” ISO, ISO/IEC JTC1/SC29/W G11 - Co ding of Moving Pictures and Audio, 1994. [5] International T elecomm unication Union, “ITU-T recommendation H .261 version 1: Video cod ec for aud io visual services at p × 64 kbits,” ITU-T, T ech. Rep., 1990. [6] ——, “ITU-T reco mmendation H.263 version 1: Video coding for low bit rate communication,” ITU-T, T ec h. Rep., 1995. [7] T. Wiegand, G. J. Sulliv an, G. Bjo ntegaard, and A. Lut hra, “Overview of the H.264/A V C v ideo co d ing standard,” IEEE T r ansactions on Cir cuits and Systems for Vi de o T e chnolo gy , vol. 13, pp. 560–576, 2003. [8] J. V. T eam, “Recommendation H.264 and ISO/IEC 14 49 6–10 A V C: Draft ITU- T reco mmendation and ﬁnal draft international standard of joint video sp eciﬁcation,” ITU-T, T ech. Rep ., 2003. [9] International T elecommunication Union, “High eﬃciency v ideo codin g: Recommendation ITU-T H.265,” ITU-T Series H: Au diovisual and Multimedia Systems, T ech. Rep., 2013. [10] M. T. Pourazad, C. Doutre, M. Azimi, and P . Nasiop oulos, “HEV C: The new gold standard for video compression: How do es HEVC compare with H.264/A VC?” IEEE Consumer Ele ctr onics Magazine , vol. 1, pp. 36–4 6, Jul. 2012. [11] G. J. Sulliv an, J. Ohm, W. Han, and T. Wiegand, “Overview of the high eﬃciency video co ding (HEVC ) standard,” I EEE T r ansactions on Cir cui ts and Systems for Vide o T e chnolo gy , vol. 22, p p. 1649–1668, D ec. 2012. [12] F. M. Bay er, R . J. Cintra, A. Madana yak e, and U. S. Potluri, “Multiplierless approximate 4-p oint DCT VLSI arc hitectures for transform b lo ck co ding,” Ele ctr onics L etters , vol. 49, pp. 1532–1534, 2013. [13] J. Park, W. Nam, S . Han , and S . Lee, “2-D large inverse transform (16 × 16, 32 × 32) for HEVC (High Eﬃciency Video Co ding),” Journal of Semic onductor T e chnolo gy an d Scienc e , vol. 2, pp. 203–211, 2012. [14] S. Park and P . K. Meher, “Flexible integer DCT arc hitectu res for H EVC, ” in IEEE International Symp osium on Ci r cuits and Systems (ISCAS) , 2013, pp. 1376–1379. 11 [15] U. S. Potluri, A. Madanay ake, R. J. Cintra, F. M. Bay er, S. Kulasekera , and A. Edirisuriya, “Improv ed 8-p oint approximate D CT for image and video compression requiring only 14 additions,” IEEE T r ansactions on Ci r cuits and Systems I: R e gular Pap ers , vol . 61, no. 6, pp. 1727–1740, 2014. [16] J. O hm, G. J. Sulliv an, H. S ch warz, T. K. T an, and T. Wiegand, “Comparison of the cod ing eﬃciency of video cod in g standards - including High Eﬃciency Video Cod ing (H EVC),” IEEE T r ansactions on Ci r cuits and Systems for Vide o T e chnolo gy , vol. 22, pp . 1669–1684, Dec. 2012. [17] T. I. Haw eel, “A new square wa ve transform based on the D CT,” Signal Pr o c essing , vol. 82, pp . 2309–231 9, 2001. [18] S. Bouguezel, M. O. Ahmad, and M. N. S. Swam y , “Low-complexit y 8 × 8 transform for image compression,” Ele ctr onics L etters , vol. 44, pp. 1249–125 0, Sep. 2008. [19] ——, “A fas t 8 × 8 transform for image compression,” in I nternational Confer enc e on Micr o ele ctr onics (ICM) , Dec. 2009, pp. 74–77. [20] ——, “Binary discrete cosine and Hartley transforms,” I EEE T r ansactions on Ci r cuits and Systems I : R e gular Pap ers , vol. 60, pp. 989–1002 , 2013. [21] R. J. Cintra and F. M. Ba yer, “A DCT approximation for image compression,” I EEE Si gnal Pr o c essing L etters , vol . 18, pp. 579–582, O ct. 2011. [22] F. M. Bay er and R. J. Cintra, “DCT-like transform for image compression requires 14 additions only ,” El e ctr onics L etters , vol. 48, pp. 919–921, 2012. [23] R. J. Cintra, F. M. Bay er, and C. J. T ablada, “Low-complexit y 8- p oin t DCT approximatio ns based on integer functions,” Signal Pr o c essing , vol. 99, pp. 201–214, 2014. [24] C. J. T ablada, F. M. Bay er, and R. J. Cin tra, “A cl ass of D CT appro ximations based on the Fei g-Winograd algor ithm,” Signal Pr o c essing , vo l. 113, pp. 38 –51, 2015. [ Online]. Av ailable: http://w ww.sciencedirect.com/s cience/article/pii/S0165168415000341 [25] K. R. Rao and P . Yip, The T r ansform and Data Compr ession Handb o ok . CRC Press LLC, 2001. [26] Y. Huang, J. W u, and C. Chang, “A generalized output pru ning algorithm for matrix-vector multipli cation and its application t o compute prun ing d iscrete cosine transform,” IEEE T r ansactions on Signal Pr o c essing , vol. 48, pp. 561–563, 2000. [27] J. Markel, “FFT pru ning,” IEEE T r ansactions on Audio and Ele ctr o ac oustics , vol. 19, pp. 305–311 , 1971. [28] D. P . Skinn er, “Pruning the decimation in- time FFT algorithm,” IEEE T r ansactions on A c oustics, Sp e e ch and Signal Pr o c essing , vol. 24, p p. 193–194, 1976. [29] R. G. Alves, P . L. Osorio, and M. N. S. S w amy , “General FFT pru n ing algorithm,” in 43r d IEEE Midwest Symp osium on Ci r cuits and Systems , vol. 3, 2000, pp. 1192–1195. [30] L. W ang, X . Zhou, G. E. S ob elman, and R. Liu, “Generic mixed- radix FFT pru ning,” IEEE Signal Pr o c essing L etters , vol. 19, pp. 167–170, Mar. 2012. [31] R. Airoldi, O. Anjum, F. Garzia, A. M. Wyglinski, and J. Nurmi, “Energy-eﬃcient fast F ourier transforms for cognitiv e radio systems,” I EEE M icr o , vol. 30, pp. 66–76, Nov. 2010. [32] P . N. Whatmough, M. R. Perrett, S . I sam, an d I. D arw azeh, “VLS I architecture for a reconﬁgurable sp ectrally eﬃcien t FDM b aseband t ransmitter,” IEEE T r ansactions on Cir cuits and Systems I: R e gular Pap ers , vol. 59, pp. 1107–1118, May 2012. [33] A. Op p enheim and R . Schafer, Discr ete-Time Signal Pr oc essing , 3rd ed. P earson, 2010. 12 [34] J. H . K im, J. G. Kim, Y . Ji, Y. Jung, and C. W on, “An islanding detection meth o d for a grid-connected system based on the Go ertzel algorithm,” IEEE T r ansactions on Power Ele ctr onics , vol. 26, p p. 1049–1055, A pr. 2011. [35] I. Carugati, S. Maestri, P . G. Donato, D. Carrica, and M. Benedetti, “V ariable sampling p eriod ﬁlter PLL for distorted t hree-phase systems,” I EEE T r ansactions on Power El e ctr onics , vol. 27, pp. 321–330 , Jan. 2012. [36] Z. W ang, “Pruning the fast discrete cosine transform,” IEEE T r ansactions on Communi c ations , vol. 39, pp. 640–643 , May 1991. [37] A. N. Skodras, “F ast discrete cosine transform pruning,” IEEE T r ansactions on Signal Pr o c essing , vo l. 42, pp. 1833–18 37, Jul. 1994. [38] V. Lecuire, L. Makk aoui, and J.-M. Moureaux, “F ast zonal DCT for energy conserv ation in wi reless image sensor netw orks,” Ele ctr oni cs L etters , vol. 48, pp . 125–127, 2012. [39] G. Karakonstan tis, N. Banerjee, and K. Roy , “Process-v ariation resilient and voltage-scalable DCT architecture for robust lo w-p ow er compu ting,” IEEE T r ansactions on V ery L ar ge Sc ale Inte gr ation ( VLSI) Systems , vol. 18, pp. 1461–1470, 2009. [40] N. K ou ad ria, N. Doghmane, D. Messadeg, and S. Harize, “Lo w complexity DCT for image comp ression in wireless v isual sensor netw ork s,” Ele ctr onics L etters , vol. 49, p p . 1531–1532, 2013. [41] P . K. Meher, S. Y. P ark, B. K. Mohan ty , K. S. Lim, and C. Y eo, “Eﬃcient in t eger DCT architectures for H EVC,” IEEE T r ansactions on Cir cuits and Systems for Vi de o T e chnolo gy , vol. 24, pp. 168–178, Jan. 2014. [42] Join t Collab orative T eam on Video Co ding (JCT-VC), “HEVC reference soft wa re do cumentation,” F raunhofer Heinrich Hertz In stitute, T ec h. Rep., 2013. [43] V. Britanak, P . Y ip, and K. R. R ao, Discr ete Cosine and Sine T r ansforms: Gener al Pr op erties, F ast Algorithms and Inte ger A ppr oximation . Else vier, 2007. [44] R. J. Cintra, “An integer approximation method fo r discr ete si nusoidal transforms,” Journal of Cir cuits, Systems, and Si gnal Pr o c essing , vol . 30, pp. 1481–15 01, 2011. [On line]. Ava ilable: http://w ww.springerlink.com/con tent/nw5 u0267254t3683/ [45] “The USC-S IPI image database,” http://sipi.usc.edu/database/, 2011, Un iversi ty of Southern Califor nia, Signal and Image Pro cessing Institut e. [46] R. E. Blahut, F ast Algorithms for Signal Pr o c essing . Cam bridge Universit y Press, 2010. [47] L. Makk aoui, V. Lecuire, and J. Moureaux, “F ast zonal DCT-based image compression for wireless camera sensor netw orks,” 2nd International Confer enc e on Image Pr o c essing The ory T o ols and Applic ations (I PT A) , pp. 126–129 , 2010. [48] G. A . F. Seb er, A Matrix Handb o ok for Statisticians . J ohn Wiley & Sons, Inc, 2008. [49] W. H. Chen, C. S mith, and S . F ralic k, “A fast computational algorithm for the Discrete Cosine Transform,” IEEE T r ansactions on Communic ations , vol. 25, no. 9, pp. 1004–100 9, Sep . 1977. [50] V. Bhask aran and K. Konstantinides, Image and Vide o Compr ession Standar ds . Boston: Kluw er Academic Publishers, 1997. [51] Z. W ang, A. C. Bovik, H. R. Sh eikh, and E. P . Simoncelli, “Image quality assessment: from error visibility to structural similarit y ,” IEEE T r ansactions on Image Pr o c essing , vol. 13, p p. 600–612, Ap r. 2004. [52] “HEVC Test Video S equence,” ftp ://hvc :US88Hula@ftp.tnt.uni-hannov er.de/testsequences, 2013, heinrich Hertz Institute. 13

A Multiplierless Pruned DCT-like Transformation for Image and Video Compression that Requires 10 Additions Only

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment