An Orthogonal 16-point Approximate DCT for Image and Video Compression

An Orthogonal 16-p oin t Appro ximate DCT for Image and Video Compressio n T. L. T. da Silv eira ∗ F. M. Bay er † R. J. Cin tra ‡ S. Kulasek era § A. Madana y ak e § A. J. Kozak evicius ¶ Abstract A low-complexity orthogo nal multiplierless approximation for the 16 -p oint discrete cosine transform (DCT) was intro duced. The pro p osed metho d w as desig ned to possess a very low computational cost. A fast algorithm bas e d on matrix factorization was pro po sed requiring only 60 additions. The prop osed architecture outp erfor ms clas sical and s tate-of-the-ar t algo rithms when assessed as a to ol fo r imag e a nd video compre s sion. Digital VLSI ha rdware implemen- tations were also prop osed being physically realized in FPGA technology and implemented in 45 nm up to synthesis and place- route lev els. Additionally , the pr op osed method was embed- ded in to a high eﬃciency video co ding (HEV C) reference so ft ware for actual pr o of-of-concept. Obtained results show negligible video degrada tion when compa red to Che n DCT alg orithm in HEVC. Keyw ords 16-p oint DCT a pproximation, Low-complexity transforms , Imag e co mpression, Video co ding 1 Introduction The discrete cosine tr an s form (DCT) [1, 2] is a pivo tal to ol for digital image pr o cessing [3 – 5]. Indeed, the DCT is an imp ortant appro ximation f or the optimal Karhunen-Lo ` ev e transform (KL T), b eing emplo yed in a multitude of compression standards due to its remark able energy compaction prop erties [5 – 9]. Because of this, the DCT has foun d at applications in image and video co ding ∗ T. L. T. da Silveira is with the Programa de P´ os-Gradua¸ c˜ ao em Inform´ atica, Universidade F ed eral de Santa Maria, Santa Maria, RS, Brazil, thiago@inf. ufsm.br † F. M. Bay er is with th e Departamento de Estat ´ ıstica and L ACESM, Un iversi dade F ederal de Santa Maria, S anta Maria, RS, Brazil, bay er@ufsm.br ‡ R. J. Cintra is with the Signal Processing Group, Departamento de Estat ´ ıstica, Universidade F ederal de P ernam- buco. E-mail: rjdsc@stat.ufp e.org § S. Kulasekera and A. Madanay ake are with the Dep artment of Electri cal and Computer Engineering, Universi ty of Akron, Akron, OH , USA, arj un a@uakron.edu ¶ A. J. Kozakevici us is with the Departamen to de Matem´ atica, Universidade F ederal de Santa Maria, Santa Maria, RS, Brazil, alicek@ufsm.br 1 standards, such as JPEG [10], MPEG-1 [11], MPEG-2 [12], H.261 [13], H.263 [14], and H.264 [15]. Moreo v er, numerous fast algorithms were prop osed for its computation [16 – 22]. Designing fast algorithms for the DCT is a mature area of researc h [17, 23–25]; th us it is n ot realistic to exp ect ma jor adv ances by means of standards tec hniqu es. On the other hand , the dev elopmen t of low-c omplexit y appr oximati ons f or DCT is an op en ﬁeld of r esearc h. In particular, the 8-p oint DCT w as giv en sev eral appro ximations, suc h as the signed DCT [26], the level 1 DCT appro ximation [27], the Bouguezel-Ahmad-Swam y (BAS) series of trans forms [4, 5, 7, 28, 29], the rounded DCT (RDCT) [8], th e mo diﬁed RDCT [30], the multiplier-free DCT appro ximation for RF imaging [31], and the improv ed appr oximate DCT pr op osed in [9]. Su ch appro ximations reduce the computational demands of the DCT ev aluatio n, leading to lo w-p o w er consu mption and high- sp eed hardware r ealizati ons [9]. A t the s ame time, appr o ximate transforms can pro vide adequate n umerical accuracy for image and video pro cessing. In resp onse to the growing need for h igher compr ession rates related to real time app lica- tions [32], th e h igh eﬃciency vid eo co din g (HEV C) video compression form at [33] wa s prop osed. Diﬀeren t from its p redecessors, the HEVC emplo ys not only 8 × 8 size blo cks, but also 4 × 4, 16 × 16, and 32 × 32. Several approxi mations for 16-p oin t DCT based on the int eger cosine transform [34] w ere pr op osed in [35], [36] and [37]. These transformations are d eriv ed fr om th e exact DCT after scaling the elemen ts of th e DCT matrix and appro ximating the resulting real-n umbered en tries to in tegers [36 ]. Therefore, r eal multiplicat ions can b e completely eliminated, at the exp en se of a noticeable increase in b oth the additiv e complexit y and the num b er of r equired bit-shifting op era- tions [2]. A more restricted class of DCT appro ximations pr escrib e transformation matrices w ith en tries deﬁned on the set C = { 0 , ± 1 / 2 , ± 1 , ± 2 } . Because the elemen ts of C imply almost null arithmetic complexit y , resulting transformations deﬁn ed ov er C ha ve very lo w-complexit y , requiring no mul- tiplications and a reduced num b er of bit-shifting op erations. In this con text, m etho ds providing 16-p oin t lo w-cost orthogonal transf orms include th e W alsh–Hadamard transform (WHT) [38, 39], BAS-2010 [4], BAS-2013 [29], and the appro ximate transform p rop osed in [40], here referred to as BCEM app ro ximation. T o the b est of our kno wledge, these are the only 16-p oint DCT ap- pro ximations deﬁn ed ov er C arc hiv ed in literature. App ro ximations o v er C are adequate the HEVC structure [9] and are capable of minimizing the associated hardw are p ow er consumption as required b y curr en t multimedia mark et [32]. The aim of this pap er is to contribute to image and video compr ession metho d s related to JPEG-lik e schemes and HEVC. Thus, w e prop ose a 16-p oint appro ximate DCT, that requires neither multiplicati ons n or bit-shifting op erations. Ad d itionally , a fast algorithm is sought, aiming to minimize the o v erall compu tation complexit y . Th e prop osed transf orm is assessed and compared with comp eting 16-point DCT appro ximations. The realization of the prop ose DCT appro ximation in digital VLSI hardwa re as well as into a HEV C r eference soft wa re is sought. This pap er unfolds as follo ws. In Section 2, we prop ose a n ew 16-p oin t DCT appro ximation and 2 detail its fast algorithm. Section 3 presents the p erformance analysis of the in tro d uced transforma- tions and compare it to comp eting tools in terms of the computational complexit y co ding measures, and similarit y metrics with resp ect to the exact DCT . In S ection 4, a JPEG-lik e image compression sim ulation is describ ed and results are pr esen ted. In S ection 5, digital hardware archite ctures for the p rop osed algorithm are su pplied f or b oth 1-D and 2-D analysis. A p ractical real-time video co ding scenario is also considered: the prop osed metho d is em b edded in to an op en source HEVC standard reference softw are. Conclusions and ﬁn al r emarks are giv en in the last section. 2 Proposed tra nsform Sev eral fast algorithm for the DCT allo w recursive structures, for whic h the computation of the N -p oint DCT can b e split in to the compu tation of N 2 -p oint DCT [1, 2, 17, 41–43]. Th is is us u ally the case for algorithms based on decimation-in-frequency metho d s [23, 41]. In accoun t of the ab o v e observ ation and jud iciously consid er in g p erm utations and signal changes, w e designed a 16-p oin t transformation that splits itself in to t w o instanti ations of the lo w-complexit y matrix asso ciated to the 8-p oin t RDCT [8]. The p rop osed transformation, denoted as T , is giv en b y: T =               1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 − 1 − 1 − 1 − 1 − 1 − 1 − 1 − 1 1 1 1 0 0 − 1 − 1 − 1 − 1 − 1 − 1 0 0 1 1 1 1 1 0 0 0 0 − 1 − 1 1 1 0 0 0 0 − 1 − 1 1 0 0 − 1 − 1 0 0 1 1 0 0 − 1 − 1 0 0 1 1 1 − 1 − 1 − 1 − 1 1 1 − 1 − 1 1 1 1 1 − 1 − 1 1 0 − 1 − 1 1 1 0 − 1 − 1 0 1 1 − 1 − 1 0 1 0 0 − 1 1 1 − 1 − 1 1 − 1 1 1 − 1 − 1 1 0 0 1 − 1 − 1 1 1 − 1 − 1 1 1 − 1 − 1 1 1 − 1 − 1 1 1 − 1 − 1 1 0 0 1 − 1 1 − 1 0 0 − 1 1 1 − 1 1 − 1 0 1 − 1 0 1 − 1 − 1 1 0 − 1 1 0 − 1 1 0 0 1 1 − 1 − 1 0 0 0 0 1 1 − 1 − 1 0 0 0 − 1 1 0 0 1 − 1 0 0 − 1 1 0 0 1 − 1 0 1 − 1 1 − 1 1 − 1 0 0 0 0 1 − 1 1 − 1 1 − 1 0 − 1 1 − 1 1 − 1 1 0 0 1 − 1 1 − 1 1 − 1 0 1 − 1 0 0 − 1 1 − 1 1 − 1 1 − 1 1 0 0 1 − 1               . Because the ent ries of T are in { 0 , ± 1 } ⊂ C , the p rop osed matrix is a m ultiplierless op erator. Bit-shifting op erations are also u nnecessary; only s im p le add itions are required. Additionally , th e ab o ve matrix ob eys the condition: T · T ⊤ = [diagonal m atrix], where s u p erscrip t ⊤ denotes matrix transp osition. Thus, the necessary conditions for orthogonalizi ng it according to the metho ds describ ed in [44], [8] and [45] are satisﬁed. Such p ro cedure yields the follo wing orthogonal 16-p oin t DCT approxima tion matrix: ˆ C = S · T , 3 where S = diag 1 4 , 1 4 , 1 √ 12 , 1 √ 8 , 1 √ 8 , 1 4 , 1 √ 12 , 1 √ 12 , 1 4 , 1 √ 12 , 1 √ 12 , 1 √ 8 , 1 √ 8 , 1 √ 12 , 1 √ 12 , 1 √ 12 ! . In the con text of image and video compression, the d iagonal matrix S can b e absorb ed into the quan tization step [4 , 5, 7, 8, 25 , 45, 46]. Therefore, under these conditions, th e complexit y of the appro ximation ˆ C can b e equated to the complexit y of the lo w-complexit y matrix T [40, 46]. Matrix-based fast algorithm design tec hniques yield a s parse matrix factorization of T as give n b elo w: T = P 2 · M 4 · M 3 · M 2 · P 1 · M 1 , where M 1 = h I 8 ¯ I 8 ¯ I 8 − I 8 i , P 1 = diag    I 9 ,    0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0       , M 2 = diag h I 4 ¯ I 4 ¯ I 4 − I 4 i , h I 4 ¯ I 4 ¯ I 4 − I 4 i , M 3 = diag " 1 0 0 1 0 1 1 0 0 − 1 1 0 1 0 0 − 1 # , " 0 1 1 1 − 1 − 1 0 1 − 1 1 − 1 0 1 0 − 1 1 # , " 1 0 0 1 0 1 1 0 0 − 1 1 0 − 1 0 0 1 # , " 0 1 1 1 1 1 0 − 1 1 − 1 1 0 1 0 − 1 1 #! , M 4 = diag h 1 1 1 − 1 i , I 6 , h 1 1 1 − 1 i , I 6  (1) and matrix P 2 p erforms the simp le p erm utation (0)(1 8)(2 4 3 11 10 7 12 2)(5 9 13 14 6 5)(15) in cyclic n otation [47, p. 77]. Matrices I n and ¯ I n denote the identit y and counter-iden tit y matrices of order n , resp ectiv ely . 3 Comput a tional Comple xity an d Ev alua tion In this section, we aim at (i) assessin g the computational complexit y of th e prop osed appr o ximation, (ii) ev aluating it in terms of approxi mation error, and (iii) measur ing its co ding p erformance [2]. F or comparison pur p oses, we selected the f ollo wing state-of-t he-art 16-p oin t DCT ap p ro ximations: BAS-2010 [4], BAS-2013 [29] and the BCEM metho d [40]. W e also considered the classical WHT [38] and the exact DCT as computed according to the Chen DCT algorithm [42]. This latter metho d is the algorithm employ ed in the HEV C co dec by [48 ]. 4 T able 1: Ar ithmetic complexit y assessmen t T ransform Op eration count Multiplication Addition Bit-shifting T otal Chen DCT [42] 44 74 0 118 WHT 0 64 0 64 BAS-2010 0 64 8 72 BAS-2013 0 64 0 64 BCEM 0 72 0 72 Prop osed 0 60 0 60 3.1 Arithmet ic Comple xity The computational cost of a giv en transformation is traditionally measured b y its arithmetic com- plexit y , i.e, the n um b er of required arithmetic operations for its compu tation [2, 41, 49]. Consider ed op erations are multiplicatio ns, additions, and bit-shifting op erations [41]. T able 1 lists the op era- tion coun t for eac h arithmetical op eration for all consid ered m etho ds. T otal op eration coun t is also pro vided. The prop osed transform show ed 6.25% less tota l op eration count when compared with the WHT or BAS-2013 appro ximation. Considering BAS-2010 or th e BCEM appro ximation, the intro d uced appro ximation required 16.67% less op eration ov erall. As a more strict complexit y assessmen t, ev en if w e tak e only the add itiv e complexit y into accoun t, the p r op osed transformation can still outp erform all considered metho ds. It is also n otew orth y that the prop osed metho d has th e lo west m ultiplicativ e complexit y among all considered metho ds. Moreo ver, to the b est of our knowledge, the p rop osed transformation outp erforms an y meaningful 16-p oint DCT approximati on archiv ed in literature. 3.2 Similarity Measur es F or the appro ximation error analysis, w e considered three to ols: the DCT distortion [50], the total error energy [8], and the mean square error (MSE) [1, 2]. This set of measures determines the similarit y b et wee n the exact DCT matrix and a giv en approxima tion. Th ese qu alit y m etrics are brieﬂy describ ed as follo ws. Let C b e the exac t N -p oin t DCT matrix and ˜ C b e a giv en N -p oin t DCT ap p ro ximation. Adopting the notation employ ed in [37], the DCT d istortion of ˜ C is giv en by: d 2 ( ˜ C ) = 1 − 1 N ·    diag  C · ˜ C ⊤     2 , where k · k is the Euclidean norm [51]. The DCT distortion captures the d iﬀerence b etw een the exact DCT matrix and a candid ate appro ximation by quantifying the orthogonalit y among the 5 basis v ectors of b oth transforms [37]. T aking the basis v ectors of the exact DCT and a giv en appro ximation as ﬁlter coeﬃcient s, the total error energy [8], measures the sp ectral pro ximit y b et w een the corresp onding transfer functions [26]. Inv oking Pa rs ev al theorem [52], the total er r or energy can b e ev aluated according to: ǫ ( ˜ C ) = π ·    C − ˜ C    2 F , where k · k F is the F rob enius norm [51]. The MSE is a we ll-established pr o ximit y measure [2]. The MSE b et w een C and ˜ C is giv en b y [2, 24]: MSE( ˜ C ) = 1 N · tr  ( C − ˜ C ) · R · ( C − ˜ C ) ⊤  , where tr ( · ) is the trace f unction [53] and R is the co v ariance matrix of the inpu t signal. Assu ming the ﬁr st-order stationary Marko v pr o cess mo del for th e inpu t data, we ha v e that R [ i,j ] = ρ | i − j | , for i, j = 1 , 2 , . . . , N , and the correlation co eﬃcien t ρ is set to 0.95 [2, 24]. T his particular mo del is suitable for real signals and natural images [2, 24, 45]. The minimization of MSE v alues in dicates pro ximit y to the exact DCT [2]. 3.3 Coding Measur es W e adopted t wo cod ing measures: the tran s form co d ing gain [37] and the transform eﬃciency [2]. The transform co d in g gain quant iﬁes the co d ing or data compression p erformance of an orthogonal transform [2, 37]. This measure is giv en by [2, 54]: C g ( ˜ C ) = 10 · log 10        1 N P N − 1 i =0 s ii h Q N − 1 i =0  s ii · q P N − 1 j =0 ˜ c 2 ij i 1 N        , where s ij and ˜ c ij are the ( i, j )-th en try of ˜ C · R · ˜ C ⊤ and ˜ C , resp ective ly . On the other h an d , the transform eﬃciency [2] is an alternativ e metho d to compute th e com- pression p erformance. Denoted b y η , the transform eﬃciency is f urnish ed by: η ( ˜ C ) = P N − 1 i =0 | s ii | P N − 1 i =0 P N − 1 j =0 | s ij | × 100 . Quant it y η ( ˜ C ) in dicates the d ata decorrelatio n capabilit y of the transformation. The KL T ac h iev es optimalit y with resp ect to this m easure, presenting a transf orm eﬃciency of 100 [2]. 6 T able 2: Performance analysis T ransform Measures Appro ximation Co ding d 2 ǫ MSE C g η DCT 0 0 0 9.4555 88.451 8 WHT 0.8783 92.563 1 0.4284 8.1941 7 0.6465 BAS-2010 0.6666 64.749 0.1866 8.5208 73.6345 BAS-2013 0.5108 54.620 7 0.132 8.1941 70.6465 BCEM 0.1519 8.0806 0.0465 7.8401 65.2 789 Prop osed 0.3405 30.323 0.0639 8.295 70.8315 3.4 Resu l ts T able 2 summ arizes the results for the ab ov e d etailed sim ilarity and co ding measures. F or eac h ﬁgure of merit, w e emp hasize in b old the tw o b est measurements. The prop osed transf orm disp lays consisten tly go o d p erformance according to all considered criteria. This fact con trasts w ith existing transformations, w hic h tend to excel in terms of similarit y measures, but p erf orm limitedly in terms of codin g p erformance; and vice-v ersa. Therefore, the prop osed transform oﬀers a compromise, while still ac hieving state-of-t he-art p erformance. 4 Applica t ion to imag e co mpress ion The p rop osed appro ximation was s u bmitted to th e image compression sim ulation metho dology originally introd u ced in [26] and employ ed in [4, 5, 7, 8, 29, 40]. In our exp erimen ts a set of 45 512 × 512 8-bit graysc ale images obtained from a standard pu blic image bank [55] wa s considered to v alidate the pr op osed algorithm. W e adapted the JPE G-like compression scheme for the 16 × 16 matrix case, as su ggested in [4]. The adop ted image compr ession m etho d is detailed as f ollo ws. An input 512 × 512 image w as divided int o 16 × 16 disj oin t blo c ks A k , k = 0 , 1 , . . . , 31. Eac h b lo c k A k w as 2-D transformed according to B k = ˜ C · A k · ˜ C ⊤ , wh ere B k is a frequency domain image b lo c k and ˜ C is a giv en appro ximation matrix. Matrix B k con tains the 256 transform domain co eﬃcient s for eac h blo ck. Adapting the zigzag sequence [56] for the 16 × 16 case, we r etained only the r in itial coeﬃcient s and set th e remaining co eﬃcien ts to zero [4, 5, 7 , 8, 26, 29, 40], generating B ′ k . Subsequently , the inv erse transformation w as applied to B ′ k according to: A ′ k = ˜ C ⊤ · B ′ k · ˜ C . T he ab ov e p ro cedure w as rep eated for eac h b lo c k. The rearrangemen t of all blo cks A ′ k reconstructs the image, which can b e assessed for qu alit y . Image d egradation was ev aluated u s ing tw o d iﬀeren t qualit y m easur es: (i) the p eak signal-to- noise ratio (PSNR) and (ii) the s tr uctural s im ilarity in dex (S SIM) [57]—a generalization of the 7 0 30 60 90 120 150 20 25 30 35 r av erage PSNR (dB) DCT WHT B AS−2010 B AS−2013 BCEM Proposed (a) Average PSNR 0 30 60 90 120 150 0.02 0.06 0.10 r APE (PSNR) B AS−2010 B AS−2013 BCEM Proposed (b) Absolute p ercen tage error of PSN R Figure 1: PSNR resu lts for all considered transforms u nder sev eral compression rates unive rsal image qualit y index [58]. In con trast to the PSNR, SSIM deﬁnition tak es adv an tage of kno wn c haracteristics of the h uman visual sys tem [57]. F ollo wing the metho dology adopted in [8 ] and [40], we calculated a v erage PSNR and SSIM v alues for all 45 images. Fig. 1(a) and Fig. 2(a) sho w a v erage PSNR and SSIM measuremen ts, resp ectiv ely . Additionally , w e considered absolute p ercenta ge error (APE) measur ements of PSNR and SSIM with r esp ect to the exact DCT. Results are display ed in Fig. 1(b) and Fig. 2(b) , for P NSR and SSIM, r esp ectiv ely . APE ﬁ gures for the WHT are ab s en t b ecause their v alues were exceedingly high, b eing lo cated outside of the p lot range. According to Figs. 1 and 2, the prop osed transf orm outp erforms other m etho ds for r ≤ 50, whic h corresp ond to high-compression rates. Ther efore, the prop osed transform is in consonance with ITU recommendation for high-compression co ding in real time applications [32]. F or r > 50, discussed metho ds are essen tially comparable in terms of image degradation. As a qu alitativ e comparison, Fig. 3 sho ws the compressed L ena image at r = 16 (93.75 % compression) obtained from eac h considered metho d. Th e prop osed transform oﬀered less pixelation and blo ck artifacts; demonstrating its adequacy for h igh-compression rate scenarios. 5 Digit al Architecture and Real iza tion In this section, hard w are architec tures for the prop osed 16-p oint app r o ximate DCT are d etailed. Both 1-D and 2-D transformations are addressed. Introdu ced archite ctures w ere sub mitted to (i) Xilinx ﬁ eld programmable gate arr a y (FPGA) imp lemen tations and (ii) CMOS 45 nm app lication sp eciﬁc in tegrated circuit (ASIC) imp lemen tation up to the synthesis lev el. Additionally , in order to assess the p erformance of th e prop osed algorithm in real time video co ding, the introdu ced appro ximation was also em b ed d ed int o an HEV C reference soft w are [59]. 8 0 30 60 90 120 150 0.0 0.2 0.4 0.6 0.8 1.0 r av erage SSIM DCT WHT B AS−2010 B AS−2013 BCEM Proposed (a) Average SSIM 0 30 60 90 120 150 0.05 0.10 0.15 0.20 r APE (SSIM) B AS−2010 B AS−2013 BCEM Proposed (b) Absolute p ercen tage error of SS IM Figure 2: SSIM r esults for all considered transforms under seve ral compression rates (a) DCT (PSNR = 28 . 55) (b) WHT (PSNR = 21 . 20) (c) BAS-2010 (PSNR = 25 . 27) (d) BAS-2013 (PSNR = 25 . 79) (e) BCEM (PSNR = 25 . 75) (f ) Prop osed (PSNR = 27 . 13) Figure 3: Compressed L ena image using all considered transform s , for r = 16 9 D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D x 6 M 1 M 2 M 3 x 0 x 8 x 9 x 10 x 11 x 12 x 13 x 14 x 7 x 15 M 4 X 0 X 8 X 12 X 4 X 2 X 6 X 14 X 10 X 1 X 5 X 11 X 3 X 7 X 9 X 13 X 15 x 1 x 2 x 3 x 4 x 5 Figure 4: Arc hitecture of the pr op osed 16-p oin t DCT app ro ximation 5.1 Architecture for the 16-point DCT appro xima t ion The 2-D v ersion of the 16-p oin t DCT approximati on architec tur e wa s r ealized using t w o 1-D trans- forms and a transp ose b uﬀer. This is p ossible b ecause the prop osed appr o ximation inherits the separable k ernel p r op erty of the exact DCT [60]. The ﬁrst instanti ation of the app ro ximate DCT blo c k furnishes a row-wise transform compu tation of the in put image, while the second implemen- tation furn ishes a column-wise transformation of the previous int ermediate r esult. A real time ro w-parallel transp osition buﬀer circuit is r equired in b etw een the 1-D transf ormation blocks. Su ch blo c k ensures d ata ordering for con v erting the ro w-transform ed data fr om the ﬁ r st DCT appro xi- mation circuit to a transp osed format as required by the second DCT appr oximati on circuit. Both 1-D transformation blocks and the tran s p osition buﬀer we re initially mo deled and tested in Matl ab Sim ulink; then they were com bined to furn ish the complete 2-D app ro ximate transform. Fig. 4 depicts the architec tur e for the pr op osed 1-D approximat e DCT. W e emp hasize in d ashed b oxes the b lo c ks M 1 , M 2 , M 4 , and M 4 , w hic h corresp ond to the r ealization of sparse m atrices M 1 , M 2 , M 3 , and M 4 , resp ectiv ely , as sho wn in the equation set (1 ). Fig. 5 sh o ws the implemen tation of the 2-D transform b y means of th e 1-D transforms. 5.2 FPGA and ASIC r ealiza t ions and resul ts The ab o v e discussed architec tur e wa s physicall y realized on a FPGA based r apid p rotot yping sys- tem for v arious register sizes and tested using on-c hip hard w are-in-the-lo op co-sim ulation. The arc hitecture was designed for digital realizat ion within the MA TL AB en vironment u sing the Xilinx 10 Buffer Transposition 1−D App. DCT 1−D App. DCT X (2 D ) 0 ,k x j, 0 x j, 1 x j, 15 X j, 0 X j, 1 X j, 15 X 0 ,k X 1 ,k X 15 ,k X (2 D ) 1 ,k X (2 D ) 15 ,k Figure 5: Two-dimensional approximate transform b y means of 1-D approximat e tran s form. Sig- nal x k , 0 , x k , 1 , . . . , corresp onds to the ro ws of the input image; X k , 0 , X k , 1 , . . . indicates the trans- formed r ows; X 0 ,j , X 1 ,j , . . . indicates the columns of the transp osed ro w-wise transformed image; and X (2-D) 0 ,j , X (2-D) 1 ,j , . . . indicates the columns of the ﬁnal 2-D tr ansformed image T able 3: Hardw are resource consum ption and p o w er consumption for the pr op osed 2-D 16-p oint DCT approxima tion CLB FF T cpd (ns) F max (MHz) D p (mW / MHz) Q p (W) 1408 4600 3.7 270.2 7 6.91 3.481 System Generator. Xilinx Virtex-6 X C6VLX240T-1FF G1156 device w as employ ed to p h ysically realize the architec ture on FPGA with ﬁn e-grain p ip elining for incr eased throughput. The r eal- ization wa s v eriﬁed on FPGA chip using a Xilinx ML605 b oard at a clo c k frequency of 50 MHz. The FPGA realization was tested with 10,000 random 16-p oin t inp u t test ve ctors using hard w are co-sim ulation. T est vecto rs w ere generated from within the MA TLAB en vironment and routed to the physica l FPGA device u sing JT A G b ased hardware co-sim ulation. Then measured data f rom the FPGA was routed b ack to MA TLAB memory space. Ev aluation of hardw are complexit y and real time p erformance considered the f ollo wing metrics: the n umber of used conﬁgurable logic blocks (CLB), ﬂip-ﬂop (FF) count, critica l path dela y ( T cpd ), and the maxim um op erating frequency ( F max ) in MHz. The xflow .results rep ort ﬁle w as acce ssed to obtain the ab ov e r esults. Dynamic ( D p ) and s tatic p o w er ( Q p ) consu m ptions were estimated using the Xilinx XPo w er Analyzer. Results are shown in T able 3. F or the ASIC implementa tion, the hardwa re description language co de w as p orted to 45 n m CMOS tec hnology and s ub ject to synthesis and p lace-and-route steps u sing the Cadence Encount er. The F reePDK, a free op en-sour ce ASI C standard cell libr ary at the 45 nm n o de, wa s us ed for this purp ose. The su pply vol tage of th e CMOS realization was ﬁxed at V DD = 1 . 1 V du r ing estimation of p ow er consumption and logic dela y . The adopted ﬁgures of merit for th e ASIC synthesis were: area ( A ) in mm 2 , area-time complexit y ( AT ) in m m 2 · n s, area-time- squ ared complexit y ( AT 2 ) in mm 2 · ns 2 , dynamic ( D p ) p o w er in ( mW / M H z ) and static ( Q p ) p o we r consumption in w atts, critical path dela y ( T cpd ) in ns, and maxim um op erating frequ en cy ( F max ) in MHz. Results are disp la y ed in T able 4. Among the considered comp etitors, the BAS-2010 [4] show ed arithmetic complexit y and cod ing 11 T able 4: Hardw are r esource consumption for CMOS 45 nm ASI C place-route implemen tation of the prop osed 2-D 16-p oint DCT appro ximation Area (mm 2 ) AT AT 2 T cpd (ns) F max (MHz) D p (mW / MHz ) Q p (mW) 0.585 4.896 40.98 8.37 119.47 0.311 216.2 T able 5: Hardw are resource consumption of the 1-D appro ximations us in g Xilinx Virtex-6 X C6VLX240T-1FF G1156 device T ransform CLB FF T cpd (ns) F max (MHz) D p (mW / MHz) Q p (W) BAS-2010 430 1440 1.950 512.82 4.54 3.49 Prop osed 421 1372 1.900 526.31 4.22 3.49 p erformance similar to the p rop osed transform. F or comparison purp oses the 1-D v ersions of the BAS-2010 approximati on and the p r op osed 16-p oin t appro ximation were r ealized on a Xilinx Virtex- 6 XC6VLX24 0T-1FF G1156 d evice as wel l as we re p orted to 45 nm CMOS tec hnology and su b ject to s y nthesis and place-and-route steps using the Cadence E ncoun ter. The r esults are sho wn in T able 5 and T able 6. Compared to the BAS-2010, the prop osed transform is faster when b oth the FPGA im p lemen tation an d CMOS s yn thesis is considered while having similar p erf ormance in hardware usage and dyn amic p o w er consu mption. Imp ortan tly , the p rop osed is b ette r in image qualit y as evidenced b y Fig. 3. 5.3 Real time v ideo co mpress ion softw are implement a tion In ord er to assess r eal-time video co ding p erf ormance, the prop osed approximati on w as em b edded in to th e op en source HEVC s tand ard reference soft ware by the F raunhofer Heinric h Hertz Insti- tute [59]. The original transform prescrib ed in the selected HEV C reference soft w are is th e scaled appro ximation of C h en DCT algorithm [42, 48] and th e soft ware can pro cess im age blo ck sizes of 4 × 4, 8 × 8, 16 × 16, and 32 × 32. Our metho dology consists of replacing the 16 × 16 DCT algorithm of the r eference softw are b y the p rop osed 16-p oint appro ximate algorithm. Algorithms we re ev aluated for their eﬀect on the o v erall p erform an ce of the enco din g p ro cess. F or s u c h, we obtained rate-distortion (RD) curves f or T able 6: Hardw are r esource consumption for CMOS 45 nm ASI C place-route implemen tation of the 1-D appr oximati ons T ransform Area (mm 2 ) AT A T 2 T cpd (ns) F max (MHz) D p (mW / MHz) Q p (mW) BAS-2010 0.169 0.843 4.21 4.99 4 2 00.24 0.093 70.47 Prop osed 0.183 0.895 4.38 4.89 5 2 04.29 0.095 78.73 12 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 25 30 35 40 45 50 55 bits/frame PSNR(dB) Proposed Chen DCT BAS 2010 Figure 6: RD cu rv es f or ‘Bask etballP ass’ test sequen ce standard v id eo sequences [61]. Th e quant ization p oint (Q P ) v aried from 0 to 50 to ob tain the curv es and the resulting PSNR v alues with the bit rate v alues measured in bits p er frame w ere recorded for the pr op osed algorithm, the Chen DCT algorithm, and the BAS-2010 [4] algorithm. Fig. 6 d epicts the obtained RD curves f or the ‘Baske tballPa ss’ test sequence. Fig. 7 sho ws particular 416 × 240 frames for the test video sequence ‘Bask etballP ass’ with QP ∈ { 0 , 32 , 50 } . The RD cur v es and s elected frames rev eal that the d iﬀerence b et w een the original HEV C and the implemen tation with the pr op osed approxi mation is negligible. In fact, in Fig. 6 the maximum PSNR d iﬀerence is 0.56 dB, whic h is v ery lo w. Fig. 7 sho ws that b oth enco d ed video streams are almost iden tical. These r esults conﬁrm the adequacy of the prop osed scheme. 6 Conclusion This work p rop osed a new orthogonal 16-p oint DCT appro ximation. Th e in tro d uced transform oﬀers a v ery lo w computational cost, outp erforming—to the b est of our kn o wledge—all comp et- ing metho d s. Moreo ve r, the prop osed transform p erformed w ell as an image compression to ol, sp ecially at h igh compression rate scenarios. By means of (i) compreh ensiv e computational sim - ulations, (ii) hardware implemen tation (b oth in FPGA and ASIC ), and (iii) softw are em b edd ing, w e demonstrated the adequ acy and eﬃciency of the prop osed metho d, whic h is suitable for co dec sc hemes, like the HEVC. Additionally , the in tro d uced transformation oﬀers an unusual goo d p er- formance balance among seve ral metrics, as shown in T able 2. This suggests th at the applicabilit y of prop osed transform is not limited in scop e to the image and video compr ession con text. 13 (a) Chen DCT (QP = 0) (b) Prop osed D CT (QP = 0) (c) Chen DCT (QP = 32) (d) Proposed DCT (Q P = 32) (e) Chen DCT (QP = 50) (f ) Prop osed D CT (QP = 50) Figure 7: Selected frames from ‘Bask etballP ass’ test video cod ed by means of the Ch en DCT and the prop osed 16-p oint DCT app ro ximation for QP = 0 (a–b), Q P = 32 (c–d), and QP = 50 (e–f ) 14 A cknowledgments This work was partially supp orted b y CPNq, F A CEPE, F APERGS and FIT/UFSM (Brazil), and b y the College of Engineering at th e Univ ersit y of Akron, Akron, OH, USA. Referen ces [1] K . R. Rao and P . Yip, Discr ete Cosine T r ansform: Algorithms, Ad vantages, Applic ations . San Diego, CA: Academic Press, 1990. [2] V. Britanak, P . Yip, and K. R. Rao, Discr ete Cosine and Sine T r ansforms . Academic P ress, 2007. [3] M. C. Lin, L. R. Dung, and P . K. W eng, “ An ultra-low-p ow er ima ge compresso r for capsule endoscop e,” BioMe dic al Engine ering On Line , vol. 5, no. 1, pp. 1– 8 , F eb. 2006. [4] S. Bouguezel, M. O. Ahmad, and M. N. S. Sw amy , “A novel transform for image compre s sion,” in 53r d IEEE International Mid west Symp osium on Cir cuits and Systems (MWSCAS ) , Aug. 2010 , pp. 509–51 2. [5] ——, “A Low-Complexity P arametric Tra nsform for Ima ge Compress ion,” in Pr o c e e dings of the 2011 IEEE International Symp osium on Cir cuits and Syst ems , 2011. [6] T .- S. C ha ng, C.-S. Kung, and C.-W. Jen, “A simple pr o cessor cor e design for DCT/IDCT,” IEEE T r ansactions on Cir cuits and Systems for Vide o T e chnolo gy , vol. 10 , no. 3, pp. 439–447 , Apr 2000 . [7] S. Bouguezel, M. O . Ahmad, and M. N. S. Swam y , “Low-complexit y 8 × 8 transform for image co mpres- sion,” Ele ctro nics L etters , v ol. 44, no. 21, pp. 1249 – 1250 , Sep. 2008. [8] R. J . Cin tra and F. M. Bayer, “A DCT approximation for image compr ession,” IEEE Signal Pr o c essing L etters , v ol. 18 , no. 10 , pp. 57 9–58 2, Oct. 201 1. [9] U. S. Potluri, A. Ma danay ake, R. J . Cintra, F. M. Bay er, S. Kula sekera, a nd A. Edirisur iya, “ Im- prov ed 8- p o int approximate DCT for image and video co mpression requiring only 14 a dditions,” IEEE T r ansactions on Cir cuits and Systems I , vol. 61, no. 6, pp. 1727–1 740, 20 14. [10] W. B. Pennebak er and J. L. Mitc hell, JPEG Stil l Image Data Compr ession St andar d . New Y ork, NY: V an Nostrand Reinhold, 19 92. [11] N. Roma and L. So usa, “Eﬃcient h ybrid DCT-domain algorithm for video spatial downscaling,” EURASIP Journal on Ad vanc es in Signal Pr o c essing , vol. 20 07, no . 2, pp. 30–3 0, 20 07. [12] In ternationa l Organisatio n for Standardisatio n, “Gener ic co ding of moving pictures a nd asso cia ted audio information – Part 2: Video,” ISO, ISO/IEC JTC1/SC2 9 /WG11 - Co ding of Moving Pictures and Audio, 1994 . [13] In ternationa l T elecommunication Union, “ITU-T recommendation H.261 version 1: Video co dec for audiovisual ser vices at p × 64 k bits,” ITU-T, T ech. Rep., 19 90. [14] ——, “ITU-T recommendation H.263 v ersio n 1 : Video co ding for lo w bit rate c o mmun icatio n,” ITU-T, T ech. Rep., 199 5. [15] Joint Video T eam, “Recommenda tio n H.26 4 and ISO/IEC 14 496–1 0 A V C: Draft ITU-T recommenda- tion and ﬁnal draft international sta ndard of join t video sp eciﬁca tion,” ITU-T, T ech . Rep., 2003. 15 [16] Y. Arai, T. Agui, and M. Nak a jima, “A fa s t DCT-SQ s cheme fo r images ,” T r ansactions of the IEICE , vol. E- 71, no. 11, pp. 1095– 1 097, Nov. 1988. [17] C. Lo e ﬄer , A. Lig ten b erg , and G. Mo sch ytz, “ P ractical fast 1D DCT algo rithms with 11 multiplications,” in Pr o c e e dings of the In t ernational Confer enc e on A c oustics, Sp e e ch, and Signal Pr o c essing , 19 8 9, pp. 988–9 91. [18] Z. W ang, “F ast algorithms for the discrete W tra nsform a nd for the discrete Fo urier tr ansform,” IEEE T r ansactions on Ac oustics, S p e e ch and Signal Pr o c essing , v ol. ASSP-32 , pp. 80 3–81 6, Aug. 198 4. [19] B. G. Lee, “A new algo rithm for computing the discrete cos ine transfor m,” IEEE T r ansactions on A c oustics, Sp e e ch and Signal Pr o c essing , vol. ASSP-3 2 , pp. 1243– 1245 , Dec. 1984. [20] M. V etterli and H. Nuss baumer, “Simple FFT and DCT algor ithms with reduced n umber o f op eratio ns,” Signal Pr o c essing , vol. 6, pp. 267–2 78, Aug. 19 84. [21] H. S. Hou, “A fas t recurs ive algo r ithm for computing the discr ete cosine tr ansform,” IEEE T r ansactions on A c oustic, Signal, and Sp e e ch Pr o c essing , vol. 6, no. 10 , pp. 1455–1 461, 1987 . [22] E. F eig and S. Winograd, “F ast algorithms for the discr ete cosine tra nsform,” IEEE T ra nsactions on Signal Pr o c essing , vol. 40, no. 9, pp. 2174– 2193 , 199 2. [23] M. T. Heideman and C. S. Burrus, Multiplic ative c omplexity, c onvolution, and the DFT , ser . Signal Pro cess ing and Digita l Filtering. Springer-V erla g, 19 88. [24] J. Lia ng and T. D. T ra n, “F ast m ultiplierless approximation of the DCT with the lifting scheme,” IEEE T r ansactions on S ignal Pr o c essing , v ol. 49 , pp. 3032–3 044, 2001 . [25] A. E dirisuriya, A. Ma danay ake, V. Dimitrov, R. J. Cintra, and J. Adik ar i, “VLSI architecture for 8- po int AI-ba s ed Arai DCT having lo w ar ea-time complexity a nd p ow er at improv ed accuracy ,” Journal of L ow Power Ele ctr onics and A pplic ations , vol. 2, no. 2 , pp. 127–1 42, 201 2 . [26] T. I. Haw eel, “A new squa re w av e transfor m based on the DCT,” S ignal Pr o c essing , vol. 8 2, pp. 230 9– 2319, 2001. [27] K. Lengwehasatit and A. Orteg a, “Scalable v ar iable complexity approximate forward DCT,” IEEE T r ansactions on Cir cuits and Systems for Vide o T e chnolo gy , vol. 14 , no. 11, pp. 1236–1 248, Nov. 2 0 04. [28] S. Bo uguezel, M. O . Ahmad, and M. Swam y , “A fast 8 × 8 transform for ima g e co mpression,” in 2009 International Confer enc e on Micr o ele ct r onics (ICM) , Dec. 2009 , pp. 7 4–77 . [29] S. Bouguezel, M. O. Ahmad, and M. N. S. Sw amy , “ Binary discrete cosine and Har tley tr ansforms,” IEEE T r ansactions on Cir cuits and Systems I: R e gular Pap ers , vol. 60 , no. 4, pp. 989– 1002, 2 013. [30] F. M. Bay er and R. J. Cintra, “DCT-like transform for ima g e compression req uir es 14 additions only ,” Ele ctr onics L etters , vol. 48, no. 15, pp. 919–9 2 1, 2012 . [31] U. S. Potluri, A. Ma danay ake, R. J. Cintra, F. M. Bay er, and N. Ra japaksha, “ Multiplier-free DCT approximations for RF multi-beam digital a per ture-ar r ay space imaging and directiona l sensing,” Me a- sur ement Scienc e and T e chnolo gy , vol. 23, no . 11, p. 11400 3, 20 12. [32] In ternationa l T elecommunication Union, Infr astructure of audiov isual servic es - Co ding of moving vide o - High eﬃciency vide o c o ding . T eleco mmu nicatio n Standarization Sector of ITU, 2013. 16 [33] M. T. Pouraza d, C. Doutr e, M. Azimi, and P . Nasiop oulo s , “HEVC: The new gold standar d for video compressio n: Ho w do es HEVC compare with H.264/ A VC?” IEEE Consumer Ele ctr onics Magazine , vol. 1, no. 3, pp. 36–46 , Jul. 20 12. [34] W. K . Cham, “Development of in teger cosine transforms by the principle of dyadic symmetry ,” in IEE Pr o c e e dings I Commun ic ations, Sp e e ch and Vision , vol. 13 6, no. 4 , 1989, pp. 276–2 82. [35] W. K. Cham and Y. T. Chan, “ An o rder-16 integer co sine tra nsform,” IEEE T ra nsactions on Signal Pr o c essing , vol. 39, no. 5, pp. 1205–1 208, 19 91. [36] J. Dong, K . N. Ngan, C.-K. F ong , a nd W.-K. Cham, “2-D or der-16 in teger tr a nsforms for HD video co ding,” IEEE T r ansactions on Cir cuits and Systems for Vide o T e chnolo gy , vol. 1 9, no . 10, pp. 1462– 1474, 2009 . [37] C.-K. F ong and W.-K . Cham, “LLM integer cosine tr ansform and its fast a lg orithm,” IEEE T r ansactions on Cir cuits and Systems for Vide o T e chnolo gy , vol. 22 , no. 6, pp. 844– 854, 20 12. [38] R. K. Y arlaga dda and J. E. Hershey , Hadamar d Matrix Analysis and Synthesis With Applic ations to Communic ations and Signal/Image Pr o c essing . Kluw er Academic Publishers, 1997. [39] I. V alov a and Y. K osugi, “Hadamard- based ima ge decomp osition and c o mpression,” IEEE T r ansactions on Information T e chnolo gy in Biome dic ine , v ol. 4, no. 4, pp. 306 –319, 2000. [40] F. M. B ay er, R. J. Cintra, A. Edirisuriya, and A. Madanayak e, “ A digital hardware fast a lg orithm and FPGA-based prototype for a nov el 16 - p o int appr oximate DCT for image compress ion applications,” Me asur emen t S cienc e and T e chnolo gy , vol. 23, no. 8, pp. 114 010– 114 019, 2012. [41] R. E . Blahut, F ast Algorithms for Signal Pr o c essing . Cambridge Universit y Pres s , 2 010. [42] W. H. Chen, C. Smith, and S. F ralick, “A fast co mputatio nal algo rithm for the discrete co sine transfor m,” IEEE T r ansactions on Communic ations , v ol. 25, no. 9, pp. 100 4–10 09, Sep. 197 7. [43] P . Yip a nd K. R. Ra o, “The decimation-in-frequency algorithms for a family o f discrete sine and co sine transform,” Cir cuits, Syst ems, and Signal Pr o c essing , pp. 4–19 , 1 988. [44] R. J . Cin tra, “An in teger approximation method for discrete sin uso idal transforms,” Cir cuits, Syst ems, and S ignal Pr o c essing , vol. 30, no. 6 , pp. 1481– 1 501, 2011 . [45] R. J . Cin tra, F. M. Bay er, and C. J. T ablada, “Low-complexit y 8- po int DCT approximations based on int eger functions,” S ignal Pr o c essing , v ol. 99 , pp. 201–21 4, 2014 . [46] T. D. T ran, “The binDCT: F ast multiplierless a pproximation of the DCT,” IEEE Signal Pr o c essing L etters , v ol. 6, no. 7, pp. 1 41–14 4, 2000 . [47] I. N. Her stein, T opics in Algebr a , 2nd e d., W. India, Ed. John Wiley & Sons, 1975 . [48] M. Capelo , “Adv ances on trans fo rms for high eﬃciency video co ding,” Master ’s thesis, Instituto Sup erior T´ ecnico, Lisboa , 2011. [49] J. G. Proakis and D. K. Manolakis, Digital Signal Pr o c essing , 4th ed. Prentice Hall, 2006. [50] M. Wien and S. Sun, “ICT compar ison for adaptiv e blo ck transforms,” T ech. Rep., 2001 . 17 [51] D. S. W atkins, F undamentals of Matrix Computations , s e r. P ur e and Applied Mathematics: A Wiley Series of T exts, Monogr aphs and T rac ts. Wiley , 2004. [52] A. V. Oppenheim, D iscr ete-Time Signal Pr o c essing . Pearson Educa tio n, 2006. [53] G. A. F. Seber , A Matrix Handb o ok for St atisticians . Wiley-Interscience, 2007 . [54] H. S. Ma lv ar, Signal Pr o c essing with L app e d T r ans forms . Artech House, 199 2 . [55] “The USC-SIPI image databas e,” http://sipi.usc.edu/data base/, 201 1 , Universit y of Southern Califor- nia, Signal and Image Pro cessing Institute. [56] I.-M. Pao and M.-T. Sun, “Approximation of calcula tions for for ward discrete cosine transform,” IEEE T r ansactions on Cir cuits and Systems for Vide o T e chnolo gy , vol. 8, no. 3, pp. 2 64–26 8, Jun. 199 8. [57] Z. W ang, A. C. Bovik, H. R. Sheikh, and E. P . Simoncelli, “Image quality a ssessment: from er r or visibility to structural similar ity ,” IEEE T ra nsactions Image Pr o c essing , vol. 13, no. 4, pp. 600 –612 , Apr. 2004 . [58] Z. W ang and A. Bovik, “A universal image qualit y index,” IEEE Signal Pr o c essing L etters , vol. 9, no. 3, pp. 81–8 4 , 200 2. [59] Joint Collab or ative T eam o n Video Coding (JCT-VC), “HEVC references soft ware do cumentation,” 2013, Fraunhofer Heinrich Hertz Ins titute. [O nline]. Av ailable: https://hevc.hhi.fraunhofer.de/ [60] S. W. Smith, The Scientist & Engine er’s Gu ide to Digital Signal Pr o c essing , 199 7. [61] A. O rtega and K. Ra mchandran, “Rate-distortio n metho ds for image and video compre ssion,” IEEE Signal Pr o c essing Magazine , vol. 15, no. 6, pp. 23–5 0, Nov 1998. 18

An Orthogonal 16-point Approximate DCT for Image and Video Compression

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment