Energy-efficient 8-point DCT Approximations: Theory and Hardware Architectures

Energy-eﬃcien t 8-p oin t DCT A ppro xima t i ons: Theory and Hardw are Arc hitectures Renato J. Cin tra ∗ F´ abio M. Ba yer † Vitor A. Coutinho ‡ Sunera Kulasek era Arjuna Madanay ak e § Abstract Due to its remark able en ergy compaction prop erties, the discrete cosine transform ( DCT ) is employ ed in a m ultitude of compression standards, such as JPEG and H.265 /HEVC . Several lo w-complexity in teger approximatio ns for the DCT ha ve been prop os ed for b oth 1-D and 2-D signal analysis. The increasing demand for lo w-complexity , energy eﬃcien t metho ds require algorithms with even lo w er computational costs. In this pap er, new 8-p oin t D CT appro ximations with very lo w arithmetic complex ity are pre- sented. The new transforms are prop osed based on pruning state-of-th e-art DCT app ro ximations. The prop os ed algo rithms wer e assesse d in terms of arithmetic complexity , energy reten tion capabilit y , and image compression p erformance. In addition, a metric combining p erformance and compu tatio nal com- plexity measures was prop osed. R esults sho w ed go o d p erforma nce and ex tremely lo w computational complexity . In trod uced algorithms were mapp ed into systolic-array d ig ital architectures and p h ysically realized as digital p roto type circuits using FPGA technology and map p ed to 45 nm CMOS tec hnology . All hardware -related metrics show ed lo w resource consump tion of the p roposed pru ned approximate transforms. The b est prop osed transform according to the introduced metric presents a reduction in p o wer consu mption of 21–25%. Keywords DCT approximation image compressionFPGA pruned transforms 1 Introduction T ransform-ba sed metho ds are widely employ ed in digital s ignal pro cessing a pplications [1]. In this context, the eﬃcient co mput ation of discrete tr ansforms has constantly attra cted co mm unity eﬀorts and the pro po- sition of fa st algo rithms [2]. In particular, the 8-p oin t discr ete cosine transform (DCT) ha s a prov en recor d of s cien tiﬁc and industrial applications, as demonstrated by the multitude of image and video c o ding stan- dards that adopt it, such a s: JP EG [3], MPEG [4–6], H.261 [7, 8], H.263 [5, 9], H.264 /A V C [10, 11], and the recent high eﬃciency v ideo co ding (HEVC) [12, 13]. The HEVC is c a pable of achieving high c ompression ∗ Renato J. Ci n tra is with the Signal Processing Group, Departamen to de Estat ´ ıstica, Univ ersi da de F ederal de Pe rnambuc o, Recife, PE, Brazil; Equip e Cair n, IRISA-INRIA, Universit ´ e de Rennes 1, Rennes, F rance; LIRIS, Institut National des Sciences Appliqu´ ees, Lyon, F rance (e-mail: rjdsc@stat.ufp e. or g) . † F´ abio M. B ay er is with the Departamen to de Estat ´ ıstica and LA CESM, Universidade F ederal de San ta Mari a, San ta Maria, RS, Brazil (e-mail: bay er@ufsm. br). ‡ Vitor A. Coutinho is with the Graduate Program in Electrical Engineering and the Signal Pro cessing Group , D epartamento de Estat ´ ıstica, Univ ersi da de F ederal de Pe rnambu co, Recife, PE, Brazil (e-mail: vitor.andrade.cou tinho@gmail.com). § Sunera Kul asekera and Arjuna M ada na y ak e are wi th the Departmen t of Electrical and Computer Engineering, The Unive r- sity of Akron, Akron, OH, USA (e-mail: arj una@uakron.edu). 1 per formance at approximately half the bit rate r equired by its predecessor H.264 /A V C with s a me image quality [13 – 16]. On the other hand, the HE VC requir e s a signiﬁcantly higher computational c o mplexit y in terms of arithmetic op erations [14 – 17], being 2– 4 t imes mo re computationally cos tly than H.264/A V C [14, 16]. In this co n text, the e ﬃcie nt computation of the DCT is a venu e fo r impr o ving the per formance of a bov e- men tioned co decs. Since its inception, several fast algo rithms for the DCT hav e b e en pr o posed [18–23]. How ever, tra ditional algorithms aim at the c omputation of the exact DCT, which req uir es several m ultiplication op erations. Ad- ditionally , several a lgorithms hav e a c hieved theoretical multiplicativ e complexity low er-b ounds [21, 24]. As a c onsequence, the pro gress in this area headed to approximate metho ds [25 – 27]. In some applica tions, a simple DCT approximation can provide mea ningful results at low ar ithmetic complexity [28]. Thus, approx- imation techniques for the DCT are b ecoming increas ingly p opular [25, 27, 29–31]. Suc h approximations can reduce the computational demands of the DCT, leading to low-pow er, high-sp eed r ealizations [16], while ensuring adequate num erical a ccuracy . F urthermore, it is a well-known fact that in many DCT applications [32 – 34], the most useful sig nal information tends to be concentrated in the low-frequency co eﬃcien ts. This is b e cause the DCT presents go od energy c ompaction prop erties, which are closely related to the K arh unen- Lo ` ev e transfo rm [35]. Therefore, only the low-frequency DCT compo nen ts ar e necess ary to be computed in these applications . A typical example of this situation o ccurs in data compression applications [36], wher e high-fre q uency comp onen ts are o ft en zero ed by the quantization pro cess [37, p. 586]. Then, only the quantit ies that ar e likely to be sig niﬁcan t should be computed [38]. This a ppr oac h is called frequency-domain pruning and has been employ ed for computing the discrete F ourier transform (DFT) [39 – 43]. Suc h methodo logy was or iginally applied in the DCT co n text in [4 4 ] and [45]. In [3 2, 46], the tw o-dimensional (2-D) version of the pruned DCT w a s prop osed. In the c o n text of low-pow er e d wireless vision sensor net works, a pruned DCT was prop osed in [47] based on the binary DCT [30]. In [48], Meher et al. prop osed a HECV architecture where the w ordlength was main tained ﬁxed by means of disca r ding least signiﬁcant bits. In that c o n text, the goal w as the minimization of the computation complexity at the exp ense of wordlength truncation. Suc h appro ac h was also termed ‘pruning’. Howev er, it is fundamentally diﬀeren t fro m the appr oac h disc us sed in the curre nt pa p er. This ter minology dis tinct ion is worth observing . Thu s, in resp onse to the growing need fo r high c o mpression of image and moving pictures for v arious applications [12], we pro pose a further reduction of the co mputatio nal co st of the 8-p oin t DCT computation in the co n text of J PEG-like compre s sion and HEVC pro cessing. In this w ork, w e introduce pruned DCT approximations for ima ge and video c o mpression. Essentially , DCT-like pruning co nsists of e xtracting from a given approximate DCT matr ix a submatrix that aims at furnishing similar mathematical prop erties. W e adv ance the applicatio n of pruning techniques to several DCT appr oximations listed in recent literature. In this pap er, we aim at identifying a dequate pruned approximations for image compressio n applications. VLSI realizations of b oth 1-D and 2-D of the pro posed metho ds are also so ugh t. This pap er is organized as follows. In Section 2, a mathema tical review o f DCT approximation and pruning metho ds is furnished. Exa ct a nd approximate DCT ar e presented and the pruning pro cedure is mathematically describ ed. In Section 3, w e prop ose several pruned metho ds for a ppr o xima te DCT computa- tion and asse ss them b y means of arithmetic complexity , co eﬃcien t energy distr ibut ion in tr ansform-domain, 2 and image compressio n p erformance. A combined ﬁgure of merit cons idering p erformance and complexity is in tro duced. In Section 4, a VLSI realizatio n o f the o ptim um pruned metho d a ccording to the suggested ﬁgure o f mer it is pro posed. Both FPGA and ASIC realiza tio ns are assessed in terms of area, time, frequency , and p ow er consumption. Sectio n 5 concludes the pape r . 2 Ma thema tical Back ground 2.1 Discrete Cosine Transf orm Let x = h x 0 x 1 · · · x N − 1 i ⊤ be an N -p oin t input vector. The one-dimens ional DCT is a linear transfo r- mation that maps x into an output vector X = h X 0 X 1 · · · X N − 1 i ⊤ of trans f orm coeﬃcients, acc ording to the following expression [4 9]: X k = α k · r 2 N · N − 1 X n =0 x n · cos  ( n + 1 2 ) k π N  , (1) where k = 0 , 1 , . . . , N − 1, α 0 = 1 / √ 2 a nd α k = 1, for k > 0. In matrix fo rmalism, (1) is given by: X = C · x , (2) where C is the N -p oint DCT matrix whose entries are expr essed according c m,n = α m · p 2 / N · cos  ( n + 1 2 ) mπ / N  , m, n = 0 , 1 , . . . , N − 1 [23]. Being a n o rthogonal transfo r m, the inv e r se tra nsforma- tion is given by: x = C ⊤ · X . Becaus e DCT satisﬁes the kernel sepa rabilit y prop erty , the 2-D DCT can b e expressed in ter ms of the 1-D DCT. Let A b e an N × N matr ix. The forward 2-D DCT o peration applied to A yields a transfor m-domain image B fur nis hed by: B = C · A · C ⊤ . In fact, the 2-D DCT can be computed after eight column-wise calls of the 1-D DCT to A ; then the re sulting intermediate image is submitted to eight row-wise calls of the 1- D DCT. In this pap er, we devote our attention to the case N = 8. 2.2 DCT Appro xima tions In general terms, a DCT approximation ˆ C is constituted of the pro duct a low-complexit y matrix T a nd a s caling diagonal matrix S that ensures ortho gonalit y or quasi-o rthogonality [31]. Thus, we hav e ˆ C = S · T [16, 27, 28, 5 0]. The entries of the low-complexity matrix are deﬁned ov er the set { 0 ± 1 , ± 2 } , which results in a m ultiplier less op erator—o nly addition and bit-shifting op erations are required. Usually p ossessing irrationa l elemen ts, the scaling diagonal matrix S do es not p ose a n y extra computation ov erhead for image and video co mpression applications. This is due to the fact that the ma trix S ca n b e c o n venien tly merged int o the qua n tization step of compression algorithms [1 6, 27, 29, 50]. Among the v a rious DCT approximations archiv ed in literature, we se parate the following methods : (i) the signed DCT (SDCT), which is the seminal metho d in the DCT approximation ﬁeld [25]; (ii) Bouguezel- Ahmad-Swam y approximations [27, 2 9 , 30]; (iii) the rounded DCT (RDCT) [28], and (iv) the mo diﬁed RDCT (MRDCT) [50]. These approximations were selected be cause they collectively exhibit a wide rang e o f com- plexity vs. p erformance trade-oﬀ ﬁg ures [50]. Mor eo ver, such approximations have b een demonstrated to be useful in ima ge co mpr ession. The low-complexity matrices o f a b ov e metho ds a re shown in T a ble 1 . Ad- 3 T able 1 : Approximate DCT metho ds Metho d T Orthogo nal? SDCT [25]     1 1 1 1 1 1 1 1 1 1 1 1 − 1 − 1 − 1 − 1 1 1 − 1 − 1 − 1 − 1 1 1 1 − 1 − 1 − 1 1 1 1 − 1 1 − 1 − 1 1 1 − 1 − 1 1 1 − 1 1 1 − 1 − 1 1 − 1 1 − 1 1 − 1 − 1 1 − 1 1 1 − 1 1 − 1 1 − 1 1 − 1     No WHT [5 1]     1 1 1 1 1 1 1 1 1 − 1 1 − 1 1 − 1 1 − 1 1 1 − 1 − 1 1 1 − 1 − 1 1 − 1 − 1 1 1 − 1 − 1 1 1 1 1 1 − 1 − 1 − 1 − 1 1 − 1 1 − 1 − 1 1 − 1 1 1 1 − 1 − 1 − 1 − 1 1 1 1 − 1 − 1 1 − 1 1 1 − 1     Y es BAS-2008 [27]      1 1 1 1 1 1 1 1 1 1 0 0 0 0 − 1 − 1 1 1 2 − 1 2 − 1 − 1 − 1 2 1 2 1 0 0 − 1 0 0 1 0 0 1 − 1 − 1 1 1 − 1 − 1 1 1 − 1 0 0 0 0 1 − 1 1 2 − 1 1 − 1 2 − 1 2 1 − 1 1 2 0 0 0 − 1 1 0 0 0      Y es BAS-2009 [29]     1 1 1 1 1 1 1 1 1 1 0 0 0 0 − 1 − 1 1 1 − 1 − 1 − 1 − 1 1 1 0 0 − 1 0 0 1 0 0 1 − 1 − 1 1 1 − 1 − 1 1 1 − 1 0 0 0 0 1 − 1 1 − 1 1 − 1 − 1 1 − 1 1 0 0 0 − 1 1 0 0 0     Y es BAS-2013 [30]     1 1 1 1 1 1 1 1 1 1 1 1 − 1 − 1 − 1 − 1 1 1 − 1 − 1 − 1 − 1 1 1 1 1 − 1 − 1 1 1 − 1 − 1 1 − 1 − 1 1 1 − 1 − 1 1 1 − 1 − 1 1 − 1 1 1 − 1 1 − 1 1 − 1 − 1 1 − 1 1 1 − 1 1 − 1 1 − 1 1 − 1     Y es RDCT [28]     1 1 1 1 1 1 1 1 1 1 1 0 0 − 1 − 1 − 1 1 0 0 − 1 − 1 0 0 1 1 0 − 1 − 1 1 1 0 − 1 1 − 1 − 1 1 1 − 1 − 1 1 1 − 1 0 1 − 1 0 1 − 1 0 − 1 1 0 0 1 − 1 0 0 − 1 1 − 1 1 − 1 1 0     Y es MRDCT [5 0]     1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 − 1 1 0 0 − 1 − 1 0 0 1 0 0 − 1 0 0 1 0 0 1 − 1 − 1 1 1 − 1 − 1 1 0 − 1 0 0 0 0 1 0 0 − 1 1 0 0 1 − 1 0 0 0 0 − 1 1 0 0 0     Y es ditionally , we also cons idered the 8-p oin t naturally order ed W alsh-Hadamar d transform (WHT), which is a well-kno wn low-complexit y transform with applications in ima g e pro cessing [30, 51]. 2.3 Pruned Exact and Appro xima te DCT Essentially , DCT pruning co nsists of extra cting from the 8 × 8 DCT matrix C a subma trix that aims a t furnishing similar mathematical prop erties as C . Pruning is often realized on the transform-domain b y means of c omputing fewer tra nsform co eﬃcient s than prescr ibed b y the full transforma tion. Usually , only the K < N co eﬃcien ts that r etain mo re energy are preser v ed. F or the DCT, this co rrespo nds to the ﬁrst K 4 rows of the DCT ma trix. Therefore, this par ticular type of pruning implies the following K × 8 matrix: C K =       c 0 , 0 c 0 , 1 · · · c 0 , 7 c 1 , 0 c 1 , 1 · · · c 1 , 7 . . . . . . . . . . . . c K − 1 , 0 c K − 1 , 1 · · · c K − 1 , 7       , (3) where 0 < K ≤ 8 and c m,n , m, n = 0 , 1 , . . . , 7 , are the en tries of C . The ca se K = 8 corr e s ponds to the original transformatio n. Such pro cedure was prop osed in [32, 4 6] for the DCT in the context of wireless sensor net works. F or the 2-D ca se, we hav e that the pruned DCT is given by: ˜ B = C K · A · C ⊤ K . Notice that ˜ B is a K × K matrix over the transform-do main. Lecuire et al. [46] show ed that r etaining the transform-do main co eﬃcien ts in a K × K squar e pattern at the upp er-right cor ner leads to a b etter energy -distortion trade- oﬀ when compared to the a lternativ e triangle pattern [32]. The pr uning approach can b e applied to DCT appr o ximatio ns. By discarding the lo wer rows of the low-complexit y matrix T , we o btain the following K × N pruned ma trix transforma tion: T K =       t 0 , 0 t 0 , 1 · · · t 0 , 7 t 1 , 0 t 1 , 1 · · · t 1 , 7 . . . . . . . . . . . . t K − 1 , 0 t K − 1 , 1 · · · t K − 1 , 7       , (4) where t m,n , m, n = 0 , 1 , . . . , 7, a re the entries of T (cf. T able 1). Cons idering the or thogonalization metho d describ ed in [31], the K × 8 pr uned approximate DCT is given by: ˆ C K = S K · T K , (5) where S K = q diag { ( T K · T ⊤ K ) − 1 } is a K × K diagona l matrix a nd diag ( · ) returns a diagonal ma trix with the diagonal e le men ts of its argument. If T is orthogo nal, then T K satisﬁes semi-orthog onalit y [52, p. 84]. The 2- D pruned DCT of a matr ix A is given by ˜ B = T K · A · T ⊤ K . (6) Resulting transfor m- domain ma trix ˜ B is siz ed K × K . 3 Complexity and Performance A ssessment In this section, we analy ze the arithmetic complexity o f the selec ted pr uned DCT appr o ximatio ns. W e als o assess their per formance in terms of energy r eten tion and image compressio n for each v alue of K . 3.1 Arithmetic complexity Because all co nsidered approximate DCT are natively m ultiplierles s oper ators, the pruned DCT approxima- tion inherits such pr operty . Therefor e, the arithmetic co mplex ity of the pruned approximations is simply 5 given by the num b er of additions a nd bit-s hif ting op erations re quired by their resp ectiv e fast algor ithms. T o illustrate the co mplexit y a ssessmen t, we fo cus on the MRDCT [5 0 ], whose fast alg o rithm signa l ﬂow graph (SFG) is shown in Figure 1 (a) . The full c o mputation o f the MRDCT require s 14 additions. By ju- diciously co nsidering the computational cost of only the ﬁrst K transform-do main comp onen ts, we der iv ed fast algor ithms for the pr uned MRDCT matrices a s shown in Figure 1. The sa me pr ocedure w a s applied to ea c h of the discussed approximations bas e d on their fa st a lgorithms [25, 27–30, 50, 51]. The obta ined a r ithmetic additive co mplexit y is prese n ted in T able 2. W e notice that the pruned MRDCT exhibited the lowest computational complexity for all v alues o f K . Suc h ma thematical pro perties of the MRDCT are transla ted into go o d har dw ar e designs . Indeed, in [1 6], several DCT approximations were ph ysically realized in FPGA devices. Hardware and p erformance assess ments revealed that the MRDCT outp e rformed several comp etitors, including BAS 2008 [2 7 ] and RDCT [2 8 ], in terms of s peed, ha rdw ar e resource consumption, and power consumption [1 6 ]. An exa mina tion of (6) r ev ea ls tha t the 2 -D pruned approximate DCT is computed after eight column- wise calls of the 1- D pruned approximate DCT and K row-wise call o f 1-D pr uned approximate DCT. Let A 1-D ( T K ) b e the additiv e complexity of T K . Therefor e, the additive complexity of the 2 -D pruned approximate DCT is g iv en by: A 2-D ( T K ) = 8 · A 1-D ( T K ) + K · A 1-D ( T K ) = (8 + K ) · A 1-D ( T K ) . (7) F or the particula r case of the pruned MRDCT, we can derive the express ions below: A 1-D ( T K ) = K + 6 , (8) A 2-D ( T K ) = K 2 + 1 4 · K + 4 8 , (9) for K = 1 , 2 , . . . , 8. 3.2 Ret ained energy T o further exa mine the p erformance of the pruned approximations, we inv es tigate the signal ener gy distribu- tion in the transform-domain for each v alue of K . This analysis is relev ant, b ecause higher energy concen tr a- tions implies that K can b e reduced without severely de g rading the transfor m co ding per formance [2 3]. In fact, higher energ y co nc e n tration eﬀects a larg e num b er of zeros in the trans fo rm-domain a fter qua n tization. On its turn, a larg e n umber of zer os translates into longer runs of zeros, whic h are beneﬁcia l for s ubsequen t run-length enco ding and Huﬀman co ding stages [53]. W e a nalyzed a set of ﬁfty 512 × 512 25 6-lev el grayscale s ta ndard imag es from [54]. Origina lly color images were conv er ted to grayscale b y extra c ting the luminance. Image types included textures, satellite ima g es, landscap es, por tr aits, and natural ima ges. Su ch v a r iet y is to ensure that selection bia s is not intro duced in our ex periments. Thus our results are exp ected to b e ro bust in this s ense. Images were split into 8 × 8 subimages. Resulting subimages were submitted to each of the discussed pruned DCT approximation for all v a lues of K . Subse q uen tly , the relative amount of retained energy in the transfor m-domain was co mput ed. Obtained v a lue s are displayed in T able 2. 6 x 0 x 2 x 3 x 4 x 5 x 6 x 7 x 1 X 2 X 7 X 3 X 5 X 1 X 4 X 0 X 6 (a) Original MRDCT (14 additions) x 0 x 2 x 3 x 4 x 5 x 6 x 7 x 1 X 2 X 3 X 5 X 1 X 4 X 0 X 6 (b) K = 7 (13 additions) x 0 x 2 x 3 x 4 x 5 x 6 x 7 x 1 X 2 X 3 X 5 X 1 X 4 X 0 (c) K = 6 (12 additions) x 0 x 2 x 3 x 4 x 5 x 6 x 7 x 1 X 2 X 3 X 1 X 4 X 0 (d) K = 5 (11 additions) x 0 x 2 x 3 x 4 x 5 x 6 x 7 x 1 X 2 X 3 X 1 X 0 (e) K = 4 (10 additions) x 0 x 2 x 3 x 4 x 5 x 6 x 7 x 1 X 2 X 1 X 0 (f ) K = 3 (9 additions) x 0 x 2 x 3 x 4 x 5 x 6 x 7 x 1 X 1 X 0 (g) K = 2 (8 additions) x 0 x 2 x 3 x 4 x 5 x 6 x 7 x 1 X 0 (h) K = 1 (7 additions) Figure 1: Sig nal ﬂow graph fo r the MRDCT matrix and pruned MRDCT matrices 7 T able 2 : Complexity and p erformance asses smen t of pruned DCT approximations Measure Metho d K 1 2 3 4 5 6 7 8 Exact DCT [32] 7 20 23 24 25 26 28 29 WHT [5 1] 7 8 11 12 19 20 23 24 SDCT [25] 7 14 17 19 20 22 23 24 Additiv e BAS-2008 [27] 7 10 13 14 15 16 17 18 complexity BAS-2009 [29] 7 10 13 14 15 16 17 18 BAS-2013 [30, 47] 7 14 17 20 21 22 23 24 RDCT [28] 7 12 13 16 17 19 20 22 MRDCT [5 0] 7 8 9 10 11 12 13 14 Exact DCT 95.46 97.47 98.55 99 .1 3 99.49 99 .71 99.87 100 .00 WHT 95.46 95.57 96.03 96 .2 5 98.24 98 .52 99.63 100 .00 Mean SDCT 95.46 96.39 97.30 98 .1 6 98.52 99 .26 99.61 100 .00 retained BAS-2008 95.46 97.08 98.10 98 .8 6 99.20 99 .51 99.68 100 .00 energy BAS-2009 95.46 97.08 97.96 98 .7 1 99.04 99 .35 99.68 100 .00 BAS-2013 95.46 97.18 98.08 98 .7 6 99.10 99 .44 99.77 100 .00 RDCT 95.46 97.36 98.28 98 .8 1 99.16 99 .41 99.75 100 .00 MRDCT 95.46 96.41 97.22 97 .9 1 98.22 99 .34 99.68 100 .00 Exact DCT 23.17 26.08 28.52 30 .4 0 31.71 32 .39 32.78 33.12 WHT 23.17 23.17 23.63 23 .8 1 26.88 27 .22 29.40 30.17 SDCT 23.17 24.28 25.23 27 .1 5 27.59 28 .43 28.82 29.84 Mean BAS-2008 23.17 25.30 27.04 29 .3 4 30.15 30 .97 31.33 32.20 PSNR BAS-2 009 23.17 25.30 26.95 28 .7 0 29.47 30 .14 30.96 31.76 BAS-2013 23.17 24.41 26.95 28 .7 3 29.51 30 .31 31.12 31.84 RDCT 23.17 25.83 27.64 28 .9 4 29.79 30 .41 31.21 31.96 MRDCT 23.17 24.29 25.26 26 .3 7 26.77 29 .58 30.29 30.98 Exact DCT 0.48 0.66 0.79 0 .86 0.89 0.90 0.90 0.90 WHT 0.48 0.49 0.55 0 .58 0.74 0.76 0.82 0.83 SDCT 0.48 0.59 0.67 0.77 0 .80 0.81 0.82 0.84 Mean BAS-2008 0.48 0.62 0.74 0 .83 0.85 0.87 0.88 0.89 SSIM BAS-2009 0.48 0.62 0.73 0 .82 0.84 0.85 0.87 0.88 BAS-2013 0.48 0.64 0.74 0 .82 0.85 0.87 0.87 0.88 RDCT 0.48 0.66 0.76 0 .82 0.85 0.87 0.88 0.88 MRDCT 0.48 0.55 0.65 0 .72 0.76 0.83 0.84 0.86 Exact DCT 0.816 0.953 0.988 0.9 9 5 0.995 0.9 96 0.997 0.997 WHT 0.815 0.815 0.823 0.8 2 3 0.955 0.9 55 0.982 0.982 SDCT 0.816 0.943 0.971 0.9 8 6 0.986 0.9 94 0.994 0.994 Mean BAS-2008 0.816 0.936 0.973 0.9 9 3 0.993 0.9 93 0.993 0.995 SR-SIM BAS-20 09 0.816 0 .9 36 0.974 0.99 3 0.993 0.9 93 0.993 0.995 BAS-2013 0.815 0.951 0.982 0.9 9 7 0.997 0.9 97 0.997 0.997 RDCT 0.816 0.952 0.981 0.9 8 8 0.988 0.9 88 0.992 0.993 MRDCT 0.816 0.898 0.932 0.9 5 8 0.958 0.9 82 0.986 0.988 8 3.3 Image Compression Prop osed metho ds w ere submitted to an imag e c ompression simulation to facilita te their p erformance a s an image/ video co ding to ol. W e based our ex p eriments o n the image compr ession simulation describ ed in in [25, 27, 3 6, 53, 55], which is brieﬂy outlined ne x t. W e considered the sa me ab o ve-men tioned set of imag es, sub-image decomp osition, and 2-D pruned transforma tio n, as detailed in previous sub-section. Resulting data were quantized by dividing each ter m of the transformed matrix by ele men ts of the standard quantization matrix for luminance [53, p. 1 53]. Diﬀerent ly from [25 , 27, 28], we included the quantization s tep in image compressio n simulation. This is a more r ealistic a nd suitable appro ac h for pruned metho ds whic h take adv antage of quantization step. An inv er se pro cedure was applied to reconstruct imag es considering 2-D inverse transform o peration. Recov er e d images were ass essed for image degradatio n by mea ns of pe ak signal- to-noise (PSNR) [53, p. 9], structural similar ity index (SSIM) [56], and sp ectral residual ba s ed similar ity (SR-SIM) [5 7 ]. The SSIM compares an o r iginal image I with the r e co vered image R a c cording to the following expression: SSIM ( I , R ) =  2 µ I µ R +  L · 10 − 2  ·  2 σ I R +  3 L · 10 − 2   µ 2 I + µ 2 R + ( L · 10 − 2 )  ·  σ 2 I + σ 2 R + (3 L · 10 − 2 )  , (10) where µ I = 8 P i =1 8 P j =1 ω i,j · I i,j , σ I = 8 P i =1 8 P j =1 ω i,j · ( I i,j − µ I ) 1 / 2 , σ I R = 8 P i =1 8 P j =1 ω i,j · ( I i,j − µ I ) · ( R i,j − µ R ), L = 255 is the dynamic r ange o f pixels v alues, and ω i,j is entry of a Gaussian weighting function w = [ ω i,j ] , i, j = 1 , 2 , . . . , 8, with s ta ndard deviation of 1 . 5 a nd normalized to unit sum. The SR-SIM b et ween the original image I a nd the r eco vered image R is calculated as de s cribed in [57]. Average PSNR, SSIM, and SR-SIM v alues of all images were computed and are shown in T able 2. F or a q ualitativ e analysis, Figur e 2 displays the reco nstructed Lena image c omputed via the MRDCT for all v a lues o f K . Asso ciated PSNR, SSIM, and SR-SIM v a lues are als o shown. Visua l insp ection sugges ts K = 6 as g oo d co mpromise b et ween qua lit y a nd co mplexit y . Indeed, we notice that the P SNR improvemen t from K = 5 to K = 6 is 3 .92 dB, while the PSNR diﬀerence from K = 6 and K = 7 is just 0.4 dB. 3.4 Combined anal ysis In or der to compar e the disc ussed a ppro xima tions, we consider a combined ﬁgure o f merit that takes into account so me of the previo usly discuss ed mea sures. Although p opular and worth rep orting, mean retaine d energy and P SNR are close ly r elated measures. Similar ly , the SR-SIM is a deriv ative o f SSIM. F or a co m bined ﬁgure of merit, we a im a t sele c ting unre lated mea sures; thus w e sepa rated the 2 -D a dditiv e complexity , PSNR, and SSIM v alues, whose numerical v alues are listed in T able 2. Such combined measure is pro posed as the following linear cost function: cost( T K ) = α 1 · A 2-D ( T K ) + (1 − α 1 ) · n α 2 · [ − NMSSIM( T K )] + (1 − α 2 ) · [ − NMPSNR( T K )] o , (11) where α 1 , α 2 ∈ [0 , 1] are weigh ts ; and NMSSIM and NMPSNR re pr esen t the normalized mean SSIM, and normalized mean PSNR, respectively , for all co nsidered images submitted to a particula r appr oximation T K . The ab o ve cost function c onsists of a multi-ob jective function, which a r e commonly found in optimizatio n 9 (a) K = 1 (PSNR=23.66, SSIM=0.63, SR-SI M=0.852) (b) K = 2 (PSNR=25.29, SSIM=0.69, SR-SI M=0.926) (c) K = 3 (PSNR=26.24, SSIM=0.75, SR-SI M=0.946) (d) K = 4 (PSNR=27.48, SSIM=0.79, SR-SI M=0.967) (e) K = 5 (PSNR=27.69, SSIM=0.80, SR-SI M=0.967) (f ) K = 6 (PSNR=31.62, SSIM=0.86, SR-SI M=0.986) (g) K = 7 (PSNR=32.02, SSIM=0.87, SR-SI M=0.988) (h) K = 8 (PSNR=32.38, SSIM=0.87, SR-SI M=0.989) Figure 2: Reconstr uc ted Lena imag e acco rding to the pruned MRDCT 10 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 α 2 α 1 Several MRDCT BAS−2008 Exact DCT 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 α 2 α 1 1 2 3 4 6 5 7 8 7 Figure 3: O ptimalit y regions for the co st function: (a) o ptimal transform a nd (b) optimal pruning v alue. literature [58]. Two types of metrics —a rithmetic complexity and p erformance measurements—are sub ject to a conv ex combination ac cording to α 1 . The p erformance mea suremen ts are themselves a conv ex combination of SSIM and PSNR measurements, balanced a ccording to α 2 . Th us the weigh ts α 1 and α 2 control the rela tive impo rtance of the constituent metrics of the cost function. F or larg e v alues of α 1 , the cost function emphasizes the minimization of the co mputational complexity; whereas , for small v alues o f α 1 , the cost function is proner to capture meas ures of imag e qua lit y p erformance. The q ua n tity α 2 balances the comp osition of the p erformance measure men t be tw een NMSSIM and NMPSNR. Because we c o nsider α 1 , α 2 ∈ [0 , 1 ], all po ssible co m binations of weigh ts are taken into account. Only the particular context, a pplication, and user requirements can determine the ﬁna l choice of the weigh t v a lues. Figure 3(a) and (b) shows, resp ectiv ely , regions for optimal trans fo rmation and pr uning v a lue K , consid- ering any choice o f w eights v a lues α 1 and α 2 . F or larg e α 1 (emphasis in complex minimization), the optimal choice tends to small K rega rdless the tr a nsform. Indeed, for s mall K , most pruned transformations collaps e to the s ame ma trix. F or small α 1 (emphasis in p erformance maximizatio n), o ptimalit y favors mor e complex transformatio ns with lar ge v alues of K , b eing the full exact DCT the limiting case. F or mid-range v alues of α 1 , we hav e less triv a l s cenarios. In Figure 3(a), considering the optimal transform, we notice that fo r mid-rang e v a lues of α 1 the MRDCT a nd the BAS-200 8 o ccupy most of the central area of the disc ussed reg ion. Aro und the same regio n in Figure 3(b), for the MRDCT, we obtain mo s tly K = 6; whereas for the B AS -2008 we hav e K = 6 , 8. W e emphasize that the pr oposed pruned MRDCT with K = 6 requires only 12 additions . The fast a lg orithm for this particular case is pr e s en ted in Figure 1 (c). 3.5 HEVC Simula tion T aking into account the previo us combined a nalysis, we embedded the pro posed pruned MRDCT ( K = 6), the BAS-200 8 approximation ( K = 8), and the pruned BAS-20 08 ( K = 6) in the widely employ ed HEVC reference softw are HM 1 0.0 [59]. This embedding co ns isted of substituting the original 8-p o in t in teg er-based DCT tra nsform present in the co dec for each of the ab o ve-men tioned appr o ximatio ns. W e considered nine CIF video sequences with 300 fr ames a t 25 frames p e r s e c ond from a public video bank [60]. Such seq uences were submitted to enco ding accor ding to: (i) the orig inal softw ar e, a nd (ii) the mo diﬁed s oft ware. W e assessed mean PSNR metrics for luminance by v a rying the q ua n tization para meter (QP) from 10 to 50 with 11 10 20 30 40 50 20 25 30 35 40 45 50 55 Q P PSNR Unmodifed HEVC Pruned MRDCT BAS−2008 Pruned BAS−2008 0 2000 4000 6000 8000 10000 20 30 40 50 60 bi tra te (k bps) PSNR Unmodifed HEVC Pruned MRDCT BAS−2008 Pruned BAS−2008 Figure 4: Video co ding p erformance assess men t. steps of 5 units. Results are shown in Figure 4 consider ing b oth QP and bitr ate. Obtained curves a re almost indistinguishable. The mean PSNR v alues at Q P = 3 0 corr espond to 37 .06 dB, 36 .92 dB, 3 6.96 dB, a nd 36.93 dB for the orig inal integer DCT, the pr uned MRDCT ( K = 6), BAS-20 08, and the pruned BAS-2 008 ( K = 6), r espectively . The degra da tion of the pruned approximations metho ds relative to the unmo diﬁed softw are was smaller tha n 0.15 dB for such QP v a lue. Figure 5 shows the r elativ e p ercent PSNR of ea c h appr o ximate metho d c ompared to the or iginal HEV C according to Q P and bitrate v alues. The curves show very clos e per formance to the o r iginal co dec. In Figure 5 (a), for low QP v alues, the appr o ximatio ns show even higher PSNR, i.e., more than 100% relative PSNR, sug gesting be tter compactio n capability at low compressio n r ates. How ever, same QP v alues do not necessa rily g enerate the sa me co mpression ra tio fo r each metho d, since distinct co eﬃcien ts are der iv ed from each transforma tion and s ubmit ted to the same quan tization table. Figure 5(b) indicates that the approximations p ossess slightly lower co ding p erformance compare d to original HEVC when compar ed at same bitrate. At the same time, the approximate metho ds pr esen t considera ble low er computational cos t and the lost of p erformance is smaller than 1%. Figure 6 shows a q ua litativ e compar ison c onsidering the ﬁrst fr ame of the standard “F oreman” vide o seq ue nc e a t QP = 30. The deg radation is hardly p erceived. 12 10 20 30 40 50 99 99.2 99.4 99.6 99.8 100 QP r e la ti ve P SN R (% ) Pruned MRDCT BAS−2008 Pruned BAS−2008 (a) Relative PSNR vs. QP 0 2000 4000 6000 8000 10000 99 99.2 99.4 99.6 99.8 100 bi tr a te (k bps ) re la ti ve P SN R (% ) Pruned MRDCT BAS−2008 Pruned BAS−2008 (b) Relative PSNR vs. Bitrate Figure 5: Video co ding p erformance a ssessmen t rela tiv e to O riginal HEVC. 4 VLSI Architectures W e aim at the physical r ealization of pruned designs based on the MRDCT, BAS-20 0 8, and BAS-2 013. The MRDCT and BAS-2008 were selected in accor dance to the discussion in previo us section. The B AS-2013 was also included beca use it is the base fo r the only pruned approximate DCT comp etitor in literature [47]. Such designs were re a lized in a separable 2 - D blo c k tr ansform using tw o 1-D trans form blo c k s with a trans pose buﬀer b et ween them. Suc h blo c k s were desig ne d and simulated, using bit-true cy c le-accurate mo deling, in Matlab/Simulink. Thereafter, the prop osed architecture was p orted to Xilinx Virtex-6 ﬁeld progr ammable gate array (FPGA) a s well as to custom CMO S standar d- cell in tegr ated circuit (IC) design. The tr ansform was applied in a r o w-pa rallel fashio n to the blo c ks of data and all blo c ks were 8 × 8, irresp ectiv e of pruning. When K decrea s es, the n um be r of n ull elements in the blo cks incr eases. The row-transformed da ta were sub ject to transp osition a nd then the same pruned algo rithm was applied, alb eit for column direction. Figure 7 s hows the a rc hitecture s for the MRDCT. Rema ining desig ns hav e simila r realizations. 13 (a) Unmod if ed HEVC (PSNR=37.1154) (b) MRDCT K = 6 (PSNR =37.0 613) (c) BAS-2008 K = 8 (PSN R=37.07 57) (d) BAS-2008 K = 6 (PSNR=37.0669) Figure 6: Reconstruct ﬁr st frame of the “F or eman” v ideo seq uence e nc o ded a ccording to conside r ed methods. D D D D D D D D D D D D D D D D D D D D X 2 x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 0 X 0 X 4 X 6 X 7 X 3 X 5 X 1 (a) Original MRDCT (14 additions) D D D D D D D D D D D D D D D D X 2 x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 0 X 0 X 4 X 3 X 5 X 1 (b) K = 6 (12 additions) D D D D D D D D D D D D D X 2 x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 0 X 0 X 3 X 1 (c) K = 4 (10 add itions) D D D D D D D D D X 1 x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 0 X 0 (d) K = 2 (8 additions) Figure 7: Digital ar c hitectures of the MRDCT matr ix and pr uned MRDCT matrices for K = 6 , 4 , 2. 14 T able 3: Reso ur ce consumption on Xilinx XC6VLX240T-1FFG1156 device K CLB FF T cp d F max D p 1 107 376 2.26 3 441.89 0.67 107 376 2.26 3 441.89 0.67 107 376 2.26 3 441.89 0.67 2 136 568 2.30 0 434.78 0.97 203 672 2.60 0 384.61 1.33 204 751 2.45 0 408.10 1.45 3 210 783 2.50 9 398.56 0.87 252 956 2.87 8 347.46 1.74 263 978 2.53 4 394.63 2.11 4 247 961 2.94 6 339.44 1.35 343 1170 3.100 322.58 2 .0 6 339 1216 2.900 344.82 2.50 5 290 1123 2.877 347.58 1 .7 0 362 1331 3.067 326.05 2 .7 6 377 1374 2.902 344.58 3.13 6 350 1286 2.735 365.63 2 .0 7 438 1531 3.214 311.13 3 .0 7 382 1557 2.784 359.19 3.80 7 424 1487 3.300 303.03 2 .2 1 501 1709 3.286 304.32 3 .5 8 445 1720 3.432 291.37 3.87 8 445 1696 3.390 294.98 2 .7 4 559 1962 3.300 303.03 3 .8 5 517 1910 3.200 312.5 5.07 4.1 FPGA Rapid Prototypes The pruned architectures were physically realiz e d on a Xilinx Virtex-6 X C6 VLX 2 40T-1FFG1156 FPGA de- vice with ﬁne-gr ain pip elining for increased throughput. The FPGA rea lizations were veriﬁed using hardware- in-the-lo op testing, which was a c hieved thro ugh a JT AG interface. Prop osed appr o ximatio ns were veriﬁed using more than 10000 test vectors with complete agreement with theor etical v alues. E v a luation of har dw ar e complexity and rea l- time p erformance co nsidered the following metrics: the num b er of employ ed conﬁg urable logic blo c k s (CLB), ﬂip-ﬂop (FF) count, critical path delay ( T cp d ), and the max im um op erating fr e q uency ( F max ) in MHz. The xf low.resul ts repo rt ﬁle, from the Xilinx FPGA too l ﬂow, led to the rep orted results. F requency normalized dynamic power ( D p , in mW / MHz) was estima ted using the Xilinx XPo wer Analyzer softw are to ol. Ab o ve measurements ar e shown in T able 3 for the prop osed pruned MRDCT (highlighted in green), the pruned version of the BAS-2008 intro duced in [27] (highlighted in blue) a nd the pruned BAS-2013 int ro duced in [30]. 4.2 ASIC Synthesis F or the ASIC synthesis, the har dw ar e description lang uage c o de from the Xilinx Sys tem Generator FPGA design ﬂow w as p orted to 45 nm CMO S technology and sub ject to sy nthesis using Cadence E ncoun ter. Standard ASIC cells from the F reePDK, which a free op en-source c e ll libr ary a t the 45 nm no de, was used 15 T able 4 : Resource c o nsumption for CMOS 45 nm ASIC synthesis K Area A T AT 2 T cp d F max D p 1 0.011 0.011 0.0 1 0 0.961 1.0 40 0.018 0.011 0.011 0.0 1 0 0.961 1.0 40 0.018 0.011 0.011 0.0 1 0 0.961 1.0 40 0.018 2 0.017 0.016 0.0 1 5 0.962 1.0 39 0.028 0.021 0.020 0.0 2 0 0.980 1.0 20 0.035 0.022 0.022 0.0 2 2 0.995 1.0 05 0.036 3 0.022 0.021 0.0 2 0 0.963 1.0 38 0.038 0.031 0.030 0.0 3 0 0.990 1.0 10 0.051 0.029 0.028 0.0 2 7 0.981 1.0 19 0.047 4 0.027 0.027 0.0 2 6 0.970 1.0 30 0.047 0.037 0.037 0.0 3 8 1.016 0.9 84 0.063 0.037 0.036 0.0 3 6 0.997 1.0 03 0.059 5 0.032 0.034 0.0 3 7 1.075 0.9 30 0.057 0.042 0.042 0.0 4 3 1.011 0.9 89 0.069 0.041 0.041 0.0 4 1 1.007 0.9 93 0.068 6 0.038 0.038 0.0 3 7 0.995 1.0 05 0.067 0.048 0.048 0.0 4 8 1.000 1.0 00 0.081 0.046 0.046 0.0 4 6 1.008 0.9 92 0.077 7 0.043 0.047 0.0 5 1 1.085 0.9 21 0.079 0.053 0.053 0.0 5 4 1.014 0.9 86 0.091 0.051 0.054 0.0 5 7 1.050 0.9 52 0.087 8 0.046 0.051 0.0 5 7 1.103 0.9 06 0.084 0.060 0.062 0.0 6 5 1.047 0.9 55 0.104 0.057 0.057 0.0 5 8 1.008 0.9 92 0.097 for this purp ose. The supply voltage of the CMO S r ealization w as ﬁxed at V DD = 1 . 1 V during estimation of p o wer c onsumption and logic delay . The adopted ﬁg ur es of merit for the ASIC synthesis were: area ( A ) in mm 2 , a rea-time complexity ( AT ) in mm 2 · ns, area-time-s quared complexity ( AT 2 ) in mm 2 · ns 2 , frequency normalized dynamic p ow er ( D p , in mW / MHz), critical path delay ( T cpd ) in ns, and maximum op erating frequency ( F max ) in GHz. ASIC synthesis results for the pr oposed pruned MRDCT (highlig h ted in g reen), pruned version of the BAS-200 8 (highlighted in blue) and the pruned BAS-201 3 algor ithm are display e d in T able 4 . 4.3 Discussion The FPGA realization of the pr oposed pruned MRDCT show ed a drastic reductions in both ar e a (measured from the n umber o f CLBs ) and frequency normalized dynamic p o wer consumption, compared to the full MRDCT. T able 5 shows the p ercen tage r eduction of area a nd frequenc y -normalized dynamic pow er for b oth FPGA implementation and CMOS synthesis for diﬀerent pruning v a lues . All metrics indicate low er hardware resource co nsumption when the num b er of o utputs are reduced from 8 to 1. I n particula r , for K = 6, whic h minimizes the discuss ed cost function (cf. (11 )) , w e notice a p o wer c onsumption reduction for approximately 20–25 %. In o rder to co mpare the hardware resource c o nsumption of the in tro duced pruned DCT approximation 16 T able 5: Percentage reduction in ar ea and dynamic p ow er for FPGA FPGA ASIC K Area % D p % Area % D p % 1 71.65 83.11 75.32 76.66 2 54.59 72.29 64.93 66.00 3 44.88 62.33 53.24 54.66 4 30.18 51.94 41.55 43.33 5 19.16 34.63 32.46 34.00 6 3.14 20.77 23.37 24.66 7 1.57 12.12 10.38 12.66 with comp eting transfor ms, we physically realized the pr uned BAS-2 013 alg orithm [3 0 ] and the pruned BAS-2008 algo rithm [27] on the same Xilinx Virtex-6 XC6VLX240T-1FFG1156 device and submitted it to synthesis using ASIC 45 nm CMO S technology . By comparing the results in T able 3 and 4, it can b e seen that the prop osed transform discussed here outp erforms b oth pruned BAS-2008 and pr uned BAS-20 13 in terms o f hardware resour ce consumption, and p o wer consumption while is in pa r in terms of sp e ed as well. 5 Conclusion In this paper , we present a set of 8-p oin t pr uned DCT approximations derived fro m state-o f-the-art metho ds. All p o ssible fre q uency-domain pruning schemes were considered and analy z ed in terms of a rithmetic com- plexity , energy compaction in the tr ansform-domain, and image compr ession p erformance. A new co m bined metric was deﬁned c o nsidering the 2- D ar ithmetic complexity and average v alues o f P SNR and SSIM. The pruned tr ansform based on MRDCT presented the low est arithmetic complexity and the showed comp eti- tive per formance. Thus, the pr uned MRDCT a ppro x imations w ere digitally implemented using b oth Xilinx FPGA to ols and CMOS 45 nm ASIC technology . The prop osed pruned tra nsforms de mo nstrated pr actical relev ance in image/video compressio n. The prop osed algo rithms are fully compatible with mo dern co decs. W e hav e embedded the prop osed metho ds into a s tandard HEVC reference softw are [59]. Results presented very low qualitative and q uan titative degra dation at a co nsiderable low er computational cost. Additionally , low-complexity desig ns are required in several contexts were very high quality imager y is not a strong r e quiremen t, such a s : environmen ta l monitor ing, habitat monitoring, surveillance, struc tur al monitoring, equipment diagnostics, dis aster ma nagemen t, and emergency resp onse [61]. All ab ov e contexts can b eneﬁt of the prop osed to ols when embedded into w ir eless s ensors with low-complexity code c s a nd low-pow er ha rdw ar e [62]. W e summarize the contributions o f the pres en t w ork: • The pruning approa c h for DCT approximations was gener alized b y no t only c onsidering all p ossible pruning v a r iations but also in vestigating a wide range o f DCT appr o ximatio ns; • An analys is cov ering all cases under diﬀere nt ﬁgures of mer it, including arithmetic complexity and image quality meas ures was presented; • A co m bined ﬁgure of merit to g uide the decision ma king pro cess in terms har dw are re alization was int ro duced; 17 • The 2-D case w as also analyzed and concluded that the pruning appr o ac h is even better suited for 2-D transforms. • The consider ed pruned DCT appr o ximatio n w as implemented using Xilinx FPGA too ls and synthe- sized us ing CMOS 45 nm ASIC technology . Such implementations demons tr ated the low resource consumption of the prop osed pruned transform. Ackno wledgements This work was par tia lly supp orted by CNPq, F ACEPE, and F APER GS (Brazil), and b y the College of Engineering at the University of Akron, Akron, OH, USA. References [1] N. Ahmed and K. R . R ao, Ortho gonal T r ansforms for Digital Signal Pr o c essing . Springer, 1975. [2] R. E. Blahut, F ast A lgorithms for Signal Pr o c essing . Cam bridge U niv ersity Press, 2010 . [3] G. W allace, “The JPEG still picture compression standard,” IEEE T r ansactions on C on sumer Ele ctr onics , vol . 38, no. 1, pp. xviii–xxx iv, 1992. [4] D. J. L. Gall, “The MPEG video compression algorithm,” Si gnal Pr o c essing: Image Com munic ation , vol. 4, pp. 129–140 , 1992. [5] N. Roma and L. Sousa, “Eﬃcient hybrid DCT-domain algorithm for video spatial downscaling,” EURASIP Journal on A dvanc es in Signal Pr o c essing , vol. 2007, no. 2, pp. 30–30, 2007. [6] International O rganisation for Standardisation, “Generic co ding of mo ving pictures an d asso ci ated audio infor- mation – Part 2: Video,” ISO, ISO/IEC JTC1/SC2 9/W G11 - Co ding of Mo ving Pictures and Aud io , 1994. [7] International T elecomm unication Un io n, “ITU-T recommendation H .2 61 versi on 1: Video codec for audiovisual services at p × 64 kb its, ” ITU-T, T ech. Rep., 1990 . [8] M. L. Liou, “Visual telephony as an ISDN application,” IEEE C ommunic ations Magazine , vol. 28, pp. 30–38, 1990. [9] International T elecommunicatio n Union, “ITU-T recommendation H.263 version 1: Vid eo co d ing for low bit rate comm unication,” ITU-T, T ech. Rep., 1995. [10] T. Wiegand, G. J. S ulliv an, G. Bjontegaa rd, and A . Luthra, “Overview of the H.264/A VC video co ding standard,” IEEE T r ansactions on Cir cuits and Systems for Vide o T e chnolo gy , vol. 13, no. 7, pp. 560–57 6, Jul. 2003. [11] J. V . T eam, “Recommendation H .2 64 and ISO/IEC 14 496–10 A V C: Draft ITU-T recommendation and ﬁ nal draft international standard of join t video speciﬁcation,” ITU-T, T ech. Rep., 2003 . [12] International T elecomm u nicatio n Un io n, “High eﬃciency video co ding: Recommendation ITU-T H.265,” ITU-T Series H: Audiovisual and Multimedia Systems, T ech. Rep., 2013. [13] M. T. Po urazad, C. D outre, M. Azimi, and P . Nasiopoulos, “HEVC: The new gold stand ard for v ideo compression: How do es HEVC compare with H.264/A VC?” I EEE Consumer Ele ctr onics Magazine , vol. 1, no. 3, pp . 36–46, Jul. 2012. [14] J.-S. Park, W.-J. Nam, S.-M. Han, and S. Lee, “2-D large inv erse transform (16 × 16, 32 × 32) for HEVC (High Eﬃciency Video Co ding),” Journal of Semic onductor T e chnolo gy and Scienc e , vol. 2, pp. 203–21 1, 2012. 18 [15] J.-R. Ohm, G. J. Sulliv an, H. Sch w arz, T. K. T an, and T. Wiegand, “Comparison of the cod ing eﬃciency of video coding standards - including High Eﬃciency Vid eo Coding ( H EV C),” IEEE T r ansactions on Cir cuits and Systems f or Vide o T e chnolo gy , vol. 22, n o. 12, pp. 1669–1 684, Dec. 2012. [16] U. S. Potluri, A. Madana yake , R. J. Cin t ra, F. M. Ba yer, S . Kulasekera, and A. Edirisuriy a, “Imp ro ved 8-point approximate DCT for image and video compression requiring only 14 additions,” IEEE T r ansactions on Cir cuits and Systems I , vol. 61, no. 6, pp. 1727–1740 , 2014. [17] G. J. Sulliv an, J.-R. Ohm, W.-J. Han, and T. Wiegand, “Overview of th e high eﬃciency v ideo co ding (HEVC) standard,” IEEE T r ansactions on Ci r cuits and Systems f or Vide o T e chnolo gy , vol. 22, no. 12, pp. 1649–16 68, Dec. 2012. [18] W. H. Chen, C. S mith, and S. F ralic k, “A fast compu tatio nal algorithm for the discrete cosine transform,” I EEE T r ansactions on Communi c ations , vol. 25, no. 9, pp. 1004–1 009, Sep. 1977. [19] H. S. H ou, “A fast recursive algorithm for computing th e discrete cosine transform,” IEEE T r ansactions on A c oustic, Signal, and Sp e e ch Pr o c essing , vo l. 6, no. 10, pp. 1455–1461 , 1987. [20] Y. A rai , T. Agui, and M. N ak a jima, “A fast DCT-SQ scheme for images,” T r ansactions of the IEICE , vol. E-71, no. 11, pp. 1095–1097 , Nov. 1988. [21] C. Lo eﬄer, A . Ligtenberg, and G. S. Mosch y tz, “Practical fast 1-D DCT algorithms with 11 multiplications,” ICASSP International C onfer enc e on A c oustics, Sp e e ch, and Signal Pr o c essing , vol. 2, p p . 988–991, 1989. [22] E. F eig and S . Winograd, “F ast algorithms for t he discrete cosine transform,” IEEE T r ansactions on Si gnal Pr o c essing , vol . 40, no. 9, pp. 2174–2193, 1992. [23] V. Britanak, P . Yip, and K. R. Rao, Discr ete Cosine and Sine T r ansforms . A cademic Press, 2007. [24] S. Winograd, Arithmet ic Complexity of Computations . CBMS-NSF Regional Conference Series in A pplied Mathematics, 1980. [25] T. I. Haw eel, “A new squ are wa ve transform based on t he DCT,” Signal Pr o c essing , vol. 82, pp . 2309–2319 , 2001. [26] K. Lengwehasa tit and A. Ort eg a, “Scalable v ariable complexity approximate forw ard DCT,” IEEE T r ansactions on Cir cuits and Systems for Vide o T e chnolo gy , vol. 14, no. 11, pp. 1236–12 48, Nov. 200 4. [27] S. Bouguezel, M. O. Ahmad, and M. N. S. Swam y , “Lo w-complexity 8 × 8 transform for image comp ression,” Ele ctr onics L etters , vol. 44, no. 21, pp. 1249–125 0, Sep. 2008. [28] R. J. Cin tra and F. M. Ba yer, “A DCT approximatio n for image compressi on,” IEEE Si gnal Pr o c essing L etters , vol . 18, no. 10, pp. 579–582, Oct. 2011. [29] S. Bouguezel, M. O. Ahmad, and M. N. S . Swam y , “A fast 8 × 8 transform for image compression,” in 2009 International Confer enc e on Micr o ele ctr oni cs (I CM) , Dec. 2009, pp. 74–77. [30] ——, “Binary discrete cosine and Hartley transforms,” IEEE T r ansactions on Ci r cuits and Systems I : R e gular Pap ers , vol. 60, no. 4, pp. 989–100 2, 2013. [31] R. J. Cin tra, F. M. Bay er, and C. J. T ablada, “Lo w-complex ity 8-p oin t DCT appro ximations based on in teger functions,” Signal Pr o c essing , vol . 99, pp. 201–214, 2014. [32] L. Makk aoui, V. Lecuire, and J. Moureaux, “F ast zonal DCT-based image compression for wireless camera sensor netw orks,” 2nd International Confer enc e on Image Pr o c essing The ory T o ols and Applic ations (IPT A) , pp. 126–129 , 2010. [33] A. Do cef, “The q uan tized D CT and its application to DCT-based video cod ing, ” IEEE T r ansactions on Image Pr o c essing , vol . 11, pp. 177–187, 2002. 19 [34] K. R. R ao and P . Yip, Discr ete Cos ine T r ansform: A lgorithms, A dvantages, Applic ations . San Diego, CA: Academic Press, 1990. [35] N. A hmed, T. Natara jan, an d K. R. Rao, “Discrete cosine transform,” I EEE T r ansactions on Com put ers , vol. C-23, no. 1, pp. 90–93, Jan. 1974. [36] K. R. Rao and P . Y ip, The T r ansform and Data Compr ession Handb o ok . CRC Press LLC, 2001. [37] H. Malepati, Di g ital M e dia Pr o c essing: DSP Algorithms Using C (Go o gl e e-Livr o) . Newnes, 2010. [38] Y.-M. Huang, J.-L. W u, and C.-L. Chang, “A generalized output pruning algorithm for matrix- v ector multipli ca- tion and its application to compute prun ing discrete cosine transform,” IEEE T r ansactions on Si gna l Pr o c essing , vol . 48, pp. 561–563, 2000. [39] L. W ang, X. Zhou, G. S obelman, and R. Liu, “Generic mixed-radix FFT pruning,” IEEE Signal Pr o c essing L etters , vol. 19, no. 3, pp. 167– 170, Marc h 2012. [40] R. Airoldi, O. Anjum, F. Garzia, A. M. Wyglinski, and J. Nurmi, “Energy-eﬃcient fast Fourier transforms for cognitiv e radio systems,” IEEE Mi c r o , vol. 30, no. 6, pp. 66–76, Nov 201 0. [41] P . Whatmough, M. P errett, S. Isam, and I . Darw azeh, “VLSI architecture for a reconﬁgurable sp ectrall y eﬃcient FDM baseband transmitter,” I EEE T r ansactions on Cir cuits and Systems I : R e gular Pap ers , vo l. 59, no. 5, pp. 1107–11 18, May 2012. [42] J.-H. Kim, J.-G. Kim, Y.- H . Ji, Y.-C. Jung, and C.-Y. W on, “An islanding detection meth od for a grid-connected system based on the Go ertzel algorithm,” IEEE T r ansactions on Power Ele ctr onics , vol. 26, no. 4, pp. 1049–1055, Apr. 2011. [43] I. Carugati, S. Maestri, P . D onato, D. Carrica, and M. Benedetti, “V ariable sampling p eriod ﬁlter PLL for distorted three-ph as e systems,” Power Ele ctr onics, IEEE T r ansactions on , vol. 27, no. 1, pp. 321–33 0, Jan. 2012. [44] Z. W ang, “Prunin g the fast discrete cosine transform,” I EEE T r ansactions on Communic ations , vol. 39, no. 5, pp. 640–643 , May 1991 . [45] A. S k o dras, “F ast discrete cosine transform pruning,” IEEE T r ansactions on Signal Pr o c essing , vol. 42, no. 7, pp. 1833–18 37, Jul 1994. [46] V. Lecuire, L. Makk aoui, and J.-M. Moureaux, “F ast zonal DCT for energy conserv ation in wireless image sensor netw orks,” Ele ctr onics L etters , vol. 48, no. 2, pp. 125–127 , 2012. [47] N. Kouadria, N. Doghmane, D. Messadeg, and S. Harize, “Low complexit y DCT for image compression in wireless visual sensor netw orks,” El e ctr onics L etters , vol. 49, no. 24, pp. 1531–1532 , 2013. [48] P . Meher, S. Y. Park, B. Mohant y , K. S. Lim, and C. Y eo, “Eﬃcient integer DCT architectures for HEVC,” Cir cuits and Systems for Vi de o T e chnolo gy, I EEE T r ansactions on , vol. 24, no. 1, pp. 168–17 8, Jan. 2014. [49] A. Opp enheim and R. Schafe r, Discr ete-Time Signal Pr o c essing , 3rd ed. Pea rson, 2010. [50] F. M. Ba yer and R. J. Cintra, “DCT-lik e transform for image compression requires 14 additions only ,” Ele ctr oni c s L etters , vol. 48, no. 15, pp. 919–92 1, 19 2012. [51] D. F. Elliot and K. R . Rao, F ast T r ansforms: A lgorithms, Analys es, Applic ations . Academic Press, 1982. [52] K. M. Abadir and J. R. Magn us, Matrix Algebr a . Cam b ri dge U n iv ersity Press , 2005. [53] V. Bhask aran and K. Kon stantinides, Image and Vi de o Compr ession Standar ds . Boston: Klu w er Academic Publishers, 1997. 20 [54] USC-S I PI Image Database. Signal and Image Processing In sti tute. Un iv ersity of S ou t hern California. [Online]. Av ailable: h ttp://sipi.use.edu/database/ [55] W. B. Pennebake r and J. L. Mitchell , JPEG Stil l Image Data C ompr ession Standar d . New Y ork, NY : V an Nostrand Reinhold, 1992. [56] Z. W ang, A. C. Bovik, H. R. S h eikh, and E. P . Simoncelli, “Image quality assessment: from error visibilit y to structural similarit y ,” IEEE T r ansactions on Image Pr o c essing , vol. 13, no. 4, pp. 600–612, Apr. 2004. [57] L. Zhang and H. Li, “SR-SIM: A fast and high p erforma nce iqa index based on sp ectra l residual,” 19th IEEE International Confer enc e on Image Pr o c essing (IC IP) , pp. 1473 – 1476, 2012. [58] M. Ehrgott, Multicriteria Optimi zat ion , ser. Lecture N otes in Economics and Mathematical Sy stems . Springer- V erlag GmbH, 2005. [59] JCT-VC. (2015) HM 10.0. h ttps://hevc.hhi.fraunhofer.de/. Joint Collaborative T eam on Video Co ding (JCT-V C). F raun hof er Heinric h Hertz Institute. [Online]. Avai lable: https://hevc.hhi.f raunhofer.de/ [60] xiph.org. (2015) https://media.xiph.org/video/derf/ . Xiph .org video test media. [Online]. Av ailable: https://me dia.xiph.org/video/derf/ [61] N. Kimura and S. Latiﬁ, “A survey on data compression in wireless sensor n et works,” International Confer enc e on Information T e chnolo gy: Co ding and Com put i ng, ITCC , vol. 2, pp. 8–13, 2005. [62] I. F. A kyildiz, T. Melo d ia , and K. R. Chow dhury , “A survey on wireless multimedia sensor netw orks,” Computer Networks , vol. 51, pp. 921–960 , 2007. 21

Energy-efficient 8-point DCT Approximations: Theory and Hardware Architectures

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment