Low-complexity 8-point DCT Approximation Based on Angle Similarity for Image and Video Coding

The principal component analysis (PCA) is widely used for data decorrelation and dimensionality reduction. However, the use of PCA may be impractical in real-time applications, or in situations were energy and computing constraints are severe. In thi…

Authors: R. S. Oliveira, R. J. Cintra, F. M. Bayer

Low-complexity 8-point DCT Approximation Based on Angle Similarity for   Image and Video Coding
Lo w-complexity 8-point DCT Approximation Based on Angle Similarity for Image and V ideo Coding Ra ´ ıza S. Oliv eira * Renato J. Cintra † F ´ abio M. Bayer ‡ Thiago L. T . da Silveira § Arjuna Madanayake ¶ Andr ´ e Leite | | Abstract The principal component analysis (PCA) is widely used for data decorrelation and dimensionality reduction. Howe ver , the use of PCA may be impractical in real-time applications, or in situations were energy and computing constraints are sev ere. In this context, the discrete cosine transform (DCT) becomes a lo w-cost alternative to data decorrelation. This paper presents a method to deri ve computationally ef ficient approximations to the DCT . The proposed method aims at the minimization of the angle between the rows of the e xact DCT matrix and the rows of the approximated transformation matrix. The resulting transformations matrices are orthogonal and ha ve extremely low arithmetic complexity . Considering popular performance measures, one of the proposed transformation matrices outperforms the best competitors in both matrix error and coding capabilities. Practical applications in image and video coding demonstrate the rele vance of the proposed transformation. In fact, we show that the proposed approximate DCT can outperform the exact DCT for image encoding under certain compression ratios. The proposed transform and its direct competitors are also physically realized as digital prototype circuits using FPGA technology . Keyw ords DCT Approximation, Fast algorithms, Image/video encoding 1 Introduction Data decorrelation is a central task in many statistical and signal pro- cessing problems [1–3]. Decorrelation can be accomplished by means of a linear transformation that con verts correlated observ ations into lin- early uncorrelated values. This operation is commonly performed by principal component analysis (PCA) [2]. PCA is widely used to reduce the dimensionality of data [2, 4], where the information contained in all the original variables is replaced by the data variability information of the initial fe w uncorrelated principal components. The quality of such approximation depends on the number of components used and the pro- portion of variance e xplained, or energy retained, by each of them. In the field of analysis and processing of images and signals, PCA, also known as the discrete Karhunen–Lo ` eve transform (KL T) [3], is considered the optimal linear transformation for data decorrelation when the signal is a first order Markov process [3, 5]. The KL T has the follo wing features [3]: (i) decorrelates the input data completely in the transform domain; (ii) minimizes the mean square error in data compression; and (iii) concentrates the energy (varia nce) in a few coef- ficients of the output vector . Because the KL T matrix depends on the variance and cov ariance matrix of the input data, deri ving computation- ally efficient algorithms for real-time processing becomes a very hard task [3, 6–13]. If the input data follows a stationary highly correlated first-order Markov process [3, 12, 14], then the KL T is very closely approximated by the discrete cosine transform (DCT) [3, 12]. Natural images fall into this particular statistical model category [15]. Thus DCT inherits the decorrelation and compaction properties of the KL T , with the adv antage * R. S. Oli veira is with the Programa de P ´ os-Graduac ¸ ˜ ao em Engenharia El ´ etrica, Universi- dade Federal de Pernambuco (UFPE), Recife, Brazil; and with the Signal Processing Group, Departamento de Estat ´ ıstica, UFPE. † Renato J. Cintra is with the Signal Processing Group, Departamento de Estat ´ ıstica, Uni- versidade Federal de Pernambuco, Recife, Brazil; ECE, University of Calgary , Calgary , AB, Canada. E-mail: rjdsc@de.ufpe.br ‡ F . M. Bayer is with the Departamento de Estat ´ ıstica, Uni versidade Federal de Santa Maria, Santa Maria, and LA CESM, Brazil. § T . L. T . Silveira is with the Programa de P ´ os-Graduac ¸ ˜ ao em Computac ¸ ˜ ao, Univ ersidade Federal do Rio Grande do Sul, Porto Alegre, Brazil. ¶ Arjuna Madanayake is with the Department of Electrical and Computer Engineering, Univ ersity of Akron, Akron, OH. | | Andr ´ e Leite is with the Departamento de Estat ´ ıstica, Universidade Federal de Pernam- buco, Recife, Brazil. E-mail: leite@de.ufpe.br of ha ving a closed-form e xpression independent of the input signal. Im- age and video communities widely adopt the DCT in their most success- ful compression standards, such as JPEG [16] and MPEG [17]. Often such standards include two-dimensional (2-D) versions of the DCT ap- plied to small image blocks ranging from 4 × 4 to 32 × 32 pixels. The 8 × 8 block is employed in a lar ge number of standards, for exam- ple: JPEG [16], MPEG [18], H.261 [19], H.263 [20], H.264/A VC [21], and HEVC [22]. The arithmetic cost of the 8-point DCT is 64 multipli- cations and 56 additions, when computed by definition. F ast algorithms can dramatically reduce the arithmetic cost to 11 multiplications and 29 additions, as in the Loeffler DCT algorithm [23]. Howe ver , the number of DCT calls in common image and video encoders is extraordinarily high. For instance, a single image frame of high-definition TV (HDTV) contains 32.400 8 × 8 image subblocks. Therefore, computational savings in the DCT step may effect signifi- cant performance gains, both in terms of speed and power consump- tion [24, 25]. Being quite a mature area of research [26], there is lit- tle room for improvement on the exact DCT computation. Thus, one approach to further minimize the computational cost of computing the DCT is the use of matrix approximations [14, 27]. Such approximations provide matrices with similar mathematical behavior to the e xact DCT while presenting a dramatically low arithmetic cost. The goals of this paper are as follows. First, we aim at establishing an optimization problem to facilitate the deriv ation of 8-point DCT ap- proximations. T o this end, we adopt a vector angle based objective func- tion to minimize the angle between the ro ws of the approximate and the exact DCT matrices subject to orthogonality constraints. Second, the sought approximations are (i) submitted to a comprehensi ve assessment based on well-known figures of merit and (ii) compared to state-of-the- art DCT approximations. Third, fast algorithms are deri ved and realized in FPGA hardware with comparisons with competing methods. W e also examine the performance of the obtained transformations in the con- text of image compression and video coding. W e demonstrate that one of our DCT approximations can outperform the exact DCT in terms of effected quality after image compression. This paper is organized as follows. In Section 2, the 8-point DCT and popular DCT approximations are discussed. Section 3 introduces an optimization problem to pa ve the way for the deriv ation of new DCT approximations. In Section 4 the proposed approximations are detailed 1 T able 1: Computational cost of the fast algorithms for the DCT Algorithm Multiplications Additions Loeffler et al. [23, 35, 36] 11 29 Y uan et al. [28, 37] 12 29 Lee [32, 38, 39] 12 29 Hou [33, 40, 41] 12 29 Arai et al. [29, 35, 42] 13 29 Chen et al. [30, 35, 43] 16 26 Feig–W inograd [31, 44] 22 28 and assessed according to well-known performance measures. In Sec- tion 5 a fast algorithm for the proposed approximation is presented. Moreov er , a field-programmable gate array (FPGA) design is proposed and compared with competing methods. Section 6 furnishes compu- tational evidence of the appropriateness of the introduced approximate DCT for image and video encoding. Section 7 concludes the paper . 2 DCT A pproximations Let x and X be 8-point column vectors related by the DCT . Therefore, they satisfy the follo wing expression: X = C · x , where C =       γ 3 γ 3 γ 3 γ 3 γ 3 γ 3 γ 3 γ 3 γ 0 γ 2 γ 4 γ 6 − γ 6 − γ 4 − γ 2 − γ 0 γ 1 γ 5 − γ 5 − γ 1 − γ 1 − γ 5 γ 5 γ 1 γ 2 − γ 6 − γ 0 − γ 4 γ 4 γ 0 γ 6 − γ 2 γ 3 − γ 3 − γ 3 γ 3 γ 3 − γ 3 − γ 3 γ 3 γ 4 − γ 0 γ 6 γ 2 − γ 2 − γ 6 γ 0 − γ 4 γ 5 − γ 1 γ 1 − γ 5 − γ 5 γ 1 − γ 1 γ 5 γ 6 − γ 4 γ 2 − γ 0 γ 0 − γ 2 γ 4 − γ 6       , and γ k = cos ( 2 π ( k + 1 ) / 32 ) , for k = 0 , 1 , . . . , 6. Common algorithms for efficient DCT computation include: (i) Y uan et al. [28], (ii) Arai et al. [29], (iii) Chen et al. [30], (i v) Feig– W inograd [31], (v) Lee [32], and (vi) Hou [33]. T able 1 lists the compu- tational costs associated to such methods. The theoretical minimal mul- tiplicativ e complexity is 11 multiplications [23, 34], which is attained by the Loeffler algorithm [23]. A DCT approximation is a matrix b C capable of furnishing b X = b C · x where b X ≈ X according to some prescribed criterion, such as ma- trix proximity or coding performance [3]. In general terms, as shown in [3, 45–48], b C is a real valued matrix which consists of tw o compo- nents: (i) a low-comple xity matrix T and (ii) a diagonal matrix S . Such matrices are giv en by: b C = S · T , (1) where S = q ( T · T ⊤ ) − 1 . (2) The operation √ · is the matrix square root operation [49, 50]. Hereafter low-complexity matrices are referred to as T ∗ , where the subscript ∗ indicates the considered method. Also DCT appr oximations are referred to as ˆ C ∗ . If the subscript is absent, then we refer to a generic low-comple xity matrix or DCT approximation. A traditional DCT approximation is the signed DCT (SDCT) [51] which matrix is obtained according to: 1 √ 8 · sgn ( C ) , where sgn ( · ) is the entry-wise signum function. Therefore, in this case, the entries of the associated low-complexity matrix T SDCT = sgn ( C ) are in { 0 , ± 1 } . Thus matrix T SDCT is multiplierless. Notably , in the past few years, sev eral approximations for the DCT have been proposed as, for example, the rounded DCT (RDCT , T RDCT ) [27], the modified RDCT (MRDCT , T MRDCT ) [48], the series T able 2: Common 8-point low-comple xity matrices associated to DCT approxi- mations Method Transformation Matrix T RDCT [27]      1 1 1 1 1 1 1 1 1 1 1 0 0 − 1 − 1 − 1 1 0 0 − 1 − 1 0 0 1 1 0 − 1 − 1 1 1 0 − 1 1 − 1 − 1 1 1 − 1 − 1 1 1 − 1 0 1 − 1 0 1 − 1 0 − 1 1 0 0 1 − 1 0 0 − 1 1 − 1 1 − 1 1 0      T BAS-2008b [52]      1 1 1 1 1 1 1 1 1 1 1 0 0 − 1 − 1 − 1 1 1 − 1 − 1 − 1 − 1 1 1 1 0 − 1 0 0 1 0 − 1 1 − 1 − 1 1 1 − 1 − 1 1 1 − 1 1 0 0 − 1 1 − 1 1 − 1 1 − 1 − 1 1 − 1 1 1 − 1 1 − 1 1 − 1 1 − 1      T LO [57]        1 1 1 1 1 1 1 1 1 1 1 0 0 − 1 − 1 − 1 1 1 2 − 1 2 − 1 − 1 − 1 2 1 2 1 1 0 − 1 − 1 1 1 0 − 1 1 − 1 − 1 1 1 − 1 − 1 1 1 − 1 0 1 − 1 0 1 − 1 1 2 − 1 1 − 1 2 − 1 2 1 − 1 1 2 0 − 1 1 − 1 1 − 1 1 0        T 6 [14]      1 1 1 1 1 1 1 1 2 1 1 0 0 − 1 − 1 − 2 2 1 − 1 − 2 − 2 − 1 1 2 1 0 − 2 − 1 1 2 0 − 1 1 − 1 − 1 1 1 − 1 − 1 1 1 − 2 0 1 − 1 0 2 − 1 1 − 2 2 − 1 − 1 2 − 2 1 0 − 1 1 − 2 2 − 1 1 0      T 4 [14]      1 1 1 1 1 1 1 1 1 1 1 0 0 − 1 − 1 − 1 1 1 − 1 − 1 − 1 − 1 1 1 1 0 − 1 − 1 1 1 0 − 1 1 − 1 − 1 1 1 − 1 − 1 1 1 − 1 0 1 − 1 0 1 − 1 1 − 1 1 − 1 − 1 1 − 1 1 0 − 1 1 − 1 1 − 1 1 0      of approximations proposed by Bouguezel–Ahmad–Swamy (B AS) [52– 56], the Lengwehasatit–Ortega approximation (LO, T LO ) [57], the ap- proximation proposed by Pati et al. [58], and the collection of approxi- mations introduced in [14]. Most of these approximations are orthogo- nal with low computational comple xity matrix entries. Essentially , they are matrices defined over the set { 0 , ± 1 / 2 , ± 1 , ± 2 } , with the multipli- cation by powers of tw o implying simple bit-shifting operations. Such approximations were demonstrated to be competitiv e substitutes for the DCT and its related integer transformations as shown in [14, 27, 48, 52–57]. T able 2 illustrates some common integer transformations linked to the DCT approximations. 3 Greedy A pproximations 3.1 Optimization Appr oach Approximate DCT matrices are often obtained by fully considering the exact DCT matrix C , including its symmetries [59], fast algorithms [23, 33], parametrizations [31], and numerical properties [28]. Usually the low-comple xity component of a DCT approximation is found by solving the following optimization problem: T = arg min T ′ approx ( T ′ , C ) , where approx ( · , · ) is a particular approximation assessment function— such as proximity measures and coding performance metrics [3]—and subject to v arious constraints, such as orthogonality and lo w-complexity of the candidate matrices T ′ . Howe ver , the DCT matrix can be understood as a stack of ro w vectors c ⊤ k , k = 1 , 2 , . . . , 8, as follows: C =  c 1 c 2 c 3 c 4 c 5 c 6 c 7 c 8  ⊤ . (3) In the current work, to derive an approximation for C , we propose to individually approximate each of its rows in the hope that the set of approximate rows generate a good approximate matrix. Such heuristic 2 T able 3: Examples of approximated vectors for the search space D 1 n Approximated V ector 1  1 1 1 1 1 1 1 1  ⊤ 2  1 1 1 1 1 1 1 − 1  ⊤ 3  1 1 1 1 1 1 1 0  ⊤ 4  1 1 1 1 1 1 − 1 1  ⊤ . . . . . . 6558  − 1 − 1 − 1 − 1 − 1 − 1 1 − 1  ⊤ 6559  − 1 − 1 − 1 − 1 − 1 − 1 − 1 1  ⊤ 6560  − 1 − 1 − 1 − 1 − 1 − 1 − 1 0  ⊤ 6561  − 1 − 1 − 1 − 1 − 1 − 1 − 1 − 1  ⊤ T able 4: Examples of approximated vectors for the search space D 2 n Approximated V ector 1  2 2 2 2 2 2 2 − 1  ⊤ 2  2 2 2 2 2 2 2 0  ⊤ 3  2 2 2 2 2 2 2 1  ⊤ 4  2 2 2 2 2 2 2 2  ⊤ . . . . . . 390622  − 2 − 2 − 2 − 2 − 2 − 2 − 2 − 2  ⊤ 390623  − 2 − 2 − 2 − 2 − 2 − 2 − 2 − 1  ⊤ 390624  − 2 − 2 − 2 − 2 − 2 − 2 − 2 1  ⊤ 390625  − 2 − 2 − 2 − 2 − 2 − 2 − 2 0  ⊤ can be categorized as a greedy method [60]. Therefore, our goal is to deriv e a lo w-complexity inte ger matrix T =  t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8  ⊤ (4) such that its rows t ⊤ k , k = 1 , 2 , . . . , 8, satisfy: t k = arg min t ∈ D error ( t , c k ) , k = 1 , 2 , . . . , 8 , (5) subject to constraints such as (i) lo w-complexity of the candidate vec- tor t and (ii) orthogonality of the resulting matrix T . The objecti ve func- tion error ( · , · ) returns a giv en error measure and D is a suitable search space. 3.2 Search Space In order to obtain a lo w-complexity matrix T , its entries must be com- putationally simple [3, 11]. W e define the search space as the collection of 8-point vectors whose entries are in a set, say P , of low-complexity elements. Therefore, we ha ve the search space D = P 8 . Some choices for P include: P 1 = { 0 , ± 1 } and P 2 = { 0 , ± 1 , ± 2 } . T ables 3 and 4 display some elements of the search spaces D 1 = P 8 1 and D 2 = P 8 2 . These search spaces hav e 6,561 and 390,625 elements, respectiv ely . 3.3 Objective Function The problem posed in (5) requires the identification of an error function to quantify the “distance” between the candidate row vectors from D and the ro ws of the e xact DCT . Related literature often consider error functions based on matrix norms [46], proximity to orthogonality [61], and coding performance [3]. In this work, we propose the utilization of a distance based on the angle between vectors as the objectiv e function to be minimized. Let u and v be two vectors defined over the same Euclidean space. The angle between vectors is simply gi ven by: angle ( u , v ) = arccos  ⟨ u , v ⟩ ∥ u ∥ · ∥ v ∥  , where ⟨· , ·⟩ is the inner product and ∥ · ∥ indicates the norm induced by the inner product [62]. 3.4 Orthogonality and Row Order In addition, we require that the ensemble of rows t ⊤ k , k = 1 , 2 , . . . , 8, must form an orthogonal set. This is to ensure that an orthogonal ap- proximation can be obtained. As shown in [45, 47], for this property to be satisfied, it suffices that: T · T ⊤ = [ diagonal matrix ] . Because we aim at approximating each of the exact DCT matrix ro ws individually , the row sequential order according to which the approxi- mations are performed may matter . Notice that we approximate the ro ws of the DCT based on a set of low-complexity rows, the search space. F or instance, let us assume that we approximate the rows in the following order: ℘ = ( 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 ) . Once we find a good approximate ro w for the first exact ro w , i.e., a row vector in the search space which has the smallest angle in relation to that exact row , the second row is ap- proximated considering only the row vectors in the search space that are orthogonal to the approximation for the first row . After that, the third exact ro w is approximated considering only the row v ectors in the search space that are orthogonal to the first and second rows already chosen. And so on. This procedure characterize the greedy nature of the proposed algorithm. Consider now the approximation order ( 4 , 3 , 5 , 6 , 1 , 2 , 7 , 8 ) , a permu- tation of ℘ . In this case, we start by approximating the fourth exact row considering the whole search space because we are starting from it. Hence, the obtained approximate row might be different from the one obtained by considering ℘ , since in that case the search space is restricted in a different manner . As an example, consider the DCT matrix of length 8, introduced in Section 2 of the manuscript. If considering the low comple xity set {− 1 , 0 , 1 } and the approximation order (1, 2, 3, 4, 5, 6, 7, 8) we obtain the following approximate matrix:      1 1 1 1 1 1 1 1 1 1 1 0 0 − 1 − 1 − 1 1 0 0 − 1 − 1 0 0 1 1 0 − 1 − 1 1 1 0 − 1 1 − 1 − 1 1 1 − 1 − 1 1 1 − 1 0 1 − 1 0 1 − 1 0 − 1 1 0 0 1 − 1 0 0 − 1 1 − 1 1 − 1 1 0      . In other hand, if we consider the reverse approximation order , (8, 7, 6, 5, 4, 3, 2, 1), we obtain the following matrix:      1 1 1 1 1 1 1 1 1 1 1 0 0 − 1 − 1 − 1 1 1 − 1 − 1 − 1 − 1 1 1 1 0 − 1 − 1 1 1 0 − 1 1 − 1 − 1 1 1 − 1 − 1 1 1 − 1 0 1 − 1 0 1 − 1 1 − 1 1 − 1 − 1 1 − 1 1 0 − 1 1 − 1 1 − 1 1 0      . Therefore, the row sequence considered matters for the resulting matrix. The sequence matters during the process of finding the approximate ma- trix. Thus, the row vectors c ⊤ k from the exact matrix must be approximated in all possibles sequences. For a systematic procedure, all the 8! = 40320 possible permutations of the sequence ℘ must be considered. 3 1: procedure A B M A P P ROX ( C , ℘ , D ) 2: approximations ← null 3D matrix of size 8 × 8 × n ; 3: for m ← 1 , | ℘ | do 4: ℘ m ← ℘ ( m , : ) 5: for k ← 1 , 2 , . . . , 8 do 6: θ min ← 2 π ; 7: ind ex ← 1; 8: for i ← 1 , 2 , . . . , | D | do 9: aux ← approximations ( : , : , m ) · ( D ( i , : )) ⊤ 10: if sum ( aux ) = 0 then 11: θ ← angle ( C ( ℘ m ( k ) , : ) , D ( i , : )) ; 12: if θ < θ min then 13: θ min ← θ ; 14: ind ex ← i ; 15: end if 16: end if 17: end for 18: approximations ( ℘ m ( k ) , : m ) ← D ( ind ex , : ) ; 19: end for 20: end for 21: end procedure Figure 1: Algorithm for the proposed method. Let ℘ m , m = 1 , 2 , . . . , 40320, be the resultant sequence that determines the m th permutation; e.g. ℘ 1250 = ( 1 , 3 , 7 , 6 , 5 , 4 , 8 , 2 ) . The particular elements of a sequence are denoted by ℘ m ( k ) , k = 1 , 2 , . . . , 8. Then, for the giv en example abo ve, we ha ve ℘ 1250 ( 2 ) = 3. 3.5 Proposed Optimization Pr oblem Considering the above discussion, we can re-cast (5) in more precise terms. For each permutation sequence ℘ m , we hav e the following opti- mization problem: t ℘ m ( k ) = arg min d ∈ D angle ( c ℘ m ( k ) , d ) , k = 1 , 2 , . . . , 8 , (6) subject to: ⟨ t ℘ m ( i ) , t ℘ m ( j ) ⟩ = 0 , i  = j , m = 1 , 2 , . . . , 40320 and a fixed search space D ∈ { D 1 , D 2 } . F or each m , the solution of the above problem returns eight vectors, t ⊤ ℘ m ( 1 ) , t ⊤ ℘ m ( 2 ) , . . . , t ⊤ ℘ m ( 8 ) , that are used as the rows of the desired low- complexity matrix. Note that each sequence ℘ m may result in a differ - ent solution. Effecti vely , there are 8! = 40320 problems to be solv ed. In principle, each permutation ℘ m can furnish a different matrix. Because the search space is relativ ely small, we solved (6) by means of exhausti ve search. Although simple, such approach ensures the at- tainment of a solution and avoids con ver gence issues [60]. Figure 1 shows the pseudo-code for the adopted procedure to solve (6). It is im- portant to highlight that although the proposed formulation is applicable to arbitrary transform lengths, it may not be computationally feasible. For this reason, we restrict our analysis to the 8-point case. Section 6.2 discusses an alternati ve form of generating higher order DCT approxi- mations. 4 Results and Evaluation In this section, we apply the proposed method aiming at the derivation of new approximations for the 8-point DCT . Subsequently , we analyze and compare the obtained matrices with a representativ e set of DCT ap- proximations described in the literature according to se veral well-known figures of merit [63]. 4.1 New 8-point DCT Appr oximations Considering the search spaces D 1 and D 2 (cf. T able 3 and T a- ble 4, respectively), we apply the proposed algorithm to solve (6). Because the first and fifth rows of the exact DCT are tri vially ap- proximated by the row vectors  1 1 1 1 1 1 1 1  and  1 − 1 − 1 1 1 − 1 − 1 1  , respecti vely , we limited the search to the remaining six ro ws. As a consequence, the number of pos- sible candidate matrices is reduced to 6! = 720. For D 1 , only two dif- ferent matrices were obtained, which coincide with pre viously archi ved approximations, namely: (i) the RDCT [27] and (ii) the matrix T 4 intro- duced in [14]. These approximations are depicted in T able 2. On the other hand, considering the search space D 2 , the following two ne w matrices were obtained: T 1 =      1 1 1 1 1 1 1 1 2 2 1 0 0 − 1 − 2 − 2 2 1 − 1 − 2 − 2 − 1 1 2 1 0 − 2 − 2 2 2 0 − 1 1 − 1 − 1 1 1 − 1 − 1 1 2 − 2 0 1 − 1 0 2 − 2 1 − 2 2 − 1 − 1 2 − 2 1 0 − 1 2 − 2 2 − 2 1 0      , T 2 =      1 1 1 1 1 1 1 1 2 1 2 0 0 − 2 − 1 − 2 2 1 − 1 − 2 − 2 − 1 1 2 2 0 − 2 − 1 1 2 0 − 2 1 − 1 − 1 1 1 − 1 − 1 1 1 − 2 0 2 − 2 0 2 − 1 1 − 2 2 − 1 − 1 2 − 2 1 0 − 2 1 − 2 2 − 1 2 0      . According to (1) and (2), the abov e low-complexity matrices T 1 and T 2 can be modified to provide orthogonal transformations b C 1 and b C 2 . The selected orthogonalization procedure is based on the polar decom- position as described in [45, 47, 64]. Thus, the ortho gonal DCT appr ox- imations associated to the matrices T 1 and T 2 are giv en by b C 1 = S 1 · T 1 and b C 2 = S 2 · T 2 , where S i = q ( T i · T i ⊤ ) − 1 , i = 1 , 2 . Thus, it follows that: S 1 = S 2 = diag  1 √ 8 , 1 √ 18 , 1 √ 20 , 1 √ 18 , 1 √ 8 , 1 √ 18 , 1 √ 20 , 1 √ 18  . Other simulations were performed considering extended sets of el- ements. In particular , following sets were considered: { 0 , ± 1 , ± 4 } , { 0 , ± 1 , ± 8 } , { 0 , ± 1 , ± 2 , ± 4 } , { 0 , ± 1 , ± 2 , ± 8 } , and { 0 , ± 2 , ± 4 , ± 8 } . Generally , the resulting matrices did not perform as well as the ones being proposed. Moreov er , the associate computational cost was con- sistently higher . The number of vectors in the search space can be calculated as | D | = | P | 8 (cf. Section 3.2). Therefore, including more elements to P effects a noticeable increase in the size of the search space. As a consequence, the processing time to deri ve all the 6! candidate matrices increases accordingly . 4 4.2 Appr oximation Measures Approximations measurements are computed between an approximate matrix ˆ C (not the low-complexity matrix T ) relati ve to the exact DCT . T o ev aluate the performance of the proposed approximations, b C 1 and b C 2 , we selected traditional figures of merit: (i) total error energy ( ε ( · ) ) [46]; (ii) mean square error (MSE ( · ) ) [3, 65]; (iii) coding gain ( C g ( · ) ) [3, 66, 67]; and (iv) transform efficienc y ( η ( · ) ) [3]. The MSE and total error ener gy are suitable measures to quantify the dif ference between the exact DCT and its approximations [3]. The coding gain and transform efficiency are appropriate tools to quantify compression, redundancy removal, and data decorrelation capabilities [3]. Addition- ally , due the angular nature of the objecti ve function required by the proposed optimization problem, we also considered descripti ve circular statistics [68, 69]. Circular statistics allows the quantification of approx- imation error in terms of the angle difference between the row vectors of the approximated and exact matrix. Hereafter we adopt the following quantities and notation: the inter- pixel correlation is ρ = 0 . 95 [3, 35, 66], b C is an approximation for the DCT , and b R y = b C · R x · b C ⊤ , where R x is the cov ariance matrix of x , whose elements are giv en by ρ | i − j | , i , j = 1 , 2 , . . . , 8. W e detail each of these measures below . 4.2.1 T otal Energy Error The total energy error is a similarity measure gi ven by [46]: ε ( b C ) = π · ∥ C − b C ∥ 2 F , where ∥ · ∥ F represents the Frobenius norm [70]. 4.2.2 Mean Square Err or The MSE of a giv en approximation b C is furnished by [3, 65]: MSE ( b C ) = 1 8 · tr  ( C − b C ) · R x · ( C − b C ) ⊤  . where tr ( · ) represents the trace operator [3]. The total energy error and the mean square error are appropriate measures for capturing the ap- proximation error in a Euclidean distance sense. 4.2.3 Coding Gain The coding gain quantifies the energy compaction capability and is giv en by [3]: C g ( b C ) = 10 · log 10    1 8 ∑ 8 i = 1 r i , i  ∏ 8 i = 1 r i , i · ∥ b c i ∥ 2  1 / 8    , where r i , i is the i th element of the diagonal of b R y [3] and b c ⊤ i is the i th row of b C . Howe ver , as pointed in [67], the pre vious definition is suitable for orthogonal transforms only . For non-orthogonal transforms, such as SDCT [51] and MRDCT [48], we adopt the unified coding gain [67]. For i = 1 , 2 , . . . , 8, let b c ⊤ i and b g ⊤ i be i th row of b C and b C − 1 , respectiv ely . Then, the unified coding gain is gi ven by: C ∗ g ( b C ) = 10 · log 10 ( 8 ∏ i = 1 1 8 √ A i · B i ) , where A i = su  b c i · b c ⊤ i  ⊙ R x  , su ( · ) returns the sum of all elements of the input matrix, the operator ⊙ represents the element-wise product, and B i = ∥ b g i ∥ 2 . 4.2.4 T ransform Efficiency The transform efficiency is an alternativ e measure to the coding gain, being expressed according to [3]: η ( b C ) = ∑ 8 i = 1 | r i , i | ∑ 8 i = 1 ∑ 8 j = 1 | r i , j | · 100 , where r i , j is the ( i , j ) th entry of b R y , i , j = 1 , 2 , . . . , 8 [3]. 4.2.5 Circular Statistics Because the objective function in (6) is the operator angle, its associate values are distributed around the unit circle. This type of data is suit- ably analyzed by circular statistics tools [68, 69, 71]. Let a be an arbi- trary 8-point v ector and q =  1 0 0 0 0 0 0 0  . The angle function is giv en by [68]: θ = angle ( a ′ , q ) , k = 1 , 2 , . . . , 8 , where a ′ = a ∥ a ∥ is the normalized vector of a . The mean angle (circular mean) is giv en by [69, 71]: ¯ θ =                arctan ( S / C ) , if C > 0 and S ≥ 0 , π / 2 , if C = 0 and S > 0 , arctan ( S / C ) + π , if C < 0 , arctan ( S / C ) + 2 π , if C ≥ 0 and S < 0 , undefined , if C = 0 and S = 0 , where C = ∑ i cos ( θ i ) , S = ∑ i sin ( θ i ) , and { θ i } is a collection of angles. The circular variance is gi ven by [68]: V = 1 − √ C 2 + S 2 8 . The minimal variance occurs when all observed angles are identical. In this case, we have V = 0. In other hand, the maximum variance occurs when the observations are uniformly distrib uted around the unit circle. Thus, V = 1 [69]. Considering the rows of the 8-point DCT matrix and of a giv en 8- point DCT approximate matrix, the angle function furnishes the fol- lowing angles, respectiv ely: θ c k = angle ( c k , q ) and θ t k = angle ( t k , q ) , k = 1 , 2 , . . . , 8 (cf. (3) and (4)). In this particular case, the mean circu- lar dif ference, which measures the mean difference between the pairs of angles is defined as follows: ¯ D = 1 8 2 · 8 ∑ i = 1 8 ∑ j = 1  π − | π − | θ c i − θ t j ||  . The expression above considers the difference between all the possible pairs of angles. Howe ver , we are interested in comparing the angle be- tween the i th ro w of the DCT and the corresponding ro w of the approx- imated matrix, i.e., the cases where i = j . Thus we have the modified circular mean difference according to: ¯ D mod = 1 8 · 8 ∑ i = 1 ( π − | π − | θ c i − θ t i || ) . 4.3 Results and Comparisons T able 5 shows the obtained measurements for all approximations de- riv ed, according to (1), from the lo w-complexity matrices considered in this paper . T able 6 brings a summary of the descriptiv e circular statis- tics. W e also included the exact DCT and the integer DCT (IDCT) [72] for comparison. The considered IDCT is the 8-point approximation 5 T able 5: Performance measures for DCT approximations deri ved from low- complexity matrices. Exact DCT measures listed for reference Method ε MSE C ∗ g η DCT [12] 0 0 8.8259 93.9912 IDCT (HEVC) [72] 0.0020 8 . 66 × 10 − 6 8.8248 93.8236 b C 1 (proposed) 1.2194 0.0046 8.6337 90.4615 b C 2 (proposed) 1.2194 0.0127 8.1024 87.2275 b C LO [57] 0.8695 0.0061 8.3902 88.7023 b C SDCT [51] 3.3158 0.0207 6.0261 82.6190 b C RDCT [27] 1.7945 0.0098 8.1827 87.4297 b C MRDCT [48] 8.6592 0.0594 7.3326 80.8969 b C BAS-2008a [53] 5.9294 0.0238 8.1194 86.8626 b C BAS-2008b [52] 4.1875 0.0191 6.2684 83.1734 b C BAS-2009 [54] 6.8543 0.0275 7.9126 85.3799 b C BAS-2011 [55] 26.8462 0.0710 7.9118 85.6419 b C BAS-2013 [56] 35.0639 0.1023 7.9461 85.3138 b C ′ 1 [14] 3.3158 0.0208 6.0462 83.0814 b C 4 [14] 1.7945 0.0098 8.1834 87.1567 b C 5 [14] 1.7945 0.0100 8.1369 86.5359 b C 6 [14] 0.8695 0.0062 8.3437 88.0594 adopted in the HEVC standard [72]. A more detailed analysis on the performance of the proposed approximation in comparison with the IDCT is provided in Section 6.2. The proposed DCT approximation b C 1 outperforms all competing methods in terms of MSE, coding gain, and transform efficienc y . It also performs as the second best for to- tal error energy measurement. It is unusual for an approximation to simultaneously excel in measures based on Euclidean distance ( ε and MSE) as well as in coding-based measures. The approximation by Lengwehasatit–Ortega ( b C LO ) [57] achiev es second best results MSE, and η . Because of its relati vely inferior performance, we removed the new approximation b C 2 from our subsequent analysis. Nevertheless, b C 2 could outperform the approximations b C BAS-2008b [52], b C BAS-2009 [54], b C BAS-2011 [55], b C BAS-2013 [56], b C SDCT [51], b C MRDCT [48], and b C ′ 1 [14] in all measures considered, b C 4 [14] and b C 5 [14] in terms of total er- ror energy and transform efficiency , b C RDCT [27] in terms of total error energy , and b C BAS-2008a [53] in terms of total error energy , MSE and transform efficienc y . Hereafter we focus our attention on the proposed approximation b C 1 . The proposed search algorithm is greedy , i.e., it makes local optimal choices hoping to find the global optimum solution [60]. Therefore, it is not guaranteed that the obtained solutions are globally optimal. This is exactly what happens here. As can be seen in T able 6, the proposed matrix T 1 is not the transformation matrix that provides the lowest cir- cular mean difference among the approximations on literature. Despite this fact , the proposed matrix has outstanding performance according to the considered measures. Figure 2 shows the effect of the interpixel correlation ρ on the per- formance of the discussed approximate transforms as measured by the unified coding gain difference compared to the exact DCT [73]. The proposed method outperforms the competing methods as its coding gain difference is smaller for any choice of ρ . For highly correlated data the coding degradation in dB is roughly reduced by half when the proposed approximation b C 1 is considered. 5 F ast Algorithm and Hardwar e Realization 5.1 Fast Algorithm The direct implementation of T 1 requires 48 additions and 24 bit- shifting operations. Ho wever , such computational cost can be sig- nificantly reduced by means of sparse matrix factorization [11]. In fact, considering butterfly-based structures as commonly found in T able 6: Descriptive circular statistics Method ¯ θ V ¯ D mod Exact DCT [12] 70.53 0.0089 0 IDCT (HEVC) [72] 70.50 0.0086 0.0022 T 1 (proposed) 71.12 0.0124 0.0711 T 2 (proposed) 71.12 0.0124 0.0343 T LO [57] 70.81 0.0102 0.0483 T SDCT [51] 69.29 0 0.1062 T RDCT [27] 71.98 0.0174 0.0716 T MRDCT [48] 75.58 0.0392 0.1646 T BAS-2008a [53] 72.35 0.0198 0.1036 T BAS-2008b [52] 67.29 0.0015 0.1097 T BAS-2009 [54] 72.10 0.0183 0.1334 T BAS-2011 [55] 73.54 0.0265 0.1492 T BAS-2013 [56] 69.29 0 0.1062 T ′ 1 [14] 73.54 0.0265 0.0901 T 4 [14] 70.57 0.0085 0.0781 T 5 [14] 72.45 0.0209 0.0730 T 6 [14] 71.27 0.0139 0.0497 Figure 2: Curves for the coding gain error of b C 1 , b C LO , and b C 6 , relative to the exact DCT , for 0 < ρ < 1. decimation-in-frequency algorithms, such as [5, 33, 74], we could de- riv e the follo wing factorization for T 1 : T 1 = D · A 4 · A 3 · A 2 · A 1 , where: A 1 =      1 1 1 1 1 1 1 1 1 − 1 1 − 1 1 − 1 1 − 1      , A 2 =      1 1 1 1 1 − 1 1 − 1 1 1 1 1      , A 3 =     1 1 1 − 1 1 1 1 1 1 1     , A 4 =        1 1 2 1 1 1 2 − 1 − 1 1 2 1 1 2 − 1 1 − 2 1 − 1 1 − 1 2        , and D = diag ( 1 , 2 , 1 , 2 , 1 , 2 , 1 , 2 ) . Figure 3 shows the signal flow graph (SFG) related to the abov e factorization. The computational cost of this algorithm is only 24 additions and six multiplications by two. The multiplications by two are extremely simple to be performed, re- quiring only bit-shifting operations [3]. The fast algorithm proposed requires 50% less additions and 75% less bit-shifting operations when 6 T able 7: Computational cost comparison Method Multiplications Additions Bit-shifts DCT [23] 11 29 0 IDCT (HEVC) [72] 0 50 30 T 1 (proposed) 0 24 6 T LO [57] 0 24 2 T SDCT [51] 0 24 0 T RDCT [27] 0 22 0 T MRDCT [48] 0 14 0 T BAS-2008a [53] 0 18 2 T BAS-2008b [52] 0 21 0 T BAS-2009 [54] 0 18 0 T BAS-2011 [55] 0 16 0 T BAS-2013 [56] 0 24 0 T ′ 1 [14] 0 18 0 T 4 [14] 0 24 0 T 5 [14] 0 24 4 T 6 [14] 0 24 6 compared to the direct implementation. The computational costs of the considered methods are shown in T able 7. The additive cost of the dis- cussed approximations varies from 14 to 28 additions. In general terms, DCT approximations exhibit a trade-off between computational cost and transform performance [61], i.e., less complex matrices effect poor spectral approximations [3]. Departing from this general behavior , the proposed transformation T 1 has (i) excelling per- formance measures and (ii) lower or similar arithmetic cost when com- pared to competing methods, as shown in T ables 5, 6, and 7. Re garding considered performance measures, three transformations are consistenly placed among the five best methods: T 1 , T LO , and T 6 . Thus, we sepa- rate them for hardware analysis. 5.2 FPGA Implementation The proposed design along with T LO and T 6 were implemented on an FPGA chip using the Xilinx ML605 board. Considering hardware co- simulation the FPGA realization was tested with 100,000 random 8- point input test vectors. The test vectors were generated from within the MA TLAB environment and, using JT A G based hardware co-simulation, routed to the physical FPGA device where each algorithm was realized in the reconfigurable logic fabric. Then the computational results ob- tained from the FPGA algorithm implementations were routed back to MA TLAB memory space. The diagrams for the designs can be seen in Figure 4. The metrics emplo yed to e v aluate the FPGA implementations were: configurable logic blocks (CLB), flip-flop (FF) count, and critical path delay ( T cpd ), in ns. The maximum operating frequency was determined by the critical path delay as F max = ( T cpd ) − 1 , in MHz. V alues were obtained from the Xilinx FPGA synthesis and place-route tools by ac- cessing the xflow.results report file. Using the Xilinx XPower An- alyzer , we estimated the static ( Q p in W) and dynamic power ( D p in mW / MHz) consumption. In addition, we calculated area-time ( A T ) and area-time-square ( A T 2 ) figures of merit, where area is measured as the CLBs and time as the critical path delay . The values of those metrics for each design are shown in T able 8. The design link ed to the proposed design approximation T 1 possesses the smallest T cpd among the considered methods. Such critical path de- lay allows for operations at a 8 . 55% and 19 . 96% higher frequency than the designs associated to T LO and T 6 , respectively . In terms of area-time and are-time-square measures, the design linked to the approximation T LO presents the best results, followed by the one associated to T 1 . 6 Computational Experiments 6.1 Still Image Compression 6.1.1 Experiment Setup and Results T o ev aluate the efficiency of the proposed transformation matrix, we performed a JPEG-like image compression experiments as described in [14, 24, 27]. Input images were divided into sub-blocks of size 8 × 8 pixels and submitted to a bidimensional (2-D) transformation, such as the DCT or one of its approximations. Let A be a sub-block of size 8 × 8. The 2-D approximate transform of A is an 8 × 8 sub-block B obtained as follows [14, 46]: B = b C · A · b C ⊤ . Considering the zig-zag scan pattern as detailed in [75], the initial r elements of B were retained; whereas the remaining ( 64 − r ) elements were discarded. Considering 8-bit images, this approach implies that the fixed average bits per pixel equals r / 8 bits per pixel (bpp). The previous operation results in a matrix B ′ populated with zeros which is suitable for entropy encoding [16]. Each processed sub-block was submitted to the corresponding 2-D in verse transformation and the image was recon- structed. For orthogonal approximations, the 2-D in verse transform is giv en by: A = b C ⊤ · B ′ · b C . W e considered 44 standardized images obtained from the ‘miscella- neous’ volume from USC-SIPI image bank [76], which include com- mon images such as Lena , Boat , Baboon , and P eppers . Without loss of generality , images were con verted to 8-bit grayscale and submit- ted to the abov e described procedure. The reconstructed images were compared with the original images and e v aluated quantitativ ely accord- ing to popular figures of merit: the mean square error (MSE) [3], the peak signal-to-noise ratio (PSNR) [63] and the structural similarity in- dex (SSIM) [77]. W e consider the MSE and PSNR measures because of its good properties and historical usage. Howe ver , as discussed in [65], the MSE and PSNR are not the best measures when it comes to predict human perception of image fidelity and quality , for which SSIM has been shown to be a better measure [65, 77]. Figure 5 shows the average MSE, PSNR, and SSIM respectively , for the 44 images considering 1 < r < 64 (bpp from 0 to 8) retained co- efficients. The proposed approximation b C 1 outperforms b C LO and b C 6 in terms of MSE and PSNR for any value of r . In terms of SSIM, b C 1 outperforms b C 6 for any v alue of r and b C LO for r ∈ [ 7 , 63 ] . In order to better visualize pre vious curv es, we adopted the relative difference which is gi ven by [78]: RD = µ ( C ) − µ ( b C ) µ ( C ) , where µ ( C ) and µ ( b C ) indicate measurements according to the exact and approximate DCT , respectively; and µ ∈ { MSE, PSNR, SSIM } . The relative difference for the MSE, PSNR, and SSIM are presented in Figure 6. Figure 6(c) shows that, for 12 < r < 60 (bpp from 1 . 5 to 7 . 5), b C 1 outperforms not only b C LO and b C 6 but the DCT itself. T o the best of our knowledge, this particularly good behavior was nev er de- scribed in literature, where inv ariably the performance of DCT approx- imations are routinely bounded by the performance of the exact DCT . A qualitativ e ev aluation is provided in Figures 7 and 8, where the re- constructed Lena images [76] for r = 3 (0 . 325 bpp) and r = 14 (1 . 75 bpp), respectively , according to the exact DCT , b C 1 , b C LO , and b C 6 are shown. As expected from the results shown in Figure 6(c), for a bitrate lower than 1 . 5, the proposed approximate transform matrix is not the 7 Figure 3: Signal flow graph of the proposed transform, relating the input data x n , n = 0 , 1 , . . . , 7, to its correspondent coef ficients ˜ X k , k = 0 , 1 , . . . , 7, where ˜ X = x · T 1 . of T 1 . Dashed arrows representing multiplication by − 1. (a) (b) (c) Figure 4: Architectures for (a) T 1 , (b) T LO , and (c) T 6 . T able 8: Hardware resource consumption and power consumption using Xilinx V irtex-6 XC6VLX240T 1FFG1156 device Approximation CLB FF T cpd (ns) F max (MHz) D p (mW / GHz) Q p (W) A T A T 2 T 1 (proposed) 135 408 1.750 571 2.74 3.471 236 413 T LO [57] 114 349 1.900 526 2.82 3.468 217 412 T 6 [14] 125 389 2.100 476 2.57 3.460 262 551 8 (a) (b) (c) Figure 5: Curves for the average of (a) MSE; (b) PSNR; and (c) SSIM corresponding to 44 images. (a) (b) (c) Figure 6: Relative dif ference curves for (a) MSE; (b) PSNR; and (c) SSIM of b C 1 , b C LO , and b C 6 , relativ e to the exact DCT . one that performs the best (Figure 7), although the results are very simi- lar to the ones furnished by the exact DCT . F or a bitrate value larger than 1 . 5, Figure 8 demonstrates a situation were the proposed approximation ov ercomes the other transforms, including the DCT . In both cases, the visual difference between the DCT and the proposed aproximate trans- form matrix is very small. 6.1.2 Discussion The obtained approximation was capable of outperforming the DCT un- der the abo ve described conditions. W e think that this is relevant, be- cause it directly of fers a counter-example to the belief that the coding performance of an approximation is supposed to always be inferior to the DCT . The theoretical background that leads the optimal performance of the DCT is based on assumption that the considered images must fol- low the Marko v-1 processes with high correlation ( ρ → 1). In practice, natural images tend to fit under this assumption, but at lower correlation values the ideal case ( ρ → 1) may not necessarily be met as strongly . For instance, the average correlation of the considered image set was roughly 0.86. This practical deviation from the ideal case ( ρ → 1) may also play a role in explaining our results. Finding matrix approximation as described in this work is a computa- tional and numerical task. T o the best of our knowledge, we cannot iden- tify any methodology that could furnish necessary mathematical tools to design optimal approximations for image compression in an a pri- ori manner, i.e., before search space approaches, optimization problem solving, or numerical simulation. In [3, 5, 25, 61, 79, 80], a lar ge number of methods is listed; all of them aim at good approximations. Although the problem of matrix approximation is quite simple to state, it is also very tricky and offers several non-linearities when combined to a more sophisticate system, such as image compression codecs. Finding low- complexity matrices can be cate gorized as an inte ger optimization prob- lem. Thus, navig ating in the low-comple xity matrix search space might generate non-trivial performance curves, usually leading to discontinu- ities, which seems to be the case of the proposed approximations matrix T 1 . The navig ation path through the search space is highly dependent on the search method and its objectiv e function. Although approximation methods are very likely to pro vide reasonable approximations, it is also very hard to tell beforehand whether a giv en approximation method is capable of furnishing extremely good results capable of outperforming the state of the art. Only after experimentation with the obtained ap- proximations one may kno w better . In particular, this work adv ances an optimization setup based on a geometrically intuitiv e objecti ve function (angle between vectors) that could find good matrices as demonstrated by our a posteriori experiments. 6.2 V ideo Coding In order to assess the proposed transform b C 1 as a tool for video cod- ing, we embedded it into a public available HEVC reference soft- ware [81]. The HEVC presents sev eral improv ements relativ e to its predecessors [82] and aims at providing high compression rates [22]. Differently from other standards (cf. Section 1), HEVC employs not only an 8-point integer DCT (IDCT) but also transforms of size 4, 16, and 32 [72]. Such feature effects a series of optimization routines al- lowing the processing of big smooth or te xtureless areas [22]. The computational search used in the proposed method proved fea- sible for the 8-point case. Ho we ver , as N increases, the size of the search space increases very quickly . For the case N = 16, if consid- ering the lo w complexity set P = {− 1 , 0 , 1 } , we hav e a search space with 3 16 ≈ 4 . 3 × 10 7 elements. For this blocklength, The proposed al- gorithm takes about 6 minutes to find an approximation for a fixed row 9 (a) MSE = 119.91, PSNR = 27.34, SSIM = 0.8814. (b) MSE = 124.44, PSNR = 27.18, SSIM = 0.8767. (c) MSE = 131.08, PSNR = 26.95, SSIM = 0.8781. (d) MSE = 129.03, PSNR = 27.02, SSIM = 0.87.63. Figure 7: Compression of Lena using (a) DCT ; (b) b C 1 ; (c) b C LO ; and (d) b C 6 considering r = 3 (0.325 bpp). (a) MSE = 27.17, PSNR = 33.78, SSIM = 0.9888. (b) MSE = 33.48, PSNR = 32.88, SSIM = 0.9893. (c) MSE = 40.07, PSNR = 32.10, SSIM = 0.9849. (d) MSE = 42.18, PSNR = 31.87, SSIM = 0.9844. Figure 8: Compression of Lena using (a) DCT ; (b) b C 1 ; (c) b C LO ; and (d) b C 6 considering r = 14 (1.75). 10 sequence, considering a machine with the following specifications: 16- core 2.4 GHZ Intel(R) Xeon(R) CPU E5-2630 v3, with 32 GB RAM running Ubuntu 16.04.3 L TS 64-bit. Therefore, since we ha ve N ! ma- trices to be generated, to run the whole algorithm would take approx- imately 16! × 6 minutes. In other words, the computation time would take an e xtremely large amount of time. Thus, for now , we limited the scope of our computational search to 8-point approximations. For this reason, aiming to deri ve large blocklength transforms for HEVC embedding, we submitted the proposed transform matrix T 1 to the Jridi–Alfalou–Meher (J AM) scalable algorithm [83]. Such method resulted in 16- and 32-point versions of the proposed matrix T 1 that are suitable for the sought video experiments. Although the J AM algo- rithm is similar to Chen’ s DCT [30], it exploits redundancies allo wing concise and high parallelizable hardware implementations [83]. From a low-complexity N / 2-point transform, the J AM algorithm generates an N × N matrix transformation by combining two instantiations of the smaller one. The lar ger N -point transform is recursiv ely defined by: T ( N ) = 1 √ 2 M per N " T ( N 2 ) Z N 2 Z N 2 T ( N 2 ) # M add N , (7) where Z N 2 is a matrix of order N / 2 with all zeroed entries. Matrices M add N and M per N are, respectiv ely , obtained according to: M add N = " I N 2 ¯ I N 2 ¯ I N 2 − I N 2 # and M per N = " P N − 1 , N 2 Z 1 , N 2 Z 1 , N 2 P N − 1 , N 2 # , where I N 2 and ¯ I N 2 are, respecti vely , the identity and counter -identity ma- trices of order N / 2 and P N − 1 , N 2 is an ( N − 1 ) × ( N / 2 ) matrix whose row vectors are defined by: P ( i ) N − 1 , N 2 =    Z 1 , N 2 , if i = 1 , 3 , 5 , . . . , N − 1 I ( i / 2 ) N 2 , if i = 0 , 2 , 4 , . . . , N − 2 . The scaling f actor 1 / √ 2 of (7) can be merged into the image/video com- pression quantization step. Furthermore, (1) can be applied to generate orthogonal versions of lar ger transforms. The computational cost of the resulting N -point transform is giv en by twice the number of bit-shifting operations of the original N / 2-point transform; and twice the number of additions plus N extra additions. Following the described algorithm, we obtained the 16- and 32-point lo w complexity transform matrices proposed. More explicitly , we obtained the following 16- and 32-point matrices, respectiv ely: T ( 16 ) =                1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 − 1 − 1 − 1 − 1 − 1 − 1 − 1 − 1 2 2 1 0 0 − 1 − 2 − 2 − 2 − 2 − 1 0 0 1 2 2 2 2 1 0 0 − 1 − 2 − 2 2 2 1 0 0 − 1 − 2 − 2 2 1 − 1 − 2 − 2 − 1 1 2 2 1 − 1 − 2 − 2 − 1 1 2 2 1 − 1 − 2 − 2 − 1 1 2 − 2 − 1 1 2 2 1 − 1 − 2 1 0 − 2 − 2 2 2 0 − 1 − 1 0 2 2 − 2 − 2 0 1 1 0 − 2 − 2 2 2 0 − 1 1 0 − 2 − 2 2 2 0 − 1 1 − 1 − 1 1 1 − 1 − 1 1 1 − 1 − 1 1 1 − 1 − 1 1 1 − 1 − 1 1 1 − 1 − 1 1 − 1 1 1 − 1 − 1 1 1 − 1 2 − 2 0 1 − 1 0 2 − 2 − 2 2 0 − 1 1 0 − 2 2 2 − 2 0 1 − 1 0 2 − 2 2 − 2 0 1 − 1 0 2 − 2 1 − 2 2 − 1 − 1 2 − 2 1 1 − 2 2 − 1 − 1 2 − 2 1 1 − 2 2 − 1 − 1 2 − 2 1 − 1 2 − 2 1 1 − 2 2 − 1 0 − 1 2 − 2 2 − 2 1 0 0 1 − 2 2 − 2 2 − 1 0 0 − 1 2 − 2 2 − 2 1 0 0 − 1 2 − 2 2 − 2 1 0                and T ( 32 ) =                              1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 − 1 − 1 − 1 − 1 − 1 − 1 − 1 − 1 − 1 − 1 − 1 − 1 − 1 − 1 − 1 − 1 1 1 1 1 1 1 1 1 − 1 − 1 − 1 − 1 − 1 − 1 − 1 − 1 − 1 − 1 − 1 − 1 − 1 − 1 − 1 − 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 − 1 − 1 − 1 − 1 − 1 − 1 − 1 − 1 1 1 1 1 1 1 1 1 − 1 − 1 − 1 − 1 − 1 − 1 − 1 − 1 2 2 1 0 0 − 1 − 2 − 2 − 2 − 2 − 1 0 0 1 2 2 2 2 1 0 0 − 1 − 2 − 2 − 2 − 2 − 1 0 0 1 2 2 2 2 1 0 0 − 1 − 2 − 2 − 2 − 2 − 1 0 0 1 2 2 − 2 − 2 − 1 0 0 1 2 2 2 2 1 0 0 − 1 − 2 − 2 2 2 1 0 0 − 1 − 2 − 2 2 2 1 0 0 − 1 − 2 − 2 − 2 − 2 − 1 0 0 1 2 2 − 2 − 2 − 1 0 0 1 2 2 2 2 1 0 0 − 1 − 2 − 2 2 2 1 0 0 − 1 − 2 − 2 2 2 1 0 0 − 1 − 2 − 2 2 2 1 0 0 − 1 − 2 − 2 2 1 − 1 − 2 − 2 − 1 1 2 2 1 − 1 − 2 − 2 − 1 1 2 2 1 − 1 − 2 − 2 − 1 1 2 2 1 − 1 − 2 − 2 − 1 1 2 2 1 − 1 − 2 − 2 − 1 1 2 2 1 − 1 − 2 − 2 − 1 1 2 − 2 − 1 1 2 2 1 − 1 − 2 − 2 − 1 1 2 2 1 − 1 − 2 2 1 − 1 − 2 − 2 − 1 1 2 − 2 − 1 1 2 2 1 − 1 − 2 − 2 − 1 1 2 2 1 − 1 − 2 2 1 − 1 − 2 − 2 − 1 1 2 2 1 − 1 − 2 − 2 − 1 1 2 − 2 − 1 1 2 2 1 − 1 − 2 2 1 − 1 − 2 − 2 − 1 1 2 − 2 − 1 1 2 2 1 − 1 − 2 1 0 − 2 − 2 2 2 0 − 1 − 1 0 2 2 − 2 − 2 0 1 1 0 − 2 − 2 2 2 0 − 1 − 1 0 2 2 − 2 − 2 0 1 1 0 − 2 − 2 2 2 0 − 1 − 1 0 2 2 − 2 − 2 0 1 − 1 0 2 2 − 2 − 2 0 1 1 0 − 2 − 2 2 2 0 − 1 1 0 − 2 − 2 2 2 0 − 1 1 0 − 2 − 2 2 2 0 − 1 − 1 0 2 2 − 2 − 2 0 1 − 1 0 2 2 − 2 − 2 0 1 1 0 − 2 − 2 2 2 0 − 1 1 0 − 2 − 2 2 2 0 − 1 1 0 − 2 − 2 2 2 0 − 1 1 0 − 2 − 2 2 2 0 − 1 1 − 1 − 1 1 1 − 1 − 1 1 1 − 1 − 1 1 1 − 1 − 1 1 1 − 1 − 1 1 1 − 1 − 1 1 1 − 1 − 1 1 1 − 1 − 1 1 1 − 1 − 1 1 1 − 1 − 1 1 1 − 1 − 1 1 1 − 1 − 1 1 − 1 1 1 − 1 − 1 1 1 − 1 − 1 1 1 − 1 − 1 1 1 − 1 1 − 1 − 1 1 1 − 1 − 1 1 − 1 1 1 − 1 − 1 1 1 − 1 − 1 1 1 − 1 − 1 1 1 − 1 1 − 1 − 1 1 1 − 1 − 1 1 1 − 1 − 1 1 1 − 1 − 1 1 − 1 1 1 − 1 − 1 1 1 − 1 1 − 1 − 1 1 1 − 1 − 1 1 − 1 1 1 − 1 − 1 1 1 − 1 2 − 2 0 1 − 1 0 2 − 2 − 2 2 0 − 1 1 0 − 2 2 2 − 2 0 1 − 1 0 2 − 2 − 2 2 0 − 1 1 0 − 2 2 2 − 2 0 1 − 1 0 2 − 2 − 2 2 0 − 1 1 0 − 2 2 − 2 2 0 − 1 1 0 − 2 2 2 − 2 0 1 − 1 0 2 − 2 2 − 2 0 1 − 1 0 2 − 2 2 − 2 0 1 − 1 0 2 − 2 − 2 2 0 − 1 1 0 − 2 2 − 2 2 0 − 1 1 0 − 2 2 2 − 2 0 1 − 1 0 2 − 2 2 − 2 0 1 − 1 0 2 − 2 2 − 2 0 1 − 1 0 2 − 2 2 − 2 0 1 − 1 0 2 − 2 1 − 2 2 − 1 − 1 2 − 2 1 1 − 2 2 − 1 − 1 2 − 2 1 1 − 2 2 − 1 − 1 2 − 2 1 1 − 2 2 − 1 − 1 2 − 2 1 1 − 2 2 − 1 − 1 2 − 2 1 1 − 2 2 − 1 − 1 2 − 2 1 − 1 2 − 2 1 1 − 2 2 − 1 − 1 2 − 2 1 1 − 2 2 − 1 1 − 2 2 − 1 − 1 2 − 2 1 − 1 2 − 2 1 1 − 2 2 − 1 − 1 2 − 2 1 1 − 2 2 − 1 1 − 2 2 − 1 − 1 2 − 2 1 1 − 2 2 − 1 − 1 2 − 2 1 − 1 2 − 2 1 1 − 2 2 − 1 1 − 2 2 − 1 − 1 2 − 2 1 − 1 2 − 2 1 1 − 2 2 − 1 0 − 1 2 − 2 2 − 2 1 0 0 1 − 2 2 − 2 2 − 1 0 0 − 1 2 − 2 2 − 2 1 0 0 1 − 2 2 − 2 2 − 1 0 0 − 1 2 − 2 2 − 2 1 0 0 1 − 2 2 − 2 2 − 1 0 0 1 − 2 2 − 2 2 − 1 0 0 − 1 2 − 2 2 − 2 1 0 0 − 1 2 − 2 2 − 2 1 0 0 − 1 2 − 2 2 − 2 1 0 0 1 − 2 2 − 2 2 − 1 0 0 1 − 2 2 − 2 2 − 1 0 0 − 1 2 − 2 2 − 2 1 0 0 − 1 2 − 2 2 − 2 1 0 0 − 1 2 − 2 2 − 2 1 0 0 − 1 2 − 2 2 − 2 1 0                              . The resulting approximations for the above lo w-complexity matrices can be found from (1) and (2). The diagonal matrices implied by (2) are D ( 16 ) = 4 ·  1 0 0 1  ⊗ diag ( 4 , 9 , 10 , 9 ) ⊗  1 0 0 1  and D ( 32 ) = 2 · D ( 16 ) ⊗  1 0 0 1  , respectiv ely , where ⊗ is the Kronecker product [50]. Figures 9 and 10 display the SFG for the lo w-complexity transform matrices T ( 16 ) and T ( 32 ) deriv ed from T 1 . Figure 9: SFG for the proposed 16-point low complexity transform matrix. T able 9 lists the computational costs of the proposed transform for sizes N = 8 , 16 , 32 compared to an efficient implementation of the IDCT [84]. In our experiments, the original 8-, 16-, and 32-point integer trans- forms of HEVC were substituted by b C 1 and its scaled versions. The original 4-point transform was kept unchanged because it is already a very low-complexity transformation. W e encoded the first 100 frames of one video sequence of each A to F class in accordance with the common test conditions (CTC) documentation [85]. Namely we used the 8-bit videos: PeopleOnStreet (2560 × 1600 at 30 fps), BasketballDrive (1920 × 1080 at 50 fps), RaceHorses (832 × 480 at 30 fps), BlowingBubbles (416 × 240 at 50 fps), KristenAndSara (1280 × 720 at 60 fps), and BasketbalDrillText (832 × 480 at 50 fps). 11 Figure 10: SFG for the proposed 32-point low complexity transform matrix, where T ( 16 ) is the 16-point matrix presented in Figure 9. As suggested in [83], all the test parameters were set according to the CTC documentation. W e tested the proposed transforms in All Intra (AI), Random Access (RA), Low Delay B (LD-B), and Low Delay P (LD-P) configurations, all in the Main profile. W e selected the frame-by-frame MSE and PSNR [72] for each YUV color channel as figures of merit. Then, for all test videos, we computed the rate distortion (RD) curve considering the recommended quantiza- tion parameter (QP) values, i.e. 22, 27, 32, and 37 [85]. The result- ing RD curves are depicted in Figure 11. W e have also measured the Bjøntegaard’ s delta PSNR (BD-PSNR) and delta rate (BD-Rate) [86, 87] for the modified HEVC software. These values are summarized in T a- ble 10. W e demonstrate that replacing the IDCT by the proposed trans- form and its scaled versions results in a loss in quality of at most 0.47dB for the AI configuration, which corresponds to an increase of 5.82% in bitrate. W orst performance for the other configurations—RA, LD-B, and LD-P—are found for the KristenAndSara video sequence, where approximately 0.55dB are lost if compared to the original HEVC imple- mentation. T able 9: Computational cost comparison for 8-, 16-, and 32-point transforms embedded in HEVC reference software N IDCT [84] Proposed transform Additions Bit-shifts Additions Bit-shifts 8 50 30 24 6 16 186 86 64 12 32 682 278 160 24 Despite the very lo w computational cost when compared to the IDCT (cf. T able 9), the proposed transform does not introduce significant er- rors. Figure 12 illustrates the tenth frame of the BasketballDrive video encoded according to the default HEVC IDCT and b C 1 and its scaled versions for each coding configuration. The QP was set to 32. V i- sual de gradations are virtually nonperceptible demonstrating real-world applicability of the proposed DCT approximations for high resolution video coding. 7 Conclusion In this paper , we set up and solved an optimization problem aiming at the proposition of ne w approximations for the 8-point DCT . The ob- tained approximations were determined according to a greedy heuristic which minimized the angle between the rows of the approximate and the exact DCT matrices. Constraints of orthogonality and low compu- tational complexity were imposed. One of the obtained approximations outperformed all the considered approximations in literature according to popular performance measures. W e also introduced the use of cir- cular statistics for assessing approximate transforms. For the proposed transform T 1 , a fast algorithm requiring only 24 additions and 6 bit- shifting operations was proposed. The fast algorithm for the proposed method and directly competing approximations were giv en FPGA real- izations. Simulations were made and the hardware resource consump- tion and power consumption were measured. The maximum operating frequency of the proposed method was 37.4% higher when compared with the well-known Lengwehasatit–Ortega approximation (LO) [57]. In addition, the applicability of the proposed approximation in the con- text of image compression and video coding was demonstrated. Our experiments also demonstrate that DCT approximations can effecti vely approximate the DCT behavior , but also—under particular conditions— outperform the DCT itself for image coding. The proposed approxi- mation is fully HEVC-compliant, being capable of video coding with HEVC quality at lower computational costs. References [1] G. H. Dunteman, Principal components analysis . Sage, 1989, vol. 69. 1 [2] I. Jollif fe, Principal component analysis . Wile y Online Library , 2002. 1 [3] V . Britanak, P . Y ip, and K. R. Rao, Discrete Cosine and Sine T ransforms . Academic Press, 2007. 1, 2, 3, 5, 6, 7, 9 [4] A. N. Gorban, B. Kgl, D. C. W unsch, and A. Zino vyev , Principal Mani- folds for Data V isualization and Dimension Reduction , 1st ed. Springer Publishing Company , Incorporated, 2007. 1 [5] K. R. Rao and P . Y ip, Discr ete Cosine T ransform: Algorithms, Advantages, Applications . San Diego, CA: Academic Press, 1990. 1, 6, 9 [6] H. Chen and B. Zeng, “Ne w transforms tightly bounded by DCT and KL T, ” IEEE Signal Pr ocessing Letters , v ol. 19, no. 6, pp. 344–347, 2012. 1 [7] S. ´ Alvarez-Cort ´ es, N. Amrani, M. Hern ´ andez-Cabronero, and J. Serra- Sagrist ` a, “Progressiv e lossy-to-lossless coding of hyperspectral images through regression wav elet analysis, ” International Journal of Remote Sensing , pp. 1–21, 2017. 1 12 (a) (b) (c) (d) (e) (f) Figure 11: Rate distortion curves of the modified HEVC software for test sequences: (a) PeopleOnStreet , (b) BasketballDrive , (c) RaceHorses , (d) BlowingBubbles , (e) KristenAndSara , and (f) BasketbalDrillText . T able 10: BD-PSNR (dB) and BD-Rate (%) of the modified HEVC reference software for tested video sequences V ideo sequence AI RA LD-B LD-P BD-PSNR BD-Rate BD-PSNR BD-Rate BD-PSNR BD-Rate BD-PSNR BD-Rate PeopleOnStreet 0.2999 − 5 . 5375 0.1467 − 3 . 4323 N/A N/A N/A N/A BasketballDrive 0.1692 − 6 . 1033 0.1412 − 6 . 1876 0.1272 − 5 . 2730 0.1276 − 5 . 2407 RaceHorses 0.4714 − 5 . 8250 0.5521 − 8 . 6149 0.5460 − 7 . 9067 0.5344 − 7 . 6868 − BlowingBubbles 0.0839 − 1 . 4715 0.0821 − 2 . 1612 0.0806 − 2 . 1619 0.0813 − 2 . 2370 KristenAndSara 0.2582 − 5 . 0441 N/A N/A 0.1230 − 4 . 1823 0.1118 − 4 . 0048 BasketballDrillText 0.1036 − 1 . 9721 0.1372 − 3 . 2741 0.1748 − 4 . 3383 0.1646 − 4 . 1509 13 (a) MSE-Y = 10 . 4097, MSE-U = 3 . 5872, MSE-V = 3 . 3079, PSNR-Y = 37 . 9564, PSNR-U = 42 . 5832, PSNR-V = 42 . 9353 (b) MSE-Y = 10 . 8159, MSE-U = 3 . 8290, MSE-V = 3 . 5766, PSNR-Y = 37 . 7902, PSNR-U = 42 . 2999, PSNR-V = 42 . 5961 (c) MSE-Y = 10 . 1479, MSE-U = 3 . 4765, MSE-V = 3 . 1724, PSNR-Y = 38 . 0670, PSNR-U = 42 . 7194, PSNR-V = 43 . 1170 (d) MSE-Y = 10 . 3570, MSE-U = 3 . 6228, MSE-V = 3 . 3113, PSNR-Y = 37 . 9785, PSNR-U = 42 . 5403, PSNR-V = 42 . 9308 (e) MSE-Y = 14 . 0693, MSE-U = 4 . 0741, MSE-V = 4 . 4404, PSNR-Y = 36 . 6481, PSNR-U = 42 . 0304, PSNR-V = 41 . 6566 (f) MSE-Y = 14 . 5953, MSE-U = 4 . 1377, MSE-V = 4 . 6053, PSNR-Y = 36 . 4887, PSNR-U = 41 . 9632, PSNR-V = 41 . 4982 (g) MSE-Y = 14 . 6155, MSE-U = 4 . 1349, MSE-V = 4 . 5502, PSNR-Y = 36 . 4827, PSNR-U = 41 . 9661, PSNR-V = 41 . 5505 (h) MSE-Y = 15 . 0761, MSE-U = 4 . 2812, MSE-V = 4 . 6444, PSNR-Y = 36 . 3479, PSNR-U = 41 . 8151, PSNR-V = 41 . 4615 Figure 12: Compression of the tenth frame of BasketballDrive using (a),(c),(e) the default and (b),(d),(f) the modified versions of the HEVC software for QP = 32, and AI, RA, LD-B, and LD-P coding configurations, respectiv ely . 14 [8] J. Bae and H. Y oo, “ Analysis of color transforms for lossless frame mem- ory compression, ” International Journal of Applied Engineering Resear ch , vol. 12, no. 24, pp. 15 664–15 667, 2017. 1 [9] D. Thomakos, “Smoothing non-stationary time series using the discrete co- sine transform, ” Journal of Systems Science and Comple xity , vol. 29, no. 2, pp. 382–404, 2016. 1 [10] J. Zeng, G. Cheung, Y .-H. Chao, I. Blanes, J. Serra-Sagrist ´ a, and A. Or - tega, “Hyperspectral image coding using graph wav elets, ” in Pr oc. IEEE International Confer ence on Image Pr ocessing (ICIP) , 2017. 1 [11] R. E. Blahut, F ast algorithms for signal pr ocessing . Cambridge University Press, 2010. 1, 3, 6 [12] N. Ahmed, T . Natarajan, and K. R. Rao, “Discrete cosine transform, ” IEEE T ransactions on Computers , vol. C-23, no. 1, pp. 90–93, Jan. 1974. 1, 6 [13] R. J. Clarke, “Relation between the Karhunen-Lo ` eve and cosine trans- forms, ” IEEE Proceedings F Communications, Radar and Signal Process- ing , vol. 128, no. 6, pp. 359–360, No v . 1981. 1 [14] R. J. Cintra, F . M. Bayer , and C. J. T ablada, “Low-comple xity 8-point DCT approximations based on integer functions, ” Signal Processing , 2014. 1, 2, 4, 6, 7, 8 [15] R. C. Gonzalez and R. E. W oods, Digital image pr ocessing , 3rd ed. Upper Saddle Riv er , NJ, USA: Prentice-Hall, Inc., 2006. 1 [16] G. K. W allace, “The JPEG still picture compression standard, ” IEEE T rans- actions on Consumer Electronics , vol. 38, no. 1, pp. xviii–xxxiv , February 1992. 1, 7 [17] A. Puri, “V ideo coding using the H.264/MPEG-4 A VC compression stan- dard, ” Signal Pr ocessing: Image Communication , vol. 19, 2004. 1 [18] D. J. Le Gall, “The MPEG video compression algorithm, ” Signal Pr ocess- ing: Image Communication , vol. 4, no. 2, pp. 129–140, 1992. 1 [19] International T elecommunication Union, “ITU-T recommendation H.261 version 1: V ideo codec for audiovisual services at p × 64 kbits, ” ITU-T , T ech. Rep., 1990. 1 [20] ——, “ITU-T recommendation H.263 version 1: V ideo coding for low bit rate communication, ” ITU-T , T ech. Rep., 1995. 1 [21] A. Luthra, G. J. Sulli van, and T . W iegand, “Introduction to the special issue on the H.264/A VC video coding standard, ” IEEE Tr ansactions on Circuits and Systems for V ideo T echnology , v ol. 13, no. 7, pp. 557–559, Jul. 2003. 1 [22] M. T . Pourazad, C. Doutre, M. Azimi, and P . Nasiopoulos, “HEVC: The new gold standard for video compression: How does HEVC compare with H.264/A VC?” IEEE Consumer Electr onics Ma gazine , v ol. 1, no. 3, pp. 36– 46, Jul. 2012. 1, 9 [23] C. Loef fler , A. Ligtenber g, and G. Moschytz, “Practical fast 1D DCT algo- rithms with 11 multiplications, ” in Pr oceedings of the International Confer- ence on Acoustics, Speec h, and Signal Pr ocessing , May 1989, pp. 988–991. 1, 2, 7 [24] U. Sadhvi Potluri, A. Madanayake, R. J. Cintra, F . M. Bayer , S. Kulasek- era, and A. Edirisuriya, “Improv ed 8-point approximate DCT for image and video compression requiring only 14 additions, ” Cir cuits and Systems I: Regular P apers, IEEE T ransactions on , vol. 61, no. 6, pp. 1727–1740, 2014. 1, 7 [25] V . A. Coutinho, R. J. Cintra, F . M. Bayer , S. Kulasekera, and A. Madanayake, “ A multiplierless pruned DCT-like transformation for im- age and video compression that requires ten additions only , ” Journal of Real-T ime Image Pr ocessing , pp. 1–9, 2015. 1, 9 [26] F . M. Bayer, R. J. Cintra, A. Edirisuriya, and A. Madanayake, “ A digital hardware fast algorithm and FPGA-based prototype for a novel 16-point approximate DCT for image compression applications, ” Measurement Sci- ence and T echnology , vol. 23, no. 8, p. 114010, No vember 2012. 1 [27] F . M. Bayer and R. J. Cintra, “Image compression via a fast DCT approxi- mation, ” IEEE Latin America T ransactions , vol. 8, no. 6, pp. 708–713, Dec. 2010. 1, 2, 4, 6, 7 [28] W . Y uan, P . Hao, and C. Xu, “Matrix factorization for fast DCT algo- rithms, ” in IEEE International Confer ence on Acoustic, Speech, Signal Pr o- cessing (ICASSP) , vol. 3, May 2006, pp. 948–951. 2 [29] Y . Arai, T . Agui, and M. Nakajima, “ A fast DCT-SQ scheme for images, ” T ransactions of the IEICE , vol. E-71, no. 11, pp. 1095–1097, Nov . 1988. 2 [30] W . H. Chen, C. Smith, and S. Fralick, “ A fast computational algorithm for the discrete cosine transform, ” IEEE T ransactions on Communications , vol. 25, no. 9, pp. 1004–1009, September 1977. 2, 11 [31] E. Feig and S. Winograd, “Fast algorithms for the discrete cosine trans- form, ” IEEE T ransactions on Signal Pr ocessing , vol. 40, no. 9, pp. 2174– 2193, September 1992. 2 [32] B. G. Lee, “ A new algorithm for computing the discrete cosine trans- form, ” IEEE T ransactions on Acoustics, Speech and Signal Pr ocessing , v ol. ASSP-32, pp. 1243–1245, Dec. 1984. 2 [33] H. S. Hou, “ A fast recursive algorithm for computing the discrete cosine transform, ” IEEE T ransactions on Acoustic, Signal, and Speech Processing , vol. 6, no. 10, pp. 1455–1461, October 1987. 2, 6 [34] M. T . Heideman and C. S. Burrus, Multiplicative complexity , convolution, and the DFT , ser . Signal Processing and Digital Filtering. Springer -V erlag, 1988. 2 [35] J. Liang and T . D. Tran, “Fast multiplierless approximation of the DCT with the lifting scheme, ” IEEE Tr ansactions on Signal Pr ocessing , vol. 49, pp. 3032–3044, December 2001. 2, 5 [36] M. Masera, M. Martina, and G. Masera, “Odd type DCT/DST for video coding: Relationships and low-complexity implementations, ” 2017. 2 [37] Z. W ang, “Combined DCT and companding for P APR reduction in OFDM signals. ” J. Signal and Information Processing , vol. 2, no. 2, pp. 100–104, 2011. 2 [38] F . S. Snigdha, D. Sengupta, J. Hu, and S. S. Sapatnekar , “Optimal design of JPEG hardware under the approximate computing paradigm, ” in Pr oceed- ings of the 53r d Annual Design Automation Conference . A CM, 2016, p. 106. 2 [39] T . Suzuki and M. Ikehara, “Inte ger DCT based on direct-lifting of DCT- IDCT for lossless-to-lossy image coding, ” IEEE Tr ansactions on Image Pr ocessing , vol. 19, no. 11, pp. 2958–2965, No v . 2010. 2 [40] C.-K. Fong and W .-K. Cham, “LLM integer cosine transform and its fast algorithm, ” IEEE Tr ansactions on Circuits and Systems for V ideo T echnol- ogy , vol. 22, no. 6, pp. 844–854, 2012. 2 [41] K. Choi, S. Lee, and E. S. Jang, “Zero coefficient-a ware IDCT algorithm for fast video decoding, ” IEEE T ransactions on Consumer Electr onics , vol. 56, no. 3, 2010. 2 [42] M. Masera, M. Martina, and G. Masera, “ Adaptive approximated dct archi- tectures for hevc, ” IEEE T ransactions on Cir cuits and Systems for V ideo T echnology , vol. 27, no. 12, pp. 2714–2725, 2017. 2 [43] J.-S. Park, W .-J. Nam, S.-M. Han, and S.-S. Lee, “2-D large in verse trans- form (16 × 16, 32 × 32) for HEVC (high efficiency video coding), ” JSTS: Journal of Semiconductor T echnology and Science , vol. 12, no. 2, pp. 203– 211, 2012. 2 [44] X. Xu, J. Li, X. Huang, M. Dalla Mura, and A. Plaza, “Multiple morpho- logical component analysis based decomposition for remote sensing im- age classification, ” IEEE T ransactions on Geoscience and Remote Sensing , vol. 54, no. 5, pp. 3083–3102, 2016. 2 [45] N. J. Higham, “Computing the polar decomposition—with applications, ” SIAM Journal on Scientific and Statistical Computing , vol. 7, no. 4, pp. 1160–1174, Oct. 1986. 2, 3, 4 [46] R. J. Cintra and F . M. Bayer, “ A DCT approximation for image compres- sion, ” IEEE Signal Pr ocessing Letters , vol. 18, no. 10, pp. 579–582, Oct. 2011. 2, 3, 5, 7 [47] R. J. Cintra, “ An integer approximation method for discrete sinusoidal transforms, ” Journal of Cir cuits, Systems, and Signal Processing , vol. 30, no. 6, pp. 1481–1501, December 2011. 2, 3, 4 [48] F . M. Bayer and R. J. Cintra, “DCT-like transform for image compression requires 14 additions only , ” Electronics Letters , v ol. 48, no. 15, pp. 919– 921, Jul. 2012. 2, 5, 6, 7 [49] N. J. Higham, “Computing real square roots of a real matrix, ” Linear Alge- bra and its Applications , v ol. 88–89, pp. 405–430, April 1987. 2 [50] G. A. F . Seber, A Matrix Handbook for Statisticians , ser . Wiley Series in Probability and Mathematical Statistics. W iley , 2008. 2, 11 [51] T . I. Haweel, “ A ne w square wave transform based on the DCT, ” Signal Pr ocessing , vol. 82, pp. 2309–2319, No vember 2001. 2, 5, 6, 7 [52] S. Bouguezel, M. O. Ahmad, and M. N. S. Swamy , “Low-complexity 8 × 8 transform for image compression, ” Electr onics Letters , vol. 44, no. 21, pp. 1249–1250, Sep. 2008. 2, 6, 7 [53] ——, “Low-complexity 8 × 8 transform for image compression, ” Electron- ics Letters , vol. 44, no. 21, pp. 1249–1250, 2008. 2, 6, 7 15 [54] ——, “ A fast 8 × 8 transform for image compression, ” in 2009 International Confer ence on Micr oelectr onics (ICM) , Dec. 2009, pp. 74–77. 2, 6, 7 [55] ——, “ A low-complexity parametric transform for image compression, ” in Pr oceedings of the 2011 IEEE International Symposium on Circuits and Systems , May 2011. 2, 6, 7 [56] ——, “Binary discrete cosine and Hartley transforms, ” IEEE T ransactions on Circuits and Systems I: Regular P apers , vol. 60, no. 4, pp. 989–1002, April 2013. 2, 6, 7 [57] K. Lengwehasatit and A. Ortega, “Scalable variable complexity approxi- mate forward DCT, ” IEEE Tr ansactions on Cir cuits and Systems for V ideo T echnology , vol. 14, no. 11, pp. 1236–1248, No v . 2004. 2, 6, 7, 8, 12 [58] R. K. Senapati, U. C. Pati, and K. K. Mahapatra, “ A lo w complexity or - thogonal 8 × 8 transform matrix for fast image compression, ” Pr oceeding of the Annual IEEE India Confer ence (INDICON), K olkata, India , pp. 1–4, 2010. 2 [59] W . K. Cham, “Development of integer cosine transforms by the principle of dyadic symmetry , ” in IEE Proceedings I Communications, Speech and V ision , vol. 136, no. 4, August 1989, pp. 276–282. 2 [60] T . Cormen, C. Leiserson, R. Ri vest, and C. Stein, Intr oduction T o Algo- rithms . MIT Press, 2001, ch. 16. 3, 4, 6 [61] C. J. T ablada, F . M. Bayer, and R. J. Cintra, “ A class of DCT approxima- tions based on the Feig–Winograd algorithm, ” Signal Pr ocessing , vol. 113, pp. 38–51, 2015. 3, 7, 9 [62] G. Strang, Linear Algebra and Its Applications . Brooks Cole, Feb. 1988. 3 [63] D. Salomon, G. Motta, and D. Bryant, Data Compr ession: The Complete Refer ence , ser . Molecular biology intelligence unit. Springer , 2007. 4, 7 [64] N. J. Higham and R. S. Schreiber , “F ast polar decomposition of an arbitrary matrix, ” Ithaca, NY , USA, T ech. Rep., October 1988. 4 [65] Z. W ang and A. C. Bovik, “Mean squared error: Love it or leave it? A new look at signal fidelity measures, ” IEEE Signal Processing Magazine , vol. 26, no. 1, pp. 98–117, Jan. 2009. 5, 7 [66] V . K. Goyal, “Theoretical foundations of transform coding, ” IEEE Signal Pr ocessing Magazine , v ol. 18, no. 5, pp. 9–21, September 2001. 5 [67] J. Katto and Y . Y asuda, “Performance e valuation of subband coding and optimization of its filter coefficients, ” Journal of V isual Communication and Image Repr esentation , vol. 2, no. 4, pp. 303–313, December 1991. 5 [68] K. Mardia and P . Jupp, Dir ectional Statistics , ser . W iley Series in Probabil- ity and Statistics. W iley , 2009. 5 [69] S. Jammalamadaka and A. Sengupta, T opics in Circular Statistics , ser. Se- ries on multiv ariate analysis. W orld Scientific, 2001. 5 [70] D. S. W atkins, Fundamentals of Matrix Computations , ser . Pure and Ap- plied Mathematics: A W iley Series of T exts, Monographs and T racts. W i- ley , 2004. 5 [71] R. P . Mahan, Circular Statistical Methods: Applications in Spatial and T emporal P erformance Analysis , ser . Special report. U.S. Army Research Institute for the Behavioral and Social Sciences, 1991. 5 [72] J.-R. Ohm, G. J. Sullivan, H. Schwarz, T . K. T an, and T . Wiegand, “Com- parison of the coding ef ficiency of video coding standards - including High Efficienc y Video Coding (HEVC), ” IEEE T ransactions on Circuits and Sys- tems for V ideo T echnology , vol. 22, no. 12, pp. 1669–1684, Dec. 2012. 5, 6, 7, 9, 12 [73] J. Han, Y . Xu, and D. Mukherjee, “ A butterfly structured design of the hybrid transform coding scheme, ” in Picture Coding Symposium (PCS), 2013 . IEEE, 2013, pp. 17–20. 6 [74] P . Y ip and K. Rao, “The decimation-in-frequency algorithms for a family of discrete sine and cosine transforms, ” Cir cuits, Systems and Signal Pro- cessing , vol. 7, no. 1, pp. 3–19, 1988. 6 [75] I.-M. Pao and M.-T . Sun, “ Approximation of calculations for forward dis- crete cosine transform, ” IEEE T ransactions on Circuits and Systems for V ideo T echnology , vol. 8, no. 3, pp. 264–268, Jun. 1998. 7 [76] (2017) USC-SIPI Image Database. Uni versity of Southern California. [Online]. A vailable: http://sipi.usc.edu/database/ 7 [77] Z. W ang, A. C. Bovik, H. R. Sheikh, and E. P . Simoncelli, “Image quality assessment: from error visibility to structural similarity , ” IEEE T ransac- tions on Image Pr ocessing , vol. 13, no. 4, pp. 600–612, Apr . 2004. 7 [78] N. J. Higham, Functions of Matrices: Theory and Computation , ser . SIAM e-books. Society for Industrial and Applied Mathematics (SIAM, 3600 Market Street, Floor 6, Philadelphia, P A 19104), 2008. 7 [79] R. K. W . Chan and M.-C. Lee, “Multiplierless fast DCT algorithms with minimal approximation errors, ” in International Confer ence on P attern Recognition , vol. 3. Los Alamitos, CA, USA: IEEE Computer Society , 2006, pp. 921–925. 9 [80] V . Dimitrov , G. Jullien, and W . Miller , “ A ne w DCT algorithm based on en- coding algebraic integers, ” in Proceedings of the 1998 IEEE International Confer ence on Acoustics, Speec h and Signal Pr ocessing, 1998. , v ol. 3, May 1998, pp. 1377–1380 vol.3. 9 [81] Joint Collaborati ve T eam on V ideo Coding (JCT -VC), “HEVC reference software documentation, ” 2013, Fraunhofer Heinrich Hertz Institute. [Online]. A vailable: https://hevc.hhi.fraunhofer .de/ 9 [82] G. J. Sulli v an, J.-R. Ohm, W .-J. Han, and T . W iegand, “Overvie w of the high ef ficiency video coding (HEVC) standard, ” IEEE T rans. Cir cuits Syst. V ideo T echnol. , vol. 22, no. 12, pp. 1649–1668, Dec. 2012. 9 [83] M. Jridi, A. Alf alou, and P . K. Meher , “ A generalized algorithm and recon- figurable architecture for efficient and scalable orthogonal approximation of DCT, ” IEEE T rans. Circuits Syst. I , vol. 62, no. 2, pp. 449–457, 2015. 11, 12 [84] P . K. Meher, S. Y . Park, B. K. Mohanty , K. S. Lim, and C. Y eo, “Efficient integer DCT architectures for HEVC, ” IEEE T ransactions on Cir cuits and Systems for V ideo T echnology , vol. 24, no. 1, pp. 168–178, Jan 2014. 11, 12 [85] F . Bossen, “Common test conditions and software reference configura- tions, ” San Jose, CA, USA, Feb 2013, document JCT -VC L1100. 11, 12 [86] G. Bjøntegaard, “Calculation of a verage PSNR differences between RD- curves, ” in 13th VCEG Meeting , Austin, TX, USA, Apr 2001, document VCEG-M33. 12 [87] P . Hanhart and T . Ebrahimi, “Calculation of av erage coding efficienc y based on subjective quality scores, ” Journal of V isual Communication and Image Repr esentation , vol. 25, no. 3, pp. 555 – 564, 2014, qoE in 2D/3D V ideo Systems. 12 16

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment