Seria ELECTRONICĂ și TELECOMUNICAȚII TRANSACTIONS on ELECTRONICS and COMMUNICATIONS

Tom 49(63), Fascicola 1, 2004

# A Linear Systolic Array Architecture for a VLSI Implementation of Type IV Discrete Cosine Transform

Doru Florin Chiper<sup>1</sup> and Tiberiu Teodorescu<sup>1</sup>

Abstract-An efficient design approach to derive a linear systolic array architecture for a prime length type IV discrete cosine transform is presented. This approach is based on a VLSI algorithm that uses a circular correlation structure. The proposed algorithm is then mapped onto a linear systolic array with a small number of I/O channels and low I/O bandwidth that can be efficiently implemented into a VLSI chip. A highly efficient VLSI chip can be thus obtained that outperforms others in the architectural topology, computing parallelism, processing speed, hardware complexity and I/O costs.

Keywords: Type IV discrete cosine transform, systolic algorithms, systolic architectures.

## 1. INTRODUCTION

Type IV DCT and DST (DCT-IV and DST-IV) were first introduced by Jain as members of the sinusoidal family of unitary transforms. Since their introduction, DCT-IV and DST-IV have found important applications in signal processing, the most recent being in the fast implementation of lapped orthogonal transforms for signal and image coding.

Since their proposal several software implementations of the type IV DCT have been presented but no efficient hardware solution has been implemented until now. Due to the fact that DCT-IV is computational intensive it is useful to develop efficient hardware implementations. In order to do this, new VLSI algorithms to meet the requirements of a real-time applications have to be developed. The efficiency of a VLSI algorithm is based more on the communication complexity than on the computational one. The rational of this is the fact that the data flow plays a central role in designing a VLSI architecture. Thus, the use of regular and modular computational structures as cyclic convolution and circular correlation has been proved to offer good implementation solutions for the discrete transforms using systolic arrays leading to low I/O cost and

reduced hardware complexity, high speed and a regular and modular hardware structure. <sup>1</sup>

In this paper, an efficient VLSI architecture for a linear systolic array implementation of a prime-length type IV discrete cosine transform is proposed. It employs a VLSI algorithm that can be efficiently implemented on a linear systolic array using a small number of I/O channels placed at the two extreme ends of the array. The proposed systolic algorithm uses an appropriate decomposition of the type IV DCT into a circular correlation structure that can be efficiently computed using a linear systolic array and an appropriate control structure based on using the tag-control scheme. The proposed decomposition approach is based on using two auxiliary input and output sequences and the appropriate reordering of these sequences using the properties of the Galois Field of the indexes. Thus, it is possible to use the advantages of the systolic array implementations of the circular correlation as high speed, low I/O cost and reduced hardware complexity with a high regularity and modularity to obtain an efficient VLSI architecture

Using the data-dependence graph-based procedure we can obtain a linear systolic array that can be efficiently controlled using the tag-control technique. The pre-processing stage is used to convert the input sequence using some multiplications, a recursive computation and data reordering operations into an appropriate auxiliary one that can be processed using a circular correlation structure. The postprocessing stage is used to convert two auxiliary output sequences into the final output sequence using some recursive computations and multiplications together with data reordering operations. The computation complexity the of operations implemented in the pre- and post-processing stages is of O(N) as opposed to that of the hardware kernel that implements the circular correlation operation which is

<sup>&</sup>lt;sup>1</sup> Faculty of Electronics and Telecommunications,

<sup>-</sup> Technical University "Gh. Asachi" lasi, Bd. Carol I, 6600, lasi

 $O(N^2)$ . The tag-control scheme is used to control the loading and draining of the data sequences into/from the internal registers of the systolic array and to select the operations and the sign of the operands in each processing element. Thus, we can load and drain the input and output data using only I/O channels placed at the two ends of the linear array. The control tags "sign" are used to appropriate select the correct operands and the right sign in the operations executed by the processing elements that form the hardware kernel.

Thus, using an appropriate decomposition and choosing a linear systolic array as a VLSI architecture paradigm, we can obtain high computing speed with a low I/O cost and hardware complexity for a primelength type IV DCT, together with all the other advantages of the linear systolic arrav implementations of the circular correlation structures as regularity, modularity and local connectivity with I/O channels placed only at the two ends of the array. Using the tag-control scheme we can appropriate select the operations in each processing element and control the loading and draining of the data into/from internal registers.

#### **11. A VLSI ALGORITHM FOR TYPE IV DCT**

For the real input sequence x(i): i = 0, 1, ..., N - 1, 1-D type IV DCT (DCT-IV) is defined as:

$$X(k) = \sqrt{\frac{2}{N}} \cdot \sum_{i=0}^{N-1} x(i) \cdot \cos[(2i+1)(2k+1)\alpha]$$
(1)

for k=0,1,...,N-1

with 
$$\alpha = \frac{\pi}{4N}$$
 (2)

In the following, to simplify our presentation, we will drop the constant coefficient  $\sqrt{\frac{2}{N}}$  from the definition of the DCT-IV. We will add at the end of the VLSI array a multiplier to scale the output sequence with this constant.

In order to reformulate relation (1) as a circular correlation form we introduce some auxiliary sequences and the proprieties of the Galois Field of indexes to appropriate permute the input and output sequences.

The output sequence  $\{X(k): k = 1, 2, ..., N - 1\}$  can be computed as follows:

$$X(0) = \sum_{i=0}^{N-1} x_{C}(i)$$
(3)

$$X(k) = 2T_{C}(k) - X(k-1),$$
  
for k=1, \(\mu\), N-1 (4)

with

$$T_{\mathcal{C}}(k) = [x_a(0) + 2T(k)] \cdot \cos[2k\alpha]$$
<sup>(5)</sup>

where the input auxiliary sequence

 $\{x_{a}(i): i = 0, ..., N-1\}$  is defined as following:

$$x_{a}(N-1) = x_{C}(N-1)$$
(6)

$$x_a(i) = x_C(i) - x_a(i+1)$$
 (7)

for i=N-2,□,0 with

$$x_{C}(i) = x(i) \cdot \cos[(2i+1)\alpha]$$
(8)

The new auxiliary output sequence  $\{T(k): k = 1, 2, ..., N-1\}$  can be computed as a circular correlation, if the transform length N is a prime number, as following:

$$T(\langle g^{k} \rangle_{N}) = \sum_{i=1}^{(N-1)/2} (-1)^{\psi(k,i)} \cdot x_{a}(\langle g^{i} \rangle_{N}) - (-1)^{\psi(k,i+(N-1)/2)} \cdot x_{a}(\langle g^{i+(N-1)/2} \rangle)]$$
  
$$\cdot \cos(g^{i+k} \rangle_{N} \times 4\alpha)$$
(9)

where  $\langle x \rangle_N$  denotes the result of x modulo N and

$$\psi(k,i) = \left\lfloor \frac{\langle g^k \rangle_N \times \langle g' \rangle - \langle g'^{+k} \rangle_N}{N} \right\rfloor$$
(10)

with  $\lfloor x \rfloor$  the greater integer smaller the x and is called the floor function.

We have used the properties of the Galois Field of indexes to convert the computation of the auxiliary output sequence  $\{T'(k): k = 1, 2, ..., N-1\}$  as a circular correlation.

#### III. AN EXAMPLE

To illustrate our approach, we will consider an example for 1-D type IV DCT with the length N=11 and the primitive root g=2.

We can write (9) in matrix-vector product form as:

| $\int T'(2$         | 2)]              |                   |               |               |      |
|---------------------|------------------|-------------------|---------------|---------------|------|
| T'(4                | 0                |                   |               |               |      |
| T'(8                | 5)               |                   |               |               |      |
| T'(5                | 5)               |                   |               |               |      |
| T'(10               | )                |                   |               |               |      |
| T'(9                | ·) =             |                   |               |               |      |
| T'(7                | )                |                   |               |               |      |
| T'(3                | )                |                   |               |               |      |
| <i>T'</i> (6        | 0                |                   |               |               |      |
| [ T'(1              | )]               |                   |               |               |      |
| [ c(4)              | c(8)             | <i>c</i> (5)      | c(10)         | c(9)          | ]    |
| c(8)                | <i>c</i> (5)     | <i>c</i> (10)     | c(9)          | <i>c</i> (4)  |      |
| c(5)                | c(10)            | c(9)              | <i>c</i> (4)  | c(8)          | ļ    |
| c(10)               | c(9)             | <i>c</i> (4)      | c( <b>8</b> ) | <i>c</i> (5)  |      |
| c(9)                | <i>c</i> (4)     | c(8)              | <i>c</i> (5)  | <i>c</i> (10) | ļ    |
| c(4)                | c(8)             | <i>c</i> (5)      | <i>c</i> (10) | c(9)          | .    |
| c(8)                | <i>c</i> (5)     | <i>c</i> (10)     | c(9)          | <i>c</i> (4)  |      |
| c(5)                | c(10)            | c(9)              | <i>c</i> (4)  | c(8)          |      |
| c(10)               | c(9)             | <i>c</i> (4)      | c( <b>8</b> ) | <i>c</i> (5)  |      |
| c(9)                | <i>c</i> (4)     | c(8)              | c(5)          | c(10)_        |      |
| $\int \pm [x_c]$    | $(2) \pm x_c$    | .(9)]]            |               |               |      |
| $ \pm [x_{i}]$      | $(4) \pm x_c$    | .(7)]             |               |               |      |
| $\pm [x_{c}$        | $(8) \pm x_c$    | .(3)]             |               |               |      |
| $\pm [x_c$          | $(5) \pm x_{c}$  | (6)]              |               |               |      |
| $\lfloor \pm [x_c]$ | $(10) \pm x_{0}$ | <sub>c</sub> (1)] |               |               |      |
|                     |                  |                   |               |               | (11) |

where we noted by  $c(k) \cos(2k\alpha)$  and the sign of the items in relation (11) is given by the following matrix:

|        | 00 | 00 | 10 | 00 | 10] |
|--------|----|----|----|----|-----|
|        | 00 | 10 | 00 | 10 | 00  |
|        | 10 | 00 | 10 | 00 | 00  |
|        | 01 | 11 | 01 | 11 | 11  |
| SIGN = | 10 | 00 | 00 | 10 | 00  |
| SIGN = | 01 | 01 | 11 | 11 | 11  |
|        | 01 | 11 | 01 | 01 | 01  |
|        | 11 | 01 | 11 | 11 | 01  |
|        | 00 | 10 | 00 | 00 | 10  |
|        | 11 | 01 | 01 | 01 | 01  |

where:

• The first bit designates the sign before the brackets

(12)

• The second bit denotes the sign inside the brackets

where the "1" bit indicates the minus sign (the first bit) and the subtraction operation (the second one)

## IV. THE VLSI ARCHITECTURE FOR 1-D TYPE IV DCT

Using the method presented in [10] and the recursive form of the equations (3)-(5) and (9) we can obtain the data dependence graph of the proposed algorithm. The data dependence graph clearly shows the data dependencies, data operations and the control signals involved in the proposed graph. Using the proposed VLSI algorithm and the data dependence graph-based procedure presented in [10] we can obtain the systolic array from fig.1. The function of PEs in the VLSI array from fig.1 is presented in fig.2. The PEs from the circular correlation module, which represent the hardware core of the VLSI architecture, execute the operations from relation (9).

In order to keep all I/O channels at the boundary PEs a tag control scheme presented in [9] has been used. The control signal tc is used to select the correct operand in the operations executed by PEs. The control signal "sign" is used to select the right operand and the right sign in the operations of PEs as shown in fig.1 and fig.2.

A preprocessing stage has been introduced to obtain the appropriate form for the auxiliary input sequences. It implements the equations (6)-(8). The preprocessing stage consists of a multiplication module that implements equation (8), a subtraction module that implements equation (7) followed by a permutation one and an addition/subtraction module. • Thus, the intermediate input sequence is permuted and used to generate the required combination of data operands.

In order to obtain the output sequence in the natural order a post-processing stage has been included. It implements the equations (3)-(5) and the permutation of the auxiliary output sequence into the right order. The post-processing stage contains a permutation block consisting of a multiplexer and some latches to permute the auxiliary output sequence in a fully pipelined mode. The auxiliary output results, in a permuted order are shifted serially in the shift registers and then are loaded parallely into the latches for the I/O data permutation such that the next block of data can be shifted into the permutation module without any time delay.

The average computation time of the proposed VLSI array for a N-point type IV DCT is (N-1)Tcycles and the number of multipliers is only (N-1)/2+1. Hence high processing speed with low hardware complexity can be obtained.

### V. CONCLUSION

In this paper, an efficient design approach for obtaining a VLSI systolic array implementation for a prime length type IV DCT is presented. It uses a VLSI algorithm for the type IV DCT computation that leads to an efficient VLSI implementation. The proposed algorithm is based on using a circular correlation structure that can be efficiently implemented using the systolic array architecture paradigm with low I/O costs, a high degree of parallelism and good architectural topology with a high degree of regularity and modularity. Thus, a new systolic array with high computing speed and parallelism, low computational and I/O costs can be obtained

#### REFERENCES

[1] A.K. Jain, "A sinosoidal family of unitary transforms," IEEE Trans. Pattern and Machine Intell., vol.1, pp.356-365, Oct. 1979.

[2] H.S. Malvar, "Lapped transforms for efficient transforms/subband coding," IEEE Trans. Acoust Speech, Signal Processing, vol.38, June 1990.

[3] H.S. Malvar, Signal Processing with Lapped Transforms. Norwood, MA: Artech House, 1991.

[4] C.M. Rader, "Discrete Fourier Transform when the number of the data samples is prime," *Proc. IEEE*, 1107-1108,vol.56, 1968.

[5] J. Guo, C.M. Liu, C.W.Jen, "A New Array Architecture for Prime-Length Discrete Cosine Transform," *IEEE Trans. on Signal Processing*, pp.436-442, vol.41, no.1, Jan. 1993. [6] D.F. Chiper, "Novel Systolic Array Design for Discrete Cosine Transform with High Throughput Rate," *Proc. IEEE Symp. on Circuits and Systems*, Atlanta, Georgia, 1996, pp. 746-749.

[7] Y. H. Chan and W. C. Siu, "On the Realization of Discrete Cosine Transform using the Distributed Arithmetic," *IEEE Trans.* on Circuits and Systems-Part 1, vol 39, no 9, pp 705-712, Sept 1992.

[8] H. T. Kung, "Why Systolic Architectures," Computer, 15, Jan 1982.

[9]S A White, "Applications of the Distributed Arithmetic to Digital Signal Processing: A tutorial,"*IEEE ASSP Mag.*, vol.6, July 1989.

[10] S.Y. Kung, "VLSI Array Processors," Prentice Hall, Englewood Cliffs, New Jersay, 1988.

[11] C.W. Jen and H.Y.Hsu, "The design of a systolic array with tags input," Proc.IEEE Int. Conf. on Circuits and Systems, pp.2263-2266, 1988.





J