EFFICIENT FPGA IMPLEMENTATION OF CORDIC ... - Xun ZHANG

Departamento de Ingeniería Electrónica, Universidad Politécnica de Valencia. 46730 Grao .... way that the control signal di is generated in equation (1). Clearly ...
244KB taille 20 téléchargements 491 vues
EFFICIENT FPGA IMPLEMENTATION OF CORDIC ALGORITHM FOR CIRCULAR AND LINEAR COORDINATES F. Angarita, A. Perez-Pascual, T. Sansaloni, J. Valls Departamento de Ingeniería Electrónica, Universidad Politécnica de Valencia 46730 Grao de Gandía, Valencia, Spain e-mail: [email protected], {asperez, tmsansal, jvalls}@eln.upv.es of CORDIC algorithm on FPGA that allows to compute the three different configuration modes with hardly extra hardware cost with respect to the cost of a single mode single coordinate CORDIC. Most of CORDIC implementations are built for a specific pair of mode and coordinate system [8][9][10]. Nevertheless, it is possible to create an optimized architecture that executes different pairs of mode and coordinate that achieves almost the same area and speed of an optimized specific CORDIC implementation. By doing so and due to the sequential processing in the receiver and respecting the signal processing rates, it is possible to reuse this generic CORDIC in different stages of the receiver. This will lead to a significant overall area saving at the receiver without reducing the signal processing data rate. This work is organized as follows: in Section 2 the CORDIC algorithm is introduced; Section 3 presents the conventional hardware parallel implementations; Section 4 details the proposed common architecture; Section 5 summarizes the main results and finally, Section 6 presents the conclusions.

ABSTRACT This paper proposes an efficient FPGA implementation of a common CORDIC architecture for circular and linear coordinates. The proposed circuit is derived from the single coordinate CORDIC architectures and the mapping on the Xilinx slices is fully detailed. Relative placed macros in VHDL have been designed to show the goodness of the proposed architecture. All the circuits have been implemented in Virtex-E and Virtex-II devices and the results show that the area of the common architecture is hardly larger than the area of a single coordinate or single mode CORDIC architecture. It is also shown that if a common architecture is modeled with RTL style its implementation requires the double of area and the maximum throughput decreases more than a half. 1. INTRODUCTION CORDIC (COordinate Rotation DIgital Computer) [1][2] is a very well known algorithm that due to its versatility and very simple hardware implementation that only needs add and shift operations, is widely used for VLSI digital signal processing systems [3]. Broadband communication systems is one of the applications of digital signal processing that requires a computational load only supported by FPGA or ASIC. In this paper we present an efficient implementation of CORDIC algorithm on Xilinx FPGA suitable for 802.11a/g and HiperLAN/2 WLAN [4][5]. Those communications systems based on OFDM (Orthogonal Frequency Division Multiplexing) [6][7] require, in the receiver, the computation of three different tasks that can be calculated by CORDIC algorithm: first an angle is calculated (atan(Y/X)) to estimate the frequency offset; second, the received preamble and symbols are rotated ((X+jY)ejĮ) by the estimated frequency offset; third, part of the received preamble is inverted (1/x) to estimate the channel. Each of those operations can be performed by CORDIC configured in circular coordinate in vectoring mode, circular coordinate in rotation mode, and linear coordinate in vectoring mode, respectively. In this paper we propose an efficient mapping

0-7803-9362-7/05/$20.00 ©2005 IEEE

2. CORDIC ALGORITHM CORDIC is an iterative algorithm for calculating the rotation of a two-dimensional vector, in linear, circular and hyperbolic coordinate systems, using only add and shift operations. It consists of two operating modes, the rotation mode (RM), where a vector (X0,Y0) is rotated by an angle (ș) to obtain a new vector (XN’,YN’), and the vectoring mode (VM) where the algorithm computes the length (R) and the angle (Į) towards the x-axis of a vector (X0,Y0). The algorithm, executed by a finite number of microrotations indexed by i = 0:N-1, was originally described only for circular mode [1], and was extended later to a generalized form [2] described by the set of equations (1): X i +1 = X i − md i 2 −i Yi Yi +1 = Yi + d i 2 −i X i Z i +1 = Z i − d i α i

535

( )

­°atan 2-i , Circular

αi = ®

°¯ 2-i , Lineal ­sign(Z i ) , for RM di = ® ¯− sign(Yi ) , for VM

(1)

Table 1.

m 0 1

Generalized CORDIC algorithm. Mode VM ( YN = 0 )

RM ( Z N = 0 ) X N = K ( X 0 cos Z 0 − Y0 sin Z 0 )

X N = K X 02 + Y02

YN = K (Y0 cos Z 0 + X 0 sin Z 0 )

Z N = Z 0 + a tan (Y0 X 0 )

X N = X0

X N = X0

YN = Y0 + X 0 Z 0

Z N = Z 0 + Y0 X 0

By selecting appropriate values for the parameter m and Įi, a different coordinate system can be selected. In this work the CORDIC algorithm for hyperbolic coordinates (m = -1) is not implemented due to it is not needed in our application. For a straightforward interpretation of the algorithm, Table 1 is introduced for the parameters Mode and m, resulting in different equations to obtain XN, YN and ZN. It is well known that the convergence of the CORDIC algorithm is the sum of all Įi = 0:N-1. For linear coordinates the maximum value of this sum is approximately ±2. When using circular coordinates the convergence range is limited to ±ʌ/2. Nevertheless, an extension of the algorithm allows to extend this value to ±ʌ. This extension implies an extra pre-operation described in the set of equations (2): X 0 = − d −1Y−1

Y0 = d −1 X −1

Z 0 = Z −1 − d −1α i , α i =

π

(2)

Fig. 1.

Circular coordinates common parallel architecture.

Fig. 2.

Linear coordinates common parallel architecture.

way that the control signal di is generated in equation (1). Clearly, the architecture can be split in rows, representing the data-paths and columns representing the N-1 iterations. Depending on the operation mode, and after each iteration, Zi (for RM) or Yi (for VM) approaches to zero. Therefore, the number of bits of Zi or Yi for the next iteration i can be reduced in i-1. On the other hand, before starting the computations the sign of the inputs (X0, Y0 for RM and X0, Z0 for VM) must be extended in order to avoid overflows due to the addition and subtraction operations. Clearly the control signal di is used in each data-path (X, Y and Z) for calculating the iterations. For RM, di is the MSB (Most Significant Bit) of Zi, and for VM, di is obtained by complementing the MSB of Yi. The not operation, does not need extra hardware, because it can be realized inside each add-sub operator. There is a significant speed reduction when deriving the control signal directly from Zi or Yi, due to the high fanout, so control signals are replicated three times, one for each data-path, yielding an increase of three slices per iteration [8]. Giving Lxy as length of X0 and Y0, and Lz the length for Z0, equation (3) represents the total area, given in Xilinx slices, of the hardware implementation.

2

3. CONVENTIONAL FPGA IMPLEMENTATION Recalling that in the present work the CORDIC processor is intended to realize significant part of the signal processing required for implementing an OFDM-based broadband wireless digital receiver, the CORDIC implementation must have a suitable architecture to achieve high data rates and low power, therefore, a parallel architecture is the choice. Traditionally a CORDIC is implemented for a specific combination of the pair: Mode and m (see Table 1). Each one of the four possible pairs of Mode and m that are implemented in this work has specific optimizations that will be discussed subsequently. The aim of this section is to show that both architectures (circular and linear) have strong similitude that can be used to derive a common architecture for implement a dualmode/dual-coordinates CORDIC in a Xilinx FPGA, without significant hardware increasing in comparison to the single-mode/single-coordinate CORDIC.

N slices = 1 2 ⋅ (M ⋅ (2 LXY + LZ + 5) − (2 M − 1)), for M = N − 1 (3)

3.1. Circular Coordinates (m = 1) Fig. 1 represents the parallel architecture for circular coordinates of CORDIC algorithm. This architecture is common for the VM and RM. The only difference is the

Equation (3) takes into account the area optimization due to the reduction of the number of bits, per iteration, of Yi or Zi and the replication of the control hardware.

536

(a) add-sub Fig. 3.

(b) add-sub/buffer

(c) 2’sComplement/buffer

(d) muxKs/add-sub

add-sub, add-sub/buffer, 2’sComplement/buffer and muxKs/add-sub hardware implementation in a Xilinx FPGA.

Due to the common architecture for the RM and VM, and using the same amount of slices used by the control replication, it is possible to implement a dual mode CORDIC (RM and VM) for circular coordinates (see row 1 of Table 1), simply by adding a multiplexer and a not gate as control hardware for implementing the selection between sign(Zi) or -sign(Yi), controlled by a Mode signal. Recalling that when the control signal is replicated, an additional register per replication is needed, consequently the LUTs associated to the registers can be used to implement the needed additional operations. Nevertheless, this extension does not allow performing the area optimization previously mentioned, due to the reduction of the number of bits of data-path Y or Z. Thus, equation (4) shows the required area in the dual mode circular CORDIC. N slices = 1 2 ⋅ (M ⋅ (2 LXY + LZ + 7 )), for M = N − 1

4. COMMON CORDIC IMPLEMENTATION Based on the dual-mode/single-coordinates CORDICs introduced previously, in this Section a common CORDIC for dual-mode/dual-coordinates is presented. For clearness, a separate explanation of each CORDIC data-path is going to be given, beginning with data-path X, which presents more complexity. The difficulty in developing a common architecture for data-path X is that there are no common elements between the lineal and the circular dual-mode architectures. So a new macro that implicates the same hardware of the previous architectures must be designed. This new macro is called add-sub/buffer. To build this macro, the LUT’s logic of the add-sub (Fig. 3a) has been rearranged, as shown in Fig. 3b. In circular coordinate a sign extension is necessary for X0, then, in the common implementation it affects both coordinates. Another issue is observed when moving from the single-coordinates to common architecture when the angle extension pre-operation is applied. This operation (two’s complement) is only needed in the case of circular coordinates. Even so, we need an extra register operation (buffer) previously the first iteration to keep compliance in the structure. The element that realizes these operations is the 2’sComplement/buffer. Once again, we have to rearrange the LUT’s logic of the add-sub to obtain this new macro, but in this case it is necessary the use of the multand element of the slice, as it is shown in Fig. 3c. As can be perceived from Fig. 1 and 2, the structure of the data-path Y is the same in both of them; this will be reproduced in the common CORDIC implementation case. Nevertheless, as in the case of data-path X, the angle extension pre-operation must affect only the circular coordinates. Subsequently, we have to use again the previously presented component – 2’sComplement/buffer. The operations realized within data-path Z involve the utilization of Įi which are treated as constants. From equation (1) it is seen that the values of Įi differ depending on the used coordinates. Therefore, for the common implementation, we firstly need to multiplex the two possible values and then execute either an addition or a subtraction operation. Taking advantage that Įi are constants and can be pre-calculated, the above mentioned operations can be implemented implicating the same

(4)

3.2. Linear Coordinates (m = 0) The architecture for lineal coordinates is quite similar to that for circular coordinates. The main differences are the signal Įi and the data-path X (see equation 1). Fig. 2 shows the parallel hardware architecture for lineal coordinates of the CORDIC algorithm. In this architecture one can find that there is no need a replication of control hardware for data-path X, so when implementing the hardware this will be translated into a saving of half slice per iteration. Furthermore, data-path X does not imply any operation, except to register the data in each iteration, due to the pipeline architecture, i.e. Xi = X0.. Recalling that the control signal is the same of the previous architecture for circular coordinates and that we can perform the same optimizations in RM and VM for data-paths Y and Z, the equations (5) and (6) that represent the total area in slices of the hardware implementation for lineal coordinates and dual-mode lineal coordinates are derived from (3) and (4) respectively. N slices = 1 2 ⋅ (M ⋅ (2 LXY + LZ + 3) − (2 M − 1)), for M = N − 1 (5) N slices = 1 2 ⋅ (M ⋅ (2 LXY + LZ + 5)), for M = N − 1 (6)

537

Table 2. Area and max. frequency for different CORDIC implementations in Virtex-E and Virtex-2 devices.

Architecture dual-mode/ dual coordinates dual-mode/ Circular rotation mode/ Circular vectoring mode/ Circular dual-mode/ Lineal rotation mode/ Lineal vectoring mode/ Lineal

dual-mode/dual-coordinates architecture. resources used by the single-coordinates architectures for data-path Z. To do this, a new macro called muxKs/add-sub has been designed. In this macro the constant values in the LUTs logic are inferred, as it is shown in Fig. 3d. Fig. 4 presents the resulting dual-mode/dual-coordinates CORDIC architecture proposed in this work. In this figure the new macros are represented by the equivalent blocks, as a matter of simplification. In this way the block diagram represents exactly the hardware implementation. The resulting area equation for this implementation is: Fig. 4.

N slices = 1 2 ⋅ (M ⋅ (2 LXY + LZ + 6)), for M = N − 1

Device xc2v1000-6 213 MHz / 553 slices 213 MHz / 544 slices 218 MHz / 511 slices 218 MHz / 511 slices 214 MHz / 496 slices 222 MHz / 480 slices 222 MHz / 480 slices

xcv400e-8 164 MHz / 553 slices 164 MHz / 544 slices 172 MHz / 511 slices 172 MHz / 511 slices 164 MHz / 496 slices 180 MHz / 480 slices 180 MHz / 480 slices

Table 3. The common architecture area utilization and max. frequency implementations in a Virtex2 -6 device.

Structural 213 MHz / 553 slices

(7)

5. FPGA IMPLEMENTATION

Design method RTL-VHDL 102 MHz / 1208 slices

RTL-Handel-C 77 MHz / 1343 slices

7. REFERENCES

Relative placed macros (RPM) of all the architectures presented in sections 3 and 4 have been designed. CORDIC architectures for 20-bit data-path X and Y and 17-bit datapath Z and 16 iterations have been implemented in a VirtexE and Virtex-II Xilinx devices. Table 2 shows the resulting area and the maximum operation frequencies for all the architectures. Table 3 presents the results of the implementation forsdkhfashdfhskd the common CORDIC with three different design styles: structural (RPM) with VHDL, register transfer level (RTL) coded in VHDL and RTL coded in Handel-C.

[1]

J.E.Volder: The CORDIC trigonometric comp. technique. IRE Trans. Elec. Comp.(1959), vol. EC-8-3, pp. 330-334.

[2]

J.S. Walther: A unified algorithm for elementary functions. Proc. Spring. Joint Comp. Conf.(1971), vol. 38, pp. 379-385.

[3]

Y. Hu: CORDIC-based VLSI architectures for digital signal processing. IEEE Signal Proc. Mag. (Jul. 1992), pp. 16-35.

[4]

ETSI TS 101 475: Technical spec: Broadband Radio Access Networks; HIPERLAN Type 2; Physical Layer (2001).

[5]

IEEE Std. 802.11: Wireless LAN Medium Access Control (MAC) and Physical Layer (PHY) specifications: High Speed Physical Layer in the 5 GHz band (1999).

6. CONCLUSION

[6]

J. Heiskala, J. Terry: OFDM Wireless LANs: A Theoretical and Practical Guide. Sams Pub-lishing (2002).

[7]

M. Engels (ed.), Wireless OFDM Systems: How to Make Them Work?. Kluwer Academic Publishers (May 2002).

[8]

J. Valls, M. Kuhlmann, K.K. Parhi: Evaluation of CORDIC Algorithms for FPGA design. Journal of VLSI Signal Processing (Nov. 2002), Vol. 32, No. 3. pp. 207-222.

[9]

Ray Andraka: A survey of CORDIC algorithms for FPGAs. Proceedings of the 1998 ACM/SIGDA 6th international symposium on FPGAs, Monterey, CA (1998), pp191-200.

This paper has proposed an efficient mapping on FPGA of a common CORDIC architecture for circular and linear coordinates and rotation and vectoring modes that can be applied to OFDM-based WLAN systems. RPMs in VHDL have been designed and implemented in Xilinx FPGA devices. The results show that the area of the common architecture is hardly larger than the area of a single coordinate or single mode CORDIC architecture and that if the common architecture is modeled with RTL style its implementation requires the double of area and the maximum throughput falls more than a half.

[10] Xilinx: CORDICv3.0 Product Spec. DS-249 (May. 2004).

538