CUSTOM IMPLEMENTATION OF THE COARSE ... - Xun ZHANG

Portable wireless multimedia approaches traditionally achieve ... A coarse-grained reconfigurable computing concept for ... 1: a VLIW view and an accelerator view as an ... orthogonal interconnect-networks. ... solution, we are working on it to replace them with multiple ... As a compiler front-end the IMPACT C-compiler [6] is.
182KB taille 1 téléchargements 385 vues
CUSTOM IMPLEMENTATION OF THE COARSE-GRAINED RECONFIGURABLE ADRES ARCHITECTURE FOR MULTIMEDIA PURPOSES Francisco-Javier Veredas * , Michael Scheppler

Will Moffat †, Bingfeng Mei

Advanced Systems and Circuits Infineon Technologies AG. Rosenheimerstr. 116, Munich, Germany email: [email protected]

IMEC vzw Kapeldreef 75, Leuven, Belgium email: [email protected], simulations are complemented by the physical results obtained by a custom layout study. This paper is organized as follows. Section 2 gives an overview of the ADRES architecture and its design framework. Section 3 shows the simulation of the multimedia application test case with the ADRES design framework. Section 4 describes the sample custom implementation of the accelerator matrix and shows the area, performance and power consumption results. Section 5 concludes the paper.

ABSTRACT Portable wireless multimedia approaches traditionally achieve the specified performance and power consumption with a hardwired accelerator implementation. Due to the increase of algorithm complexity (Shannon’s law), flexibility is needed to achieve shorter development cycles. A coarse-grained reconfigurable computing concept for these requirements is discussed, which supports both flexible control decisions and repetitive numerical operations. The concept includes an architecture template and a compiler and simulator environment. The architecture provides flexible time -multiplexing of code for highperformance data processing while keeping the configuration bandwidth and power requirements low. The purpose of this study is to use the coarse-grained architecture for H.264/AVC in order to determine at the physical level whether reconfigurable computing, highperformance and low-power can be obtained.

2. ADRES/DRESC FRAMEWORK

2.1. The ADRES Architecture Template The ADRES architecture template has two functional views, Fig. 1: a VLIW view and an accelerator view as an array of reconfigurable cells (RCs). While the VLIW processor is optimized for control and load/store operations the accelerator is optimized for data-flow kernels. The VLIW is programmable in the traditional sense, especially it has a virtually unlimited capacity of operations. In contrast, the accelerator has a limited stack of operations (or contexts). Notably is the local data register file in the configurable processing units, which supports modulo technique. In acceleration mode the functional units of the VLIW form the first row of the array. Orthogonal busses facilitate data transport within the array. Two configurable port sides in the functional units facilitate data input from various possible sources. Data output is possible also for both horizontal and vertical distribution. No further switching in the interconnect is required. The data-width thought the whole architecture is 32 bits. Further detailed explanation of the ADRES architecture template can be found in [2]. The VLIW part includes up to eight functional units organized in a row. The units communicate with each other through a horizontal data bus for data exchange. A part of the units can communicate vertically with a common register file for data load and store. Optionally, few of the units may be hardwired multipliers. In VLIW mode a long

1. INTRODUCTION This paper discusses a versatile processor architecture template for SoC, called ADRES (Architecture for Dynamically Reconfigurable Embedded System) [1,2]. The architecture is used to implement an H.264/AVC decoder and the results of this evaluation are discussed. ADRES includes a VLIW processor as the main processing unit and a tightly coupled array of configurable processing cells for purpose of acceleration. The accelerator-array is a coarse-grained, reconfigurable architecture comprising of a multitude of processing elements with a fixed word-width. It is provided for the processing of loop kernels. The array allows flexible configuration of datapaths. Particularly, the array supports data dependent branching by means of a predicate logic. The simulator and compiler environment of the architecture is used for cycle-based investigations of a H.264/AVC decoder [3,4] implemented on the architecture. This is used as a realistic application scenario. The *Ph.D Student at University of Ulm, Ulm, Germany † currently on leave

0-7803-9362-7/05/$20.00 ©2005 IEEE

106

Fig. 2. Reconfigurable Cell Fig. 1. ADRES core

processing including move and predicate logic. The register file for our evaluation has a size of eight 32-bit words. The local data storage is beneficial for code mapping as several iterations of local data processing can be executed in a single RC without the need of transporting intermediate data through the interconnect (modulo technique). Beside of power aspects this virtually increases the feasible depth of data paths. Also data delay can be implemented with it. Currently ADRES is directly connected to a multi-port L1 cache for data access. Since this is an expensive solution, we are working on it to replace them with multiple banks of single port memory. For our system evaluations we assume the same cache sizes as in the Texas Instruments c64x DSP processor: 16kB L1 I-Cache and 16kB L1 DCache [5]. For the picture data the capacity must be minimum the size of a frame. One of the fundamental characteristics of the ADRES architecture is the introduction of a new section in the memory hierarchy (configuration memory, program memory and data memory).

instruction word from a central program memory flexibly controls the operations. The attached array is composed of several rows of RCs organized in a matrix form. The behavior of each RC is controlled by a locally stored set of configuration vectors (several contexts are possible). The current implementation of the RC assumes a static usage of the local configuration memory, i.e. similar as in FPGAs, the configuration is loaded from an external memory at the boot phase of the device. During the execution phase a central pointer allows dynamic reconfiguration of the RCs within a cycle. In acceleration mode the VLIW-units in the first row can also be controlled by reconfiguration, hence for instance an 8x8 matrix is possible. In case the first row contains multipliers, the whole column contains multipliers for reasons of layout. Data exchange with external memory is through the default path of the VLIW processor, i.e. in the first level this means access to the central register file and in the second level access to higher memory hierarchy e.g. cache or memory banks. For some cases the bandwidth of this memory interface may limit the acceleration. Additional interfaces with direct and parallel access on special data buffer allow for instance parallel numerical operation (parallel datapaths). The data exchange between RCs is done with orthogonal interconnect-networks. There are two levels of interconnect for internal data exchange between the units: a global bus for each row or column spans the entire array. Additionally, the array is subdivides into four quadrants. Within a quadrant supplementary local interconnect is provided such that a RC can get input data directly from each of its horizontal or vertical neighbors. The components inside a RC, Fig. 2, include: the local configuration memory, the ALU, input and output multiplexers and a register file for local data storage. The ALU instruction set comprises 20 operations for data

2.2. The DRESC Compiler Framework As a compiler front-end the IMPACT C-compiler [6] is used for parsing the sources and generate a Lcode, which is an intermediate code representation suited for VLIW processors. The compiler outputs also profiling information on all function calls in the source. It uses symbolic instructions and flattened function calls. However, loop descriptions are preserved. For further processing of the code, the target coarse-grained architecture is described in an XML-based language. The parser and abstraction steps transform the architecture into an internal graph representation [2]. Taking the program and architecture representations as an input, the modulo scheduling algorithm is applied to achieve high parallelism for the kernels.

107

parameter-settings derived in the first step. As a result blocks of content data with associated control parameters are obtained. This data beneficially can be stored in the content buffer for later parallel access. In the third step decompression is executed i.e. data is transformed in the two-dimensional image domain. A parallel step comprises motion compensation. Finally the predicted or interpolated frames are reconstructed. The post-processing step contains a filtering of the image data for reduction of decoding artifacts. Of these steps processing of block data is most suited for acceleration in the array. Examples are the IDC and the loop filter.

ADRES performance speed-up Processor (C- Source) Kernels AVC Decoder Table 1.

VLIW (AVC Reference) VLIW (Modified Reference) ADRES (Modified AVC R f )

N.A.

x7.14

x4.24

x1.88

x1

x1

For purpose of partitioning it is assumed that the ADRES has an ideal memory environment through the standard interface, i.e. the pure computation effort is profiled. As a result of this step candidate loops for being mapped onto the reconfigurable matrix are identified. The further decision criterions are based on the execution time and potential speed-up. The next iterative step requires optimization of the coding style of the C-sources (sourcelevel transformations) in order to support pipelining in the datapaths and maximize the performance.

3.2. Results of Cycle-based Simulation With the software framework explained in the Section 2 cycle-based studies were done. After profiling six functions have been selected for acceleration in the array: itrans1, get block, decode one chroma macroblock, avg block, copy mv and alloc storable picture. A detailed explanation of these functions can be found in [4]. The selected functions account for 60% of the total execution time (on a VLIW). 16 kernels were detected in the functions. They were strongly recoded. After new profiling, we observed that the recoded functions account for the 27.5% of the total execution time. We compare the results of the ADRES mapped functions with the same functions mapped into the VLIW only (Table 1). We see that in complete decoding cycle the ADRES is about 88% faster than a VLIW processor. The required clock frequency is 184 MHz in the VLIW processor only and 98 MHz in the ADRES architecture.

3. DESIGN APPLICATION TEST-CASE

3.1. The H.264/AVC Decoder Multimedia applications have real time constraints, which are dictated by the specified data rates at the boundaries of the system. For instance, a decoder has at its input side a variable-rate serial bitstream, which is limited by the maximum channel capacity, and at the output side the video stream is invariable defined by the frame rate and the picture resolution. At the input side the equivalent of several frames are merged into one transport package, which are decoded in one decoding cycle. The decoded and reconstructed pictures are streamed out in the correct order and at the given frequency for video display. The H.264/ AVC is a high-compression digital video standard written by the Joint Video Team (JVT) [4]. We used the C-sources the JM 7.5b reference model. Our code modifications effected a simplification of the data types mainly by flattening pointer chains. This brought some improvements in the code density after compilation with IMPACT. Our sample video input bit-stream for running the decoder includes 50 frames of the foreman.264 test sequence. The resolution of the decoded data is specified CIF (352x288). We used the European video standard with 25 frames/sec and 320 kbps channel capacity. The profile was set to "IDC extended". The serial input stream contains a combination of protocol information and content-data for groups of pictures of the video stream. In a first processing step the data stream is parsed: symbols and subsections of the content are identified. In a second processing step the subsections of the data are decoded on the symbol level according to

4. ADRES PHYSICAL IMPLEMENTATION The RC described in the Section 2 is custom implemented with the Infineon Technologies CMOS 130nm 6 metals process technology. The objective of the work was to get an understanding about the microstructure of the architecture in order to take advantage of regularities for optimized physical implementation. This step was important, because large parts of the architecture were defined at the IMEC institute and with a software perspective. Based on the experience from former custom design projects the methodology was to start with standard cells. Design Entry was in schematics, placement was manually and routing was partially automatic partially manual, too. 4.1. Microarchitecture of the Reconfigurable Cell The schematics of an RC were newly developed from a paper-based specification [7]. The first level of the hierarchy was identified with the sublocks: configuration

108

block, external interfaces for input and output, predicate logic, ALU and data register file. The configuration block itself includes an SRAM with 32 words of 40-bit. It also includes an addressing unit. Since the basic structure of a loop kernel is a sequence of instructions, a counter does the address increments. The start and the end address are set from externally (the central program counter). The 5-bit output of the counter is demultiplexed to select one of the 32 physical addresses. Read- and write-enable are also provided from external. The 40-bit data output of the RAM is the configuration word for the rest of the RC. One enhancement is the socalled stage counter. It is used to select between different states of the predicate logic registers and the data register file. Hence, the 3-bit output of the stage counter is added to the five 3-bit configuration segments of the predicate logic registers and the data registerfile. The external inputs are 32-bit wide multiplexers. They select from the six possible external data sources or from the internal output of the ALU. (Unlike in FPGAs no switch matrix is used for the interconnect, but the input multiplexers and the output demultiplexer of each RC organize the communication.) There are four input multiplexers: two for the ALU input, one for the local registerfile input and one for the predicate logic. The output demultiplexer selects to which bus (horizontal or vertical) the output-data is broadcasted. The ALU includes a crossbar at the input of the two source inputs. It can swap the position of the two operands, which is necessary for the move and the sub operation. The two operands are then distributed to the compare unit, the logical unit, the shifter and the arithmetic unit. Optimization of the microarchitecture obtained a merging of the arithmetic unit with the compare unit into a single unit, because both of them require the adder functionality. The result values of the four units are feed to a multiplexer, which selects one value. The result of an operation either is stored in the register file of the generating RC or in the output register or in the register file of another RC. The register file has eight words of 32-bit. It has one data_in port and two data_out ports, which can be read simultaneously. The addressing requires ten of the 40 configuration bits. The predicate logic is used for data dependent branching. The result of a compare operation is either stored in the local predicate registerfile (8 x 1-Bit) or in another RC. The predicate value controls at a later time a multiplexer function, which is implemented in the logical unit of the ALU. In our implementation, which is not fully optimized, the schematics show a consumption of 63266 transistors in one RC.

4.2. Layout Study Our current layout implementation is based on standard cells. Medium drive strength is used inside a subblock and strong buffering for the lines, which broadcast inside the RC. The cells mostly are hand-placed, because many parts of the microarchitecture are regular. Routing inside a subblock is done automatically. The routing between the subblocks is done manually. The total area of the layout is 0.196 mm2 . The contributions of the subblocks is as follows: configuration block 50%, external interfaces for input and output 6%, registerfile 9% and ALU 19%. About 15% of the area of an RC are consumed by the interconnect between the subblocks. The routing of the gates inside the subblocks consumes metal 2 and 3. The routing between the subblocks can use metal 1 up to metal 4. The metal 4 layer was required for traversing the three functional units inside the ALU. We found out that floorplanning of a RC is difficult, because the standard cell areas do not align well. Certainly a trade-off can be made for the routing between area and the height of the metal stack. Especially, area can be saved if the orthogonal interconnect lines of the array can be put on top of the RC. Furthermore lines for the configuration load and the control signals from the central program counter need to be routed as well as global clock and reset. Very similar to FPGAs a tile can be implemented, which several times will be instantiated to form an array. Although, there is some effort required, we believe that an optimized layout of a tile bears enough potential of savings that makes it is worth while in comparison to automatic place and route. 4.3. Performance and Power Consumption of an RC From the layout implementation of the RC a transistor netlist was extracted, which also contained the parasitics of the routing. The extraction considers capacities only and neglects resistance. This appeared justified at the stage of the studies. We used lumped capacitance to CVSS and cross coupling Ccoup with adjacent lines. The external load caused by the array interconnect was modeled by capacitive load at the outputs. The circuit simulator was Synopsys Nanosim vs. 2003.03-SP1. The transistor model was the BSIM3v2. All the simulations were with 1.2 V of power supply. The simulation vectors mimic the situation when the function of the H.264/AVC decoder is accelerated. The function Get Block (GetBlk) is selected as a representative function. The function is modeled by invoking several instructions: ADD, AND, ASR and LSR with 48%, 22%, 18% and 11% frequency, respectively. We assume that the context memory contains 27 configuration contexts. The data input vectors of the first operant (src1 in Fig. 2) are 50%

109

Fig.4. Layout of the Reconfigurable Cell

Fig.3. Floorplan complementary of the second operant (src2 in Fig. 2). For example if the src1 vector is “0101”, the src2 vector is “0110”. Also, for the first operand it is assumed 50% of ‘1’ and 50% of ‘0’, for example "1010". With our simulations at nominal process and temperature conditions we observed a maximum clocking of the RC of 60MHz or 16 ns cycle time. A major contribution to the signal delay is because of the long carry chain (64 bits) in the subtraction unit. We assume that the critical path starts at the registered output of the central program counter, passes through the local configuration memory and ends with the data output of the ALU, which is either due to the internal register file or due to another RC in the array. If we add 4 ns for the reconfiguration pointer, we get a total cycle time of about 20 ns or 50 MHz performance. Compared with the 100 MHz, which we found out in the cycle-based simulations, a 100 % of speed-up is required. In our current implementation no circuit optimizations have been performed yet. To reach the required performance it is possible to implement new circuit architectures (e.g. carry look-ahead adder instead of ripple carry adder). In addition, we can use a hardware pipeline technique to speed up the critical path. For purpose of having reference for the power evaluation we used the above-mentioned CAPE tool. The CAPE tool is an Infineon Technologies in-house tool for architectural level prognosis. For purpose of modeling the signal activity the reconfigurable cell was partitioned in sequential configuration logic, combinational configuration logic, ALU and register file. The sequential part is assumed to toggle every clock cycle with a probability of 25% of the nets changing their state. In the SRAM, the assumption is that 10% of the network nodes toggle. For both ALU and register file 25% activity is assumed. We can note that the major contributions of the power consumption are from the ALU and the sequential part of configuration unit. At a

target frequency of 100 MHz we get a consumption of 1,7 mW for one RC. This however is a peak consumption, which occurs only when the acceleration is required a dynamic disabling is suggested therefor. In terms of prognosis accuracy the power consumption stays within a 20% - 30% range. 4.4. Assesment of the System for H.264/AVC Application For the system assessment we additionally require area figures for the multipliers and the surrounding memory architecture. Both functions we did not realize, but we use estimations as input for the assessment. The area of a reconfigurable multiplier we assumed low configuration overhead and replacing the ALU of the previous RC with one multiplier. Hence, the area of the multiplier RC is estimated 0.1 mm2 . For one functional unit of the VLIW we assume a slightly larger area than an RC namely 0.2 mm2 . For estimating the area of the memory we assumed a 256x32-bit common register file, 16 kB L1 I-Cache and 16 kB L1 D-Cache. The estimated area for all the memories is 4 mm2 . We used the statistic-based CAPE tool to find this value. The architecture template used in the H.264/AVC decoding was an 8x8 array. After simulation we note that no more of the 50% of the array is utilized. Nevertheless due to the compiler requirement and future reprogramming, it is recommended to maintain all the units of the array. Assuming an ADRES implementation of 48 RC and 16 multipliers, the total area is 15 mm2 . We can note that the memory (configuration memory, register file and caches) is the 83% of the total area. With the example of accelerating only the function GetBlk, the above reported value of about 2mW of an RC running at 100 MHz is a peak consumption. In the context

110

of the complete decoding cycle: 73% of the machine cycles are without acceleration, i.e. the accelerator can be disabled for saving power. Normalized to 100% of the cycles the average consumption of the array is about 0.06 mW/MHz. With the required 100 MHz for running H.264/AVC decoder the average consumption of the whole array would be 6 mW with a peak of about 31.15 mW. For the VLIW processor, we take a scaled version of the power consumption in a T.I. c64x [5]. For this DSP processor the average power consumption at 100 MHz is 51.9 mW. The total power consumption of the ADRES with the 27% contribution of the array is 46.3 mW. As the contribution of the array is small the average power consumption is close to a VLIW processor implementation.

From today’s point of view, running a H.264/AVC video decoder at 100MHz, the average power consumption would be around 46.3mW and the total area would be 15mm2 . The contribution of the array to the total power consumption of ADRES is small in comparison with the VLIW part. 6. REFERENCES [1]

B. Mei, S. Vernalde, D. Verkest, and R. Lauwereins, “Design methodology for tightly coupled VLIW/reconfigurable matrix architecture: a case study,” Proc. of Design Automation and Test Conference in Europe, March 2004.

[2]

B. Mei, “A coarse-grained reconfigurable architecture template and its compilation techinques,” Ph.D. thesis.Katholieke Universiteit Leuven, Jan. 2005.

[3]

T. Wiegang, G.J. Sullivan, G. Bjontegaard, and A. Luthra, “Overview of the H.264/AVC video coding standard. IEEE Trans. on Circuits and Systems for Video Technology, pp.560-576 Jul. 2003.

[4]

H.264/AVC Software http://iphome.hhi.de/suehring/tml/

[5]

S. Agarwala, et. al., “A 600-MHz VLIW DSP,” IEEE J. Solid-State Circuits, vol. 37, no. 11, pp. 1532-1544, Nov. 2002.

[6]

The IMPACT group. http://www.crhc.uiuc.edu/impact

[7]

W. Hai-Tajeb, “Full-Custom Design Studien zu einem Coprocessor Array,” Master thesis. Technical University of Munich, Oct. 2004.

[8]

F.J. Veredas-Ramirez, M. Scheppler, and H.J. Pfleiderer, “A Survey on Reconfigurable Computing Systems: Texonomy and Metrics,” Proc. of IV workshop on Reconfigurable Computing and Applications, pp. 25-36, Sept. 2004.

[9]

R. Hartenstein, “ Coarse grain reconfigurable architecture,” ACM Proc. of the 2001 Conference on ASIA South Pacific Design Automation, 2001.

5. CONCLUSIONS AND FUTURE WORK The architecture of the coarse-grained, reconfigurable ADRES for multimedia applications has been described. The H.264/AVC video decoder has been selected as a testcase scenario. Six functions of the H.264/AVC decoder have been identified for acceleration on the ADRES array. Cycle-based simulations with the six functions show a speed-up of 4.24 with respect to the VLIW processor. For a complete H.264/AVC decoding cycle, the average speed-up is 88%. The architecture is well suited for embedded application in SoC with low cost targets in terms of area and power consumption. The functional units in the architecture are tightly linked and communicate fully synchronous. Task distribution and communication are defined at compilation time. In acceleration mode a central program counter globally controls the reconfiguration of the array. After an initial boot phase, reconfiguration can be done fast (dynamically) without stalling the array. Programming of the array takes advantage of the fact that in the case of a loop kernel a dataflow graph is a succession of processing and register-storage. Clustering subnets in the dataflow graph yields small program sequences, which can be mapped onto a functional unit for local operation. In future work further optimization effort needs to be put onto the handling of the loop indices. A layout study implementation of a RC of ADRES has been presented, which showed the particularities of the RC. We believe, that full-custom optimization of the layout is necessary and beneficial. However, our first attempts showed non-regularity on the level of the subblocks, which make alignment difficult. In order to cope with this in future work we need to use cells with differing height for the various subblocks. Also in future work we need to improve the interblock routing inside an RC. Further work needs to be put on the architectural and physical implementation of the whole array including the common register file, the central program counter and the addressing unit of the VLIW.

Coordination.

[10] S.C. Goldstein, H. Schmit, M. Moe, M. Budiu, S. Cadambi,

R.R. Taylor, and R. Laufer, “ PipeRench: A Coprocessor for Streaming Multimedia Acceleration,” Proc. of the International Symposium on Computing Architectures, pp. 28-39, May 1999. [11] H. Singh, M.H. Lee, G. Lu, F.J. Kurdahi, N. Bagherzadeh,

E.M. Filho, “ MorphoSys: An Integrated Reconfigurable System for Data-Parallel and Computation-Intensive Applications,” IEEE Trans. on Computers, May 2000. [12] S. Hauck, et. al., “The Chimaera Reconfigurable Functional

Unit,” IEEE Trans. on Very Scale Integration (VLSI) Systems, Feb. 2004. [13] C. Ebeling, C. Fisher, G. Xing, M. Shen, and H. Liu,

“Implementing an OFDM Receiver on the RaPiD Reconfigurable Architecture,” IEEE Trans. on Computers, pp. 1436-1448, Nov. 2004.

111