Clustered Programmable-Reconfigurable Processors - CiteSeerX

have been designed to meet all of these requirements. Like chip multiprocessors (CMPs), ... logic within each cluster, not the propagation delay of global interconnect. ... independent processor clusters, four of which contain conventional ...
73KB taille 12 téléchargements 315 vues
Clustered Programmable-Reconfigurable Processors Derek B. Gottlieb, Jeffrey J. Cook, Joshua D. Walstrom, Steven Ferrera, Chi-Wei Wang, Nicholas P. Carter University of Illinois at Urbana-Champaign {dgottlie, jjcook, walstrom, ferrera, cwang12, npcarter}@crhc.uiuc.edu

Abstract In order to pose a successful challenge to conventional processor architectures, reconfigurable computing systems must achieve significantly better performance than conventional programmable processors by both greatly reducing the number of clock cycles required to execute a wide range of applications and achieving high clock rates when implemented in deep-submicron fabrication technologies. In this paper, we describe the architecture of Amalgam, a clustered programmable-reconfigurable processor that integrates multiple conventional processors and blocks of reconfigurable logic onto a single chip. Amalgam’s distributed architecture allows implementation at high clock rates by limiting the impact of wire delay on cycle time and delivers an average of 13.7x speedup on our benchmark applications when compared to an equivalent architecture that contains only a single programmable processor.

1. Introduction The “90-10” rule of computer software, which states that most applications spend 90% of their time executing only 10% of their code, makes a strong argument in favor of computer systems that integrate both reconfigurable logic and conventional programmable processors. Computing systems based on reconfigurable logic have demonstrated impressive performance on a variety of compute-intensive applications [5], but the amount of chip area required to implement general-purpose applications in reconfigurable logic is often prohibitive. In contrast, programmable processors represent programs as instruction sequences that are stored in memory and executed on a fixed set of hardware resources. This representation allows programmable processors to implement complex applications in a limited amount of chip area, but their performance is often limited by the need to express computations in terms of the instructions supported by the hardware [6]. Providing both programmable and reconfigurable hardware in an architecture allows critical regions of programs to be implemented in reconfigurable logic, while less-critical regions are implemented as software programs, yielding high performance in reasonable amounts of chip area.

For programmable-reconfigurable processors (architectures that combine programmable and reconfigurable logic) to be successful in either the general-purpose or embedded systems markets, they must achieve much better performance than conventional architectures on a wide range of applications by exploiting both reconfigurable logic and more traditional task- and data-level parallelism. In addition, it must be possible to implement programmable-reconfigurable processors at clock rates that match or exceed those of conventional processors, so that increases in cycle time do not counteract the performance benefits from reconfigurable logic. Finally, programmablereconfigurable architectures must support automatic compilation of programs written in high-level programming languages to prevent application development time from becoming a barrier to their adoption. Clustered programmable-reconfigurable processors have been designed to meet all of these requirements. Like chip multiprocessors (CMPs), clustered processors integrate several independent processing resources, known as clusters, and a shared memory system onto a single chip. Unlike CMPs, however, clustered processors treat the combined register files of the clusters as a single, distributed register file. Operations executing on a cluster may only read their operands from the register file in that cluster, but may write their results into any register on the chip over an on-chip network. This decoupled architecture significantly reduces the length of the longest wire that a signal must traverse in a clock cycle, preventing wire delays from becoming the limiting factor on cycle time [10] [1]. Because the clusters on a chip are independent, the clock period of a clustered processor is determined by the critical path of the logic within each cluster, not the propagation delay of global interconnect. Operations that do use the on-chip network see the propagation delay across the network as one or more additional cycles of latency. This latency is exposed to the compiler, which can take it into account when scheduling operations. Chip multiprocessors also limit wire lengths to allow implementation at high clock rates. However, previous work has shown that adding a register-based communication mechanism to a clustered processor made up solely of programmable processors yields significantly better per-

2. Architecture of a Clustered ProgrammableReconfigurable Processor Figure 1 shows a block diagram of the Amalgam clustered programmable-reconfigurable processor, which was designed to take advantage of the capabilities and constraints of 2005-era VLSI fabrication processes. Eight independent processor clusters, four of which contain conventional programmable processors and four of which contain blocks of reconfigurable logic, communicate with each other and the memory system over an on-chip network. The 64KB on-chip data cache is divided into four banks of 16KB each to support up to four memory operations per cycle, matching the peak rate at which memory requests may be issued by the programmable clusters. Memory addresses are interleaved across the banks on a word-by-word basis, so bank 0 contains all words whose addresses end in 0 (mod 4), bank 1 contains the words whose addresses end in 1 (mod 4), and so on. To simplify data cache writebacks, the data cache banks are required to contain the same set of cache lines, so a cache line is never present in some of the banks and invalid in others. As illustrated in Figure 2, each programmable cluster contains a dual-issue in-order microprocessor, including a 32-entry register file and a 4KB instruction cache, while the reconfigurable clusters contain register files and blocks of reconfigurable logic. Each cluster has one read

Off-Chip Memory

Cache Bank 0

Cache Bank 1

Cache Bank 2

Cache Bank 3

Reconf. Cluster 3

Prog. Cluster 3

Reconf. Cluster 2

Prog. Cluster 2

Reconf. Cluster 1

Prog. Cluster 1

Reconf. Cluster 0

Network

Prog. Cluster 0

formance than an equivalent architecture in which clusters communicate only through memory [12]. The registerbased communication mechanism also provides a clean abstraction barrier between heterogeneous processing elements, making it easier to integrate programmable and reconfigurable logic onto a single chip. In this paper, we present and evaluate the architecture of Amalgam, a clustered programmable-reconfigurable processor. Unlike previous clustered processor architectures, which contained only programmable clusters, half of the clusters on an Amalgam chip contain reconfigurable logic, while the other half contain programmable processors. We show that this hardware configuration achieves an average speedup of 2.84 over an architecture with an equal number of clusters that all contain programmable processors. In addition, we demonstrate that optimizations that transform memory accesses into intercluster communication can yield significant speedups for some applications on this architecture. The remainder of this paper begins with a description of Amalgam. This is followed by a discussion of our experimental methods and results. We then present related work, discuss future plans for our research, and conclude.

Figure 1: The Amalgam Processor and one write port onto the on-chip network, which are used for both memory references and inter-cluster communication. Amalgam’s programmable clusters execute independent instruction streams, which are fetched from main memory and cached in each cluster’s instruction cache. They implement a RISC ISA with three extensions to support clustered architectures that are based on the instruction set used on the M-Machine [7]: a barrier instruction to reduce synchronization overhead, specification of both a destination cluster and a destination register for the result of most instructions, and an EMPTY instruction that prepares registers to receive results from other clusters. The EMPTY instruction is necessary because the delay required for inter-cluster communication makes it impossible to automatically invalidate the destination registers of instructions that target clusters other than the one they execute in. Instead, any cluster that expects to receive a result from another cluster must first execute an EMPTY instruction to invalidate the register that will receive the result. This clears the appropriate valid bit in the receiving cluster’s register scoreboard, preventing instructions that depend on the register from issuing. When the instruction that writes the register completes, it sends its result to the receiving cluster and the destination register is marked valid, allowing dependent instructions to proceed.

Network

Network Interface

ALU

Register File

ALU Reconfigurable Logic

Register Bank 0 (Registers 0-7) Array Segment 0 (Rows 0-7)

Instruction Cache Figure 2: Cluster Detail

2.1. Reconfigurable Cluster Architecture Each of Amalgam’s reconfigurable clusters contains a 32-entry register file and a 32x32 array of 4-input logic blocks. To improve performance on multi-bit computations, each logic block can be configured to generate either a one-bit function of its four inputs or two independent functions of three inputs, and an optimized carry chain is included in each row of the array. The main challenge in the design of the reconfigurable clusters was deciding how the reconfigurable array should interface with the register file. Our original design was similar to the conceptual diagram presented in Figure 2, and had the reconfigurable array and register file interacting in the same way as the register file and execution units in a conventional processor. A fixed number of read and write ports connected the two components of the reconfigurable cluster, and one of the tasks of the reconfigurable array was generating the appropriate register indices to control the set of registers being read and written on each cycle. This approach proved to have several significant problems. First, the bandwidth between the register file and the reconfigurable array was very limited, making it hard to perform multiple computations in parallel. Second, the logic required to generate the register indices on each cycle was complex and interfered with the usage of the array for computation. Finally, allowing each row in the array to access every register in the register file would have required large numbers of long wires, increasing the area and the cycle time of the reconfigurable cluster. To address these issues, we developed the architecture shown in Figure 3. The register file has been divided into four equal-sized banks and interleaved with the reconfigurable array, which has also been divided into four segments. An array control unit (described later) generates control signals for the cluster, and the network interface handles communication over the on-chip network.

Register Bank 1 (Registers 8-15) Array Segment 1 (Rows 8-15)

Array Segment 3 (Rows 24-31) Array Control Unit

Register File

Register Bank 3 (Registers 24-31) Array Segment 2 (Rows 16-23) Register Bank 2 (Registers 16-23)

Figure 3: Reconfigurable Cluster Block Diagram Each register in a bank continuously drives its output on a set of wires that are visible to each of the rows of logic blocks in the array segment “below” the bank. Logic blocks in a given column of a segment can then access the contents of the corresponding bit in any register in the bank “above” them by appropriately configuring their input multiplexors. Similarly, the input to each register bit is taken from a vertical wire that can be driven by any of the logic blocks in the corresponding column of the array segment “above” the bank that contains the bit. This configuration makes the entire contents of the register file available to the reconfigurable array on each cycle, greatly increasing register bandwidth and the number of computations that can be carried out in parallel. Interleaving register banks and array segments in this manner causes data to flow through the cluster in a “counter-clockwise” fashion similar to that used in PipeRench [8], and the interconnections between logic blocks reinforce this. Each logic block’s output is visible to the logic blocks in the eight rows “below” it in the array (ensuring that each row can directly drive at least one row in the next segment), to the logic blocks adjacent to it in the same row (to support shifts and accumulates), and to the eight registers in the register bank “below” it. The pattern of horizontal and vertical wires that provide these connections is described in more detail in [22]. The Array Control Unit (ACU) contains a programmable finite state machine that controls computation in the reconfigurable cluster. The ACU also has the ability to transfer values between register banks and to/from the network interface, subject to the limitation that it may

only read and write one register in each bank per cycle. Since the ACU cannot predict when data will arrive over the on-chip network, it stalls the entire cluster when arriving data and an internal register transfer try to write to a given register bank simultaneously. This architecture allows the reconfigurable cluster to efficiently implement both fine-grained and coarsegrained tasks. Each cluster’s reconfigurable array is large enough to map complex computations such as nested loops, allowing it to provide substantial performance improvements on programs with coarse-grained tasks. For programs with large numbers of fine-grained tasks, the register-based communication mechanism allows the cluster to respond quickly to input data and to deliver results directly to the cluster that needs them, reducing overhead and improving performance even on very small computations. One disadvantage of our current architecture is that configurations of the reconfigurable cluster may contain logic paths with delays greater than the processor’s cycle time. For the experiments reported in this paper, we address this limitation by verifying that our configurations for the reconfigurable clusters do not perform unreasonable amounts of computation in a single cycle. We are currently investigating the use of pipelined interconnects such as those described in [17] and [20] to ensure that the delay through the reconfigurable array matches the cycle time of the programmable clusters, eliminating the need to perform delay estimation as part of compiling a program for Amalgam.

3. Experimental Methodology In order to evaluate our architecture, we have implemented amalsim, a cycle-accurate simulator for clustered programmable-reconfigurable processors. Amalsim allows the user to define a wide range of clustered programmable-reconfigurable architectures through a configuration file interface. Users can define the number of programmable and reconfigurable clusters, the number of ALUs in each programmable cluster, the size and configuration of each reconfigurable cluster’s reconfigurable array, the memory hierarchy, and the topology of the onchip network, making it possible to evaluate a wide range of design trade-offs without modifying the simulator. For the experiments described in this paper, we varied the number of reconfigurable and programmable clusters while keeping the configuration of each cluster, network latencies, and the memory system constant. Programmable clusters were modeled as dual-issue in-order processors with five-stage pipelines. Reconfigurable clusters were modeled as having 32 rows of 32 logic blocks each, using the interleaving of array rows and logic blocks described earlier. Our timing model for the reconfigurable

clusters assumed that a 16-bit add or similar computation could be performed in one clock cycle. SPICE simulations indicate that the latency of a 16-bit add in our reconfigurable array is 18.5 fan-out-of-four inverter delays (FO4). This is consistent with the cycle times of recent Intel processors, which range from 12-20 FO4 delays [11], validating our delay assumption. The memory system used for these experiments has 64 KB of on-chip data cache (four banks of 16K each). Each programmable cluster’s instruction cache is 4KB in size, and both the instruction caches and data cache banks are four-way set-associative. Total access time for a hit in the cache is five cycles (one cycle to access the cache plus two cycles in each direction across the network), and fetching a line from the off-chip memory takes 40 cycles. We evaluate the performance of our architecture using five applications that have been implemented on amalsim: IDCT, Rijndael, N-Queens, DNA, and Dither. IDCT is an 8x8 inverse discrete-cosine transform of the type used in a number of video compression/decompression algorithms. Rijndael implements the Rijndael block encryption algorithm [4] using a block size of 128 bits and ten iterations, while N-Queens computes the number of possible arrangements of N queens on an NxN chessboard such that no queen can capture another in one move. (N=12 for the experiments reported here.) DNA uses the dynamic programming algorithm described in [16] to compute the edit distance between two sequences of genetic information, and Dither uses Floyd-Steinberg error diffusion [21] to convert a 1,024x786 pixel image from a 24-bit RGB format into one that can be encoded in 8 bits/pixel.

4. Experimental Results To assess the effectiveness of Amalgam’s architecture, we evaluate the speedup achieved by mapping our applications onto Amalgams with different numbers of programmable and reconfigurable clusters, relative to the performance of the same application on an Amalgam consisting solely of one programmable cluster. Memory system and network parameters are held constant for all experiments.

4.1. Baseline Clustered Processor Performance As a baseline for our discussion of Amalgam, Figure 4 shows the speedup achieved by parallelizing our benchmarks across multiple programmable clusters. These results illustrate that the clustered programmable processor architecture that Amalgam builds on is very efficient, achieving speedups of between 2.9 and 3.8 on four clusters. Three of our applications, N-Queens, IDCT, and DNA, also see significant improvement between their

Speedup vs. 1 Programmable Cluster

22

N-Queens 2D IDCT Rijndael Encryption Image Dithering DNA Pattern Match

20 18

24

24

22

22

20

20

18

Speedup vs. 1 Programmable Cluster

24

16

14

14

12

12

10

10

8

8

6

6

4

4

2

2 0

0

1

2

3

4 5 6 Number of Clusters

7

8

N-Queens 2D IDCT Rijndael Encryption Image Dithering DNA Pattern Match

18

16

0

24 22 20 18

16

16

14

14

12

12

10

10

8

8

6

6

4

4

2

2 0 1

2

3

4 5 6 Number of Clusters

7

8

Figure 4: Speedup Using Programmable Clusters Only

Figure 5: Speedup Using Reconfigurable Clusters

four-cluster and eight-cluster implementations, achieving total speedups of 7.29, 5.42, and 4.13, respectively. One benchmark, Rijndael, sees very little performance improvement between its four-cluster and eight-cluster versions, and the Dither benchmark actually runs more slowly on eight clusters than on four. In the case of Rijndael, this is due to a lack of parallelism in the 128-bit input to our benchmark, and running the algorithm on a 256-bit or larger input would give significantly better performance. Dither’s low performance on eight programmable clusters is due to contention for the memory system, which can handle four memory requests per cycle in the default Amalgam configuration. Increasing the number of requests that the memory system can handle each cycle to eight produces almost no change in the execution time of the four-cluster programmable version of Dither, but reduces the execution time of the eight-cluster programmable version by 49.4%, supporting the claim that memory bandwidth is the limiting factor in this case.

one-cluster programmable version ranged from 7.94 to 21.08 for the eight-cluster configuration consisting of four programmable and four reconfigurable clusters, with the average speedup being 11.78. Speedups over the configuration with eight programmable clusters varied from 1.89 to 3.90 for the different benchmarks, with an average speedup of 2.84. Table 1 shows the utilization of reconfigurable array resources in each benchmark. (In our experiments, the configuration of the reconfigurable arrays was the same for all versions of a benchmark that made use of the reconfigurable clusters.) Interestingly, resource utilization varies significantly from benchmark to benchmark and does not correlate with the speedup achieved. N-Queens, which achieves the highest performance of any of our benchmarks, uses only 12% of the logic blocks and very few of the wiring resources in each reconfigurable cluster. In contrast, IDCT, whose performance is at the median of our benchmark set, utilizes all of the logic blocks and many of the wiring resources in each reconfigurable cluster, and is, in fact performance-limited by the amount of logic available in the reconfigurable array. One common trait among all of the benchmarks is that they consume relatively little of the available register file bandwidth, as indicated by the “Read Channels Used” and “Write Channels Used” lines in the table, which show the fraction of the reconfigurable cluster’s registers that are read or written each cycle on average. Similarly, the total number of registers used at any point in each application is relatively small, as shown in the “Total Registers Read” and “Total Registers Written” lines. One possible explanation for this low register bandwidth usage is that our reconfigurable arrays do not contain enough logic blocks to take advantage of all of the register bandwidth provided by our architecture, a conjec-

4.2. Performance Using Reconfigurable Clusters Figure 5 plots the performance of our benchmarks against the number of clusters used, for configurations in which half of the clusters on the chip are programmable, and half reconfigurable. In these experiments, clusters were grouped into pairs containing one programmable and one reconfigurable cluster. Each application was then parallelized across the number of cluster pairs in the architecture, and the work assigned to each cluster pair was divided between the pair’s programmable and reconfigurable cluster to maximize performance. As the figure shows, replacing half of the programmable clusters on an Amalgam with reconfigurable clusters greatly improves performance. Total speedups over the

N-Queens 2D IDCT 12% 100% 0% 32% 4% 53% 6% 25% 3% 25% 16% 25% 9% 25%

Rijndael 76% 19% 56% 33% 3% 43% 3%

Dither 53% 21% 29% 20% 30% 31% 43%

DNA 39% 27% 20% 13% 9% 47% 16%

Table 1: Reconfigurable Cluster Resource Utilization ture that is supported by the fact that all of our benchmarks use a greater percentage of the logic blocks in each cluster than the register bandwidth. However, given that only one of our benchmarks fully utilizes the logic blocks in each cluster, we believe that experiments with lessstructured benchmarks are required before deciding to change the configuration of the reconfigurable arrays.

24

24

22

Speedup vs. 1 Programmable Cluster

Logic Blocks Horizontal Wires Vertical Wires Read Channels Used Write Channels Used Total Registers Read Total Registers Written

IDCT: Reconfigurable Only IDCT: Reconf w/ Forwarding Rijndael: Reconfigurable Only Rijndael: Reconf w/ Forwarding

20 18

22 20 18

16

16

14

14

12

12

10

10

8

8

6

6

4

4

2

2

0

4.3. Using Register-Based Communication to Improve Performance In many applications, a substantial fraction of the execution time is devoted to reading and writing flow variables (variables that exist to pass results from one phase of a program to another and are not part of the program’s final output) to and from memory. For example, many implementations of two-dimensional IDCT computations first compute the one-dimensional IDCT of each row in their input matrix and then find the 1D IDCT of each column in the matrix of transformed rows. Because the results of the row transformations are read by a later phase of the program but are not part of the program’s output, they can be classified as flow variables. On applications that contain substantial numbers of flow variables, clustered processors can significantly improve performance by transforming memory references that access flow variables into inter-cluster communication, sending the results of operations that compute flow variables directly to the clusters that need them instead of storing them in memory to be read later. In addition to reducing the memory bandwidth required by an application, this transformation eliminates any operations that might be required to compute the address of each flow variable, further improving performance. Figure 6 shows the effect of applying this transformation to IDCT and Rijndael, two applications that have significant numbers of flow variables. On an eight-cluster Amalgam with four reconfigurable and four programmable clusters, the performance of IDCT improves by 18% when this optimization is applied, giving a peak speedup of 12.09 over the one-cluster programmable version. Rijndael sees an improvement of 69% on the same configuration, with a total speedup of 19.52. Converting memory-based communication into intercluster register writes has the potential to greatly improve the performance of many applications on clustered proc-

0 1

2

3

4 5 6 Number of Clusters

7

8

Figure 6: Performance With Register-Based Communication essors, and we are currently investigating algorithms that apply this transformation as part of compilation. The large number of flow variables in some applications, such as the Dither benchmark described in this paper, is a barrier to this optimization on clustered programmable processors, because the limited size of the register file on each cluster limits the amount of data that can be sent to the cluster between synchronizations. For these programs, we are developing techniques to allocate data buffers/queues in the reconfigurable clusters to reduce synchronization overhead.

5. Related Work Amalgam is inspired by two separate bodies of previous work: distributed programmable processor architectures and architectures that combine programmable processors with reconfigurable logic. Distributed VLIW processors, such as Multiflow [3], partition their register files among their functional units to reduce register file size and access time. However, the performance of VLIW architectures is limited by the fact that instructions are issued in lockstep across all of the functional units in a processor, making them very sensitive to cache misses and other events that cause instructions to have variable latencies. Clustered processor architectures such as the MMachine [7] and Multiscalar [19], address this limitation by dividing the processor’s functional units into independent clusters, which provides greater tolerance for variable-latency operations while still limiting wire lengths and register file sizes. Previous programmable-reconfigurable processor architectures generally fit into one of two categories, de-

pending on the size of the computations they map onto reconfigurable logic. Fine-grained programmablereconfigurable processors, such as PRISC [15] and CHIMERAE [24], integrate small blocks of reconfigurable logic into superscalar processor architectures, treating the reconfigurable logic as programmable ALUs that can be configured to implement application-specific instructions. These systems can achieve better performance than conventional superscalar processors on a wide range of applications by mapping commonly-executed sequences of instructions onto their reconfigurable units, but the maximum speedup they can achieve is limited by the small amount of logic in their reconfigurable units. Coarse-grained programmable-reconfigurable processors, such as REMARC [14], MorphoSys [18], Garp [9] [2], and OneChip [23], provide larger blocks of reconfigurable logic that are less tightly-coupled with the programmable portions of the processor. These architectures can achieve extremely good performance on applications that contain long-running nested loops that can be mapped onto the processor’s reconfigurable arrays, but perform less well on applications that require frequent communication between the programmable and reconfigurable portions of the processor. Systems, such as Pilchard [13], that integrate FPGAs into conventional workstations over the processor’s memory bus display similar behavior, although the relatively low bandwidth of a processor’s memory bus makes them even more sensitive to the amount of communication that an application requires between the processor and the FPGA. Amalgam’s clustered architecture allows both finegrained and coarse-grained computations to be mapped into reconfigurable logic. Placing the reconfigurable arrays in separate clusters from the programmable processors allows them to be large enough to implement coarsegrained computations, while the register-based communication mechanism keeps inter-cluster communication times low enough to allow small regions of code to be profitably mapped onto reconfigurable logic. In addition, the integration of the register file with the reconfigurable array makes it easy to map multiple small functions onto reconfigurable logic, as the function to be performed on any input data is selected by the registers that the data is placed in.

6. Future Directions The work described in this paper has demonstrated that clustered processor architectures are an effective way to integrate reconfigurable logic with more-conventional programmable processors. We are currently extending this work into the design of memory systems and compilers for clustered programmable-reconfigurable processors. In memory system design, we are studying architec-

tures that integrate local data memories into the clusters to reduce memory latency, including the trade-offs between the hardware complexity required to provide coherency between the local memories and the software/compiler complexity required if the local memories in each cluster are placed under program control. Our compiler research focuses on the integration of techniques for parallel compilation and compilation of high-level languages onto reconfigurable logic.

7. Conclusion In this paper, we have described the architecture of the Amalgam clustered programmable-reconfigurable processor, which integrates four programmable and four reconfigurable processor clusters onto a single chip. Clusters communicate through a register-based mechanism in which operations on any cluster may write to any cluster’s register file, although clusters may only read from their own register file. Clusters may also communicate through the memory system. One of the key contributions of this paper is the design of Amalgam’s reconfigurable clusters. To provide maximum register bandwidth, each cluster’s register file is divided into four banks of equal size. Similarly, the reconfigurable array in each reconfigurable cluster is divided into four equal-sized segments, which are interleaved with the register banks in a ring structure. This architecture makes the full contents of each register bank available to the array segment “below” it in the ring, without requiring the reconfigurable array to generate the indices of the registers to be read on each cycle. The logic blocks in each segment of the array may write any of the registers in the bank below them, allowing data to flow through the ring as computation proceeds. We have implemented five benchmark applications in simulation: N-Queens, an 8x8 IDCT, Rijndael encryption, image dithering, and a DNA pattern matching algorithm. Results from these simulations show that an Amalgam that has four programmable and four reconfigurable clusters achieves an average speedup of 11.78 over an Amalgam with one programmable cluster when inter-cluster communication goes through memory. When the registerbased inter-cluster communication mechanism is used to reduce communication latency, average speedup increases to 13.74. These results suggest that clustered programmablereconfigurable processors will be able to achieve high performance on a wide range of applications, making them good candidates for both general-purpose computing and embedded applications. Our investigation of these architectures is ongoing, and we are currently exploring compilation techniques for clustered programmable-

reconfigurable processors as well as network and memory system designs to maximize performance. [13]

8. Acknowledgements This work was funded by the Office of Naval Research under award number N00014-01-1-0824.

[14]

References [1]

[2]

[3]

[4] [5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

Agarwal, V., Hrishikesh, M. S., Keckler, S. W., and Burger, D., "Clock Rate versus IPC: The End of the Road for Conventional Microarchitectures," Proceedings of the 27th Annual International Symposium on Computer Architecture, Vancouver, British Columbia, Canada, pp. 248-259, 2000. Callahan, T. J., Hauser, J. R., and Wawrzynek, J., The Garp Architecture and C Compiler IEEE Computer, vol. 33, pp. 62-69, Apr, 2000. Colwell, R. P., Nix, R. P., O'Donnell, J. J., Papworth, D. B., and Rodman, P. K., "A VLIW Architecture for a Trace Scheduling Compiler," International Symposium on Architectural Support for Programming Languages and Operating Systems, pp. 180-92, 1987. Daemen, J. and Rijmen, V., "AES Proposal: Rijndael," Mar 1999. DeHon, A., The Density Advantage Of Reconfigurable Computing IEEE Computer, vol. 33, pp. 41-49, Apr, 2000. DeHon, A. and Wawrzynek, J., "Reconfigurable Computing: What, Why, and Implications for Design Automation," Proceedings of the 26th Conference on Design Automation, pp. 610-615, 1999. Fillo, M., Keckler, S. W., Dally, W. J., Carter, N. P., Chang, A., Gurevich, Y., and Lee, W. S., "The MMachine Multicomputer," Proceedings of the 28th International Symposium on Microarchitecture, pp. 146156, 1995. Goldstein, S. C., Schmit, H., Moe, M., Budiu, M., Cadambi, S., Taylor, R. R., and Laufer, R., "PipeRench: A Coprocessor for Streaming Multimedia Applications," Proceedings of the 26th International Symposium on Computer Architecture, pp. 28-38, 1999. Hauser, J. R. and Wawrzynek, J., "Garp: A MIPS Processor With a Reconfigurable Coprocessor," IEEE Symposium on Field-Programmable Custom Computing Machines, pp. 12-21, 1997. Ho, R., Mai, K. W., and Horowitz, M. A., The Future of Wires Proceedings of the IEEE, vol. 89, pp. 490-504, Apr, 2001. Hrishikesh, M. S., Jouppi, N. P., Farkas, K. I., Burger, D., Keckler, S. W., and Shivakumar, P., "The Optimal Logic Depth Per Pipeline Stage is 6 to 8 FO4 Inverter Delays," Proceedings of the 29th International Symposium on Computer Architecture, Anchorage, Alaska, pp. 14-24, 2002. Keckler, S. W., Dally, W. J., Maskit, D., Carter, N. P. , Chang, A., and Lee, W. S., "Exploiting Fine-Grain Thread Level Parallelism on the MIT Multi-ALU Proc-

[15]

[16]

[17]

[18]

[19]

[20]

[21] [22]

[23]

[24]

essor," Proceedings of the 25th International Symposium on Computer Architecture, pp. 306-317, 1998. Leong, P. H. W., Leong, M. P., Cheung, O. Y. H., Tung, T., Kwok, C. M., Wong, M. Y., and Lee, K. H., "Pilchard -- A Reconfigurable Computing Platform with Memory Slot Interface," IEEE Symposium on FieldProgrammable Custom Computing Machines, 2001. Miyamori, T. and Olukotun, K., REMARC: Reconfigurable Multimedia Array Coprocessor IEICE Transactions on Information and Systems E82-D, vol. pp. 389-397, Feb, 1999. Razdan, R. and Smith M. D., "A High-Performance Microarchitecture With Hardware-Programmable Functional Units," Proceedings of the 27th International Symposium on Microarchitecture, pp. 172-180, 1994. D. Sankoff and J. Kruskal. Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison, Mass.: Addison-Wesley, 1983. Singh, D. P. and Brown, S. D., "The Case for Registered Routing Switches in Field Programmable Gate Arrays," International Symposium on Field-Programmable Gate Arrays, pp. 161-169, 2001. Singh, H., Lee, M.-H., Lu, G., Kurdahi, F. J., Bagherzadeh, N., and Chaves Filho, E. M., MorphoSys: An Integrated Reconfigurable System for Data-Parallel and Computation-Intense Applications IEEE Transactions on Computers, vol. 49, pp. 465-481, May, 2000. Sohi, G. S. , Breach, S. E., and Vijaykumar, T. N., "Multiscalar Processors," Proceedings of the 22nd International Symposium on Computer Architecture, pp. 414425, 1995. Tsu, W., Macy, K., Joshi, A., Huang, R., Walker, N., Tung, T., Rowhani, O., George, V., Wawrzynek, J., and DeHon, A., "HSRA: High-Speed, Hierarchical Synchronous Reconfigurable Array," International Symposium on Field-Programmable Gate Arrays, pp. 125-134, 1999. R. Ulichney . Digital Halftoning, Mass.: MIT Press, 1987. Walstrom, J., The Design of the Amalgam Reconfigurable Cluster. Master’s Thesis, 2002. University of Illinois at Urbana-Champaign. Wittig, R. D. and Chow, P., "OneChip: An FPGA Processor With Reconfigurable Logic," IEEE Symposium on FPGAs for Custom Computing Machines, pp. 126-135, 1996. Ye, Z. A., Moshovos, A., Hauck, S., and Banerjee, P., "CHIMAERA: A High-Performance Architecture With a Tightly-Coupled Reconfigurable Functional Unit," Proceedings of the 27th International Symposium on Computer Architecture, pp. 225-235, 2000.