software ... - Xun ZHANG

The approach is highly suitable for embedded soft-core SoPC applications. ... adequate design platforms and tools, which enable optimal exploitation of the considered architectures. ...... IEEE Trans Consumer Electron 1992;38(1):18–34. 110.
321KB taille 6 téléchargements 444 vues
Computers and Electrical Engineering 31 (2005) 93–111 www.elsevier.com/locate/compeleceng

A systematic approach to profiling for hardware/software partitioning M. Finc, A. Zemva

*

Faculty of Electrical Engineering, Laboratory for Integrated Circuits Design, University of Ljubljana, Trzaska c.25, 1000 Ljubljana, Slovenia Received 13 April 2004; accepted 26 July 2004 Available online 7 April 2005

Abstract In this paper, we present an efficient approach to profiling for HW/SW partitioning. The execution of arbitrary SW code regions is analyzed with a clock-cycle accuracy without introducing an additional profiling induced performance overhead. Based on the profiling principle, performance analysis of the initial functional SW description and performance estimation of various HW/SW partitioning configurations are systematically and iteratively carried out. For an efficient evaluation of different partitioning possibilities no design and implementation of HW co-processing blocks are necessary. The principle equally covers the simulation and implementation domains. The approach is highly suitable for embedded soft-core SoPC applications. In order to demonstrate its use, we developed a COMET Profiler tool. The design flow is illustrated with two case studies.  2005 Elsevier Ltd. All rights reserved. Keywords: Hardware/software (HW/SW) co-design; HW/SW partitioning; Field Programmable gate array (FPGA); Embedded soft-core processor; Performance analysis; Profiling; System-on-programmable-chip (SoPC)

*

Corresponding author. Tel.: +386 14768346; fax: +386 14264630. E-mail address: [email protected] (A. Zemva).

0045-7906/$ - see front matter  2005 Elsevier Ltd. All rights reserved. doi:10.1016/j.compeleceng.2004.07.003

94

M. Finc, A. Zemva / Computers and Electrical Engineering 31 (2005) 93–111

1. Introduction HW/SW partitioning is a process of distributing the processing load of the application between the SW, executed sequentially in a microprocessor, and additional HW co-processors working in parallel with the microprocessor [39,41,42,45,46]. An adequate utilization of parallelism can result in significant improvements in terms of speed and/or energy consumption [17,18,38,40,51]. The balance between the time-to-market and application performance often depends on the methodologies and design tools used in the design process. In order to efficiently navigate through a large design space of HW/SW possibilities, it is crucial to perform adequate partitioning decisions in the earliest stages of the design process. For such early exploration of the HW/SW design space adequate methodologies and tools are also required. In higher level oriented design approaches HW/ SW partitioning can be explored by abstract behavioral modeling (SystemC [31], Ocapi-XL [47], SpecC [37]) of the targeted application defined in the specification. The advantage is that, besides the accurate functional descriptions, higher level models are highly portable since the microprocessor SW code, co-processing HW parts, communication and system architectures are modeled gradually and simultaneously along the iterative design process. The designer is not bound to the use of specific architectures and devices until the functionality is well defined. The advantage is that parallelism can be efficiently explored with the use of portable automated tools for the dataflow graph analysis of application model descriptions [2,10,14]. In this way, proper system partitioning decisions and functional verification can be carried out before the actual HW/SW design and implementation. On the other hand, the accuracy of predicted results highly depends on an adequate description of application models, which are later mapped to available physical architectures for implementation. Since the high level design process is behaviorally oriented, closing the gap between modeling and final implementation is often non-trivial. A number of model refinement/analysis iterations are required in order to adequately migrate the descriptions to the SW and HW implementations. On the contrary, lower level approaches are already oriented towards a design process with particular device families and architectures in mind [10,12]. The designer is bound to use the available development tools, which often do not satisfy the requirements for efficient HW/SW co-design and partitioning. This offers a direct implementation support, but induces limited portability and device coverage. In this implementation oriented design process, it is typically required to implement all modifications before the analysis and verification of the considered partitioning configuration, which significantly increases the time of the design iteration cycle. Platform-based design combines aspects from both aforementioned approaches [21]. Having a specific range of architectures in mind in the initial design steps, the design space for partitioning is significantly narrowed. The design process requires to be supported with adequate design platforms and tools, which enable optimal exploitation of the considered architectures. In our approach, we rely on systematic profiling adapted to the performance analysis and estimation in order to support adequate HW/SW partitioning decisions. The advantage is that, like in the high level approaches, the design and implementation of additional HW is not necessary for the evaluation process of different partitioning decisions. Also, the profiling approach does not alter the analyzed SW code and does not hinder its execution in terms of introducing SW code overhead for profiling support. Because of their high grade of configurability, our approach is mostly appropriate for soft-core processor SoPC systems. Currently, various FPGA platforms

M. Finc, A. Zemva / Computers and Electrical Engineering 31 (2005) 93–111

95

offer a high degree of such SoPC integration supporting soft-core processor implementations (e.g. Microblaze from Xilinx [52], Nios from Altera [3], Xtensa from Tensilica [43], OpenRISC from OpenCores [30]). Besides the FPGA devices, there is a range of custom configurable SoPC platform architectures targeting at specific fields of applications (e.g. Adaptive Computing Machine from Quicksilver [34] and PicoArray from PicoChip [32]).

2. Related work For profiling and analysis of a SW code execution on a microprocessor system, different approaches are available [1,4,22]. One such SW oriented profiling option is based on emulation or simulation of processor systems on independent host platforms before the physical implementation (e.g. SimOs [5,15,35,50], JavaVirtualMachine profiler [23], SimICS [26], ASIM [9]). In this approach accurate performance models of processing units are required. The accuracy of profiling with the simulation based profilers reflects itself in the speed of the simulation; the greater the accuracy the greater the execution time overhead. The benefit is that the SW code and its execution are not altered. Another approach is based on the application code instrumentation, which covers two general principles: constantly tracking and statistically sampling the required SW execution parameters. Often a combination of both approaches is available (e.g. SpeedShop [36]). Profilers using the first principle track the SW execution on the function level (i.e. gprof [13]) or on the instruction level (e.g. iprof [22]). Profilers based on the second principle use statistical sampling to collect the information during the SW execution (e.g. prof [33], DCPI [4]). They rely on an existing source of interrupts (e.g. timer interrupts) to generate program-counter samples of the required execution parameters which are statistically evaluated afterwards. The SW instrumentation oriented profiling approaches introduce a noticeable overhead into the SW execution and influence the behavior of the profiled system. Thus, the measured timing results of function/ instruction executions offer a limited accuracy, depending on the settings, type and principle of the profiler. SW-based profilers provide for a better support and are practically limited only with storage resources for non-time related parameters acquisition, which is often disadvantageous for the use in custom embedded systems. Apart from accurate simulation-based profilers, most of the profilers are suitable only for the analysis and tuning of the SW code for optimal CPU usage. The simplest HW oriented profiling approach is the use of a logic analyzer for instruction/data bus monitoring. The drawbacks of this approach are high cost, ineffectiveness and limited storage capacity of logic analyzers. Some profiling systems also take advantage of additional special on-chip HW for profiling support (e.g. Intel VTune [20,48]). The most trivial is the use of performance counters for a variety of events which deliver an interrupt when the counters overflow (e.g. Intel Itanium, Intel Pentium Pro, MIPS R10000). Some approaches also sample and record detailed data of the CPU pipeline behavior and latencies (ProfileMe [8,25]). Custom programmable co-processors for run-time profiling are also used [24,28,44,53]. Profilers with additional HW offer better accuracy when it comes to time-related performance measurements and CPU execution-related parameters. Speaking in terms of HW/SW partitioning, obtaining an efficient performance analysis and estimation of potential partitioning solutions with a clock-cycle accuracy is required. We present a principle and methodology which equally cover simulation and implementation domains with a

96

M. Finc, A. Zemva / Computers and Electrical Engineering 31 (2005) 93–111

clock-cycle accuracy. In order to exploit the advantages and avoid the disadvantages of both domains, the choice of combination can be adapted to the characteristics of the targeted application.

3. Profiling for partitioning Our methodology is oriented towards the analysis of the performance of a SW code on a targeted processor taking into consideration the potential use of additional HW co-processing blocks for optimization. Our main concern is to apply an adequate profiling approach which allows for an efficient estimation of the SW code performance without the need to design and implement the considered co-processing HW. Profiling must be non-intrusive for the analyzed parts of the SW code. This means it should not disturb or influence its execution. This aspect is crucial for obtaining accurate and relevant results which are the basis for the performance estimation of potential partitioning configurations. Profiling enables the designer to iteratively detect critical parts of the functional SW code and to estimate the exact performance of the optimized design. Compared to the performance of the SW execution only, the optimized design is evaluated and the partitioning choices are reconsidered. After the performance requirements are satisfied and the components of the processor system are well defined, the design and implementation of co-processing HW blocks and SW code modifications can be carried out. 3.1. Principle In general, our profiling approach covers two design aspects important for efficient partitioning: performance analysis and performance estimation. In both aspects, the main principle of profiling is to monitor the SW code execution in the CPU at the instruction level through an independent interface. The SW code regions for the analysis are inspected at the lowest level, i.e. as instruction address regions. This approach is also upwards compatible and allows to define the SW regions at the higher levels of programming abstraction, e.g. in the ANSI C language. The SW regions of our interest are defined with the starting and ending instruction addresses. Based on these starting/ending points, adequate SW code execution monitoring can be performed. In this manner, the analysis of the SW code with non-linear execution is supported, provided the SW regions are defined accordingly. In principle, the definition of multiple exit possibilities, e.g. conditional execution breaks, is allowed. In the process of performance estimation, we rely on clock-cycle accurate profiling results of the performance analysis of critical parts of the SW code. Taking into account the performance of potential HW co-processing components for optimization, the execution parameters are redefined. For this purpose, at least the count of repetitions and the number of clock cycles spent for the execution of single instructions or blocks of a relevant SW code must be captured by the profiler in the analysis stage without introducing any additional performance overhead. Considering the redefined execution parameters for the related SW code regions, the profiler gives appropriate estimated results of the SW code execution in the estimation stage. In the exploration process of possible HW optimizations for critical SW parts, the SW code needs to be iteratively adapted and analyzed. When using additional HW co-processors, the execution time is distributed between the data exchange and HW co-processing. Therefore, to take into account the utilization of true parallelism including communication and

M. Finc, A. Zemva / Computers and Electrical Engineering 31 (2005) 93–111

97

data exchange overhead for a potential co-processing solution, partial HW implementations (i.e. implementing only registers for the co-processing memory space) are required. In this way the performance is estimated with clock-cycle accuracy without the need to design and implement the HW co-processing block. 3.2. Coverage The principle of the profiling approach covers two design domains: • simulation domain, • implementation domain. The advantage of working in the simulation domain is that it does not require any physical HW platform for implementation of the analyzed processor system. For verification of the basic functionality, a HW simulator can be used (e.g. ModelSim from Mentor Graphics [27]). Afterwards, the performance analysis is carried out. The performance analysis and estimation are performed with the use of the related profiler on the basis of the simulation output of a single simulation cycle. For video and imaging applications, usually one data frame or stream processing cycle has to be simulated. For each critical part of the functional SW code, the designer can redefine timing execution parameters and estimate the system performance without actually implementing any additional HW components for co-processing. For this purpose, timing parameters of the available hardware intellectual property (IP) components or estimated timing parameters of potential HW solutions can be used. With the profiler, a clock-cycle accurate insight into the performance improvement can be obtained and different partitioning possibilities can be experimented with without the need to re-simulate the design. Partitioning in the implementation domain can be performed by implementing a basic targeted processor system configuration on any adequate development board capable of executing the SW code or its functional parts. Instead of analyzing the simulation results, the performance analysis of the SW code execution can be monitored in real-time with the use of an external HW component for profiling. Similarly to the simulation domain, based on the performance analysis results of the executed code, performance estimation of different HW/SW partitioning options can be easily carried out. 3.3. Accuracy The profiling approach offers clock-cycle accuracy for the performance analysis of the execution of instructions in the CPU. Since only the execution stage of the pipeline is monitored by the profiler, the preceding stages of the first instruction and the succeeding stages of the last instruction in the monitored SW region are not taken into account. However, this is irrelevant for the partitioning process, since both performance analysis and estimation utilize the same principle. Therefore, the results are directly comparable and cover both aspects with the same accuracy. As to the performance estimation, although the results are accurately calculated, they often rely on rough estimations of potential IP block timing parameters. Timing parameters of the additional SW code, related to the potential IP implementation support, have to be taken into

98

M. Finc, A. Zemva / Computers and Electrical Engineering 31 (2005) 93–111

account also for adequate accuracy. With an iterative fine tuning of these SW parts and with the use of partial implementations, the performance estimation results are closely comparable to the results of the actual implementation. This is demonstrated on practical examples in Section 4. 3.4. COMET profiler tool In order to support the approach, we developed a COMET Profiler tool. It was designed for the performance analysis and estimation of various potential HW/SW partitioning combinations based on the profiling principle. Currently, the tool supports profiling of Altera Nios soft-core processor systems in the simulation and implementation domains. The Altera Nios [29] is a highly configurable 32- or 16-bit RISC embedded soft-core processor with a five-stage pipeline and Harvard architecture connected to other peripherals via a master/slave Avalon bus. Based on the above principle, our tool performs point-to-point time measurements of the SW code execution in the Nios execution pipeline. The reference points for the time measurements are the first and the last instruction addresses of the blocks of the examined SW code. Since every block is monitored independently, the measurements of overlapping blocks and nested loops are supported. The output results are given in clock-cycle counts and repetitions. In the simulation domain, SW regions for the analysis are labeled simply by insertion of special COMET Profiler tool directives in the source code. These profiler directives are introduced as ANSI C comments in order not to affect the SW code compilation results. In this way, the use of the COMET Profiler tool instructions is transparent, simple, well defined and non-intrusively detected by the profiling tool. The labels are: • //#START label_name, • //#STOP label_name. The COMET Profiler tool scans the SW source code for start/stop labels and detects the SW regions to be analyzed from a disassembled executable. By examining the simulation results, the executed SW regions are tracked and analyzed. Its additional options enable redefining of timing execution parameters of the labeled parts by using OPT parameter setting as an addition to the start mark followed by a number of execution cycles. The system performance with the redefined parameters can be estimated with the tool on the same simulation results, without the need to resimulate the design. An example of the output file with the performance analysis and estimation results in the simulation domain is presented in Fig. 1. In the implementation domain, an additional HW component is generated according to the CPU configuration and attached to the system. The integration into the Altera Nios system is illustrated in Fig. 2. The COMET Profiler component is directly connected to the internal CPU pipeline registers which buffer the address of the instruction currently processed in the execution pipeline. Based on these values, it tracks the execution of the labeled SW code regions in real-time. In this way, point-to-point time measurements are obtained and loop counting is performed. The internal structure of the COMET Profiler component is illustrated in Fig. 3. It is comprised of a set of registers (input registers for defining the address range of the analyzed SW regions and output registers with profiling results), a counter for cycle counting, a counter for counting loops and control logic. Input registers are set during initialization of the application and the results are

M. Finc, A. Zemva / Computers and Electrical Engineering 31 (2005) 93–111

99

Fig. 1. Example of an output file.

usually read after the processing cycle is completed. Accurate profiling and performance analysis are performed without interfering with the CPU operation and consequently without introducing any additional overhead into the SW execution. 3.5. Design flow The design flow is presented in Fig. 4. Data processing algorithms used in modern embedded applications are often very complex in structure but standardized and well defined. For this reason, based on the specification of the application, the designer usually obtains a commercially or

100

M. Finc, A. Zemva / Computers and Electrical Engineering 31 (2005) 93–111

Fig. 2. Nios system architecture.

Fig. 3. COMET Profiler component architecture.

freely available verified SW code of the required algorithm, or creates and verifies the code on an available independent development platform (i.e. PC). In order to evaluate an embedded soft-core processor system, this functional SW code must be provided in a SW language supported by the compilation tools targeted for the soft-core processor. Based on the SW code properties, an initial soft-core platform configuration needs to be set up to support the functionality of the SW code. In the process, the designer has to keep focus on the algorithm evaluation and not yet on the application performance. The partitioning concept starts with the profiling and performance analysis of the main data processing algorithms in the design. The performance of the final application depends mostly on the execution of the most computationally intensive data processing algorithms, which consume the most of the processing cycles. These algorithms usually present the main factor which defines the necessary HW platform architectural requirements in terms of adequate processing power and data throughput. For the performance analysis, estimation and evaluation, the designer can choose between two design directions supported by the COMET Profiler tool, the simulation and/or implementation domains. For optimal exploitation of the partitioning process the combined use of both directions is recommended. The performance analysis and evaluation gradually and iteratively lead the designer towards the desired system architecture and an optimal HW/SW configuration. The initial functional

M. Finc, A. Zemva / Computers and Electrical Engineering 31 (2005) 93–111

101

Fig. 4. Design flow.

architecture may significantly differ from the ideal system configuration. When actual modifications of the initial system architecture are performed, the simulation of an algorithm processing cycle and the verification for functionality need to be repeated. With the described performance analysis and estimation the system modifications are delayed until the last stage in the design iteration. In the final step, the system is well determined and ready for the final implementation. Since all components and their integration are defined, the designer can concentrate on the design and implementation of co-processing HW blocks. Disadvantages of the simulation-based profiling are the time consuming HW simulation and the need of complex input test models and test-benches unsuitable for complex real-time systems. The great advantage is that the design and the use of the physical HW are not required and the entire analysis and estimation can be performed in a single simulation cycle. Working in the simulation domain is most suitable for less complex designs and for analyzing only parts of implemented algorithms in complex systems. It is useful for initial partitioning decisions in the early stages of the design process when the functional HW development platform is not yet available and the performance needs to be evaluated. The use of the COMET Profiler tool in the implementation domain is highly suitable for complex real-time systems since it tracks the actual in-system SW code execution and results are obtained in real-time. The use of the COMET Profiler component consumes additional FPGA resources. Their amount depends on the time range and the number of tracked SW regions. Compared to the overall soft-core system implementation in

102

M. Finc, A. Zemva / Computers and Electrical Engineering 31 (2005) 93–111

the FPGA, the resource overhead is usually insignificant as a result of the simple internal structure of the COMET Profiler component.

4. Case studies 4.1. Simulation domain—division calculation We will demonstrate our profiling approach in the simulation domain on a simple design example targeted for the Altera Nios processor system for an FPGA implementation. The application performs a division calculation of two random numbers in a loop of 100 iterations. The initial configuration of the functional HW system was: • 32 bit Altera Nios CPU @ 33 333 MHz (standard configuration), • 256 KB external SRAM, • UART serial port. The SRAM memory was used as an instruction and data memory. UART was needed for the terminal communication and data exchange. For the system set up and synthesis the Altera Nios development toolkit was used. The SW code for the algorithm implementation was written in the ANSI C programming language. In the functional SW description, the division was calculated with the use of a standard ANSI C function div(a, b). The algorithm part of the SW code is shown in Fig. 5. For profiling of the code with the COMET Profiler tool, the SW code was marked with COMET Profiler labels as shown in Fig. 6. For the entire loop performance analysis, start/stop marks labeled loop1 were used and the marks labeled div_cpu were chosen for the division calculation analysis. For the system simulation, the ModelSim-Altera from Mentor Graphics was used. Results of the COMET Profiler analysis are presented in Table 1. The division calculation function (div_cpu) consumed some 20% of the total time of the whole main algorithm (loop1). In the

Fig. 5. Functional SW code.

M. Finc, A. Zemva / Computers and Electrical Engineering 31 (2005) 93–111

103

Fig. 6. SW code labeled for the analysis.

Table 1 Performance results for division calculation Functional

Estimated

Implemented

Labels

loop1

div_cpu

loop1

div_cpu

loop1

div_custom

Sub-labels Loops Min. cyclesa Max. cyclesa Total cyclesa

div_cpu (20%) 1 / / 165.912

/ 100 250 732 33.402

div_cpu (2%) 1 / / 136.010

/ 100 35 35 3.500

div_custom (2%) 1 / / 135.682

/ 100 35 35 3.500

a

1 clock cycle = 30 ns.

next step, the optional HW co-processing component, which calculates the division quotient in 35 clock cycles, was taken into consideration to replace the SW function. Instead of integrating the HW component into the system and re-simulating, we applied performance estimation with the COMET Profiler tool. The implementation in terms of custom instructions was considered. It does not introduce any additional data access overhead and is highly suitable for repetitive operations over two input values. Based on the new timing parameters, the execution timing of the division calculation was redefined with a new parameter (OPT) of 35 clock cycles as shown in Fig. 7. The estimated execution time of the division calculation was reduced by the factor of 10 and the overall performance was increased by some 18% as shown in Table 1. As shown in Table 1, the results of the actual implementation of the division component are almost identical to the results estimated with the COMET Profiler tool. The difference in the performance was the result of the compiler optimization ( o2), which is hard to predict. The performance improvement and performance comparison of different stages of the design flow are presented in Fig. 8.

104

M. Finc, A. Zemva / Computers and Electrical Engineering 31 (2005) 93–111

Fig. 7. SW code with redefined execution parameters.

180000 160000

clock cycles

140000 120000 functional

100000

estimated

80000

implemented

60000 40000 20000 0 loop1

div_cpu

Fig. 8. Result comparison for division calculation.

4.2. Implementation domain—JPEG decoder As a demonstration of the implementation-based design flow, the design of a JPEG decoder application [6,7,49] is illustrated. The initial requirements were the following: • the JPEG decoder is a stand-alone application (performance with 3 frames/s or more), • baseline JPEG decoding support [16] for JFIF and EXIF format with a 256 · 256 pixels resolution and 8 bit grayscale levels, • input JPEG images are stored in FLASH memory, • the use of a special custom VGA video system for display [11]. Our application was based on a freely available JPEG coding/decoding SW code of the Independent JPEG Group [19] written in the ANSI C programming language. A simplified representation of the JPEG decoding algorithm structure is illustrated in Fig. 9. In the first step, the SW

M. Finc, A. Zemva / Computers and Electrical Engineering 31 (2005) 93–111

105

Fig. 9. Simplified JPEG decoding algorithm presentation.

code was tailored to the requirements of the specified application and verified for functionality on a PC platform. For the support of application development in the implementation domain, a standard Altera Nios Development Board was utilized as a functional platform. The prototype development board consisted of: • • • • • •

Altera APEX20K200E FPGA, 1 MB FLASH memory, 256 KB SRAM memory, 32 MB SDRAM, RS-232 interface, external connectors and switches. The initial configuration of the Nios soft-core processor system was set up as follows:

• Altera Nios CPU @ 33 333 MHz (standard 32 bit, 256 registers, enabled MUL accelerated multiplication), • 1 KB internal ROM (for boot monitor), • 256 KB SRAM controller, • 32 MB SDRAM controller, • 1 MB FLASH memory controller (for JPEG image storage), • UART interface (for terminal and debugging communication), • video interface component (for communication with a custom VGA output video system), • COMET component (provided by the COMET Profiler tool). For the instruction and/or data memory, two different memories were available (faster SRAM and a larger amount of SDRAM). A SW routine was added to support the FLASH memory file system. JPEG test files were converted accordingly and written into the FLASH memory via UART. Initially, the data and instruction memory were both set into the SRAM. The compiled SW code consumed the amount of 96 KB, leaving 160 KB of free space for the data memory in the fast SRAM. After synthesis, implementation and functional verification on the development board, the performance of the JPEG decoder was analyzed with the COMET Profiler component. The performance results for the decoding process of the standard JPEG test picture ‘‘Lena.jpg’’ are hierarchically presented in the functional section of Table 2. In Table 2, the actual SW functions are written in italics and the analyzed functional sub-parts are written in regular font formatting. The overall performance of the initial system was less than

106

M. Finc, A. Zemva / Computers and Electrical Engineering 31 (2005) 93–111

Table 2 JPEG decoder performance Min (clka) JPEG decoder total (functional) jpeg_read_header jpeg_read_scanlines +decode_mcu (entropy dec.) +jpeg_idct +de-quantization +idct +idct (1st stage) +idct (2nd stage) display JPEG decoder total (estimated) jpeg_idct (estimated) +idct (estimated) JPEG decoder total (partial) jpeg_idct (partial) +idct (partial) JPEG decoder total (final) jpeg_idct (FPGA) a

Max (clka)

Loops

1,323 463 4,395 200

959,847 444,874 15,789 508

256 1,024 1,024 8,192

97 1,850

652 6552

8,192 1,024

262

262

1,024 1,024

378

378

1,024 1,024

1,697

3,568

1,024

Total (clka)

Time (ms)

16,979,600 482,242 15,923,715 3,050,161 11,820,130 2,573,058 9,078,112 2,603,942 6,474,170 454,328 8,169,776 3,010,306 268,288 8,288,156 3,128,686 387,072 8,093,213 2,483,312

509.39 14.47 477.71 91.50 354.60 77.19 272.34 78.12 194.23 13.63 245.10 90.31 8.05 248.64 93.86 11.61 242.80 74.50

1 clock cycle = 30 ns.

2 frames/s. Based on the performance analysis results, the next step was to explore HW optimization possibilities for the 2D IDCT algorithm. Our idea was to write the whole 8 · 8 block of the dequantized DCT values directly into the HW co-processor and read back the calculated IDCT results as shown in Fig. 10. A potential solution, which takes 16 clock cycles to calculate the 2D IDCT, was considered for implementation. The possibilities of parallelism were explored further. It was estimated that a processing window of around 24 clock cycles was available between the last writing operation and the first data acquisition from the IDCT component. The estimated parameters are introduced in Fig. 10. The considered IDCT co-processor execution parameters fitted into the time frame. Therefore, the new performance of the whole system was reprocessed and the estimation results are presented in the estimated section of Table 2. The performance of

Fig. 10. Considered IDCT implementation.

M. Finc, A. Zemva / Computers and Electrical Engineering 31 (2005) 93–111

107

the jpeg_idct routine was improved up to the factor of 4. The overall execution time of the JPEG Decoder algorithm was reduced by some 50%, which resulted in the frame rate of 4 frames/s. Since the initial requirements were met, no additional modifications were necessary. In the next iteration, a partial implementation of the 2D IDCT co-processor was carried out for more accurate performance estimation. Only two memory register blocks (one 64 word 14 bit memory block for 8 · 8 input DCT coefficients and one 16 word 32 bit memory block for output values) were used for communication and data exchange overhead verification. The SW code for utilization of the new HW component was added also. Based on the results in the partial section of Table 2, the use of the proposed partitioning solution was verified. Based on the partitioning, the final application consisted of: • • • • • •

Altera Nios CPU (standard 32 bit, 256 registers, enabled MUL accelerated multiplication), 256 KB SRAM, 1 MB FLASH memory (for JPEG image storage), UART (for terminal and debugging communication), video interface component (for communication with custom VGA output video system), IDCT component.

This prototype configuration served as a basis for final device choices and layout decisions. The final JPEG decoder application is illustrated in Fig. 11. The performance results with the actual implementation of the defined IDCT architecture are presented in the final part of Table 2. All the

Fig. 11. JPEG decoder application.

108

M. Finc, A. Zemva / Computers and Electrical Engineering 31 (2005) 93–111

Fig. 12. JPEG decoder performance comparison.

results are graphically illustrated and compared in Fig. 12. Discrepancies in the results of different estimation/analysis stages are minor (only few percent) and they are mostly a result of the datadependent compiler optimizations (in our case level o2). Nevertheless, the performance estimation results serve as an adequately accurate and efficient basis for HW/SW partitioning design space exploration. In this design example, the cycle and loop counters of the COMET component were both configured as 25 bit wide. The component consumed 6% of the overall FPGA resource usage of the JPEG decoder application (486 logic elements compared to the 8.320 used). For comparison, the gprof profiler was included with the Altera Nios platform, but its different principle of operation did not allow for an adequate use for partitioning. Also, an additional HW peripheral for profiling is provided by Altera for the second generation of Nios processors. The profiling is performed by an additional HW component allowing for point-to-point time measurements and repetition counting. However, the counters are invoked through macro instructions, which are inserted at the triggering points and influence the execution of the analyzed SW code. The efficiency of our profiling approach is demonstrated on practical examples and supported with quantitative results as a reference for further independent comparisons.

5. Conclusion In this paper an efficient approach to profiling for HW/SW partitioning is presented. The methodology and the design flow are introduced. The coverage of simulation and implementation design domains is supported. The relevance of the results and their accuracy are established. The use of the COMET Profiler tool for profiling the Altera Nios soft-core processors is described. The benefits and disadvantages of the approach are explained. Although the profiling approach to partitioning is of a lower level and SW oriented, it offers an adequate analysis and estimation of the system performance with remodeled timing parameters of the SW algorithm. The result is that the partitioning decisions can be carried out earlier in the design process and the exploration of various partitioning configurations can be performed on a higher level, without the need of an actual implementation. The methodology and principle are expandable, upgradeable and reusable on similar soft-core processors and processors which allow monitoring of the internal CPU operation.

M. Finc, A. Zemva / Computers and Electrical Engineering 31 (2005) 93–111

109

A significant enhancement to the approach would be the possibility to collect statistical data about individual instruction executions from the instruction set of the processor and to track the data exchange overhead in terms of cache misses and pipeline stalls. These additional valuable data would permit an even greater insight into the nature of critical segments of the application or its parts, thus offering an improved efficiency of the HW/SW partitioning process.

Acknowledgement The research was funded by the Ministry of Education, Science and Sport of the Republic of Slovenia through the programme P2-0246-Algorithms and the optimization methods in telecommunications.

References [1] Abenante L. Transparent, low-overhead profiling on modern processors. IEEE Trans Electron Dev 2002;49(2):329–31. [2] Kaouane L, Akil M, et al. From algorithm graph specification to automatic synthesis of FPGA circuit: a seamless flow of graphs transformations. In: Proceedings of 13th field-programmable logic and applications international conference 2003 (FPL 2003). 2003. p. 934–43. [3] Altera Corporation. Available from: http://www.altera.com. [4] Anderson J et al. Continuous profiling: Where have all the cycles gone? ACM Trans Comput Syst 1997;15(4):357–90. [5] Bennett JE. Two case studies in latency tolerant architectures. Technical report CSL-TR-94-639. Stanford University, Computer Systems Laboratory, 1994. [6] Bhaskaran V, Konstantinides K. Image and video compression standards—algorithms and architectures. 2nd ed. Kluwer Academic Publishers; 2000. [7] CCITT, Information technology—digital compression and coding of continuous–tone still images–requirements and guidelines. T. 81, 1992. [8] Dean J et al. ProfileMe: hardware support for instruction-level profiling on out-of-order processors. In: Proceedings of 30th international symposium on microarchitecture; 1997. p. 292–302. [9] Emer J et al. Asim: A performance model framework. IEEE Comput 2002;35(2):68–76. [10] Ernst R, Henckel J, Benner T. Hardware–software cosynthesis for microcontrollers. IEEE Des Test Comput 1993:64–75. [11] Finc M, Trost A, Zemva A. A configurable prototype platform for real time HW/SW video and image processing. In: Proceedings of embedded world 2003 conference; 2003. p. 673–82. [12] Finc M, Trost A, Zemva A. HW/SW co-design and implementation of motion detection algorithms in an FPGA device. In: Proceedings of 38th MIDEM conference 2002. p. 189–94. [13] Graham S, Kessler P, McKusick M. gprof: a call graph execution profiler. In: Proceedings of ACM SIGPLAN symposium on compiler construction; 1982. p. 120–6. [14] Gupta RK, De-Micheli G. Hardware–software co-synthesis for digital systems. IEEE Des Test Comput 1993;10(3):29–41. [15] Gurumurthi S et al. Using complete machine simulation for software power estimation: The softwatt approach. In: Eighth international symposium on high-performance computer architecture (HPCAÕ02); 2002. p. 141–50. [16] Hamilton E. JPEG file interchange format, version 1.02. 1992. [17] Henkel J. A low power hardware/software partitioning approach for core-based embedded systems. In: Proceedings of the 36th design automation conference; 1999. p. 122–7.

110

M. Finc, A. Zemva / Computers and Electrical Engineering 31 (2005) 93–111

[18] Henkel J, Li Yanbing. Energy-conscious HW/SW-partitioning of embedded systems: A case study on an MPEG-2 encoder. In: Proceedings of the sixth international workshop on hardware/software codesign (CODES/CASHE Õ98). 1998. p. 23–7. [19] IJG. Available from: http://www.ijg.org/. [20] Introduction to microarchitectural optimization for Itanium 2 processors. Reference Manual, Intel Corporation. 2004. [21] Keutzer K et al. System level design: Orthogonalization of concerns and platform-based design. IEEE Trans Comput-Aided Des Circuit Syst 2003;19(12):1523–43. [22] Kuhn P. Algorithms, complexity analysis and VLSI architectures for MPEG-4 motion estimation. asdfsadfsadfs: Kluwer Academic Publishers; 1999. [23] Liang S, Viswanathan D. Comprehensive profiling support in the Java Virtual Machine. In: USENIX conference on object-oriented technologies (COOTS); 1999. p. 229–40. [24] Lysecky R, Vahid F. A configurable logic fabric for dynamic hardware/software partitioning. In: IEEE/ACM design automation and test in Europe conference (DATE); 2004. pp. 10 480–5. [25] Lysecky R, Cotterell S, Vahid F. A fast on-chip profiler memory. in: Design automation conference, 2002. Proceedings 39th. 2002. p. 28–33. [26] Magnusson PS et al. Simics: A full system simulation platform. IEEE Comput 2002;35(2):50–8. [27] Mentor Graphics Company. Available from: http://www.mentor.com/. [28] Narayanasamy S et al. Catching accurate profiles in hardware. In: Proceedings of the ninth international symposium on high-performance computer architecture (HPCA-9); 2003. pp. 269–80. [29] Nios documentation. Available from: http://www.altera.com/literature/lit-nio.html. [30] OpenCores. Available from: http://www.opencores.org/. [31] The Open SystemC Initiative. Available from: http://www.systemc.org. [32] PicoChip Ltd. Available from: http://www.picochip.com. [33] Prof, Digital Unix Manual pages. [34] QuickSilver Technology Inc. Available from: http://www.qstech.com/. [35] Rosenblum M et al. Complete computer simulation: The SimOs approach. IEEE Parallel Distrib Technol 1995;3(4):34–43. [36] SGI—SpeedShop UserÕs Guide rev. 11. Document number 007-3311-011. 2003. [37] SpecC System—Center for Embedded Computer Systems, University of California, Irvine. Available from: http:// www.ics.uci.edu/~specc/. [38] Stitt G, Vahid F. Energy advantages of microprocessor platforms with on-chip configurable logic. IEEE Des Test Comput 2002;19(6):36–43. [39] Stitt G, Vahid F. Binary-level hardware/software partitioning of mediabench. NetBench, and EEMBC Benchmarks. Technical Report UCR-CSE-03-01. 2003. [40] Stitt G, Vahid F. Energy advantages of microprocessor platforms with on-chip configurable logic design and test of computers. IEEE 2002;19(6):36–43. [41] Stitt G, Vahid F. Hardware/software partitioning of software binaries. In: International conference on computer aided design 2002 (ICCAD 2002); 2002. p. 164–70. [42] Stitt G, Lysecky R, Vahid F. Dynamic hardware/software partitioning: A first approach. In: IEEE/ACM 40th design automation conference (DAC); 2003. p. 250–5. [43] Tensilica. Available from: http://www.tensilica.com/. [44] Vahid F, Gordon-Ross A. A self-optimizing embedded microprocessor using a loop table for low power. In: Proceedings of the 2001 international symposium on low power electronics and design. 2001. p. 219–24. [45] Vahid F, Givargis T. Platform tuning for embedded systems design. IEEE Comput 2001;34(3):112–4. [46] Vahid F. The softening of hardware. IEEE Comput 2003;36(4):27–34. [47] Vanmeerbeeck G et al. Hardware/software partitioning of embedded system in OCAPI-xl. In: Proceedings of the ninth international symposium on hardware/software codesign (CODES 2001); 2001. p. 30–5. [48] Vtune Environment, Intel Corp. Available from: http://www.intel.com/software/products/vtune/. [49] Wallace GK. The JPEG still picture compression standard. IEEE Trans Consumer Electron 1992;38(1):18–34.

M. Finc, A. Zemva / Computers and Electrical Engineering 31 (2005) 93–111

111

[50] Witchel E, Rosenblum M. Embra: Fast and flexible machine simulation. In: Proceedings of the 1996 ACM SIGMETRICS international conference on measurement and modeling of computer systems. 1996. p. 68–79. [51] Wolf W. A decade of hardware/software codesign. IEEE Comput 2003;36(4):38–43. [52] Xilinx Inc. Available from: http://www.xilinx.com. [53] Zilles C, Sohi G. A programmable co-processor for profiling. In: Proceedings of the 7th international symposium on high-performance computer architecture (HPCA-7). 2001. p. 241–54. Matjaz Finc received his B.Sc. and M.Sc. degrees in electrical engineering from the University of Ljubljana in 2001 and 2004, respectively. His current research interests include hardware–software co-design, embedded soft-core processors and real-time digital image processing.

Andrej Zemva received his B.Sc., M.Sc. and Ph.D. degrees in electrical engineering from the University of Ljubljana in 1989, 1993 and 1996, respectively. He is Associate Professor at the Faculty of Electrical Engineering. His current research interests include logic synthesis and optimization, test generation and hardware–software codesign.