SOFTWARE ... - Xun ZHANG

representative applications for large format printers. These ... Packard, Inkjet Commercial Division (ICD), R&D Large-Format ... the large format printers industry.
114KB taille 12 téléchargements 446 vues
CONFIGURABLE HARDWARE/SOFTWARE ARCHITECTURE FOR DATA ACQUISITION: IMPLEMENTATION ON FPGA Marc Bautista-Palacios*, Luis Baldez, Jordi Sempere-Agulló, Francisco Cardells-Tormo and Pep-Lluis Molinet

Atilà Herms-Berenguer

R&D Technology Lab, Digital ASICs Hewlett Packard Company 08174 Sant Cugat del Vallès (Barcelona) email: [email protected]

Department of Electronics University of Barcelona (UB) Marti i Franques 1, 08028 Barcelona, Spain email: [email protected] ƒ Low latency: Latency is defined as the time elapsed between the theoretical trigger of one task and when it really takes place. The design must propose a solution which manages all tasks with low latency (less than 1% of sampling period). ƒ Low main CPU bandwidth: The DAQ system is usually a piece of a bigger system which is managed by a main CPU. The implementation must reduce the load on the main CPU (especially real-time). The main CPU does not have enough bandwidth to attend all real-time tasks and hence it does not meet the required performance. Previous implementations [1] were all hardware based. Although these solutions met the required performance they had a lack of flexibility. This issue did not allow extrapolating the specific architecture to a generic architecture. Other investigations, [2] and [3], started the idea of a hardware/software architecture but using an external DSP together with some hardware blocks implemented on a FPGA. These solutions had the main disadvantage of PCB area because they implemented the solution with two separated integrated circuits. Another investigation [4] started using the idea of the hardware/software architectures to manage DAQ in virtual instruments. They used the reconfigurable hardware to increase the versatility of the virtual instruments as well as their bandwidth. On one hand, the hardware managed the data acquisition/generation services to the software (realtime tasks). On the other hand, the software manages the human interface of each instrument (not time-critic tasks). An overall problem to current solutions is that they are specific to few applications. If something changes, they may not meet some of the initial requirements.

ABSTRACT This paper deals with the FPGA-implementation of a configurable hardware/software (HW/SW) architecture for data acquisition (DAQ) systems. The novelty of this paper is to present a HW/SW architecture which can manage most of possible applications in DAQ systems in a very flexible way meeting the performance targets. This architecture can be dynamically reconfigured and therefore adapted depending on the performance of the I/O devices. We implemented and tested our system with three representative applications for large format printers. These three applications differ in terms of sampling frequency, amount of memory and synchronization complexity. The proposed system can also be integrated in an ASIC implementation. 1. INTRODUCTION One of the main problems in current data acquisition (DAQ) systems is the different performance constraints of the applications. In many cases, the sensor application requires a high rate sampling frequency, special synchronization schemes and large amount of data. A general architecture for DAQ systems must fulfill four basic requirements: ƒ Highly configurable: The system must be suitable for any DAQ application. It can be configured to any serial protocol and therefore to its corresponding device. It has to be able to handle different protocols and devices simultaneously and be extended to new applications. ƒ High flexibility: The system must be flexible to handle different types of synchronization schemes and communication with the host. * Marc Bautista acknowledges the support of HewlettPackard, Inkjet Commercial Division (ICD), R&D Large-Format Technology Lab, in the preparation of his engineering degree thesis. The opinions expressed by the authors are theirs alone.

0-7803-9362-7/05/$20.00 ©2005 IEEE

241

External Triggers

SYNC

MAIN CPU

MEMORY

INTERNAL CPU

MEMORY CONTROLLER

HOST SYSTEM BUS

SYNC

MAILBOX

CPU

INTERRUPT CONTROLLER

PROGRAM SRAM

DATA SRAM

REG OCB MEM OCB SER-DES 0

SER-DES

2

SER-DES 1

...

I/O DEVICE0

I/O DEVICES

Fig. 2.

I/O DEVICE2

High Level Block Diagram.

Fig. 1.

To address these issues a novel alternative is presented. This alternative is based on the use of a “small and cheap” dedicated embedded CPU together with some hardware blocks which control DAQ meeting all the system requirements. We propose a highly configurable embedded system which can be portable to any application. It has been implemented on a Virtex-II 6000 FPGA [5] and tested with three different representative applications from the large format printers industry. The rest of the paper is organized as follows. First, the HW/SW architecture is described. Second, we present our case study. Then, based on the case study we give a short description about the CPU selection and we present the results and a comparative analysis. Finally, the conclusions are presented.

SER-DES n

MEMORY CONTROLLER

1

I/O DEVICE1

I/O DEVICE3

...

I/O DEVICEj

Hardware Architecture Block Diagram.

these tasks are periodic; therefore they have a certain sampling period which corresponds to its deadline. We consider fast tasks which have a sampling period lesser than 100 us and slow tasks that have a sampling period greater than 100 us. On one hand, the tasks which have a high sampling frequency will be controlled by hardware implementations, like synchronization schemes and data serialization. Alternatively, slow tasks, such as block configuration and system status update, will be managed by software. Along the system design there were some HW/SW trade-offs. There are some tasks which can be managed both in hardware or software. All these trade-offs must be individually studied. If the system has the required performance with a software implementation, the task will be implemented in software because it has more flexibility and it is usually a cheaper election. But if the system does not meet the required performance, the task must be implemented in hardware. Fig. 1 shows a high level block diagram of the implemented system. It is a microprocessor-based system where the Central Processing Unit (CPU) communicates with one synchronization (SYNC) block, which generates synchronization pulses in order to trigger some serializer/deserializer (SER-DES) blocks. The SER-DES interface communicates with external devices (ADCs …) through serial protocols and finally they interact with a memory block which contains acquired data. This subsystem is usually a portion of a bigger system, and therefore it must communicate with the rest of the system. Communication is often with the main system CPU and the external memory block where the memory controller stores data. Moreover, the SYNC block can be triggered by external sources. All the HW architecture is depicted in Fig. 2. It contains the four main blocks of the design: the internal CPU, the synchronization block, the SER-DES blocks (grey area 2) and the memory management blocks (grey

2. HARDWARE/SOFTWARE ARCHITECTURE The cost of a mixed hardware/software system based on a standard microprocessor (including a microcontroller or a digital signal processor) depends on the size of the hardware and software code size. The most effective way to reduce the hardware size is to implement a given functionality with a program on the microprocessor. However, the software implementation of the functionality sometimes fails to meet the performance requirement. To handle this problem, one possible approach is to choose a critical portion in the program which does not satisfy the performance requirement, and then implement it through hardware components. In this approach, the software performance estimation is the key to find the critical portion in the software implementation. Based on internal (in-house) projects and previous implementations [4], we did a task-based hardware/software partition. We attempted to differentiate two types of tasks: fast and slow. The difference between these two tasks is the maximum time allowed for completion, commonly defined as ‘deadline’. Most of

242

area 1). In addition, there is the communication with the rest of the system through mailboxes. Also, there is an interrupt controller. The communication between hardware blocks is through two On-Chip Buses (OCBs): register bus and memory bus. Both buses only have one master, which eliminates the need for a bus arbiter. The CPU carries out all HW block configuration through the register bus. The memory controller carries out the data interface through the memory bus. This division allows a dedicated bus for the internal CPU and another for the data, which translates into a system with more performance. The SER-DES blocks have the main function of serializing/deserializing data. In case of data acquisition, it receives a serial data through one serial protocol (SPI, I2C …) and makes it available to the data interface. It has 2 OCBs interfaces: one for the register OCB which is configured by the internal CPU and another for the memory OCB which is driven by the memory controller to read the received data. Start of operation can be triggered directly by the SYNC block or by firmware. It can generate a done or error interrupt to inform the internal CPU about the transaction status. All SER-DES blocks follow the same structure although every SER-DES block has a different serial protocol interface. Furthermore, each SERDES depending on the serial protocol features can manage more than one I/O device. The memory management block consists on three different blocks: the program SRAM, the data SRAM and the memory controller. The program SRAM contains the program code to be executed by the CPU. The data SRAM is a dual port SRAM which contains the data and can be read/written by both CPU and memory controller. All data is centralized into a single memory block to make the system more configurable. Both SRAMs are implemented on the Virtex-II BRAMs. Finally the memory controller manages all read data from SER-DES. It reads received data from SER-DES blocks and then it stores to the data SRAM. Consequently, the memory controller must be an OCB initiator. Furthermore it is a target of the register OCB to allow configuration by the internal CPU. This memory controller behaves similarly to a DMA initiator. We decided to centralize this function into one dedicated hardware block to make the system more configurable and with less latency. One DMA initiator for each SER-DES block is a very expensive system in terms of hardware utilization and only a few SER-DES blocks can be DMA initiator. Also, a hardware dedicated block with a dedicated OCB has less latency than a software solution. The memory controller polls the SER-DES to know its status before reading new data. There is a programmable parameter which allows inserting a wait time between two consecutive reads. The interrupt controller receives all possible interrupts that can take place inside the system. It maps all these

External Triggers ...

External Triggers ... TIMER0

Loops counter

TIMER1

Ticks counter

Loops counter

Pulses (SER-DES) Fig. 3.

Ticks counter

Int (CPU)

Simplified Synchronization Block Diagram.

interrupts to the internal CPU or the main CPU. The main CPU together with the rest of the system is symbolized with a box called host. The host is a bus master that can initiate transactions and receive an interrupt from the interrupt controller. The host in the case of virtual instrumentation would be the PC and the system bus the PCI. The SYNC indicates when serial blocks have to start an action. These indications are single pulses which are controlled by some timers. The number of timers depends on the application. In our implementation, the SYNC block is made of 8 timers. If needed, the user can instantiate more or less than 8 timers to generate all the required pulses. Fig. 3 shows a simplified block diagram of the SYNC block which instantiates two timers. Each timer consists on two counters. The ‘Ticks counter’ counts the number of input ticks pulses until it reaches a programmable value and the ‘Loops counter’ counts the number of times the Ticks counter rolls over. All output pulse timers can be interconnected between them, therefore these pulses can be used as trigger signals together with external trigger sources. This SYNC structure enables a huge variety of synchronization schemes. Finally, the CPU can enable one interrupt for each counter. The synchronization block has two different connections. The outputs directly connected to SER-DES blocks are all single output pulses. The CPU configures each SER-DES block to select its related input pulses. These single output pulses trigger the SER-DES without the SW intervention. Therefore it reduces the latency and the CPU bandwidth. Furthermore, the internal CPU also receives one interruption from the synchronization block. The CPU enables relevant information interrupts which

243

Table 1. HW Block SYNC

SER-DES

Internal CPU

Memory Controller

Mailbox

Table 2. CPU Selection Altera [6] Xilinx [7] OpenCore [8]

Design Improvements Summary.

Ļ Ļ Load CPU Ĺ Flexibility Ĺ Configurable Latency The Direct CPU enables interconconneconly nection tion to interrupts it between SER- wants to see timers and DES counters allow a huge variety of synchronization schemes

The variety of synchronization schemes allow a highly configurable to different applications

Direct connection with SYNC

SER-DES structure reused many times maintaining a common interface for firmware

Vendor

MIPS [9]

ARM [10]

Nios II/e

Micro Blaze

OpenRisc 1200

M4K

ARM7 TDMI-S

8

8

9

9

9

LUTs

1070

1300

6000

6950

7030

FFs

1070

1300

6000

2350

2300

8

9

8

8

9

0.8

0.75

1.33

1.35

0.9

33 MHz

30 MHz

33 MHz

CPU Portable to ASIC

Tools house

in-

DMIPS/ MHz Clock Frequency

150 MHz 150 MHz

configurable). Table 1 shows how each block contributes for an architecture that meets all the requirements.

An internal It can perform CPU adds most of flexibility. internal Most system system features can operations. be programmed by It reduces the number firmware. of main CPU interrupts ComIt manages plete HW all data solution transactions between SER-DES and data memory.

Table.

3. CASE STUDY Once we designed the system architecture we tested the implementation with 3 representative applications (A, B and C) from large format printers. Due to confidential reasons, the applications are not explained and we only show their requirements. A has a sampling period of 500 ms, 1 timer and 2 memory bytes are needed. B has a sampling period of 250 us, 3 timers and 32 memory bytes are needed. C has a sampling period of 25 us, 3 timers and 256 KB are needed. All applications have different serial protocols therefore we used 3 different SER-DES simultaneously.

It is number and type of applications independent

It allows The meaning The register fields mean of the firmware different things register working in a depending on fields are high abstraction defined by the application. firmware. level.

4. CPU SELECTION The CPU selection has been carried out together with the hardware architecture definition and the performance requirements. This way we have more accurate parameters in order to select the most appropriate CPU and therefore to find an optimum system. We have summarized in Table 2 the relevant features of each embedded CPU. We studied low cost embedded CPUs from different suppliers: Altera, Xilinx, ARM, MIPS and OpenCore. The comparative parameters are: portability to ASIC technology, number of LUTs and FFs, development tools available in-house and performance parameters: Dhrystone 2.1 in DMIPS/MHz and the maximum clock frequency achieved when implementing the CPU on the FPGA. All studied embedded CPUs must fulfill some basic parameters. We are looking for a “small and cheap” embedded CPU. Therefore the selected CPU must be low cost: royalty free, low cost license, small size and the development tools should be available in-house (no extra

usually are not related to fast tasks (start application, end application …). Finally, the communication with the remainder of the system is through mailboxes, with one for command and another for status. In the command mailbox, the host writes the actions that the DAQ must perform. In the status mailbox, the internal CPU writes the status of the DAQ system. The command mailbox has three external pins which allow for starting/stopping the selected application with default values (these pins were very useful in the debugging for the FPGA implementation because we did not have the host available). All these HW blocks make the system better than previous implementations in terms of initial requirements (low latency, high flexibility, low load CPU and highly

244

Table 3.

license costs). Moreover, the CPU must be a soft core and portable to any ASIC and technology. Our target system will be implemented on an ASIC after prototyping on the FPGA. Finally, it has to meet the required performance and data width should be compatible with the OCB (32 bits). All CPUs shown in Table 2 are low cost, soft core and 32 bits data width. Referring to development tools, we only have available in-house the MicroBlaze and ARM development tools. Both Altera and Xilinx processors are not easily portable to ASIC. Therefore the selected CPU is the ARM7 TMDI-S. In addition to previous reasons, an ASIC vendor was able to provide a free evaluation version of the selected CPU.

Function

Task 1

Task 2

Example table.

Max Time

Measured Time

A

500 ms

278.5 us

B

500 us

278.5 us

C

25 us

278.5 us

A

500 ms

122.1 us

B

250 us

71.6 us

C

25 us

91.7 us

Brief Description The internal CPU must read from data memory each new application data. The internal CPU manages the SER-DES trigger and the data saving to data memory.

Memory controller in its worst case only took 1.7 us and the fastest application reads one data every 25 us. The synthesis tool (Synplify Pro) reports the hardware can achieve a maximum frequency of 37 MHz implementing it on a Virtex-II 6000 speed grade 6. We planned to use the system with a clock frequency of 31.25 MHz. This frequency is fast enough and it is supported for the selected CPU. The worst performance cases were found in the software tasks. We implemented in software all possible functions that were able to be managed by software. Although our firmware can perform all tasks meeting the performance targets, we verified that the CPU was not able to handle real-time applications that require an immediate attention. In Table 3 there is a representative summary, in terms of time to execute a certain functionality, of the firmware performance results. The first column indicates different functions which can be performed by the CPU. The second column indicates the maximum time to perform these functions, the third column indicates the real measured time and the last column shows a brief description of the task. The firmware meets the performance targets in all functions where the maximum time is greater than the measured time. The functions that do not meet the performance are shaded in grey. These results demonstrate our first approach of isolating fast events (period lesser than 100 us) from firmware is quite good. Therefore, these results allow us to demonstrate that our hardware/software partition was the most appropriate for the designed data acquisition system.

5. ANALYSIS OF RESULTS The two key parameters are cost and performance. We must analyze these parameters of both hardware and software. Hardware parameters were measured using both the hardware simulation results (indicates performance) and the synthesis results (indicates cost in terms of LUTs and FFs and performance in terms of maximum clock frequency). Software parameters were measured with the software development tools. The software cost is measured through the number of program and data bytes. We measured the performance with a very accurate Instruction Set Simulator (ISS). We got less than 1% error comparing with the cosimulation results. The complete system verification has been done cosimulating both hardware and software with two separate methodologies: a SystemC environment and a CPU model for HDL simulation. In the SystemC environment the embedded CPU is replaced by a Bus Functional Model (BFM). The C code is executed in the PC processor and the BFM translates C functions into bus reads/writes. This co-simulation environment allows verifying the system functionality. The CPU model structure is the most similar to the final implementations. In this environment the embedded CPU is replaced by its CPU model and the memory models are replaced by functional models (FM). This HW/SW co-simulation allows us to verify that our system meets the required performance for many different types of applications.

5.2. Cost Analysis The software code size is 18930 bytes. This number includes functions, constants, stack, global variables and internal CPU initialization. All these numbers are provided by the firmware development tool. The design was synthesized using Synplify Pro. Place and route were obtained using Xilinx ISE 6.2. The synthesize results, in terms of cost, are shown in Table 4. It shows the number of FFs and LUTs, the equivalent ASIC logic gates (considering approximately 1 FF § 10 logic gates and 1 LUT § 6 logic gates) and the cost percentage

5.1. Performance Analysis After co-simulating the complete system we realized, as we expected, hardware was fast enough and met all the required performance: it communicates with I/O serial devices at programmed frequency for every application, it generated the synchronization pulse and connects to SERDES without latency (only 1 clock cycle