uncorrected proof - Xun ZHANG

May 15, 2007 - 68 to medium volumes of production. ..... chain. It should be noted that the MicroBlaze processor. 346 and its C compiler ..... Commerce, 1977.
849KB taille 5 téléchargements 424 vues
MICPRO 1710 15 May 2007

No. of Pages 10, Model 5+

ARTICLE IN PRESS

Disk Used

1

Microprocessors and Microsystems xxx (2007) xxx–xxx www.elsevier.com/locate/micpro

Implementation of secure applications in self-reconfigurable systems

2

4

OO

Escuela Politecnica Superior, Universidad Autonoma de Madrid, Spain

5

PR

6 Abstract

ED

In a highly connected World, network security is a must even for embedded systems. However, cryptographic algorithms are computationally intensive and the processors used in FPGA-based embedded systems are known to have a modest performance. In fact, this paper presents a study showing that unless HW acceleration is used, the throughput of secure applications on FPGA-based embedded systems is poor when compared to the current networking standards. But the multi-algorithm nature of most applications poses many difficulties to classic HW acceleration, particularly large area utilization and difficulty in supporting new algorithms. Fortunately, these problems can be easily solved using partial run-time reconfiguration. This paper proposes an architecture based on self-reconfiguration that allows the implementation of hardware accelerated secure applications in FPGA-based embedded systems. Cryptographic coprocessors are efficiently deployed without incurring in the problems mentioned above, and moreover, without needing any external components. To prove the feasibility of this proposal, a proof-of-concept implementation of the well-known SSH application has been developed in a low-cost commercial device running a standard operating system.  2007 Published by Elsevier B.V.

CT

7 8 9 10 11 12 13 14 15 16 17

*

F

I. Gonzalez, S. Lopez-Buedo, F.J. Gomez-Arribas

3

RR E

18 Keywords: FPGA; Cryptography; Embedded systems; Reconfigurable computing; Self-reconfiguration 19 20 1. Introduction

CO

One of the trends of this increasingly connected World is that every computing appliance, including embedded systems, should have a TCP/IP-enabled networking interface. The benefits of networking are well known, apart from the countless possibilities offered by the Internet, there are many other advantages. They include remote accessibility and maintainability, the use of distributing computing/ storage techniques, etc. However networking also poses evident security challenges, such as ensuring confidentiality or protecting the systems against malicious accesses. Security issues have to be particularly addressed when the devices are being connected to such a public communication medium as the Internet. Fortunately security is a well studied problem, and there exists a comprehensive set of network protocols and pro-

UN

21 22 23 24 25 26 27 28 29 30 31 32 33 34 35

*

Corresponding author. E-mail addresses: [email protected] (I. Gonzalez), [email protected] (S. Lopez-Buedo), [email protected] (F.J. Gomez-Arribas).

gramming libraries that can be used to develop secure applications. All these protocols and libraries are based on cryptography, which is the underlying technique that supplies the mechanisms necessary to provide the accountability, accuracy and confidentiality required by secure systems [1]. Most secure applications do not rely on a single encryption algorithm, but on a broad collection of ciphering standards. A good example of this is SSH [2], which will be reviewed in Section 2. SSH offers a choice of different symmetric ciphers and hashes to ensure the privacy and integrity of the data. The algorithms to be used are negotiated during the connection establishment phase, which is also encrypted using an asymmetric cipher, that is, with private and public keys. The benefits arising from the multi-algorithmic approach are clear. First, flexibility is improved so that new encryption standards can be easily deployed as soon as they become available. Moreover, if a given encryption standard is broken, then secure applications can be easily reprogrammed to avoid this one and keep utilizing the safe ones.

0141-9331/$ - see front matter  2007 Published by Elsevier B.V. doi:10.1016/j.micpro.2007.04.001

Please cite this article in press as: I. Gonzalez et al., Implementation of secure applications in self-reconfigurable systems, Microprocess. Microsyst. (2007), doi:10.1016/j.micpro.2007.04.001

36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57

MICPRO 1710 15 May 2007

I. Gonzalez et al. / Microprocessors and Microsystems xxx (2007) xxx–xxx

115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153

2. Secure applications

154

SSH is a protocol that enables secure login connections and file transfers over the Internet or other untrusted networks. Cryptographic algorithms are used to authenticate both ends of the connection, to encrypt all transmitted data, and to protect the data integrity. SSH may also be used to forward X11 connections and any arbitrary TCP/ IP port from a remote machine in a secure way. The ultimate goal of SSH is to provide strong security in a transparent way to the user. Currently, there are two SSH protocols, SSH1 and SSH2. SSH2 is a complete rewrite of the SSH1 protocol. Among its main differences, SSH2 encrypts different parts of the packet and uses slightly different key exchange algorithms. The SSH2 protocol is considered more secure, as it avoids the known vulnerabilities of the SSH1 implementa-

155 156 157 158 159 160 161 162 163 164 165 166 167 168 169

OO

F

Instead of implementing all the HW accelerators in the FPGA, the architecture developed in this work reserves an area in the device equal to the one needed by just one of coprocessors. Depending on the algorithm that is being used at any given moment, that area is reconfigured to place the corresponding accelerator on it. This approach consequently saves a lot of area and thus allows the designer to use a smaller device, reducing both the cost and the static power consumption. The architecture proposed in this paper requires that the reconfigurable area for the coprocessor is changed using partial reconfiguration, to ensure that the rest of the PSoC continues its normal operation. Regular reconfiguration would reset the FPGA to its initial state, causing the embedded processor (and all the applications running on it) to abruptly stop. However, this requirement is not a major problem, since many commercial FPGAs support this mode of operation, most notably all Xilinx devices since the introduction of the Virtex family (1998). Actually, the architecture presented in this paper goes one step further, as it uses self-reconfiguration to modify the reconfigurable area. That is, the FPGA is partially reconfigured by the processor embedded in it. This is the most compelling approach of dynamic reconfiguration [5,6], and has the advantage of achieving a true system-on-chip solution without needing any external components to tackle the reconfiguration process. Section 5 provides an introduction to self-reconfigurable systems, in particular how to build one using commercial FPGAs. An implementation of the proposed architecture in low-cost PSoCs based on Xilinx Spartan-3 FPGAs and the MicroBlaze soft-core processor is also presented in Section 5. Finally, Section 6 provides the description of a proof-ofconcept implementation of SSH that is used to validate the methodology proposed in this paper. It is created using a commercial development board from Avnet that features a Spartan-3 FPGA. The PSoC is based on the MicroBlaze soft-core processor and runs the uCLinux operating system, a port of Linux for MMU-less processors.

CO

RR E

CT

ED

In the recent years, the capacity of FPGAs has grown exponentially. Programmable logic devices are now able to hold multi-million gate designs, opening up the possibility of implementing a full embedded system on a single FPGA. This is the idea behind the concept of Programmable System-on-Chip (PSoC) [3], a FPGA-based embedded system which may include one or more processors, different communication busses, many peripherals and, naturally, one or more network interfaces. These systems are very competitive in terms of customizability and price for small to medium volumes of production. Nevertheless, PSoCs have some limitations, and one of their major drawbacks is the modest performance of the microprocessors running on them. This disadvantage is especially relevant in softcore processors (implemented using the general-purpose programmable logic fabric), and less so for their hard-core counterparts (available as blocks embedded in the silicon). Lack of performance is a serious drawback for secure applications, since cryptography is known to be a computationally intensive task. Fortunately, the performance problems can be very effectively solved in PSoCs using a HW/SW codesign approach. The programmable logic fabric of the FPGA can be used to implement a coprocessor that will accelerate in HW the critical functions causing the poor performance of the SW. Many works have been written about this approach, proving that it provides significant speedups for many different applications [4]. In fact, Sections 3 and 4 present a comprehensive study about HW/SW codesign for the ciphers and hashes commonly used in secure applications. It has been done in systems based on two wellknown soft-core processors: The proprietary Xilinx MicroBlaze and the open-source LEON2, from Gaisler Research. The results from Section 4 show that HW acceleration is required to attain a performance on a par with current networking standards. However, a second look at HW acceleration shows that this technique, at least in its conventional way, is not suitable to improve the performance of secure applications. The problem is that if a secure application is going to be HW-accelerated, then all the supported cryptographic algorithms must be ported to HW. The reason is that it is not a-priori known which algorithm will be used after the negotiation phase. This will need a significant amount of silicon area, most probably making this approach economically unfeasible. Even if there is enough area available for all the HW accelerators, the problem arises again when a new encryption algorithm becomes available. In an ASIC it would be simply impossible to add a new coprocessor, but in an FPGA this can be done by changing its configuration. Although the reconfiguration may pose some difficulties, the major problem rises from the fact that if new accelerators are added, the area problem will appear again. It is just a matter of time. This paper proposes a solution to this problem using dynamic reconfiguration, based on the fact that secure applications only use one encryption standard at a time.

UN

58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114

Disk Used

PR

2

No. of Pages 10, Model 5+

ARTICLE IN PRESS

Please cite this article in press as: I. Gonzalez et al., Implementation of secure applications in self-reconfigurable systems, Microprocess. Microsyst. (2007), doi:10.1016/j.micpro.2007.04.001

MICPRO 1710 15 May 2007

No. of Pages 10, Model 5+

ARTICLE IN PRESS

Disk Used

I. Gonzalez et al. / Microprocessors and Microsystems xxx (2007) xxx–xxx

228 229 230 231 232 233 234 235 236 237 238 239 240 241

3.2. SCP systems and algorithm implementation

242

OO

F

3.1.2. LEON2 LEON2 is a highly configurable VHDL model of a 32bit SPARC processor. It features a SPARC V8 compliant integer unit with 5-stage pipeline, hardware multiply, divide and MAC units, a Memory Management Unit (MMU), an interface to connect a Float Point Unit (like the Meiko FPU from Sun), and a custom coprocessor interface. AMBA-2.0 AHB and APB buses [10] are used to connect the on-chip peripherals to LEON2. The development tool provides UARTs, timers, an interrupt controller, a 16-bit I/O port, etc. as well as more advanced peripherals like memory and network controllers. Fig. 2 shows the MMU-less LEON2 system implemented in this paper.

For each processor, a basic system consisting on the SCP and a minimal set of peripherals was first implemented. The systems were evaluated on the RC1000PP development board from Celoxica. This PCI board includes a Xilinx Virtex-E FPGA (XCV2000E), a programmable clock oscillator and 8 Mbytes of 20 ns SRAM memory. The main characteristics of the systems are reviewed in Table 1. Neither hardware multiplication nor division units were available for MicroBlaze: The former because the XCV2000E device does not include embedded multipliers, the latter because none of the commonly used cryptographic algorithms use this operation [12]. MicroBlaze and LEON2 were configured to use the same cache organization scheme (Direct-Mapped) with an 8-Kbyte data cache and an 8-Kbyte instruction cache. The implementation tools were the Xilinx EDK 6.2i, ISE 6.2i and Synplify 7.7.1, and the best clock speed that could be obtained in

CO

RR E

The computational burden of cryptographic algorithms is known to be the major bottleneck in the performance of secure applications [4]. In this section, the performance of several algorithms in two different soft-core processor (SCP) cores is evaluated: the proprietary MicroBlaze, from Xilinx [9], and the open-source LEON2, from Gaisler Research [10]. Contrary to hard-core processors, which are embedded in the silicon of certain high-end FPGA families (like Virtex-II Pro or Virtex-4 FX), soft-core processors come as synthesizable designs than can be targeted to a broad range of programmable devices. Another advantage of SCPs is that the number of processors per FPGA is only limited by the available logic resources, instead of being fixed by the manufacturer (currently there are only devices with at most two hard-processors). Due to these benefits, SCPs have been chosen for this study.

UN

198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213

223 224 225 226 227

CT

196 3. Implementation of cryptographic algorithms on FPGA197 based processors

Peripheral Bus (OPB) [9] is used to connect these peripherals. Custom hardware accelerators are usually attached to MicroBlaze using the proprietary Fast Simplex Link (FSL) [11], although the OPB bus can be also used for this purpose at the cost of a poorer performance Fig. 1.

PR

tion. There are three main components in the SSH protocol: algorithm negotiation, authentication, and data encryption. Algorithm negotiation is mainly responsible for determining the encryption algorithms, compression algorithms and the authentication methods to be used between the client and the server. Authentication is broken into two processes: key exchange (transport layer) and user authentication (user authentication layer). Key exchange has two purposes: It attempts to authenticate the server to the client, and it establishes a shared key which is used as a session key to encrypt all the data being transferred between the two machines, using a symmetric-key algorithm. Additionally, a hash is generated for checking the integrity of the data. SSH1 offers four encryption algorithms: DES, 3DES, IDEA and Blowfish. SSH2 dropped support for DES (weak algorithm) and IDEA (patent issues), but added three new algorithms: AES (Rinjdael), Twofish and CAST. SSH1 also utilized the RSA authentication algorithm, while SSH2 switched to the Digital Signature Algorithm (DSA) [7,8]. These changes were designed to both circumvent intellectual property issues surrounding the use of IDEA and RSA, and increase the base level of security in SSH2 by utilizing stronger algorithms. Finally, the MD5 or SHA hash algorithms are used for data integrity protection [7,8].

ED

170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195

214 3.1. Embedded soft-core processors in FPGAs 215 216 217 218 219 220 221 222

3.1.1. MicroBlaze MicroBlaze is a conventional 32-bit RISC Harvard-style SCP. It targets Xilinx FPGA devices, and it is highly parameterizable: Its cache, divide unit, barrel shifter, FPU, etc. can be included or not. MicroBlaze systems are created using the Xilinx EDK (Embedded Development Kit), which offers many on-chip peripherals such UARTs, timers, memory and network controllers, etc. The On-Chip

3

Fig. 1. MicroBlaze processor, FSL interface and busses.

Please cite this article in press as: I. Gonzalez et al., Implementation of secure applications in self-reconfigurable systems, Microprocess. Microsyst. (2007), doi:10.1016/j.micpro.2007.04.001

243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260

MICPRO 1710 15 May 2007

No. of Pages 10, Model 5+

ARTICLE IN PRESS

Disk Used

I. Gonzalez et al. / Microprocessors and Microsystems xxx (2007) xxx–xxx

additional operations to generate the sub-keys before starting the ciphering or deciphering. Note that the cores were not optimized for performance: The goal was to achieve the same clock frequency as the processor (50 MHz) while maintaining a reasonably low FPGA area usage and a short development time. Table 3 summarizes the characteristics of the cryptographic accelerators that have been developed. All accelerators execute the complete ciphering algorithm in hardware with the only exception of IDEA subkey generation, which is done in software. The hardware accelerators are connected to the processors either using the MicroBlaze FSL bus or the LEON2 custom coprocessor interface. The peripheral busses (OPB and APB) were not evaluated because they are not well suited to connect acceleration cores [12,24]. For MicroBlaze, all coprocessors have been implemented with the same interface: one master FSL bus for receiving the key (not used in the hash algorithms) and two pairs of master/slave FSL busses to send/receive data. The control bit of the FSL bus is used to select ciphering or deciphering. All FSL buses are 32 bits wide and have 16-byte FIFOs. On the other hand, LEON2 accelerators have a dedicated FSM to implement the protocol used by the custom coprocessor interface of this processor [10]. A set of 64-bit registers store the key, the input data and the output data. They have been placed between the coprocessor and the main CPU core and they are accessible directly from the integer unit of the processor. Custom instructions for that interface have been implemented to manage the accelerators.

278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307

4. Performance results

308

OO

F

4

Fig. 2. LEON2 processor, coprocessor interface and busses.

% occupation (XCV2000E)

MicroBlaze

Barrel shifter Data cache Instruction cache Memory controller Timer

1321

74

6

LEON2

Data cache Instruction cache Debug unit Memory controller Timer

3806

40

19

ED

BRAM

CT

FPGA Slices

CO

RR E

this environment was 50 MHz for MicroBlaze architecture and 25 MHz for LEON2. After implementing these basic systems, hardware accelerators were developed for each of the algorithms being studied: the symmetric-key ciphers IDEA [13], DES [14], 3DES [14], Blowfish [15] and AES [16], and the hash algorithms MD5 [17] and SHA-1 [18]. Table 2 shows the most relevant operations that are required by these algorithms. There are many previous works [19–29] about the implementation of these cryptographic standards. These works have been used as a basis for the development of the cores used in this study. Different approaches [12] were followed depending on the characteristics of the algorithm being implemented. For example, IDEA and BLOWFISH use a FSM (Finite-State Machine) to execute the different rounds iteratively. Other implementations, like 3DES and AES, are fully unrolled. Furthermore, some algorithms need

UN

261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277

System/processor configuration

PR

Table 1 Different architectures proposed and device resources used

The reference systems previously described are used in 309 this section to show how standard, all-software implemen- 310 tations of the cryptographic algorithms compare to 311 Table 3 Characteristics of each cryptographic core Algorithm

Key/block Key cycles

DES 64/64 3DES 192/64 IDEA 128/64 BLOWFISH 448/64 AES 128/128 MD5 0/512 SHA-1 0/512

Cipher cycles Slices

0 49 0 147 By software 175 21106 37 0 12 0 384 0 198

BRAM

3920 0 11786 0 1744 12 2496 10 4497 0 2412 0 2134 0

Table 2 Operations used in the analyzed cryptographic algorithms Algorithm

XOR AND OR

DES/3DES IDEA BLOWFISH AES MD5 SHA-1

• • • • • •

Add Sub Mod.

Fixed Shift

MULT Mod.

MULT GF Const.

• 216 232 232 232

LUT

Internal block

6-to-4

32 16 32 32 32 32

216+1 • • •

GF(28)

8-to-32 8-to-8

Please cite this article in press as: I. Gonzalez et al., Implementation of secure applications in self-reconfigurable systems, Microprocess. Microsyst. (2007), doi:10.1016/j.micpro.2007.04.001

MICPRO 1710 15 May 2007

No. of Pages 10, Model 5+

ARTICLE IN PRESS

Disk Used

I. Gonzalez et al. / Microprocessors and Microsystems xxx (2007) xxx–xxx

338

Fig. 3 compares the results achieved for each processor running the algorithms without coprocessor. In some cases, LEON2 obtains a better throughput in spite of working at half the clock frequency than MicroBlaze. The reasons for this superior performance might be the advanced features of the SPARC architecture and the maturity of its toolchain. It should be noted that the MicroBlaze processor and its C compiler have been recently designed. However, the performance obtained by both SCPs is not enough to execute secure applications over 10/100 Mbps networks. The same comparison is presented in Fig. 4 where cryptographic cores have been connected to the coprocessor interface of each SCP. The results show that LEON2 and MicroBlaze present similar throughputs when they have

339 340 341 342 343 344 345 346 347 348 349 350 351 352 353

PR

OO

F

4.1. Results discussion

CT

ED

hardware accelerated ones. Processor-agnostic C sources compiled with the GNU tools with default optimizations are used for the all-software test: The goal is to know the minimum performance for the shortest possible development time (no time spent on optimizations or hardware acceleration). The proposed test program measures the execution time consumed by the encryption and consecutive decryption of data blocks (Table 2 shows the block size) until 4 MB of data are processed. Although this mechanism is not suitable to obtain the best throughput, it resembles more the operation of data management subroutines in interactive secure communication applications like SSH. The application code and the data to cipher are stored in external SRAM memory. For MicroBlaze, Table 4 shows the execution time of each algorithm for full-software implementations and hardware acceleration using a FSL coprocessor. In the latter case, the speedup ranges from 6 to 80 times. The results obtained for this LEON2 are shown in Table 5. The full-software values proves that the SPARC architecture provides a good performance in cryptographic applications, at least when compared to MicroBlaze. The results for HW acceleration show an improvement over the algorithms running without custom coprocessor, which ranges from 4 to 40 times better.

DES 3DES IDEA BLOWFISH AES MD5 SHA-1

MicroBlaze

MicroBlaze-FSL Coproc.

# of cycles

Execution time (s)

# of cycles

Execution time (s)

Speedup

1787 19085 4209 1648 7076 5502 21978

18.734 200.119 44.136 17.277 37.101 7.211 28.807

135 233 260 123 202 913 754

1.415 2.443 2.726 1.290 1.059 1.197 0.988

13.24 81.92 16.19 13.39 35.03 6.02 29.16

CO

Algorithm

RR E

Table 4 Comparison of the different implementations of the algorithms in MicroBlaze (Speedup = Texec_Soft/Texec_FSL)

Fig. 3. Performance comparison of the best architectures without specific hardware core.

UN

312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337

5

Table 5 Comparison of the different implementations of the algorithms in LEON2 (Speedup = Texec_Soft/Texec_CP) Algorithm

DES 3DES IDEA BLOWFISH AES MD5 SHA-1

LEON2

LEON2-CP Coproc.

# of cycles

Execution time (s)

# of cycles

Execution time (s)

Speedup

974 5062 2153 697 3413 2889 4595

20.418 106.150 45.154 14.610 35.785 7.574 12.045

85 183 211 51 86 610 416

1.739 3.845 4.432 1.070 0.897 1.598 1.091

11.39 27.61 10.19 13.65 39.89 4.74 11.04

Fig. 4. Performance comparison of the best architectures with specific hardware core.

Please cite this article in press as: I. Gonzalez et al., Implementation of secure applications in self-reconfigurable systems, Microprocess. Microsyst. (2007), doi:10.1016/j.micpro.2007.04.001

MICPRO 1710 15 May 2007

I. Gonzalez et al. / Microprocessors and Microsystems xxx (2007) xxx–xxx

366 5. A Self-reconfigurable MicroBlaze-based system on 367 Spartan-3 FPGA

Algorithm

MicroBlaze-FSL Coproc.

LEON2-CP Coproc.

Slices BRAM % occupation

Slices BRAM % occupation

DES 5547 74 3DES 13396 74 IDEA 3365 86 BLOWFISH 4131 84 AES 6576 74 MD5 3322 74 SHA-1 3565 74

29 70 18 22 34 17 19

8146 16008 6048 8762 8278 6628 6250

48 48 48 48 48 48 48

42 83 32 46 43 35 33

FPGA-based embedded systems. However, these results also showed that the area required to accelerate in HW all the necessary cryptographic algorithms can be unaffordable. For example, the logic resources required to implement all FSL accelerator cores plus the MicroBlaze system are 30,729 Slices, that is, 160% of the device chosen for this study (XCV2000E). However, the MicroBlaze system plus the biggest accelerator (3DES) only occupies 13,396 Slices, 70% of the selected FPGA. The solution proposed in this paper is to load ondemand the required coprocessor as needed by the secure application. As it was stated in the introduction, this approach not only solves the area problem but also provides HW acceleration with a versatility similar to that of SW. In order to evaluate this possibility, a self-reconfigurable system based on MicroBlaze has been designed. This system is capable of modifying the configuration of the own FPGA in which it is running in order to load (or unload) the coprocessor. MicroBlaze was selected because it uses a smaller area than LEON2, thus making it possible to dedicate more logic resources for the coprocessor implementation. The system was implemented on a Spartan-3 FPGA, which was chosen because of its low cost. The main goal of the work was minimizing the overall price of the system while maintaining its performance. The cost is reduced because reconfiguration allows optimizing the area, and also because a low-cost FPGA family has been selected.

370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397

5.1. Design methodology

398

Partial reconfiguration is not supported using traditional design flows. However, Xilinx provides an alternative based on Modular Design together with the RECONFIG mode for the AREA_GROUP constraint [30]. Modular Design allows the user to build the final FPGA layout from separated modules (blocks), each located in a rectangular section of the device. The partial reconfiguration support added to Modular Design allows the modification of a module of the circuit while leaving the rest unchanged, and guarantees that the placement and routing of the module being modified will not overlap other modules.

399 400 401 402 403 404 405 406 407 408 409 410

RR E

CT

ED

368 The previous section proved the benefits of HW acceler369 ation to improve the performance of secure applications in

Table 6 Implementation results of each SCP-based system with the different ciphering cores connected as coprocessor

F

coprocessors available. All algorithms experience a significant speedup when coprocessors are used (Fig. 5), and the performance obtained with this approach is enough to run secure applications efficiently in a 10/100 Mbps local area network. Another interesting variable is the area necessary to implement the different soft-core systems. This number is especially useful when FPGAs are used to implement systems-on-chip. Fig. 6 and Table 6 compare the performance versus area of each coprocessor-based solution. In particular, MicroBlaze is the best solution if the area is the most important restriction.

OO

354 355 356 357 358 359 360 361 362 363 364 365

Disk Used

PR

6

No. of Pages 10, Model 5+

ARTICLE IN PRESS

UN

CO

Fig. 5. Speedup obtained using specific hardware cores.

Fig. 6. Comparing performance/area of each coprocessor-based solution.

Please cite this article in press as: I. Gonzalez et al., Implementation of secure applications in self-reconfigurable systems, Microprocess. Microsyst. (2007), doi:10.1016/j.micpro.2007.04.001

MICPRO 1710 15 May 2007

No. of Pages 10, Model 5+

ARTICLE IN PRESS

Disk Used

I. Gonzalez et al. / Microprocessors and Microsystems xxx (2007) xxx–xxx

439 5.2. Reference platform

CO

F

OO

PR

RR E

All Xilinx Virtex FPGAs are partially reconfigurable at run-time [31]. This is also true for their low-cost counterparts, the Spartan families, but with some limitations. For example, the restrictions for the Spartan-3 devices used in this work are that a glitchless reconfiguration is not assured, and also that the minimal configuration unit is a whole CLB column (19 frames) instead a single frame. But these limitations are not a problem for the goal of this work, they only compel the designer to use complete CLB columns (from top to bottom of the chip) for the area occupied by the reconfigurable coprocessors. In Spartan-3 FPGAs, only the JTAG and SelectMap (parallel) configuration interfaces support partial reconfig-

UN

440 441 442 443 444 445 446 447 448 449 450 451 452

uration. These devices lack the internal configuration port specifically designed for self-reconfiguration, the ICAP [5], which is only available on higher-end Xilinx families. However, self-reconfiguration can be accomplished by means of a simple external circuit: it is only needed to add a general purpose I/O (GPIO) to the MicroBlaze embedded processor, and externally connect it to the pins of the parallel configuration interface. Although this procedure solves the run-time partial reconfiguration, the first configuration of the chip, at power-up, still needs to be done. This problem can be solved with an external configuration Flash memory and following the connections recommended by Xilinx [32]. The MicroBlaze system has been implemented on a Spartan-3 Development Board together with the Communications Module, both from Avnet. The latter was used because of 64 MB of SDRAM that it provides. The only modification done to Avnet’s board was to add the external loopback connections for self-reconfiguration, as shown in Fig. 8. Apart from the coprocessor interface, the MicroBlaze system includes 8 KB data and instruction caches, an SDRAM controller, an Ethernet MAC, one timer, one interrupt controller, one UART for console I/O and three

ED

To make external connections between modules, Xilinx provides the bus macro [30], which implements the connections using pairs of tri-state buffers (TBUFs). This component is implemented as a hard macro to prevent the routes going through module boundaries from changing when reimplementing a partially reconfigurable module. Unfortunately, Spartan-3 does not have tri-state buffers, so Xilinx’s bus macros cannot be used. However, the concept behind bus macros can be extended to other FPGA components, following the idea suggested in [31]. In this way, ad-hoc bus macros are created that emulate the buffers with LUTs. While Xilinx’s bus macro is a generic component, the macros developed here are specifically tailored for the MicroBlaze FSL bus. Using this methodology, the system being designed includes two modules, as shown in Fig. 7: The first module is composed of a MicroBlaze processor and its peripherals, and it is located in the left half of the FPGA; and the second module is the coprocessor, placed to the right of the device. The different designs were constructed using the same MicroBlaze module and changing only the coprocessors. Modular design is used for all the steps but the final module assembly, which cannot be completed due to bugs in the version of Xilinx tools being used. Instead, a Perl script working over XDL files was developed as workaround for this problem. Recently, a patch solving these reconfiguration problems has been released by Xilinx for a more recent version of the software (ISE 8.1i).

CT

411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438

7

Fig. 8. Avnet’s board with the external loopback connections for self-reconfiguration.

Fig. 7. Modular Design and Reconfiguration: MicroBlaze system is the module located on the left half, the coprocessor is the one to the right.

Please cite this article in press as: I. Gonzalez et al., Implementation of secure applications in self-reconfigurable systems, Microprocess. Microsyst. (2007), doi:10.1016/j.micpro.2007.04.001

453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476

MICPRO 1710 15 May 2007

I. Gonzalez et al. / Microprocessors and Microsystems xxx (2007) xxx–xxx

490 6. A self-reconfigurable implementation of SSH

506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540

if (encryption with cipher algorithm) reset coprocessor module if (hardware cipher algorithm exists) load bitstream and dynamically reconfigure coprocessor module default: use software cipher algorithm release coprocessor module else (decryption) Hold/keep coprocessor

541 542 543 544 545 546 547 548 549 550 552 551 553 554

CO

RR E

CT

ED

SSH is the secure application that has been selected to test the proposed architecture, in particular its open source implementation OpenSSH [33], which is available in the uCLinux distribution [34,35]. Only two minor modifications have been done to its source code: Substitute the ciphering/deciphering code by calls to the coprocessors, and add the routines to manage the reconfiguration process. The two symmetric-key ciphers that are executed in hardware are the default AES128-CBC and the optional 3DES-CBC. Although two algorithms are enough for a proof-of-concept implementation, the same approach could be used for the remaining cryptographic algorithms, including the hash algorithms used for data integrity and the asymmetric-key ciphers utilized during the session setup.

UN

491 492 493 494 495 496 497 498 499 500 501 502 503 504 505

Table 7 shows the resources used for the different modules: the MicroBlaze system and the two coprocessors. The cores are based on the ones developed in Section 3, but have been redesigned to support different keys for ciphering and deciphering, as required by SSH2, and also to adapt them to the limited area available (for example, 3DES have been implemented as one DES round iteratively executed). However, they maintain the same interface described previously: One 32-bit master FSL bus for receiving the keys and two pairs of 32-bit master/slave FSL busses to send/receive data. The OpenSSL library [36] must be changed in order to use the coprocessors instead of the SW implementations of the encryption algorithms. These modifications are related to the crypto library, and basically consist on substituting the original encrypt/decrypt code by calls to the functions which interact to the coprocessor. For example, the AES coprocessor encrypt/decrypt functions substitute the original functions AES_encrypt() and AES_decrypt() in function AES_cbc_encrypt() from aes_cbc.c file. The functions that send the key to the coprocessor are inserted in the AES_set_encrypt_key() function inside aes_core.c file. Similar changes were carried out for the 3DES algorithm. After completing these changes, the SSH application must be also be modified to tackle the reconfiguration of the coprocessor needed for the selected ciphering algorithm. The OpenSSH interface to the OpenSSL library includes an initialization function, cipher_init(), that sets the algorithms up. This function has been modified in order to load the bitstream which contains the coprocessor for the selected ciphering algorithm. The following pseudocode shows how reconfiguration is managed during the negotiation phase:

F

GPIOs: one for the board leds and switches, another one for self-reconfiguration and a third one for coprocessor reset. The last GPIO is needed to reset the newly inserted logic, as the global set/reset (GSR) signal cannot be used because it would initialize the MicroBlaze processor that is executing the application. The proposed solution adds a local reset signal to the reconfigurable logic, which will be controlled by the processor. It will be asserted before starting the reconfiguration process, and will remain active until the process has finished ensuring that the new logic starts at a known state. The final system, Fig. 9, was implemented in a XC3S2000FG676-4C running at 65 MHz.

OO

477 478 479 480 481 482 483 484 485 486 487 488 489

Disk Used

PR

8

No. of Pages 10, Model 5+

ARTICLE IN PRESS

The first condition checks if the algorithm has already 555 been selected for encryption, in order to avoid reconfigur- 556 ing the FPGA twice with the same coprocessor. After that, 557 Table 7 Logic usage – MicroBlaze and coprocessors

Fig. 9. MicroBlaze-uCLinux system implementation in Avnet’s board.

MicroBlaze System AES coprocessor 3DES coprocessor

FPGA Slices

LUTs

FF/ Latches

18 · 18 Mult

BRAM

4198 4326 5424

4809 4632 4493

3618 1802 5825

3 0 0

20 0 0

Please cite this article in press as: I. Gonzalez et al., Implementation of secure applications in self-reconfigurable systems, Microprocess. Microsyst. (2007), doi:10.1016/j.micpro.2007.04.001

MICPRO 1710 15 May 2007

No. of Pages 10, Model 5+

ARTICLE IN PRESS

Disk Used

I. Gonzalez et al. / Microprocessors and Microsystems xxx (2007) xxx–xxx

In the first section of this paper, a comprehensive study about the implementation of encryption and hashing algorithms in FPGA-based embedded systems has been done. The results confirm that HW acceleration is required to achieve a reasonable performance in the execution of these algorithms, which are the basis of secure applications. In fact, for both systems based on proprietary (MicroBlaze) or opensource (LEON2) processors, the transfer speed of SW-only implementations is at most a few Mbps. The performance is poor when compared to the current networking standards. Moreover, it does not even reach one Mbps for complex ciphers like IDEA or 3DES. HW acceleration improves this number at least one order of magnitude, making thus possible for most algorithms to break the 10 Mbps barrier. However, this study also shows the drawbacks arising from taking a classic approach to HW acceleration. The loss of versatility and substantial area requirements makes it unsuitable for low-cost solutions. Here, an architecture based on self-reconfiguration is proposed to overcome these limitations. It not only provides significant advantages from the theoretical point of view, but it can also be implemented on commercial low-cost FPGAs. This architecture is based on a very simple concept, the processor dynamically loads the coprocessor it needs at a given moment using self-reconfiguration (that is, partially changing the bitstream of the own FPGA where it is running). As it is only necessary to reserve area for just one coprocessor, the system can fit in a significantly smaller device. Moreover, new coprocessors can be added at any moment by, for example, loading the new bitstream over the network. This way the system is no longer fixed at design-time, so a versatility similar to that of SW is achieved. Finally, a proof-of-concept system has been developed. It shows how the proposed architecture can be successfully used to accelerate a representative secure application (SSH) running on a standard operating system (uCLinux) in a low-cost FPGA (Spartan-3). Speedup results reaching 223% prove the validity of the proposed approach. Undoubtedly, better results can be obtained with optimized implementations where not only the encryption algorithms but also the hashing ones are accelerated. This is certainly one of the future works that should be addressed relating this research.

601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642

References

643

CO

RR E

CT

ED

After finishing the modifications explained before, it was possible to establish a connection to a SSH server and open a remote shell. However, this test only checked the application functionality but not the throughput improvements. Instead, a file transfer speed test was deployed to evaluate the performance speedup due to the use of HW accelerators. The Secure Copy application (scp) was used to fulfil this test. It provides a means of securely transferring files between two machines using the SSH protocol, and also shows the time employed and the throughput reached for each file copied. To carry out this test, the MicroBlaze system was connected to an internal 10 Mbps LAN. In order to minimize the traffic impact, only a few other computers, including the SSH server, were connected to this network. Table 8 shows the average results for the different combinations of SSH implementation (HW or SW) and encryption algorithm (AES or 3DES). The results show that the SW implementation of the AES algorithm is more efficient than the 3DES one. However, when hardware coprocessors are used both algorithms achieve a comparable performance, most probably because the speed is now determined by the access to the coprocessor. The performance improvement when using HW acceleration ranges from 119% (AES) to 223% (3DES), but the performance for the secure file transfer is significantly worse than the traditional file transfer protocol (FTP). This is mainly caused by the overhead from the hash algorithms used for data integrity (MD5 in this test). It is important to take into consideration that this work has only improved the ciphering process of SSH, but each data packet is still processed to verify its data integrity using a software implementation of the MD5 algorithm. In order to further improve the performance it would be necessary to use another reconfigurable coprocessor to accelerate the hash algorithms (MD5 and SHA).

UN

564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599

600

F

563 6.1. File transfer test

7. Conclusions

OO

the code identifies the algorithm and reconfigures the FPGA with the suitable coprocessor. Note that although most of the times encryption and decryption algorithms are the same, SSH2 might use different algorithms for each operation.

PR

558 559 560 561 562

9

Table 8 File transfer throughput in Kbytes/s Protocol

Crypto. algorithm

File 1 335 KB

File 2 2410 KB

File 3 8266 KB

SSH software

AES 3DES

27.8 16.7

48.1 26.6

52.7 30.3

SSH hardware

AES 3DES

33.2 32.0

60.4 60.3

66.4 67.7

FTP

None

877.3

696.1

683.2

[1] J. Burke, J. McDonald, T. Austin, Architectural support for fast symmetric-key cryptography, in: Proceedings of the 9th International Conference in Architectural Support for Programming Languages and Operating Systems, 2000, pp. 178–189. [2] T. Ylonen, SSH – Secure Login Connections over the Internet, in: Proceedings of the 6th USENIX Security Symposium, 1996, pp. 37–42. [3] J. Becker, Configurable Systems-on-Chip (CSoC), in: 15th Symposium on Integrated Circuits and System Design, 2002, pp. 379–384. [4] T. Wollinger, J. Guajardo, C. Paar, Cryptography on FPGAs: state of the art implementations and attacks, ACM Transactions on Embedded Computing Systems 3 (3) (2003) 534–574.

Please cite this article in press as: I. Gonzalez et al., Implementation of secure applications in self-reconfigurable systems, Microprocess. Microsyst. (2007), doi:10.1016/j.micpro.2007.04.001

644 645 646 647 648 649 650 651 652 653 654 655

MICPRO 1710

724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742

Ivan Gonzalez, received the Computer Engineering degree (Ms.C.) in 2000 and the Ph.D. degree in Computer Engineering in 2006, both from Universidad Autonoma de Madrid (UAM), Spain. From October 2002 to October 2006 he was Assistant Professor at the Computer Engineering Department of UAM. He is currently a Postdoctoral Researcher in the ECE Department, The George Washington University, Washington, DC. His main research interests are FPGAbased reconfigurable computing applications, with a special focus on dynamic partial reconfiguration and embedded systems. Other interests include high performance reconfigurable computing, robotics and experimental support of C.S. and E.E. education on Internet.

745 746 747 748 749 750 751 752 753 754 755 756 744 757 758 759 760

Sergio Lopez-Buedo, received in 2003 his Ph.D. in Computer Engineering from Universidad Autonoma de Madrid (Spain), where he currently holds a lecturer position in the area of Computer Architecture. He was a visiting researcher at University of British Columbia (2005) and at The George Washington University (2006). He is also a faculty member of the NSF Center of High Performance Computing at The George Washington University. FPGA technology is his main research interest, especially low-power, highspeed design, self-reconfiguration, high-performance reconfigurable computing and communication applications. Dr. Lopez-Buedo holds more than 50 publications, including journals, conferences and books as editor.

763 764 765 766 767 768 769 770 771 772 773 774 762 775 776 777 778

Francisco J. Gomez-Arribas. Ph.D. from Universidad Autonoma de Madrid (UAM), Spain, in 1996. From October 1996 until November 2000 he was Assistant Professor at the Computer Engineering Department of the UAM. He is currently Professor of Computer Architecture and Parallel Computing courses at the same university. His research field of interest concern reconfigurable computing applications based in FPGA circuits, with a special focus on the design of multiprocessor systems with reconfigurable architecture. Secondary fields of interest include network computing, cryptographic coprocessors, embedded systemon-a-chip and experimental support of C.S. and E.E. education on Internet.

781 782 783 784 785 786 787 788 789 790 791 792 780 793 794 795

F

[28] K. Nadehara, M. Ikekawa, I. Kuroda, Extended instructions for the AES cryptography and their efficient implementation, in: IEEE Workshop on Signal Processing Systems, 2004, pp. 152–157. [29] F. Rodriguez-Henriquez, N.A. Saqib, A. Diaz-Perez, 4.2 Gbits/s single-chip FPGA implementation of AES algorithm, Electron. Lett. 39 (15) (2003) 1115–1116. [30] Xilinx Inc., Two Flows for Partial Reconfiguration: Module Based or Difference Based, Application Note 290, 2004, . [31] M. Dyer, C. Plessl, M. Platzner, Partially Reconfigurable Cores for Xilinx Virtex, LNCS 2438 (2002) 292–301. [32] Xilinx Inc., Platform Flash In-System Programmable Configuration PROMs, Data Sheet DS123, 2005, . [33] OpenSSH Project, 2005, Home page: . [34] uCLinux – Embedded Linux/Microcontroller Project, 2005, Home page: . [35] Microblaze uClinux Project, 2005, Home page: . [36] OpenSSL Project, 2005, Home page: .

CO

RR E

CT

ED

[5] P. Lysaght, J. Dunlop, Dynamic Reconfiguration of FPGAs, in: International Workshop on Field Programmable Logic and Applications, 1993, pp. 82–94. [6] B. Blodget, P. James-Roxby, E. Keller, Self-reconfiguring Platform, in: Proceedings of the 13th International Workshop on FieldProgrammable Logic and Applications, LNCS 2778 (2003) 565–574. [7] B. Schneier, Applied Cryptography, 2nd ed., John Wiley and Sons, 1996. [8] W. Stallings, Cryptography and Network Security: Principles and Practice, 3rd ed., Prentice Hall, 2002. [9] Xilinx Inc., MicroBlaze Processor Reference Guide, Embedded Development Kit, 2004, . [10] Gaisler Research, LEON2 Processor User’s Manual, XST edition, version 1.0.24, . [11] Xilinx Inc., Connecting Customized IP to the MicroBlaze Soft Processor Using the Fast Simplex Link (FSL), Application Note 529, 2004. [12] I. Gonzalez, F.J. Gomez-Arribas, Ciphering Algorithms in MicroBlaze-based Embedded Systems, IEE Proceedings – Computer and Digital Techniques 153 (2) (2006) 87–92. [13] X. Lai, J.L. Massey, A Proposal for a New Block Encryption Standard, in: Proceedings of the Workshop on the Theory and Application of Cryptographic Techniques on Advances in Cryptology, 1990, pp. 389–404. [14] FIPS 46: Data encryption standard, NBS, U.S. Department of Commerce, 1977. [15] B. Schneier, Description of a New Variable-Length Key, 64-Bit Block Cipher (Blowfish), Fast Software Encryption Cambridge Security Workshop, LNCS 809 (1999) 191–204. [16] J. Daemen, V. Rijmen, The Design of Rijndael: AES-The Advanced Encryption Standard, 1 ed., Springer-Verlag, 2002. [17] R.L. Rivest, The MD5 Message Digest Algorithm, RFC 1321, 1992. [18] NIST, FIPS 180-2: Secure Hash Standard (SHS), 2002. Online: http:// csrc.nist.gov/publications/fips/fips180-2/fips180-2withchangenotice.pdf. [19] I. Gonzalez, S. Lopez-Buedo, F.J. Gomez, J. Martinez, Using Partial Reconfiguration in Cryptographic Applications: An implementation of the IDEA Algorithm, in: Proceedings of FPL’03, LNCS 2778 (2003) 194–203. [20] G. Rouvroy, F.-X. Standaert, J.-J. Quisquater, J.-D. Legat, Design Strategies and Modified Descriptions to Optimize Cipher FPGA Implementations: Fast and Compact Results for DES and TripleDES, in: Proceedings of FPL‘03, LNCS 2778 (2003) 181–193. [21] F.-X. Standaert, G. Rouvroy, J.-J. Quisquater, J.-D. Legat, Efficient Implementation of Rijndael Encryption in Reconfigurable Hardware: Improvements and Design Tradeoffs, in: Proceedings of CHES’03, LNCS 2523 (2003) 334–350. [22] J. Deepakumara, H.M. Heys, R. Venkatesan, FPGA implementation of MD5 hash algorithm, in: Proceedings of the Canadian Conference on Electrical and Computer Engineering, vol. 2, 2001, pp. 919–924. [23] R. Lien, T. Grembowski, K. Gaj, A 1 Gbit/s Partially Unrolled Architecture of Hash Functions SHA-1 and SHA-512, in: 13th annual RSA Conference, 2004, pp. 324–338. [24] A. Hodjat, I. Verbauwhede, Interfacing a High Speed Crypto Accelerator to an Embedded CPU, in: Proceedings of the 38th Asilomar Conference on Signals, Systems and Computers, vol. 1, 2004, pp. 488–492. [25] O. Mencer, M. Morf, M.J. Flynn, Hardware software tri-design of encryption for mobile communication units, in: Proceedings of International Conference on Acoustics, Speech, and Signal Processing, 1998. [26] S.L.C. Salomao, J.M.S. de Alcantara, V.C. Alves, A.C.C. Vieira, SCOB, a soft-core for the Blowfish cryptographic algorithm, in: Proceedings XII Symposium on Integrated Circuits and Systems Design, 1999, pp. 220–223. [27] J. Burke, J. McDonald, T. Austin, Architectural Support for Fast Symmetric-Key Cryptography, in: Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems, 2000, pp. 178–189.

UN

656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723

I. Gonzalez et al. / Microprocessors and Microsystems xxx (2007) xxx–xxx

OO

10

Disk Used

PR

15 May 2007

No. of Pages 10, Model 5+

ARTICLE IN PRESS

Please cite this article in press as: I. Gonzalez et al., Implementation of secure applications in self-reconfigurable systems, Microprocess. Microsyst. (2007), doi:10.1016/j.micpro.2007.04.001