FPGA IMPLEMENTATION OF A GF(24M ) MULTIPLIER ... - Xun ZHANG

accelerator for characteristic 2. 3. MATHEMATICAL BACKGROUND. 3.1. Elliptic Curves. A supersingular elliptic curve over the field GF(2 m. ) is de- fined by the ...
92KB taille 2 téléchargements 369 vues
FPGA IMPLEMENTATION OF A GF (24M ) MULTIPLIER FOR USE IN PAIRING BASED CRYPTOSYSTEMS Maurice Keller, Tim Kerins, William Marnane Dept. of Electrical and Electronic Engineering University College Cork, Cork City, Ireland email: mauricek,timk,[email protected] ABSTRACT In this paper an architecture for GF (24m ) multiplication is outlined. It is illustrated how this operation is critical to efficient hardware implementation of the Tate pairing, which itself is the underlying calculation in many new pairing based cryptosystems. Tate pairing calculation times using an FPGA hardware accelerator are estimated based on results from the multiplier architecture.

cussed in [4]. The field sizes of interest for evaluating the Tate pairing require km ≈ 1000 in the extension field, making these architectures too large and too slow respectively. Multiplication over the extension field GF (24m ) has been less well studied in the literature. In [5] hardware architectures based on the Karatsuba-Ofman algorithm are proposed. The architectures are targeted at smaller fields suitable for use in coding applications where in general km < 100. In [6] an architecture based on a LFSR is proposed which uses bit-parallel multipliers in the subfield. It is targeted at field sizes suitable for elliptic curve cryptography where m ≈ 250 in general (k = 1). Recently work has been carried out in improving Millers algorithm to provide faster Tate pairing computations for cryptographic applications [2, 3, 7, 8]. The BKLS/GHS algorithm proposed in [2, 3] has the advantage that a hardware implementation would also be capable of easily performing elliptic curve point manipulations. The author is unaware of any published hardware implementations of a Tate pairing accelerator for characteristic 2.

1. INTRODUCTION Recent research has seen the development of a number of new pairing based cryptosystems [1]. These cryptosystems are based on the use of bilinear pairings. The most efficiently computable of these is the Tate pairing, which is calculated using the algorithm of Miller [2, 3]. This calculation requires Galois Field multiplication in the extension field GF (pkm ) where k is the security multiplier and its value depends on the characteristic p of the underlying Galois Field. A popular choice for implementing cryptographic schemes is p = 2 as it leads to more efficient implementations of the Galois Field arithmetic. In this case the security multiplier k takes on the value k = 4 [2, 3]. To provide adequate security for pairing based systems m will be a large prime with m ≈ 250. Multiplication in the field GF (24m ) is required in the critical path of the Tate pairing calculation. Therefore it is essential to be able to efficiently perform this calculation in hardware. This work investigates the feasibility of using an FPGA to implement a hardware based Tate pairing accelerator by examining the performance obtainable from a GF (24m ) multiplication architecture on an FPGA.

3. MATHEMATICAL BACKGROUND 3.1. Elliptic Curves A supersingular elliptic curve over the field GF (2m ) is defined by the equation: E(GF (2m )) : y 2 + y = x3 + x + b, b ∈ {0, 1}

The curve in (1) also defines a curve over the field GF (24m ) since GF (2m ) is a subfield of GF (24m ). The set of points on E(GF (2m )) is defined as the set of all (x, y) ∈ GF (2m ) which satisfy (1) above and a special point ϕ, known as the point at infinity. Addition of two points on E to compute P2 = P0 + P1 is defined by the chord tangent process. The underlying operations in the algebraic formulae for point addition and doubling, given in (2), are Galois Field addition, multiplication, squaring and division. The equation of the line function l1 (x, y) used in the point addition

2. RELATED WORK Multiplication over GF (2m ) has been well studied in the literature. Bit-parallel and bit-serial architectures are disThis research was funded by the Embark Initiative Postgraduate Research Scholarship Scheme from the Irish Research Council for Science, Engineering and Technology (IRCSET).

0-7803-9362-7/05/$20.00 ©2005 IEEE

(1)

594

Algorithm 1: BKLS/GHS Algorithm m

of elliptic curve point addition and doubling and evaluation of l1 (x, y) are given by (6a)-(6c).

4m

P ∈ E(GF (2 )), Q ∈ E(GF (2 )), t−1 l = i=0 li 2i τ (P, Q) f = 1; V = P for i = t − 1 downto 0 do V = 2V /* Point Double */ f = f 2 .l1 (Q) /* GF (24m ) × */ if(li = 1) then V = V + P /* Point Addition */ f = f.l1 (Q) /* GF (24m ) × */ end if end for f  /* GF (24m ) Exponentiation */

Input: Output: Initialise: Run:

return:

(6a) TP DOU B = TMSU B + 2TSSU B + 6TASU B TP ADD = TMSU B + TSSU B + 7TASU B + TDSU B (6b) ∗ T l1 = T M + 3TA∗ = 4TMSU B + 3TASU B (6c) where TMSU B , TSSU B , TASU B and TDSU B refer to the time taken for GF (2m ) arithmetic operations and ∗ indicates an operation involving an element of GF (2m ) and GF (24m ). The number of times the conditional statement in the loop is entered depends on the Hamming weight of l i.e. Hw(l). The simplest method to compute the point exponentiation c = f  is the binary method with cost 4m(TS ) + Hw()(TM ), where TM and TS refer to multiplication and squaring in GF (24m ) respectively. Neglecting the time required for field additions and squarings (both can be done in one clock cycle in hardware) the overall time taken for an evaluation of the Tate pairing using the BKLS/GHS algorithm is given by (7). It is noted that as l and hence  are known for a particular value of m, there is no danger of side channel analysis due to the dependence of the calculation time on their Hamming weight.

process is defined in (3). ⎧ ⎨

λ

=

x2 y2

= =

l1 (x, y)

y0 +y1 x0 +x1 ,

P0 = P1



xo 2 + 1, P0 = P1 λ2 + x1 + x0 λ(x0 + x2 ) + y0 + 1

(2)

= y + y1 + λ(x + x1 )

(3)

T = [k]R = R + R + . . . + R  

Tτ ≈ 5 [t + Hw(l)] TMSU B + Hw(l)TDSU B + [t + Hw(l) + Hw()] TM

(7)

(4)

k times

3.3. Galois Field Element Representation

Point scalar multiplication of an elliptic curve point, R by a scalar k is computed using repeated point additions as per (4). An elliptic curve point P is said to be an r-torsion point if r is the smallest number such that [r]P = ϕ. The set of all r-torsion points of the curve E(GF (2m )) is denoted by E(GF (2m ))[r].

Elements of the field GF (2m ) can be represented in a polynomial basis by polynomials of degree m − 1 in x. Arithmetic is performed modulo some degree m irreducible polynomial f (x). Elements of GF (24m ) are represented by degree 3 polynomials with coefficients in GF (2m ). Arithmetic in GF (24m ) is performed modulo p(x) = x4 + x + 1.

3.2. The Tate Pairing

4. HARDWARE

Let E be a supersingular elliptic curve defined over GF (2m ) as given in (1). The Tate pairing maps two points, P ∈ E(GF (2m ))[l] and Q ∈ E(GF (24m ))[l], to an element of the multiplicative group GF (24m )∗ . It is defined as: τ (P, Q) = fP (Q)

4.1. GF (2m ) Digit-Serial Multiplier Architecture As will be seen in section 4.2 a GF (2m ) multiplier is required in order to implement GF (24m ) multiplication using the Karatsuba algorithm. The GF (2m ) multiplication architecture will be required to compute c(x) = a(x)b(x) mod f (x) for a, b, c ∈ GF (2m ). A Digit-Serial Multiplier architecture as described in [9] was chosen as it allows the speed/area tradeoff to be explored. The m bit multiplicand b is processed D bits at a time. The architecture computes the product in n =  m D clock cycles. Varying the digit size D allows an area/speed tradeoff to be explored. Selecting a digit size of 1 results in a bit-serial architecture whereas a digit size of m results in a bit-parallel architecture. The calculation time of the architecture is (n + 2)TCLK , where TCLK is the clock period.

(5) 4m

The Tate pairing is raised to the power  = 2 l −1 to obtain a unique value for cryptographic applications. Millers algorithm is utilised to evaluate the Tate pairing. The BKLS/GHS algorithm presented in Algorithm 1 is an optimised version of Millers algorithm for computing the Tate pairing where Q = φ(R), R ∈ E(GF (2m )), Q ∈ E(GF (24m )) for some suitable distortion map φ(x, y) as described in [2]. The cost of evaluating the BKLS/GHS algorithm can be determined in terms of underlying field operations. The cost

595

4.2. GF (24m ) Karatsuba Multiplier

a1

To perform GF (24m ) multiplication using the representation described in section 3.3 it is required to multiply two degree 3 polynomials with coefficients in GF (2m ) to compute c = ab mod p(x) ∈ GF (24m ). This computation can be divided into two stages. The first stage is to compute the degree 6 product γ = ab. The naive method to compute γ(x) is to perform polynomial multiplication i.e. γ = ab = a3 x3 b(x)+a2 x2 b(x)+ a1 xb(x) + a0 b(x). This method would require 16M and 9A where M and A refer to GF (2m ) multiplication and addition respectively. It is noted at this point that GF (2m ) addition is simply bitwise XOR of the elements. Therefore it is virtually for free in hardware requiring only m XOR gates and having a delay equal to the propagation delay of one XOR gate. Karatsuba multiplication [10] allows multiplications to be traded for additions. Using Karatsuba to compute γ(x) a set of partial products is first computed in 9M and 10A as per (8). The coefficients of γ(x) are given in 12A by (9). q0 q1 q2 q3

= a0 b0 = a1 b1 = a2 b2 = a3 b3 γ0 γ1 γ2 γ3 γ4 γ5 γ6

q4 q5 q6 q7 q8

= (a0 + a1 )(b0 + b1 ) = (a0 + a2 )(b0 + b2 ) = (a1 + a3 )(b1 + b3 ) = (a2 + a3 )(b2 + b3 ) 3 3 = ( i=0 ai )( i=0 bi )

= q0 = q 0 + q1 + q4 = q 0 + q1 + q2 + q5 8 = i=0 qi = q 1 + q2 + q3 + q6 = q 2 + q3 + q7 = q3

q0 + q 1 + q 2 + q 3 + q 6 q0 + q 4 + q 6 + q 7 q0 + q 1 + q 5 + q 7 q0 + q 1 + q 2 + q 4 + q 5 + q 6 + q 7 + q 8

a3 a0

a1 b0

b1 b2

b3 a0

a2 b0

q4

q6

q8

q7

b2

GF(2 m ) Addition GF(2 m ) Multiplication

Hardware Count:

a0

b0

a1

b1 a2

b2 a3

b3

10 9

q0

q1

q2

q3

q5

12

Total: 9 and 22

c0

c1

c2

c3

Fig. 1. GF (24m ) Karatsuba Multiplier Architecture (8) A hardware GF (24m ) Karatsuba multiplier architecture was realised by capturing (8) and (11) in VHDL as illustrated in Fig. 1. In order to obtain maximum speed all 9 GF (2m ) multiplications required by the Karatsuba multiplication are performed in parallel using the digit-serial multiplication architecture of section 4.1. All lines in Fig. 1 are m bit buses. For different field sizes m the architecture remains the same with only the bus size and underlying GF (2m ) components changing. It is noted that the critical path through this architecture requires 5 GF (2m ) additions and 1 GF (2m ) multiplication.

(9)

5. RESULTS The extension field multiplier architecture of section 4.2 was captured in VHDL at the RTL level using C program generated VHDL for the GF (2m ) digit-serial multiplier components. The design was synthesised for a Xilinx Virtex 2 xc2v6000−ff1152 device using Xilinx ISE 5.1.03i. Two fields sizes suitable for pairing based cryptosystems are m = 241, 283 (see [3]). The post synthesise and post place and route (PS and PPR respectively) clock frequencies for various digit sizes are reported in Tables 1 and 2. As can be seen increasing the digit size D reduces the overall calculation time but at the expense of consuming more FPGA resources. For both field sizes selecting D = 8 seems to be a good tradeoff that offers fast GF (24m ) multiplications with reasonable resource usage.

This yields the coefficients of c(x) in a further 6A leading to a total of 9M and 28A. Combining the modular reduction and γ coefficient generation leads to cancellations in the equations since a + a = 0 for a ∈ GF (2m ). Therefore the final product has coefficients given by (11) with a final cost of 9M and 22A. = = = =

b3 a2

Legend:

The next stage of the multiplication process is to compute c(x) = γ(x) mod p(x) where p(x) = x4 + x + 1 as given in section 3.3. The reduction is performed using matrix multiplication and the reduction matrix R.

⎛ ⎞ 1 0 0 0

1 0 0 ⎜ 0 1 0 0 1 1 0 ⎟

⎟ R=⎜ (10) ⎝ 0 0 1 0 0 1 1 ⎠

0 0 0 1 0 0 1

c0 c1 c2 c3

a3 b1

(11)

596

6. CONCLUSION

Table 1. GF (24m ) Multiplier Results, m = 241 D n CLK freq. (MHz) TM area PS PPR (μs) (%) 1 241 172.265 87.596 2.774 20 4 61 181.258 78.474 0.803 28 8 31 161.996 71.013 0.465 38 16 16 140.496 63.914 0.282 68

In this paper the feasibility of using an FPGA to implement a GF (24m ) multiplier for cryptographic field sizes was investigated. An architecture based on Karatsuba multiplication was proposed which was found to offer very good performance when a digit-serial architecture was used for the subfield multiplier. By changing the digit size of the subfield multiplier an area/speed tradeoff was explored. It was found that for the field sizes of interest for pairing based cryptosystems a fast GF (24m ) multiplier could be implemented on a modern FPGA whilst leaving sufficient space on the device to implement the other components necessary to evaluate the Tate pairing. The speed of the accelerator can be improved at the expense of area if a larger digit size is used. The next stage is to implement a full Tate pairing hardware accelerator using the BKLS/GHS algorithm.

Table 2. GF (24m ) Multiplier Results, m = 283 D n CLK freq. (MHz) TM area PS PPR (μs) (%) 1 283 186.670 73.448 3.880 25 4 71 155.690 74.482 0.980 33 8 36 135.667 59.712 0.636 44 16 18 123.259 58.689 0.341 75 Table 3. Estimated Tate Pairing Computation Times m D Hw(l) Hw() TM TD Tτ (μs) (μs) (ms) 241 8 121 482 0.465 6.79 2.06 283 8 3 566 0.636 9.48 1.48

7. REFERENCES [1] R. Dutta, R. Barua, and P. Sarkar, “Pairing-Based Cryptographic Protocols: A Survey,” Cryptology ePrint Archive, Report 064/2004, 2004. [2] P. Barretto, H.Kim, B. Lynn, and M. Scott, “Efficient Algorithms for Pairing-Based Cryptosystems,” Proc. CRYPTO ’02, pp. 354–368, 2002. [3] S. D. Galbraith, K. Harrison, and D. Soldera, “Implementing the Tate Pairing,” Proc. Fifth Algorithmic Number Theory Symp. (ANTS-V), pp. 324–337, 2002. [4] E. D. Mastrovito, “VLSI Architectures for Computation in Galois Fields,” Ph.D. dissertation, Link¨oping University, 1991. [5] C. Paar, P. Fleischmann, and P. Roelse, “Efficient Multiplier Architectures for Galois Fields GF (24n ),” IEEE Trans. Comput., vol. 47, no. 2, pp. 162–170, Feb. 1998. [6] C. Paar and P. Fleischmann, “Fast Arithmetic for PublicKey Algorithms in Galois Fields with Composite Exponents,” IEEE Trans. Comput., vol. 48, no. 10, pp. 1025–1034, Oct. 1999. [7] I. Duursma and H.-S. Lee, “Tate Pairing Implementation for Hyperelliptic Curves y 2 = xp − x + d,” Proc. ASIACRYPT ’03, pp. 111–123, 2003. [8] P. S. L. M. Barreto, S. Galbraith, C. O’hEigeartaigh, and M. Scott, “Efficient Pairing Computation on Supersingular Abelian Varieties,” Cryptology ePrint Archive, Report 375/2004, 2004. [9] L. Song and K. Parhi, “Low Energy Digit-Serial/Parallel Finite Field Multipliers,” Kulwer Journal of VLSI Signal Processing Systems, vol. 19, no. 2, pp. 149–166, 1998. [10] A. Karatsuba and Y. Ofman, “Multiplication on Many-Digital Numbers by Automatic Computers,” Translation in PhysicsDoklady, vol. 7, pp. 595–596, 1963. [11] S. C. Shantz, “From Euclid’s GCD to Montgomery Multiplication to the Great Divide,” Sun Microsystems, Tech. Rep. TR-2001-95, 2001.

The Tate pairing evaluation time using the BKLS/GHS algorithm can be estimated using (7) once the times required for the underlying field operations are known. It is assumed that one of the subfield multipliers in the GF (24m ) multiplier architecture will be used for GF (2m ) multiplications (i.e. TMSU B = TM ). A GF (2m ) divider implemented in VHDL on the target device using the algorithm described by Shantz in [11] performs a GF (2m ) division in 2m clock cycles and has a post place and route clock frequency higher than that of the multiplier for both m = 241, 283. Assuming no clock partitioning the system clock speed will therefore be determined by the maximum clock speed of the GF (24m ) multiplier. The estimated time taken to evaluate the Tate pairing using a hardware accelerator based on the GF (24m ) architecture presented in this paper is given in Table 3. The curves used for the estimates are given in [3]. The digit sizes were chosen so that in both cases, the GF (24m ) multiplier occupies approximately 40% of the device, leaving space for the additional components and control neccesary to compute the Tate pairing. In [2] a Tate pairing software evaluation time of 23ms over GF (2271 ) on a 1GHz Pentium III is reported. The estimated results in Table 3 illustrate that an FPGA based hardware accelerator would provide superior performance to a software based system for similar security levels. The underlying reason for this speed up is the ability to parallelise the calculactions on a large, modern FPGA. The results in Table 3 and (7) illustrate that GF (24m ) multiplication is essential to the efficient computation of the Tate pairing.

597