[IEEE 2010 VI Southern Programmable Logic Conference (SPL) - Ipojuca, Pernambuco, Brazil (2010.03.24-2010.03.26)] 2010 VI Southern Programmable Logic Conference (SPL) - Montgomery

MONTGOMERY MODULAR MULTIPLICATION ON RECONFIGURABLE HARDWARE:FULLY SYSTOLIC ARRAY VS PARALLEL IMPLEMENTATION

Guilherme Perin, Daniel G. Mesquita, Fernando L. Herrmann and Joao Baptista Martins

Post-Graduate Program in Informatics - PPGIFederal University of Santa Maria / Microelectronics Group

Av. Roraima, 1000 - Camobi - Santa Maria - [email protected], [email protected], [email protected],

[email protected]

ABSTRACT

This paper describes a comparison of two FPGA Montgo-mery modular multiplication architectures: a fully systolicarray and a parallel implementation. The modular multi-plication is employed in modular exponentiation processes,which is the most important operation of some public-keycryptographic algorithms and the most popular of them isthe RSA encryption scheme. The proposed fully systolic ar-ray architecture presents a high-radix implementation withcarry propagation between the Processing Elements. Theparallel implementation is composed by multipliers blocksin parallel with the Processing Elements and it provides apipelined operation mode. We compared the time x area ef-ficiency for both architectures as well as a RSA application.The fully systolic array implementation can run the 1024bit RSA decryption process in just 3.23 ms and the paral-lel architecture executes the same operation in 6 ms, whichmeans a competitive state-of-art performance for both archi-tectures.

1. INTRODUCTION

The modular multiplication is widely employed in public-key cryptography on the modular exponentiation execution.The hardware implementations of this mathematical opera-tions speed up the throughput rate of cryptographic algo-rithm when compared with software-only implementations.The use of Field Programmable Gate Arrays (FPGAs) of-fer a great degree of flexibility to cryptographic hardwaresolutions, because different key sizes and protocols can beaccommodated with relative ease.

The RSA public key cryptosystem [1] was proposed byR.L. Rivest, A.Shamir and L. Adleman in 1978. It is awidely used system for providing privacy and ensuring au-thenticity of digital data, i.e., in all applications where se-curity of digital data is a concern. The RSA key is gener-ated with a multiplication of two long prime numbers, from

512 up to 2048 bits. Thus the security of the RSA cryp-tosystem relies on the inability of a potential attacker to ef-fectively factor large integers. However, a direct hardwareimplementation of RSA mathematical operations comes tobe impracticable and a compute-intensive task, due the longintegers arithmetic. The Montgomery Multiplication [2] isa fast and power efficient algorithm widely used in hard-ware implementations because it avoids the costly divisionstep by N in a typical modular multiplication of type A.Bmod N . Making use of multi-precision arithmetic is suitableemploy high-radix operations for the modular multiplicationarchitecture.

Over the last years, different architectures have been pro-posed in the implementation of Montgomery multiplication[4][7][9][11][12][13][14][15]. Fully systolic arrays archi-tectures have been presented in order to speed up the mo-dular multiplication. These architectures offer a ProcessingElements (PE) array which each PE provides arithmetic ad-ditions and multiplications in a multi-precision context [5]with carry propagation. Depending on word size (or radix)used, the architecture can employ a high number of Proces-sing Elements, that in a FPGA implementation means a in-creased need of resource usage.

As a new alternative, the modular multiplication archi-tecture can match with the FPGA resources paralleling theadditions and multiplications of whole execution by inser-ting multipliers block in parallel with Processing Elements.Forcing a pipelined operation mode and using a high-radixarchitecture (16 or 32 bits), the multipliers blocks ensurethe high speed performance provided by fully systolic arrayarchitectures, with a reduced arithmetic elements and carrypropagation.

This paper presents a trade-off between two proposedmodular multiplication architectures: a very high-radix fullysystolic array and a parallel implementation. Our architec-ture use a radix-16 and radix-32 in both implementations tospeed-up the processes and to match the resource usage ofVirtex-2P and Virtex-5 Xilinx FPGA Series[16]. The paral-

61978-1-4244-6311-4/10/$26.00 ©2010 IEEE

lel implementation puts multipliers blocks in parallel withProcessing Elements in order to increase the reuse of arith-metic elements of architecture and, consequently, make abetter usage of FPGA resources. This paper is organized asfollow: Section 2 presents the Montgomery modular mul-tiplication Algorithm. Section 3 makes a discussion aboutrelated state-of-art works. The proposed architectures arepresented in Section 4. Finally, the Results and Conclusionare presented on Sections 5 and 6, respectively.

2. MONTGOMERY MODULAR MULTIPLICATION

The Montgomery multiplication algorithm is a method forperforming modular multiplication without the need of di-vision by the modulus N . In cryptography, for hardwareimplementation of modular multiplication, the MontgomeryAlgorithm is very suitable because it allow that long inte-gers numbers be represented in a numeric precision givenby a radix (generally a power of two).

The algorithm version used in this work is the originalone, with some pre-conditions. The Algorithm 1 shows themodular multiplication with the notation proposed on [3],and used for the remaining of this text.

Algorithm 1: Montgomery Modular Multiplication for com-puting A.B.R−1modN :N =

∑m−1i=0 (2k)ini, ni ∈ {0, 1...2k − 1}

B =∑m−1

i=0 (2k)ibi, bi ∈ {0, 1...2k − 1}A =

∑m−1i=0 (2k)iai, ai ∈ {0, 1...2k − 1}, R = (2k)m

A,B < 2N ; N < R = 2km; N ′ = −N−1mod(2k)1. S0 = 02. for i = 0 to m − 1 do3. qi = ((S0 + ai × b0)N ′)mod(2k)4. Si+1 = (Si + qi × N + ai × B)/2k

5. end forThe N’ value is the modular inverse of N regarding the

N modulus, computed such as N.N ′ = 1 mod N . The fi-nal result is placed on S, after m iterations, and is equal toA.B.R−1 mod N , which must to be corrected to retrievethe expected result (A.B mod N). The correction is done byperforming an additional Montgomery multiplication withS and R2 mod N as parameters. It is interesting to highlightthat this correction is inexpensive during a modular expo-nentiation, because it has to be made only one time after thewhole exponentiation.

Since its publication in 1985 by Peter L. Montgomery[2],the Montgomery Algorithm has received many modifica-tions and improvements[10][12]. One of those is particu-larly interesting, because it avoids the final subtraction byonly choosing the input data correctly. By limiting the ope-rands A and B to integers less than 2N and by defining 2Nless than 2km, is guaranteed that the final S is less than N[12]. These pre-conditions are shown in algorithm 1 andconsidered to our architecture, as it is explained in Section4.

3. RELATED WORKS

In the radix-2 Montgomery algorithm implementations con-text, Tenca and Koc are widely referenced. These authorsinitially proposed architectures with improvements for radix-2 Montgomery algorithm, like in [6]. Even the input ope-rands are large size numbers, radix-2 modular multiplica-tions avoid expensive multiplications, which are visible onhigh-radix implementations (8 or more). Differently fromradix-2 classic algorithm [3], the Tenca and Koc modifi-cations allow the scalable property for modular multipli-cation architecture, i.e., the proposed multiplier is able towork with any precision of the input operands. In termsof hardware implementation, there is a systolic array archi-tecture composed by Processing Elements (PE) and controlblocks for managing the input and output words of the ope-rands. Each PE contains few logic elements, providing re-duced area and a large clock frequency, when synthesizedfor FPGA or ASIC.

Based on above work, in [7][8] are presented improve-ments to the Tenca and Koc architecture. The advantage ofthese new approaches is concentrated on PE optimizationsand, consequently, the reduced latency between the Mont-gomery modular multiplication current processes by a mini-mum factor of two. So, the main contributions lies on themodular multiplication speed improvement, and on the re-duced number of logical elements at the PEs. In [7], a radix-4 scalable Montgomery modular multiplication architectureis proposed in order to enhance the speed. A limiting factorof the radix-2 and radix-4 architectures is the large numberof clock cycles in a modular multiplication.

Furthermore, in the high-radix implementations context,in [4] is presented a systolic architecture composed by Pro-cessing Elements able to provide modular multiplication forradix greater than 4. Besides its time and area efficiency, thisarchitecture needs preprocessing before the modular mul-tiplication execution. The authors make use of optimizedMontgomery algorithm initially proposed in [10], that pre-sented a manner to simplify the qi quotient calculus, mak-ing the quotient determining a simple truncation operation Smod 2k. However, as a consequence, the input operands aremore limited (N ′ = −N−1 mod 2k = 1 and A,B < 2(N ′

mod 2k).N ) and the optimized Montgomery algorithm willneed three additional iterations, because the B input operandis left shifted by 2k and has to be corrected with these furtheriterations.

To avoid preprocessing in an high-radix modular mul-tiplication, [9] presents a fully systolic array architecturecomposed by Processing Elements containing internal mul-tipliers and adders. The Montgomery algorithm version usedin this implementation is also the optimized version pro-posed in [10]. As being an implementation in radix-16, themodular multiplications takes only 103 clock cycles, signi-ficantly less then other architectures [4][6][7].

62

Fig. 1. First Processing Element.

4. THE PROPOSED ARCHITECTURES

This section details the proposed architectures for perform-ing Montgomery multiplication. Firstly, we provide a de-tailed description of the fully systolic array architecture ex-plaining the Processing Elements (PE) behaviour. After, theparallelized Montgomery modular multiplication architec-ture is presented.

4.1. The Fully Systolic Array Architecture

According to the Montgomery algorithm (Section 2), theoperands of a modular multiplication are given on a numer-ical basis 2k (radix) and, therefore, are divided into a fixednumber of words with precision of m bits. At the fully sys-tolic array architecture, each PE performs arithmetic opera-tions concerning the byte-words having the same PE index(for example, PE2, N2, B2). Excepting the first PE, the rest-ing PEs have as output a Si word. The following subsectionsdescribe in detail all the Processing Elements and how theseare connected to form the systolic array.

4.1.1. First Processing Element

The systolic array architecture is composed by m identicalProcessing Elements as well as an initial PE. The first PEdiffers from the others because it not receives any carry sig-nal as input and, especially, this PE establish communicationwith the control unit of the architecture, serially receivingthe words and starting the pipeline mode. Moreover, in or-der to carry out the entire division by 2k, according to line4 of Algorithm 1, the first word of Si is discarded in thefirst PE. Figure 1 shows the internal architecture of the firstprocessing element.

4.1.2. General Processing Element

The m + 1 Processing Elements, including the first PE, isarranged to compose a systolic array architecture in order toperform the calculation of Si+1 = (Si + qi × N + ai ×

Fig. 2. General Processing Element.

Fig. 3. The Fully Systolic Array Architecture.

B)/2k and the word length (k) will define the number ofpipeline stages. The PEs propagate carry signals from left-to-right, and these carry are related to the multiplications(carry AB and carry QN) and additions (carry) within ofeach processing element, as shown in Figure 2. Each oneprocesses a particular k-bits of the operands Si, N and B,depending on the position of the PE in the array.

The Figure 3 illustrates the fully systolic array architec-ture with the control unit and the quotient block. This lastis responsible for the line 3 calculation in the Algorithm 1.Each processing element have as output a word of the Si

result.As seen in Figure 3, each processing element receives a

word of the result Si of the previous iteration to calculatethe word of the result on the current iteration. Together withthe carry signals, the ai and qi words are also propagatedfrom left-to-right and it characterizes the pipeline way ofthe architecture.

The number of clock cycles required for a modular mul-tiplication execution in a systolic array architecture is greatlyreduced. The pipelined work mode of Processing Elementsdrastically decrease the latency between them. Figure 4shows a flowchart of the latency between the PEs.

4.2. The Parallel Implementation

Based on multi-precision arithmetic operations [5], we de-veloped an architecture with multipliers blocks working inparallel with Processing Elements in order to achieve the

63

Fig. 4. The latency between the PEs in the fully systolicarray architecture.

same fully systolic array architectures performance [4][8][9]with better use of FPGA resources. Again, the input param-eters A, B and N are n-bits operands and they are handledhaving m words of k bits, according to previously definedradix k. The Figure 5 shows the full modular multiplicationarchitecture with the control unit.

At the beginning of each modular multiplication, the Band N operands are stored at multiplier blocks inside memo-ries (BMEM and NMEM ). Each multiplier block stores aportion of m/8 words of B and N . The words of A operandis provided serially at each i Algorithm 1 iteration.

The qi quotient calculus (line 3, Algorithm 1) is per-formed by a combinational block and the S0, ai, b0 and N ′

k-bit words are provided for this block at the start iteration.The Figure 6a shows the quotient block internal architecture.To the Si+1 intermediate results calculus (line 4, algorithm1) there are two large numbers multiplications, two largenumbers addition and a right shift operation, that representsthe division of (Si + qi.N + ai.B) by 2k.

The architecture consists of m/8 multipliers blocks andm Processing Elements. The purpose of multiplier blocksis to allow greater reuse of arithmetic components that werepreviously inserted in the PEs. So, there is a migration ofk × k multipliers from PEs to multipliers blocks, reducingfour times the number of arithmetic elements. This k×k-bitmultiplications represent the critical path of the architecture.However, the multipliers blocks have to be able to provide apipelined mode operation to avoid the latency caused by theinsertion of them. The Figure 6b shows the internal archi-tecture of the multipliers blocks.

Three carry signals are propagated inside the architec-ture: two of them, C1 and C2, are the k most significant bitsof the W = Si + qi.N and U = ai.B operations and arepropagated between the multipliers blocks. Just the fourth

Fig. 5. Proposed Modular Multiplication Architecture.

Fig. 6. (a) Quotient Block Architecture (b) MultipliersBlocks

multiplier block sends the C1 and C2 carry signals to thePEm (last PE). The third carry, CPE , is the most signifi-cant bit of an addition between the W = Si + qi.N andU = ai.B. This last add operation occurs inside of Proces-sing Elements and this third carry, CPE , is transmitted fromPE in PE.

At the end of the m− 1 iteration , the Si+1 = A.B.R−1

mod N is sent out by an m : 1 k-bits multiplexer.In terms of clock cycles for the Montgomery modular

multiplication, we can define the following: initially, m clockcycles are reserved for B and N internal storage. To obtainthe m words of Si+1, in the algorithm 1 first iteration, areneeded also m clock cycles. Then, the others iterations needonly one clock cycle in determining the qi quotient and k/4clock cycles for line 4 execution. Thus, the modular multi-plication clock cycles number is given by:

nMM = 2.m + (m − 1).(1 + k/4) (1)

4.2.1. The Processing Elements - PE

The parallelized architecture is composed by m ProcessingElements. Due to the multipliers blocks, each PE just needsto add two k bits words (Wm−1...0 and Um−1...0) and sendsout one Si+1 result word at each iteration. Also, just onebit carry (CPE) is transmitted from a PE to its neighbour,which corresponds to the most significant bit of the internal

64

Fig. 7. General PE with the carry propagation.

addition operation in the PE.The right shift operation, corresponding the division of

(Si+qi.N+ai.B) by 2k, is done, simply, discarding the firstword of S (S0) obtained from first PE (PE0) by the W0+U0

addition, taking in account just the carry CPE . The rest ofProcessing Elements must perform the addition between Wand U words and each PE provides a k-bits word of the S re-sult at each iteration, also transmitting the carry signal by thearchitecture. The last PE (PEm) is responsible for provid-ing two words of S results (Sm−2 and Sm−1), consideringthat the input words for Sm−1 calculus are the carry signalsC1 and C2 from last multiplier block. The Figure 7 showsthe block diagram of the processing elements.

5. RESULTS

Two architectures for Montgomery modular multiplicationwere described in VHDL and synthesized for Virtex-2P andVirtex-5 Xilinx FPGA. The choice of these two FPGA tech-nologies is due to the different features offered. The Virtex-2P Series has Mults 18 x 18 bits, required for 16 and 32-bit multipliers, and the Virtex-5 offers DSP48Es blocks forthe multiplications execution. DSP48E slices, available inall Virtex-5 devices, accelerate algorithms and enable higherlevels of DSP integration and lower power consumption thanprevious-generation Virtex devices [17]. Moreover, bothFPGAs have Block RAMs that are required to store the mainoperands before the beginning of the modular multiplica-tion process. Table 1 shows the synthesis results. The pa-rallel implementation presents four times lower FPGA re-sources occupation than fully systolic array architecture forboth Virtex-2P and Virtex-5. The fully systolic array imple-mentation can run a modular multiplication with about 2.5times less clock cycles than the parallel architecture.

Table 2 presents a RSA encryption and decryption ap-plications of the proposed architectures. Since the modularexponentiation is performed by a successive modular mul-tiplication executions, left-to-right binary square and mul-tiply algorithm[3] are employed in the modular exponenti-ation. The fully systolic array architecture presents higherspeed RSA encryption and decryption process than the pa-rallel implementation, but the RSA algorithm can more ap-propriately use the parallel architecture in FPGAs with lessresources, due the few area needed.

Table 1. Proposed Architectures Synthesis.Virtex-2P

Fully Systolic Array Architecturen k Slices MM Clock 18x18 Freq.

Cycles Mults (MHz)512 16 5135 140 66 142.891512 32 6487 70 134 90.7441024 16 11175 280 130 123.9861024 32 12960 140 262 78.264

Parellelized Architecturen k Slices MM Clock 18x18 Freq.

Cycles Mults (MHz)512 16 1831 356 10 119.589512 32 2145 178 22 81.0401024 16 2744 712 18 119.5891024 32 3674 356 38 79.729

Virtex-5Fully Systolic Array Architecture

n k Slices MM Clock DSP48Es Freq.Cycles (MHz)

512 16 5624 140 35 146.624512 32 5673 70 63 106.6741024 16 11652 280 67 140.0281024 32 11346 140 138 92.649

Parellelized Architecturen k Slices MM Clock DSP48Es Freq.

Cycles (MHz)512 16 1588 356 10 161.614512 32 1471 178 22 95.9131024 16 2242 712 18 161.6141024 32 3044 356 38 95.913

Table 2. Radix-32 RSA application.Fully Systolic Array Architecture

n Freq.(MHz) RSA Encryption RSA Decryption512 106.674 0.018 ms 1.75 ms1024 92.649 0.037 ms 3.23 ms

Parellelized Architecturen Freq.(MHz) RSA Encryption RSA Decryption512 95.913 0.040 ms 3.1 ms1024 95.913 0.079 ms 6 ms

Table 3 shows a state-of-art comparison with our results.Every works referred in this table used the Montgomery al-gorithm for their hardware modular multiplication architec-tures and for a direct comparison just the 1024-bit appli-cations is exposed. The time of modular multiplications,when not explained in the references, are estimated consi-dering a modular exponentiation of n = 1024 bits throughthe Square and Multiply algorithm, running 1.5n modularmultiplications.

65

Table 3. State-of-Art Implementations of Modular Multiplication Architectures.Work n Radix Technology Clock Frequency MM Time RSA Decryption

Systolic Array 1024 32 Virtex-2 78.264 MHz 2.15 µs 3.52 msParellelized 1024 32 Virtex-2 79.729 MHz 4.46 µs 6.85 ms

Systolic Array 1024 32 Virtex-5 92.649 MHz 1.85 µs 3.23 msParellelized 1024 32 Virtex-5 95.913 MHz 3.71 µs 6 ms

[4] 1024 16 Xilinx 3090 63.7 MHz 7.77 µs 11.95 ms[6] 1024 2 CMOS 0.5 µm 80 MHz 43 µs 66 ms[7] 1024 4 XC2V2000 248 MHz 6.11µs 9.4 ms[9] 1024 8 Virtex 2P 104.7 MHz 3.73µs 5.735 ms

[13] 1024 64 NA 90 MHz 5.36µs 8 ms[14] 1024 2 Virtex-4 150.5 MHz 9.07µs 13.94 ms

6. CONCLUSION

This paper presented two Montgomery Modular Multipli-cation architectures and the results of their synthesis withXilinx Virtex-2P and Virtex-5. A fully systolic array and aparallel implementation, suitable for RSA public-key cryp-tosystem, were developed and the designs were carefullymatched with features of the FPGAs, utilizing embedded18 × 18-bits multipliers and DSP48Es Slices. The fullysystolic array architecture provides a higher speed perfor-mance and the parallel implementation offer a higher reuseof FPGA resources. The fully systolic array architecture canrun the 1024-bits RSA decryption process in just 3.23 msand the parallel implementation executes the same opera-tion in 6 ms. If necessary, the parallel architecture has theadvantage of improve speed of the modular multiplicationsjust increasing the multipliers blocks quantity, i.e., the pa-rallelism.

Therefore, the fully systolic array architecture is moresuitable for applications that require high throughput rateslike telecommunications networks and internet data commu-nications. In sensor networks and lower resources devices,the parallel architecture becomes more appropriate, due tothe smaller occupied area by the architecture.

7. REFERENCES

[1] R.L Rivest, A. Shamir,L. Adleman, ”A Method for Obtain-ing Digital Signatures and Public-Key Cryptosystems”, Com-mun. ACM, 1978, 21, (2), pp. 120?126.

[2] P.L. Montgomery. Modular multiplication without trial di-vision. Mathematics of Computation, 44(170):519-21, April1985.

[3] A.J. Menezes, P.C. Van Oorschot, and S.A. Vanstone, ”Hand-book of Applied Cryptography”, Florida, CRC Press, 1997.

[4] T. Blum and C. Paar, ”Montgomery modular exponentiationon reconfigurable hardware”, in Proc. 14th IEEE Symp. OnComputer Arithmetic, 1999, pp. 70 - 77.

[5] D. Knuth, ”The Art of Computer Programming, Volume 2:Semi numerical Algorithms”, Addison-Wesley, 1968.

[6] A. Tenca and C. Koc, ”A Scalable Architecture for ModularMultiplication Based on Montgomery’s Algorithm”, IEEETransactions on Computers, 2003.

[7] Pinckney and D. Harris, ”Parallelized Radix-4 ScalableMontgomery Multiplier”, SBCCI, 2008.

[8] D. Harris, K. Krishnamurthy, M. Anders, S. Mathew, S. Hsu,”An Improved Unified Scalable Radix-2 Montgomery Multi-plier”, IEEE Symposium on Computer Arithmetic, pp. 172-178, 2005.

[9] C. McIvor, M. McLoone and J. McCanny, ”High-Radix Sys-tolic Modular Multiplication on Reconfigurable Hardware”,ICFPT 2005.

[10] H. Orup, ”Simplifying Quocient Determination in High-Radix Modular-Multiplication”, in Proceeding s 12th Sym-posium on Computer Arithmetic, pp. 193-9, 1995.

[11] C. D. Walter, ”Systolic Modular Multiplication”, IEEETransactions on Computers, Vol. 42, no. 3, March 1993, pp376-378.

[12] C. D. Walter, ”Montgomery Exponentiation Needs no FinalSubtractions”, Electronics Letters,35(21):1831-1832, Octo-ber 1999.

[13] F. Bernard, ”Scalable hardware implementing high-radixMontgomery multiplication algorithm”, Journal of SystemsArchitectures, 2007.

[14] Y. Wang, D. L. Maskell and J. Leiwo, ”A unified architec-ture for a public key cryptographic coprocessor”, Journal ofSystems Architectures, 2008.

[15] Alan Daly and William Marnane, ”Efficient Architecturesfor implementing Montgomery Modular Multiplication andRSA Modular Exponentiation”, FPGA’02, 2002. on Recon-figurable Logic

[16] XilinxTM Inc. Foundation Series, http://www.xilinx.com.

[17] XtremeDSP 48 Slice, http://www.xilinx.com/technology/dsp/xtremedsp.htm.

66

Documents

[IEEE 2010 VI Southern Programmable Logic Conference (SPL) - Ipojuca, Pernambuco, Brazil (2010.03.24-2010.03.26)] 2010 VI Southern Programmable Logic Conference (SPL) - Montgomery