A low-cost implementation of Trivium · This description matches the radix-1 version of Trivium...

8
A low-cost implementation of Trivium Nele Mentens, Jan Genoe, Bart Preneel, Ingrid Verbauwhede ? Katholieke Universiteit Leuven ESAT/COSIC, Kasteelpark Arenberg 10, B-3001 Leuven, Belgium [email protected] Katholieke Hogeschool Limburg, Agoralaan, Campus B, Bus 3, B-3590 Diepenbeek, Belgium [email protected] IMEC, Polymer and Molecular Electronics, Kapeldreef 75, B-3001 Heverlee, Belgium [email protected] Katholieke Hogeschool Limburg, Agoralaan, Campus B, Bus 3, B-3590 Diepenbeek, Belgium [email protected] Katholieke Universiteit Leuven ESAT/COSIC, Kasteelpark Arenberg 10, B-3001 Heverlee, Belgium [email protected] Katholieke Universiteit Leuven ESAT/COSIC, Kasteelpark Arenberg 10, B-3001 Heverlee, Belgium [email protected] Abstract. This paper describes the implementation of two Trivium cores on a single chip. The cores are realized in a 5-metal 0.35μm AMIS technology. The chip is currently being manufactured. The first core on the chip is an automatically placed and routed standard cell core. The second one is a custom design using dynamic logic and C 2 MOS flipflops. The goal of this paper is to evaluate and compare the size of the cores based on the lay-out results. The lay-out of the custom design shows a significant size reduction compared to the standard cell design. Keywords: stream cipher, custom design, Trivium, dynamic logic. 1 Introduction The most efficient cryptographic algorithms to achieve data confiden- tiality are block ciphers and stream ciphers. Stream ciphers are used as ? The results presented in this paper have been realized by the master students of 2007- 2008 in electronic engineering, chip design at the Katholieke Hogeschool Limburg: Michael Billen, Free Claessens, Dries Cuypers, Jeroen Dreessen, Frederik Gomm´ e, Wim Heedfeld, Jens Hoffmann, Kurt Ilsen, Lukasz Jaszczuk, Jan Jooken, Peter Schreurs, Peter Timmermans and Luc Van Roey.

Transcript of A low-cost implementation of Trivium · This description matches the radix-1 version of Trivium...

Page 1: A low-cost implementation of Trivium · This description matches the radix-1 version of Trivium that outputs one key stream bit in every clock cycle. Because it is our goal to design

A low-cost implementation of Trivium

Nele Mentens, Jan Genoe, Bart Preneel, Ingrid Verbauwhede?

Katholieke Universiteit Leuven ESAT/COSIC,Kasteelpark Arenberg 10, B-3001 Leuven, Belgium

[email protected]

Katholieke Hogeschool Limburg,Agoralaan, Campus B, Bus 3, B-3590 Diepenbeek, Belgium

[email protected]

IMEC, Polymer and Molecular Electronics,Kapeldreef 75, B-3001 Heverlee, Belgium

[email protected]

Katholieke Hogeschool Limburg,Agoralaan, Campus B, Bus 3, B-3590 Diepenbeek, Belgium

[email protected]

Katholieke Universiteit Leuven ESAT/COSIC,Kasteelpark Arenberg 10, B-3001 Heverlee, Belgium

[email protected]

Katholieke Universiteit Leuven ESAT/COSIC,Kasteelpark Arenberg 10, B-3001 Heverlee, Belgium

[email protected]

Abstract. This paper describes the implementation of two Triviumcores on a single chip. The cores are realized in a 5-metal 0.35µm AMIStechnology. The chip is currently being manufactured. The first core onthe chip is an automatically placed and routed standard cell core. Thesecond one is a custom design using dynamic logic and C2MOS flipflops.The goal of this paper is to evaluate and compare the size of the coresbased on the lay-out results. The lay-out of the custom design shows asignificant size reduction compared to the standard cell design.Keywords: stream cipher, custom design, Trivium, dynamic logic.

1 Introduction

The most efficient cryptographic algorithms to achieve data confiden-tiality are block ciphers and stream ciphers. Stream ciphers are used as? The results presented in this paper have been realized by the master students of 2007-

2008 in electronic engineering, chip design at the Katholieke Hogeschool Limburg:Michael Billen, Free Claessens, Dries Cuypers, Jeroen Dreessen, Frederik Gomme,Wim Heedfeld, Jens Hoffmann, Kurt Ilsen, Lukasz Jaszczuk, Jan Jooken, PeterSchreurs, Peter Timmermans and Luc Van Roey.

Page 2: A low-cost implementation of Trivium · This description matches the radix-1 version of Trivium that outputs one key stream bit in every clock cycle. Because it is our goal to design

alternatives for block ciphers when high throughput or low gate countare important requirements. The Profile 2 stream cipher candidates inthe ECRYPT eSTREAM project are developed to be designed in hard-ware with restricted resources [2]. In this paper we focus on the hardwareimplementation of the Profile 2 stream cipher Trivium. Like most streamcipher implementations, the implementation of Trivium requires a fewhundreds of storage elements or flipflops to store the internal state, whilethe combinatorial logic part is rather limited. As a consequence, the diesize is mainly determined by the size of the flipflops. When implementedin a straightforward manner, i.e. using automatic synthesis and standardcell placement and routing, the size of a flipflop is equivalent to about 8standard cell NAND gates or 32 transistors. We present a custom designof Trivium that consists of dynamic logic and C2MOS flipflops. This re-sults in a significant decrease in size, which is shown on the basis of alay-out comparison to a standard cell design in the same technology. Thecores are compared in a 5-metal 0.35µm AMIS technology1 [1].

This paper is organized as follows. Section 2 shortly describes theimplemented algorithm, Trivium. In Sect. 3, some previously reportedhardware implementations of Trivium are listed. Section 4 describes thetwo cores that are contained in the chip that is currently being manufac-tured. Finally, Sect. 5 concludes the paper.

2 Trivium: The Algorithm

Trivium consists of a 288-bit the state register, in which 3 bits are updatedbased on the result of combinatorial logic and the remaining bits performa cyclic shift. The key stream output is the result of an XOR operation on6 bits in the state register. The schematic representation of the Triviumalgorithm is given in Fig. 1. In the initialization phase, the 80-bit keyand 80-bit Initial Value (IV) are loaded into the shift register at positions1-80 and 94-173, respectively. Then, the state register is updated 4× 288times according to Fig. 1 without generating key stream bits. After theinitialization phase, a maximum of 264 key stream bits are generatedaccording to Fig. 1.

This description matches the radix-1 version of Trivium that outputsone key stream bit in every clock cycle. Because it is our goal to design alow area Trivium core, we only consider Trivium radix-1.

1 AMIS has been bought by ON Semiconductor on September 13, 2007.

Page 3: A low-cost implementation of Trivium · This description matches the radix-1 version of Trivium that outputs one key stream bit in every clock cycle. Because it is our goal to design

Fig. 1. Schematic representation of the Trivium algorithm.

3 Previous Work

On the eSTREAM Phase 3 website, several hardware implementations ofTrivium cores are presented [2]. These implementations are all standardcell designs of which the area and/or the number of equivalent NANDgates is reported. The results are given Table 1.

Our design also contains a standard cell core. However, in addition tothis reference core, we present a custom design of Trivium. The detailsof these cores are given in Sect. 4, which also reports on the area of thecores.

4 Two Trivium Cores

4.1 Standard Cell Core

The standard cell core was designed using automatic standard cell place-ment and routing in L-Edit, a tool from Tanner for physical lay-out. Tolimit the number of I/O pins on the chip, the loading of the key and theIV in the initialization phase is done in a bit-serial manner. To accom-modate this, three multiplexors are added to the design as depicted inFig. 2.

Page 4: A low-cost implementation of Trivium · This description matches the radix-1 version of Trivium that outputs one key stream bit in every clock cycle. Because it is our goal to design

Table 1. Overview of hardware resources for the implementation of Trivium reportedon the eSTREAM Phase 3 website [2]. The equivalent number of NAND gate for theTSMC library was calculated using [3].

Authors Area Number of Technology Radix(µm2) equivalent NAND gates

Gurkaynak et al. 144128 0.25 µm CMOS UMC 64

Feldhofer 169950 3090 0.35 µm 16

Gaj et al. 7428 3068 90 nm TCBN90G TSMC 1

Gaj et al. 13440 5551 90 nm TCBN90G TSMC 64

Good and Benaissa 2599 0.13 µm 1

Good and Benaissa 2660 0.13 µm 4

Good and Benaissa 2801 0.13 µm 8

Good and Benaissa 3185 0.13 µm 16

Good and Benaissa 3787 0.13 µm 32

Good and Benaissa 4921 0.13 µm 64

4.2 Dynamic Core using C2MOS flipflops

The second core has the same architecture as depicted in Fig. 2. However,it uses dynamic instead of static logic for the logic gates in the design. In-stead of standard CMOS flipflops, the dynamic core uses C2MOS flipflopsto store the internal state. The next two paragraphs describe the compo-nents of the dynamic core.

Dynamic logic Dynamic logic gates are constructed with a single Pull-Up Network (PUN) or Pull-Down Network (PDN), whereas static CMOSgates contain both a PUN and a complementary PDN. To enable a dy-namic gate to provide a logical ’1’ as well as a logical ’0’, two prechargetransistors are added. These transistors are fed with a clock signal suchthat each period of the clock consists of a precharge phase followed byan evaluation phase. In the precharge phase the output is precharged to’0’ (’1’). In the evaluation phase the output either remains ’0’ (’1’) or is(dis)charged to ’1’ (’0’). Fig. 3 depicts a NAND gate in static CMOS logicand dynamic logic.

If the Boolean equation of a gate is written as an inverted function,the number of transistors in a static CMOS implemenation is 2N, whereN is the number of literals in the function. An advantage of dynamic logicis that the number of transistors only equals N+2. Especially for largegates, this results in a significant decrease in the number of transistors.Another advantage is that the capacitive input load of a dynamic gate issmaller than that of the corresponding static CMOS gate, resulting in a

Page 5: A low-cost implementation of Trivium · This description matches the radix-1 version of Trivium that outputs one key stream bit in every clock cycle. Because it is our goal to design

Fig. 2. Architecture of the standard cell Trivium design.

higher speed. The reason is that in dynamic logic only a single nMOS orpMOS needs to be driven per literal in the Boolean function, while forstatic CMOS an nMOS and a pMOS need to be driven.

A disadvantage of dynamic logic is that automatic placement androuting is difficult to employ. The reason is that there are timing restric-tions that prevent straightforward cascading of logic gates. The followingrestrictions need to be taken into account:

– Two cascaded dynamic logic gates that evaluate on the same clocklevel cannot both have a PUN or a PDN.

Fig. 3. Transistor schemes of a static CMOS NAND gate (left), a dynamic NAND gatewith PDN (middle) and a dynamic NAND gate with PUN (right).

Page 6: A low-cost implementation of Trivium · This description matches the radix-1 version of Trivium that outputs one key stream bit in every clock cycle. Because it is our goal to design

– When the output of a dynamic logic gate that evaluates on a high(low) clock level is connected to a dynamic logic gate that evaluateson a low (high) clock level, a transition block needs to be inserted.This transition block is depicted in Fig. 4 for a transition of a signalfrom a dynamic logic gate evaluating on a low clock level to a dynamiclogic gate evaluating on a high clock level.

Fig. 4. Transistor scheme of a transition block to be inserted in between a dynamicgate with evaluation on a low clock level and a dynamic gate with evaluation on a highclock level.

Another drawback of dynamic logic is the fact that the output node ofa gate is floating when there is no (dis)charge during the evaluation phase.Therefore, the clock frequency not only has an upper bound caused bythe delay of the critical path, but also a lower bound that depends on theleakage of the signal on the floating output node. In practical applications,the lower bound on the clock frequency is usually assumed to be around1MHz.

In our custom design of Trivium, we implement the XOR gates, NANDgates and multiplexors using dynamic logic. The flipflops are described inthe next paragraph.

C2MOS Flipflop A C2MOS flipflop consists of two transition blocksthat pass the signal on a different clock level. A rising edge C2MOS flipflopis depicted in Fig. 5. A falling edge C2MOS flipflop consists of the sametwo transition blocks in reverse order.

Whereas a standard cell flipflop consists of about 8 equivalent NANDgates or 32 transistors, a C2MOS flipflop consists of only 8 transistors.However, there are some drawbacks to C2MOS flipflops:

Page 7: A low-cost implementation of Trivium · This description matches the radix-1 version of Trivium that outputs one key stream bit in every clock cycle. Because it is our goal to design

Fig. 5. Transistor scheme of a C2MOS flipflop.

– When the clock level is high in Fig. 5, the node in between the twotransition blocks is floating. This poses a lower bound on the clockfrequency in the same way as for dynamic logic gates.

– In order to prevent a race condition where the signal passes throughthe C2MOS at once, the rise and fall time of the clock needs to behigh enough.

In our custom design of Trivium, we store the internal state using288 C2MOS flipflops. This significantly reduces the area of the design, asshown in the next section.

4.3 Comparison of the Area of the Cores

Fig. 6 shows the lay-outs of the standard cell core and the dynamic corein a 0.35µm AMIS technology. The area of the standard cell core is esti-mated at 108900 µm2, while the custom design has a significantly smallerestimated area of 40425 µm2. This comes down to an equivalent numberof NAND gates equal to 2017 for the standard cell design and 749 for thecustom design.

When comparing these core sizes to the results in Table 1, we noticethat only the design of Gaj et al. has a smaller die area. However, this isan unfair comparison since this design was realized in a 90nm technologywhile our design uses a 0.35µm technology. A more fair comparison canbe made based on the equivalent number of NAND gates. The designsthat report this metric, all show a larger core size, even in comparisonto our standard cell design. The reason for this could be that the keyand IV are loaded serially in our design, which takes away the need formultiplexors compared to a design that loads the key and IV in parallel.However, the most useful comparison is the one that evaluates the sizesof our standard cell and custom designs, which shows that the customdesign decreases the area by a factor of more than 2.

Page 8: A low-cost implementation of Trivium · This description matches the radix-1 version of Trivium that outputs one key stream bit in every clock cycle. Because it is our goal to design

Fig. 6. Lay-out of the standard cell core (top) and the dynamic core (bottom) in a0.35µm AMIS technology.

5 Conclusions and future work

This paper described the implementation of two Trivium cores in differentdesign styles. The cores are currently being manufactured in a single chip.The sizes of the cores are compared based on their lay-outs. The customdesign shows a significant decrease in die area compared to the standardcell design.

Upon production, the functionality of the chip will be tested. More-over, the minimal and maximal operating frequency will be determinedfor both cores as well as the allowable rise and fall time of the clock.

References

1. 0.35 micron AMIS technology. http://www.amis.com/pdf/standard cell/sc3 fs.pdf,2008.

2. eSTREAM Phase 3: Trivium. http://www.ecrypt.eu.org/stream/triviump3.html,2008.

3. TSMC. TSMC standard cell libraries.http://www.cadence.com/datasheets/4456 TSMC SC ds.pdf, 2008.