8. an fpga implementation of 30gbps security model for gpon systems

An FPGA Implementation of 30Gbps Security Module for GPON Systems

Truong Quang Vinh1, Ju-Hyun Park1, Young-Chul Kim1, Kwang-Ok Kim2 1 Department of Electronics and Computer Engineering, Chonnam National University

300 Yongbong-dong, Buk-gu, Gwangju 500-757, Korea [email protected]

2 Electronics and Telecommunication Research Institute 161 Gajeong-dong, Yuseong-gu, Daejeon, Korea

[email protected]

Abstract

GPON systems require gigabit throughput data encryption for security and privacy. This paper presents an implementation of very high speed security module for GPON on Virtex4 FPGA. The security module supports payload encryption with constant delay by using counter mode AES algorithm. Our design of AES has three advanced features: composite field arithmetic SubByte, efficient MixColumn transformation, and On-the-Fly Key-Scheduling. Full-pipelined architecture is employed for the AES architecture in order to achieve the high performance for security module. The experiment shows that the proposed architecture can achieve a throughput of 30Gbits/s on a Xilinx Virtex-4 VLX100-12 device. The performance of our design is well suitable for encryption applications of GPON systems. 1. Introduction

Recently, GPONs (Gigabit-capable Passive Optical Networks) are attractive for cost-effective delivery of high-bandwidth data directly to building, curb, and home. This creates a strong requirement for access network to be trustworthy, secure, and reliable. Therefore, encryption module is an essential part in GPON systems for protecting broadcast data from eavesdropping due to the multicast nature of the GPONs. The ITU-T G.984 document [1] recommends using the Advanced Encryption Standard (AES) for payload encryption in GPONs. The National Institute of Standards and Technology (NIST) defined five modes of operation of AES [2]. However, only AES with counter mode (CTR-AES) can be used for GPON payload encryption. In this paper, we present a GPON security module using CTR-AES algorithm which is implemented by a full-pipelined architecture for area and performance optimization.

For hardware implementation of security module, there are two critical constrains: performance and

amount of resource needed. To achieve high throughput for gigabit links in GPONs, we apply pipelined architectures for all process blocks of security module, especially for AES core. The pipelined architecture for AES can improve the throughput but it utilizes much area due to duplicated hardware for implementing 11 rounds. Therefore, some researchers have proposed several speed-area trade-off to implement the architectures for AES algorithm. To optimize the resource for AES implementation, researchers focus on improving some blocks of the ciphers. In [8]-[10], efficient implementations of the S-box are proposed to minimize area and delay. The architecture of proposed S-box is combination of SubBytes and Inverse SubBytes transformations, instead of look-up tables that require much memory. To enhance key schedule, some authors propose on-the-fly key expansion that can generate the round keys concurrently during the encryption or decryption procedure without extra memory to store the round keys [8], [9].

This paper explores efficient schemes for designing the security module in order to achieve the target performance of GPON systems. Our design employs a composite field arithmetic architecture for SubByte transformation. Moreover, we apply sub-pipelined for this function block to improve the throughput of AES algorithm. Another part that has improvement is the key-expander. We propose an area-efficient key expander which can compute round keys in on-the-fly manner. Besides, we exploit sub-pipelined architecture for key expansion block and use optimized set of registers to store round keys. Our key expander is suitable for pipelined AES architecture that can start at the same time with data encryption.

The paper is organized as follows. Section 2 presents the architecture of the GPON security module. Section 3 describes the hardware implementation of AES algorithm. The advanced features for AES hardware implementation are presented in Section 4. Section 5 presents the implementation results and the

978-1-4244-2358-3/08/$20.00 © 2008 IEEE CIT 2008868

performance comparisons with different architectures. In the section 6, we give the conclusion.

2. The architecture of the GPON security

The GPON security module is implemented to guarantee a secure communication in Tx/Rx link of GPON. Using the module, the transmission data are ensured to be confidentiality, integrity, and origin authenticity of each frame sent and received by the OLT (Optical Line Termination) / ONT (Optical Network Termination) [1]. The top structure of the GPON security module is shown in Fig. 1.

Fig. 1. The top structure of the GPON security module.

a. Port-ID Table: is implemented as 4K 12-bit

registers to store the port identifier. Only frames with the appropriate Port-ID are encrypted by CTR-AES core.

b. Security Decoder: generates Crypto counter with the format: (Inter Frame Count[19:0] & Intra Frame Count[15:0]) & (Inter Frame Count[29:0] & Intra Frame Count[15:0]) & (Inter Frame Count[29:0] & Intra Frame Count[15:0]). It also registers 128-bit GTC Payload for the Payload Bypass.

c. Key Expander: restores the initial key and generates round keys for CTR-AES from 128-bit key input. The total bit number of round_keys is 1408 = 128*(10+1). The shadow key is used if the OLT require key exchange. The ONT responds by generating, storing and sending a new key. When the new key is transferred successfully to OLT, both the OLT and ONU (Optical Network Unit) begin using the new key at precisely the same frame boundary.

d. CTR-ARE Core: is the same process of AES algorithm except input values which is crypto counter. The crypto counter increases at every 128-bit data block. 128-bit input blocks are transformed into 128-bit pseudorandom cipher blocks

e. Payload Bypass: delivers the insecure payload without an authentication encryption. It has the same

delay as encryption time to synchronize with the cipher GEM payload at the output.

f. Security Encoder: multiplexes the cipher GEM (G-PON Encapsulation Method) Payloads from Bypass GEM Payload and Encrypted GEM Payload depending whether security function is enabled. For the authentic frames, the encoder performs XORed 128bits Pseudorandom Cipher block with delayed GEM payload to generate cipher GEM payload.

The AES algorithm in GPON security module uses

counter mode to encrypt data [2]. In counter mode encryption, the forward cipher function is invoked on each counter blocks, and the resulting output blocks are exclusive-ORed with the corresponding plaintext blocks to produce the ciphertext blocks. The forward cipher function is used in both CTR decryption and CTR encryption. Therefore, only one hardware implementation is used for both encryption and decryption. The XORed operation is executed in security encoder block. 3. AES core implementation 3.1 AES general architecture

The AES algorithm is a symmetric-key cipher, in which both the sender and the receiver use a single key for encryption and decryption. In the encryption of the AES algorithm, each round except the final round consists of four steps: SubByte, ShiftRow, MixColumn, and AddRoundKey. The SubByte is nonlinear transformation, which substitutes each byte of round data according to a substitution table called SBox. The ShiftRow step is a circular shifting of bytes in each row of the round data. The MixColumn transformation operates on the State column-by-column, treating each column as a four-term polynomial. The AddRoundKey can be simply performed by applying exclusive OR to the round key with the data block. The round keys are different in every round and are generated by Key Expansion. 3.2 The full-pipelined architecture for AES algorithm

In order to achieve very high throughput, we apply pipeline technique both for outer round and inner round of AES architecture. For outer round pipelining, the pipeline registers are placed between the data path instances of each round. For the inner round pipelining, we decompose four processes SubByte, ShiftRow, MixColumn and AddRoundKey into sub-pipelined stages with equivalent delay. The Fig.2 shows full pipelined architecture of AES algorithm.

869

Among round processes of AES algorithm, the SubByte phase has the most delay. Therefore, the number of sub-stages of this block is more than that of other phases. We implemented two full-pipelined architectures which have 2-stage sub-pipeline and 5-stage sub-pipeline for each round process. Thus, the SubByte block has to be decomposed into 2 stages and 3 stages, respectively. We can achieve a very high throughput when using 5-stage sub-pipelined for AES architecture.

Fig. 2. Full-pipelined architecture for AES algorithm.

4. Advanced features for AES Hardware implementation

This section presents innovative features in AES

hardware implementation. Each sub-block in encryption process is optimized for area and delay. Our improvement for AES architecture is focused on SubByte, MixColumn, and Key Expander block. The detail hardware implementations for these blocks are described as follows. 4.1. SubByte transformation

In the SubByte transformation (Sbox), the input is considered as an element of GF(28). First, the multiplicative inverse in GF(28) is calculated. Then, an affine transformation over GF(2) is applied. The implementation of a SBox can be done by a look-up table, but it consumes much resource. Nevertheless, we can implement a SBox using Galios Field operations [10]. Field arithmetic GF(24) is used instead of GF(28) to optimize area. In this architecture, the input values is mapped to two elements of GF(24). Then, the multiplicative inverse is calculated using GF(24) operation. Next, the two GF(24) elements are inverse mapped to one element in GF(28). Last, the affine transformation is performed. Although the composite field implementation of Sbox is very efficient in area, it suffers from a long critical path. To overcome this

drawback, further pipelining can be used. By using the 2-stage pipelined architecture with three 8-bit registers (Fig.3), the critical path is broken in half. To reduce more path delay, the 3-stage pipelined architecture can be also applied (Fig.4).

map

map

-1

affine

Fig. 3. 2-stage pipelined SBox using GF operations.

Fig. 4. 3-stage pipelined SBox using GF operations.

4.2. MixColumn

In MixColumn transformation, the columns of the State are considered as polynomials over GF(28) and multiplied modulo x4 + 1 with a fixed polynomial c(x ) = ‘03’ x3 + ‘01’ x2 + ‘01’ x + ‘02’. In direct form, the MixColumn transformation can be expressed as

( ) ( )( ) ( )

( ) ( )( ) ( )

•⊕⊕⊕•=•⊕•⊕⊕=⊕•⊕•⊕=⊕⊕•⊕•=

c,3c,2c,1c,0c,3

c,3c,2c,1c,0c,2

c,3c,2c,1c,0c,1

c,3c,2c,1c,0c,0

s}02{sss}03{'ss}03{s}02{ss'sss}03{s}02{s'ssss}03{s}02{'s

(1)

Several architectures have been proposed for the

implementation of MixColumn transformation. Substructure-shared architecture is applied in [4] [6], [7], [9]. In our implementation, we also use substructure sharing techniques to implement an efficient hardware for MixColumn transformation. To apply this technique, the equation (1) should be rewritten in an efficient way as

( ) ( )( ) ( )( ) ( )( ) ( )

⊕⊕⊕⊕•=⊕⊕⊕⊕•=⊕⊕⊕⊕•=⊕⊕⊕⊕•=

c,1c,0c,2c,0c,3c,3

c,1c,0c,3c,3c,2c,2

c,3c,2c,0c,2c,1c,1

c,3c,2c,1c,1c,0c,0

sssss}02{'ssssss}02{'ssssss}02{'ssssss}02{'s

(2)

The equation for MixColumn transformation is now

more symmetrical, and we can apply substructure sharing to optimize area for hardware implementation. The {02} constant multiplication is computed by the function denoted by a = xtime(b). The xtime() function can be implemented at the byte level as a left shift and

870

a subsequent conditional bitwise XOR with {1b} if the most significant of input byte is one (b7 = 1). The xtime() block can be implemented by 3 2-bit XOR gate. By using efficient architecture of xtime() and applying XOR-sharing, the MixColumn transformation can be implemented as shown in the Fig.5.

Fig. 5. (a) The efficient architecture of the MixColumn.

(b) The implementation of xtime() function.

The total number of gate counts for MixColumn transformation is 324, which includes 108 2-bit XOR gates (each XOR gate contains 3 gates). 4.3. Key-Expander

The Key Expansion routine generates a total of 11 round keys from an initial key in 128-bit AES algorithm. For pipelined AES architecture, all round keys must be available at the same time. Therefore, some researchers implemented a key expansion routine to compute a round key, and duplicate this hardware 10 times for total 10 rounds [4], [5]. These architectures can calculate all round keys at the same time, but they consume much area. Some other researchers propose method to reduce Xinmiao Zhang [9] has proposed key expander that can operate in on-the-fly manner. The data encryption and the key expansion can start simultaneously. Inherited from that architecture, we implement an area-efficient key expander which also can compute round key in on-the-fly manner. In order to operate synchronously with the sub-pipelined round process, the key expander is divided into r sub-stages. We use 11 registers to store 11 round keys. It is different from the architecture of the key expansion in [9], in which the author used r sets of registers all round keys and temporary values for sub-pipelined stage. By this scheme, we can reduce more area than the previous architecture. The sub-pipelined architecture for on-the-fly key expander with 3 sub-stage (r=3) is shown in Fig.6.

Since round keys are generated on the fly, the number of sub-pipelined stages for key expansion must be the same with the number of encryption sub-stages.

After r clock cycles, a new round key is generated, so all the round keys are available after (r×Nr) +1 clock cycles.

reg

reg

Roundkey(0)

Roundkey(1)

Roundkey(2)

Roundkey(3)

Roundkey(N

r)

reg

reg

reg

reg

reg

reg

Fig.6. The architecture of on-the-fly key expander.

5. Performance results and comparisons

We implemented the GPON security module with full-pipelined architecture of 128-bit CTR-AES on Virtex-4 VLX100-12. Xilinx ISE 8.2i was used to synthesize the design and provided post-placement timing results. For simulation, we used ModelSim 5.8c to verify the encrypt/decrypt operations. We evaluated the hardware cost in terms of BRAMs, slices, maximum frequency and throughput.

We implemented two full-pipelined architectures of AES core which have 3 sub-pipelined stages (r=3) and 5 sub-pipelined stages (r=5). The 3-stage sub-pipelined design has total 31 stages (r×10 + 1). Thus, after 31 clock cycles, the corresponding cipher text blocks will appear every clock cycle. By using this architecture, we can achieve the throughput of 26.7Gbits/s. The 5-stage sub-pipelined design has higher performance than the 3-stage sub-pipelined design. However, this design consumes more area for pipeline registers and takes more clock cycles for round processes. The table 1 shows the comparison between existed AES implementations and our implementation. Since previous architectures have been implemented on VirtexE device, we also choose Xilinx VirtexE-family device beside Virtex4 for our design in order to compare the result fairly. According to the experiment result, the designs in [3]-[5] have less performance because they just use outer-pipeline architecture. In the implementations of [7] and [9], the authors improve the throughput by applying sub-pipeline architectures. Nevertheless, these designs require more slices for extra hardware. In term of throughput/slice, our implementation is more efficient than the published approaches. The result of synthesized report shows that our design with 5-stage sub-pipelined architecture can achieve throughput of 31.6 Gbits/s.

871

Table 1. Comparison of FPGA implementation of the AES algorithm

Design Device Frequency (MHz)

Throughput (Mbps)

slices BRAMs Mbps/slice

Shuenn-Shyang [3] XCV1000e-8 125.38 1604 1857 0 0.867 Jae-Gon Lee [4] XCV3200e-8 40 5120 8009 104 0.639 Saqib, N.A. [5] XCV812e-8 20.192 2584 2744 0 0.942

Jarvinen [7] XCV1000e-8 129.2 16500 11719 0 1.408 Xinmiao Zhang (r=3) [9] XCV812e-8 93.5 11965 9406 0 1.272 Xinmiao Zhang (r=7) [9] XCV1000e-8 168.4 21556 11022 0 1.956

Our AES core design (r=3) XCV1000e-8 91.1 11661 8914 0 1.308 Our AES core design (r=5) XCV1000e-8 150.25 19232 9820 0 1.958 Our AES core design (r=3) XC4VLX100-12 208.49 26686 9478 0 2.816 Our AES core design (r=5) XC4VLX100-12 247.19 31640 9904 0 3.195

The whole architecture of GPON security including

AES core are synthesized on Xilinx Virtex-4 VXL100-12. The some extra resource is needed for security decoder, security encoder, and payload bypass. Therefore, the total areas for the security module with AES core (r=3) and AES core (r=5) are 11958 slices, and 13384 slices, respectively. 6. Conclusions

We presented a FPGA implementation of the high speed GPON security module using counter mode AES algorithm. Our design has three main efficient features: composite field arithmetic SubByte, area-efficient MixColumn, and on-the-fly sub-pipelined Key-Expander. By using these improvement features, our design has optimal area and maximum throughput. For full-pipelined architecture with 51 stages, we can achieve throughput of 30 Gbits/s on Virtex4 VLX100 device. Our implementation is well suitable for encryption applications of GPON systems. Acknowledgement This research was financially supported by the Electronics and Telecommunication Research Institute (ETRI) in Korea. The CAD tools for design in this work were supported by IDEC. References [1] “Gigabit-capable Passive Optical Networks (G-PON): Transmission convergence layer specification”, ITU-T G.984.3 Amendment 1, July. 2005.

[2] Morris Dworkin, “Recommendation for Block Cipher Modes of Operation”, NIST Special Publication , http://csrc.nist.gov/ CryptoToolkit/modes/, 2001.

[3] Shuenn-Shyang Wang, Wan-Sheng Ni, “An efficient FPGA implementation of advanced encryption standard algorithm”, Proceedings of the International Symposium on Circuits and Systems, vol. 2, pp. 597-600, May 2004.

[4] Jae-Gon Lee, Woong Hwangbo, Seonpil Kim, Chong-Min Kyung, “Top-down implementation of pipelined AES cipher and its verification with FPGA-based simulation accelerator”, Proceedings of 6th International Conference on ASIC , pp. 68-72, Oct. 2005.

[5] Saqib, N.A., Rodriguez-Henriquez, F., Diaz-Perez, A., “AES algorithm implementation - an efficient approach for sequential and pipeline architectures”, Proceedings of the Fourth Mexican International Conference on Computer Science , pp. 126-130, Sept. 2003.

[6] Nedjah, N., de Macedo Mourelle, L., Cardoso, M.P., “A Compact Pipelined Hardware Implementation of the AES-128 Cipher”, Third International Conference on Information Technology: New Generations , pp. 216-221, April 2006.

[7] Yongzhi Fu, Lin Hao, Xuejie Zhang, Rujin Yang, “Design of an extremely high performance counter mode AES reconfigurable processor”, Second International Conference on Embedded Software and Systems , Dec. 2005.

[8] Hodjat, A., Verbauwhede, I., “Area-throughput trade-offs for fully pipelined 30 to 70 Gbits/s AES processors”, IEEE Transactions on Computers, vol. 55, no. 4, pp. 366-372, April 2006.

[9] Xinmiao Zhang, Parhi, K.K., “High-speed VLSI architectures for the AES algorithm”, IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 12, no. 9, pp. 957-967, Sept. 2004.

[10] J. Wolkerstorfer, E. Oswald, and M. Lamberger, “An ASIC Implementation of the AES Sboxes”, Proceeding of RSA Conference , pp.29-52, Feb. 2002.

872

8. an fpga implementation of 30gbps security model for gpon systems

Documents