Ruo-Ting Yang(楊若婷) Xin-Yi Xue(薛新伊) Shu-Cheng Yang(楊樹澄) Xiao-Ping Gao(高小平)Jie Ren(任潔) Wei Yan(嚴(yán)偉) and Zhen Wang(王鎮(zhèn))
1State Key Laboratory of Functional Material for Informatics,Shanghai Institute of Microsystem and Information Technology,Chinese Academy of Sciences,Shanghai 200050,China
2CAS Center for Excellence in Superconducting Electronics(CENSE),Shanghai 200050,China
3University of Chinese Academy of Sciences,Beijing 100049,China
4School of Software and Microelectronics,Peking University,Beijing 100871,China
Keywords: RSFQ,AES,S-box,hardware implementation
Rapid single flux quantum(RSFQ)circuits[1]are the first members of the SFQ families. They are a kind of the superconducting digital circuits that can work in an ultra-low temperature (typically at 4.2 K) environment. It is known that RSFQ circuits have significant advantages in reducing crosstalk and power consumption (nearly nW/gate). The picosecond switching time of Josephson junctions (JJs) makes them operate much faster than common CMOS circuits, e.g.,the fastest RSFQ devices were tested working in 770 GHz at 4.2 K.[2]Another unique feature of RSFQ circuits is that most logic cells(AND,OR,etc.) are driven by the clock signal,i.e.,RSFQ circuits are natural gate-level pipelining synchronous sequential circuits. Such a feature makes RSFQ circuits both high-throughput and high-frequency clocking.
RSFQ circuits are suitable for their applications that may require intensive computing resources, low power consumption, or high throughput. In the past few years, they were found in many applications,such as the analog-to-digital converter (ADC),[3]fast Fourier transform (FFT) processors,[4]and CPU processors.[5]We observed that the high-throughput and high-speed performance of the RSFQ circuits could take the advantage of a hardware implementation of the encryption algorithm. The advanced encryption standard (AES)[6]algorithm is one of the most popular encryption algorithms in symmetric key encryption.It is widely used in the field of information security. The AES algorithm has been studied widely by using CMOS circuits.[7-10]Still,only one research team could be found,which implemented a 128-bit AES algorithm in the RSFQ logic[11]with logic simulation results only.
In this paper, we demonstrate the key operation of the AES-128 algorithm,i.e.,SubBytes operation,using the RSFQ circuits based on the SIMIT Nb03 process including its cell library. Simulated performance of the RSFQ circuits,such as the throughput,clock frequency,and power consumption,are presented and compared with those of the same structure in the CMOS logic. Moreover,we fabricate the individual modules of theS-box in chips and test them under both high-and low-frequency clocks. The measurement results show the correctness of the function. To our best knowledge,it is the first demonstration of the working of the RSFQ-based AES modules.
The rest of this paper is organized as follows. Section 2 briefly introduces the basic concepts of the AES algorithm and implementation of theS-box circuit. Section 3 proposes our AESS-box hardware implementation structure and design in the RSFQ logic as well as presents a comparison with the circuit of the same structure in the CMOS logic. The design and test results of individual modules of theS-box are presented in Section 4. In Section 5,we present conclusions and propose an outlook.
The AES algorithm is a symmetric block cipher that can encrypt (encipher) and decrypt (decipher) information. It is declared as an advanced encryption standard algorithm by the National Institute of Standards and Technology (NIST)in 2001. The Rijindael algorithm[12]is adopted in AES as a cryptographic algorithm, which is widely used in many fields.[13-16]The essence of the AES algorithm is to convert polynomial multiplication and modular operation into shift calculation and bit XOR operation, which greatly simplifies the hardware implementation.
Fig.1. SubBytes()mapping process(left)and the S-box table(right).
The four individual transformations in the cipher are Sub-Bytes(),ShiftRows(),MixColumns(),and AddRoundKey().In the SubBytes( ) transformation, each byte of a state is non-linear substituted independently using a substitution table,which is known as anS-box. The ShiftRows()transformation is a shift operation,where the bytes in the last three rows of the state,excluding its first row,are cyclically shifted over different numbers of bytes. In the MixColumns( ) transformation,each column is treated as a four-term polynomial and it operates on the state column-by-column. The AddRoundKey()transformation is a simple bitwise XOR operation, where a Round Key is added to the state.
The SubBytes( ) transformation is a mapping process,whose operation and hexadecimal form are presented in Fig.1.For example,if S1.1 equals{64},the result of transformation becomes S'1.1={43}.
The design of theS-box implementation circuit is the core to the implementation of the SubBytes( ) transformation, as well as vital challenge in designing compact implementations of AES. In the CMOS circuits, the look-up table (LUT) is a common method to complete the circuit design.[17]The state matrix greatly simplifies the difficulty in designing, and it is very easy to realize in the CMOS circuits. It can also effectively improve the rate of data throughput. The concept is also adopted in Ref. [11] where they realizeS-box in the RSFQ logic. There are some other methods to design theSbox, e.g., the order-reduced method.[16]A multiplier is used in this method. However, larger costs in hardware resources for an RSFQ multiplier will reduce the advantages of a high multiplexing rate with a larger delay in the critical path.
In this paper, we choose to implement theS-box of a masked and bit-sliced AES-128 algorithm.[18]The whole circuit is designed in a full-combinational way.The idea of linear and non-linear mappings is adopted in the design.[19]There are two reasons why we choose this design method:
(i)The RSFQ circuits take the advantage of high throughput and frequency, but calculations like multiplication cost enormous resources. In this method, addition and multiplication in a single-bit calculation are realized, respectively,through XOR gate for linear mapping and AND gate for nonlinear mapping,which is ideal for the RSFQ circuits.
(ii)The full-combinational design method is the most basic design method in the CMOS logic. Compared to a multiplier or LUT design in the RSFQ logic, a full-combinational structure is more hardware resource-efficient with higher reliability.
The wholeS-box implementation circuit is shown in Fig. 2. It consists of 14 modules, named as the linear map number(lm)or non-linear map number(nlm)or input/output map.The input map is the input module that processes and calculates the input signal. It is a linear mapping module,where the input data is put through XOR calculation and then passed to the next few modules after the processing is completed.The non-linear map number(number=0-5)works as the nonlinear mapping module and consists of AND gate, while the linear map number (number=0-5) works as the linear mapping module and consists of XOR gate. The output map is an output mapping module composed of NOT gate, which combines and adjusts the data from previous modules.
Fig.2. Structure of S-box implementation in bit-sliced AES-128.
We build the same pre-simulation structure in our simulator environment,as shown in Fig.3. Besides the main function module of theS-box, we use the D-flip-flop (DFF) as a buffer to reach the path balancing for the gate-level pipelining character of RSFQ. We ran the pre-simulation under the SIMIT-Nb03 Verilog cell library. The gate-level Verilog simulation result is shown in Fig. 4, where TI is the input clock signal,TO is the output clock signal,and aesout 0 to aesout7 are the outputs of the circuit. Every square wave of picosecond width stands for an RSFQ logic signal.There are a total of 32 pipeline stages. Hence, the first result comes out after the 33rd clock arrives. The simulation results are consistent with theS-box table, e.g., the 8th output is{30}with input{08},confirming the correctness of our design.
Fig.3. Pre-simulation structure of S-box in RSFQ logic circuit.
Fig.4. Simulation results of S-box implementation circuit in RSFQ logic.
The layout and post-simulation results of the circuit are carried out based on the JSIC complier, which is an EDA tool for logic synthesis, placement, and routing of singleflux-quantum (RSFQ) logic circuits developed by our EDA groups. It contains 42237 JJs and 117 logic cells. The maximum frequency of theS-box implementation circuit after layout(Fig.5)is 16.28 GHz.
For the sameS-box implementation circuit structure,comparative results between the CMOS logic and RSFQ logic are listed in Table 1. For comparison,the results of the CMOS circuit are obtained under TSMC 7 nm CMOS logic 1P15M process by flattening the whole circuit. Two parts indicate the advantages of the RSFQ-based circuit of theS-box.Firstly,the working frequency of the RSFQ-based circuit is much higher than that of the CMOS-based circuit. The maximum working frequency of the RSFQ-basedS-box is 16.28 GHz,while it is 4 GHz in the CMOS circuits of the same structure. Secondly,due to the unique gate-level pipelining character of the RSFQ circuits,the higher the clock frequency,the higher the throughput. The throughput of the wholeS-box is 130.24 Gbps under 16.28 GHz of the main frequency,which is much higher than that of the CMOS logic-based circuits. Although it is hard to estimate the power consumption of the whole RSFQ-basedS-box with limitation of our EDA tool, the power consumption of the RSFQ-based single module still shows at least two orders of magnitude lower than that of the CMOS-based circuits. It is to be mentioned that the minimum line width of our SIMIT-Nb03 process is 1.4μm. The successfully tested modules play a cornerstone role for our follow-up work in building the whole system and offer good test suggestions.
Fig.5. Layout of S-box implementation circuit.
Table 1. The performance of same S-box implementation structure in different logic circuits.
In the previous section, we designed the whole circuit with correct simulation results. Tape-out verification of the design is needed in the next step. Here, we select several modules to tape out, by considering the limitation of the current integration of the RSFQ circuits. The selected modules are linear lm0,nlm4,and nlm5,covering both linear mapping and non-linear mapping modules with JJ varying from several hundreds to thousands. We believe that these three modules are sufficient for demonstration.We adopt the concurrent-flow clocking[20]to design fully pipelined synchronous RSFQ circuits in each module.
The simulation results are obtained in pscan[21]after conducting global optimization. The widest margin of the module is up to 51.06%, while the narrowest one still reaches 26%of a module with JJs of more than two thousands. The wider margin indicates less impact of deviation caused by the fabrication,i.e.,better performance of the circuits.
The modules of nlm4, nlm5, and lm0 are taped out in different batches by using our self-developed SIMIT Nb03 6-kA/cm2 process.[22,23]The circuits are tested by using the OCTOPUX[24]test system for superconducting integrated circuits. Figure 6 shows the microphotographs of the lowfrequency and high-frequency test circuits of nlm4, and the low-frequency test circuits of lm0 and nlm5. Table 2 presents the integration, power consumption, area, and the designed frequency of each module taped out. The largest module is lm0 with 2271 JJs and the smallest module is nlm4 with 128 JJs. Three modules are designed and optimized under the main frequency of 17.5 GHz.
Fig.6. Microphotographs of test modules of S-box circuits: (a)nlm4 in low frequency, (b)nlm5 in low frequency, (c)lm0 in low frequency and nlm4 in high frequency.
Table 2. The related parameters of each module.
As shown in Figs. 6(a) and 6(b), each low-frequency test circuit consists of D2Qs(digital-to-SFQ pulse converter),CUT(circuit under test),and Q2Ds(SFQ pulse-to-digital converter). D2Q and Q2D work as interfacing units between the digital level signal and SFQ pulse signal.
The obtained test results are shown in Fig.7. Each square signal in the blue line stands for an SFQ signal input pulse,while each transition edge of the red line stands for an SFQ signal output pulse. As shown in Fig. 7(a), the first 12 lines in blue color stand for input signals A1-A3, B1-B3, C1-C3,and D1-D3. The bottom blue line stands for the clock signal.The output signal XO1(X=A or B or C or D)and XO2 are expected to be(X1*X2)and(X2*X3),respectively. The output values of each module are matched with our expectations,and the results indicate the correctness in design and fabrication. The widest margins of nlm4/nlm5/lm0 are 68%, 32%,and 2%,respectively.
Fig.7. Low-frequency test results of each module: (a)nlm5,(b)nlm4,(c)lm0.
From the test results,we find that the test margins of lm0 is significantly lower than the simulation. The first reason is due to the instability of different batches of our developing fabrication process. According to the data of PCM (process control monitor) of this batch of process detection chips, the critical current densityJC,which is the key process parameter of the Josephson junction,is different from the designed value under two test methods, one is 8.47% higher and the other is 10.56% lower. The shift inJCresults in a shift in the critical currentICof the Josephson junction in the circuit, which changes the timing information of cells,making the timing on the critical path more stringent, especially for large-scale circuits. By analyzing the test data and structure of lm0,we find that the critical path made of two complex IP exists in the data path of two output ports,and the test results of these two ports show extremely small margins,which limit the margins of the whole circuit. Tested margins can be improved by optimizing the critical path timing in design. In addition to process parameter shift, the deviation also comes from the gap between the simulation model and the actual model of the device. The ideal RCSJ model commonly used in the SFQ circuit simulation has not yet taken into account the effects of parasitic parameters in actual devices.
Since our current superconducting test system can offer and receive only low-frequency signals, an on-chip highfrequency test structure is adopted. In the lower part in Fig.6,a clock generator(CG)is used as the high-frequency clock input,and a shift register(SR)is used to store the test patterns.The test patterns are read into SR at a low-frequency clock,and the results are also read out from SR at that low-frequency clock. However,with the flag signal of a trigger,the test patterns pass through the CUT at a high-frequency clock.
The test results of nlm4 are shown in Fig. 8(a) and they are matched with the expected ones. The margins of nlm4 at different frequencies are also shown in Fig.8(b). The widest margin of this circuit is 56%,which is very close to that in simulation. The maximum tested clock frequency is 28.84 GHz.
Fig.8. High-frequency test results of nlm4(a)and margins of nlm4 at different frequencies(b).
We have demonstrated one of the most important operations, i.e., SubBytes( ) operation of the AES-128 algorithm,using RSFQ circuits based on the SIMIT-Nb03 cell library,and successfully fabricated and tested some of the modules inS-box. The wholeS-box consists of 32 pipeline stages with 42237 JJs,out of which the tested nlm4,nlm5,and lm0 modules consist of 128 JJs,354 JJs,and 2271 JJs,respectively.The maximum working frequency of theS-box is 16.28 GHz,and the maximum testing working frequency of a single module of theS-box is 28.84 GHz. The comparative results between the CMOS-based circuits and RSFQ-based circuits indicate that the RSFQ-basedS-box modules get advantages in frequency and throughput, while it is still being competitive in power consumption under our backward fabrication process.
Despite the limitation of the current integration of the RSFQ circuit and immature EDA tools, we did not fabricate the wholeS-box implementation circuit, but several feature modules were tested with correct functions and margins,which indicate some reliability of our design. Compared with the mature commercial CMOS process,our process is still under development, so the stability is not enough, which also limits the large-scale circuit test margin. We are planning to find some effective optimization methods to improve the optimization efficiency of our EDA tools and further reduce the scale of the designed circuit. With the steady development of the fab-process and integration,S-box,as well as the complete AES algorithm with higher performance,can be implemented in near future.
Acknowledgments
This work was supported by the National Natural Science Foundation of China(Grant No.92164101),the National Natural Science Foundation of China (Grant No. 62171437), the Strategic Priority Research Program of the Chinese Academy of Sciences (Grant No. XDA18000000), Shanghai Science and Technology Committee (Grant No. 21DZ1101000),and the National Key R&D Program of China (Grant No.2021YFB0300400). The authors would like to thank Ya-Jun Ha,Bin-Han Liu and Ling Xin for the valuable discussion and help. The fabrication was performed in the Superconducting Electronics Facility(SELF)of SIMIT.