First Fully Pipelined High Throughput FPGA Implementation and GPU Optimization of Wider Variant of AES

Ahmet MALAL
Cihangir TEZCAN

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

In response to the recent NIST call for a wider variant of the AES algorithm, we developed a fully pipelined, high-throughput FPGA implementation of the 256-bit block size AES, referred to as WAES-256. This design targets both 7th generation and UltraScale+ FPGAs, focusing on maximizing throughput and efficient hardware utilization. Our work supports AES-128, AES-256, and WAES-256, employing composite field arithmetic in the S-box to reduce critical path delay. All AES layers are fully pipelined, enabling multiple levels of parallelism with minimal architectural changes. Our AES-128 implementations achieved the best throughput-per-slice (TPS) ratios reported in the literature for fair comparisons on the same FPGA platforms. For WAES-256, our designs reached 75.73 Gbps on Spartan-7, 72.32 Gbps on Artix-7, 199.46 Gbps on Zynq UltraScale+, and 206.11 Gbps on Kintex UltraScale+. Additionally, our multi-core parallel WAES-256 designs achieved 426.66 Gbps with x2 cores and 742.63 Gbps with x4 cores on the Kintex UltraScale+ platform, demonstrating the scalability of our approach. These results highlight the efficiency and scalability of our architectures, offering high-throughput performance without relying on BRAM, making them well-suited for next-generation cryptographic applications. Moreover, we optimized WAES-256 on GPUs and achieved performance comparable to the best AES-256 results. For instance, we achieved 3053.5 Gbps WAES-256 encryption in counter mode of operation on an RTX 4090. Our results show that using FPGAs or GPUs as co-processors for WAES-256 render encryption free and transition from AES-256 to WAES-256 results in no observable slowdowns.

Version published to 10.21203/rs.3.rs-6941414/v1 on Research Square
Jul 16, 2025

Benchmarking Design Trade-Offs in FPGA Implementations of SIMON 64/128 Cipher

This article has 1 author:
1. W.A. Susantha Wijesinghe
This article has no evaluationsLatest version Jul 15, 2025
An Open Chisel-Based Framework for Hardware Acceleration on High-Performance FPGA Cards

This article has 2 authors:
1. Robin Gay
2. Tarek Ould-Bachir
This article has no evaluationsLatest version Aug 13, 2025
DLPack: A DSP-Based Low-Bitwidth Packing Architecture for Efficient 2-Bit CNN Inference on FPGA-based Edge Devices

This article has 3 authors:
1. Maryam Mohabbati
2. Hakem Beitollahi
3. Somayeh Kashi
This article has no evaluationsLatest version Aug 29, 2025

Listed in

Abstract

Article activity feed

Related articles

Benchmarking Design Trade-Offs in FPGA Implementations of SIMON 64/128 Cipher

An Open Chisel-Based Framework for Hardware Acceleration on High-Performance FPGA Cards

DLPack: A DSP-Based Low-Bitwidth Packing Architecture for Efficient 2-Bit CNN Inference on FPGA-based Edge Devices