GPU-NTT and Karatsuba Co-Optimization forHigh-Throughput Polynomial MultiplicationAcceleration

Ruwei Huang
xiaolong Tang
Junjie Wang
Xuezheng Qin

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Polynomial multiplication serves as a fundamental computational primitivein modern cryptography—including fully homomorphic encryption and zero-knowledge proofs —as well as in digital signal processing. Its performanceoptimization has become increasingly critical amid the rapid development ofprivacy-preserving computation and blockchain technologies. To address the lim-itations of traditional algorithms in meeting the demands for high throughputand low latency, this study proposes a high-performance polynomial multiplica-tion accelerator based on the collaborative optimization of GPU-NTT and theKaratsuba algorithm. The method deeply integrates the asymptotically optimalcomplexity of NTT with the constant-factor efficiency of Karatsuba at moderatescales, and fully exploits the parallel computing power of GPUs to construct amodular, multi-stage pipelined acceleration framework. The divide-and-conquernature of the Karatsuba algorithm is leveraged for coarse-grained parallelism,splitting large polynomial multiplications into subproblems handled by GPUthread blocks in parallel, while each subproblem is solved with fine-grained paral-lelism using GPU-accelerated NTT kernels. An innovative zero-padding strategyis introduced to enhance the generality of the NTT kernels, and shared memorycaching is employed to alleviate GPU memory bandwidth bottlenecks. Experi-mental results on the NVIDIA RTX 4060 GPU demonstrate that the proposedmethod achieves a stable speedup of 1.43× to 1.49× over the baseline GPU-NTT for lower-dimensional polynomials, and outperforms the KNTT algorithmby up to 2.44× for higher dimensions (e.g., log2 n = 14), showing superior scal-ability and robustness. Kernel execution time analysis further confirms that themethod benefits from efficient kernel fusion and balanced workload distribution,which effectively avoids pipeline stalls and ensures high-throughput execution.This research provides a significant performance optimization solution for thepractical deployment of advanced cryptographic technologies such as FHE andZKP.

Version published to 10.21203/rs.3.rs-8537970/v1 on Research Square
Jan 19, 2026

CSH-256: A Modular Cubing–Based Approach toStrengthening the Critical Path in Hash Functions

This article has 1 author:
1. Ibrahem Aboukila
This article has no evaluationsLatest version Jan 7, 2026
Semiprime Factorization with the Cell Method: A QUBO-Based and a Memory-Efficient Classical Cell-Decomposition Study

This article has 9 authors:
1. Gianbiagio Curato
2. Davide Tezza
3. Daniele Galati
4. Lorenzo Ferrara
5. Manuel Ponzi
6. Edoardo Chiatello
7. Luca Asproni
8. Davide Caputo
9. Roberto Giorgetti
This article has no evaluationsLatest version Feb 17, 2026
Error-Resilient Quantum Circuit Design of Hybrid Approximate-Exact 5:2 Compressors for Arithmetic Applications

This article has 6 authors:
1. Sreeprad V S A L Manda
2. Aravindhan Alagarsamy
3. Ernest Ravindran
4. Yu-Chen Hu
5. Gian Carlo Cardarilli
6. Seok-Bum Ko
This article has no evaluationsLatest version Feb 25, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

CSH-256: A Modular Cubing–Based Approach toStrengthening the Critical Path in Hash Functions

Semiprime Factorization with the Cell Method: A QUBO-Based and a Memory-Efficient Classical Cell-Decomposition Study

Error-Resilient Quantum Circuit Design of Hybrid Approximate-Exact 5:2 Compressors for Arithmetic Applications