GPU-NTT and Karatsuba Co-Optimization forHigh-Throughput Polynomial MultiplicationAcceleration

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Polynomial multiplication serves as a fundamental computational primitivein modern cryptography—including fully homomorphic encryption and zero-knowledge proofs —as well as in digital signal processing. Its performanceoptimization has become increasingly critical amid the rapid development ofprivacy-preserving computation and blockchain technologies. To address the lim-itations of traditional algorithms in meeting the demands for high throughputand low latency, this study proposes a high-performance polynomial multiplica-tion accelerator based on the collaborative optimization of GPU-NTT and theKaratsuba algorithm. The method deeply integrates the asymptotically optimalcomplexity of NTT with the constant-factor efficiency of Karatsuba at moderatescales, and fully exploits the parallel computing power of GPUs to construct amodular, multi-stage pipelined acceleration framework. The divide-and-conquernature of the Karatsuba algorithm is leveraged for coarse-grained parallelism,splitting large polynomial multiplications into subproblems handled by GPUthread blocks in parallel, while each subproblem is solved with fine-grained paral-lelism using GPU-accelerated NTT kernels. An innovative zero-padding strategyis introduced to enhance the generality of the NTT kernels, and shared memorycaching is employed to alleviate GPU memory bandwidth bottlenecks. Experi-mental results on the NVIDIA RTX 4060 GPU demonstrate that the proposedmethod achieves a stable speedup of 1.43× to 1.49× over the baseline GPU-NTT for lower-dimensional polynomials, and outperforms the KNTT algorithmby up to 2.44× for higher dimensions (e.g., log2 n = 14), showing superior scal-ability and robustness. Kernel execution time analysis further confirms that themethod benefits from efficient kernel fusion and balanced workload distribution,which effectively avoids pipeline stalls and ensures high-throughput execution.This research provides a significant performance optimization solution for thepractical deployment of advanced cryptographic technologies such as FHE andZKP.

Article activity feed