GPU-NTT and Karatsuba Co-Optimization forHigh-Throughput Polynomial MultiplicationAcceleration
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Polynomial multiplication serves as a fundamental computational primitivein modern cryptography—including fully homomorphic encryption and zero-knowledge proofs —as well as in digital signal processing. Its performanceoptimization has become increasingly critical amid the rapid development ofprivacy-preserving computation and blockchain technologies. To address the lim-itations of traditional algorithms in meeting the demands for high throughputand low latency, this study proposes a high-performance polynomial multiplica-tion accelerator based on the collaborative optimization of GPU-NTT and theKaratsuba algorithm. The method deeply integrates the asymptotically optimalcomplexity of NTT with the constant-factor efficiency of Karatsuba at moderatescales, and fully exploits the parallel computing power of GPUs to construct amodular, multi-stage pipelined acceleration framework. The divide-and-conquernature of the Karatsuba algorithm is leveraged for coarse-grained parallelism,splitting large polynomial multiplications into subproblems handled by GPUthread blocks in parallel, while each subproblem is solved with fine-grained paral-lelism using GPU-accelerated NTT kernels. An innovative zero-padding strategyis introduced to enhance the generality of the NTT kernels, and shared memorycaching is employed to alleviate GPU memory bandwidth bottlenecks. Experi-mental results on the NVIDIA RTX 4060 GPU demonstrate that the proposedmethod achieves a stable speedup of 1.43× to 1.49× over the baseline GPU-NTT for lower-dimensional polynomials, and outperforms the KNTT algorithmby up to 2.44× for higher dimensions (e.g., log2 n = 14), showing superior scal-ability and robustness. Kernel execution time analysis further confirms that themethod benefits from efficient kernel fusion and balanced workload distribution,which effectively avoids pipeline stalls and ensures high-throughput execution.This research provides a significant performance optimization solution for thepractical deployment of advanced cryptographic technologies such as FHE andZKP.