Accelerating fine-grained parallel incomplete factorization on MIMD many-core architecture
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Preconditioning techniques can enhance the convergence of iterative methods and improve the efficiency of the solution. Incomplete factorization, a core component of preconditioning techniques, primarily includes incomplete LU (ILU) and incomplete Cholesky (IC) factorizations. Research of these methods on MIMD-based heterogeneous architectures remains limited. This paper proposes parallel fixed-point iteration algorithms for sparse matrix ILU and IC factorizations tailored for MIMD architectures. The proposed algorithm uses multiple memory access optimization measures and adaptive load balancing optimization to achieve better performance computation. Experimental results demonstrate that, compared to OpenMP-based ILU and IC implementations on dual-socket Intel Xeon 4314 processors, the proposed algorithms achieve average speedups of 23.6x and 42.6x, respectively. Compared to NVIDIA A16 GPU implementations, average speedups reach 2.3x for ILU and 2.8x for IC factorizations. Our study shows that the MIMD many-core processor with low-energy architecture can still achieve good performance with proper algorithmic design.