Accelerating fine-grained parallel incomplete factorization on MIMD many-core architecture

Yongzhen Shi
Qinglin Wang
Weihao Guo
Muchun Peng
Jie Liu
Lian Wang
Zhiyan Liu
Bingwei Wang
Feiming Liu
Xiangdong Pei

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Preconditioning techniques can enhance the convergence of iterative methods and improve the efficiency of the solution. Incomplete factorization, a core component of preconditioning techniques, primarily includes incomplete LU (ILU) and incomplete Cholesky (IC) factorizations. Research of these methods on MIMD-based heterogeneous architectures remains limited. This paper proposes parallel fixed-point iteration algorithms for sparse matrix ILU and IC factorizations tailored for MIMD architectures. The proposed algorithm uses multiple memory access optimization measures and adaptive load balancing optimization to achieve better performance computation. Experimental results demonstrate that, compared to OpenMP-based ILU and IC implementations on dual-socket Intel Xeon 4314 processors, the proposed algorithms achieve average speedups of 23.6x and 42.6x, respectively. Compared to NVIDIA A16 GPU implementations, average speedups reach 2.3x for ILU and 2.8x for IC factorizations. Our study shows that the MIMD many-core processor with low-energy architecture can still achieve good performance with proper algorithmic design.

Version published to 10.21203/rs.3.rs-6566846/v1 on Research Square
Oct 9, 2025

Racing to Idle: Energy Efficiency of Matrix Multiplication on Heterogeneous CPU and GPU Architectures

This article has 1 author:
1. Mufakir Qamar Ansari
This article has no evaluationsLatest version Oct 21, 2025
Towards a GPU-enabled billionare SVD in pyLOM

This article has 6 authors:
1. Arnau Miró
2. Benet Eiximeno
3. Lucas Gasparino
4. Nathan Kutz
5. Ivette Rodriguez
6. Oriol Lehmkuhl
This article has no evaluationsLatest version Oct 10, 2025
3CBench: A Unified Benchmarking Framework for the Computing Capacity of Heterogeneous AI Clusters

This article has 10 authors:
1. Weixing Zhang
2. Xizhi Wang
3. Jun Yan
4. Jiasun Feng
5. Yiying Liu
6. Haiyan Li
7. Qun Chen
8. Zhe Tang
9. Xin Cui
10. Fei Yang
This article has no evaluationsLatest version Oct 9, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Racing to Idle: Energy Efficiency of Matrix Multiplication on Heterogeneous CPU and GPU Architectures

Towards a GPU-enabled billionare SVD in pyLOM

3CBench: A Unified Benchmarking Framework for the Computing Capacity of Heterogeneous AI Clusters