Enhancing pix2pix with Swin Transformer for Cross Modal Brain CT-MR synthesis
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Cross-modal medical image synthesis, such as generating a brain computed tomography (CT) from a magnetic resonance (MR) and vice versa, plays an increasingly crucial role in the management of cerebral diseases. Conventional CNN-based models, such as pix2pix, have demonstrated utility in this domain but are limited in capturing long-range dependencies and global anatomical context, often compromising fidelity. This study introduces an enhanced image-to-image translation framework that replaces the standard U-Net generator in pix2pix with SwinUNETR, a transformer-based architecture. Leveraging hierarchical self-attention mechanisms, the model effectively captures both local and global features, enabling the synthesis of anatomically realistic images. The framework was evaluated on CT-to-MR (sMR) and MR-to-CT (sCT) synthesis tasks using 2,091 paired CT and T1-weighted MR scans from public datasets (OASIS-3, SynthRAD2023) and an internal cohort of patients with neurodegenerative disorders. Quantitative metrics, including Multi-Scale Structural Similarity (MS-SSIM) and Peak Signal-to-Noise Ratio (PSNR), were used to benchmark performance against a pix2pix baseline.The proposed method consistently outperformed the baseline, achieving an MS-SSIM of 0.952 and a PSNR of 26.07 dB in sCT. In sMR, it achieved an MS-SSIM of 0.948 and a PSNR of 26.07 dB, while preserving gray–white matter contrast—an essential feature for the assessment of neurodegenerative diseases. These results highlight the potential of Transformer-based architectures like SwinUNETR to advance high-fidelity cross-modal synthesis, particularly in neurological applications.