Enhancing pix2pix with Swin Transformer for Cross Modal Brain CT-MR synthesis

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Cross-modal medical image synthesis, such as generating a brain computed tomography (CT) from a magnetic resonance (MR) and vice versa, plays an increasingly crucial role in the management of cerebral diseases. Conventional CNN-based models, such as pix2pix, have demonstrated utility in this domain but are limited in capturing long-range dependencies and global anatomical context, often compromising fidelity. This study introduces an enhanced image-to-image translation framework that replaces the standard U-Net generator in pix2pix with SwinUNETR, a transformer-based architecture. Leveraging hierarchical self-attention mechanisms, the model effectively captures both local and global features, enabling the synthesis of anatomically realistic images. The framework was evaluated on CT-to-MR (sMR) and MR-to-CT (sCT) synthesis tasks using 2,091 paired CT and T1-weighted MR scans from public datasets (OASIS-3, SynthRAD2023) and an internal cohort of patients with neurodegenerative disorders. Quantitative metrics, including Multi-Scale Structural Similarity (MS-SSIM) and Peak Signal-to-Noise Ratio (PSNR), were used to benchmark performance against a pix2pix baseline.The proposed method consistently outperformed the baseline, achieving an MS-SSIM of 0.952 and a PSNR of 26.07 dB in sCT. In sMR, it achieved an MS-SSIM of 0.948 and a PSNR of 26.07 dB, while preserving gray–white matter contrast—an essential feature for the assessment of neurodegenerative diseases. These results highlight the potential of Transformer-based architectures like SwinUNETR to advance high-fidelity cross-modal synthesis, particularly in neurological applications.

Article activity feed