The Good, the Bad, and the Ugly: Segmentation-Based Quality Control of Structural Magnetic Resonance Images
This article has been Reviewed by the following groups
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
- Evaluated articles (GigaScience)
Abstract
The processing and analysis of magnetic resonance images is highly dependent on the quality of the input data, and systematic differences in quality can consequently lead to loss of sensitivity or biased results. However, varying image properties due to different scanners and acquisition protocols, as well as subject-specific image interferences, such as motion artifacts, can be incorporated in the analysis. A reliable assessment of image quality is therefore essential to identify critical outliers that may bias results. Here we present a quality assessment for structural (T1-weighted) images using tissue classification. We introduce multiple useful image quality measures, standardize them into quality scales and combine them into an integrated structural image quality rating to facilitate the interpretation and fast identification of outliers with (motion) artifacts. The reliability and robustness of the measures are evaluated using synthetic and real datasets. Our study results demonstrate that the proposed measures are robust to simulated segmentation problems and variables of interest such as cortical atrophy, age, sex, brain size and severe disease-related changes, and might facilitate the separation of motion artifacts based on within-protocol deviations. The quality control framework presents a simple but powerful tool for the use in research and clinical settings.
Article activity feed
-
AbstractThe processing and analysis of magnetic resonance images is highly dependent on the quality of the input data, and systematic differences in quality can consequently lead to loss of sensitivity or biased results. However, varying image properties due to different scanners and acquisition protocols, as well as subject-specific image interferences, such as motion artifacts, can be incorporated in the analysis. A reliable assessment of image quality is therefore essential to identify critical outliers that may bias results. Here we present a quality assessment for structural (T1-weighted) images using tissue classification. We introduce multiple useful image quality measures, standardize them into quality scales and combine them into an integrated structural image quality rating to facilitate the interpretation and fast …
AbstractThe processing and analysis of magnetic resonance images is highly dependent on the quality of the input data, and systematic differences in quality can consequently lead to loss of sensitivity or biased results. However, varying image properties due to different scanners and acquisition protocols, as well as subject-specific image interferences, such as motion artifacts, can be incorporated in the analysis. A reliable assessment of image quality is therefore essential to identify critical outliers that may bias results. Here we present a quality assessment for structural (T1-weighted) images using tissue classification. We introduce multiple useful image quality measures, standardize them into quality scales and combine them into an integrated structural image quality rating to facilitate the interpretation and fast identification of outliers with (motion) artifacts. The reliability and robustness of the measures are evaluated using synthetic and real datasets. Our study results demonstrate that the proposed measures are robust to simulated segmentation problems and variables of interest such as cortical atrophy, age, sex, brain size and severe disease-related changes, and might facilitate the separation of motion artifacts based on within-protocol deviations. The quality control framework presents a simple but powerful tool for the use in research and clinical settings.Competing Interest StatementThe authors have declared no competing interest.
This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf146), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
Reviewer 4: Laura Caquelin
Reproducibility report for: The Good, the Bad, and the Ugly: Segmentation-Based Quality Control of Structural Magnetic Resonance Images Journal: GigaScience ID number/DOI: GIGA-D-25-00085 Reviewer(s): Laura Caquelin, Department of Clinical Neuroscience, Karolinska Institutet, Sweden [Worked on reproducing the results and wrote the report] Tobias Wängberg, Department of Clinical Neuroscience, Karolinska Institutet, Sweden [Worked on reproducing the results]
- Summary of the Study The study addresses how variability in magnetic resonance images quality, especially from motion artifacts or scanner differences, can affect structural image analysis. It proposes a quality assessment framework for T1-weighted images based on tissue classification and standardized image quality measures. The method is shown to be robust across datasets and conditions, helping to detect outliers and control for motion-related artifacts.
- Scope of reproducibility
According to our assessment the primary objective is: to develop and validate a standardized framework for assessing the quality of structural (T1-weighted) MRI images, enabling the detection of artifacts on simulated data.
Outcome: Quantitative quality ratings derived from image properties such as noise-to-contrast ratio (NCR), inhomogeneity-to-contrast ratio (ICR), resolution score (RES), and edge-to-contrast ratio (ECR) and Full-brain Euler characteristic (FEC) combined into a Structural Image Quality Rating (SIQR).
Analysis method outcome: Not precised in the manuscript, but with the Matlab script we identified that the quality scores were correlated using Spearman's rank correlation, and statistical significance was assessed using p-values computed using MATLAB's built-in method.
Main result: Results are presented in Figure 5. "The evaluation on the BWP test dataset showed that most quality ratings have a very high correlation (rho > .950, p < .001) with their corresponding perturbation and a very low correlation (rho < |0.1|) with the other tested perturbations (see table in Figure 5A & C). This suggests considerable specificity of the proposed quality measures. The combined SIQR score also showed a very strong association with the segmentation quality kappa (rho = -.913, p < .001) and brain tissue volumes (rhoCSF/GM/WM = -.472/-.484/.736, pCSF/GM/WM < .001) (Figure 5B). […] The edge-based resolution measure ECR, on the other hand, generally performed better (rho = .828, p < .001), but was more affected by noise (rho = .306, p < .001) and inhomogeneity (rho = .223, p < .001) than other scores."
- Availability of Materials a. Data
- Data availability: Open
- Data completeness: Complete, all data necessary to reproduce main results are available
- Access Method: Private journal dropbox but also available on Github repository
- Repository: https://github.com/ChristianGaser/cat12 -Data quality: Structured b. Code
- Code availability: Share in the private journal dropbox but also open
- Programming Language(s): Matlab
- Repository link: https://github.com/ChristianGaser/cat12
- License: GPL-2.0 License
- Repository status: Public
- Documentation: Readme file
- Computational environment of reproduction analysis
- Operating system for reproduction: MacOS 15.5 (reviewer 1) and MacOS 15.1 (reviewer 2)
- Programming Language(s): Matlab
- Code implementation approach: Using shared code
- Version environment for reproduction: Matlab R2024b Update 6 (24.2.2923080) - Trial version
- Results
5.1 Original study results
- Results 1: Figure 5 C (see screenshot)
5.2 Steps for reproduction
->Finding how to reproduce the results
- Issue 1: The methods section lacks sufficient detail regarding the statistical methodology, and the relevant information is not fully provided in the GitHub repository. -- Resolved: A message has been sent to the authors requesting further clarification on the methodology and additional resources (scripts/data) needed to reproduce the results. The script to reproduce the results is "cat_tst_qa_bwpmaintest.m".
-> Reproduce the results using the "cat_tst_qa_bwpmaintest.m" script.
- Issue 2: To run the script "cat_tst_qa_bwpmaintest.m", the "eva_vol_calcKappa" function is missing. -- Resolved: The script was shared and added to the Github repository.
- Issue 3: While running the script, the following error message encountered: Assigning to 0 elements using a simple assignment statement is not supported. Consider using comma-separated list assignment.
Error in cat_tst_qa_bwpmaintest (line 481) default.QS{find(cellfun('isempty',strfind(default.QS(:,2),'FEC'))==0),4} = [100, 850]; -- Resolved: This error stops the execution of the script. After discussion with the authors, the exact cause of the error encountered at line 480 was not directly identified. We exchanged and compared our environments at the point just before the error occurred and observed notable differences between them. Our environment is almost empty. The authors identified that the default variable is missing from our environment, even though it is referenced at line 437 by a call to the cat_stat_marks function. We confirmed that all required dependencies were installed (including Statistics toolbox, SPM and CAT12), and that we had access to all the necessary data. To ensure the issue was not due to user error, the code was independently executed by two reviewers. The error was consistently reproduced in both cases. About the setup, I specified to the authors: "To summarize my setup:
- I have installed SPM, CAT, and the Statistics Toolbox.
- I downloaded all datasets from the GigaScience server.
- I also downloaded the IXI T1 data, but I've only kept the version available on the GigaScience server in my working directory. Is the version from GigaScience sufficient? I had presumed that this dataset was pre-processed and ready to use, so I ignored the time-consuming pre-processing step. Your last email seems to confirm this point."
The authors answered that: « Yes, this is correct. However, both directories have to be combined so that the original IXI images and the processing files are included. »
In an attempt to proceed, we modified the portion of the code that triggered the error:
% FEC FECpos = find(cellfun('isempty',strfind(default.QS(:,2),'FEC'))==0); try warning off; [Q.fit.FEC, Q.fit.FECstat] = robustfit(Q.FECgt(M,1),Q.FECo(M,1)); warning on; if ~isempty(FECpos) default.QS{FECpos,4} = round([Q.fit.FEC(1) + Q.fit.FEC(2), Q.fit.FEC(1) + Q.fit.FEC(2) * 6], -1); end
catch Q.fit.FEC = [nan nan]; Q.fit.FECstat = struct('coeffcorr',nan(2,2),'p',nan(2,2)); if ~isempty(FECpos) default.QS{FECpos,4} = [100 850]; end end
Following this adjustment, the end of the script "cat_tst_qa_bwpmaintest.m" ran without issue and generated output results:
Finally, the error was identified after numerous exchanges with the authors. The function "cat_stat_marks", available in the Github repository, was not shared in the FTP server. With this function added, the script runs correctly. Please note that the link to the Github repository where the software code can be found is not specified in the manuscript.
-> Compare the results reproduced and the original results
- Issue 4: Discrepancy between reproduced results, output results provided by the authors and the original results shown in Figure 5C. -- Unresolved: We reproduced the figures and the corresponding output table using the modified "cat_tst_qa_bwpmaintest.m" script. We ran the script using the only default QC version selected in the script ("cat_vol_qa201901x"). By comparing our output with the result files shared by the authors, we were able to confirm that we had executed the correct pipeline. However, we encountered a discrepancy: neither the generated file in our run (tst_cat_col_qa201901x_irBWPC_HC_T1_pn9_rf100pC_vx200x200x200rptable.csv) nor the corresponding file provided by the authors (outputs from BWPmain_full_202504) matched the numerical values presented in Figure 5C of the manuscript. We contacted the authors to clarify whether the default QC version used in the script was indeed the one produce the figure. In response, they confirmed:
"All figures should show the results of this QC version although I had the plan to run a final check update after the reviewer comments (the figures are finally arranged in Adobe Illustrator)."
Therefore, although the correct version of the QC was used, the differences in the results shown in Figure 5C remain unexplained. This issue is still unresolved.
5.3 Statistical comparison Original vs Reproduced results
Results: Screenshot of reproduced tst_cat_vol_qa201901x_irBWPC_HC_T1_pn9_rf100pC_vx200x200x200_rptable.csv table
Comments: Several p-values in the reproduced results appear as exactly 0 (0.00000000e+00), which is unlikely from a statistical point of view. It is possible that these values are just extremely small and were rounded down. However, this could also point a problem in the script. Further investigation would be needed to determine the cause.
Errors detected: Values in Figure 5C do not correspond to those provided by the authors in the FTP server in the files (tst_cat_vol_qa201901x_irBWPC_HC_T1_pn9_rf100pC_vx200x200x200_rptable.csv). Multiple inconsistencies were observed, suggesting potential errors in the manuscript figure or mismatches between file versions (see file Comparison_original_rptable_vs_fig5C_data.csv for comparison).
(Screenshot of Figure 5C)
(Screenshot of the original output corresponding to the Figure 5C)
- Statistical Consistency: The reproduced correlation table (tst_cat_vol_qa201901x_irBWPC_HC_T1_pn9_rf100pC_vx200x200x200_rptable.csv). differs from the original in terms of r-values and p-values. Compared to the Figure 5C, the reproduced r-values do not all match those shown in the figure. P-values cannot be directly compared to Figure 5C, as they are represented by a color gradient without a scale or legend, making direct comparison impossible.
- Conclusion
Summary of the computational reproducibility review The computational reproducibility of the main result we identified for the study is partially achieved. After several technical issues related to missing functions, I was able to execute the script to reproduce values of Figure 5C ("cat_tst_qa_bwpmaintest.m") and obtain ouput results. However, discrepancies were observed when comparing the reproduced results (tst_cat_col_qa201901x_irBWPC_HC_T1_pn9_rf100pC_vx200x200x200rptable.csv) to both:
the output file provided by the authors, and
the original results presented in figure 5C of the manuscript. Notably, the output file provided by the authors and the results in figure 5C do not match either, indicating potential errors or file versions mismatches. Additionally, many p-values in the reproduced results are equal to 0, which suggests a formatting issue or a problem in the script. Figure 5C also lacks a scale, legend detail, or supplementary data to make possible to verify p-values (assuming the color gradient represents the p-values).
Recommendations for authors We strongly recommend the authors to: -- Ensure all essential code and functions are included in the shared repositories. Some necessary files were not included in the FTP server provided with the paper. Although the GitHub repository (https://github.com/ChristianGaser/cat12) was shared with the journal, but it is not referenced in the manuscript, making it difficult for external users to locate. -- Add detailed documentation of the statistical methods: the current manuscript lacks sufficient information regarding the statistical methodology used, at least for the purpose of the reproducibility review. Please, include detailed explanation of statistical tests, packages and parameter settings (e.g. QC version) to improve reproducibility. -- Clarify the versioning and outputs for the figures: there is a lack of clarity regarding which specific data outputs were used to generate figure 5C. Providing metadata or links to the exact output file used would help to resolve this issue. -- Provide raw numerical data behind figures: figure 5C seems to display p-values using a color gradient but no scale or legend is provided. Sharing the raw data used would allow the comparison and the reproducibility of the figure. -- Improve the clarity of execution instructions and address potential p-values issues: the issue with p-values showing up as exactly 0 in the reproduced results might be caused by differences in the environment setup, such as missing variables, different software versions, or skipped steps before running the script. Improving the instructions for setting up the environment and running the would help prevent issues and facilitate reproducibility.
-
AbstractThe processing and analysis of magnetic resonance images is highly dependent on the quality of the input data, and systematic differences in quality can consequently lead to loss of sensitivity or biased results. However, varying image properties due to different scanners and acquisition protocols, as well as subject-specific image interferences, such as motion artifacts, can be incorporated in the analysis. A reliable assessment of image quality is therefore essential to identify critical outliers that may bias results. Here we present a quality assessment for structural (T1-weighted) images using tissue classification. We introduce multiple useful image quality measures, standardize them into quality scales and combine them into an integrated structural image quality rating to facilitate the interpretation and fast …
AbstractThe processing and analysis of magnetic resonance images is highly dependent on the quality of the input data, and systematic differences in quality can consequently lead to loss of sensitivity or biased results. However, varying image properties due to different scanners and acquisition protocols, as well as subject-specific image interferences, such as motion artifacts, can be incorporated in the analysis. A reliable assessment of image quality is therefore essential to identify critical outliers that may bias results. Here we present a quality assessment for structural (T1-weighted) images using tissue classification. We introduce multiple useful image quality measures, standardize them into quality scales and combine them into an integrated structural image quality rating to facilitate the interpretation and fast identification of outliers with (motion) artifacts. The reliability and robustness of the measures are evaluated using synthetic and real datasets. Our study results demonstrate that the proposed measures are robust to simulated segmentation problems and variables of interest such as cortical atrophy, age, sex, brain size and severe disease-related changes, and might facilitate the separation of motion artifacts based on within-protocol deviations. The quality control framework presents a simple but powerful tool for the use in research and clinical settings.Competing Interest StatementThe authors have declared no competing interest.
This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf146), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
Reviewer 3: Cyril Pernet
The paper describes an alternative way to QC T1w images with 2 major innovations: a different set of metrics not relying on background and a global score that combines those metrics. In addition, all of this is integrated in a well maintained toolbox allowing easy usage.
I only have suggestions (ie it does not have to be all done) as the overall paper is well written, easy to follow and analyses well conducted. P6 NCR: it can be nice to demonstrate how it performs compared to traditional CNR (mean of the white matter intensity values minus the mean of the gray matter intensity values divided by the standard deviation of the values outside the brain) -- differs markedly because of background difference for sure, since you have plenty of test images you could show that more clearly (later in the method, based on what criteria/reason 'local' is defined as 555?) P7 ECR should capture something similar to Entropy Focus Criterion, would be nice to provide a direct comparison P8 typo, you meant equation 2 P8 SIQR I'm guessing you have experimented with the power function - maybe a side note to share your experience of why or how it works better than eg square
Dr Cyril Pernet
-
AbstractThe processing and analysis of magnetic resonance images is highly dependent on the quality of the input data, and systematic differences in quality can consequently lead to loss of sensitivity or biased results. However, varying image properties due to different scanners and acquisition protocols, as well as subject-specific image interferences, such as motion artifacts, can be incorporated in the analysis. A reliable assessment of image quality is therefore essential to identify critical outliers that may bias results. Here we present a quality assessment for structural (T1-weighted) images using tissue classification. We introduce multiple useful image quality measures, standardize them into quality scales and combine them into an integrated structural image quality rating to facilitate the interpretation and fast …
AbstractThe processing and analysis of magnetic resonance images is highly dependent on the quality of the input data, and systematic differences in quality can consequently lead to loss of sensitivity or biased results. However, varying image properties due to different scanners and acquisition protocols, as well as subject-specific image interferences, such as motion artifacts, can be incorporated in the analysis. A reliable assessment of image quality is therefore essential to identify critical outliers that may bias results. Here we present a quality assessment for structural (T1-weighted) images using tissue classification. We introduce multiple useful image quality measures, standardize them into quality scales and combine them into an integrated structural image quality rating to facilitate the interpretation and fast identification of outliers with (motion) artifacts. The reliability and robustness of the measures are evaluated using synthetic and real datasets. Our study results demonstrate that the proposed measures are robust to simulated segmentation problems and variables of interest such as cortical atrophy, age, sex, brain size and severe disease-related changes, and might facilitate the separation of motion artifacts based on within-protocol deviations. The quality control framework presents a simple but powerful tool for the use in research and clinical settings.Competing Interest StatementThe authors have declared no competing interest.
This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf146), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
Reviewer 2: Oscar Esteban
Technical Note GIGA-D-25-00085 introduces a segmentation-based quality control (QC) framework for T1-weighted structural MRI integrated into the CAT12 toolbox. The approach defines five interpretable image quality metrics—noise-to-contrast ratio (NCR), inhomogeneity-to-contrast ratio (ICR), resolution score (RES), edge-to-contrast ratio (ECR), and full-brain Euler characteristic (FEC)—which are combined into a composite Structural Image Quality Rating (SIQR). The tool aims to provide a standardized, interpretable scoring system for identifying poor-quality scans, with validation across simulated datasets and real-world imaging data.
Strengths
The manuscript addresses a critical need in neuroimaging by presenting an automated, interpretable, and practical framework for quality control of T1-weighted structural MRI. By integrating multiple segmentation-derived metrics into a single Structural Image Quality Rating (SIQR), the approach enables fast, standardized assessment of image quality. The tool is embedded in the widely used CAT12/SPM ecosystem, facilitating adoption, and it is validated across a range of synthetic and real-world datasets. The scoring system is designed with user accessibility in mind, offering a clear grading scale and robust detection of motion-related artifacts, making it particularly well-suited for use in large-scale research and clinical imaging settings.
Weaknesses
- Ambiguity of scope and segmentation dependency. A fundamental issue with the manuscript is its failure to clearly define the proposed QC framework's intended scope. If it is intended as a general-purpose image quality assessment tool, then several limitations become critical: its reliance on accurate tissue segmentation, its omission of background signal, its restricted validation within the CAT12 pipeline, and its lack of demonstrated interoperability with other workflows or populations. The method's reliability across different segmentation tools (e.g., FreeSurfer, FSL, SynthSeg) or in anatomically atypical populations (e.g., pediatric, lesioned brains) is untested. Conversely, if the framework is intended as a CAT12-specific internal QC tool, then the presentation is misleading. The inclusion of cross-tool benchmarks (e.g., MRIQC), the use of generalized grading schemes, and the claims of robustness give the impression of broader applicability. In this narrower interpretation, some concerns (e.g., pipeline generalization) would be less pressing, but others—such as the MRIQC comparison—become more problematic and unjustified. The manuscript would benefit greatly from explicitly stating whether the goal is a broadly applicable QC solution or a targeted add-on for CAT12 workflows.
- Lack of compliance with GigaScience reproducibility standards. The manuscript does not currently meet GigaScience's data and code availability requirements. The code used to generate results and figures is not publicly accessible—only available upon request—which directly conflicts with the journal's expectations for open, reproducible research. Similarly, while the data are drawn from public sources, the manuscript lacks direct links, accession numbers, or DOIs for the datasets used, and provides no clarity on data preprocessing or analysis scripts. There is also no reference to licensing for the CAT12 toolbox or the code used in the study, and no reproducibility capsule (e.g., containerized environment, workflow script) is offered. These omissions limit the transparency and reusability of the work and must be addressed to comply with the FAIR principles and GigaScience's editorial policies.
- Mischaracterization of background-based IQMs. In the "SIQR measure development" section, the manuscript states: "Image quality measures are commonly estimated from the image background (Mortamed et al., 2008; Esteban et al., 2017)." This statement is factually incorrect and conceptually misleading. First, the citation is incorrect—Mortamed should be Mortamet (2009). Second, it misrepresents tools like MRIQC, where most quality metrics are computed within brain tissue, including CJV, SNR, and contrast-based measures. Third, the authors entirely omit recent work (e.g., Pizarro et al., 2016; Provins et al., 2025) showing that artifacts such as ghosting, wrap-around, and motion often manifest more clearly in the background, due to the nature of Fourier reconstruction. By excluding background regions, the proposed method may miss artifacts that are visible but lie outside the segmented brain, and the trade-offs of this design decision are not discussed. The rationale based on defacing is only partial: defacing typically removes the face, not the broader background, where artifact signals often dominate. The statement as written oversimplifies QC practices and signals a bias toward justifying the framework's internal constraints rather than engaging with the full methodological landscape. References: Provins, C., … Esteban, O. (2025). Removing facial features from structural MRI images biases visual quality assessment PLOS Biology. doi:10.1371/journal.pbio.3003149 (OA). Pizarro RA, et al. (2016). Automated quality assessment of structural magnetic resonance brain images based on a supervised machine learning algorithm. Front Neuroinf. 10. doi:10.3389/fninf.2016.00052.
- Underdeveloped and opaque benchmarking against MRIQC. The benchmarking against MRIQC is reported only in the Results section, with no corresponding description in the Methods. It is surprising that MRIQC is not mentioned by name until page 14, despite the Esteban et al. (2017) reference appearing earlier in a different context. This suggests that the treatment of MRIQC—a widely adopted, general-purpose QC tool—has not been as thorough or fair as would be desirable. Key methodological details are missing: the authors do not explain how MRIQC was executed, how specific features (e.g., snr_wm, cjv) were selected, or whether a multivariate classifier was considered. Given that MRIQC's full model leverages multiple features simultaneously, limiting the comparison to univariate metrics weakens the validity of the claim that SIQR outperforms existing approaches. A more balanced, transparent benchmarking setup would strengthen the manuscript considerably. This benchmarking also mentions an "SPM12-based" QC performance but does not clarify how and why this comparison is made.
- No analysis of failure cases. The manuscript does not present examples of false positives or false negatives—cases where SIQR fails to align with visual inspection or known ground truth. Without understanding when and why the metric fails, users cannot judge the risk of misclassification or apply it conservatively in sensitive datasets.
Minor Issues
- Figure 7 could benefit from clearer annotation of thresholds and misclassified cases to help interpret the ROC curves.
- While the title "The Good, the Bad, and the Ugly" is a play on the classic western film, this informal or humorous reference may be perceived as inappropriate in a scientific context—especially for a methods paper intended to support standardization and reproducibility. The title does not convey the technical scope or scientific contribution of the work, which may undermine its visibility and perceived rigor. A more descriptive and neutral title—e.g., "Segmentation-Based Quality Control of Structural MRI using the CAT12 Toolbox"—would better reflect the content and purpose of the manuscript.
- While the authors validate their approach against synthetic degradations and segmentation-derived kappa scores, they do not sufficiently leverage human expert QC ratings. Greater engagement with visual QC standards would make the case for SIQR's practical value more compelling.
I was given access to the supporting data but chose not to proceed with reproducibility checks at this stage, as the manuscript does not currently meet GigaScience's basic standards for code and data transparency. I look forward to reviewing a revised version that clearly defines the scope of the method, improves methodological transparency, and brings the manuscript into compliance with the journal's reproducibility and FAIR data principles.
Best regards,
Oscar Esteban, Ph. D. Research and Teaching FNS Fellow Dept. of Radiology, CHUV, University of Lausanne
-
AbstractThe processing and analysis of magnetic resonance images is highly dependent on the quality of the input data, and systematic differences in quality can consequently lead to loss of sensitivity or biased results. However, varying image properties due to different scanners and acquisition protocols, as well as subject-specific image interferences, such as motion artifacts, can be incorporated in the analysis. A reliable assessment of image quality is therefore essential to identify critical outliers that may bias results. Here we present a quality assessment for structural (T1-weighted) images using tissue classification. We introduce multiple useful image quality measures, standardize them into quality scales and combine them into an integrated structural image quality rating to facilitate the interpretation and fast …
AbstractThe processing and analysis of magnetic resonance images is highly dependent on the quality of the input data, and systematic differences in quality can consequently lead to loss of sensitivity or biased results. However, varying image properties due to different scanners and acquisition protocols, as well as subject-specific image interferences, such as motion artifacts, can be incorporated in the analysis. A reliable assessment of image quality is therefore essential to identify critical outliers that may bias results. Here we present a quality assessment for structural (T1-weighted) images using tissue classification. We introduce multiple useful image quality measures, standardize them into quality scales and combine them into an integrated structural image quality rating to facilitate the interpretation and fast identification of outliers with (motion) artifacts. The reliability and robustness of the measures are evaluated using synthetic and real datasets. Our study results demonstrate that the proposed measures are robust to simulated segmentation problems and variables of interest such as cortical atrophy, age, sex, brain size and severe disease-related changes, and might facilitate the separation of motion artifacts based on within-protocol deviations. The quality control framework presents a simple but powerful tool for the use in research and clinical settings.Competing Interest StatementThe authors have declared no competing interest.
This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf146), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
Reviewer 1: Chris Foulon
The article presents a valuable effort towards standardising quality control methods and their evaluation. However, too many choices seem arbitrary without sufficient justification, and too many sections are unclear. Overall, the quality of the work cannot be fully assessed in the current state of the manuscript, and major revisions are needed to correct that. There is also not enough comparison (one) with other methods and no way of evaluating whether these measures are relevant to actual downstream imaging uses. Additionally, the article's goal is highly unclear and led me to think the segmentation measures were part of the QC pipeline until I read the discussion ... Nothing until the discussion explains that the segmentation measures are used to evaluate the single SIQR score output of the QC pipeline.
Comments: "All measures and tools are part of the Computational Anatomy Toolbox (CAT; https://neuro-jena.github.io//cat, Gaser et al., 2024) of the Statistical Parametric Mapping (SPM; http://www.fil.ion.ucl.ac.uk/spm, Ashburner et al. 2002) software and also available as a standalone version (https://neuro-jena.github.io/enigma-cat12/#standalone)." I cannot really expect everyone to avoid Matlab tools. Still, Matlab is a drag to the development of scalable tools nowadays (every system admin's nightmare is to have to try to make Matlab tools run on high-performance computing servers).
"such as noise, inhomogeneities, and resolution (Figure 1B)." At this point in the article, it's a bit unclear how that works in Figure 1B.
"It is assessed within optimized cerebrospinal fluid (CSF) and white matter (WM) regions." Then, the NCR relies on the segmentation, right? What if the segmentation fails?
Oh, most of the measures actually rely on the segmentation. Are segmentation errors accounted for in the tool? I am thinking specifically about "abnormal" brains that can be difficult for segmentation algorithms. At least at this point of the article, it's not clear.
"To accommodate various international rating systems, we have adopted a linear percentage and a corresponding (alpha-)numeric scaling." this doesn't match the complexity of the following explanation about the rather arbitrary range. I think a much more international and understandable rating would have been a 0 to 1 range. A 0.5 to 10.5 range is not helping users at all. As the rating is linear, I am struggling to see the added value of this choice.
"Although the BWP does not include the simulation of motion artifacts, these are in general comparable to an increase of noise in the BWP dataset by 2 percentage points." Maybe that should be justified with a reference? "in general" might be a bit light to justify not having a direct measure for something presented as important (motion artefacts) in the introduction and goal of the tool. I think the absence of a noise estimation in the QC ratings should be more thoroughly justified.
"To balance the sensitivity to different quality measures while ensuring that the necessary quality conditions are met, we apply an exponentially weighted averaging approach — similar to the root mean square (RMS) but using the fourth power and fourth root." Why is there no justification or references for these arbitrary choices? Why not the fifth root or tenth root? Why the square root and not an exponential or any other function?
"Sample Normalization for Outlier Detection" It is unclear whether this is systematically applied or not. Is it a separate measure, or is it aggregated into another score? That measure could be relevant in many cases but could also be really bad in some specific cases (for example, historical data where the "ideal" quality would probably be well below standards.
"raw (co-registered)" Well, it is not raw if it's co-registered. I suggest reformulation to avoid confusion with actual raw images.
The "Evaluation Concept and Data" section is very unclear. The need for a training-testing scheme is not explained, and the scheme itself is very arbitrary (choosing odd and even numbered files ordered by filenames). How does that splitting strategy help with generalisation? Why that specific split? Why not another? How do we know that split is not biased? Finally, the selection of 6 scans also seems completely arbitrary. Overall, this section does not provide enough information to justify the seemingly arbitrary choices.
"Of note, obvious subject/scan-specific motion artifacts generally increase the scans' rating for about 1 grade, which corresponds to a decrease of 10 rps (and +0.5 grade / -5 rps for light artifacts), in comparison to the typical rating achieved by the majority of scans of the same protocol." This is incredibly vague! How are readers supposed to evaluate the quality control measures with this information?
Discussion: "as this is more relevant for segmentation and surface reconstruction (Ashburner et al., 2005)." A lot of work has been done in these domains in 20 years; this reference, however solid, is not enough to justify that choice. This might not be relevant with the methods developed in the last 20 years.
"with a power of 4 rather than 2, to place greater emphasis on the more problematic aspects of image quality." Still not enough to justify that choice. The authors failed to convince me that one single score is better than reporting all the measures significantly, as different quality measures will influence different tasks. A very practical example is the fact that the vast majority of acquisitions in clinical settings, the resolution is anisotropic (though less with T1 images nowadays, historical datasets will still have it). This anisotropy is not necessarily an issue for human diagnosis, for example; however, aggregating all the scores in one might hide that a low-quality measurement might not affect the specific downstream task. Coupled with the lack of justification for the factor scalings, this choice of a single score is a significant negative point for the tool.
Data availability: Where can the sources of these specific tools be accessed?
-
