Reproducibility of AVM grading in clinical practice: A study of interobserver variability

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Purpose To quantify interobserver agreement in Spetzler–Martin (SM) and Spetzler–Ponce (SP) grading of pediatric brain arteriovenous malformations (AVMs), locate sources of variability, and test a composite Disagreement Index (DI). Methods Forty-five consecutive pediatric AVMs were independently graded by three neurosurgical residents without prior calibration. SM components (eloquence, venous drainage, nidus size) and SP class were assigned; nidus morphology (compact vs diffuse) was scored by two raters. Agreement was estimated with Fleiss’/Cohen’s κ and ICC(2,1); dispersion with across-rater standard deviation and Shannon entropy. Borderline cases were prespecified (SM 2–3; SP class transitions). DI combined entropy, SM-score dispersion, and component mismatch. Results SM scores showed moderate numerical agreement (ICC 0.72) but minimal categorical concordance (Fleiss’ κ 0.04). SP improved overall agreement (Fleiss’ κ 0.49), yet 17/45 (37.8%) cases crossed SP boundaries. Component reliability differed: eloquence κ 0.29, nidus size κ 0.41, venous drainage κ 0.58. Nidus morphology showed low reproducibility between two raters (≈ 49% agreement; Cohen’s κ − 0.15). Ten cases spanned the SM 2–3 threshold. DI ranged 0.15–1.00 (median 0.46) and isolated a small subset of highly discordant cases; eloquence was the primary driver in 8/10. Conclusions Interobserver variability concentrates at decision thresholds and is driven chiefly by how eloquence is interpreted. Standardized definitions, reporting measured nidus dimensions with SM bins, and routine lesion-to-eloquence distance may stabilize grading. DI can flag “teaching” cases and support calibration over time.

Article activity feed