Quality-Aware Evaluation for Journal Recommendation
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Objective: Journal recommendation systems are tools that analyze a manuscript’s content and suggest potential target journals. Most tools that suggest journals are evaluated using simple accuracy measures, such as Top-K accuracy metrics: the recommendation is considered correct only if the exact journal is predicted. All other recommendations are counted as errors. However, this approach ignores journal quality. Recommending a different journal of similar quality is very different from recommending a journal that is far below the appropriate quality level. We developed a Quality-Aware evaluation tool designed to assess whether journal recommendation tools suggest journals of an appropriate quality level, rather than simply whether they predict the exact target journal. Methods: We developed specialty-specific journal recommendation models for medical fields using transformer-based architectures trained on nearly one million PubMed articles (2020–2024; 2020–2025 for Cardiology). Using SCImago quartiles as a proxy for journal quality, we evaluated models using both standard metrics and novel Quality-Aware metrics (Quality Accuracy, Undersell Rate, Severe Undersell Rate). We examined the relationship between these metric families to characterize when recommendation lists can be considered reliable. Results: Across five specialties, mean accuracy@1 was 47.9% (range: 39.9%–55.0%), with Mean Reciprocal Rank (MRR) and Normalized Discounted Cumulative Gain (NDCG) of 0.60 and 0.68, respectively. However, Quality Accuracy@1 averaged 67.7%, exceeding raw accuracy by approximately 20 percentage points. This gap indicates that more than a third of all prediction “errors” under standard metrics involved journals of equivalent quality. Furthermore, Quality Consistency@3 averaged 54.0%, demonstrating that, on average, the majority of journals in the top-3 recommendations consistently aligned with the target journal’s quality tier. Severe Undersell Rates (recommendations two or more quartiles below the ground truth) averaged 7.9%, with the best-performing models achieving rates as low as 5.2%. Conclusion: Standard evaluation metrics for journal recommendation are insufficient because they treat all errors as equivalent. In this study, Quality Accuracy@1 substantially exceeded top@1 accuracy, indicating that a large proportion of apparent “errors” involved journals of equivalent quality. Quality-Aware metrics, combined with ranking quality measures, provide a more complete assessment of whether a system produces reliable recommendations. We propose that journal recommendation systems report Quality-Aware metrics alongside traditional accuracy to better characterize real-world utility. MSC Codes: 68T10 (Pattern recognition); 62P10 (Applications of statistics to biology and medical sciences) JEL Codes: O32 (Management of Technological Innovation and R&D)