Are Food Safety Classifiers Learning Hazards or Memorizing Firms? Entity-Level Leakage in FDA Recall Severity Prediction

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Machine learning (ML) models for predicting food recall severity could accelerate regulatory triage, yet no systematic benchmark exists on the U.S.\ Food and Drug Administration (FDA) open-access database. We construct the first comprehensive ML benchmark for FDA food recall severity classification (Class I / II / III) using 28,448 enforcement records spanning 2012--2025. A 1,437-dimensional feature space is engineered from TF-IDF and Sentence-BERT embeddings of recall narratives, structured categorical attributes, and temporal indicators. Five classifiers (Logistic Regression, Random Forest, XGBoost, LightGBM, CatBoost) are trained with Optuna-tuned hyperparameters. Under standard random splitting, XGBoost achieves Macro-F1 = 0.89; however, a multi-layer leakage audit reveals that this figure is inflated by entity-level autocorrelation. When firm-aware group splitting, temporal splitting, or their combination is applied, Macro-F1 drops to approximately 0.57. A firm-mode baseline---assigning each company's historically most frequent severity class---reaches 0.82 under random splitting, demonstrating that 92% of the apparent performance stems from firm-level memorisation. Identity-masking experiments confirm that the leakage is structural rather than attributable to explicit company-name tokens. A \( 2 \times 2 \) factorial decomposition shows that firm overlap and temporal continuity are highly collinear; removing either suffices to expose the true generalisation floor. A hazard-type decomposition reveals that pathogen--severity associations transfer across firms, whereas labelling and GMP violations are highly firm-specific, explaining the disproportionate collapse of Class~III prediction under group splitting. SHAP analysis, feature ablation, and a nine-year continuous-learning simulation provide additional insights into model behaviour and retraining strategies. We recommend that food-safety ML studies adopt group-aware or temporal evaluation protocols, report entity-overlap statistics, and include entity-prior baselines to prevent overstated conclusions.

Article activity feed