Input design for unsupervised cross-national branded food database alignment using large language models

Shinichi Nakagawa
Akira Yamamoto

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Cross-national alignment of branded food databases is essential for international nutritional epidemiology but lacks standardized methods. Existing approaches — including food ontologies, domain-specific fine-tuned language models, and manual expert mapping — require either substantial infrastructure or do not scale to thousands of items. We propose an unsupervised evaluation framework for large language model (LLM)-based food database alignment that requires no ground-truth labels. Using the Japan Branded Food Database (JBFD; 9,519 items, 71 mid-level categories) and USDA FoodData Central (448 categories) as a case study, we introduce two complementary metrics: weighted centroid distance (nutritional proximity between matched category pairs) and dominant category share (structural consistency of category-level assignments). We then conducted a systematic ablation study across eight input conditions (A–H), varying combinations of product name, nutrient profile, and semantic category label. Results showed that nutrient-only inputs yielded poor structural consistency despite low centroid distances, while semantic category labels achieved the highest dominant category share (89.3%) but introduced circularity due to their LLM-derived origin. Among circularity-free conditions, product name combined with minimal nutrient information (energy, protein, salt; condition E) achieved the best balance of centroid distance (0.471) and dominant category share (65.8%). Model comparison across Claude Haiku, Sonnet, and Opus confirmed that NO_MATCH rates were consistent across model sizes (12–14%), suggesting that prompt design contributes more to alignment quality than model scale. These findings provide practical guidance for input design in LLM-based food database alignment without ground-truth annotation.

Version published to 10.64898/2026.05.23.26353945 on medRxiv
May 25, 2026

Establishing a Bidirectional Correspondence Table between the Japanese Standard Tables of Food Composition 2020 (8th Edition) and the USDA FoodData Central Using Large Language Model-Based Matching

This article has 2 authors:
1. Shin-ichi Nakagawa
2. Akira Yamamoto
This article has no evaluationsLatest version May 13, 2026
Compatibility of National Food Composition Databases with USDA FoodData Central: A Seven-Country LLM-Based Analysis

This article has 2 authors:
1. Shin-ichi Nakagawa
2. Akira Yamamoto
This article has no evaluationsLatest version Jun 1, 2026
Nutrient Composition of Foods Represented in the U.S. Food and Nutrient Database for Dietary Studies, 2013-2023

This article has 5 authors:
1. Omar Ihab Moussa
2. Moaz Elsayed Abouelmagd
3. Belal Mohamed Hamed
4. Asmaa Zakria Alnajjar
5. Abdelrahman Shata
This article has no evaluationsLatest version Jun 22, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Establishing a Bidirectional Correspondence Table between the Japanese Standard Tables of Food Composition 2020 (8th Edition) and the USDA FoodData Central Using Large Language Model-Based Matching

Compatibility of National Food Composition Databases with USDA FoodData Central: A Seven-Country LLM-Based Analysis

Nutrient Composition of Foods Represented in the U.S. Food and Nutrient Database for Dietary Studies, 2013-2023