Fine-Tuned LLM Workflows for Feature Extraction from App Reviews
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
App reviews contain actionable signals for improvement, but colloquial language, abbreviations, and fuzzy discourse boundaries make automatic extraction difficult. We build and compare single- and multi-node (sequential and parallel) workflows using a fine-tuned LLM extractor. The multi-node designs combine non‑fine‑tuned and fine‑tuned models, parameter variation, and domain‑specific instructions at inference. On a public benchmark, the single‑node LLM outperforms representative baselines, improving Weighted F1 by up to +69.7% and substantially increasing recall. In multi‑node comparisons, using fine‑tuned models for all nodes performs best in both sequential and parallel settings, while adding domain‑specific instructions at inference harms consistency and accuracy. Overall, (i) preserving training‑time output formats and instructions at inference and (ii) avoiding model mixing and excessive prompt changes are key to maximizing LLM‑based feature extraction. These findings provide practical guidance for deployment‑oriented extraction workflows with LLMs.