Fine-Tuned LLM Workflows for Feature Extraction from App Reviews

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

App reviews contain actionable signals for improvement, but colloquial language, abbreviations, and fuzzy discourse boundaries make automatic extraction difficult. We build and compare single- and multi-node (sequential and parallel) workflows using a fine-tuned LLM extractor. The multi-node designs combine non‑fine‑tuned and fine‑tuned models, parameter variation, and domain‑specific instructions at inference. On a public benchmark, the single‑node LLM outperforms representative baselines, improving Weighted F1 by up to +69.7% and substantially increasing recall. In multi‑node comparisons, using fine‑tuned models for all nodes performs best in both sequential and parallel settings, while adding domain‑specific instructions at inference harms consistency and accuracy. Overall, (i) preserving training‑time output formats and instructions at inference and (ii) avoiding model mixing and excessive prompt changes are key to maximizing LLM‑based feature extraction. These findings provide practical guidance for deployment‑oriented extraction workflows with LLMs.

Article activity feed