Random forests in corpus research: A systematic review

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

The increasing popularity of random forests (RFs) in corpus-based work calls for a critical discussion of their utility and use in our field. To provide an empirical basis for this methodological discourse, the current paper conducts a systematic review of RFs in corpus research. This survey targets 6,972 papers published between 2012 and 2024 across 20 linguistic journals, yielding a total of 125 RF models spread across 69 studies. I examine various features including the purpose of the analysis, data structure and as well as model specification, evaluation and interpretation. There is a clear upward trend in the currency of RF models, which are routinely applied to data with a clustering structure, where observations are grouped by speaker/text or lexical item. Essential details about the analysis are frequently missing from the reports. This includes information on software and model specification, predictive performance, and the type of variable importance scores used. Motivated by these insights, the current study provides a checklist for the reporting of RF models as well as an R tutorial on how to obtain key indices. It closes with concrete suggestions for future methodological work on the use of RFs in corpus research.

Article activity feed