FILMS: A Multilingual Word Frequency Corpus based on Film Subtitles with IPA Transcriptions

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Word frequency plays an important role in diverse areas of psychology research, such as reading, memory, and word processing, with researchers traditionally relying on existing word-frequency corpora for their investigations. However, not all corpora are created equal, and factors such as data and language domain can significantly impact research outcomes. This is particularly problematic for cross-language research, where linguistic material needs to be comparable across languages. Databases derived from written texts have limitations in reflecting everyday language use, since people speak differently than they write. Recent research highlights the effectiveness of movie subtitles-based word frequency data, particularly for corpora exceeding 30 million words. Responding to these findings, this paper introduces the multilingual word-frequency corpus FILMS (Word Frequency IPA MultiLingual Subtitles Corpus), comprised of 52 languages from open-source subtitles. This paper provides a comprehensive account of the motivation for creating this corpus, the creation of the corpus, sources of data, a statistical overview, and future prospects. The goal of this project is to offer a corpus tailored to research in the field of psychology.

Article activity feed