DeepSeek and GPT Fall Behind: Claude Leads in Zero-Shot Consumer Complaints Classification

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Large language models (LLMs) have demonstrated remarkable capabilities in various natural language processing (NLP) tasks, but their effectiveness in real-world consumer complaint classification without fine-tuning remains uncertain. Zero-shot classification is particularly challenging in finance, where complaint categories often overlap, requiring a deep understanding of nuanced language. In this study, we evaluate the zero-shot classification performance of leading LLMs— DeepSeek-V3, OpenAI’s GPT-4o and GPT-4o mini, and Anthropic’s Claude 3.5 Sonnet and Claude 3.5 Haiku—on consumer complaints submitted to the Consumer Financial Protection Bureau (CFPB). These models were tasked with categorizing complaints into five predefined financial classes based solely on complaint text. Performance was measured using accuracy, precision, recall, F1-score, and heatmaps to identify classification patterns. While DeepSeek Chat and GPT-4o produced competitive results, Claude 3.5 Sonnet consistently outperformed all models, demonstrating superior classification accuracy and efficiency. These findings highlight the relative strengths and limitations of DeepSeek-V3 and other top-tier models in financial text processing, providing valuable insights into their practical applications.

Article activity feed