Survey and Benchmarking of Large Language Models for RTL Code Generation: Techniques and Open Challenges

Arun Ravindran
Aditya Patra
Vahid Babaey
Suresh Purini

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Large language models (LLMs) are emerging as powerful tools for hardware design, with recent work exploring their ability to generate register-transfer level (RTL) code directly from natural-language specifications. This paper provides a survey and evaluation of LLM-based RTL generation. We review twenty-six published efforts, covering techniques such as fine-tuning, reinforcement learning, retrieval-augmented prompting, and multi-agent orchestration, and we analyze their contributions across eight methodological dimensions including debugging support, post-RTL metrics, and benchmark development. Building on this review, we experimentally evaluate frontier commercial models---GPT-4.1, GPT-4.1-mini, and Claude Sonnet 4---on the VerilogEval and RTLLM benchmarks under both single-shot and lightweight agentic settings. Results show that these models achieve up to 89.74% on VerilogEval and 96.08% on RTLLM, matching or exceeding prior domain-specific pipelines without specialized fine-tuning. Detailed failure analysis reveals systematic error modes, including FSM mis-sequencing, handshake drift, blocking vs. non-blocking misuse, and state-space oversimplification. Finally, we outline a forward-looking research roadmap toward natural-language-to-SoC design, emphasizing controlled specification schemas, open benchmarks and flows, PPA-in-the-loop feedback, and modular assurance frameworks. Together, this work provides both a critical synthesis of recent advances and a baseline evaluation of frontier LLMs, highlighting opportunities and challenges in moving toward AI-native electronic design automation.

Version published to 10.20944/preprints202509.1681.v1
Sep 19, 2025

Open-Source vs. Commercial Coding Assistants: A 2025 Comparison of DeepSeek R1, Qwen 2.5 and Claude 3.7

This article has no evaluationsLatest version Aug 28, 2025
Integrating Large Language Models into Automated Software Testing

This article has 4 authors:
1. Yanet Sáez Iznaga
2. Luís Rato
3. Pedro Salgueiro
4. Javier Lamar León
This article has no evaluationsLatest version Sep 18, 2025
A Survey of Large Language Models: Evolution, Architectures, Adaptation, Benchmarking, Applications, Challenges, and Societal Implications

This article has 6 authors:
1. Seyed Mahmoud Sajjadi Mohammadabadi
2. Burak Cem Kara
3. Can Eyupoglu
4. Can Uzay
5. Mehmet Serkan Tosun
6. Oktay Karakuş
This article has no evaluationsLatest version Sep 9, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Open-Source vs. Commercial Coding Assistants: A 2025 Comparison of DeepSeek R1, Qwen 2.5 and Claude 3.7

Integrating Large Language Models into Automated Software Testing

A Survey of Large Language Models: Evolution, Architectures, Adaptation, Benchmarking, Applications, Challenges, and Societal Implications