RAG Based Implementation for Smart Web Scraping using Craw4ai

Kunal Singh Chauhan
Sumit Srivast

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

With the onset of LLMs, the way people surf the internet has changed. Instead of manually selecting information from the sources present on the internet users just ask a large language model like Chat-GPT to search through the data, select the most relevant bits of data, and then generate a response for the user prompt. However, a major challenge with LLMs is that they work on pre trained data and may produce hallucinated results when they lack contextual accuracy and updated data. RAG short for retrieval augmentation generation is an approach used to increase data and context on which an LLM can act on. It feeds untrained data which is processed and attached as external context to the user prompt. This approach is used to create a web scraping tool which leverages the capabilities of an LLM to answer any query referring to information from the web. The tool uses web scraping to provide contextually accurate and real time information from the web thus reducing hallucinations in the answer. The modular RAG architecture scrapes the web using crawl4ai to extract raw html data. This data is then indexed, broken down to make chunks and converted into vector embeddings using Sentence Transformer models.

Version published to 10.21203/rs.3.rs-6846502/v1 on Research Square
Jul 22, 2025

Understanding the Impact of Dataset Characteristics on RAG based Multi-hop QA Performance

This article has 3 authors:
1. Nimet Aksoy
2. Zekeriya Anıl Güven
3. Murat Osman Ünalır
This article has no evaluationsLatest version Jul 2, 2025
Issue Detection and Future Proofing Dutch Government Apps Using Language Technologies

This article has 3 authors:
1. Anca-Mihaela Matei
2. Flor Miriam Plaza-del-Arco
3. Natalia Amat-Lefort
This article has no evaluationsLatest version Aug 21, 2025
NL4DV-Stylist: Styling Data Visualizations Using Natural Language and Example Charts

This article has 2 authors:
1. Tenghao Ji
2. Arpit Ajay Narechania
This article has no evaluationsLatest version Aug 1, 2025

Listed in

Abstract

Article activity feed

Related articles

Understanding the Impact of Dataset Characteristics on RAG based Multi-hop QA Performance

Issue Detection and Future Proofing Dutch Government Apps Using Language Technologies

NL4DV-Stylist: Styling Data Visualizations Using Natural Language and Example Charts