RAG Based Implementation for Smart Web Scraping using Craw4ai
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
With the onset of LLMs, the way people surf the internet has changed. Instead of manually selecting information from the sources present on the internet users just ask a large language model like Chat-GPT to search through the data, select the most relevant bits of data, and then generate a response for the user prompt. However, a major challenge with LLMs is that they work on pre trained data and may produce hallucinated results when they lack contextual accuracy and updated data. RAG short for retrieval augmentation generation is an approach used to increase data and context on which an LLM can act on. It feeds untrained data which is processed and attached as external context to the user prompt. This approach is used to create a web scraping tool which leverages the capabilities of an LLM to answer any query referring to information from the web. The tool uses web scraping to provide contextually accurate and real time information from the web thus reducing hallucinations in the answer. The modular RAG architecture scrapes the web using crawl4ai to extract raw html data. This data is then indexed, broken down to make chunks and converted into vector embeddings using Sentence Transformer models.