IGC: Intelligence-Gated Crawling for Distributed Web Content Acquisition and RAG-Ready Vectorization

Sharan Kumar Yenugula

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Large-scale Retrieval-Augmented Generation (RAG) systems require high-quality web corpora, yet conventional crawlers optimise for coverage and throughput rather than semantic content quality, leaving the burden of corpus cleaning as an expensive post-processing step. We introduce IGC (Intelligence-Gated Crawler), a distributed web acquisition framework that integrates multidimensional content quality evaluation directly into the crawl pipeline, preventing low-value content from entering the embedding layer at source. IGC combines distributed crawling via BullMQ and Redis, structured HTML extraction through the Mozilla Readability algorithm, a six-dimensional content quality scoring model (length, density, readability, structure, uniqueness, freshness), semantic sentence-window chunking, and a pluggable embedding pipeline supporting OpenAI, Cohere, and locally-hosted models. Embeddings are persisted in PostgreSQL with the pgvector extension, enabling approximate nearest-neighbour cosine similarity retrieval. A pilot evaluation on commodity hardware (Intel Core i3, 8 GB RAM) demonstrates stable crawl throughput of approximately 5.4 pages per second with a mean latency of 1,241 ms, while the Intelligence-Gated Crawling model filters roughly 14% of crawled pages and reduces downstream embedding noise by approximately 25%. IGC provides a scalable data-ingestion layer for retrieval-augmented generation systems, semantic search engines, and large language model dataset construction pipelines.

Version published to 10.21203/rs.3.rs-9039974/v1 on Research Square
Mar 6, 2026

Knowledge and Context Compression via Question Generation

This article has 6 authors:
1. Alex Anvi Eponon
2. Moein Shahiki-Tash
3. Abdullah -
4. Luis Ramos
5. Christian Maldonado-Sifuentes
6. Ildar Batyrshin
This article has no evaluationsLatest version Jan 27, 2026
Knowledge and Context Compression via Question Generation

This article has 6 authors:
1. Alex Anvi Eponon
2. Moein Shahiki-Tash
3. Abdullah -
4. Luis Ramos
5. Christian Maldonado-Sifuentes
6. Ildar Batyrshin
This article has no evaluationsLatest version Jan 27, 2026
Multilingual Rag Agents For Localized Knowledge: Adaptive Indexing For Under-Represented Languages

This article has 2 authors:
1. Nnaemeka Kingsley Ugwumba
2. Kelechi Ernest Okechukwu
This article has no evaluationsLatest version Jan 29, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Knowledge and Context Compression via Question Generation

Knowledge and Context Compression via Question Generation

Multilingual Rag Agents For Localized Knowledge: Adaptive Indexing For Under-Represented Languages