IGC: Intelligence-Gated Crawling for Distributed Web Content Acquisition and RAG-Ready Vectorization
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Large-scale Retrieval-Augmented Generation (RAG) systems require high-quality web corpora, yet conventional crawlers optimise for coverage and throughput rather than semantic content quality, leaving the burden of corpus cleaning as an expensive post-processing step. We introduce IGC (Intelligence-Gated Crawler), a distributed web acquisition framework that integrates multidimensional content quality evaluation directly into the crawl pipeline, preventing low-value content from entering the embedding layer at source. IGC combines distributed crawling via BullMQ and Redis, structured HTML extraction through the Mozilla Readability algorithm, a six-dimensional content quality scoring model (length, density, readability, structure, uniqueness, freshness), semantic sentence-window chunking, and a pluggable embedding pipeline supporting OpenAI, Cohere, and locally-hosted models. Embeddings are persisted in PostgreSQL with the pgvector extension, enabling approximate nearest-neighbour cosine similarity retrieval. A pilot evaluation on commodity hardware (Intel Core i3, 8 GB RAM) demonstrates stable crawl throughput of approximately 5.4 pages per second with a mean latency of 1,241 ms, while the Intelligence-Gated Crawling model filters roughly 14% of crawled pages and reduces downstream embedding noise by approximately 25%. IGC provides a scalable data-ingestion layer for retrieval-augmented generation systems, semantic search engines, and large language model dataset construction pipelines.