IGC: Intelligence-Gated Crawling for Distributed Web Content Acquisition and RAG-Ready Vectorization

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Large-scale Retrieval-Augmented Generation (RAG) systems require high-quality web corpora, yet conventional crawlers optimise for coverage and throughput rather than semantic content quality, leaving the burden of corpus cleaning as an expensive post-processing step. We introduce IGC (Intelligence-Gated Crawler), a distributed web acquisition framework that integrates multidimensional content quality evaluation directly into the crawl pipeline, preventing low-value content from entering the embedding layer at source. IGC combines distributed crawling via BullMQ and Redis, structured HTML extraction through the Mozilla Readability algorithm, a six-dimensional content quality scoring model (length, density, readability, structure, uniqueness, freshness), semantic sentence-window chunking, and a pluggable embedding pipeline supporting OpenAI, Cohere, and locally-hosted models. Embeddings are persisted in PostgreSQL with the pgvector extension, enabling approximate nearest-neighbour cosine similarity retrieval. A pilot evaluation on commodity hardware (Intel Core i3, 8 GB RAM) demonstrates stable crawl throughput of approximately 5.4 pages per second with a mean latency of 1,241 ms, while the Intelligence-Gated Crawling model filters roughly 14% of crawled pages and reduces downstream embedding noise by approximately 25%. IGC provides a scalable data-ingestion layer for retrieval-augmented generation systems, semantic search engines, and large language model dataset construction pipelines.

Article activity feed