Vector Semantics at Scale: An AI Pipeline for Financial Text Similarity
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
This paper presents an end-to-end AI system that transforms unstructured corporate filings into vector representations and computes interpretable similarity signals using tf-idf and cosine metrics within a distributed pipeline. The approach unifies robust preprocessing (token normalization, stemming/lemmatization) with scalable retrieval, parsing, and clustering, enabling comparative analysis of accounting policy narratives across firms and time. Extensive empirical evaluation quantifies how these similarity features relate to firm-level attributes and investor behavior, illustrating how classic NLP can yield actionable structure from financial disclosures at web scale. The design and experiments offer a reproducible blueprint for AI-driven text analytics in regulated domains, coupling information retrieval methods with cloud compute for high-throughput document understanding.