Vector Semantics at Scale: An AI Pipeline for Financial Text Similarity

Vipul Razdan

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

This paper presents an end-to-end AI system that transforms unstructured corporate filings into vector representations and computes interpretable similarity signals using tf-idf and cosine metrics within a distributed pipeline. The approach unifies robust preprocessing (token normalization, stemming/lemmatization) with scalable retrieval, parsing, and clustering, enabling comparative analysis of accounting policy narratives across firms and time. Extensive empirical evaluation quantifies how these similarity features relate to firm-level attributes and investor behavior, illustrating how classic NLP can yield actionable structure from financial disclosures at web scale. The design and experiments offer a reproducible blueprint for AI-driven text analytics in regulated domains, coupling information retrieval methods with cloud compute for high-throughput document understanding.

Version published to 10.21203/rs.3.rs-8696862/v1 on Research Square
Jan 29, 2026

Unsupervised text clustering with large language models

This article has 6 authors:
1. Leonid Kuligin
2. Jacqueline Lammert
3. Florence Heinkelein
4. Keno Bressem
5. Martin Boeker
6. Maximilian Tschochohei
This article has no evaluationsLatest version Feb 23, 2026
Public Signals of Python‑Enabled AI in Finance: Disclosure Patterns and Outcome Claims in NYSE Institutions

This article has 1 author:
1. Veliota Drakopoulou
This article has no evaluationsLatest version Feb 18, 2026
Social-RAG: A Retrieval-Augmented Generation Pipeline for Computational Social Science Research on Telegram

This article has 6 authors:
1. Leonardo F. Nascimento
2. Eric Brasil
3. Ruan Arthur Lima Santos
4. Gabriel Andrade
5. Ricardo Sodré Andrade
6. Tarssio Barreto
This article has no evaluationsLatest version Feb 19, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Unsupervised text clustering with large language models

Public Signals of Python‑Enabled AI in Finance: Disclosure Patterns and Outcome Claims in NYSE Institutions

Social-RAG: A Retrieval-Augmented Generation Pipeline for Computational Social Science Research on Telegram