LLM-Driven Evaluation of Text Embedding Similarities for Job Posting Deduplication

Giannis Thivaios
Panagiotis Zervas
Giannis Tzimas

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

This study presents a method for detecting and removing duplicate job postings in large datasets with emphasis on key attributes such as job title, location, company name, and job description. The approach begins with a preprocessing phase that standardizes text data—normalizing formats, removing special characters, and resolving lexical variations—to ensure consistency and compatibility. For deduplication, we utilize WordLlama, a fast and lightweight NLP toolkit optimized for fuzzy deduplication and similarity detection. Furthermore, we evaluate the performance of various Large Language Models (LLM) in identifying duplicates, measuring accuracy through precision and recall metrics. The objective is to determine which model best captures semantic similarities in job postings and achieves the highest deduplication accuracy. This comparison offers valuable insights into the effectiveness of LLMs for large-scale, text-based deduplication in the context of job postings.

Version published to 10.20944/preprints202506.1143.v1
Jun 13, 2025

DiLLaB: Discussion Labeling with LLMs for Building Datasets

This article has 6 authors:
1. Ludimila Gonçalves
2. Márcia Lima
3. André Carvalho
4. Walter Nakamura
5. Igor Steinmacher
6. Tayana Conte
This article has no evaluationsLatest version Jan 28, 2026
LLM Aspect Prediction: Reviewing Academic Papers from Different Aspects with Large Language Model

This article has 3 authors:
1. Zihao Hu
2. Fumiyo Fukumoto
3. Dongjin Yu
This article has no evaluationsLatest version Dec 11, 2025
Comparing the Performance of SOTA Text Summarization Models on AI Research Papers

This article has 2 authors:
1. Pradnya Gotmare
2. Sushant Nair
This article has no evaluationsLatest version Jan 22, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

DiLLaB: Discussion Labeling with LLMs for Building Datasets

LLM Aspect Prediction: Reviewing Academic Papers from Different Aspects with Large Language Model

Comparing the Performance of SOTA Text Summarization Models on AI Research Papers