LLM-Driven Evaluation of Text Embedding Similarities for Job Posting Deduplication
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
This study presents a method for detecting and removing duplicate job postings in large datasets with emphasis on key attributes such as job title, location, company name, and job description. The approach begins with a preprocessing phase that standardizes text data—normalizing formats, removing special characters, and resolving lexical variations—to ensure consistency and compatibility. For deduplication, we utilize WordLlama, a fast and lightweight NLP toolkit optimized for fuzzy deduplication and similarity detection. Furthermore, we evaluate the performance of various Large Language Models (LLM) in identifying duplicates, measuring accuracy through precision and recall metrics. The objective is to determine which model best captures semantic similarities in job postings and achieves the highest deduplication accuracy. This comparison offers valuable insights into the effectiveness of LLMs for large-scale, text-based deduplication in the context of job postings.