Navigating Complexity: A Tailored Question-Answering Approach for PDFs in Finance, Bio-Medicine, and Science

Teerath Kumar
Rutu Bhujbal
Kislay Raj
Arunabha M. Roy

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Understanding complex Portable Document Format (PDF) files, such as research papers, clinical reports, and scientific manuals, is often a time-consuming endeavor. While significant progress has been made in developing question-answering (QA) systems that yield contextually relevant responses, the creation of a comprehensive end-to-end machine learning model capable of addressing intricate questions remains a formidable challenge. These systems typically rely on substantial labeled training data to effectively train their foundational models for specific tasks. However, assembling such datasets is particularly challenging for complex documents, including annual reports from major technology companies. In this paper, we address this issue by developing a QA system specifically designed for PDF documents, focusing on the domains of finance, biomedicine, and scientific literature. We manually curated datasets from these areas for evaluation purposes and utilized pre-trained Bidirectional Encoder Representations from Transformers (BERT) models from the Hugging Face library. The models were evaluated using the F1 score, achieving a notable score of 44% with the BERT Large model.

Version published to 10.20944/preprints202410.1395.v1
Oct 17, 2024

Discuss this preprint

Listed in

Abstract

Article activity feed