Building a Question-Answering System to Extract Information From PDF Files Using BERT Transformers

Rutuja Bhujbal
Kislay Raj
Teerath Kumar

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

The comprehension of complex PDFs such as research documents, clinical reports, and scientific manuals is a time-consuming task. Previous studies have demonstrated significant success in building question-answering systems to provide contextually relevant answers to user queries. However, addressing puzzling questions within a single end-to-end trained ML model remains a rigorous task. Such systems require a huge amount of labeled training data to train the base models for specific tasks. The creation of such data sets is still a challenge for complicated documents like the annual reports of big tech companies. This research paper addresses this challenge by focusing on the construction of a question-answering system tailored for PDF files, specifically targeting domains such as finance, bio-medicine, and scientific literature. Curated data sets for the PDF from the chosen Domains were created manually for the evaluation. Pre-trained Bidirectional Encoder Representations from Transformers (BERT) Models from the Hugging Face library were utilized for the chosen domains and evaluated with an F1 score. A score of 44\% was achieved for the BERT Large.

Version published to 10.20944/preprints202506.2105.v1
Jun 25, 2025

Prompt-Orchestrated Large Language Models for Clinical Information Extraction

This article has 13 authors:
1. Livia Lilli
2. Andrea Rosati
3. Giovanni Paolo Tobia
4. Massimo Criscione
5. Federica Tomassini
6. Chiara Dachena
7. Alice Luraschi
8. Chiara Cantarini
9. Carolina De Maria
10. Luigi Congedo
11. Massimo Bernaschi
12. Stefano Patarnello
13. Anna Fagotti
This article has no evaluationsLatest version Jan 16, 2026
Understanding the Impact of Dataset Characteristics on RAG-based Multi-hop QA Performance

This article has 3 authors:
1. Nimet Aksoy
2. Zekeriya Anıl Güven
3. Murat Osman Ünalır
This article has no evaluationsLatest version Dec 12, 2025
Screenathon 2.0: Human–AI Collaborative Screening Applied to Patient-Generated Health Data

This article has 11 authors:
1. Jonas Bergmann
2. Tiago Azzi
3. Rutger Chris Neeleman
4. Kianush Monschau
5. Elena Jalsovec
6. Emily Westerbeek
7. Felix Weijdema
8. Jonathan de Bruin
9. Qixiang Fang
10. Rens van de Schoot
11. Berke Yazan
This article has no evaluationsLatest version Jan 9, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Prompt-Orchestrated Large Language Models for Clinical Information Extraction

Understanding the Impact of Dataset Characteristics on RAG-based Multi-hop QA Performance

Screenathon 2.0: Human–AI Collaborative Screening Applied to Patient-Generated Health Data