Building a Question-Answering System to Extract Information From PDF Files Using BERT Transformers
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
The comprehension of complex PDFs such as research documents, clinical reports, and scientific manuals is a time-consuming task. Previous studies have demonstrated significant success in building question-answering systems to provide contextually relevant answers to user queries. However, addressing puzzling questions within a single end-to-end trained ML model remains a rigorous task. Such systems require a huge amount of labeled training data to train the base models for specific tasks. The creation of such data sets is still a challenge for complicated documents like the annual reports of big tech companies. This research paper addresses this challenge by focusing on the construction of a question-answering system tailored for PDF files, specifically targeting domains such as finance, bio-medicine, and scientific literature. Curated data sets for the PDF from the chosen Domains were created manually for the evaluation. Pre-trained Bidirectional Encoder Representations from Transformers (BERT) Models from the Hugging Face library were utilized for the chosen domains and evaluated with an F1 score. A score of 44\% was achieved for the BERT Large.