PruneBERT: Context-Aware Sentence Classification through Statistical Relevance Pruning
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Traditional grading mechanisms have been time consuming, prone to error and sometimes even biased. To account for quick grading that is error free to a larger extent and unbiased, AI based Automated Grading Systems(AGS’s) are the go-to technology. Through AGS’s one can expect that the workload management of instructors in the higher education system to be streamlined. The predominant reason for such delays and error in manual grading as well as AGS's is the presence of irrelevant information within an assignment \slash answer sheet. This is a new and a major challenge identified as existing in the AGS's - their inability to identify unimportant sentences when grading student responses. In this paper we leverage the advantage of contextual embeddings from Sentence-BERT (all-mpnet-base-v2) and semantic representations from BERT to introduce a novel dual-embedding framework, called PruneBERT. Our method captures both the inter-sentence coherence and fine-grained semantic distinctions for classifying irrelevant sentences in domain-specific texts, aimed at enhancing the precision of AGS’s. Central to PruneBERT is an adaptive thresholding mechanism that dynamically adjusts similarity cutoffs based on statistical properties of cosine similarity distributions, enabling robust irrelevance filtering across diverse textual inputs. Evaluated on a curated corpus of around 3000 sentences from computer science domains, PruneBERT achieves a relative improvement of 40% in F1 score over conventional threshold-based and single-embedding baselines. The approach offers a lightweight, interpretable, and computationally efficient alternative to large language model inference, making it well-suited for scalable applications in automated grading, summarization, and domain-aware content filtering.