Automating Software Size Measurement from Code Using Language Models
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Software size is a key input for project planning, effort estimation, and productivity analysis. While pre-trained language models have shown promise in deriving functional size from natural-language requirements, predicting size directly from source code remains under-explored. Yet, code-based size measurement is critical in modern workflows where requirement documents are often incomplete or unavailable, especially in Agile development environments. This study investigates the use of CodeBERT, a pre-trained bimodal transformer model, for predicting software size from code according to two measurement methods: COSMIC Function Points and MicroM. We construct two curated datasets from the Python subset of the CodeSearchNet corpus, and manually annotate each function with its corresponding size. Our experimental results show that CodeBERT can successfully predict COSMIC data movements with up to 91.4% accuracy and generalize to the functional, architectural, and algorithmic event types defined in MicroM, reaching up to 81.5% accuracy. These findings highlight the potential of code-based language models for automated functional size measurement when requirement artifacts are absent or unreliable.