AlchemBERT: Exploring Lightweight Language Models for Materials Informatics
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
The emergence of large language models (LLMs) has spurred numerous applications across various domains, including material design. In this field, an increasing number of generative models focus on directly generating materials with desired properties more accurately, moving away from the traditional approach of enumerating vast numbers of candidates and relying on computationally intensive screening algorithms. However, we assert that without accurate prediction capabilities, effective material design is unattainable, as generating necessary structures becomes futile if their quality cannot be reliably evaluated by language models. Matbench provides an excellent foundation for predictive tasks, yet prior efforts with LLMs have primarily focused on composition-related tasks using models such as GPT or LLaMA. In this study, we revisit BERT, a relatively small LLM with 110 million parameters, which is significantly smaller than GPT or LLaMA models containing billions of parameters. Remarkably, we demonstrate that BERT-base achieves comparable performance to these larger models in material property prediction tasks. Beyond composition tasks, we introduce BERT’s application in structure prediction using CIF (Crystallographic Information File) data and natural language descriptions of structures, with natural language outperforming CIF by an average of 40.3\% across all tasks. Our results rival state-of-the-art composition models such as CrabNet and, in several tasks across datasets ranging from a few hundred to over a hundred thousand samples, even surpass traditional structure-based and knowledge-driven models. Additionally, on the latest Matbench test task, Matbench-Discovery, our model outperformed the Voronoi-RF model and achieved MAE results comparable to other models that rely solely on energy predictions. Our findings provide a new reference point for future LLM applications in material design, offering valuable insights for leveraging language model in this domain and emphasizing natural language descriptions over conventional model-centric design. We term this application of BERT in material design AlchemBERT, signifying its novel role in bridging natural language and structural representations.