Probing Hidden States for Calibrated, Alignment-Resistant Predictions in LLMs

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Scientific applications of large language models (LLMs) demand reliable, well-calibrated predictions, but standard generative approaches often fail to fully access relevant knowledge contained in their internal representations. As a result, models appear less capable than they are, with useful information remaining latent. We present PING (Probing INternal states of Generative models), an open-source framework that trains lightweight probes on frozen, HuggingFace-compatible transformers to deliver structured predictions with minimal compute overhead. Across diverse models and benchmarks including MMLU for broad coverage and MedMCQA for clinical focus, PING matches or exceeds generative accuracy while reducing Expected Calibration Error by up to 96%. Strikingly, on an LLM that has been explicitly safety-tuned to withhold medical information, PING recovered 87% of lost MedMCQA performance while generative accuracy is zero, showing this information still exists in the model’s latent space. The accompanying pingkit package makes these methods easy to deploy and is available through PyPI.

Article activity feed