Probing Hidden States for Calibrated, Alignment-Resistant Predictions in LLMs
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Scientific applications of large language models (LLMs) demand reliable, well-calibrated predictions, but standard generative approaches often fail to fully access relevant knowledge contained in their internal representations. As a result, models appear less capable than they are, with useful information remaining latent. We present PING (Probing INternal states of Generative models), an open-source framework that trains lightweight probes on frozen, HuggingFace-compatible transformers to deliver structured predictions with minimal compute overhead. Across diverse models and benchmarks including MMLU for broad coverage and MedMCQA for clinical focus, PING matches or exceeds generative accuracy while reducing Expected Calibration Error by up to 96%. Strikingly, on an LLM that has been explicitly safety-tuned to withhold medical information, PING recovered 87% of lost MedMCQA performance while generative accuracy is zero, showing this information still exists in the model’s latent space. The accompanying pingkit package makes these methods easy to deploy and is available through PyPI.