A regulatory medical device dataset with risk labels and an image-linked subset from the NMPA registry
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
We present NMPA-MedDevice, a regulatory dataset derived from China's National Medical Products Administration (NMPA) Unique Device Identification (UDI) registry. The release comprises four components: (1) a frozen raw snapshot of the NMPA UDI registry (66,472 records, July 2024); (2) a reproducibly cleaned text-and-metadata corpus of approximately 52,000 unique device records with risk class labels deterministically derived from the ninth character of the NMPA registration number; (3) a curated image-linked subset of 1,005 devices (Class I/II/III, 39/462/504) with precomputed text and image feature embeddings; and (4) an external temporal validation set of 300 devices from a later registry update (October--November 2025). All textual data, derived labels, the cleaned corpus, preprocessing scripts, dataset splits, and precomputed features are publicly deposited. Raw product images are not redistributed due to copyright restrictions; precomputed embeddings and image retrieval scripts are provided instead.