Analytical Centralization of Health Expenditure at the National Administrator of Health System Resources: Architecture, Data Quality, and Operational Performance of the ADRES Health System Analytics Platform, Colombia
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background
Between 2024 and 2025, Colombia universalized the Electronic Health Invoice with embedded Individual Health Services Delivery Records (RIPS — Registro Individual de Prestación de Servicios de Salud) (FEV-RIPS) as the standard for financial and clinical data exchange across the health system. ADRES — the entity responsible for administering the resources of the General Social Security Health System (SGSSS) — faced the challenge of processing information from multiple heterogeneous sources generated by more than 55,000 healthcare providers of varying complexity. Health systems in high-income countries converge clinical-financial data in consolidated platforms with years of operation; Colombia started from a fragmented architecture with incompatible historical sources, no cross-database standardization, and a resource administrator with no centralized analytical infrastructure until 2023.
Objective
We describe the design, the technical challenges of integrating heterogeneous data, and the operational performance of the analytical infrastructure built by ADRES to centralize large-scale processing of Colombian health system information, and we derive transferable lessons for health system resource administrators in Latin America facing equivalent digitalization mandates.
Methods
Technical-descriptive report based on operational metrics from the ADRES Azure/Databricks environment during January–November 2025. We report indicators of data volume managed, processing speed, deployed computational capacity, concurrent use by functional group, and implemented governance structure. The architecture integrates secure data transfer with MinSalud via VPN, OneLake Fabric connectivity, automated processing of multiple formats (XML, relational tables, flat files), and a data lake with a medallion pattern (Bronze/Silver/Gold) and automated pipelines. Data quality challenges are characterized through structural inconsistencies across system sources, coding incompatibilities (municipalities, dates, diagnoses), format heterogeneities in unstructured data, and the absence of complete technical documentation.
Results
The platform manages 21 catalogs, 1,183 tables, and over 110,645 million stored records, with cumulative production exceeding 1 trillion processed records. It executes queries on 100 billion records in ten seconds, using clusters of up to 32 TB RAM and 4,096 vCPU. During September–October 2025, monthly query peaks reached 78,028, distributed across eleven institutional functional groups. Integrating heterogeneous sources required developing specific technical capabilities: Python/PySpark parsers for XML with variable node depth, institutional equivalence tables to homologate incompatible municipality codes between BDUA and service delivery records, cleaning routines for extreme dates used as null representations (1900-01-01, 9999-12-31), and transformation logic to build coherent longitudinal series bridging classic RIPS and FEV-RIPS. During 2024–2025, the platform supported econometric expenditure analyses, multi-source information contrasts, responses to Constitutional Court judicial mandates, and publication of interactive dashboards publicly available on the ADRES institutional site. Integration of conversational AI agents (Genie, Copilot) enables analytical access for users without SQL knowledge, expanding the platform’s institutional reach.
Conclusions
ADRES built in one year an analytical infrastructure that provides, to our knowledge, the first published documentation of the systemic technical challenges of integrating heterogeneous data sources in a middle-income social security health system. The case demonstrates that centralizing health system information at national scale is technically feasible under the institutional constraints of a public entity — but it requires solving a set of cross-source data standardization problems that the literature on health information system implementation in middle-income countries does not document with quantitative precision. The derived lessons are transferable to health system resource administrators in Latin America facing equivalent challenges of heterogeneous information integration.