A Deployment-Aware Framework for Carbon- and Water- Efficient LLM Serving

Julian Hoxha
Marsela Thanasi-Boçe
Tarek Khalifa

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Inference now dominates the lifecycle footprint of large language models, yet published estimates often use inconsistent boundaries and optimize carbon while ignoring water. We present a provider-agnostic framework that unifies scope-transparent measurement with time-resolved, SLO-aware orchestration and jointly optimizes carbon and consumptive water. Measurement reports daily medians at a comprehensive serving boundary that includes accelerators, host CPU/DRAM, provisioned idle, and PUE uplift, and provides accelerator-only whiskers for reconciliation. Optimization uses a mixed-integer linear program solved over five-minute windows; it selects region, batch size, and phase-aware hardware for prefill and decode while enforcing p95 TTFT and TPOT as well as capacity constraints. Applied to four representative models, a single SLO-aware policy reduces comprehensive-boundary medians by 57 to 59 percent for energy, 59 to 60 percent for water, and 78 to 80 percent for location-based CO2, with SLOs met in every window. For a day with 500 million queries on GPT-4o, totals fall from 0.344 to 0.145 GWh, 1.196 to 0.490 ML, and 121 to 25 t CO2 (location-based). The framework offers a deployable template for carbon- and water-aware LLM serving with auditable and scope-transparent reporting.

Version published to 10.3390/su172310473
Nov 22, 2025
Version published to 10.20944/preprints202510.0957.v1
Oct 13, 2025

Practical Event-Driven Microservices: A Database-Centric Alternative to Message Brokers An Architectural Framework for Moderate-Scale Systems

This article has 1 author:
1. Olatunji Ajayi
This article has no evaluationsLatest version Dec 11, 2025
Dynamic Micro-Batch and Token-Budget Scheduling for IoT-Scale Pipeline-Parallel LLM Inference

This article has 4 authors:
1. Juncheol Ahn
2. Yubin Son
3. Daemin Kim
4. Sejin Park
This article has no evaluationsLatest version Dec 9, 2025
Research on Frontend-Backend Collaboration and Performance Optimization for High-Concurrency Web Systems

This article has 3 authors:
1. Yu Mao
2. Keng-Ming Chang
3. Zhishen Chen
This article has no evaluationsLatest version Jan 19, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Practical Event-Driven Microservices: A Database-Centric Alternative to Message Brokers An Architectural Framework for Moderate-Scale Systems

Dynamic Micro-Batch and Token-Budget Scheduling for IoT-Scale Pipeline-Parallel LLM Inference

Research on Frontend-Backend Collaboration and Performance Optimization for High-Concurrency Web Systems