A Time-Resolved, SLO-Aware and Bi-Objective Framework to Measure and Minimize LLM Serving’s Carbon and Water Footprints

Julian Hoxha
Marsela Thanasi-Boçe
Tarek Khalifa

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Studies of the environmental footprint of large language model (LLM) inference often disagree because they mix incompatible system boundaries, ignore latency and throughput service level objectives (SLOs), and optimize carbon without accounting for water. We present a provider-agnostic framework that unifies scope-transparent measurement with time-resolved, bi-objective orchestration under realistic SLOs. Measurement follows production practice and reports daily medians at a comprehensive serving boundary that includes active accelerators, host CPU/DRAM, provisioned idle, and facility overhead via PUE. Consumptive water is computed as site plus source. Carbon is location-based (LB) by default with a market-based (MB) sensitivity. Optimization is cast as a mixed‑integer linear program, solved over 288 five‑minute windows per day. For each prompt profile, the solver selects region, batch size, and phase‑aware hardware for prefill and decode while enforcing p95 Time To First Token/Time Per Output Token (TTFT/TPOT) and capacity constraints. Because grid carbon intensity (CIF) and electricity water intensity (EWIF) are only weakly correlated, the policy is dual‑objective by design and balances carbon and water explicitly. Applied to four representative models using public per‑prompt energy tables and per‑region multipliers, a single SLO‑aware policy reduces comprehensive‑boundary medians by 57-59% for energy, 59-60% for consumptive water, and 78-80% for LB CO_2, with SLOs met in every window. For a day with 500M queries on GPT‑4o, median‑scaled totals drop from 0.344 to 0.145~GWh, 1.196 to 0.490~ML, and 121 to 25~tCO_2 (LB). The framework also reproduces the production‑observed accelerator‑only versus comprehensive gap (narrow/comprehensive approx. 0.417), enabling direct translation across studies. Pareto analyses show when routing alone and when joint routing, batching, and token‑length controls deliver concurrent reductions in carbon and water at fixed quality of service. The combination of time‑resolved control, comprehensive accounting, and dual‑objective optimization yields a deployable template for decarbonization and water stewardship in LLM serving.

Version published to 10.20944/preprints202510.0957.v1
Oct 13, 2025

EADF: An Environment-Aware Deployment Design Pattern for Multi-Cloud Data Engineering CI/CD Pipelines

This article has 4 authors:
1. Chiara Rucco
2. Motaz Saad
3. Enrique Puig
4. Antonella Longo
This article has no evaluationsLatest version Sep 24, 2025
Evaluating Latency and Infrastructure Trade-offs in Serverless Computing

This article has 1 author:
1. CHANDRAMOHAN REDDY POREDDY
This article has no evaluationsLatest version Sep 8, 2025
How Good is my Scheduling Algorithm? A Benchmark to Compare Production Scheduling Algorithms for Real-World Production Environments

This article has 2 authors:
1. Michael Groth
2. Matthias Schumann
This article has no evaluationsLatest version Oct 7, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

EADF: An Environment-Aware Deployment Design Pattern for Multi-Cloud Data Engineering CI/CD Pipelines

Evaluating Latency and Infrastructure Trade-offs in Serverless Computing

How Good is my Scheduling Algorithm? A Benchmark to Compare Production Scheduling Algorithms for Real-World Production Environments