A Time-Resolved, SLO-Aware and Bi-Objective Framework to Measure and Minimize LLM Serving’s Carbon and Water Footprints

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Studies of the environmental footprint of large language model (LLM) inference often disagree because they mix incompatible system boundaries, ignore latency and throughput service level objectives (SLOs), and optimize carbon without accounting for water. We present a provider-agnostic framework that unifies scope-transparent measurement with time-resolved, bi-objective orchestration under realistic SLOs. Measurement follows production practice and reports daily medians at a comprehensive serving boundary that includes active accelerators, host CPU/DRAM, provisioned idle, and facility overhead via PUE. Consumptive water is computed as site plus source. Carbon is location-based (LB) by default with a market-based (MB) sensitivity. Optimization is cast as a mixed‑integer linear program, solved over 288 five‑minute windows per day. For each prompt profile, the solver selects region, batch size, and phase‑aware hardware for prefill and decode while enforcing p95 Time To First Token/Time Per Output Token (TTFT/TPOT) and capacity constraints. Because grid carbon intensity (CIF) and electricity water intensity (EWIF) are only weakly correlated, the policy is dual‑objective by design and balances carbon and water explicitly. Applied to four representative models using public per‑prompt energy tables and per‑region multipliers, a single SLO‑aware policy reduces comprehensive‑boundary medians by 57-59% for energy, 59-60% for consumptive water, and 78-80% for LB CO_2, with SLOs met in every window. For a day with 500M queries on GPT‑4o, median‑scaled totals drop from 0.344 to 0.145~GWh, 1.196 to 0.490~ML, and 121 to 25~tCO_2 (LB). The framework also reproduces the production‑observed accelerator‑only versus comprehensive gap (narrow/comprehensive approx. 0.417), enabling direct translation across studies. Pareto analyses show when routing alone and when joint routing, batching, and token‑length controls deliver concurrent reductions in carbon and water at fixed quality of service. The combination of time‑resolved control, comprehensive accounting, and dual‑objective optimization yields a deployable template for decarbonization and water stewardship in LLM serving.

Article activity feed