Is a Large Context Window all you need? Exploring Time To First Token (TTFT)-context size tradeoff for Autoregressive LLMs

Anuran Roy
Arnab Sengupta
Saptarshi Pani

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Recent advancements in auto-regressive large language models (henceforth referred to as LLMs) have significantly expanded context window capacities, with Meta’s Llama 4 Scout achieving a 10 million token input length . This expansion is facilitated by techniques like Rotary Position Embedding (RoPE) and YaRN (Yet Another Rope extensioN), which encodes positional information through rotational transformations, enabling models to process longer sequences effectively. This advancement opens up a host of opportunities for the ubiquitious LLMs. Yet, attention mechanisms barely sub-quadratic in their nature. This means that extending context windows introduces challenges in latencies, especially in scenarios where even sub-second delays can result in catastrophic failures at scale in real-life use cases, many of which can be silent. This paper examines the trade-offs between context sizes and latencies, highlighting the need for improved context retrieval strategies that do not bloat query sizes to the concerned Large Language Models.

Version published to 10.31224/4666
Jun 2, 2025

A Comprehensive and Critical Survey of Large Language Model Inference and Feature Generation

This article has 1 author:
1. Snehil Shrivastava
This article has no evaluationsLatest version Jun 16, 2025
A Comprehensive and Critical Survey of Large Language Model Inference and Feature Generation

This article has 1 author:
1. Snehil Shrivastava
This article has no evaluationsLatest version Jun 16, 2025
Unable to forget: Proactive lnterference Reveals Working Memory Limits in LLMs Beyond Context Length

This article has 2 authors:
1. Chupei Wang
2. Jiaqiu vince Sun
This article has no evaluationsLatest version Jun 11, 2025

Listed in

Abstract

Article activity feed

Related articles

A Comprehensive and Critical Survey of Large Language Model Inference and Feature Generation

A Comprehensive and Critical Survey of Large Language Model Inference and Feature Generation

Unable to forget: Proactive lnterference Reveals Working Memory Limits in LLMs Beyond Context Length