Agent Memory Systems: Balancing Context Length vs Retrieval Latency
How agents reconstruct memory between turns, and the latency trade-offs between long context, RAG, summarization, and KV cache reuse.
Insights on AI inference, ASIC infrastructure, and building fast AI applications.
How agents reconstruct memory between turns, and the latency trade-offs between long context, RAG, summarization, and KV cache reuse.
A practical breakdown of the latency budget inside a code agent, step by step, and why every link in the chain needs to land under a second to keep the loop usable.
Popular agent reasoning patterns are described as prompt techniques, but they are inference cost multipliers. Here is how ReAct, Reflexion, and Chain-of-Thought actually shape the bill and the latency.
Orchestrator and worker patterns make multi-agent systems easy to design and expensive to run. Here is where the inference cost actually goes, and what it means for the infrastructure underneath.
Function calling looks simple on paper, but the latency budget of a tool-using LLM is dominated by short structured generations that most serving stacks are not optimized for. This is what actually makes tool calls feel slow.
Agents make many sequential LLM calls per task, and each one pays the full latency of decoding. This post walks through how that compounds and why fast inference changes which agents are even viable.
How modern ML compilers turn Python model code into fused, fast kernels. A practical look at TorchInductor, Triton, and XLA, and the tradeoffs each one makes for inference.
Picking a draft model is the most consequential decision when deploying speculative decoding. A practical guide to acceptance rates, sizing, and the tradeoffs that decide whether you actually get a speedup.
How attention concentrates on the first few tokens of every sequence, why naive sliding-window caching breaks long-context generation, and how StreamingLLM uses sink tokens to serve effectively unbounded streams.
How MoE routing actually works during serving, why sparse activation makes large models cheaper to run per token, and what changes for the inference stack.
How tensor and pipeline parallelism actually differ in production inference, when to use each, and why most serving stacks end up combining them.
How prefix caching works in modern LLM serving stacks, why it changes the economics of long system prompts and RAG, and what to watch out for in production.
A practical guide to knowledge distillation for production inference: what actually works, what to skip, and how to ship a smaller model without losing the behavior you cared about.
Why 8-bit floating point hits a different point on the accuracy/throughput curve than INT8, how E4M3 and E5M2 are used in practice, and what FP8 actually buys you in production serving.
A close look at how AWQ picks salient weight channels, applies per-channel scaling, and why it consistently beats round-to-nearest 4-bit quantization for LLM inference.
How structured state space models like Mamba achieve constant-time per-token inference, and why the selective scan changes the trade-off space for long-context serving.
How RWKV and linear attention architectures collapse the per-token cost of generation to O(1), and what that means for serving long-context workloads.
Batching is the lever that turns idle GPU silicon into served tokens. This post walks through the evolution of batching for LLM serving, from one-at-a-time to static batches to request-level dynamic batching to iteration-level continuous batching, and shows where each strategy still leaves throughput on the floor.
Attention cost grows with the square of sequence length. Token merging and token pruning shrink that sequence mid-network, trading a little accuracy for real speedups. Here is how ToMe works, how the idea extends to language models, and where it breaks down.
In LLM serving, a single long-running request can stall everyone else sharing the same batch. S3 attacks that by predicting output length and scheduling around it. Here is what stragglers actually cost you, and how output-length-aware scheduling helps.
Prefill pins the compute units while decode starves for memory bandwidth. Sarathi-Serve splits prefill into chunks and piggybacks decodes on them, keeping both resources busy in the same batch. Here is how it works and where the limits are.
FrugalGPT and its descendants show that most queries do not need the biggest model. We walk through the cascade pattern, routing classifiers, and the engineering trade-offs of sending easy work to cheap models and escalating only when needed.
Lookahead decoding from LMSYS speeds up autoregressive generation without requiring a draft model. We walk through the Jacobi iteration trick, the n-gram pool, and what the speedups actually look like in practice.
Prefill and decode have different compute profiles and clash when they share a GPU. Splitwise and DistServe separate them onto different hardware pools. We walk through why, how, and when it actually pays off.
DeepSeek's Multi-Head Latent Attention cuts the KV cache by an order of magnitude without giving up quality. We walk through MLA, how it compares to MQA and GQA, and the other compression techniques worth knowing.
Ring Attention distributes the attention computation across devices in a ring topology, overlapping KV transfer with compute so context length scales linearly with the number of GPUs.
Quantization shrinks model weights from 16-bit to 4-bit or 8-bit, cutting memory usage and speeding up inference. Here's how the major techniques work and when to use each one.
MQA and GQA reduce the memory footprint of attention by sharing key-value heads across queries. A simple architectural change that makes inference dramatically faster.
Before continuous batching, LLM servers wasted GPU cycles waiting for the slowest request in each batch. Orca's iteration-level scheduling fixed this with a 36x throughput improvement.
The original speculative decoding papers needed a separate draft model. Medusa, EAGLE, and Sequoia found ways to speculate faster, smarter, and without the extra model.
SGLang's RadixAttention stores KV cache in a radix tree, enabling automatic prefix sharing across requests. The result is up to 5x higher throughput for multi-turn and structured workloads.
Speculative decoding uses a small draft model to predict multiple tokens ahead, then verifies them all at once. The result is mathematically identical output, 2-3x faster.
The PagedAttention paper solved the biggest memory waste problem in LLM serving by borrowing an idea from operating systems. Here's how it works and why vLLM became the default serving framework.
FlashAttention rewrote the rules of transformer inference by treating attention as a memory problem, not a compute problem. Here's how it works and why it matters.
A step-by-step tutorial for building a voice AI agent with sub-500ms response times. Plus: why General Compute is the only provider fast enough to use reasoning models in a voice pipeline.
Coding agents make dozens of sequential LLM calls per task. Every millisecond of inference latency compounds across each step, making speed the single biggest infrastructure bottleneck for AI-powered developer tools.
Model quality has commoditized. The real competitive advantage in AI is how fast your infrastructure can deliver results. Inference speed is becoming the defining moat for AI-native products.