Agent Readout

Blog directory

Plain list of posts with dates for quick parsing.

Total posts: 37

Entries

Agent Memory Systems: Balancing Context Length vs Retrieval Latency — How agents reconstruct memory between turns, and the latency trade-offs between long context, RAG, summarization, and KV cache reuse.. View raw
Building a Code Agent: Why Each Step Needs Sub-Second Inference — A practical breakdown of the latency budget inside a code agent, step by step, and why every link in the chain needs to land under a second to keep the loop usable.. View raw
ReAct, Reflexion, and Chain-of-Thought: The Inference Cost of Reasoning Patterns — Popular agent reasoning patterns are described as prompt techniques, but they are inference cost multipliers. Here is how ReAct, Reflexion, and Chain-of-Thought actually shape the bill and the latency.. View raw
Multi-Agent Architectures and the Inference Cost Explosion — Orchestrator and worker patterns make multi-agent systems easy to design and expensive to run. Here is where the inference cost actually goes, and what it means for the infrastructure underneath.. View raw
Tool Calling Latency: The Bottleneck No One Talks About — Function calling looks simple on paper, but the latency budget of a tool-using LLM is dominated by short structured generations that most serving stacks are not optimized for. This is what actually makes tool calls feel slow.. View raw
The Agentic Inference Tax: Why Agents Need 10x Faster Models — Agents make many sequential LLM calls per task, and each one pays the full latency of decoding. This post walks through how that compounds and why fast inference changes which agents are even viable.. View raw
Compiler-Level Optimizations for Inference: TorchInductor, Triton, XLA — How modern ML compilers turn Python model code into fused, fast kernels. A practical look at TorchInductor, Triton, and XLA, and the tradeoffs each one makes for inference.. View raw
Draft Model Selection for Speculative Decoding — Picking a draft model is the most consequential decision when deploying speculative decoding. A practical guide to acceptance rates, sizing, and the tradeoffs that decide whether you actually get a speedup.. View raw
The Attention Sink Phenomenon: Why the First Token Matters — How attention concentrates on the first few tokens of every sequence, why naive sliding-window caching breaks long-context generation, and how StreamingLLM uses sink tokens to serve effectively unbounded streams.. View raw
Mixture of Experts at Inference Time — How MoE routing actually works during serving, why sparse activation makes large models cheaper to run per token, and what changes for the inference stack.. View raw
Tensor Parallelism vs Pipeline Parallelism for Model Serving — How tensor and pipeline parallelism actually differ in production inference, when to use each, and why most serving stacks end up combining them.. View raw
Prefix Caching: Why Repeated Prompts Shouldn't Cost You Twice — How prefix caching works in modern LLM serving stacks, why it changes the economics of long system prompts and RAG, and what to watch out for in production.. View raw
Distillation for Inference: How Smaller Models Learn From Larger Ones — A practical guide to knowledge distillation for production inference: what actually works, what to skip, and how to ship a smaller model without losing the behavior you cared about.. View raw
FP8 Training and Inference: The Precision Sweet Spot — Why 8-bit floating point hits a different point on the accuracy/throughput curve than INT8, how E4M3 and E5M2 are used in practice, and what FP8 actually buys you in production serving.. View raw
Activation-Aware Quantization (AWQ) Deep Dive — A close look at how AWQ picks salient weight channels, applies per-channel scaling, and why it consistently beats round-to-nearest 4-bit quantization for LLM inference.. View raw
Mamba and State Space Models: Inference Without Attention — How structured state space models like Mamba achieve constant-time per-token inference, and why the selective scan changes the trade-off space for long-context serving.. View raw
RWKV and Linear Attention: Recurrent Models as an Inference Shortcut — How RWKV and linear attention architectures collapse the per-token cost of generation to O(1), and what that means for serving long-context workloads.. View raw
Dynamic Batching Strategies: From Naive to Continuous to Iteration-Level — Batching is the lever that turns idle GPU silicon into served tokens. This post walks through the evolution of batching for LLM serving, from one-at-a-time to static batches to request-level dynamic batching to iteration-level continuous batching, and shows where each strategy still leaves throughput on the floor.. View raw
Token Merging and Token Pruning for Faster Transformers — Attention cost grows with the square of sequence length. Token merging and token pruning shrink that sequence mid-network, trading a little accuracy for real speedups. Here is how ToMe works, how the idea extends to language models, and where it breaks down.. View raw
S3: Scheduling for Straggler Mitigation in LLM Serving — In LLM serving, a single long-running request can stall everyone else sharing the same batch. S3 attacks that by predicting output length and scheduling around it. Here is what stragglers actually cost you, and how output-length-aware scheduling helps.. View raw
Chunked Prefill: Overlapping Compute and Communication — Prefill pins the compute units while decode starves for memory bandwidth. Sarathi-Serve splits prefill into chunks and piggybacks decodes on them, keeping both resources busy in the same batch. Here is how it works and where the limits are.. View raw
Cascade Inference: Using Small Models to Route to Big Ones — FrugalGPT and its descendants show that most queries do not need the biggest model. We walk through the cascade pattern, routing classifiers, and the engineering trade-offs of sending easy work to cheap models and escalating only when needed.. View raw
Lookahead Decoding: Parallel Token Generation Without Draft Models — Lookahead decoding from LMSYS speeds up autoregressive generation without requiring a draft model. We walk through the Jacobi iteration trick, the n-gram pool, and what the speedups actually look like in practice.. View raw
Disaggregated Prefill and Decode (Splitwise / DistServe) — Prefill and decode have different compute profiles and clash when they share a GPU. Splitwise and DistServe separate them onto different hardware pools. We walk through why, how, and when it actually pays off.. View raw
KV Cache Compression: MLA and Beyond — DeepSeek's Multi-Head Latent Attention cuts the KV cache by an order of magnitude without giving up quality. We walk through MLA, how it compares to MQA and GQA, and the other compression techniques worth knowing.. View raw
Ring Attention: Scaling Context to Millions of Tokens — Ring Attention distributes the attention computation across devices in a ring topology, overlapping KV transfer with compute so context length scales linearly with the number of GPUs.. View raw
Quantization for Inference: GPTQ, AWQ, SmoothQuant, and FP8 — Quantization shrinks model weights from 16-bit to 4-bit or 8-bit, cutting memory usage and speeding up inference. Here's how the major techniques work and when to use each one.. View raw
Multi-Query and Grouped-Query Attention: Shrinking the KV Cache — MQA and GQA reduce the memory footprint of attention by sharing key-value heads across queries. A simple architectural change that makes inference dramatically faster.. View raw
Continuous Batching: The Orca Paper That Changed LLM Serving — Before continuous batching, LLM servers wasted GPU cycles waiting for the slowest request in each batch. Orca's iteration-level scheduling fixed this with a 36x throughput improvement.. View raw
Medusa, EAGLE, and Sequoia: The Next Generation of Speculative Decoding — The original speculative decoding papers needed a separate draft model. Medusa, EAGLE, and Sequoia found ways to speculate faster, smarter, and without the extra model.. View raw
SGLang and RadixAttention: Smarter KV Cache Reuse — SGLang's RadixAttention stores KV cache in a radix tree, enabling automatic prefix sharing across requests. The result is up to 5x higher throughput for multi-turn and structured workloads.. View raw
Speculative Decoding: Getting 3x Speedups Without Changing the Model — Speculative decoding uses a small draft model to predict multiple tokens ahead, then verifies them all at once. The result is mathematically identical output, 2-3x faster.. View raw
PagedAttention and vLLM: Virtual Memory for LLM Serving — The PagedAttention paper solved the biggest memory waste problem in LLM serving by borrowing an idea from operating systems. Here's how it works and why vLLM became the default serving framework.. View raw
FlashAttention: How Tri Dao Made Attention 4x Faster — FlashAttention rewrote the rules of transformer inference by treating attention as a memory problem, not a compute problem. Here's how it works and why it matters.. View raw
Build a Real-Time Voice AI Agent with General Compute — A step-by-step tutorial for building a voice AI agent with sub-500ms response times. Plus: why General Compute is the only provider fast enough to use reasoning models in a voice pipeline.. View raw
How Coding Agents Depend on Inference Speed — Coding agents make dozens of sequential LLM calls per task. Every millisecond of inference latency compounds across each step, making speed the single biggest infrastructure bottleneck for AI-powered developer tools.. View raw
Why Inference Speed is the New Moat — Model quality has commoditized. The real competitive advantage in AI is how fast your infrastructure can deliver results. Inference speed is becoming the defining moat for AI-native products.. View raw