Agent Readout

Blog directory

Plain list of posts with dates for quick parsing.

Total posts
37

Entries

  • Agent Memory Systems: Balancing Context Length vs Retrieval LatencyHow agents reconstruct memory between turns, and the latency trade-offs between long context, RAG, summarization, and KV cache reuse.. View raw
  • Building a Code Agent: Why Each Step Needs Sub-Second InferenceA practical breakdown of the latency budget inside a code agent, step by step, and why every link in the chain needs to land under a second to keep the loop usable.. View raw
  • ReAct, Reflexion, and Chain-of-Thought: The Inference Cost of Reasoning PatternsPopular agent reasoning patterns are described as prompt techniques, but they are inference cost multipliers. Here is how ReAct, Reflexion, and Chain-of-Thought actually shape the bill and the latency.. View raw
  • Multi-Agent Architectures and the Inference Cost ExplosionOrchestrator and worker patterns make multi-agent systems easy to design and expensive to run. Here is where the inference cost actually goes, and what it means for the infrastructure underneath.. View raw
  • Tool Calling Latency: The Bottleneck No One Talks AboutFunction calling looks simple on paper, but the latency budget of a tool-using LLM is dominated by short structured generations that most serving stacks are not optimized for. This is what actually makes tool calls feel slow.. View raw
  • The Agentic Inference Tax: Why Agents Need 10x Faster ModelsAgents make many sequential LLM calls per task, and each one pays the full latency of decoding. This post walks through how that compounds and why fast inference changes which agents are even viable.. View raw
  • Compiler-Level Optimizations for Inference: TorchInductor, Triton, XLAHow modern ML compilers turn Python model code into fused, fast kernels. A practical look at TorchInductor, Triton, and XLA, and the tradeoffs each one makes for inference.. View raw
  • Draft Model Selection for Speculative DecodingPicking a draft model is the most consequential decision when deploying speculative decoding. A practical guide to acceptance rates, sizing, and the tradeoffs that decide whether you actually get a speedup.. View raw
  • The Attention Sink Phenomenon: Why the First Token MattersHow attention concentrates on the first few tokens of every sequence, why naive sliding-window caching breaks long-context generation, and how StreamingLLM uses sink tokens to serve effectively unbounded streams.. View raw
  • Mixture of Experts at Inference TimeHow MoE routing actually works during serving, why sparse activation makes large models cheaper to run per token, and what changes for the inference stack.. View raw
  • Tensor Parallelism vs Pipeline Parallelism for Model ServingHow tensor and pipeline parallelism actually differ in production inference, when to use each, and why most serving stacks end up combining them.. View raw
  • Prefix Caching: Why Repeated Prompts Shouldn't Cost You TwiceHow prefix caching works in modern LLM serving stacks, why it changes the economics of long system prompts and RAG, and what to watch out for in production.. View raw
  • Distillation for Inference: How Smaller Models Learn From Larger OnesA practical guide to knowledge distillation for production inference: what actually works, what to skip, and how to ship a smaller model without losing the behavior you cared about.. View raw
  • FP8 Training and Inference: The Precision Sweet SpotWhy 8-bit floating point hits a different point on the accuracy/throughput curve than INT8, how E4M3 and E5M2 are used in practice, and what FP8 actually buys you in production serving.. View raw
  • Activation-Aware Quantization (AWQ) Deep DiveA close look at how AWQ picks salient weight channels, applies per-channel scaling, and why it consistently beats round-to-nearest 4-bit quantization for LLM inference.. View raw
  • Mamba and State Space Models: Inference Without AttentionHow structured state space models like Mamba achieve constant-time per-token inference, and why the selective scan changes the trade-off space for long-context serving.. View raw
  • RWKV and Linear Attention: Recurrent Models as an Inference ShortcutHow RWKV and linear attention architectures collapse the per-token cost of generation to O(1), and what that means for serving long-context workloads.. View raw
  • Dynamic Batching Strategies: From Naive to Continuous to Iteration-LevelBatching is the lever that turns idle GPU silicon into served tokens. This post walks through the evolution of batching for LLM serving, from one-at-a-time to static batches to request-level dynamic batching to iteration-level continuous batching, and shows where each strategy still leaves throughput on the floor.. View raw
  • Token Merging and Token Pruning for Faster TransformersAttention cost grows with the square of sequence length. Token merging and token pruning shrink that sequence mid-network, trading a little accuracy for real speedups. Here is how ToMe works, how the idea extends to language models, and where it breaks down.. View raw
  • S3: Scheduling for Straggler Mitigation in LLM ServingIn LLM serving, a single long-running request can stall everyone else sharing the same batch. S3 attacks that by predicting output length and scheduling around it. Here is what stragglers actually cost you, and how output-length-aware scheduling helps.. View raw
  • Chunked Prefill: Overlapping Compute and CommunicationPrefill pins the compute units while decode starves for memory bandwidth. Sarathi-Serve splits prefill into chunks and piggybacks decodes on them, keeping both resources busy in the same batch. Here is how it works and where the limits are.. View raw
  • Cascade Inference: Using Small Models to Route to Big OnesFrugalGPT and its descendants show that most queries do not need the biggest model. We walk through the cascade pattern, routing classifiers, and the engineering trade-offs of sending easy work to cheap models and escalating only when needed.. View raw
  • Lookahead Decoding: Parallel Token Generation Without Draft ModelsLookahead decoding from LMSYS speeds up autoregressive generation without requiring a draft model. We walk through the Jacobi iteration trick, the n-gram pool, and what the speedups actually look like in practice.. View raw
  • Disaggregated Prefill and Decode (Splitwise / DistServe)Prefill and decode have different compute profiles and clash when they share a GPU. Splitwise and DistServe separate them onto different hardware pools. We walk through why, how, and when it actually pays off.. View raw
  • KV Cache Compression: MLA and BeyondDeepSeek's Multi-Head Latent Attention cuts the KV cache by an order of magnitude without giving up quality. We walk through MLA, how it compares to MQA and GQA, and the other compression techniques worth knowing.. View raw
  • Ring Attention: Scaling Context to Millions of TokensRing Attention distributes the attention computation across devices in a ring topology, overlapping KV transfer with compute so context length scales linearly with the number of GPUs.. View raw
  • Quantization for Inference: GPTQ, AWQ, SmoothQuant, and FP8Quantization shrinks model weights from 16-bit to 4-bit or 8-bit, cutting memory usage and speeding up inference. Here's how the major techniques work and when to use each one.. View raw
  • Multi-Query and Grouped-Query Attention: Shrinking the KV CacheMQA and GQA reduce the memory footprint of attention by sharing key-value heads across queries. A simple architectural change that makes inference dramatically faster.. View raw
  • Continuous Batching: The Orca Paper That Changed LLM ServingBefore continuous batching, LLM servers wasted GPU cycles waiting for the slowest request in each batch. Orca's iteration-level scheduling fixed this with a 36x throughput improvement.. View raw
  • Medusa, EAGLE, and Sequoia: The Next Generation of Speculative DecodingThe original speculative decoding papers needed a separate draft model. Medusa, EAGLE, and Sequoia found ways to speculate faster, smarter, and without the extra model.. View raw
  • SGLang and RadixAttention: Smarter KV Cache ReuseSGLang's RadixAttention stores KV cache in a radix tree, enabling automatic prefix sharing across requests. The result is up to 5x higher throughput for multi-turn and structured workloads.. View raw
  • Speculative Decoding: Getting 3x Speedups Without Changing the ModelSpeculative decoding uses a small draft model to predict multiple tokens ahead, then verifies them all at once. The result is mathematically identical output, 2-3x faster.. View raw
  • PagedAttention and vLLM: Virtual Memory for LLM ServingThe PagedAttention paper solved the biggest memory waste problem in LLM serving by borrowing an idea from operating systems. Here's how it works and why vLLM became the default serving framework.. View raw
  • FlashAttention: How Tri Dao Made Attention 4x FasterFlashAttention rewrote the rules of transformer inference by treating attention as a memory problem, not a compute problem. Here's how it works and why it matters.. View raw
  • Build a Real-Time Voice AI Agent with General ComputeA step-by-step tutorial for building a voice AI agent with sub-500ms response times. Plus: why General Compute is the only provider fast enough to use reasoning models in a voice pipeline.. View raw
  • How Coding Agents Depend on Inference SpeedCoding agents make dozens of sequential LLM calls per task. Every millisecond of inference latency compounds across each step, making speed the single biggest infrastructure bottleneck for AI-powered developer tools.. View raw
  • Why Inference Speed is the New MoatModel quality has commoditized. The real competitive advantage in AI is how fast your infrastructure can deliver results. Inference speed is becoming the defining moat for AI-native products.. View raw
ModeHumanAgent
Blog | General Compute