Blog

Insights on AI inference, ASIC infrastructure, and building fast AI applications.

Agent Memory Systems: Balancing Context Length vs Retrieval Latency

How agents reconstruct memory between turns, and the latency trade-offs between long context, RAG, summarization, and KV cache reuse.

General Compute·May 12, 2026

coding-agentsinferencelatencydeveloper-toolsagents

Building a Code Agent: Why Each Step Needs Sub-Second Inference

A practical breakdown of the latency budget inside a code agent, step by step, and why every link in the chain needs to land under a second to keep the loop usable.

General Compute·May 11, 2026

agentsreasoningreactreflexionchain-of-thoughtinferencelatency

ReAct, Reflexion, and Chain-of-Thought: The Inference Cost of Reasoning Patterns

Popular agent reasoning patterns are described as prompt techniques, but they are inference cost multipliers. Here is how ReAct, Reflexion, and Chain-of-Thought actually shape the bill and the latency.

General Compute·May 10, 2026

agentsmulti-agentinferencelatencycost

Multi-Agent Architectures and the Inference Cost Explosion

Orchestrator and worker patterns make multi-agent systems easy to design and expensive to run. Here is where the inference cost actually goes, and what it means for the infrastructure underneath.

General Compute·May 9, 2026

agentstool-callinginferencelatency

Tool Calling Latency: The Bottleneck No One Talks About

Function calling looks simple on paper, but the latency budget of a tool-using LLM is dominated by short structured generations that most serving stacks are not optimized for. This is what actually makes tool calls feel slow.

General Compute·May 8, 2026

agentsinferencelatency

The Agentic Inference Tax: Why Agents Need 10x Faster Models

Agents make many sequential LLM calls per task, and each one pays the full latency of decoding. This post walks through how that compounds and why fast inference changes which agents are even viable.

General Compute·May 7, 2026

inferencecompilersdeep-dive

Compiler-Level Optimizations for Inference: TorchInductor, Triton, XLA

How modern ML compilers turn Python model code into fused, fast kernels. A practical look at TorchInductor, Triton, and XLA, and the tradeoffs each one makes for inference.

General Compute·May 6, 2026

inferencespeculative-decodingdeep-dive

Draft Model Selection for Speculative Decoding

Picking a draft model is the most consequential decision when deploying speculative decoding. A practical guide to acceptance rates, sizing, and the tradeoffs that decide whether you actually get a speedup.

General Compute·May 5, 2026

attention sinksstreamingllmlong contextkv cacheinferencetransformers

The Attention Sink Phenomenon: Why the First Token Matters

How attention concentrates on the first few tokens of every sequence, why naive sliding-window caching breaks long-context generation, and how StreamingLLM uses sink tokens to serve effectively unbounded streams.

General Compute·May 4, 2026

mixture of expertsmoeinferenceroutingsparse modelsserving

Mixture of Experts at Inference Time

How MoE routing actually works during serving, why sparse activation makes large models cheaper to run per token, and what changes for the inference stack.

General Compute·May 3, 2026

tensor parallelismpipeline parallelisminferencedistributedgpuserving

Tensor Parallelism vs Pipeline Parallelism for Model Serving

How tensor and pipeline parallelism actually differ in production inference, when to use each, and why most serving stacks end up combining them.

General Compute·May 2, 2026

prefix cachingkv cacheinferencevllmsglangproduction

Prefix Caching: Why Repeated Prompts Shouldn't Cost You Twice

How prefix caching works in modern LLM serving stacks, why it changes the economics of long system prompts and RAG, and what to watch out for in production.

General Compute·May 1, 2026

distillationinferencemodel compressiontrainingproduction

Distillation for Inference: How Smaller Models Learn From Larger Ones

A practical guide to knowledge distillation for production inference: what actually works, what to skip, and how to ship a smaller model without losing the behavior you cared about.

General Compute·April 30, 2026

fp8quantizationinferencetraininghopperblackwell

FP8 Training and Inference: The Precision Sweet Spot

Why 8-bit floating point hits a different point on the accuracy/throughput curve than INT8, how E4M3 and E5M2 are used in practice, and what FP8 actually buys you in production serving.

General Compute·April 29, 2026

quantizationawqinferencellmoptimization

Activation-Aware Quantization (AWQ) Deep Dive

A close look at how AWQ picks salient weight channels, applies per-channel scaling, and why it consistently beats round-to-nearest 4-bit quantization for LLM inference.

General Compute·April 28, 2026

mambastate-space-modelsinferencearchitecturelong-context

Mamba and State Space Models: Inference Without Attention

How structured state space models like Mamba achieve constant-time per-token inference, and why the selective scan changes the trade-off space for long-context serving.

General Compute·April 27, 2026

rwkvlinear-attentioninferencearchitecturelong-context

RWKV and Linear Attention: Recurrent Models as an Inference Shortcut

How RWKV and linear attention architectures collapse the per-token cost of generation to O(1), and what that means for serving long-context workloads.

General Compute·April 26, 2026

inferencebatchingservingschedulingthroughput

Dynamic Batching Strategies: From Naive to Continuous to Iteration-Level

Batching is the lever that turns idle GPU silicon into served tokens. This post walks through the evolution of batching for LLM serving, from one-at-a-time to static batches to request-level dynamic batching to iteration-level continuous batching, and shows where each strategy still leaves throughput on the floor.

General Compute·April 25, 2026

inferencepaperstransformerstoken-mergingpruningvision

Token Merging and Token Pruning for Faster Transformers

Attention cost grows with the square of sequence length. Token merging and token pruning shrink that sequence mid-network, trading a little accuracy for real speedups. Here is how ToMe works, how the idea extends to language models, and where it breaks down.

General Compute·April 24, 2026

inferencepapersservingschedulingtail-latencyfairness

S3: Scheduling for Straggler Mitigation in LLM Serving

In LLM serving, a single long-running request can stall everyone else sharing the same batch. S3 attacks that by predicting output length and scheduling around it. Here is what stragglers actually cost you, and how output-length-aware scheduling helps.

General Compute·April 23, 2026

inferencepapersservingprefilldecodeschedulingsarathi

Chunked Prefill: Overlapping Compute and Communication

Prefill pins the compute units while decode starves for memory bandwidth. Sarathi-Serve splits prefill into chunks and piggybacks decodes on them, keeping both resources busy in the same batch. Here is how it works and where the limits are.

General Compute·April 22, 2026

inferencepapersroutingcascadesfrugalgptllm

Cascade Inference: Using Small Models to Route to Big Ones

FrugalGPT and its descendants show that most queries do not need the biggest model. We walk through the cascade pattern, routing classifiers, and the engineering trade-offs of sending easy work to cheap models and escalating only when needed.

General Compute·April 21, 2026

inferencepapersdecodingspeculative-decodinglookaheadllm

Lookahead Decoding: Parallel Token Generation Without Draft Models

Lookahead decoding from LMSYS speeds up autoregressive generation without requiring a draft model. We walk through the Jacobi iteration trick, the n-gram pool, and what the speedups actually look like in practice.

General Compute·April 20, 2026

inferencepapersservingprefilldecodegpu

Disaggregated Prefill and Decode (Splitwise / DistServe)

Prefill and decode have different compute profiles and clash when they share a GPU. Splitwise and DistServe separate them onto different hardware pools. We walk through why, how, and when it actually pays off.

General Compute·April 19, 2026

inferencepaperskv-cacheattentiondeepseek

KV Cache Compression: MLA and Beyond

DeepSeek's Multi-Head Latent Attention cuts the KV cache by an order of magnitude without giving up quality. We walk through MLA, how it compares to MQA and GQA, and the other compression techniques worth knowing.

General Compute·April 18, 2026

inferencepaperslong-contextdistributed

Ring Attention: Scaling Context to Millions of Tokens

Ring Attention distributes the attention computation across devices in a ring topology, overlapping KV transfer with compute so context length scales linearly with the number of GPUs.

General Compute·April 17, 2026

inferencepapersdeep-dive

Quantization for Inference: GPTQ, AWQ, SmoothQuant, and FP8

Quantization shrinks model weights from 16-bit to 4-bit or 8-bit, cutting memory usage and speeding up inference. Here's how the major techniques work and when to use each one.

General Compute·March 26, 2026

inferencepapersdeep-dive

Multi-Query and Grouped-Query Attention: Shrinking the KV Cache

MQA and GQA reduce the memory footprint of attention by sharing key-value heads across queries. A simple architectural change that makes inference dramatically faster.

General Compute·March 25, 2026

inferencepapersdeep-dive

Continuous Batching: The Orca Paper That Changed LLM Serving

Before continuous batching, LLM servers wasted GPU cycles waiting for the slowest request in each batch. Orca's iteration-level scheduling fixed this with a 36x throughput improvement.

General Compute·March 24, 2026

inferencepapersdeep-dive

Medusa, EAGLE, and Sequoia: The Next Generation of Speculative Decoding

The original speculative decoding papers needed a separate draft model. Medusa, EAGLE, and Sequoia found ways to speculate faster, smarter, and without the extra model.

General Compute·March 24, 2026

inferencepapersdeep-dive

SGLang and RadixAttention: Smarter KV Cache Reuse

SGLang's RadixAttention stores KV cache in a radix tree, enabling automatic prefix sharing across requests. The result is up to 5x higher throughput for multi-turn and structured workloads.

General Compute·March 24, 2026

inferencepapersdeep-dive

Speculative Decoding: Getting 3x Speedups Without Changing the Model

Speculative decoding uses a small draft model to predict multiple tokens ahead, then verifies them all at once. The result is mathematically identical output, 2-3x faster.

General Compute·March 23, 2026

inferencepapersdeep-dive

PagedAttention and vLLM: Virtual Memory for LLM Serving

The PagedAttention paper solved the biggest memory waste problem in LLM serving by borrowing an idea from operating systems. Here's how it works and why vLLM became the default serving framework.

General Compute·March 22, 2026

inferencepapersdeep-dive

FlashAttention: How Tri Dao Made Attention 4x Faster

FlashAttention rewrote the rules of transformer inference by treating attention as a memory problem, not a compute problem. Here's how it works and why it matters.

General Compute·March 21, 2026

voice-aitutorialagents

Build a Real-Time Voice AI Agent with General Compute

A step-by-step tutorial for building a voice AI agent with sub-500ms response times. Plus: why General Compute is the only provider fast enough to use reasoning models in a voice pipeline.

General Compute·March 20, 2026

coding-agentsinferencedeveloper-tools

How Coding Agents Depend on Inference Speed

Coding agents make dozens of sequential LLM calls per task. Every millisecond of inference latency compounds across each step, making speed the single biggest infrastructure bottleneck for AI-powered developer tools.

General Compute·March 19, 2026

inferenceinfrastructure

Why Inference Speed is the New Moat

Model quality has commoditized. The real competitive advantage in AI is how fast your infrastructure can deliver results. Inference speed is becoming the defining moat for AI-native products.

General Compute·March 18, 2026