3 A Computer Systems Journey Through LLMs
Training vs. Inference
- Training: forward + backward, parameter update, offline
- Inference: autoregressive decoding, online, system bottleneck
- Deployment performance is dominated by inference, not training
Attention and Complexity
- Self-attention uses Query (Q), Key (K), Value (V)
- Vanilla attention complexity:
where is the sequence length
Autoregressive Inference
- Tokens are generated one by one
- Each new token repeats:
- Multi-Head Attention (MHA)
- MLP
KV Cache
Idea
- Cache historical K and V
- Only compute Q for new tokens
Effect
- Attention complexity reduces:
- Trade space for computation
Cost
- KV Cache consumes large GPU memory
- Memory usage grows linearly with sequence length
KV Cache Memory Problems
- Internal fragmentation
- External fragmentation
- Worsens with variable-length and dynamic requests
Paged Attention
Core Idea
- Apply virtual memory & paging concepts to KV Cache
Benefits
- Fixed-size pages
- Logical–physical separation via indirection
- Eliminates fragmentation
- Improves memory utilization and scalability
Performance Metrics
- Throughput: tokens/s
- Latency (TTFT): time to first token
- Inter-token latency
Inference systems must balance throughput and latency.
Parallelism
- Pipeline Parallelism: split layers across GPUs → higher throughput
- Tensor Parallelism: split tensor ops → lower per-GPU compute
- Mixed Parallelism: used in large models
Batching
- Static batching: padding, low GPU utilization
- Continuous batching:
- Dynamically add/remove requests
- Improves GPU utilization and throughput
Systems Takeaway
LLM inference is a systems problem.
Key principles:
- Parallelism
- Pipelining
- Batching
- Indirection
- Speculation
- Locality