3 A Computer Systems Journey Through LLMs

Training vs. Inference

Training: forward + backward, parameter update, offline
Inference: autoregressive decoding, online, system bottleneck
Deployment performance is dominated by inference, not training

Attention and Complexity

Self-attention uses Query (Q), Key (K), Value (V)
Vanilla attention complexity:

\begin{equation*} O(N^2) \end{equation*}

where $N$ is the sequence length

Autoregressive Inference

Tokens are generated one by one
Each new token repeats:
- Multi-Head Attention (MHA)
- MLP

KV Cache

Idea

Cache historical K and V
Only compute Q for new tokens

Effect

Attention complexity reduces:

O(N^2) \rightarrow O(N)

Trade space for computation

Cost

KV Cache consumes large GPU memory
Memory usage grows linearly with sequence length

KV Cache Memory Problems

Internal fragmentation
External fragmentation
Worsens with variable-length and dynamic requests

Paged Attention

Core Idea

Apply virtual memory & paging concepts to KV Cache

Benefits

Fixed-size pages
Logical–physical separation via indirection
Eliminates fragmentation
Improves memory utilization and scalability

Performance Metrics

Throughput: tokens/s
Latency (TTFT): time to first token
Inter-token latency

Inference systems must balance throughput and latency.

Parallelism

Pipeline Parallelism: split layers across GPUs → higher throughput
Tensor Parallelism: split tensor ops → lower per-GPU compute
Mixed Parallelism: used in large models

Batching

Static batching: padding, low GPU utilization
Continuous batching:
- Dynamically add/remove requests
- Improves GPU utilization and throughput

Systems Takeaway

LLM inference is a systems problem.

Key principles:

Parallelism
Pipelining
Batching
Indirection
Speculation
Locality

Training vs. Inference​

Attention and Complexity​

Autoregressive Inference​

KV Cache​

Idea​

Effect​

Cost​

KV Cache Memory Problems​

Paged Attention​

Core Idea​

Benefits​

Performance Metrics​

Parallelism​

Batching​

Systems Takeaway​

Training vs. Inference

Attention and Complexity

Autoregressive Inference

KV Cache

Idea

Effect

Cost

KV Cache Memory Problems

Paged Attention

Core Idea

Benefits

Performance Metrics

Parallelism

Batching

Systems Takeaway