CS336-2025-lec10
2025年10月29日
15:17
Inference
Inference: given a fixed model, generate responses given prompts
Metrics:
Key considerations in efficiency:
KV cache: for every sequence (B), token (S), layer (L), head (K), store an H-dimensional vector
Two stages of inference:
推理时的2个步骤,先是对prompt部分进行encode,再生成下一个token
Let's compute the FLOPs and memory IO for both the MLP and attention layers.
S is the number of tokens we're conditioning on, T is the number of tokens we're generating.
Later, we'll specialize to prefill (T = S) and generation (T = 1).
对于MLP来说,推理时可以通过增加batchsize来提高效率,也就是将多个用户的prompt进行batch,然后同时进行prefill和generation。
For the two stages:
对于attention来说,无法通过增加batchsize来提高效率,因为每个sequence都有自己的kvcache,增加batchisize,相应的kv cache也增加了。
Unlike MLPs, no dependence on B, so batching doesn't help。
Why?
Summary
Tradeoff between latency and throughput:
Easy parallelism: if you launch M copies of the model, latency is the same, throughput increases by M!
Harder parallelism: shard the model and the KV cache [Scaling book chapter on Transformers] kv cache也要进行shard
如何提升推理速度?
1.减少kv cache
推理的主要瓶颈是kv cache,memory limited,所以内存占用越小,速度越快(因为涉及和HBM的传输),速度与内存直接相关。
MQA和GQA,n个头的query对应单个或m(m小于n)个key和value,这样就不用存n个头的key和value了,只存一个头的key和value就行了。

除了速度提升,减少kv-cache的另一个附带好处是,内存占用更小了,同样大小的内存可以装入更大的batchsize,这样进一步提高了throughput。
Deepseek MLA

Key idea: project down each key and value vector from N*H dimensions to C dimensions
DeepSeek v2: reduce N*H = 16384 to C = 512
Wrinkle: MLA is not compatible with RoPE, so need to add additional 64 dimensions for RoPE, so 512 + 64 = 576 total dimensions
Latency/throughput improvements follow similarly from the KV cache reduction as argued earlier
Cross-layer attention (CLA)
GQA是在不同的heads之间共享key和value,CLA是在不同的层之间共享key和value

Idea: share KVs across layers (just as GQA shares KVs across heads)
Empirically improves the pareto frontier of accuracy and KV cache size (latency and throughput)
Local attention

Idea: just look at the local context, which is most relevant for modeling
Effective context scales linearly with the number of layers
KV cache is independent of sequence length!
总结:
Taking shortcuts (lossy)
1.reduce_kv_cache_size()
2.alternatives_to_the_transformer() :State-space models,Diffusion models
3.quantization() : LLM.int8() , Activation-aware quantization
4.model_pruning(): Key idea: just rip out parts of an expensive model to make it cheaper
...and then fix it up.
Use shortcuts but double check (lossless)
speculative_sampling: In other words, checking is faster than generation.
Handling dynamic workloads
Batching over sequences in live traffic is tricky because:
Requests arrive at different times (waiting for batch is bad for early requests)
Sequences have shared prefixes (e.g., system prompts, generating multiple samples
Sequences have different lengths (padding is inefficient)
continuous_batching()
paged_attention(): PageAttention 是一种通过借鉴操作系统分页和内存共享思想,来高效管理LLM推理过程中KV Cache的技术。它是vLLM推理引擎的核心,能极大地提升GPU内存利用率和请求吞吐量。
Other vLLM optimizations:
已使用 OneNote 创建。