Skip to content

Latest commit

 

History

History
347 lines (239 loc) · 16.9 KB

File metadata and controls

347 lines (239 loc) · 16.9 KB

Model Optimization in 2025–2026: A Practical Survey

An T. Le, Hanoi, Dec 2025 (revised Jan 2026)

Foundation models keep getting larger, context windows keep getting longer, and deployment constraints keep getting tighter. In practice, “model optimization” is no longer one trick: it’s a stack of techniques that touch weights, activations, KV cache, kernels, runtime scheduling, and distributed training.

This note focuses on what actually changes memory / latency / throughput in real systems, and links to working code + official docs where possible.


The optimization stack (what you actually pay for)

When you profile an LLM service, you usually see a handful of dominant cost centers:

  • Weights (VRAM footprint; bandwidth during decoding)
  • KV cache (VRAM footprint; bandwidth; fragmentation under multi-tenant serving; long-context blowups)
  • Attention and GEMM kernels (fusion, IO efficiency, precision support)
  • Scheduling (continuous batching, prefill vs decode balance, cache reuse / prefix caching)
  • Training memory (optimizer states, gradients, sharding, precision)

Everything below maps to one or more of those bottlenecks.


1. Quantization (largest memory win; usually best first move)

1.1 Weight-only PTQ (keep activations high precision; quantize weights to 4-bit)

Reality check: weight-only INT4 helps most when decode is bandwidth-bound and you have good INT4 kernels (packing format matters).

Algorithms / toolkits

Docs that matter (format + runtime support)

Discussion: when weight-only wins

  • You are decode-bound (memory bandwidth limited).
  • You want a drop-in path without engineering activation quantization.
  • Your serving engine supports the exact packing + kernel combination (often the hidden make-or-break detail).

1.2 W8A8 (INT8 weights + INT8 activations): higher peak performance, higher sensitivity

Discussion: why W8A8 is tricky

  • Activations have outliers that dominate quantization error.
  • Calibration data + per-layer behavior matters more than weight-only.
  • Speedups require excellent INT8 kernels/fusions in your stack.

1.3 FP8 / FP4 formats (where modern GPU stacks are going)

If you serve on recent GPUs, FP8 (and increasingly 4-bit float formats like NVFP4) are “mainline” in vendor stacks:

Rule of thumb:

  • FP8 often shines for prefill throughput (compute-heavy).
  • INT4 weight-only often shines for decode throughput (bandwidth-heavy), assuming kernels are good.

1.4 Practical k-bit infrastructure (what people actually use)


1.5 4-bit fine-tuning without full fine-tune cost (QLoRA)

Discussion: what you really get with QLoRA

  • You optimize the training iteration loop more than final inference speed.
  • Deployment can be “serve adapters” vs “merge adapters” — different ops/latency tradeoffs.

1.6 Ultra-low-bit PTQ (2–4 bits): pushing compression harder

Useful when INT4 is not enough, or for edge/local constraints (but kernels decide if it’s fast or just small).

Discussion: reality of 2–3 bits

  • Accuracy is workload-dependent → validate on real prompts.
  • Kernel support often decides if you get speedups or only memory savings.

2. KV cache optimization (the hidden bottleneck for long context + concurrency)

If you serve long prompts or many concurrent users, KV cache can dominate GPU memory and cause sharp perf cliffs via fragmentation.

2.1 “In-engine” KV cache quantization (practical, shipping)

These are usually higher-ROI than research KV compression if you’re already on vLLM / TRT-LLM.

2.2 Research KV cache compression / pruning methods

Discussion: when KV cache work matters most

  • Long context chat, heavy RAG prompts, tool traces, multi-turn sessions.
  • High concurrency where KV cache memory is the capacity limiter.
  • Systems needing predictable latency (cache pressure → sudden cliffs).

3. Sparsity and pruning (powerful on paper; conditional in practice)

Pruning can reduce compute and memory, but real speedups depend heavily on sparsity structure and kernel support.

3.1 One-shot pruning for LLMs

3.2 Semi-structured 2:4 sparsity (hardware-aligned path)

Discussion: when sparsity is worth it

  • You can enforce a hardware-friendly pattern (often 2:4).
  • You control kernels + deployment environment (not just the checkpoint).
  • You have benchmarks showing sparse kernels beat dense low-bit for your shapes + batch regime.

4. Decoding acceleration (reduce the number of expensive target-model steps)

Autoregressive decoding is sequential; methods that reduce target forward passes can yield big speedups if integrated into the serving stack well.

4.1 Speculative decoding (draft model → verify with target)

(If you just want a minimal reference implementation: https://github.com/romsto/Speculative-Decoding)

4.2 Multi-token / multi-head decoding (Medusa-family)

Discussion: when decoding tricks pay off

  • You can maintain high acceptance rates (draft model quality matters).
  • Your serving engine’s integration is efficient (overhead can cancel gains).

5. Kernels and serving engines (where “real throughput” often comes from)

You can have a well-quantized model and still get poor performance if the runtime can’t batch efficiently or manage KV cache well.

5.1 Attention kernels

5.2 Serving engines

5.3 Scheduling-level wins (often bigger than “one more quant trick”)

If your GPU isn’t saturated, these can dominate:

(If you’re CPU-heavy: OpenVINO Model Server also documents continuous batching + paged attention ideas:
https://docs.openvino.ai/2024/ovms_demos_continuous_batching.html)

5.4 “Production glue” (optional but common): Triton

If you want a standard serving layer + batching + observability, Triton often sits above the engine:

5.5 Profiling Tools

5.6 Vision Kernels (Preprocessing)

5.7 GEMM Optimization

5.8 Transformer Layer Optimizations

Practical serving mindset

  1. Optimize utilization first: batching, KV management, prefill vs decode scheduling.
  2. Then optimize math: quantization, kernels, decoding shortcuts.
  3. Validate quality under your exact sampler settings and prompt distribution.

6. Training-scale optimization (making big runs feasible)

Large training and large-scale fine-tuning are often memory-limited by optimizer state + gradients + replicas.

6.1 ZeRO-style sharding and offload

6.2 PyTorch-native sharding (FSDP + FSDP2)

6.3 Parallelism reference stack (if you go deep)

6.4 FP8 building blocks


7. Diffusion and generative vision models (fewer steps beats faster steps)

For diffusion-style models, denoising step count often dominates latency.


8. Putting it together (high-signal stacks)

A. High-throughput LLM serving (GPU)

  • Engine: vLLM or TensorRT-LLM (or TGI if you want HF-managed production ergonomics)
  • Kernel: FlashAttention/FlashInfer where supported
  • Quant: INT4 weight-only (AWQ/GPTQModel) for decode-heavy workloads; FP8 for prefill-heavy workloads
  • KV: enable FP8/NVFP4 KV cache if long-context or high concurrency is the limiter
  • Decode accel: speculative decoding (draft model) if acceptance rates are high

B. Local / edge-ish constraints (single node, cost-sensitive)

  • Quantize to a runtime-friendly format (GGUF / llama.cpp; or a supported GPTQ/AWQ format for your engine).
  • Prefer distillation (smaller model) before extreme quant, if quality matters.

C. Training memory wall

  • Start with DeepSpeed ZeRO or PyTorch FSDP2
  • Add activation checkpointing + optimizer tricks
  • Use FP8 carefully (TransformerEngine / supported stacks)

Quick checklist (what to profile first)

  • Are you prefill-bound (compute) or decode-bound (bandwidth)?
  • Is KV cache the capacity limiter (context/concurrency)?
  • Are you actually using the best attention/GEMM kernels your stack supports?
  • Is batching (continuous/dynamic) enabled and stable under your traffic shape?