LLM compression

LLM compression is the set of techniques used to make large language models cheaper, faster, smaller, and easier to deploy while preserving as much quality as possible. It can compress the model itself, the runtime state used during inference, or the input context sent to the model.

What is LLM compression?

LLM compression means reducing one or more of the resources required to use a model:

Resource	Why it matters
Disk size	Model distribution, storage cost, startup time
RAM/VRAM	Whether the model fits and how many requests can run concurrently
Memory bandwidth	Often the dominant bottleneck for token generation
Compute	Matrix multiply cost, attention cost, and server throughput

Compression can target different parts of the system:

Target	What is compressed	Typical benefit
Model weights	Parameters stored in matrices	Smaller model files and lower VRAM
Activations	Intermediate tensors during forward pass	Lower memory and faster compute
KV cache	Attention keys/values saved during generation	Lower VRAM for long context and concurrency
Input tokens	API cost, prefill latency, and context-window pressure
Trainable parameters	Fine-tuning cost and adapter storage
Architecture	Layers, heads, hidden size, routing	Better quality per compute unit

Why would you comperss LLMs?

Fit models on available hardware

Compression can change a deployment from “does not fit” to “fits on one machine”. Compression can enable:

Running local models on laptops.
Serving models on fewer GPUs.
Deploying models to edge devices.
Hosting larger models in a fixed memory budget.

For example:

Model size	FP16/BF16 rough weight memory	8-bit rough memory	4-bit rough memory
7B	14 GB	7 GB	3.5 GB
13B	26 GB	13 GB	6.5 GB
34B	68 GB	34 GB	17 GB
70B	140 GB	70 GB	35 GB
405B	810 GB	405 GB	202.5 GB

These are simplified weight-only estimates. Real deployments also need KV cache, runtime overhead, and sometimes duplicated weights across replicas or tensor-parallel shards.

Reduce inference cost

Serving cost is often driven by GPU memory, memory bandwidth, and request concurrency. Smaller weights and caches can allow:

More requests per GPU.
Larger batch sizes.
Fewer GPUs per replica.
Lower latency under load.
Cheaper instance types.

Improve latency and throughput

Compression can improve performance by:

Moving less data from memory.
Increasing effective memory bandwidth.
Allowing larger batches.
Reducing CPU/GPU transfer.
Enabling faster low-precision matrix kernels.

However, smaller does not automatically mean faster. Speed depends on whether the hardware and inference engine have optimized kernels for the compressed format.

Support long context

For chat, RAG, coding agents, and document workflows, the KV cache can dominate memory. KV cache memory grows with layer count, KV heads, head dimension, sequence length, batch size, and precision. Compressing weights does not solve a KV cache bottleneck by itself.

Long-context workloads can become KV-cache-bound even when weights fit comfortably. KV cache compression is therefore critical for:

Chatbots with long histories.
Retrieval-augmented generation over large context.
Coding agents with many files in context.
Document analysis.
Multi-turn customer support.
High-concurrency API serving.

Enable local, edge, or private deployment

Quantized and distilled models are easier to run on laptops, workstations, edge devices, and private infrastructure. This makes it more practical to run models on-device or on-premises, in disconnected environments or regulated environments where data cannot leave the network, in low-latency applications where cloud roundtrips are unacceptable.

Lower energy and infrastructure footprint

At high request volume, smaller models and more efficient inference can reduce GPU-hours, power, cooling and hardware requirements, reducing environmental impact.

How is LLM compression done?

The main options are:

Option	What it compresses	Main benefit	Main risk
Quantization	Weights, activations, or KV cache	Lower memory, lower bandwidth, often faster inference	Quality loss if bit width, calibration, or kernels are wrong
Pruning and sparsity	Weights, neurons, heads, layers, or tokens	Smaller model or less compute	Speedups require sparse-aware hardware/runtime
Knowledge distillation	Train a smaller student model to imitate a larger teacher model	Real reduction in model size and latency	Requires data, training, and careful evaluation
Low-rank factorization	Weight matrices	Fewer parameters in selected layers	Can hurt quality; savings depend on rank and implementation
Weight sharing/codebooks	Weight storage	Smaller files and sometimes memory	Usually limited runtime speedup without custom kernels
Efficient architecture design	Model structure	Best quality per compute when training/building a model	Not a drop-in compression step for an existing checkpoint
KV cache compression	Runtime attention cache	More concurrency and longer context	Can damage long-context reasoning or retrieval
Prompt/context compression	Input tokens	Immediate cost and latency savings	Can remove information the model needed

Quantization

Quantization is the most widely used LLM compression method. It represents numeric values with fewer bits.

Target	Common formats	Notes
Weights only	W8A16, W4A16, GPTQ, AWQ, GGUF quants	Most common for local and open-model inference
Weights and activations	W8A8, FP8, W4A8	More speed potential, more kernel and calibration sensitivity
KV cache	FP8, INT4, INT2, NVFP4, custom vector quantization	Critical for long context and high concurrency
Fine-tuning state	4-bit base weights with adapters	Often used for memory-efficient fine-tuning rather than final serving

At a high level, quantization maps high-precision values to lower-precision buckets.

Example:

Original FP16 values:
[-1.72, -0.41, 0.03, 0.88, 1.96]

Quantized 4-bit codes:
[1, 6, 8, 11, 15]

Metadata:
scale and zero-point or other reconstruction parameters

At inference time, the runtime either:

Dequantizes values back to higher precision before computation.
Uses low-precision kernels directly.
Uses a hybrid approach where storage is low precision but accumulation is higher precision.

Pros:

Usually the fastest path to lower memory use.
Often no retraining required.
Large ecosystem support.
Can reduce cost dramatically.
Makes larger models accessible on smaller hardware.

Cons:

Can reduce quality.
Can increase hallucinations or brittle behavior if too aggressive.
Can hurt difficult tasks more than easy chat tasks.
Requires runtime-specific compatibility.
Speedups are not guaranteed without optimized kernels.
Evaluation must be task-specific.

Pruning and sparsity

Pruning removes parts of a model that appear less important. Sparsity keeps the original shape but sets some weights or structures to zero so they can be skipped by supported kernels.

Pruning type	What is removed	Example
Unstructured pruning	Individual weights	Set 50% of low-importance weights to zero
Semi-structured pruning	Fixed sparse patterns	NVIDIA-style 2:4 sparsity
Structured pruning	Channels, heads, neurons, layers	Remove attention heads or MLP neurons
Layer dropping	Entire transformer blocks	Distill or fine-tune after removing layers
Token pruning	Less important tokens during inference	Reduce attention workload

Pros:

Can reduce both parameters and computation.
Structured pruning can produce a genuinely smaller dense model.
Can combine with quantization.
Useful for research and specialized hardware.

Cons:

Not always faster in practice.
Sparse formats and kernels are runtime-dependent.
May require fine-tuning or distillation to recover quality.
Aggressive pruning can break reasoning and instruction following.
Can remove rare but important capabilities.
More difficult to deploy than quantization.

Knowledge distillation

Distillation trains a smaller student model to imitate a larger teacher model. The teacher may provide labels, logits, chain-of-thought rationales, tool-call traces, critique data, preference data, or synthetic instruction data.

The output is not a compressed copy of the same model. It is a smaller model trained to reproduce selected behaviors of the larger model.

Type	Description	Common use
Response distillation	Student imitates teacher answers	Instruction tuning
Logit distillation	Student matches teacher probability distribution	More information-rich but requires teacher logits
Step/rationale distillation	Student learns reasoning traces	Math, code, planning
Tool-use distillation	Student learns when/how to call tools	Agentic workflows
Preference distillation	Student learns teacher or human preferences	Alignment
Domain distillation	Student learns specialized domain behavior	Enterprise assistants

Pros:

Produces a genuinely smaller model.
Can reduce latency and cost substantially.
Can specialize a model for a domain, product, or workflow.
Can be combined with quantization after training.
Often better than simply pruning an existing model.

Cons:

Requires training data and training infrastructure.
Student inherits teacher errors and blind spots.
Broad generality is hard to preserve.
Distillation can violate model-provider terms if not permitted.
Evaluation must cover reliability, safety, and edge cases, not just average accuracy.

Low-rank factorization

Many neural network layers contain large dense matrices. Low-rank factorization approximates a large matrix as the product of smaller matrices.

Pros:

Mathematically clean.
Can combine with quantization.
Can reduce parameter count in selected layers.
Can be layered with fine-tuning.

Cons:

Drop-in compression may require fine-tuning.
Speedup depends on matrix sizes and kernels.
Not always competitive with quantization for simple deployment.
Choosing ranks layer-by-layer is nontrivial.

These methods reduce storage by making many weights share values or by encoding weights more compactly.

Pros:

Useful for distribution and storage.
Can be combined with quantization.
Good for many model variants with small differences.
Reduces parameter count.
Can be effective when designed into the model.
May preserve depth while reducing storage.

Cons:

Often helps disk size more than runtime latency.
Needs custom kernels or decompression strategy for serving speed.
Added indirection can hurt performance.
Hard to apply after training without degradation.
May reduce representational capacity.
Usually requires training or fine-tuning.
Less common as a practical post-training LLM compression method.

Efficient architecture design

Sometimes the best compression is not compressing a finished model but designing or selecting a better-sized model. Important architecture-level efficiency techniques include:

Grouped-query attention and multi-query attention
Mixture-of-experts
Sliding-window and sparse attention
Smaller dense models trained better

Pros:

Can outperform post-hoc compression.
Better control over latency, memory, and deployment shape.
Can avoid quality loss from compressing the wrong model.

Cons:

Requires training or selecting a new model.
Not a direct transformation of an existing checkpoint.
More evaluation and migration work.

KV cache compression

During autoregressive generation, a transformer stores key and value tensors for previous tokens so it does not have to recompute them. This is the KV cache. The KV cache grows linearly with sequence length and batch size. For long-context and high-concurrency serving, KV cache memory can dominate. Weight compression helps fit the model. KV cache compression helps serve real workloads.

Method	Description	Benefit	Risk
KV quantization	Store cache in INT8, INT4, FP8, etc.	Lower VRAM	Quality loss if too aggressive
Token eviction	Drop less important cached tokens	Lower memory	Irreversible forgetting
Token merging	Combine similar tokens or cache entries	Lower memory	Loss of detail
Attention sinks	Preserve special high-importance early tokens	Better stability	Heuristic-dependent
Sliding cache	Keep recent window plus selected global tokens	Long-context efficiency	Misses distant details
CPU offload	Move cache from GPU to CPU memory	Larger effective context	Transfer latency
Hierarchical cache	Hot cache on GPU, cold cache elsewhere	Better memory use	System complexity
Low-rank KV	Approximate cache tensors with lower-rank forms	Memory reduction	Reconstruction error
Prefix caching	Reuse cache for shared prompts	Throughput improvement	Less useful for unique prompts

Pros

Enables longer context.
Improves concurrency.
Reduces serving cost.
Helps avoid GPU memory fragmentation and out-of-memory errors.
Can be combined with quantized weights.

Cons

Quality degradation can be subtle and task-dependent.
Retrieval tests may not reveal reasoning damage.
Some methods help memory but add latency.
System implementations are more complex than weight-only quantization.

Prompt and context compression

Prompt compression reduces the number of input tokens before the model sees them. This is not model compression, but it often has the fastest business impact because API and prefill costs scale with input tokens.

Method	Description
Summarization	Replace long text with a shorter summary
Extractive selection	Keep only relevant passages
Reranking	Retrieve many chunks, send only top-ranked chunks
Deduplication	Remove repeated or near-duplicate content
Structured extraction	Convert documents into concise facts or fields
Conversation memory	Summarize old turns and keep recent turns verbatim
Query-focused compression	Compress context specifically for the current question
Embedding-based filtering	Drop chunks with low semantic relevance
Prompt template minimization	Remove unnecessary instruction verbosity

Pros

Works with closed and open models.
No model modification required.
Reduces cost and prefill latency immediately.
Can improve answer quality by removing irrelevant context.

Cons

Compression may remove details needed later.
Summaries can introduce errors.
Relevance filtering can miss subtle dependencies.
Requires careful evaluation for legal, medical, financial, and code tasks.
Over-compression can make the model confidently wrong.
Compression logic becomes part of the product behavior.

Combining compression methods

Compression methods are often combined.

Combination	Why use it
Quantization + KV cache quantization	Reduce both model and runtime memory
Distillation + quantization	Build a small student, then compress further
Pruning + quantization	Reduce parameters and precision
Context compression + KV cache compression	Reduce both input length and long-context memory
Efficient architecture + quantization	Best practical deployment footprint

Evaluation

Compression should be evaluated across four dimensions:

Quality
Performance
Cost
Reliability

Quality metrics measure:

Human preference ratings.
LLM-as-judge with calibration.
Domain expert review.
Tool-call correctness.
Safety and refusal behavior.
Regression tests for known prompts.

Performance metrics measure:

Time to first token.
Tokens per second per request.
Maximum context length.
Maximum concurrent users.
GPU utilization.
Memory bandwidth utilization.

Cost metrics measure:

Cost per 1M input tokens.
Cost per 1M output tokens.
Cost per successful task.
GPU-hours.
Power consumption.
Required hardware tier.
Engineer time and operational complexity.

Reliability metrics measure:

Out-of-memory frequency.
Error rates.
Latency spikes.
Behavior on unusual prompts.
Long-running serving stability.

Security, privacy, and governance

Compression changes model behavior and deployment patterns. It should be treated as a model change, not only an infrastructure optimization.

Consider:

Whether compressed models still follow safety policies.
Whether refusals and guardrails survive compression.
Whether domain-specific factuality changes.
Whether private training data was used for distillation.
Whether teacher-model terms permit distillation.
Whether quantized community checkpoints are trustworthy.
Whether model files came from a verified source.
Whether compression tools execute untrusted code.

What is LLM compression?#

Why would you comperss LLMs?#

Fit models on available hardware#

Reduce inference cost#

Improve latency and throughput#

Support long context#

Enable local, edge, or private deployment#

Lower energy and infrastructure footprint#

How is LLM compression done?#

Quantization#

Pruning and sparsity#

Knowledge distillation#

Low-rank factorization#

Weight sharing, clustering, and entropy coding#

Efficient architecture design#

KV cache compression#

Prompt and context compression#

Combining compression methods#

Evaluation#

Security, privacy, and governance#

What is LLM compression?

Why would you comperss LLMs?

Fit models on available hardware

Reduce inference cost

Improve latency and throughput

Support long context

Enable local, edge, or private deployment

Lower energy and infrastructure footprint

How is LLM compression done?

Quantization

Pruning and sparsity

Knowledge distillation

Low-rank factorization

Weight sharing, clustering, and entropy coding

Efficient architecture design

KV cache compression

Prompt and context compression

Combining compression methods

Evaluation

Security, privacy, and governance