LLM compression is the set of techniques used to make large language models cheaper, faster, smaller, and easier to deploy while preserving as much quality as possible. It can compress the model itself, the runtime state used during inference, or the input context sent to the model.

What is LLM compression?

LLM compression means reducing one or more of the resources required to use a model:

ResourceWhy it matters
Disk sizeModel distribution, storage cost, startup time
RAM/VRAMWhether the model fits and how many requests can run concurrently
Memory bandwidthOften the dominant bottleneck for token generation
ComputeMatrix multiply cost, attention cost, and server throughput

Compression can target different parts of the system:

TargetWhat is compressedTypical benefit
Model weightsParameters stored in matricesSmaller model files and lower VRAM
ActivationsIntermediate tensors during forward passLower memory and faster compute
KV cacheAttention keys/values saved during generationLower VRAM for long context and concurrency
Input tokensAPI cost, prefill latency, and context-window pressure
Trainable parametersFine-tuning cost and adapter storage
ArchitectureLayers, heads, hidden size, routingBetter quality per compute unit

Why would you comperss LLMs?

Fit models on available hardware

Compression can change a deployment from “does not fit” to “fits on one machine”. Compression can enable:

  • Running local models on laptops.
  • Serving models on fewer GPUs.
  • Deploying models to edge devices.
  • Hosting larger models in a fixed memory budget.

For example:

Model sizeFP16/BF16 rough weight memory8-bit rough memory4-bit rough memory
7B14 GB7 GB3.5 GB
13B26 GB13 GB6.5 GB
34B68 GB34 GB17 GB
70B140 GB70 GB35 GB
405B810 GB405 GB202.5 GB

These are simplified weight-only estimates. Real deployments also need KV cache, runtime overhead, and sometimes duplicated weights across replicas or tensor-parallel shards.

Reduce inference cost

Serving cost is often driven by GPU memory, memory bandwidth, and request concurrency. Smaller weights and caches can allow:

  • More requests per GPU.
  • Larger batch sizes.
  • Fewer GPUs per replica.
  • Lower latency under load.
  • Cheaper instance types.

Improve latency and throughput

Compression can improve performance by:

  • Moving less data from memory.
  • Increasing effective memory bandwidth.
  • Allowing larger batches.
  • Reducing CPU/GPU transfer.
  • Enabling faster low-precision matrix kernels.

However, smaller does not automatically mean faster. Speed depends on whether the hardware and inference engine have optimized kernels for the compressed format.

Support long context

For chat, RAG, coding agents, and document workflows, the KV cache can dominate memory. KV cache memory grows with layer count, KV heads, head dimension, sequence length, batch size, and precision. Compressing weights does not solve a KV cache bottleneck by itself.

Long-context workloads can become KV-cache-bound even when weights fit comfortably. KV cache compression is therefore critical for:

  • Chatbots with long histories.
  • Retrieval-augmented generation over large context.
  • Coding agents with many files in context.
  • Document analysis.
  • Multi-turn customer support.
  • High-concurrency API serving.

Enable local, edge, or private deployment

Quantized and distilled models are easier to run on laptops, workstations, edge devices, and private infrastructure. This makes it more practical to run models on-device or on-premises, in disconnected environments or regulated environments where data cannot leave the network, in low-latency applications where cloud roundtrips are unacceptable.

Lower energy and infrastructure footprint

At high request volume, smaller models and more efficient inference can reduce GPU-hours, power, cooling and hardware requirements, reducing environmental impact.

How is LLM compression done?

The main options are:

OptionWhat it compressesMain benefitMain risk
QuantizationWeights, activations, or KV cacheLower memory, lower bandwidth, often faster inferenceQuality loss if bit width, calibration, or kernels are wrong
Pruning and sparsityWeights, neurons, heads, layers, or tokensSmaller model or less computeSpeedups require sparse-aware hardware/runtime
Knowledge distillationTrain a smaller student model to imitate a larger teacher modelReal reduction in model size and latencyRequires data, training, and careful evaluation
Low-rank factorizationWeight matricesFewer parameters in selected layersCan hurt quality; savings depend on rank and implementation
Weight sharing/codebooksWeight storageSmaller files and sometimes memoryUsually limited runtime speedup without custom kernels
Efficient architecture designModel structureBest quality per compute when training/building a modelNot a drop-in compression step for an existing checkpoint
KV cache compressionRuntime attention cacheMore concurrency and longer contextCan damage long-context reasoning or retrieval
Prompt/context compressionInput tokensImmediate cost and latency savingsCan remove information the model needed

Quantization

Quantization is the most widely used LLM compression method. It represents numeric values with fewer bits.

TargetCommon formatsNotes
Weights onlyW8A16, W4A16, GPTQ, AWQ, GGUF quantsMost common for local and open-model inference
Weights and activationsW8A8, FP8, W4A8More speed potential, more kernel and calibration sensitivity
KV cacheFP8, INT4, INT2, NVFP4, custom vector quantizationCritical for long context and high concurrency
Fine-tuning state4-bit base weights with adaptersOften used for memory-efficient fine-tuning rather than final serving

At a high level, quantization maps high-precision values to lower-precision buckets.

Example:

Original FP16 values:
[-1.72, -0.41, 0.03, 0.88, 1.96]

Quantized 4-bit codes:
[1, 6, 8, 11, 15]

Metadata:
scale and zero-point or other reconstruction parameters

At inference time, the runtime either:

  • Dequantizes values back to higher precision before computation.
  • Uses low-precision kernels directly.
  • Uses a hybrid approach where storage is low precision but accumulation is higher precision.

Pros:

  • Usually the fastest path to lower memory use.
  • Often no retraining required.
  • Large ecosystem support.
  • Can reduce cost dramatically.
  • Makes larger models accessible on smaller hardware.

Cons:

  • Can reduce quality.
  • Can increase hallucinations or brittle behavior if too aggressive.
  • Can hurt difficult tasks more than easy chat tasks.
  • Requires runtime-specific compatibility.
  • Speedups are not guaranteed without optimized kernels.
  • Evaluation must be task-specific.

Pruning and sparsity

Pruning removes parts of a model that appear less important. Sparsity keeps the original shape but sets some weights or structures to zero so they can be skipped by supported kernels.

Pruning typeWhat is removedExample
Unstructured pruningIndividual weightsSet 50% of low-importance weights to zero
Semi-structured pruningFixed sparse patternsNVIDIA-style 2:4 sparsity
Structured pruningChannels, heads, neurons, layersRemove attention heads or MLP neurons
Layer droppingEntire transformer blocksDistill or fine-tune after removing layers
Token pruningLess important tokens during inferenceReduce attention workload

Pros:

  • Can reduce both parameters and computation.
  • Structured pruning can produce a genuinely smaller dense model.
  • Can combine with quantization.
  • Useful for research and specialized hardware.

Cons:

  • Not always faster in practice.
  • Sparse formats and kernels are runtime-dependent.
  • May require fine-tuning or distillation to recover quality.
  • Aggressive pruning can break reasoning and instruction following.
  • Can remove rare but important capabilities.
  • More difficult to deploy than quantization.

Knowledge distillation

Distillation trains a smaller student model to imitate a larger teacher model. The teacher may provide labels, logits, chain-of-thought rationales, tool-call traces, critique data, preference data, or synthetic instruction data.

The output is not a compressed copy of the same model. It is a smaller model trained to reproduce selected behaviors of the larger model.

TypeDescriptionCommon use
Response distillationStudent imitates teacher answersInstruction tuning
Logit distillationStudent matches teacher probability distributionMore information-rich but requires teacher logits
Step/rationale distillationStudent learns reasoning tracesMath, code, planning
Tool-use distillationStudent learns when/how to call toolsAgentic workflows
Preference distillationStudent learns teacher or human preferencesAlignment
Domain distillationStudent learns specialized domain behaviorEnterprise assistants

Pros:

  • Produces a genuinely smaller model.
  • Can reduce latency and cost substantially.
  • Can specialize a model for a domain, product, or workflow.
  • Can be combined with quantization after training.
  • Often better than simply pruning an existing model.

Cons:

  • Requires training data and training infrastructure.
  • Student inherits teacher errors and blind spots.
  • Broad generality is hard to preserve.
  • Distillation can violate model-provider terms if not permitted.
  • Evaluation must cover reliability, safety, and edge cases, not just average accuracy.

Low-rank factorization

Many neural network layers contain large dense matrices. Low-rank factorization approximates a large matrix as the product of smaller matrices.

Pros:

  • Mathematically clean.
  • Can combine with quantization.
  • Can reduce parameter count in selected layers.
  • Can be layered with fine-tuning.

Cons:

  • Drop-in compression may require fine-tuning.
  • Speedup depends on matrix sizes and kernels.
  • Not always competitive with quantization for simple deployment.
  • Choosing ranks layer-by-layer is nontrivial.

Weight sharing, clustering, and entropy coding

These methods reduce storage by making many weights share values or by encoding weights more compactly.

Pros:

  • Useful for distribution and storage.
  • Can be combined with quantization.
  • Good for many model variants with small differences.
  • Reduces parameter count.
  • Can be effective when designed into the model.
  • May preserve depth while reducing storage.

Cons:

  • Often helps disk size more than runtime latency.
  • Needs custom kernels or decompression strategy for serving speed.
  • Added indirection can hurt performance.
  • Hard to apply after training without degradation.
  • May reduce representational capacity.
  • Usually requires training or fine-tuning.
  • Less common as a practical post-training LLM compression method.

Efficient architecture design

Sometimes the best compression is not compressing a finished model but designing or selecting a better-sized model. Important architecture-level efficiency techniques include:

  • Grouped-query attention and multi-query attention
  • Mixture-of-experts
  • Sliding-window and sparse attention
  • Smaller dense models trained better

Pros:

  • Can outperform post-hoc compression.
  • Better control over latency, memory, and deployment shape.
  • Can avoid quality loss from compressing the wrong model.

Cons:

  • Requires training or selecting a new model.
  • Not a direct transformation of an existing checkpoint.
  • More evaluation and migration work.

KV cache compression

During autoregressive generation, a transformer stores key and value tensors for previous tokens so it does not have to recompute them. This is the KV cache. The KV cache grows linearly with sequence length and batch size. For long-context and high-concurrency serving, KV cache memory can dominate. Weight compression helps fit the model. KV cache compression helps serve real workloads.

MethodDescriptionBenefitRisk
KV quantizationStore cache in INT8, INT4, FP8, etc.Lower VRAMQuality loss if too aggressive
Token evictionDrop less important cached tokensLower memoryIrreversible forgetting
Token mergingCombine similar tokens or cache entriesLower memoryLoss of detail
Attention sinksPreserve special high-importance early tokensBetter stabilityHeuristic-dependent
Sliding cacheKeep recent window plus selected global tokensLong-context efficiencyMisses distant details
CPU offloadMove cache from GPU to CPU memoryLarger effective contextTransfer latency
Hierarchical cacheHot cache on GPU, cold cache elsewhereBetter memory useSystem complexity
Low-rank KVApproximate cache tensors with lower-rank formsMemory reductionReconstruction error
Prefix cachingReuse cache for shared promptsThroughput improvementLess useful for unique prompts

Pros

  • Enables longer context.
  • Improves concurrency.
  • Reduces serving cost.
  • Helps avoid GPU memory fragmentation and out-of-memory errors.
  • Can be combined with quantized weights.

Cons

  • Quality degradation can be subtle and task-dependent.
  • Retrieval tests may not reveal reasoning damage.
  • Some methods help memory but add latency.
  • System implementations are more complex than weight-only quantization.

Prompt and context compression

Prompt compression reduces the number of input tokens before the model sees them. This is not model compression, but it often has the fastest business impact because API and prefill costs scale with input tokens.

MethodDescription
SummarizationReplace long text with a shorter summary
Extractive selectionKeep only relevant passages
RerankingRetrieve many chunks, send only top-ranked chunks
DeduplicationRemove repeated or near-duplicate content
Structured extractionConvert documents into concise facts or fields
Conversation memorySummarize old turns and keep recent turns verbatim
Query-focused compressionCompress context specifically for the current question
Embedding-based filteringDrop chunks with low semantic relevance
Prompt template minimizationRemove unnecessary instruction verbosity

Pros

  • Works with closed and open models.
  • No model modification required.
  • Reduces cost and prefill latency immediately.
  • Can improve answer quality by removing irrelevant context.

Cons

  • Compression may remove details needed later.
  • Summaries can introduce errors.
  • Relevance filtering can miss subtle dependencies.
  • Requires careful evaluation for legal, medical, financial, and code tasks.
  • Over-compression can make the model confidently wrong.
  • Compression logic becomes part of the product behavior.

Combining compression methods

Compression methods are often combined.

CombinationWhy use it
Quantization + KV cache quantizationReduce both model and runtime memory
Distillation + quantizationBuild a small student, then compress further
Pruning + quantizationReduce parameters and precision
Context compression + KV cache compressionReduce both input length and long-context memory
Efficient architecture + quantizationBest practical deployment footprint

Evaluation

Compression should be evaluated across four dimensions:

  • Quality
  • Performance
  • Cost
  • Reliability

Quality metrics measure:

  • Human preference ratings.
  • LLM-as-judge with calibration.
  • Domain expert review.
  • Tool-call correctness.
  • Safety and refusal behavior.
  • Regression tests for known prompts.

Performance metrics measure:

  • Time to first token.
  • Tokens per second per request.
  • Maximum context length.
  • Maximum concurrent users.
  • GPU utilization.
  • Memory bandwidth utilization.

Cost metrics measure:

  • Cost per 1M input tokens.
  • Cost per 1M output tokens.
  • Cost per successful task.
  • GPU-hours.
  • Power consumption.
  • Required hardware tier.
  • Engineer time and operational complexity.

Reliability metrics measure:

  • Out-of-memory frequency.
  • Error rates.
  • Latency spikes.
  • Behavior on unusual prompts.
  • Long-running serving stability.

Security, privacy, and governance

Compression changes model behavior and deployment patterns. It should be treated as a model change, not only an infrastructure optimization.

Consider:

  • Whether compressed models still follow safety policies.
  • Whether refusals and guardrails survive compression.
  • Whether domain-specific factuality changes.
  • Whether private training data was used for distillation.
  • Whether teacher-model terms permit distillation.
  • Whether quantized community checkpoints are trustworthy.
  • Whether model files came from a verified source.
  • Whether compression tools execute untrusted code.