LLM compression
LLM compression is the set of techniques used to make large language models cheaper, faster, smaller, and easier to deploy while preserving as much quality as possible. It can compress the model itself, the runtime state used during inference, or the input context sent to the model. What is LLM compression? LLM compression means reducing one or more of the resources required to use a model: Resource Why it matters Disk size Model distribution, storage cost, startup time RAM/VRAM Whether the model fits and how many requests can run concurrently Memory bandwidth Often the dominant bottleneck for token generation Compute Matrix multiply cost, attention cost, and server throughput Compression can target different parts of the system: ...