Docker model runner

Sometime in April this year, Docker added new feature called Docker Model Runner. It’s meant to streamline the process of pulling, running, and serving large language models (LLMs) and other AI models directly from Docker Hub or OCI-compliant registries.

It integrates with Docker Desktop and Docker Engine, allows you to serve models via OpenAI-compatible APIs, package GGUF files as OCI, and interact with models from the command line.

Features

Pull and push models
Serve models on OpenAI-compatible APIs
Package and publish GGUF files as OCI
Run AI models directly from the command line
Manage local models and display logs

Requirements

Docker Model Runner is supported on the following platforms:

Windows (amd64) - NVIDIA GPUs
Windows (arm64) - Qualcomm Adreno GPU
MacOS (Apple Silicon)
Linux - NVIDIA GPUs

How it works

You pull the model, run it and you can interact with it from command line, or you can use OpenAI APIs:

GET /engines/llama.cpp/v1/models
GET /engines/llama.cpp/v1/models/{namespace}/{name}
POST /engines/llama.cpp/v1/chat/completions
POST /engines/llama.cpp/v1/completions
POST /engines/llama.cpp/v1/embeddings

Note that models are loaded into memory only at runtime when a request is made, and unloaded when not in use. Also, keep in mind that pulling models can take some time, since thier OCIs tend to be large.

Enable Docker Model Runner

Docker Desktop

To enable it in Docker Desktop, just go to beta features and tick the checkbox. Make sure you have version 4.40+

Docker Engine

just install DMR as a package. For example, on ubuntu run:

$ sudo apt-get update
$ sudo apt-get install docker-model-plugin

verify installation:

$ docker model version

Pull a model

Run:

docker model pull ai/smollm3
# with specific tag
docker model pull ai/smollm2:360M-Q4_K_M
# pulling from hugging face
docker model pull hf.co/bartowski/Llama-3.2-1B-Instruct-GGUF

Version of the model is part of the name, while tags represent model quantization.

Run a model

Run:

docker model run ai/smollm2

In case you have some issues running a model, they expose logs which can be viewed with:

docker model logs

Calling OpenAI endpoints

# from within a container
curl -X POST -H "Content-Type: application/json" -d @data.json "http://model-runner.docker.internal/engines/llama.cpp/v1/chat/completions" 
 
# from the host
curl -X POST -H "Content-Type: application/json" -d @data.json "http://localhost:12434/engines/llama.cpp/v1/chat/completions"

An example payload in data.json file:

{
    "model": "ai/smollm2",
    "messages": [
        {
            "role": "user",
            "content": "Hello there"
        }
    ]
}

Compose

Docker Compose also has support for models. You can now define if a service depends on a model and docker engine will make sure that service can access the OpenAI endpoins that model runner exposes.

Here’s a simple example:

services:
  app:
    image: app:latest
    models:
      - llm

models:
  llm:
    model: ai/smollm2

If you’re interested to learn more, check the offical docs or take a look at my example application here.

Features#

Requirements#

How it works#

Enable Docker Model Runner#

Docker Desktop#

Docker Engine#

Pull a model#

Run a model#

Calling OpenAI endpoints#

Compose#

More#