Posts

LLM compression

LLM compression is the set of techniques used to make large language models cheaper, faster, smaller, and easier to deploy while preserving as much quality as possible. It can compress the model itself, the runtime state used during inference, or the input context sent to the model. What is LLM compression? LLM compression means reducing one or more of the resources required to use a model: Resource Why it matters Disk size Model distribution, storage cost, startup time RAM/VRAM Whether the model fits and how many requests can run concurrently Memory bandwidth Often the dominant bottleneck for token generation Compute Matrix multiply cost, attention cost, and server throughput Compression can target different parts of the system: ...

OpenTelemetry for GitHub Copilot CLI

GitHub Copilot CLI can export OpenTelemetry data that helps inspect model calls, tool invocations, MCP activity, token usage and latency. This article relates to OpenTelemetry for Codex CLI and focuses on the Copilot-specific setup and telemetry behavior. Use the Compose example from the Codex article to bring up local OTel stack. The important Copilot-specific differences are: Copilot CLI exports OTLP over HTTP to http://127.0.0.1:4318. Codex CLI exports OTLP over gRPC to http://127.0.0.1:4317. Traces Copilot CLI telemetry is most useful when inspected as traces. From local captures, the most useful span categories are: ...

OpenTelemetry for Codex CLI

Codex CLI can export traces, metrics, OTel log records with event names and attributes. With that telemetry, you can track API requests, tool invocations, token usage, MCP calls and run latency. This article walks through a local OpenTelemetry stack and Codex CLI configuration needed to send telemetry to it. The setup should help you explore the traces, logs and metrics, build dashboards and answer questions such as: Which models were used? How many tool calls were made? Was MCP used? How long did the run take? Traces While Codex CLI can export traces, its trace spans are not fully documented, and there is no official list of span names. ...

Claude Code with Ollama

During 2025 the AI race has shifted from creating best models to creating best agents. By end of year, most developers argue that Claud Code has taken lead. Standard Claude Code usage typically requires an expensive Anthropic subscription. But beyond the cost savings, using Ollama with Claude Code offers significant advantages regarding data privacy, confidentiality, and service autonomy. By redirecting the Claude Code interface to a local instance, you effectively bypass several of Anthropic’s cloud-based data collection and usage policies. These include model training opt-out, technical information collection, usage tracking, regional constraints, intellectual property ownership and more. ...

Docker model runner

Sometime in April this year, Docker added new feature called Docker Model Runner. It’s meant to streamline the process of pulling, running, and serving large language models (LLMs) and other AI models directly from Docker Hub or OCI-compliant registries. It integrates with Docker Desktop and Docker Engine, allows you to serve models via OpenAI-compatible APIs, package GGUF files as OCI, and interact with models from the command line. Features Pull and push models Serve models on OpenAI-compatible APIs Package and publish GGUF files as OCI Run AI models directly from the command line Manage local models and display logs Requirements Docker Model Runner is supported on the following platforms: ...