Using Kimi K2.5 with Claude Code via Ollama Cloud (Tutorial 2026)

Claude Code is Anthropic's AI coding assistant that runs in your terminal. By default it uses Claude models. But what if you want to try something else? In this tutorial I will show you how to connect Kimi K2.5 (Moonshot AI's open-source 1-trillion-parameter model) to Claude Code using Ollama Cloud.

I will also be honest about what works and what does not, because there are limitations you should know about before you try this yourself.

What Is Kimi K2.5

Kimi K2.5 is an open-source model from Moonshot AI, released in January 2026. The headline specs:

1 trillion total parameters
32 billion active per request (mixture-of-experts)
256K token context window
Open weights, available on Hugging Face
Strong on coding benchmarks, competitive with GPT-class models

The catch: 1 trillion parameters is a lot of GPU memory. You are not running this on a MacBook. That is where Ollama Cloud comes in.

Why Ollama Cloud

Ollama Cloud hosts the heavyweight models so you do not need a beefy GPU rig. You pay per request or per usage, the model runs on their hardware, and you connect over an API that looks similar to standard Ollama local mode. This is the only practical way to use Kimi K2.5 unless you own (or rent) a multi-GPU machine.

Step 1: Get an Ollama Cloud Account

Step 2: Install Ollama Locally

Even when targeting Ollama Cloud, the local Ollama CLI is what Claude Code communicates with. It acts as a router. On Mac:

brew install ollama

Or download the installer from ollama.com if you are not on Homebrew.

Step 3: Configure Ollama to Use the Cloud Endpoint

Point Ollama at the cloud host using an environment variable:

export OLLAMA_HOST=https://api.ollama.com
export OLLAMA_API_KEY=your_api_key_here

Add those lines to your shell rc file (~/.zshrc or ~/.bashrc) so they persist across sessions.

Step 4: Pull Kimi K2.5

ollama pull kimi-k2.5

With Ollama Cloud, "pull" registers the model with your account. You are not downloading 600GB to your laptop. The model lives on Ollama's GPUs; pulling sets you up to call it.

Step 5: Point Claude Code at Kimi via Router

Claude Code can be routed at any OpenAI-compatible endpoint using claude-code-router. Install it:

npm install -g @musistudio/claude-code-router

Create or edit ~/.claude-code-router/config.json:

{
  "Providers": [
    {
      "name": "ollama-cloud-kimi",
      "api_base_url": "https://api.ollama.com/v1/chat/completions",
      "api_key": "your_api_key_here",
      "models": ["kimi-k2.5"]
    }
  ],
  "Router": {
    "default": "ollama-cloud-kimi,kimi-k2.5"
  }
}

Now run Claude Code through the router:

ccr code

That launches Claude Code, but every request gets routed to Kimi K2.5 on Ollama Cloud instead of Anthropic's models.

What Works

Code generation: Kimi K2.5 produces clean code for common languages (JavaScript, Python, Go, Rust)
Refactors: Multi-file edits work, with good awareness of project context
Long-context recall: The 256K window means it can hold a lot of your codebase at once
Speed: Faster than full Opus for routine work

What Does Not (Yet)

Here is the honest part. Kimi K2.5 through the router has rough edges I hit:

Tool calling is inconsistent. Claude Code expects certain JSON shapes for tool calls. Kimi sometimes returns malformed responses, especially for chained tool use. This shows up as silent failures or wrong file edits.
Plan mode is less reliable. Kimi will draft a plan but does not always follow it during execution.
Streaming differs from Claude's. You see chunkier output, fewer typing animations.
No vision yet on this endpoint. Drag-drop screenshots are not supported in this setup.
Latency varies wildly. Some requests are sub-second. Others stall for 10+ seconds. Cloud GPU contention.

Should You Use This in Production

No. Not yet. This setup is for experimentation. If you depend on Claude Code for real work, stay on Claude (Sonnet or Opus) for now. The router approach is great for trying open models and seeing how they perform on your codebase, but it is not production-ready for serious projects.

Use this when you want to:

Compare model quality on your real code
Test prompt strategies on a different model architecture
Reduce cost for high-volume non-critical work
Experiment with open-source alternatives