Back

Kimi K2.5 Technical Report [pdf]

176 points7 hoursgithub.com
zeroxfe5 hours ago

I've been using this model (as a coding agent) for the past few days, and it's the first time I've felt that an open source model really competes with the big labs. So far it's been able to handle most things I've thrown at it. I'm almost hesitant to say that this is as good as Opus.

armcat5 hours ago

Out of curiosity, what kind of specs do you have (GPU / RAM)? I saw the requirements and it's a beyond my budget so I am "stuck" with smaller Qwen coders.

zeroxfe4 hours ago

I'm not running it locally (it's gigantic!) I'm using the API at https://platform.moonshot.ai

BeetleB4 hours ago

Just curious - how does it compare to GLM 4.7? Ever since they gave the $28/year deal, I've been using it for personal projects and am very happy with it (via opencode).

https://z.ai/subscribe

InsideOutSanta4 hours ago

There's no comparison. GLM 4.7 is fine and reasonably competent at writing code, but K2.5 is right up there with something like Sonnet 4.5. it's the first time I can use an open-source model and not immediately tell the difference between it and top-end models from Anthropic and OpenAI.

zeroxfe4 hours ago

It's waaay better than GLM 4.7 (which was the open model I was using earlier)! Kimi was able to quickly and smoothly finish some very complex tasks that GLM completely choked at.

segmondy3 hours ago

The old Kimi K2 is better than GLM4.7

+1
cmrdporcupine3 hours ago
+1
akudha3 hours ago
rc13 hours ago

How long until this can be run on consumer grade hardware or a domestic electricity supply I wonder.

Anyone have a projection?

+1
johndough3 hours ago
+1
segmondy3 hours ago
heliumtera3 hours ago

You need 600gb of VRAM + MEMORY (+ DISK) to fit the model (full) or 240 for the 1b quantized model. Of course this will be slow.

Through moonshot api it is pretty fast (much much much faster than Gemini 3 pro and Claude sonnet, probably faster than Gemini flash), though. To get similar experience they say at least 4xH200.

If you don't mind running it super slow, you still need around 600gb of VRAM + fast RAM.

It's already possible to run 4xH200 in a domestic environment (it would be instantaneous for most tasks, unbelievable speed). It's just very very expensive and probably challenging for most users, manageable/easy for the average hacker news crowd.

Expensive AND hard to source high end GPUs, if you manage to source for the old prices around 200 thousand dollars to get maximum speed I guess, you could probably run decently on a bunch of high end machines, for let's say, 40k (slow).

Carrok5 hours ago

Not OP but OpenCode and DeepInfra seems like an easy way.

tgrowazay4 hours ago

Just pick up any >240GB VRAM GPU off your local BestBuy to run a quantized version.

> The full Kimi K2.5 model is 630GB and typically requires at least 4× H200 GPUs.

CamperBob22 hours ago

You could run the full, unquantized model at high speed with 8 RTX 6000 Blackwell boards.

I don't see a way to put together a decent system of that scale for less than $100K, given RAM and SSD prices. A system with 4x H200s would cost more like $200K.

thesurlydev5 hours ago

Can you share how you're running it?

eknkc4 hours ago

I've been using it with opencode. You can either use your kimi code subscription (flat fee), moonshot.ai api key (per token) or openrouter to access it. OpenCode works beautifully with the model.

Edit: as a side note, I only installed opencode to try this model and I gotta say it is pretty good. Did not think it'd be as good as claude code but its just fine. Been using it with codex too.

Imustaskforhelp4 hours ago

I tried to use opencode for kimi k2.5 too but recently they changed their pricing from 200 tool requests/5 hour to token based pricing.

I can only speak from the tool request based but for some reason anecdotally opencode took like 10 requests in like 3-4 minutes where Kimi cli took 2-3

So I personally like/stick with the kimi cli for kimi coding. I haven't tested it out again with OpenAI with teh new token based pricing but I do think that opencode might add more token issue.

Kimi Cli's pretty good too imo. You should check it out!

https://github.com/MoonshotAI/kimi-cli

nl52 minutes ago

I like Kimi-cli but it does leak memory.

I was using it for multi-hour tasks scripted via an self-written orchestrator on a small VM and ended up switching away from it because it would run slower and slower over time.

zeroxfe4 hours ago

Running it via https://platform.moonshot.ai -- using OpenCode. They have super cheap monthly plans at kimi.com too, but I'm not using it because I already have codex and claude monthly plans.

esafak3 hours ago

Where? https://www.kimi.com/code starts at $19/month, which is same as the big boys.

UncleOxidant4 hours ago

so there's a free plan at moonshot.ai that gives you some number of tokens without paying?

explorigin5 hours ago
KolmogorovComp4 hours ago

To save everyone a click

> The 1.8-bit (UD-TQ1_0) quant will run on a single 24GB GPU if you offload all MoE layers to system RAM (or a fast SSD). With ~256GB RAM, expect ~10 tokens/s. The full Kimi K2.5 model is 630GB and typically requires at least 4× H200 GPUs. If the model fits, you will get >40 tokens/s when using a B200. To run the model in near full precision, you can use the 4-bit or 5-bit quants. You can use any higher just to be safe. For strong performance, aim for >240GB of unified memory (or combined RAM+VRAM) to reach 10+ tokens/s. If you’re below that, it'll work but speed will drop (llama.cpp can still run via mmap/disk offload) and may fall from ~10 tokens/s to <2 token/s. We recommend UD-Q2_K_XL (375GB) as a good size/quality balance. Best rule of thumb: RAM+VRAM ≈ the quant size; otherwise it’ll still work, just slower due to offloading.

+2
Gracana4 hours ago
gigatexal5 hours ago

Yeah I too am curious. Because Claude code is so good and the ecosystem so just it works that I’m Willing to pay them.

epolanski4 hours ago

You can plug another model in place of Anthropic ones in Claude Code.

miroljub2 hours ago

If you don't use Antrophic models there's no reason to use Claude Code at all. Opencode gives so much more choice.

+2
zeroxfe4 hours ago
Imustaskforhelp4 hours ago

I tried kimi k2.5 and first I didn't really like it. I was critical of it but then I started liking it. Also, the model has kind of replaced how I use chatgpt too & I really love kimi 2.5 the most right now (although gemini models come close too)

To be honest, I do feel like kimi k2.5 is the best open source model. It's not the best model itself right now tho but its really price performant and for many use cases might be nice depending on it.

It might not be the completely SOTA that people say but it comes pretty close and its open source and I trust the open source part because I feel like other providers can also run it and just about a lot of other things too (also considering that iirc chatgpt recently slashed some old models)

I really appreciate kimi for still open sourcing their complete SOTA and then releasing some research papers on top of them unlike Qwen which has closed source its complete SOTA.

Thank you Kimi!

storus35 minutes ago

Do I need to have two M3U 512GB MacStudios to run this?

Imanari43 minutes ago

I have been very impressed with this model and also with the Kimi CLI. I have been using it with the 'Moderato' plan (7 days free, then 19$). A true competitor to Claude Code with Opus.

zzleeper1 hour ago

Do any of these models do well with information retrieval and reasoning from text?

I'm reading newspaper articles through a MoE of gemini3flash and gpt5mini, and what made it hard to use open models (at the time) was a lack of support for pydantic.

jychang1 hour ago

That roughly correlates with tool calling capabilities. Kimi K2.5 is a lot better than previous open source models in that regard.

You should try out K2.5 for your use case, it might actually succeed where previous generation open source models failed.

derac4 hours ago

I really like the agent swarm thing, is it possible to use that functionality with OpenCode or is that a Kimi CLI specific thing? Does the agent need to be aware of the capability?

zeroxfe4 hours ago

It seems to work with OpenCode, but I can't tell exactly what's going on -- I was super impressed when OpenCode presented me with a UI to switch the view between different sub-agents. I don't know if OpenCode is aware of the capability, or the model is really good at telling the harness how to spawn sub-agents or execute parallel tool calls.

esafak3 hours ago

Has anyone tried it and decided it's worth the cost; I've heard it's even more profligate with tokens?

syndacks46 minutes ago

How do people evaluate creative writing and emotional intelligence in LLMs? Most benchmarks seem to focus on reasoning or correctness, which feels orthogonal. I’ve been playing with Kimmy K 2.5 and it feels much stronger on voice and emotional grounding, but I don’t know how to measure that beyond human judgment.

miroljub4 hours ago

I've been quite satisfied lately with MiniMax M-2.1 in opencode.

How does Kimi 2.5 compare to it in real world scenarios?

viraptor4 hours ago

A lot better in my experience. M2.1 to me feels between haiku and sonnet. K2.5 feels close to opus. That's based on my testing of removing some code and getting it to reimplement based on tests. Also the design/spec writing feels great. You can still test k2.5 for free in OpenCode today.

miroljub4 hours ago

Well, Minimax was the equivalent of Sonnet in my testing. If Kimi approach Opus, that would be great.

samtheprogram3 hours ago

Kimi K2.5 approaches Sonnet as well from what I can tell, it's just slower to get to the result.

epolanski4 hours ago

It's interesting to note that a model that can OpenAI is valued almost 400 times more than moonshotai, despite their models being surprisingly close.

famouswaffles2 hours ago

OpenAI is a household name with nearly a billion weekly active users. Not sure there's any reality where they wouldn't be valued much more than Kimi regardless of how close the models may be.

m3kw956 minutes ago

Unless they can beat their capabilities by a clear magical step up and has infrastructure to capture the users

moffkalast4 hours ago

Well to be the devil's advocate: One is a household name that holds most of the world's silicon wafers for ransom, and the other sounds like a crypto scam. Also estimating valuation of Chinese companies is sort of nonsense when they're all effectively state owned.

epolanski2 hours ago

There isn't a single % that is state owned in Moonshot AI.

And don't start me with the "yeah but if the PRC" because it's gross when US can de facto ban and impose conditions even on European companies, let alone the control it has on US ones.

firasd3 hours ago

I'm not very familiar with this model series but tried it when prototyping a potential 'Vibesbench' (my eval) conversation I tried it. Here's Kimi on a Robocop scene

kimi k2 thinking turbo

"He’s not being poetic; he’s weaponizing nostalgia. Quoting Casablanca—the ultimate “beautiful sacrifice” romance—to taunt Morton about corporate backstabbing is pure executive sleaze. It’s the same move as his mahogany desk and flag pin: draping sentimental Americana over naked power. He’s essentially saying, “Isn’t it romantic, Bob? You fought the good fight, now you die.” "

grok 4.1

"So when Dick Jones smirks and says, “Time goes by, it’s the same old story… the fight for love and glory, huh, Bob?” that’s pure Ed Neumeier flexing his love of old Hollywood while simultaneously mocking how these psychopathic ’80s execs think they’re in some grand cinematic drama instead of just murdering a guy for market share."

chatgpt-4o

"The reason the Casablanca reference matters is RoboCop itself is partially a noir—and also a tragedy built inside a genre engine. [… ] So yeah, calling back to Casablanca is not just some meta-joke—it’s a knowing wink that RoboCop is a noir trapped in cybernetic skin. // Even Clarence Boddicker, for all his cartoonish sadism, plays like a twisted stand-in for a noir heavy—talking smooth, dropping nicknames, throwing out casual violence like it’s a punchline."

opus 4.5

"Jones is performing avuncular menace. He's the kind of executive who'd quote old movies while having you murdered, because it's all performance to him, all style. The substance is just power."

margorczynski5 hours ago

I wonder how K2.5 + OpenCode compares to Opus with CC. If it is close I would let go of my subscription, as probably a lot of people.

eknkc4 hours ago

It is not opus. It is good, works really fast and suprisingly through about its decisions. However I've seen it hallucinate things.

Just today I asked for a code review and it flagged a method that can be `static`. The problem is it was already static. That kind of stuff never happens with Opus 4.5 as far as I can tell.

Also, in an opencode Plan mode (read only). It generated a plan and instead of presenting it and stopping, decided to implement it. Could not use the edit and write tools because the harness was in read only mode. But it had bash and started using bash to edit stuff. Wouldn't just fucking stop even though the error messages it received from opencode stated why. Its plan and the resulting code was ok so I let it go crazy though...

esafak3 hours ago

Some models have a mind of their own. I keep them on a leash with `permission` blocks in OC -- especially for rm/mv/git.

naragon4 hours ago

I've been using K2.5 with OpenCode to do code assessments/fixes and Opus 4.5 with CC to check the work, and so far so good. Very impressed with it so far, but I don't feel comfortable canceling my Claude subscription just yet. Haven't tried it on large feature implementations.

ithkuil4 hours ago

I also wonder if CC can be used with k2.5 with the appropriate API adapter

cmrdporcupine1 hour ago

DeepSeek is likely to release a new model soon, and judging from the past it's likely to be more cost effective and just as or more powerful than Kimi 2.5.

DeepSeek 3.2 was already quite compelling. I expect its successor will be competitive.

llmslave4 hours ago

The benchmarks on all these models are meaningless

alchemist1e94 hours ago

Why and what would a good benchmark look like?

moffkalast3 hours ago

30 people trying out all models on the list for their use case for a week and then checking what they're still using a month after.

gedy3 hours ago

Sorry if this is an easy-answerable question - but by open we can download this and use totally offline if now or in the future if we have hardware capable? Seems like a great thing to archive if the world falls apart (said half-jokingly)

Tepix53 minutes ago

You could buy five Strix Halo systems at $2000 each, network them and run it.

Rough estimage: 12.5:2.2 so you should get around 5.5 tokens/s.

j-bos44 minutes ago

Is the software/drivers for networking LLMs on Strix Halo there yet? I was under the impression a few weeks ago that it's veeeery early stages and terribly slow.

fragmede40 minutes ago

Yes but the hardware is > $100k so hopefully your bunkermates are rich.

Carrok2 hours ago

Yes.

cmrdporcupine2 hours ago

Yes, but you'll need some pretty massive hardware.

behnamoh4 hours ago

It's a decent model but works best with kimi CLI, not CC or others.

alansaber4 hours ago

Why do you think that is?

chillacy4 hours ago

I heard it's because the labs fine tune their models for their own harness. Same reason why claude does better in claude code than cursor.

segmondy3 hours ago

read the tech report