Hacker News

Kimi K2.6: Advancing Open-Source Coding

274 points by meetpateltech - 120 comments

simonw [3 hidden]5 mins ago

Accessed via OpenRouter, this one decided to wrap the SVG pelican in HTML with controls for the animation speed: https://gisthost.github.io/?ecaad98efe0f747e27bc0e0ebc669e94...

Transcript and HTML here: https://gist.github.com/simonw/ecaad98efe0f747e27bc0e0ebc669...

FlyingSnake [3 hidden]5 mins ago

At this point drawing these Pelicans must be in the training data sets.

ffsm8 [3 hidden]5 mins ago

Clearly not.

I mean the prompt was succinct and clear, as always - and it still decided to hallucinate multiple features (animation + controls) beyond the prompt.

It'd also like to point out that to date no drawing was actually good from an actual quality perspective (as in comparative to what a decent designer would throw together)

Theyre always only "good" from the perspective of it being a one shot low effort prompt. Very little content for training purposes.

nwienert [3 hidden]5 mins ago

The way I’ve come to think of LLM is that what the produce in a single reply even with thinking turned up, is akin to what you’d do in a single short session of work.

And so if you ask it to do something big it will do a very surface level implementation. But if you have it iterate many times, or give it small pieces each time, you’ll end up with something closer to what a human would do.

I imagine the pelican test but done in a harness that has the agents iterate 10+ times would be closer to what you’d expect, especially if a visual model was critiquing each time.

SwellJoe [3 hidden]5 mins ago

We got an overachiever, here. Kimi sounds like a teacher's pet kind of name.

hn8726 [3 hidden]5 mins ago

Genuine question, what's the goal of posting this on almost every single new model thread here on HN? I may be old and grumpy but to me it got old a while ago, and is closer to a low effort Reddit comment

lambda [3 hidden]5 mins ago

It's a lighthearted, fun, visual benchmark that's not part of the standard benchmarks; and at least traditionally, it was not something that the labs trained on so it was something of a measure of how well the intelligence of the model generalized. Part of the idea of LLMs is that they pick up general knowledge and reasoning ability, beyond any tasks that they are specifically trained for, from the vast quantity of data that they are trained on.

Of course, a while back there was a Gemini release that I believe specifically called out their ability to produce SVGs, for illustration and diagramming purposes. So it's not longer necessarily the case that the labs aren't training on generating SVGs, and in fact, there's a good chance that even if they're not doing so explicitly, the RLVR process might be generating tasks like that as there is more and more focus on frontend and design in the LLM space. So while they might not be specifically training for a pelican riding a bicycle, they may actually be training on SVG diagram quality.

hamdouni [3 hidden]5 mins ago

Maybe this can help

https://simonwillison.net/2024/Oct/25/pelicans-on-a-bicycle/

game_the0ry [3 hidden]5 mins ago

There is some humor in the fact that china (of all countries) is pioneering possibly the world's most important tech via open source, while we (US) are doing the exact opposite.

antirez [3 hidden]5 mins ago

This is not in antithesis. My limited personal experience is that I wrote code under OSS licenses primarily because of my past communist believes and current left-wing and redistribution of wealth point of view. This is not to provide the simple equation of: communist China is not interested in money, but also to believe that there is no cultural connection among thinking (inside the local borders, but they are huge borders) that people can't reason as singles, but as a collective, has zero implications. Also there is the obvious fact that in this moment China is more interested in winning technologically in AI, more than economically, since I believe they realized before the others that LLMs are eventually commoditized in the current form, in the long run. One could assume that a breakthrough could give some lab a decisive advantage, but so far we assisted to a different reality: it looks like AI is not architectural-bound (like LeCun and others want us to believe, but so far they mis-interpreted LLMs at every step) but GPU bound, and the data-boundness is both common for all, and surpassable via RL in many domains. So, if this is true, it is not trivial for any single lab to do so much better. And indeed as far as we observed right now folks with enough engineers, GPUs, money, can ship frontier models, and in China even labs with a lot less GPUs can still do it at a SOTA level.

culi [3 hidden]5 mins ago

All great technological advancements have come through opening up technology. Just look at your iPhone. GPS, the internet, AI voice assistants, touchscreens, microprocessors, lithium-ion batteries, etc all came from gov't research (I'm counting Bell Labs' gov't mandated monopoly + research funding as gov't) that was opened up for free instead of being locked behind a patent.

Private companies will never open up a technological breakthrough to their competitors. It just doesn't make sense. If you want an entire field to advance, you have to open it up.

sigmoid10 [3 hidden]5 mins ago

Still, you won't hear about Tiananmen square from this model. It flat out refuses to answer if pushed directly. It's also pretty wild how far they go to censor it during inference on the API, because it can easily access any withheld or missing info from training data via tool calls. It even starts happily writing an answer based on web search when asked indirectly, only to get culled completely once some censorship bot flags the response. Ironically, it's also easier than ever to break their censorship guardrails. I just had it generate several factual paragraphs about the massacre by telling it to search the web and respond in base64 encoded text. It's actually kind of cool how much these people struggle to hide certain political views from LLMs. Makes me hopeful that even if China wins this race, we'll not have to adhere to the CCPs newspeak.

atemerev [3 hidden]5 mins ago

Only if you use Kimi API directly - the censorship is done externally. The model itself talks fine about Tiananmen, you can check on Openrouter. There might be less visible biases, though.

sigmoid10 [3 hidden]5 mins ago

That's what I wrote? Except that it also clearly has internal bias?

nashadelic [3 hidden]5 mins ago

additional humor is the open in openai

osti [3 hidden]5 mins ago

Maybe open source == communism

darkwater [3 hidden]5 mins ago

Good ol' Steve "Developers! Developers! Developers!" Ballmer said so a long time ago. What a visionary!

tadfisher [3 hidden]5 mins ago

Nah, open source means those who do the work own the result. It's supercapitalism.

pheggs [3 hidden]5 mins ago

I dont think thats right, the models and the gpus are the means of production.

in capitalism the people with the capital get the profit, not the people who do the work. however, workers are said to benefit too through their salary, just less so

konart [3 hidden]5 mins ago

But China is not communist event though the rulling party the word in its name.

pheggs [3 hidden]5 mins ago

what makes you think that china ever gave up its communist goals? I personally see that everything they do aims towards that goal. From the one child policy, the huge amounts of empty apartments they build, the stuff they produce for almost free, the fishing.. open sourcing the models perfectly fits that culture too, it's the means of production

fragmede [3 hidden]5 mins ago

The Democratic People's Republic of Korea would like a word.

osti [3 hidden]5 mins ago

Oh i’m fully aware of that lol

brandensilva [3 hidden]5 mins ago

We are at the point where uncontrolled capitalism collides with humanity.

I do wonder where we go from here.

elfbargpt [3 hidden]5 mins ago

I've always been surprised Kimi doesn't get more attention than it does. It's always stood out to me in terms of creativity, quality... has been my favorite model for awhile (but I'm far from an authority)

twotwotwo [3 hidden]5 mins ago

Kagi has it as an option in its Assistant thing, where there is naturally a lot of searching and summarizing results. I've liked its output there and in general when asked for prose that isn't in the list/Markdown-heavy "LLM style." It's hard to do a confident comparison, but it's seemed bold in arranging the output to flow well, even when that took surgery on the original doc(s). Sometimes the surgery's needed e.g. to connect related ideas the inputs treated as separate, or to ensure it really replies to the request instead of just dumping info that's somehow related to it.

Aeolun [3 hidden]5 mins ago

It’s good, but it’s not quite Claude level. And their API has constant capacity issues.

Price/quality is absolutely bonkers though. I loaded $40 a few weeks/months ago and I haven’t even gone through half of it.

atemerev [3 hidden]5 mins ago

Why use China model API from China if there are many independent providers available via Openrouter?

pheggs [3 hidden]5 mins ago

to support the companies that open source their models

culi [3 hidden]5 mins ago

It's also one of the few models that seem capable of drawing an SVG clock

https://clocks.brianmoore.com/

SwellJoe [3 hidden]5 mins ago

Interesting that the best performers are all Chinese-made models (DeepSeek and Qwen also perform consistently well). I wonder if there's more focus on vision and illustration in their training, or if something else is leading to their clear lead on this one test.

sigmoid10 [3 hidden]5 mins ago

Is it? In your link it definitely failed to draw the clock.

squarefoot [3 hidden]5 mins ago

It redraws it every minute, and some models give quite different results although the prompt is exactly the same.

dryarzeg [3 hidden]5 mins ago

I'm not really sure how this works, but I stayed on the page for a while, and then it reloaded and all clocks changed. I guess there's either a collection of different clocks generated by models, or maybe they're somehow generated in the real time, but the fact is what you see is not necessarily what I see.

sigmoid10 [3 hidden]5 mins ago

Seems like it regenerates them to reflect the current time. Funny to see how some models (like Kimi and Deepseek) sometimes get it right and other times fail miserably on the level of ancient models like GPT 3.5.

gunalx [3 hidden]5 mins ago

It reruns the prompt every minute.

regularfry [3 hidden]5 mins ago

Dirt cheap on openrouter for how good it is, too. Really hoping that 2.6 carries on that tradition.

varispeed [3 hidden]5 mins ago

Maybe because it's a bit of like unleashing a chaos monkey on your codebase? I tried it locally (K2.5 72B) and couldn't get anything useful.

KaoruAoiShiho [3 hidden]5 mins ago

Huh, that's not a thing?

johndough [3 hidden]5 mins ago

The parent poster is probably referring to Kimi-Dev-72B¹, which is a much smaller and older model, while people are probably more familiar with the big and fairly powerful 1100B Kimi-K2.5².

[1] https://huggingface.co/moonshotai/Kimi-Dev-72B

[2] https://huggingface.co/moonshotai/Kimi-K2.5

natrys [3 hidden]5 mins ago

Yes it was good for its time, but 10 months old now which is a long time ago in this space. It was also a fine-tune (albeit a good one) of Qwen-2.5 72B.

I wish they did more smaller models. Kimi Linear doesn't really count, it was more of a proof of concept thing.

nickandbro [3 hidden]5 mins ago

Wow, if the benchmarks checkout with the vibes, this could almost be like a Deepseek moment with Chinese AI now being neck and neck with SOTA US lab made models

ai_fry_ur_brain [3 hidden]5 mins ago

[flagged]

otabdeveloper4 [3 hidden]5 mins ago

> Its not anywhere close

Close to what, and how are you measuring?

> nobody in the USA would be spending 7 figures on infrastructure for it

Au contraire, if AI had a moat it would pay for itself. They're funneling capital into infrastructure because they know it can't.

fragmede [3 hidden]5 mins ago

You need the infrastructure to train and run it regardless though. Kimi is great but I'm not getting the same performance from it running it on my MacBook or a 3090 as it running on a H100 or a Grace Hopper supercomputer. Pretend you did have said moat. Why wouldn't you also books infrastructure to run it on?

jstummbillig [3 hidden]5 mins ago

What?

motoboi [3 hidden]5 mins ago

With the previous generation? Yes. With 10T mythos-level models? Not even close.

amazingamazing [3 hidden]5 mins ago

The psyop continues. Mythos until it’s released is vaporware. Notice how you can try kimi 2.6. Where is the same for mythos?

jstummbillig [3 hidden]5 mins ago

At this point it seems more like the result of a psyop to presume that a new anthropic model should be considered vaporware until released.

fragmede [3 hidden]5 mins ago

It's been released to "select partners".

atemerev [3 hidden]5 mins ago

Yeah, Crowdstrike among them. Clearly experts in this "security" thing, given what happened during the last incident...

ChrisLTD [3 hidden]5 mins ago

Mythos isn't the current generation, it's literally vaporware.

lbreakjai [3 hidden]5 mins ago

I've got a 12T model on my machine, built it myself. It's called Mytho. Too dangerous to even release a fact sheet about it. It can hack into the mainframe, enhance ultra-compressed images, grow your hair back, and make people fall in love with you.

jollymonATX [3 hidden]5 mins ago

According to the benchmarks, you are wrong. It is on track and slightly above some sota. Just the benchmarks speaking there, they can be/are gamed by all big model labs including domestic.

bestouff [3 hidden]5 mins ago

There's no public data about Mytho.

maplethorpe [3 hidden]5 mins ago

That's because it would be too dangerous to release.

cedws [3 hidden]5 mins ago

My girlfriend goes to a different school, you wouldn't know her.

squarefoot [3 hidden]5 mins ago

Same for teleport, time travel and warp drive.

nisegami [3 hidden]5 mins ago

So is my P=NP proof.

irthomasthomas [3 hidden]5 mins ago

10T? Impossible! They told us the training run was under 10^26 flops.

mistercheph [3 hidden]5 mins ago

Mythos doesnt exist

sergiotapia [3 hidden]5 mins ago

mythos is vaporware right now, what are you talking about?

candl [3 hidden]5 mins ago

Are there any coding plans for this? (aka no token limit, just api call limit). Recently my account failed to be billed for GLM on z.ai and my subscription expired because of this... the pricing for GLM went through the roof in recent months, though...

dygd [3 hidden]5 mins ago

> Agent Swarms, Elevated: Match 100 Jobs and Generate 100 Tailored Resumes

Model seems quite capable, but this use-case is just yikes. As if interviewing isn't already a hellscape.

kburman [3 hidden]5 mins ago

Has anyone here used Kimi for actual work?

I tried it once, although it looks amazing on benchmarks, my experience was just okay-ish.

On the other hand, Qwen 3.6 is really good. It’s still not close to Opus, but it’s easily on par with Sonnet.

lbreakjai [3 hidden]5 mins ago

I have a subscription through work, I've been trialing it, so far it looks on par, if not better, than opus.

m4rkuskk [3 hidden]5 mins ago

I have been testing it in my app all morning, and the results line up with 4.6 Sonnet. This is just a "vibe" feeling with no real testing. I'm glad we have some real competition to the "frontier" models.

mchusma [3 hidden]5 mins ago

it feels like between K2.6 and GLM5.1 we have Sonnet level intelligence at roughly Haiku level pricing. Which is great.

I'm hoping that Anthropic will be able to release an updated Haiku soon and they really need something that is 1/3-1/5 the price of Haiku to compete with the truly cheaper models (Gemma-4 is really good at this range).

dmix [3 hidden]5 mins ago

I'm pretty Kimi is what Cursor uses for their "composer 2" model. Works pretty good as a fallback when Claude runs out, but definitely a downgrade.

arcanemachiner [3 hidden]5 mins ago

It's a Kimi K2.5 finetune, there was some drama about this a few weeks ago.

mariopt [3 hidden]5 mins ago

Really excited to try this one, I've been using kimi 2.5 for design and it's really good but borderline useless on backend/advanced tasks.

Also discovered that using OpenCode instead of the kimi cli, really hurts the model performance (2.5).

pt9567 [3 hidden]5 mins ago

wow - $0.95 input/$4 output. If its anywhere near opus 4.6 that's incredible.

corlinp [3 hidden]5 mins ago

This should erase any doubt that AI Labs are making $$$ on API inference.

Kimi 2.5 (which this is based on) is served at $0.44 input / $2 output by a ton of different providers on OpenRouter, 2.6 will certainly be similar.

That's about 11X less than Opus for similar smarts.

Lalabadie [3 hidden]5 mins ago

Famously, OpenAI and Anthropic are devoted to increasing efficiency before scaling up resource usage.

amazingamazing [3 hidden]5 mins ago

How does it erase any doubt? You’re implying Chinese things can’t be actually cheaper to produce than American which is laughable

irthomasthomas [3 hidden]5 mins ago

Beats opus 4.6! They missed claiming the frontier by a few days.

NitpickLawyer [3 hidden]5 mins ago

While I'm skeptical of any "beats opus" claims (many were said, none turned out to be true), I still think it's insane that we can now run close-to-SotA models locally on ~100k worth of hardware, for a small team, and be 100% sure that the data stays local. Should be a no-brainer for teams that work in areas where privacy matters.

cedws [3 hidden]5 mins ago

Even the smaller quantized models which can run on consumer hardware pack in an almost unfathomable amount of knowledge. I don't think I expected to be able to run a 'local Google' in my lifetime before the LLM boom.

sterlind [3 hidden]5 mins ago

I'm extremely curious how these models learn to pack a lossily-compressed representation of the entire Internet (more or less) into a few hundred billion parameters. like, what's the ontology?

osti [3 hidden]5 mins ago

I think this one is only about 600GB VRAM usage, so it could fit on two mac studios with 512GB vram each. That would have costed (albeit no longer available) something like less than 20k.

NitpickLawyer [3 hidden]5 mins ago

Yeah, but that's personal use at best, not much agentic anything happening on that hardware. Macs are great for small models at small-medium context lengths, but at > 64k (something very common with agentic usage) it struggles and slows down a lot.

The ~100k hardware is suitable for multi-user, small team usage. That's what you'd use for actual work in reasonable timeframes. For personal use, sure macs could work.

zozbot234 [3 hidden]5 mins ago

You could run it with SSD offload, earlier experiments with Kimi 2.5 on M5 hardware had it running at 2 tok/s. K2.6 has a similar amount of total and active parameters.

BoorishBears [3 hidden]5 mins ago

Opus is clearly a sidegrade meant to help Anthropic manage cost, so I would say they may have it if it actually beats 4.6

irthomasthomas [3 hidden]5 mins ago

Could be right. I just noticed my feed is absent the usual flood of posts demoing the new hotness on 3D modeling, game design and SVG drawings of animals on vehicles.

pixel_popping [3 hidden]5 mins ago

It doesn't beat Opus 4.6, no way, don't be fooled by benchmarks.

antirez [3 hidden]5 mins ago

Here I analyze the same linenoise PR with Kimi K2.6, Opus, GPT. https://www.youtube.com/watch?v=pJ11diFOjqo

Unfortunately the generation of the English audio track is work in progress and takes a few hours, but the subtitles can already be translated from Italian to English.

TLDR: It works well for the use case I tested it against. Will do more testing in the future.

Banditoz [3 hidden]5 mins ago

If the benchmarks are private, how do we reproduce the results? I looked up the Humanity's Last Exam (https://agi.safe.ai/) this model uses and I can't seem to access it.

johndough [3 hidden]5 mins ago

You can request access here: https://huggingface.co/datasets/cais/hle

The test data is purposely difficult to access to reduce the chance of leaking it into the training dataset.

verdverm [3 hidden]5 mins ago

https://huggingface.co/moonshotai/Kimi-K2.6

Is this the same model?

Unsloth quants: https://huggingface.co/unsloth/Kimi-K2.6-GGUF

(work in progress, no gguf files yet, header message saying as much)

SwellJoe [3 hidden]5 mins ago

A trillion parameters is wild. That's not going to quantize to anything normal folks can run. Even at 1-bit, it's going to be bigger than what a Strix Halo or DGX Spark can run. Though I guess streaming from system RAM and disk makes it feasible to run it locally at <1 token per second, or whatever. GLM 5.1, at 754B parameters, is already beyond any reasonable self-hosting hardware (1-bit quantization is 206GB). Maybe a Mac Studio with 512GB can run them at very low-bit quantizations, also pretty slowly.

gpm [3 hidden]5 mins ago

Huh, so the metadata says 1.1 trillion parameters, each 32 or 16 bits.

But the files are only roughly 640GB in size (~10GB * 64 files, slightly less in fact). Shouldn't they be closer to 2.2TB?

coder543 [3 hidden]5 mins ago

The description specifically says:

"Kimi-K2.6 adopts the same native int4 quantization method as Kimi-K2-Thinking."

johndough [3 hidden]5 mins ago

The bulk of Kimi-K2.6's parameters are stored with 4 bits per weight, not 16 or 32. There are a few parameters that are stored with higher precision, but they make up only a fraction of the total parameters.

gpm [3 hidden]5 mins ago

Huh, cool. I guess that makes a lot of sense with all the success the quantization people have been having.

So am I misunderstanding "Tensor type F32 · I32 · BF16" or is it just tagged wrong?

Balinares [3 hidden]5 mins ago

Quite curious how well real usage will back the benchmarks, because even if it's only Opus ballpark, open weights Opus ballpark is seismic.

swingboy [3 hidden]5 mins ago

Exciting benchmarks if true. What kind of hardware do they typically run these benchmarks on? Apologies if my terminology is off, but I assume they're using an unquantized version that wouldn't run on even the beefiest MacBook?

cassianoleal [3 hidden]5 mins ago

If only their API wasn't tied to a Google or phone login...

jenkstom [3 hidden]5 mins ago

If it's open then there will be multiple providers. I see it is on OpenRouter now.

atemerev [3 hidden]5 mins ago

Why use "their API"? It is an open model, use any provider on OpenRouter

nisegami [3 hidden]5 mins ago

The choice of example task for Long-Horizon Coding is a bit spooky if you squint, since it's nearing the territory of LLMs improving themselves.

greenavocado [3 hidden]5 mins ago

I pray the benchmark figures are true so I can stop paying Anthropic after screwing me over this quarter by dumbing down their models, making usage quotas ridiculously small, and demanding KYC paperwork.

deaux [3 hidden]5 mins ago

> dumbing down their models,

This should be so easy to prove if it were true. Yet there is none of it, just vibes.

Still, your other two points are completely valid. The opaqueness of usage quotas is a scam, within a single month for a single model it can differ by more than 2x. And this indeed has been proven.

jollymonATX [3 hidden]5 mins ago

Anthropic has done horrible PR and investors should be livid.

greenavocado [3 hidden]5 mins ago

My theory is they pushed retail off their systems to make room for their new corporate fat cat clients. In which case, they'll do just fine.

esafak [3 hidden]5 mins ago

K2.5 was already pretty decent so I would try this. Starting at $15/month: https://www.kimi.com/membership/pricing

edit: Note that you can run it yourself with sufficient resources (e.g., companies), or access it from other providers too: https://openrouter.ai/moonshotai/kimi-k2.6/providers

pbowyer [3 hidden]5 mins ago

What's the privacy/data security like? I can't find that on that page.

Edit: found it.

> We may use your Content to operate, maintain, improve, and develop the Services, to comply with legal obligations, to enforce our policies, and to ensure security. You may opt out of allowing your Content to be used for model improvement and research purposes by contacting us at membership@moonshot.ai. We will honor your choice in accordance with applicable law.

Section 3 of https://www.kimi.com/user/agreement/modelUse?version=v2

gpm [3 hidden]5 mins ago

> We will honor your choice in accordance with applicable law.

So in other words only if you can point to a local law which requires them to comply with the opt out?

pixel_popping [3 hidden]5 mins ago

You really rely on ToS from Anthropic/OpenAI to know if they use your prompts or not? It's on their servers, why wouldn't they use our data?

deaux [3 hidden]5 mins ago

Yup, they train on your inputs and OpenRouter is complicit by claiming that Moonshot's ToS says that they don't. Contacted OpenRouter about this a while ago and was met with silence because it's bad for their business to stop lying about it.

SwellJoe [3 hidden]5 mins ago

"sufficient resources" is going to be a lot of resources. I doubt this will run on even something like a Strix Halo or DGX Spark, even at 1-bit quantization. You'll need a 256GB or 512GB Mac Studio, or a monster GPU situation, to run it locally, I think, though quantized versions aren't showing up yet, to be sure.

wg0 [3 hidden]5 mins ago

How are the usage limits compared to Anthropic?

greenavocado [3 hidden]5 mins ago

Anthropic has the worst usage limits in the industry

andriy_koval [3 hidden]5 mins ago

gemini is worse imo

deaux [3 hidden]5 mins ago

You're correct, Gemini chat limits are a joke at their chapest paid tier compared to both Claude and GPT. Especially crazy when you consider Gemini 3 Pro is more than twice as cheap as Opus 4.6 on the API. It's hard to run into pure chat limits on Claude even if you only use Opus on the cheapest tier, whereas with Gemini it's easy to hit.

Not sure about coding usage, Google being weird about these things I could see that quota being separate.

cmrdporcupine [3 hidden]5 mins ago

Running it through opencode to their API and... it definitely seems like it's "overthinking" -- watching the thought process, it's been going for pages and pages and pages diagnosing and "thinking" things through... without doing anything. Sitting at 50k+ output tokens used now just going in thought circles, complete analysis paralysis.

Might be a configuration or prompt issue. I guess I'll wait and see, but I can't get use out of this now.

oliver236 [3 hidden]5 mins ago

isnt this better than qwen?

XCSme [3 hidden]5 mins ago

A bit weird to be comparing it to Opus-4.5 when 4.7 was released...

EDIT: Wrong comment: they compared it with 4.6, my comment was for the Qwen-3.6 Max release blog post...

wizee [3 hidden]5 mins ago

They're comparing to Opus 4.6, not 4.5. It was Anthropic's best public model up until last week.

zozbot234 [3 hidden]5 mins ago

Some people would say it's still Anthropic's best public model!

XCSme [3 hidden]5 mins ago

Yeah, I noticed that, HN doesn't let me delete my comment.

The other release, Qwen-3.6-Max is the one comparing it to 4.5