MAI-Code-1-Flash
https://microsoft.ai/models/mai-code-1-flash/https://microsoft.ai/pdf/MAI-Code-1-Flash-Model-Card.PDFLaunching seven new MAI models: https://microsoft.ai/news/building-a-hillclimbing-machine-la...
511 points by EvanZhouDev - 240 comments
Performance doesn't seem that good:
- MAI-Code-1-Flash (137B-A5B) = 51% on SWE-bench pro
- Qwen3.6-35B-A3B = 49.5% on SWE-bench pro (https://huggingface.co/Qwen/Qwen3.6-35B-A3B)
They benchmark against Claude Haiku but Haiku is not good, it's worse than tiny open models you can run locally or via API at 10% the cost.
On benchmarks: in the same VS Code harness, MAI-Code-1-Flash scored 51.2% on SWE-bench Pro vs. Haiku's 35.2% which we see as a pretty big leap. But going forward, we'll include additional models in our benchmarks, including models like Qwen 3.6 and Gemma 4.
Even if it can't fully pass much, there are so many tests against most of the scenarios that you can get a fairly rich report beyond the pass@1 stat. See e.g. this DeepSWE report against the Minimax M3 model: https://entrpi.github.io/misc/deep-swe-minimax-m3/
[1] https://srinathh.medium.com/mid-size-local-models-are-now-co...
Qwen-3.6-27b is closer to Claude Opus 4.7 than it is to Haiku 4.5 in a lot of benchmarks - and it's way smaller than Microsoft's new model.
Sure, it competes with Haiku, but it shows how far Microsoft is behind lots of other small models that are available.
I don’t think that’s right, this flash model is 5B active params. Qwen3.6-35B-A3B is 3B so 40% smaller.
Instead, they only do cherry-picked comparisons against Anthropic's small models, and not the full spectrum of competitors.
Without evidence to the contrary, I'll interpret this as just what happens when you're late to the party and insist on doing everything from scratch.
Maybe coaxing reasoning behavior out of their base model without kickstarting it by distilling from existing models provided them with valuable experience that will help improve their future models, or maybe it was an unnecessary waste of time.
[0] Not even here: https://playground.microsoft.ai/
Yeah, not a 5B param model as the earlier title implied!
I was implementing a re-print functionality in my warehouse management system.
It took Opus 4.8 high 24m1s and 87k tokens. Took Haiku 6m30s and 41k tokens.
After that time I had to provide (minor) adjustments to both. But Haiku allowed me to iterate faster. Code quality for that somewhat trivial use case was similar.
Actually, I would even say that Opus provided a sub par solution: instead of fixing an issue where carrier label pdf wasn't saved as the state machine progressed to the latest step, it went through a much complex solution of re-generating those by scratch. Which is also wrong, as it was de-facto booking the carriers twice for the same order.
Haiku simply added another field on the terminal state that carried the already generated urls.
I don't think it's a good idea to default to highest effort/bigger model without taking into account the time it takes and the task complexity.
Imho we should experiment rather than assume that what the rest of the community does to be the best practice.
Yesterday Codex was making a big issue out of a new module that was upgraded in our cluster and because of which the same SSH key would be "regenerated" by Terraform. No big deal, it just truncates a newline at the end of the SSH key and it works all the same. But not being aware that this, as an example, is unimportant can cost a lot more time than using the big models saves.
https://news.ycombinator.com/formatdoc
The next question is where did the "ASCII-art" graph and table come from? Are there sites to generate these?
Just built a tool for that: https://krysoph.github.io/UnicodeData/
It is a single html file with no dependencies, it takes json data and turns into unicode charts.
Source: https://github.com/Krysoph/UnicodeData
And this certainly wont bring me back to GitHub Copilot which I cancelled yesterday.
GitHub Copilot had competitive pricing until yesterday when they changed from per-request to one of the most expensive per-token quotas. Seriously, take a look at their burning subreddit for some laughs: https://www.reddit.com/r/GithubCopilot
I have since changed to DeekSeek Flash on high which is Sonnet+ level for almost free.
If I feel I still need smarter models I might signup for $20/mo Codex to use GPT 5.5 which, in my opinion, is the best I can access right now.
As such, haiku isn’t a waste of my time, it saves enormous amounts of time for me. But I spent a large amount of time building the orchestration system up front and iterating on it to get here. Interestingly i found my experience as a director and later a distinguished engineer gave me the tools to build it and get it working well and reliably end to end - the dynamics of multi agent workflows of varying capability is not a lot different than the dynamics of a 1000 engineer organization.
In my tests, openweight Qwens and GLM are way better than it.
And, DeepSeek and MiMo perform much better than Haiku and Sonnet, near Opus/GPT 5.5 levels, at a fraction of the cost.
There's seemingly no reason to ever use Haiku or Sonnet, if you're not getting it for free or as part of a subscription (that you don't usually saturate).
Haiku costs $1/$5. DeepSeek V4 Flash, a stronger model, is only $0.0028/$0.14/$0.28. That first number is the cached input, and DeepSeek caching is crazy efficient. So, using DeepSeek V4 Flash costs about an order of magnitude less than Haiku and performs better.
I have a Claude subscription because I'm willing to pay a premium for the best model for coding, one that doesn't waste as much of my time doing dumb stuff. But, if I need something other than Claude Code, I'm using something other than Claude models. Why burn money for no benefit?
Oh, also, Haiku chews tokens like crazy. In my benchmarks it used three times more tokens than the next highest model. Of course, security bug hunting is not in its wheelhouse, so it's not fair to judge it based on that one thing, but if it's more expensive per token and burns a lot more tokens, it ends up being a lot more expensive.
Sonnet might have more knowledge and is maybe good for making excel sheets, but it does not write good code and does not follow instructions well.
But 27b Q8 needs a very beefy PC (48GB VRAM or more), so it is not an option many people can use and DS4F is so cheap right now, if you are open to externally hosted models.
But, from what I can tell DeepSeek is better than Sonnet, though I agree it is not at the level of current Opus or GPT 5.5 (but I think it probably beats Gemini Pro 3.1). I use the best model I can for code, because the cost of weaker performance is more than the $100/month I pay for Claude Opus, but it's worth knowing there are very cheap, very good, models for stuff I want to do that isn't Claude Code.
But all in all, I don't think we disagree.
I’ve actually had luck taking the analysis from GHCP and pasting it into our M365 Copilot and getting a useful poc to stick into my bug reports.
We moved to OpenCode Go ($10/mo), so we could switch between DeepSeek v4, GLM 5.1, and Qwen 3.7 models run by providers in EU, US, & Singapore that OpenCode FAQ claims don't use retained data for training.
I find their rather verbose privacy policy is not making far-reaching guarantees about any of this though: https://opencode.ai/legal/privacy-policyBut with Copilot now just being paying per-token prices I don't see how this is competitive with Chinese models.
It is probably telling you can't find the costs in the announcement. Because Input $0.75 Cached input $0.075 Output $4.50 might be competitive with Haiku, but nobody in their right mind uses Haiku and Anthropic has abandoned it chasing the tokenmaxers who aren't thinking about budgets.
So I guess they are aiming for corporate customers that are bound to Microsoft through compliance approval that will soon start seeing their budgets explode that have to find some corporate compromise.
I suppose if you're reeling at the new Copilot bill but want to stay in their ecosystem, this gives you something to use, but for most folks, there's a plethora of better options.
The supermajority of respondents did report that they do engage in some coding outside of working hours, for one reason or another. I'm impressed; I'm basically a zombie after hours, rarely in any shape to touch anything technical. Good for them.
But then only 19.3% of respondents ticked that they code for freelancing reasons, and only 15% said they're doing it in an attempt to bootstrap a business. These groups were the only types that suggested revenue generating after-hours activity, and they even overlap to a non-obvious-to-me extent. But even if we pretended they didn't, that adds up to like a third at best.
So when you say:
> I don’t understand how’s that not the default option for all professional developers.
that's in contradiction with this data (and imo common sense), which suggests that the supermajority of professional developers simply do not perform revenue generating software development activity outside of work hours, period. Therefore, for them, the ROI on any potential AI subscription is a flat and constant zero.
Unless you envision people working at "bring your own license" type shops, I don't know how this is supposed to make sense. These are work tools, corporate should be providing them already. But then I'm clearly not from a "wealthy" country either, so YMMV.
Since I use LLMs basically only for analysis and as a signal in bug discovery, debugging, research and general search, I don't need a very powerful model and I don't need high token counts. A $100 subscription would be entirely way too much for useful usage for me, and would border on just using tokens for the sake of using them.
Unless of course we’re thinking Copilot will be more expensive than others longer term. But is that a reasonable assumption?
I assume I'm misunderstanding you (likely my fault), because the way I read that is that you're saying nobody should currently be using models owned & hosted by companies like OpenAI and Antheopic, while clearly a huge number of people are using those in 2026 despite not owning them.
Cursor is potentially about to be acquired by X.ai (i.e. SpaceX), unless this is just some IPO game being played by Musk. They are certainly not just a token reseller since they have their own models in addition to their own vector database approach for code matching.
I personally wouldn't use models that class directly, though - I'd use them in a harness as a "backend" for more capable models. And Haiku itself, as opposed to other smaller models, is still expensive.
I'm old enough to remember when IDEs could do this without needing a couple gigabytes of matrices to do it
(LLMs are great for anything even slightly more complicated ofc)
And it did just fine.
So no matter what you think about vibe coding, using AI for these slightly more complicated use cases is genuinely useful.
Small models are more than enough for the majority of tasks these days. Plan and review with the bigger ones, let the little ones explore and implement.
OpenCode Go is $10/month for the open weight models with nice quotas: https://opencode.ai/go
I am about 85% through my quota with 9 days left before refresh and have just used over 1B tokens, mostly DeepSeek V4 Pro, but also a little mimo 2.5 pro and kimi k2.6
AI is expensive and it has been heavily subsidized. I you think $20/mo for Codex/Claude flat vs a more usage based model you're in for a shock. Especially once these companies go public and have to meet investor expectations.
90% of corporate job tasks are trivial enough that Haiku can handle them.
Just this morning I have been implementing a reprint functionality in our warehouse management system, which needed to print again carrier labels and delivery notes for a specific order.
It essentially had to do the same workflow of print, but instead of generating and uploading the pdfs, it only had to fetch and print them.
Took Opus 4.8 high 24m1 seconds and 87k tokens. Took Haiku 6m30 seconds and half the tokens.
So not really sure what do you mean by "wasting your expensive time" here. I think you really don't experiment with these tools and assume higher effort, bigger model => time saved, but that's true only when tasks are much bigger and complex enough that a smaller/less precise model would fail or land work of much lower quality.
TLDR:
https://artificialanalysis.ai/models?models=gemini-3-5-flash...
and: https://i.imgur.com/nTu3VCZ.png
For starters I did experiment a heck lot with models since Github Copilot gave me access to OpenAI, Gemini and Anthropic models. So I probably experimented more than the average LLMer. When GitHub Copilot had a generous quota I ran the same tasks with many models to compare them (and pursue best solution among them) quite often.
Now about my experience with Haiku, I think it was free for some time in GitHub Copilot, then it was 0.33x quota usage (when Sonnet was 1x and Opus was 3x, good times). I tried to use it for light coding for about a week.
In my tests I concluded that there was zero reason to use 0.33x priced Haiku in my coding workload because it constantly generated subpar solutions. Even when they worked, Sonnet at 1x and Opus at 3x quota usage had a lot less tech debt on average and my plan permitted continuous Sonnet/Opus usage for my workload, otherwise I would use Gemini Flash (the old one, not this 3.5 one) which was better than Haiku by a mile.
Then GPT 5.4 came at 1x quota usage and it was competitive with Opus at 3x quota usage. So I stopped using Opus in favor of GPT and by this time there was even less reason to use Haiku on my $39/mo GitHub Copilot plan.
And now we have DeepSeek v4 which is Sonnet+ levels in my tests because it has an actual 1 million token context window and their crazy alien caching tech (https://huggingface.co/blog/deepseekv4).
I urge you to throw $5 at OpenCode Go plan for 30 days and toy around with DeepSeek Flash on high setting (not max).
Or MiMo 2.5 Pro on the same OpenCode Go plan. 2 amazing models.
In your experience, is max worse or you suggest it for less token use?
> MiMo 2.5 Pro on the same OpenCode Go
Xiaomi dropped dropped MiMo 2.5 rates by 70%+ [0] & now it is cost competitive with DeepSeek v4 Pro. I haven't used MiMo, but since you have, do you find it to be better than DeepSeek v4? If so, for what tasks? How do you decide when to use which, if you have an intuition for it? Thanks.
[0] https://news.ycombinator.com/item?id=48282814
Yes. DS4 Flash max is incredibly chatty for minimal gain over DS4 high.
I asked the same question a month ago: https://news.ycombinator.com/item?id=47978820 and confirmed in my tests.
> ...MiMo, but since you have, do you find it to be better than DeepSeek v4?
I didn't test MiMo 2.5 enough to form a veridict but from initial tests it is equivalent to DS4. But MiMo 2.5 (non Pro) has the advantage of having vision capability and MiMo is priced equaly as DeepSeek v4 in the $10/mo OpenCode Go now, after the discount you mentioned, see the yellow bars at https://opencode.ai/go
I'll start testing MiMo seriously next week.
With Opus I can work, trust its designs, architecture suggestions, and code changes, even in a complex code base.
The smaller models seem to "try". They work for smaller tasks, but for more complex task it's often more work than doing it myself.
I wish it were different, and maybe in a year or two it will be.
I’ve used GPT mini quite a bit and it’s decent.
always has been
claude code has opusplan — uses opus while in plan mode, switches to sonnet for execution.
https://code.claude.com/docs/en/model-config#opusplan-model-...
edit: you can make it work with sonnet for planning, and haiku for execution, or any other combination you fancy to work with.
https://code.claude.com/docs/en/model-config#control-the-mod...
1. Step execution (Sonnet): Work for 30 minutes / 100k tokens at the direction of the Orchestrator
2. Review (Opus): Scrutinize the previous step's work for errors, fidelity to the instructions, fix those and record opportunities to improve the agent configuration + tools to reduce errors and token usage (record those to a file).
3. Self-improvement (Opus): Implement the highest impact self-improvement items that don't require user intervention.
Repeat: Until orchestrator session token budget exhausted (set it to 1M or whatever).
The underlying rationale is to keep each step manageable to maximize adherence to instructions and minimize cost (even cached tokens cost something). Prompt tokens are much cheaper than generated, so to the extent Opus mostly reviews rather than drives that saves a lot too. Self-improvement steps are very expensive but the improvements compound, if you're going to run a job for days or weeks it's way more expensive not to do them.
Edit: I do this in Claude Code with the Anthropic models as well as Qwen family models for offline use.
For simple features I don't have a full plan worked out. I write a bit of code then tell the model in a short line prompt what it should do. Sometimes I put temporary comments in the code to give it guidance. Generally if the code change is within a file or package, Haiku is good enough follow what you ask and not mess up too much. I also have skills created over time to give it guidance. There were some months when I used GitHub copilot where I had excess credits available at the end of the month I frantically try to use up.
Even the AI code completions can be pretty good on their own. Sometimes I write some temporary comments describing what the code should do and just press Tab-Tab-Tab and the entire function is done.
I think there is a tendency for people to go for the advanced models thinking they we screw up less but if you really understand the code its easier to interactively do it with a lesser model.
I also work on a consumer AI application https://apps.apple.com/us/app/slidebits-studio/id1138731130
For comparison someone showed me an internal company tool he was working on. He had Claude agents dangerously skipping permissions and firing up github branches through a vm sandbox just to make a single feature change. One agent to code and the other to review.
For example you probably don't have days where you ask Opus to review your whole code base and look for code duplication/technical debt/robustness issues, and then to fix some of the found issues, and do this 3-5 times until no big issues are found anymore.
I use quite plain prompts, nothing fancy:
> go over the tests and do a code review, focusing on how well they test inventory management, planner and controller. maybe some tests need to be deleted, maybe other tests need to be added. the end goal should be good coverage of the core features.
> do a code review, focusing on robustness/correctness issues. validate that the code correctly implements specification.md. focus on the async client.
> there was a big refactor. please do a code review, focusing on eliminating tech debt. look for unused, obsolete or duplicate code that can be removed, look for mismatched interfaces, inconsistent function/argument/variable names. do not output what is correct, just the issues you found. for each issue output instructions for a coding agent on how to fix it. do not nitpick.
As we build a better and better harness and better feedback/verifiers we're switching more to 3.5 flash. I think chinese models would work too, but we cant use those atm.
Generally theres a coordinator running opus and an ever growing set of skills and subagents that take actions using weaker models and output feedback to the coordinator opus.
I'm pretty convinced at this point we're past the level of intelligence needed for most tasks most devs do and that will trend down as we better build harnesses for our own codebases.
...but I spend so much more time correcting it, or building pipelines to try, retry, and converge, that it's rarely worthwhile for me in either time or $ spent vs Opus.
So by using Opus you are using "smaller" model. Well, not really smaller, just worse. The actual smaller models can at least be faster.
Why not sell it as a math agent? Why do I have to set up 4 agents to check each others' work?
It is my belief that smaller models will get better and better, and even cloud SOTA models will shrink.
Yet another reason the current buildout will feel like the railroads.
Hard to know when they don't give the price per token. Presumably it will be comparable to a low-mid range model in terms of price. But otherwise their 'Ideal Zone' is meaningless without factoring in the price per token. I don't how much tokens are being used, that's an implementation detail to me. I care about price / performance / latency.
Model Input Cached input Output MAI-Code-1-Flash $0.75 $0.075 $4.50
That's what I'm betting on anyway.
As models get better and smaller, I expect that we will rapidly (within a year?) get to the point where SOTA models are not needed for the vast majority of coding tasks, and even today it seems many people are just using them for the planning phase.
How many people drive Ferraris vs Fords? How many people driving a Ford would, on a utilitarian basis, be any better off driving a Ferrari?
So far there seems to be mainly two high volume use cases that have been found for LLMs - coding and business flow automation, and it seems neither of these need SOTA models. I wonder if there will continue to be enough market demand for massive expensive SOTA models to make them worthwhile developing?
The eye-opener is clean licensed data with filters for AI content (not sure how you do that).
If MSFT builds up using an ethical approach, there is a large anti-AI audience that might take note.
https://microsoft.ai/news/introducingmai-code-1-flash/
and the model card
https://microsoft.ai/pdf/MAI-Code-1-Flash-Model-Card.PDF
The broader announcement of 7 MAI models seems to be where the 5B active in the title comes from
https://microsoft.ai/news/building-a-hillclimbing-machine-la...
Model Input Cached input Output
MAI-Code-1-Flash $0.75 $0.075 $4.50
Comparing to
Claude Haiku 4.5 $1.00 $0.10 $5.00
looks fine.
But they also forgot to include the benchmarks comparing to
GPT-5.4 mini $0.75 $0.075 $4.50
Those would have been helpful.
Terminal-Bench 2.0 60.0 % 41.6% 54.8%
Source: https://openai.com/index/introducing-gpt-5-4-mini-and-nano/
Why not showcase it against something in a similar domain like qwen3.6 or gemma 4?
That scroll effect is jank city for me (yeah yeah works fine in Chrome/Edge).
Let me slide as fast and unrestricted as I want. I do not want to "transition" to the next paragraph.
This trend needs to stop.
https://microsoft.ai/wp-content/uploads/2026/06/main_2026060...
https://microsoft.ai/news/building-a-hillclimbing-machine-la...
Unless they specifically clarify that the testing and training benchmarks are completely separate, we have to assume they test on the same 'hill' the model climbs.
The early testers have confirmed that it is much better than all earlier US open weights models, but it is not as good as the best Chinese open weights models.
While Nemotron 3 Ultra is not the smartest open weights LLM, it is well optimized for fast inference, so it is much faster than the other LLMs of the same size.
In any case I believe that it is very good to have an additional option in big open weights LLMs, because until now all existing models have shown that even if some model is definitely better on average than another, the weaker model can still be better in some particular applications.
With open weights models, you can afford to try multiple LLMs for the more important tasks and then choose the best solution.
The pure-AI companies like OpenAI and Anthropic are hoping to sell you API access to cloud-based AI, perhaps running on NVIDIA chips, but it seems NVIDIA's plan may be for you to run local AI, maybe from NVIDIA, running on local NVIDIA chips.
I was hoping Microsoft would make it open weights, as they have done for years with the Phi models.
The era of big tech releasing models into the wild might be over, which IMO is counter-productive, as we are shifting from "the model is the product" to "the harness is the product"
Seems like the work from a good system design to code is practically solved.
Now it’s a matter of the design of the system. Or is that represented in these evals?
Even if I had no idea, going with the default suggestion would not be a terrible mistake, assuming you did describe your requirements relatively well.
When I need a light model, I reach for Sonnet. It is nearly free on the max plans, and quite fast. I don't see a place for Haiku in regular coding.
Haiku I guess is when you need summarization/categorization at scale.
Microsoft setting Haiku as the benchmark is a low bar.
is a funny oxymoron
"MAI-Code-1-Flash outperforms Claude Haiku 4.5"
Well still no list nor publication of the training data.
That sounds like something you say when you don't benchmark well
While the scores are not good compare to other open weight model, the important thing to note is their training data (as they claimed) is very clean, without any synthetic datasets.
MAI-Thinking-1 - https://news.ycombinator.com/item?id=48374362 - June 2026 (64 comments)
Here Microsoft is comparing against Claude Haiku, the smallest and least capable model from Anthropic.
[1] https://ai.meta.com/blog/introducing-muse-spark-msl/
Seriously tho, wtf is going on over at Meta? Anyone working there currently want to describe the vibe of the org when it comes to being a frontier company?
Please don't complain about tangential annoyances—e.g. article or website formats, name collisions, or back-button breakage. They're too common to be interesting.
They also did some more interesting work like showing very small models can be coherent as long as you have very simple children's book style training data (TinyStories is pretty famous).
Lots of these ideas are still used. Learning facts at scale with active reading is an ICLR 2026 paper from Meta AI that does a lot of similar work.
If you watch the Build keynote with Satya, you'll notice that the design of the slides changed to Serif typography and warmer colors when Mustafa/Microsoft AI segment came on which was completely different from the rest of the keynote. Now it makes sense!
Where does the Pascal case inspired variant come from? Is it a reference to something? Is it like "M$" was used back in the days?
This model might have a perfect speed:
(gestures wildly while changing lanes in his Fiat 500)
But it seems like, by and large, even the faster models are now aimed at longer-running agentic flows and not sub-1s autocomplete. Or am I wrong about that?
Why not assign them to make windows good :D
These things can be useful if you can accurately predict which tasks they will reliably do, and which they will usually fail on. Then you can get much more reliable work from them.
Even if it were Opus, comparing to a version number makes for an interesting snapshot of time comparison: if you knew how a model performed at whatever time in was in vogue, you can say "well, it looks like Model X is about 6 months/1 year/etc. behind the frontier SOTA" - which is exactly the discussion that happens in the open-weight/local LLM space. (interesting, MAI-Code-1-Flash does not appear to be such an open-weight model, following the western trend of locking models up)