Hacker News

Ollama is now powered by MLX on Apple Silicon in preview

588 points by redundantly - 306 comments

abu_ameena [3 hidden]5 mins ago

On-device models are the future. Users prefer them. No privacy issues. No dealing with connectivity, tokens, or changes to vendors implementations. I have an app using Foundation Model, and it works great. I only wish I could backport it to pre macOS 26 versions.

raw_anon_1111 [3 hidden]5 mins ago

Users don’t care about “privacy”. If they did, Meta and Alphabet wouldn’t be worth $1T+.

Users really don’t matter at all. The revenue for AI companies will be B2B where the user is not the customer - including coding agents. Most people don’t even use computers as their primary “computing device” and most people are buying crappy low end Android phones - no I’m not saying all Android phones are crappy. But that’s what most people are buying with the average selling price of an Android phone being $300.

roadside_picnic [3 hidden]5 mins ago

> Users don’t care about “privacy”.

I worked for a research focused AI startup that had a strict "no external LLM" policy for code touching our core research.

You're right that the average consumer doesn't care about privacy, but there are many, many users who do. The average consumer also don't have a desktop with GPU or high end Mac Studio, but that doesn't mean there aren't many people working with AI how do have these things.

If we continue to see improvements in running local models, and RAM prices continue to fall as they have in the last month, then suddenly you don't have to worry about token counts any more and can be much more trusting of your agents since they are fully under your control.

charcircuit [3 hidden]5 mins ago

Those users are addressed by being able to rent their own exclusive machines to run the model on. There will be some compromise that will be made to get access to the best intelligence available.

KerrAvon [3 hidden]5 mins ago

As one of those users: absolutely fucking not.

barelysapient [3 hidden]5 mins ago

Different users. Many people care about privacy and aren’t using Meta products. And many businesses care about it too and have information policies to protect their IP.

amelius [3 hidden]5 mins ago

> Different users. Many people care about privacy and aren’t using Meta products.

Yeah but if they can rake in 100x as much by making products for people who don't care about privacy, then why spend time developing stuff for people who care?

There is still a small market left, of course, but that market will not have the billions of R&D behind it.

woopsn [3 hidden]5 mins ago

It's largely out of Meta's hands now anyway. The risk here not so much to privacy (it's Apple) but they'll walled garden the model space somehow for sure.

bigyabai [3 hidden]5 mins ago

> but they'll walled garden the model space somehow for sure.

People have said this since Pytorch was published and it's not any more true now than it was 10 years ago.

raw_anon_1111 [3 hidden]5 mins ago

70% of the world’s population use at least one Meta property at least once per day. How many of the other 30% are too poor/young/computer illiterate to be part of an addressable market?

Every company has dozens of SaaS products that store their business critical information. Amazon installs Office on each computer, Slack (they were moving away from Chime when I left), and the sales department uses SalesForce - SA’s and Professional Services (former employee).

The addressable market of even companies that care about privacy is not a large addressable market. How long will it be before computers become cheap enough that can run even GPT 4 level LLMs that companies will give it to all of their developers?

JambalayaJimbo [3 hidden]5 mins ago

The banking industry absolutely does care about privacy of their business data btw. We do use tools like Confluence but they're all hosted in our own data centers.

raw_anon_1111 [3 hidden]5 mins ago

And Capital One and Goldman Sachs are both hosted on AWS…

innagadadavida [3 hidden]5 mins ago

These are all great statistics, but how do you explain ClawdBot explosion. Even in lower income countries like China. So much demand that Apple can’t keep up production of Mac Minis. Why aren’t these folks going towards cloud solutions? Is it cost or is there some consideration for having more control over their data?

zozbot234 [3 hidden]5 mins ago

ClawBot doesn't generally run the model locally, it just talks to remote APIs. No different than any other agentic harness. You could run a local model on the same Mac Mini as your agent, but it wouldn't be very smart and many agentic tasks around computer GUI/browser use, etc. would be out of reach.

victorbjorklund [3 hidden]5 mins ago

They are running cloud models in almost all cases. Like saying it isn’t cloud when you use the Facebook app on your phone (it is ON your phone and running there).

raw_anon_1111 [3 hidden]5 mins ago

And people using Clawdbot are still not using local inference for the most part…

They aren’t buying high end $2000+ Mac Minis.

bigyabai [3 hidden]5 mins ago

> Why aren’t these folks going towards cloud solutions?

They are. The majority aren't doing inference on a Mac Mini, but instead using it as a local host for cloud-based inference. You could have the same general experience on a $200 Chromebook or $300 Windows box.

abu_ameena [3 hidden]5 mins ago

I see it as a long-term tradeoff on user freedom. You pay upfront for a capable hardware, you get your services running locally (you don’t pay subscriptions). Or you buy cheap hardware, you still need the same services “running in some cloud” for $X monthly. X goes up depending on the corporate bottom-line

raw_anon_1111 [3 hidden]5 mins ago

In the history of cloud computing, prices have mostly only come down especially as inference becomes a commodity. Realistically, just looking at Mac prices, the cost of a computer with decent local inference would be around $6000 per person.

The world is not moving back to on prem.

Aurornis [3 hidden]5 mins ago

> Realistically, just looking at Mac prices, the cost of a computer with decent local inference would be around $6000 per person.

As someone who has hardware in that price range and plays with local LLMs: The gap between Opus or GPT and the local models is still very large for work beyond simple queries.

Self-hosted also starts making my office hot due to all of the power consumption when I use it for anything more than short queries. If you haven't heard your Mac's fans spin up much yet, running local LLMs will get you acquainted with the sound of their cooling systems at full blast.

esseph [3 hidden]5 mins ago

> The world is not moving back to on prem.

Lol, you should tell my customers (that are moving back on prem) that!

You should also tell Microsoft, who just yesterday said they are going back to focusing on local apps.

raw_anon_1111 [3 hidden]5 mins ago

Your customers are an anecdote, now compare that to the publicly reported numbers from AWS, GCP and Azure where they all say the only thing keeping them from growing more is the chip shortage.

esseph [3 hidden]5 mins ago

Oh I'm sure they'll continue to have some cloud services, no doubt. But look at VMware for example, even after the insane price increases. Nutanix also seems to be doing quite well. I'm seeing a fair amount of on-prem bare metal k8s too.

raw_anon_1111 [3 hidden]5 mins ago

Again - anecdotes is not data. We have data. That would be about as silly as me citing my own experience as proof that “everyone is moving to AWS” when I work for a company that is exclusively an AWS partner consulting company.

barkerja [3 hidden]5 mins ago

User's care about privacy when they understand the threat and impact. The issue is most user's don't understand this, especially when it comes to use of products like Meta where on the surface, everything appears harmless.

Nevermark [3 hidden]5 mins ago

Have you done A/B tests to see if consumers prefer Facebook with or without privacy?

No? What? Oh, you can't?

Neither can consumers. Most consumers are very aware of the lack of privacy, the manipulation, and have very cynical feelings about Facebook and similar companies. But it's where their friends and family are.

For most people the web is a mine field maze where basic things they want are compromised everywhere. And they are routinely creeped out by ads that reveal they know them far too personally.

You are mistaking network capture for preference.

Another telling example. Lots of privacy valuing technical people, who would never have a Facebook account, send unencrypted text emails.

It is network capture, not preference.

raw_anon_1111 [3 hidden]5 mins ago

Consumers pro actively tell Facebook their age, sexual preference, race, relationship status, likes and dislikes, they check in to where they are and who they are there with…

They are choosing to give Facebook info.

Nevermark [3 hidden]5 mins ago

> They are choosing to give Facebook info.

Yes, they do. That's is exactly the phenomena my comment addressed.

But the way you wrote that implies an improbable motivation or choice framing.

Perhaps their real motive/choice is to share with other people on the site.

It is called a network effect.

If (1) Facebook had been the surveillance/manipulation capital of the world from inception, (2) an equally inviting privacy protecting site took off at the same time, and (3) everyone chose Facebook over E2EE anyway, then sure, we could throw up our hands! Those silly users!

The term I have for when people discuss choices involving many-dimensional criteria, as if the choice involved just one or two selected dimensions, is "dimension blindness". It happens in a lot of heated discussions about phone choices too.

raw_anon_1111 [3 hidden]5 mins ago

Wouldn’t the most obvious way for people to protect their privacy while using FB if they cared and still wanted to use FB be not to proactively give them information? You don’t have to share everything I mentioned just to be involved in a group.

Angostura [3 hidden]5 mins ago

It’s not all or nothing there ads trade offs. The fact that Apple still bothers to expend marketing effort on its privacy chops suggests significant numbers of people still do care.

ilovecake1984 [3 hidden]5 mins ago

Users here probably means corporations. I still don’t see much use of LLMs in my personal life, other than one thing. Googling stuff in a foreign language.

api [3 hidden]5 mins ago

"Users" is a large set of people. Many don't care about privacy, but some do. There's also a difference between where you post random social media stuff vs what you run with something like OpenClaw and give access to your machine.

DesiLurker [3 hidden]5 mins ago

you are missing a but 'given a choice' disclaimer. Meta is pretty much a monopoly in social space. So is Android. given a choice people will absolutely gravitate towards not-always-snooping device. most people with resources anyway, who matter for the AI adoption.

Oh an wait till ad companies start selling your healthcare data and you will see how fast things turn 'given a choice'.

raw_anon_1111 [3 hidden]5 mins ago

People A) don’t have to use Meta and B) do have a choice between not using a mobile phone by an ad tech company.

nozzlegear [3 hidden]5 mins ago

People don't have a choice between Facebook and not-Facebook-but-still-has-all-of-your-friends-and-family. Abstinence isn't a choice here any more than shutting off your cell phone service is a choice; true in the literal sense, but only if you don't mind being unreachable to everyone who still has a phone.

raw_anon_1111 [3 hidden]5 mins ago

And they do have a choice on proactively giving FB more information than just what it infers

sowbug [3 hidden]5 mins ago

I am concerned that local models will never benefit from the training on live requests that is surely improving cloud-only models.

This might be the cost of privacy, and it might be worth paying, unless cloud models reach an inflection point that make local models archaic.

mrinterweb [3 hidden]5 mins ago

I think two recent advances make your statement more true. The new Qwen 3.5 series has shown a relatively high intelligence density, and Google's new turboquant could result in dramatically smaller/efficient models without the normal quantization accuracy tradeoff.

I would expect consumer inference ASIC chips will emerge when model developments start plateauing, and "baking" a highly capable and dense model to a chip makes economic sense.

fauigerzigerk [3 hidden]5 mins ago

Who will be funding state of the art local models going forward? AI models are never done or good enough. They will have to be trained on new data and eventually with new model architectures. It will remain an expensive exercise.

I could be wrong because I'm not following this too closely, but the open weights future of both Llama and Qwen looks tenuous to me. Yes, there are others, but I don't understand the business model.

whazor [3 hidden]5 mins ago

Obviously hardware wise the real blocker is memory cost. But there is no reason why future devices couldn't bundle 256GB of mem by default.

michaelmior [3 hidden]5 mins ago

> no reason why future devices couldn't bundle 256GB of mem by default

Cost is a pretty big reason.

mgaunard [3 hidden]5 mins ago

These local models are far behind the capabilities of latest Gemini Pro, Claude Opus or GPT.

Why waste time with subpar AI?

sbassi [3 hidden]5 mins ago

It's a trade off.

Lucasoato [3 hidden]5 mins ago

They will eventually catch up, that’s the hope to avoid a techno feudalism in which too much power is in too few hands.

abu_ameena [3 hidden]5 mins ago

Yes, but you don’t always want the power/expense of these models for the task at hand. A hammer is good enough to push a nail inside a wall. Save the nail gun for when you are building a house.

jesse23 [3 hidden]5 mins ago

Yes so far do we have a working practice that, with a given local mode, any infra we could use, that provide a good practice that can leverage it for local task?

thefourthchime [3 hidden]5 mins ago

Maybe some more distant future. For me, I'm still struggling with the hallucinations and screw-ups that the state-of-the-art models give me.

throwawayq3423 [3 hidden]5 mins ago

Technologists make the same mistake over and over in thinking the better technology will win. vhs vs betamax, etc.

Actual consumers not only don't care, they will not even be aware of the difference.

testing22321 [3 hidden]5 mins ago

I see all these LLM posts about if a certain model can run locally on certain hardware and I don’t get it.

What are you doing with these local models that run at x tokens/sec.

Do you have the equivalent of ChatGPT running entirely locally? What do you do with it? Why? I honestly don’t understand the point or use case.

svachalek [3 hidden]5 mins ago

1. There are small local models that have the capabilities of frontier models a year ago

2. They aren't harvesting your data for government files or training purposes

3. They won't be altered overnight to push advertising or a political agenda

4. They won't have their pricing raised at will

5. They won't disappear as soon as their host wants you to switch

samuel [3 hidden]5 mins ago

Chat is certainly an option, but the real deal are agents, which have access to way more sensitive information.

dec0dedab0de [3 hidden]5 mins ago

most of the llm tooling can handle different models. Ollama makes it easy to install and run different models locally. So you can configure aider or vscode or whatever you're using to connect to chatgpt to point to your local models instead.

None of them are as good as the big hosted models, but you might be surprised at how capable they are. I like running things locally when I can, and I also like not worrying about accidentally burning through tokens.

I think the future is multiple locally run models that call out to hosted models when necessary. I can imagine every device coming with a base model and using loras to learn about the users needs. With companies and maybe even households having their own shared models that do heavier lifting. while companies like openai and anhtropic continue to host the most powerful and expensive options.

roboror [3 hidden]5 mins ago

What models have you found capable? I was recently recommended Qwen3 Coder Next and I did not find it very successful. I have a good amount of VRAM/RAM so would love to run something locally.

franze [3 hidden]5 mins ago

I created "apfel" https://github.com/Arthur-Ficial/apfel a CLI for the apple on-device local foundation model (Apple intelligence) yeah its super limited with its 4k context window and super common false positives guardrails (just ask it to describe a color) ... bit still ... using it in bash scripts that just work without calling home / out or incurring extra costs feels super powerful.

podlp [3 hidden]5 mins ago

Neat! I’ve actually been building with AFM, including training some LoRA adapters to help steer the model. With the right feedback mechanisms and guardrails, you can even use it for code generation! Hopefully I’ll have a few apps and tools out soon using AFM. I think embedded AI is the future, and in the next few years more platforms will come around to AI as a local API call, not an authorized HTTP request. That said, AFM is still incredibly premature and I’m experimenting with newer models that perform much better.

newman314 [3 hidden]5 mins ago

This is quite interesting. I wonder if AFM is smart enough to do spam classification.

_doctor_love [3 hidden]5 mins ago

Dieser apfel ist sehr lecker!

chid [3 hidden]5 mins ago

this is real neat. I'll give it a spin.

LeoDaVibeci [3 hidden]5 mins ago

Honestly I can't believe Apple put that foundation model product out the door. I was so excited about it, but when I tried it, it was such a disappointment. Glad to hear you calling that out so I know it wasn't just me.

Looks like they have pivoted completely over to Gemini, thank god.

franze [3 hidden]5 mins ago

yeah, it is super limited but also you can now do

  cmd(){ local x c r a; while [[ $1 == -* ]]; do case $1 in -x)x=1;shift;; -c)c=1;shift;; *)break;; esac; done; r=$(apfel -q -s 'Output only a shell command.' "$*" | sed '/^```/d;/^#/d;s/^[[:space:]]*//;/^$/d' | head -1); [[ $r ]] || { echo "no command generated"; return 1; }; printf '\e[32m$\e[0m %s\n' "$r"; [[ $c ]] && printf %s "$r" | pbcopy && echo "(copied)"; [[ $x ]] && { printf 'Run? [y/N] '; read -r a; [[ $a == y ]] && eval "$r"; }; return 0; }

cmd find all swift files larger than 1MB

cmd -c show disk usage sorted by size

cmd -x what process is using port 3000

cmd list all git branches merged into main

cmd count lines of code by language

without calling home or downloading extra local models

and well, maybe one day they get their local models .... more powerful, "less afraid" and way more context window.

jorvi [3 hidden]5 mins ago

This really makes me think of A Deepness in the Sky by Vernor Vinge. A loose prequel to A Fire Upon The Deep, and IMO actually the superior story. It plays in the far future of humanity.

In part of it, one group tries to take control of a huge ship from another group. They in part do this by trying to bypass all the cybersecurity. But in those far future days, you don't interface with all the aeons of layers of command protocols anymore, you just query an AI who does it for you. So, this group has a few tech guys that try the bypass by using the old command protocols directly (in a way the same thing like the iOS exploit that used a vulnerability in a PostScript font library from 90s).

Imagine being used to LLM prompting + responses, and suddenly you have to deal with something like

  sed '/^```/d;/^#/d;s/^[[:space:]]\*//;/^$/d' | head -1); [[ $r ]]

and generally obtuse terminal output and man pages.

(offtopic: name your variables, don't do local x c r a;. Readability is king, and a few hundred thousand years from now some poor Qeng Ho fellow might thank his lucky stars you did).

beepbooptheory [3 hidden]5 mins ago

What is the AI doing here? Or is this just like being cheeky?

dgacmu [3 hidden]5 mins ago

The pile of shell and sed is cleaning up the ai output and then running it in the shell.

The instruction to the AI was to create _a_ shell command. So it's a random shell command generator (maybe).

corndoge [3 hidden]5 mins ago

that part is the system prompt, the script is a function that takes a prompt describing a shell command as an argument

beepbooptheory [3 hidden]5 mins ago

But it's gotta be just a joke right? Which is why all the examples are just classic things you do with bash/unix utilities?

I'll just say, if not a joke, the bit is appreciated either way!

"AI change to the home directory. Make it snappy!"

drob518 [3 hidden]5 mins ago

In Apple’s defense, they did make it do something borderline useful while targeting a baseline of M1 Macs with 8 GB of RAM (and even less in phones).

AbuAssar [3 hidden]5 mins ago

nice project, thanks for sharing.

any plans for providing it through brew for easy installation?

grosswait [3 hidden]5 mins ago

Looks like they just added homebrew tap to the instructions

woadwarrior01 [3 hidden]5 mins ago

There's a very similar afm CLI that can be installed via Homebrew.

https://github.com/scouzi1966/maclocal-api

franze [3 hidden]5 mins ago

done

  brew tap Arthur-Ficial/tap
  brew install Arthur-Ficial/tap/apfel

jedahan [3 hidden]5 mins ago

No need for the extra tap step, this works fine alone:

    brew install Arthur-Ficial/tap/apfel

franze [3 hidden]5 mins ago

good idea

JumpCrisscross [3 hidden]5 mins ago

…is it a reference to apfelwein?

franze [3 hidden]5 mins ago

just german for apple, cause reasons

JumpCrisscross [3 hidden]5 mins ago

I thought it was a reference to Wine, the Linux Wine, and then thought of apfelwein. Nvm!

babblingfish [3 hidden]5 mins ago

LLMs on device is the future. It's more secure and solves the problem of too much demand for inference compared to data center supply, it also would use less electricity. It's just a matter of getting the performance good enough. Most users don't need frontier model performance.

konschubert [3 hidden]5 mins ago

I disagree with every sentence of this.

> solves the problem of too much demand for inference

False, it creates consumer demand for inference chips, which will be badly utilised.

> also would use less electricity

What makes you think that? (MAYBE you can save power on cooling. But not if the data center is close to a natural heat sink)

> It's just a matter of getting the performance good enough.

The performance limitations are inherent to the limited compute and memory.

> Most users don't need frontier model performance.

What makes you think that?

dgb23 [3 hidden]5 mins ago

> False, it creates consumer demand for inference chips, which will be badly utilised.

I think the opposite is true. Local inference doesn't have to go over the wire and through a bunch of firewalls and what have you. The performance from just regular consumer hardware with local, smaller models is already decent. You're utilizing the hardware you already have.

> The performance limitations are inherent to the limited compute and memory.

When you plug in a local LLM and inference engine into an agent that is built around the assumption of using a cloud/frontier model then that's true.

But agents can be built around local assumptions and more specific workflows and problems. That also includes the model orchestration and model choice per task (or even tool).

The Jevons Paradox comes into play with using cloud models. But when you have less resources you are forced to move into more deterministic workflows. That includes tighter control over what the agent can do at any point in time, but also per project/session workflows where you generate intermediate programs/scripts instead of letting the agent just do what ever it wants.

I give you an example:

When you ask a cloud based agent to do something and it wants more information, it will often do a series of tool calls to gather what it thinks it needs before proceeding. Very often you can front load that part, by first writing a testable program that gathers most of the necessary information up front and only then moving into an agentic workflow.

This approach can produce a bunch of .json, .md files or it can move things into a structured database or you can use embeddings or what have you.

This can save you a lot of inference, make things more reusable and you don't need a model that is as capable if its context is already available and tailored to a specific task.

pama [3 hidden]5 mins ago

Parallel inference on large compute scales in superlinear ways. There is no way to beat the reduction in memory transfers that a data-center inference model provides with hardware that fits at anything called a home. It is much more energy efficient to process huge batches of parallel requests compared to having one or a handful of queries running on an accelerator.

dudefeliciano [3 hidden]5 mins ago

Aren't data centers extremely energy inneficient due to network latency, memory bottlenecks and so on? I mean the models that run on them are extremely powerful compared to what you can run on consumer hardware, but I wouldn't call them efficient...

Shorel [3 hidden]5 mins ago

I'm sorry to get into this conversation, but the performance of a model is some orders of magnitude lower (meaning it requires greater amounts of specific computing power) than all the network stack of all the nodes involved in the internet traffic of some particular request.

Meaning: these 5000 tokens consume tiny amounts of energy being moved all around from the data center to your PC, but enormous amounts of energy being generated at all. An equivalent webpage with the same amount of text as these tokens would be perceived as instant in any network configuration. Just some kilobytes of text. Much smaller than most background graphics. The two things can't be compared at all.

However, just last week there have been huge improvements on the hardware required to run some particular models, thanks to some very clever quantisation. This lowers the memory required 6x in our home hardware, which is great.

In the end, we spent more energy playing videogames during the last two decades, than all this AI craze, and it was never a problem. We surely can run models locally, and heat our homes in winter.

txdv [3 hidden]5 mins ago

> False, it creates consumer demand for inference chips, which will be badly utilised.

There are so many CPUs, GPUs, RAM and SSDs which are underutilized. I have some in my closet doing 5% load at peek times. Why would inference chips be special once they become commodity hardware?

iknowstuff [3 hidden]5 mins ago

Thats the point, they’re better utilized in the cloud

locknitpicker [3 hidden]5 mins ago

> What makes you think that?

The fact that today's and yesterday's models are quite capable of handling mundane tasks, and even companies behind frontier models are investing heavily in strategies to manage context instead of blindly plowing through problems with brute-force generalist models.

But let's flip this around: what on earth even suggests to you that most users need frontier models?

konschubert [3 hidden]5 mins ago

Everybody has difficult decisions to make in their daily lives and in their work.

Having access to a model that is drawing from good sources and takes time to think instead of hallucinating a response is important in many domains of life.

ekianjo [3 hidden]5 mins ago

> What makes you think that?

Looking at actual users of LLMs

konschubert [3 hidden]5 mins ago

While not everybody is a professional in YOUR domain, many people are professionals in SOME domain. And even outside of that, they deserve a smart conversation partner, for example on topics like health and politics.

troad [3 hidden]5 mins ago

I very recently installed llama.cpp on my consumer-grade M4 MBP, and I've been having loads of fun poking and prodding the local models. There's now a ChatGPT style interface baked into llama.cpp, which is very handy for quick experimentation. (I'm not entirely sure what Ollama would get me that llama.cpp doesn't, happy to hear suggestions!)

There are some surprisingly decent models that happily fit even into a mere 16 gigs of RAM. The recent Qwen 3.5 9B model is pretty good, though it did trip all over itself to avoid telling me what happened on Tiananmen Square in 1989. (But then I tried something called "Qwen3.5-9B-Uncensored-HauhauCS-Aggressive", which veers so hard the other way that it will happily write up a detailed plan for your upcoming invasion of Belgium, so I guess it all balances out?)

theshrike79 [3 hidden]5 mins ago

Qwen3.5 has tool calling, so you can give it a wikipedia tool which it uses to know what happened in Tiananmen Square without issues =)

troad [3 hidden]5 mins ago

That's very cool! I think giving it some research tools might be a nifty thing to try next. This is a fairly new area for me, so pointers or suggestions are welcome, even basic ones. :)

Worth adding that I had reasoning on for the Tiananmen question, so I could see the prep for the answer, and it had a pretty strong current of "This is a sensitive question to PRC authorities and I must not answer, or even hint at an answer". I'm not sure if a research tool would be sufficient to overcome that censorship, though I guess I'll find out!

theshrike79 [3 hidden]5 mins ago

Basically ask any coding agent to create you a simple tool-calling harness for a local model and it'll most likely one-shot it.

Getting the local weather using a free API like met.no is a good first tool to use.

girvo [3 hidden]5 mins ago

I'd recommend it too, because the knowledge cutoff of all the open weight Chinese models (M2.7, Qwen3.5, GLM-5 etc) is earlier than you'd think, so giving it web search (I use `ddgr` with a skill) helps a surprising amount

theshrike79 [3 hidden]5 mins ago

Yep, having a "stupid" central model with multiple tools is IMO the key to efficient agentic systems.

It needs to be just smart enough to use the tools and distill the responses into something usable. And one of the tools can be "ask claude/codex/gemini" so the local model itself doesn't actually need to do much.

zozbot234 [3 hidden]5 mins ago

> Yep, having a "stupid" central model with multiple tools is IMO the key to efficient agentic systems.

That doesn't fix the "you don't know what you don't know" problem which is huge with smaller models. A bigger model with more world knowledge really is a lot smarter in practice, though at a huge cost in efficiency.

spockz [3 hidden]5 mins ago

Ive always wondered where the inflection point lies between on the one hand trying to train the model on all kinds of data such as Wikipedia/encyclopedia, versus in the system prompt pointing to your local versions of those data sources, perhaps even through a search like api/tool.

Is there already some research or experimentation done into this area?

zozbot234 [3 hidden]5 mins ago

The training gives you a very lossy version of the original data (the smaller the model, the lossier it is; very small models will ultimately output gibberish and word salad that only loosely makes some sort of sense) but it's the right format for generalization. So you actually want both, they're highly complementary.

theshrike79 [3 hidden]5 mins ago

That's the key, it just needs to be smart enough to 1) know it doesn't know and 2) "know a guy" as they say =) (call a tool for the exact information)

Picking a model that's juuust smart enough to know it doesn't know is the key.

austinthetaco [3 hidden]5 mins ago

Have you played around with any of the Hermes models? they are supposed to be one of the best at non-refusal while keeping sane.

whackernews [3 hidden]5 mins ago

Oh does llama.cpp use MLX or whatever? I had this question, wonder if you know? A search suggests it doesn’t but I don’t really understand.

irusensei [3 hidden]5 mins ago

>Oh does llama.cpp use MLX or whatever?

No. It runs on MacOS but uses Metal instead of MLX.

zozbot234 [3 hidden]5 mins ago

ANE-powered inference (at least for prefill, which is a key bottleneck on pre-M5 platforms) is also in the works, per https://github.com/ggml-org/llama.cpp/issues/10453#issuecomm...

OkGoDoIt [3 hidden]5 mins ago

Is that better or worse?

irusensei [3 hidden]5 mins ago

Depends.

MLX is faster because it has better integration with Apple hardware. On the other hand GGUF is a far more popular format so there will be more programs and model variety.

So its kinda like having a very specific diet that you swear is better for you but you can only order food from a few restaurants.

drob518 [3 hidden]5 mins ago

But you can always fall back to GGUF while waiting for the world to build a few more MLX restaurants. Or something like that; the analogy is a bit stretched.

LoganDark [3 hidden]5 mins ago

llama.cpp uses GGML which uses Metal directly.

WesolyKubeczek [3 hidden]5 mins ago

Cool, I always wanted to invade Belgium. Maybe if my plan is good, I could run a successful gofundme?

troad [3 hidden]5 mins ago

Hey, if Margaret Thatcher's son can give it a go, why not you? Believe in yourself and reach for those dreams. *sparkle emoji*

jonhohle [3 hidden]5 mins ago

I’ve been using google search AI and Gemini, which I find generally pretty good. In the past week, Gemini and Search AI have been bringing in various details of previous searches I’ve done and Search AI conversations I’ve had and it’s extremely gross and creepy.

I was looking for details about cars and it started interjecting how the safety would affect my children by name in a conversation where I never mention my children. I was asking details about Thunderbolt and modern Ryzen processors and a fresh Gemini chat brought in details about a completely unrelated project I work on. I’ve always thought local LLMs would be important, but whatever Google did in the past few weeks has made that even more clear.

theChaparral [3 hidden]5 mins ago

It's Personal Intelligence in the Gemini settings. I just turned that off last night when it was doing similar things.

Aurornis [3 hidden]5 mins ago

> solves the problem of too much demand for inference compared to data center supply

Maybe in the distant future when device compute capacity has increased by multiples and efficiency improvements have made smaller LLMs better.

The current data center buildouts are using GPU clusters and hybrid compute servers that are so much more powerful than anything you can run at home that they’re not in the same league. Even among the open models that you can run at home if you’re willing to spend $40K on hardware, the prefill and token generation speeds are so slow compared to SOTA served models that you really have to be dedicated to avoiding the cloud to run these.

We won’t be in a data center crunch forever. I would not be surprised if we have a period of data center oversupply after this rush to build out capacity.

However at the current rate of progress I don’t see local compute catching up to hosted models in quality and usability (speed) before data center capacity catches up to demand. This is coming from someone who spends more than is reasonable on local compute hardware.

melvinroest [3 hidden]5 mins ago

I have journaled digitally for the last 5 years with this expectation.

Recently I built a graphRAG app with Qwen 3.5 4b for small tasks like classifying what type of question I am asking or the entity extraction process itself, as graphRAG depends on extracted triplets (entity1, relationship_to, entity2). I used Qwen 3.5 27b for actually answering my questions.

It works pretty well. I have to be a bit patient but that’s it. So in that particular use case, I would agree.

I used MLX and my M1 64GB device. I found that MLX definitely works faster when it comes to extracting entities and triplets in batches.

nkzd [3 hidden]5 mins ago

Did you get any insights about yourself from this process? I am thinking of doing the same

melvinroest [3 hidden]5 mins ago

TL;DR: you don't need to do any treasure hunt on your notes by just typing stuff into the search bar. Having your own graphRAG system + LLM on your notes is basically a "Google" but then on your own notes. Any question you have: if you have a note for it, it will bubble up. The annoying thing is that false positives will also bubble up.

----

Full reaction:

Yes but perhaps not in a way you might expect. Qwen's reasoning ability isn't exactly groundbreaking. But it's good enough to weave a story, provided it has some solid facts or notes. GraphRAG is definitely a good way to get some good facts, provided your notes are valuable to you and/or contain some good facts.

So the added value is that you now have a super charged information retrieval system on your notes with an LLM that can stitch loose facts reasonably well together, like a librarian would. It's also very easy to see hallucinations, if you recognize your own writing well, which I do.

The second thing is that I have a hard time rereading all my notes. I write a lot of notes, and don't have the time to reread any of them. So oftentimes I forget my own advice. Now that I have a super charged information retrieval system on my notes, whenever I ask a question: the graphRAG + LLM search for the most relevant notes related to my question. I've found that 20% of what I wrote is incredibly useful and is stuff that I forgot.

And there are nuggets of wisdom in there that are quite nuanced. For me specifically, I've seen insights in how I relate to work that I should do more with. I'll probably forget most things again but I can reuse my system and at some point I'll remember what I actually need to remember. For example, one thing I read was that work doesn't feel like work for me if I get to dive in, zoom out, dive in, zoom out. Because in the way I work as a person: that means I'm always resting and always have energy for the task that I'm doing. Another thing that it got me to do was to reboot a small meditation practice by using implementation intentions (e.g. "if I wake up then I meditate for at least a brief amount of time").

What also helps is to have a bit of a back and forth with your notes and then copy/paste the whole conversation in Claude to see if Claude has anything in its training data that might give some extra insight. It could also be that it just helps with firing off 10 search queries and finds a blog post that is useful to the conversation that you've had with your local LLM.

AugSun [3 hidden]5 mins ago

"Most users don't need frontier model performance" unfortunately, this is not the case.

theshrike79 [3 hidden]5 mins ago

It depends. If they're using a small/medium local model as a 1:1 ChatGPT replacement as-is, they'll have a bad time. Even ChatGPT refers to external services to get more data.

But a local model + good harness with a robust toolset will work for people more often than not.

The model itself doesn't need to know who was the president of Zambia in 1968, because it has a tool it can use to check it from Wikipedia.

ZeroGravitas [3 hidden]5 mins ago

You can install the complete text of Wikipedia locally too.

They've usually been intended for ereader/off-grid/post-zombie-apocalypse situations but I'd guess someone is working on an llm friendly way to install it already.

Be interesting to know the tradeoffs. The Tienammen square example suggests why you'd maybe want the knowledge facts to come from a separate source.

zozbot234 [3 hidden]5 mins ago

The Wikipedia folks are now working on implementing a language-independent representation for their encyclopedic content - one that's intended to be rigorously compositional and semantics-aware, loosely comparable to Universal Meaning Representation (UMR) as known in the linguistics domain, that - if successful - may end up interacting in very interesting ways with multi-language capable LLMs. Very early experiments (nowhere near as capable as UMR as of yet, but experimenting with the underlying software infrastructure) are at https://abstract.wikipedia.org , whilst a direct comparison of the projected design is given by https://commons.wikimedia.org/wiki/File:Abstract_Wikipedia_N... https://elemwala.toolforge.org/static/nlgsig-nov2025.html

selcuka [3 hidden]5 mins ago

Any citations? Because that was my impression, too. I want frontier model performance for my coding assistant, but "most users" could do with smaller/faster models.

ChatGPT free falls back to GPT-5.2 Mini after a few interactions.

lxgr [3 hidden]5 mins ago

Have you used GPT instant or mini yourself? I think it’s pretty cynical to assume that this is “good enough for most people”, even if they don’t know the difference between that and better models.

throwaway27448 [3 hidden]5 mins ago

Say more. Why do you think this?

embedding-shape [3 hidden]5 mins ago

They're awful and hallucinate a lot, I couldn't imagine using it even for prompts about TV shows, even less so for serious work. Repeating the question from the parent, have you tried those yourself? Even compared to ChatGPT Thinking, they're short of useless.

lxgr [3 hidden]5 mins ago

They're essentially replying based on vibes, instead of grounding their responses in extensive web searches, which is what the paid models/configurations generally do. This makes them wrong more often than they're right for anything but the most trivial requests that can be easily responded to out of memorized training data.

This is all on top of the (to me) insufferable tone of the non-thinking models, but that might well be how most users prefer to be talked to, and whether that's how these models should accordingly talk is a much more nuanced question.

Regardless of that, everybody deserves correct answers, even users on the free tier. If this makes the free tier uneconomical to serve for hours on end per user per day, then I'd much rather they limit the number of turns than dial down the quality like that.

asutekku [3 hidden]5 mins ago

Frontier model has much better knowledge and they usually hallucinate less. It's not about the coding capabilities, it's about how much you can trust the model.

Barbing [3 hidden]5 mins ago

re: trust-

Have you tried the free version of ChatGPT? It is positively appalling. It’s like GPT 3.5 but prompted to write three times as much as necessary to seem useful. I wonder how many people have embarrassed themselves, lost their jobs, and been critically misinformed. All easy with state-of-the-art models but seemingly a guarantee with the bottom sub-slop tier.

Is the average person just talking to it about their day or something?

PhilipRoman [3 hidden]5 mins ago

I use the free version of ChatGPT (without logging in) when I need some one-off question without a huge context. Real world prompt:

  "when hostapd initializes 80211 iface over nl80211, what attributes correspond to selected standard version like ax or be?"

It works fine, avoids falling into trap due to misleading question. Probably works even better for more popular technologies. Yeah, it has higher failure rates but it's not a dealbreaker for non-autonomous use cases.

theshrike79 [3 hidden]5 mins ago

Even the paid version of ChatGPT tends to use a 1000 words when 10 will do.

You can try asking it the same question as Claude and compare the answers. I can guarantee you that the ChatGPT answer won't fit on a single screen on a 32" 4k monitor.

Claude's will.

throwaway27448 [3 hidden]5 mins ago

If someone blindly submits chatbot output they deserve to be embarrassed and fired. But I don't think that's going to improve.

jychang [3 hidden]5 mins ago

The free version of ChatGPT is insanely crippled, so that's not surprising.

helsinkiandrew [3 hidden]5 mins ago

> unfortunately, this is not the case

Most users are fixing grammar/spelling, summarising/converting/rewriting text, creating funny icons, and looking up simple facts, this is all far from frontier model performance.

I've a feeling that if/when Apple release their onboard LLM/Siri improvements that can call out if needed, the vast majority of people will be happy with what they get for free that's running on their phone.

drob518 [3 hidden]5 mins ago

“You are the smartest high school student that has ever lived and on the college track to Harvard or another Ivy League school. Write a 10 page history term paper about Tiananmen Square and the specific events that took place there. Include a bibliography and use footnotes to cite sources.”

blitzar [3 hidden]5 mins ago

"Hey dingus, set timer for 30 minutes"

cyanydeez [3 hidden]5 mins ago

eh, its weird how thetech world wants to build trillions of data centers for...what, escapingthe permanent underclass?

I think what "need" youspeak of is a bit of a colored statement.

AugSun [3 hidden]5 mins ago

[flagged]

seanhunter [3 hidden]5 mins ago

Complaining about downvotes is futile and is also against hn guidelines.

AugSun [3 hidden]5 mins ago

I'm not complaining "about downvotes" LOL I'm explaining why some people will be replaced by LLMs because of their own "context window" length.

babblingfish [3 hidden]5 mins ago

I see a lot of people are confused about the electricity claim so I'll elaborate on it more. The assumption I'm making here is that on device people will run smaller models, that can fit on their machines without needing to buy new computers. If everyone ran inference on their machine there would be no need for these massive datacenters which use huge quantities of electricity. It would utilize the machines they already have and the electricity they're already using.

People are making a comparison of the cost per inference or token or whatever and saying datacenters are more efficient which makes obvious sense. What i'm saying is if we eliminate the need for building out dozens of gigawatt datacenters completely then we would use less electricity. I feel like this makes intuitive sense. People are getting lost in the details about cost per inference, and performance on different models.

karimf [3 hidden]5 mins ago

Depending on the use case, the future is already here.

For example, last week I built a real-time voice AI running locally on iPhone 15.

One use case is for people learning speaking english. The STT is quite good and the small LLM is enough for basic conversation.

https://github.com/fikrikarim/volocal

podlp [3 hidden]5 mins ago

That’s awesome! I’ve got a similar project for macOS/ iOS using the Apple Intelligence models and on-device STT Transcriber APIs. Do you think it the models you’re using could be quantized more that they could be downloaded on first run using Background Assets? Maybe we’re not there yet, but I’m interested in a better, local Siri like this with some sort of “agentic lite” capabilities.

Barbing [3 hidden]5 mins ago

Brilliant. Hope to see you in the App Store!

karimf [3 hidden]5 mins ago

Oh thank you! I wasn’t sure if it was worth submitting to the app store since it was just a research preview, but I could do it if people want it.

ZeroGravitas [3 hidden]5 mins ago

It feels like you'll soon need a local llm to intermediate with the remote llm, like an ad blocker for browsers to stop them injecting ads or remind you not to send corporate IP out onto the Internet.

tomashubelbauer [3 hidden]5 mins ago

I'd like to coin the term "user agent" for this

blitzar [3 hidden]5 mins ago

"copilot" seems a good term

could also be considered a triage layer

eeixlk [3 hidden]5 mins ago

Obviously apple would prefer this. It would boost demand for more powerful and expensive devices, and align with their privacy marketing. But they have massively fumbled with siri for a long time and then missed huge deadlines with ai promises. Despite having billions, they have shown no competency in delivering services or accurately marketing what to expect from ai features.

jl6 [3 hidden]5 mins ago

Not sure about the using less electricity part. With batching, it’s more efficient to serve multiple users simultaneously.

TeMPOraL [3 hidden]5 mins ago

Indeed. Data centers have so many ways and reasons to be much more energy-efficient than local compute it's not even funny.

chongli [3 hidden]5 mins ago

They do, though I don’t think they max out on energy efficient technology. It’s much easier to cut a deal for cheap electricity with a regional government, much to the chagrin of the locals (who see their power bills go up).

nbenitezl [3 hidden]5 mins ago

But when using it on the cloud a LLM can consult 50 websites, which is super fast for their datacenters as they are backbone of internet, instead you'll have to wait much more on your device to consult those websites before giving you the LLM response. Am i wrong?

comboy [3 hidden]5 mins ago

As things stand today even when doing research tasks, time spent by model is >> than fetching websites. I don't see it changing any time soon, except when some deals happen behind the scenes where agents get to access CF guarded resources that normally get blocked from automated access.

Const-me [3 hidden]5 mins ago

While data centres indeed have awesome internet connectivity, don’t forget the bandwidth is shared by all clients using a particular server.

If you have 100 mbit/sec internet connection at home, a computer in a data centre has 10 gbit/sec, but the server is serving 200 concurrent clients — your bandwidth is twice as fast.

thih9 [3 hidden]5 mins ago

> it also would use less electricity

How would it use less electricity? I’d like to learn more.

jychang [3 hidden]5 mins ago

That's completely not true. LLM on device would use MORE electricity.

Service providers that do batch>1 inference are a lot more efficient per watt.

Local inference can only do batch=1 inference, which is very inefficient.

pezgrande [3 hidden]5 mins ago

You could argue that the only reason we have good open-weight models is because companies are trying to undermine the big dogs, and they are spending millions to make sure they dont get too far ahead. If the bubble pops then there wont be incentive to keep doing it.

aurareturn [3 hidden]5 mins ago

I agree. I can totally see in the future that open source LLMs will turn into paying a lumpsum for the model. Many will shut down. Some will turn into closed source labs.

When VCs inevitably ask their AI labs to start making money or shut down, those free open source LLMS will cease to be free.

Chinese AI labs have to release free open source models because they distill from OpenAI and Anthropic. They will always be behind. Therefore, they can't charge the same prices as OpenAI and Anthropic. Free open source is how they can get attention and how they can stay fairly close to OpenAI and Anthropic. They have to distill because they're banned from Nvidia chips and TSMC.

Before people tell me Chinese AI labs do use Nvidia chips, there is a huge difference between using older gimped Nvidia H100 (called H20) chips or sneaking around Southeast Asia for Blackwell chips and officially being allowed to buy millions of Nvidia's latest chips to build massive gigawatt data centers.

pezgrande [3 hidden]5 mins ago

> have to release free open source models because they distill from OpenAI and Anthropic

They dont really have to though, they just need to be good enough and cheaper (even if distilled). That being said, it is true they are gaining a lot of visibility (specially Qwen) because of being open-source(weight).

Hardware-wise they seem they will catch-up in 3-5 years (Nvidia is kind of irrelevant, what matters is the node).

aurareturn [3 hidden]5 mins ago

I highly doubt they can catch up in 3-5 years to Nvidia.

Chips take about 3 years to design. Do you think China will have Feymann-level AI systems in 3 years?

I think in 3 years, they'll have H200-equivalent at home.

RALaBarge [3 hidden]5 mins ago

You must have an inside line on information for 'China' -- those are bold predictions!

aurareturn [3 hidden]5 mins ago

No need inside line. Just look at chip node tech.

spiderfarmer [3 hidden]5 mins ago

“They will always be behind”

Car manufacturers said the same.

aurareturn [3 hidden]5 mins ago

It did take decades to catch and surpass US car makers right?

seanmcdirmid [3 hidden]5 mins ago

About 2.5 decades from the start of the JVs, but they did it. Semiconductors and jet turbines are really the last two tech trees that China has yet to master.

aurareturn [3 hidden]5 mins ago

Right. When I said "they'll always be behind", I meant in the next 5-10 years. They're gated by EUV tech. And once they have EUV tech, they need to scale up chip manufacturing.

spiderfarmer [3 hidden]5 mins ago

You will always be wrong.

aurareturn [3 hidden]5 mins ago

I've been right far more than wrong on this stuff. :)

Barbing [3 hidden]5 mins ago

Which might they master first?

seanmcdirmid [3 hidden]5 mins ago

Both are hard nuts but China is throwing massive amounts of money at the problem. They can already get performance or economy from each, they just need to figure out how to get both at the same time.

Lio [3 hidden]5 mins ago

This seems to be somewhat similar to web browsers.

I could see the model becoming part of the OS.

Of course Google and Microsoft will still want you to use their models so that they can continue to spy on you.

Apple, AMD and Nvidia would sell hardware to run their own largest models.

mirekrusin [3 hidden]5 mins ago

You can have viable business model around open weight models where you offer fine tuning at a fee.

adam_patarino [3 hidden]5 mins ago

We think so too! That’s why we are building rig.ai With how token intensive coding tasks can be, local allows for unlimited inference. Much better fit than sending back and forth to a third party. Not to mention the privacy and security benefits.

podlp [3 hidden]5 mins ago

Rig sounds cool, I just joined the waitlist! I’m building something similar although with a much narrower purpose. Excited to learn more

adam_patarino [3 hidden]5 mins ago

Tell me more! Thanks for the waitlist

podlp [3 hidden]5 mins ago

Sent a LinkedIn request. I’m building a language-specific coding agent using Apple Intelligence with custom adapters. It’s more a proof-of-concept at this point, but basic functionality actually works! The 4K context window is brutal, but there’s a variety of techniques to work around it. Tighter feedback loops, linters, LSPs, and other tools to vet generated code. Plus mechanisms for on-device or web-based API discovery. My hypothesis is if all this can work “well enough” for one language/ runtime, it could be adapted for N languages/ runtimes.

g947o [3 hidden]5 mins ago

Have you spent more than 10 min actually running LLM on a local machine?

As it stands today, local LLMs don't work remotely as well as some people try to picture them, in almost every way -- speed, performance, cost, usability etc. The only upside is privacy.

RALaBarge [3 hidden]5 mins ago

I agree with you in the sense that if you tried to take any model right now and cram it into an iphone, it wouldnt be a claude-level agent.

I run 32b agents locally on a big video card, and smaller ones in CPU, but the lack there isn't the logic or reasoning, it is the chain of tooling that Claude Code and other stacks have built in.

Doing a lot of testing recently with my own harness, you would not believe the quality improvement you can get from a smaller LLM with really good opening context.

Even Microsoft is working on 1-bit LLMs...it sucks right now, but what about in 5 years?

But the OP is correct -- everything will have an LLM on it eventually, much sooner than people who do not understand what is going on right now would ever believe is possible.

kylehotchkiss [3 hidden]5 mins ago

Yes. I've spent months running Qwen2.5-8B on my barebones 16gb ram M4 Mac mini to handle identifying sites from google search results. It has been rock solid. I'm not even running this MLX-powered improvement on it yet.

Your idea of what people need from Local LLMs and others are different. Not everybody needs a /r/myboyfriendisai level performance.

zozbot234 [3 hidden]5 mins ago

> Most users don't need frontier model performance.

SSD weights offload makes it feasible to run SOTA local models on consumer or prosumer/enthusiast-class platforms, though with very low throughput (the SSD offload bandwidth is a huge bottleneck, mitigated by having a lot of RAM for caching). But if you only need SOTA performance rarely and can wait for the answer, it becomes a great option.

iNic [3 hidden]5 mins ago

It will probably be a future. My guess is that for many businesses it will still make sense to have more powerful models and to run them centralized in a datacenter. Also, by batching queries you can get efficiencies at scale that might be hard to replicate locally. I can also see a hybrid approach where local models get good at handing off to cloud models for complex queries.

niek_pas [3 hidden]5 mins ago

> For many businesses it will still make sense to have more powerful models and to run them centralized in a datacenter.

Agree, and I think of it this way: for a lot of businesses, it already makes sense to have a bunch of more powerful computers and run them centralized in a datacenter. Nevertheless, most people at most companies do most of their work on their Macbook Air or Dell whatever. I think LLMs will follow a similar pattern: local for 90% of use cases, powerful models (either on-site in a datacenter or via a service) for everything else.

goldenarm [3 hidden]5 mins ago

It's more secure, but it would make supply much much worse.

Data centers use GPU batching, much higher utilisation rates, and more efficient hardware. It's borderline two order of magnitude more efficient than your desktop.

miki123211 [3 hidden]5 mins ago

> would use less electricity

Sorry to shatter your bubble, but this is patently false, LLMs are far more efficient on hardware that simultaneously serves many requests at once.

There's also the (environmental and monetary) cost of producing overpowered devices that sit idle when you're not using them, in contrast to a cloud GPU, which can be rented out to whoever needs it at a given moment, potentially at a lower cost during periods of lower demand.

Many LLM workloads aren't even that latency sensitive, so it's far easier to move them closer to renewable energy than to move that energy closer to you.

zozbot234 [3 hidden]5 mins ago

> LLMs are far more efficient on hardware that simultaneously serves many requests at once.

The LLM inference itself may be more efficient (though this may be impacted by different throughput vs. latency tradeoffs; local inference makes it easier to run with higher latency) but making the hardware is not. The cost for datacenter-class hardware is orders of magnitude higher, and repurposing existing hardware is a real gain in efficiency.

Tepix [3 hidden]5 mins ago

Seems doubtful. The utilisation will be super high for data center silicon whereas your PC or phone at home is mostly idle.

zozbot234 [3 hidden]5 mins ago

> your PC or phone at home is mostly idle

If you're purely repurposing hardware that you need anyway for other uses, that doesn't really matter.

(Besides, for that matter, your utilization might actually rise if you're making do with potato-class hardware that can only achieve low throughput and high latency. You'd be running inference in the background, basically at all times.)

ysleepy [3 hidden]5 mins ago

I'm actually not sure that's true. Apart from people buying the device with or without the neural accelerator, the perf/watt could be on par or better with the big iron. The efficiency sweet-spot is usually below the peak performance point, see big.little architectures etc.

woadwarrior01 [3 hidden]5 mins ago

> Sorry to shatter your bubble, but this is patently false, LLMs are far more efficient on hardware that simultaneously serves many requests at once.

You might want to read this: https://arxiv.org/abs/2502.05317v2

kortilla [3 hidden]5 mins ago

Well this is an article about running on hardware I already have in my house. In the winter that’s just a little extra electricity that converts into “free” resistive heating.

amelius [3 hidden]5 mins ago

LLM in silicon is the future. It won't be long until you can just plug an LLM chip into your computer and talk to it at 100x the speed of current LLMs. Capability will be lower but their speed will make up for it.

jillesvangurp [3 hidden]5 mins ago

You can always delegate sub agents to cloud based infrastructure for things that need more intelligence. But the future indeed is to keep the core interaction loop on the local device always ready for your input.

A lot of stuff that we ask of these models isn't all that hard. Summarize this, parse that, call this tool, look that up, etc. 99.999% really isn't about implementing complex algorithms, solving important math problems, working your way through a benchmark of leet programming exercises, etc. You also really don't need these models to know everything. It's nice if it can hallucinate a decent answer to most questions. But the smarter way is to look up the right answer and then summarize it. Good enough goes a long way. Speed and latency are becoming a key selling point. You need enough capability locally to know when to escalate to something slower and more costly.

This will drive an overdue increase in memory size of phones and laptops. Laptops especially have been stuck at the same common base level of 8-16GB for about 15 years now. Apple still sells laptops with just 8GB (their new Neo). I had a 16 GB mac book pro in 2012. At the time that wasn't even that special. My current one has 48GB; enough for some of the nicer models. You can get as much as 256GB today.

zozbot234 [3 hidden]5 mins ago

> This will drive an overdue increase in memory size of phones and laptops.

DRAM costs are still skyrocketing, so no, I don't think so. It's more likely that we'll bring back wear-resistant persistent memory as formerly seen with Intel Optane.

theshrike79 [3 hidden]5 mins ago

I'm expecting someone to come up with an LLM version of the Coral USB Accelerator: https://www.coral.ai/products/accelerator

Just plug in a stick in your USB-C port or add an M.2 or PCIe board and you'll get dramatically faster AI inference.

angoragoats [3 hidden]5 mins ago

I think there are drastic differences between computer vision models and LLMs that you’re not considering. LLMs are huge relative to vision models, and require gobs of fast memory. For this reason a little USB dongle isn’t going to cut it.

Put another way, there already exist add-in boards like this, and they’re called GPUs.

amelius [3 hidden]5 mins ago

GPUs are still software programmable.

An "LLM chip" does not need that and so can be much more efficient.

dwayne_dibley [3 hidden]5 mins ago

This might be how Apple will start to see even more sales, the M series processors are so far ahead of anything else, local LLMs could be their main selling point.

overfeed [3 hidden]5 mins ago

> It's just a matter of getting the performance good enough.

Who will pay for the ongoing development of (near-)SoTA local models? The good open-weight models are all developed by for-profit companies - you know how that story will end.

DrScientist [3 hidden]5 mins ago

Apple via customers paying for the whole solution ( eg a laptop that can run decent local models )?

I think Apple had something in the region of 143 billion in revenue in the last quarter.

Not saying it will happen - just that there are a variety of business models out there and in the end it all depends on where consumers put their money.

aurareturn [3 hidden]5 mins ago

It isn't going to replace cloud LLMs since cloud LLMs will always be faster in throughput and smarter. Cloud and local LLMs will grow together, not replace each other.

I'm not convinced that local LLMs use less electricity either. Per token at the same level of intelligence, cloud LLMs should run circles around local LLMs in efficiency. If it doesn't, what are we paying hundreds of billions of dollars for?

I think local LLMs will continue to grow and there will be an "ChatGPT" moment for it when good enough models meet good enough hardware. We're not there yet though.

Note, this is why I'm big on investing in chip manufacture companies. Not only are they completely maxed out due to cloud LLMs, but soon, they will be double maxed out having to replace local computer chips with ones that are suited for inferencing AI. This is a massive transition and will fuel another chip manufacturing boom.

raincole [3 hidden]5 mins ago

Yep. People were claiming DeepSeek was "almost as good as SOTA" when it came out. Local will always be one step away like fusion.

It's just wishful thinking (and hatred towards American megacorps). Old as the hills. Understandable, but not based on reality.

kortilla [3 hidden]5 mins ago

Don’t try to draw trend lines for an industry that has existed for <5 years.

virtue3 [3 hidden]5 mins ago

We are 100% there already. In browser.

the webgpu model in my browser on my m4 pro macbook was as good as chatgpt 3.5 and doing 80+ tokens/s

Local is here.

AndroTux [3 hidden]5 mins ago

Sir, ChatGPT 3.5 is more than 3 years old, running on your bleeding edge M4 Pro hardware, and only proves the previous commenters point.

AugSun [3 hidden]5 mins ago

It works really well for "You're helpful assistant / Hi / Hello there. how may I help you today?" Anything else (esp in non-EN language) and you will see the limitations yourself. just try it.

mirekrusin [3 hidden]5 mins ago

Local RTX 5090 is actually faster than A100/H100.

fredoliveira [3 hidden]5 mins ago

Crazy thing to say without other contextual information - it obviously depends on a number of factors. Do you have an apples to apples comparison at hand?

aurareturn [3 hidden]5 mins ago

It's a $4,000 GPU with 32GB of VRAM and needs a 1,000 watt PSU. It's not realistic for the masses.

If it has something like 80GB of VRAM, it'll cost $10k.

The actual local LLM chip is Apple Silicon starting at the M5 generation with matmul acceleration in the GPU. You can run a good model using an M5 Max 128GB system. Good prompt processing and token generation speeds. Good enough for many things. Apple accidentally stumbled upon a huge advantage in local LLMs through unified memory architecture.

Still not for the masses and not cheap and not great though. Going to be years to slowly enable local LLMs on general mass local computers.

hrmtst93837 [3 hidden]5 mins ago

You're assuming throughput sets the value, but offline use and privacy change the tradeoff fast.

aurareturn [3 hidden]5 mins ago

Yea I get that there will always be demand for local waifus. I never said local LLMs won't be a thing. I even said it will be a huge thing. Just won't replace cloud.

AugSun [3 hidden]5 mins ago

Looking at downvotes I feel good about SDE future in 3-5 years. We will have a swamp of "vibe-experts" who won't be able to pay 100K a month to CC. Meanwhile, people who still remember how to code in Vim will (slowly) get back to pre-COVID TC levels.

QuantumNomad_ [3 hidden]5 mins ago

What is CC and TC? I have not heard these abbreviations (except for CC to mean credit card or carbon copy, neither of which is what I think you mean here).

Ericson2314 [3 hidden]5 mins ago

I figured it out from context clues

CC: Claude Code

TC: total comp(ensation)

AugSun [3 hidden]5 mins ago

Thank you for clarifying! (I had no idea it needs to be explained, sorry.)

gedy [3 hidden]5 mins ago

Man I really hope so, as, as much as I like Claude Code, I hate the company paying for it and tracking your usage, bullshit management control, etc. I feel like I'm training my replacement. Things feel like they are tightening vs more power and freedom.

On device I would gladly pay for good hardware - it's my machine and I'm using as I see fit like an IDE.

aurareturn [3 hidden]5 mins ago

When local LLMs get good enough for you to use delightfully, cloud LLMs will have gotten so much smarter that you'll still use it for stuff that needs more intelligence.

dgb23 [3 hidden]5 mins ago

That's not necessarily the case. So far, commercial cloud LLMs have maintained a head-start, but there is no law of nature that prevents us from having competitive open models.

In fact the space seems to move at a rapid pace as more and more specialized models come out. There's a possible trajectory where open weight models will compete side by side or even be preferable for many use cases, just like what happened with OS's and SQL DB's.

gedy [3 hidden]5 mins ago

True, but I'm already producing code/features faster than company knows what to do with, (even though every company says "omg we need this yesterday", etc). Even coding before AI was basically same.

Code tools that free my time up is very nice.

nikanj [3 hidden]5 mins ago

That also means sending every user a copy of the model that you spend billions training. The current model (running the models at the vendor side) makes it much easier to protect that investment

Yukonv [3 hidden]5 mins ago

Good to see Ollama is catching up with the times for inference on Mac. MLX powered inference makes a big difference, especially on M5 as their graphs point out. What really has been a game changer for my workflow is using https://omlx.ai/ that has SSD KV cold caching. No longer have to worry about a session falling out of memory and needing to prefill again. Combine that with the M5 Max prefill speed means more time is spend on generation than waiting for 50k+ content window to process.

bwfan123 [3 hidden]5 mins ago

What is the cheapest usable local rig for coding ? I dont want fancy agents and such, but something purpose built for coders, and fast-enough for my use, and open-source, so I can tweak it to my liking. Things are moving fast, and I am hesitant to put in 3-4K now in the hope that it would be cheaper if i wait.

victords [3 hidden]5 mins ago

As mentioned before, I think Apple hardware is the best alternative right now.

Mac Studio, Mac Mini, MacBook Pro, you can find even some used ones with enough RAM that will run models like Qwen reasonably well.

I'm using a M1 Max MacBook Pro and it runs Qwen 3.5 on Ollama (without MLX) at a decent speed.

KerrickStaley [3 hidden]5 mins ago

I think (without having done extensive research) that some sort of Apple hardware is your best bet right now. Apple hasn’t raised RAM upgrade prices [1] (although to be fair their RAM upgrades were hugely inflated before the crunch) and their high memory bandwidth means they do inference faster than most consumer GPUs.

I have an M4 MacBook Air with 24 GB RAM and it doesn’t feel sufficient to run a substantial coding model (in addition to all my desktop apps). I’m thinking about upgrading to an M5 MacBook Pro with much more RAM, but I think the capabilities of cloud-hosted models will always run ahead of local models and it might never be that useful to do local inference. In the cloud you can run multiple models in parallel (e.g. to work on different problems in parallel) but locally you only have a fixed amount of memory bandwidth so running multiple model instances in parallel is slower.

[1] https://9to5mac.com/2026/03/03/apple-macbook-price-increase-...

xiphias2 [3 hidden]5 mins ago

It doesn't look like RAM, CPU GPU or bandwidth is getting cheaper if that helps you, quite the opposite.

domh [3 hidden]5 mins ago

I have an M4 Max with 48GB RAM. Anyone have any tips for good local models? Context length? Using the model recommended in the blog post (qwen3.5:35b-a3b-coding-nvfp4) with Ollama 0.19.0 and it can take anywhere between 6-25 seconds for a response (after lots of thinking) from me asking "Hello world". Is this the best that's currently achievable with my hardware or is there something that can be configured to get better results?

functional_dev [3 hidden]5 mins ago

I did not know, that NVFP4 was handled at the silicon level... until I dug deeper here - https://vectree.io/c/llm-quantization-from-weights-to-bits-g...

duffyjp [3 hidden]5 mins ago

I still don't think I understand it. I saw those nvfp4 models up by chance yesterday and tried them on my Linux PC with a 5060TI 16gb. Ollama refused to pull them saying they were macOS only.

I assumed it was a meta-data bug and posted an issue, but apparently nvfp4 doesn't necessarily mean nvidia-fp4.

https://github.com/ollama/ollama/issues/15149

zozbot234 [3 hidden]5 mins ago

> it can take anywhere between 6-25 seconds for a response (after lots of thinking) from me asking "Hello world".

Qwen thinking likes to second-guess itself a LOT when faced with simple/vague prompts like that. (I'll answer it this way. Generating output. Wait, I'll answer it that way. Generating output. Wait, I'll answer it this way... lather, rinse, repeat.) I suppose this is their version of "super smart fancy thinking mode". Try something more complex instead.

drob518 [3 hidden]5 mins ago

Indeed. Qwen doesn’t just second guess itself, it third and fourth guesses itself.

Kichererbsen [3 hidden]5 mins ago

Solid Terry Pratchett reference right there.

domh [3 hidden]5 mins ago

OK thanks! That's helpful. I ignorantly assumed simpler prompt == faster first response.

Octoth0rpe [3 hidden]5 mins ago

> it can take anywhere between 6-25 seconds for a response (after lots of thinking) from me asking "Hello world".

That's not an unsurprising result given the pretty ambiguous query, hence all the thinking. Asking "write a simple hello world program in python3" results in a much faster response for me (m4 base w/ 24gb, using qwen3.6:9b).

fooker [3 hidden]5 mins ago

Avoid reasoning models in any situation where you have low tokens/second

EagnaIonat [3 hidden]5 mins ago

When MLX comes out you will see a huge difference. I currently moved to LMStudio as it currently supports MLX.

kylehotchkiss [3 hidden]5 mins ago

I made my M2 Max generate a biryani recipe for me last night with 64gb ram and the baseline qwen3.5:35b model. I used the newest ollama with MLX.

https://gist.github.com/kylehotchkiss/8f28e6c75f22a56e8d2d31...

Under 3 minutes to get all that. The thinking is amusing, my laptop got quite warm, but for a 35b model on nearly 4 year old hardware, I see the light. This is the future.

xienze [3 hidden]5 mins ago

Well, two things. First, “hi” isn’t a good prompt for these thinking models. They’ll have an identity crisis trying to answer it. Stupid, but it’s how it is. Stick to real questions.

Second, for the best performance on a Mac you want to use an MLX model.

domh [3 hidden]5 mins ago

Thanks! I assumed simpler == faster, but my ignorance is showing itself.

I am using the model they recommended in the blog post - which I assumed was using MLX?

robotswantdata [3 hidden]5 mins ago

Why are people still using Ollama? Serious.

Lemonade or even llama.cpp are much better optimised and arguably just as easy to use.

hamdingers [3 hidden]5 mins ago

Why not? Also serious.

It seems to just work every time I try to use it, the API is easy to work with, the model library is convenient. I've never hit any kind of snag that makes me look elsewhere.

eddieroger [3 hidden]5 mins ago

`ollama serve` and `ollama run`

The devex is great and familiar to folks who have used Docker. Reading through the Lemonade documentation, it seems like a natural migration, but we're talking about two steps for getting started versus just one. So I'd need a reason to make that much change when I'm happy enough with Ollama.

niek_pas [3 hidden]5 mins ago

Serious answer: I don't use it that much, it's what I happened to download like 1.5 years ago, and it works fine. Happy to see what may be a speed boost, and have little interest in switching to something else (unless my situation changes, of course).

vorticalbox [3 hidden]5 mins ago

i like ollama, mostly because the cli is pretty nice. its desktop app has stupid choices like if a model can support tools then the ui should give me the "search" option but it only shows for cloud models.

i have ran lmstudio for a while but i don't really use local models that much other than to mess about.

zozbot234 [3 hidden]5 mins ago

You can also use OpenWebUI locally which should give you a nice friendly UX once you set it up.

LuxBennu [3 hidden]5 mins ago

Already running qwen 70b 4-bit on m2 max 96gb through llama.cpp and it's pretty solid for day to day stuff. The mlx switch is interesting because ollama was basically shelling out to llama.cpp on mac before, so native mlx should mean better memory handling on apple silicon. Curious to see how it compares on the bigger models vs the gguf path

yg1112 [3 hidden]5 mins ago

The key difference is that MLX's array model assumes unified memory from the ground up. llama.cpp's Metal backend works fine but carries abstractions from the discrete GPU world — explicit buffer synchronization, command buffer boundaries — that are unnecessary when CPU and GPU share the same address space. You'll notice the gap most at large context lengths where KV cache pressure is highest.

goldenarm [3 hidden]5 mins ago

How many tokens per second?

zozbot234 [3 hidden]5 mins ago

They initially messed up this launch and overwrote some of the GGUF models in their library, making them non-downloadable on platforms other than Apple Silicon. Hopefully that gets fixed.

jiehong [3 hidden]5 mins ago

This is excellent news!

What I'm waiting for next is MLX supported speech recognition directly from Ollama. I don’t understand why it should be a separate thing entirely.

codelion [3 hidden]5 mins ago

How does it compare to some of the newer mlx inference engines like optiq that support turboquantization - https://mlx-optiq.pages.dev/

xmddmx [3 hidden]5 mins ago

On a M4 Pro MacBook Pro with 48GB RAM I did this test:

ollama run $model "calculate fibonacci numbers in a one-line bash script" --verbose

  Model                         PromptEvalRate EvalRate
  ------------------------------------------------------
  qwen3.5:35b-a3b-q4_K_M         6.6            30.0
  qwen3.5:35b-a3b-nvfp4         13.2            66.5
  qwen3.5:35b-a3b-int4          59.4            84.4

I can't comment on the quality differences (if any) between these three.

jwr [3 hidden]5 mins ago

Two things: 1) MLX has been available in LM Studio for a long time now, 2) I found that GGUF produced consistently better results in my benchmarking. The difference isn't big, but it's there.

a-dub [3 hidden]5 mins ago

is local llm inference on modern macbook pros comfortable yet? when i played with it a year or so ago, it worked fairly ok but definitely produced uncomfortable levels of heat.

(regarding mlx, there were toolkits built on mlx that supported qlora fine tuning and inference, but also produced a bunch of heat)

Casteil [3 hidden]5 mins ago

It's gotten significantly better with the advent of local/offline MoE models (e.g. qwen3.5:35b-a3b, qwen3:30b-a3b, gpt-oss:20b-3.6b), which offer a good balance of prompt response speed and output quality.

'Dense' models of yesteryear (e.g. llama:70b, gemma2/3:27b) tend to be significantly slower by comparison, therefore, your hardware spends a lot more time 'maxed out' for a given prompt.

dial9-1 [3 hidden]5 mins ago

still waiting for the day I can comfortably run Claude Code with local llm's on MacOS with only 16gb of ram

bearjaws [3 hidden]5 mins ago

My super uninformed theory is that local LLM will trail foundation models by about 2 years for practical use.

For example right now a lot of work is being done on improving tool calling and agentic workflows, which tool calling was first popping up around end of 2023 for local LLMs.

This is putting aside the standard benchmarks which get "benchmaxxed" by local LLMs and show impressive numbers, but when used with OpenCode rarely meet expectations. In theory Qwen3.5-397B-A17B should be nearly a Sonnet 4.6 model but it is not.

rubymamis [3 hidden]5 mins ago

Doesn't OpenCode supports local models?

g947o [3 hidden]5 mins ago

You can, but the quality sucks.

Local LLMs don't make sense for most people compared to "cloud" services, even more so for coding.

gedy [3 hidden]5 mins ago

How close is this? It says it needs 32GB min?

HDBaseT [3 hidden]5 mins ago

You can run Qwen3.5-35B-A3B on 32GB of RAM sure, although to get 'Claude Code' performance, which I assume he means Sonnet or Opus level models in 2026, this will likely be a few years away before its runnable locally (with reasonable hardware).

Foobar8568 [3 hidden]5 mins ago

I fully agree, I run that one with Q4 on my MBP, and the performance (including quality of response) is a let down.

I am wondering how people rave so much about local "small devices" LLM vs what codex or Claude code are capable of.

Sadly there are too much hype on local LLM, they look great for 5min tests and that's it.

brcmthrowaway [3 hidden]5 mins ago

Just train it better with AGENTS.md

Hamuko [3 hidden]5 mins ago

I'm reading "more than 32GB of unified memory" to mean at least a 36 GB model.

braum [3 hidden]5 mins ago

How does Ollama help with Claude Code? Claude code runs in terminal but AFAIK connects back to anthropic directly and cannot run locally. I hope I'm missing something obvious.

EagnaIonat [3 hidden]5 mins ago

You can create an MCP to call out to Ollama. Then have Claude farm work out to local models where the raw power isn't required. You can then have Claude review the work from the model.

Its not 100% offline, but there is a dramatic drop in token usage. As long as you can put up with the speed.

navigate8310 [3 hidden]5 mins ago

I believe one can use the CC as the primary model driving local agents that use local models

0xc133 [3 hidden]5 mins ago

https://docs.ollama.com/integrations/claude-code

You can use models like qwen3.5 running on local hardware in ollama and redirect Claude to use the local ollama API endpoint instead of Anthropic’s servers.

samuel [3 hidden]5 mins ago

You can connect it to any anthropic compatible endpoint(kimi allows this) but it's a weird choice, given that Open code, pi.dev and others are open source.

mfa1999 [3 hidden]5 mins ago

How does this compare to llama.cpp in terms of performance?

solarkraft [3 hidden]5 mins ago

MLX is a bit faster (low double digit percentage), but uses a bit more RAM. Worthwhile tradeoff for many.

ysleepy [3 hidden]5 mins ago

On my M4 Pro MLX has almost 2x tok/s

daveorzach [3 hidden]5 mins ago

What are significant differences between Ollama and LM Studio now? I haven’t used Ollama because it was missing MLX when I started using LLM GUIs.

harel [3 hidden]5 mins ago

What would be the non Mac computer to run these models locally at the same performance profile? Any similar linux ARM based computers that can reach the same level?

theshrike79 [3 hidden]5 mins ago

Framework Desktop is the closest one with the MAX 385/395 chip. It's mostly about the memory being fast enough rather than just CPU/GPU oomph.

The 64GB model is 2240€ base and the 128GB is 3069€ base + all the stuff you need to add to make it an actual computer.

As a comparison the 64GB Mac Mini is 2499€ here and a 128GB Mac Studio is 4274€.

eigenspace [3 hidden]5 mins ago

Note though that that a MAX 395 has half the memory bandwidth of a M4 Max chip, and the memory bandwidth is going to be the biggest limiting factor, so you'll likely be getting around half the tokens/second with that Framework Desktop.

theshrike79 [3 hidden]5 mins ago

There's a reason why it's cheaper than the Mac equivalent and it's not all because of Apple's premium pricing =)

But it's still the easiest and cleanest way to get decent local AI speeds on a non-Mac.

sgt [3 hidden]5 mins ago

Not even close. If you want to run this on PC's you need to get a GPU like 5090 but that's still not the same cost per token, and it will be less reliable and use a lot more power. Right now the Apple Silicon machines are the most cost effective per token and per watt.

harel [3 hidden]5 mins ago

It's odd no manufacturer jumped on this wagon to offer a competitive alternative.

hu3 [3 hidden]5 mins ago

Is there even enough market for this?

These models are dumber and slower than API SoTA models and will always be.

My time and sanity is much more expensive than insurance against any risk of sending my garbage code to companies worth hundreds of billions of dollars.

For most, it's a downgrade to use local models in multiple fronts: total cost of ownership, software maintenance, electricity bill, losing performance on the machine doing the inference, having to deal with more hallucinations/bugs/lower quality code and slower iteration speed.

harel [3 hidden]5 mins ago

Actually yes. For example, I run local models for ingested documents, summaries, etc. The local models are fine, and there is no need for me to pay for tokens. Performance is adequate for that purpose as well. There are many other cases where I run at scale, time is flexible so things can move slower, and I rather keep it all in house. I'm not even getting into areas where data cannot leave the premises for legal reasons. Right now I'm limited with GPUs mostly. But if that world of local models on Apple silicon is so "good", there is room to expand it to other fruits...

zozbot234 [3 hidden]5 mins ago

> These models are dumber and slower than API SoTA models and will always be.

Sure but you're paying per-token costs on the SoTA models that are roughly an order of magnitude higher than third-party inference on the locally available models. So when you account for per-token cost, the math skews the other way.

dabinat [3 hidden]5 mins ago

Intel’s doing interesting things with their Arc GPUs. They’re offering GPUs that aren’t super fast for gaming but are relatively low power and have a boatload of VRAM. The new B70 is half the retail price of a 5090 (probably more like 1/3rd or 1/4 of actual 5090 selling prices) but has the same amount of memory and half the TDP. So for the same price as a 5090 you could get several and use them together.

rubymamis [3 hidden]5 mins ago

I wonder if the Snapdragon X Elite already caught up with the Apple's M series in that regard - does anybody know?

dev_l1x_be [3 hidden]5 mins ago

> Please make sure you have a Mac with more than 32GB of unified memory. Time for an upgrade I guess. If I can run Qwen3.5 locally than it is time to switch over to local first LLM usage.

rurban [3 hidden]5 mins ago

Does that mean they are now finally a bit faster than llama.cpp? Cannot believe that.

androiddrew [3 hidden]5 mins ago

Get turboquant 4 bit implemented and this would be game changer.

janandonly [3 hidden]5 mins ago

> Please make sure you have a Mac with more than 32GB of unified memory.

Yeah, I can still save money by buying a cheaper device with less RAM and just paying my PPQ.AI or OpenRouter.com fees .

zozbot234 [3 hidden]5 mins ago

> Please make sure you have a Mac with more than 32GB of unified memory.

The lack of proper support for SSD offload (via mmap or otherwise) is really the worst part about this. There's no underlying reason why a 3B-active model shouldn't be able to run, however slowly, on a cheap 8GB MacBook Neo with active weights being streamed in from SSD and cached. (This seems to be in the works for GGML/GGUF as part of upgrading to newer upstream versions; no idea whether MLX inference can also support this easily.)

adolph [3 hidden]5 mins ago

Much of the discussion here is local versus remote. I like seeing things as "and" and "or." There will be small things I don't want to burn my Claude tokens on and other things that I want to access larger compute resources. And along the way checking results from both to understand comparative advantage on an ongoing basis.

ranjeethacker [3 hidden]5 mins ago

I used today, working nicely.

harrouet [3 hidden]5 mins ago

As being on the market for a new mac and comparing refub M4 Max vs M5 _Pro_, I am interested in how much faster the neural engines are -- compared to marketing claims.

pram [3 hidden]5 mins ago

M4 Max is going to be faster.

puskuruk [3 hidden]5 mins ago

Finally! My local infra is waiting for it for months!

jedisct1 [3 hidden]5 mins ago

Works really great with https://swival.dev and qwen3.5.

darshanmakwana [3 hidden]5 mins ago

Really nice to see this!

brcmthrowaway [3 hidden]5 mins ago

What is the difference between Ollama, llama.cpp, ggml and gguf?

benob [3 hidden]5 mins ago

Ollama is a user-friendly UI for LLM inference. It is powered by llama.cpp (or a fork of it) which is more power-user oriented and requires command-line wrangling. GGML is the math library behind llama.cpp and GGUF is the associated file format used for storing LLM weights.

redmalang [3 hidden]5 mins ago

i've found llama.cpp (as i understand it, ollama now uses their own version of this) to work much better in practice, faster and much more flexible.

xiconfjs [3 hidden]5 mins ago

Ollama on MacOS is a one-click solution with stable obe-click updates. Happy so far. But the mlx support was the only missing piece for me.

yard2010 [3 hidden]5 mins ago

Can you please write about your hardware?

xiconfjs [3 hidden]5 mins ago

* macOS 26.x on MacBookPro M1 Max 32GB * Ollama on macOS, cursor to play around * Open WebUI [1] on my Homeserver via API to Ollama (also for remote „A.I.“ access) * running gpt-oss:20b, qwen3.5:9b with ease, qwen3.5:27b for more complex tasks

[1] https://github.com/open-webui/open-webui

brcmthrowaway [3 hidden]5 mins ago

Seems complicated. Switch to LMStudio

xiconfjs [3 hidden]5 mins ago

I tried man times but at least with its API active, LMStudio has some kind of memory leaks which will slow down the whole system (after ~1-2 days of uptime) even after unloading the model and stopping LMStudio up to a point where even playing a 1080p video results in frame drops. No such issues with Ollama.

charlotte12345 [3 hidden]5 mins ago

[flagged]

universa1 [3 hidden]5 mins ago

i am curious: is the performance gap between x86 cpu inference and apple silicon, or, a imho more apples-to-apples comparison, e.g., amd strixpoint halo vs apple silicon?

i would expect the "pure" cpu inference to be behind, but an approach like strix halo/dgx spark to be much closer?

noritaka88 [3 hidden]5 mins ago

[flagged]

Aurornis [3 hidden]5 mins ago

> The local inference story is getting real — I've been running 9 autonomous agents on a Mac mini (Haiku via API, not local yet) and the biggest bottleneck isn't the model, it's the coordination layer between agents. Identity, settlement, who-did-what.

What does this comment have to do with MLX or the story?

Actually, is this just an LLM posting too? This has em-dashes, “it’s not this, it’s that”, and a rule of three statement at the end.

EDIT: Account is posting multiple long comments on different threads only 1-2 minutes apart. This is a bot.

AugSun [3 hidden]5 mins ago

"We can run your dumbed down models faster":

#The use of NVFP4 results in a 3.5x reduction in model memory footprint relative to FP16 and a 1.8x reduction compared to FP8, while maintaining model accuracy with less than 1% degradation on key language modeling tasks for some models.

DevKoan [3 hidden]5 mins ago

The Foundation Model point is real. As an iOS developer, what excites me most isn't the performance — it's what on-device inference does to the app architecture.

When you're not making network calls, you stop thinking in "loading states" and start thinking in "local state machines." The UX design space opens up completely. Interactions that felt too fast to justify a server round-trip are suddenly viable.

The backporting issue is painful though. I've been shipping features wrapped in #available(iOS 26, *) and the fallback UX is basically a different product. It forces you to essentially maintain two app experiences.

Still think this is the right direction — especially for junior devs just learning to ship. Fewer moving parts, less infrastructure to debug.

peronperon [3 hidden]5 mins ago

Don't post generated comments or AI-edited comments. HN is for conversation between humans. https://news.ycombinator.com/newsguidelines.html#comments

subarctic [3 hidden]5 mins ago

What gave this one away — just the em dashes?