Claude Fable 5
System Card [pdf]: https://www-cdn.anthropic.com/d00db56fa754a1b115b6dd7cb2e3c3...
2521 points by Philpax - 2026 commentsSystem Card [pdf]: https://www-cdn.anthropic.com/d00db56fa754a1b115b6dd7cb2e3c3...
2521 points by Philpax - 2026 comments
One that I'm willing to share (albeit from just a week ago) - I built a Python library last week that bundles MicroPython compiled to WASM to create a sandboxed code execution library: https://github.com/simonw/micropython-wasm
I just told Claude.ai (not even Claude Code - this was the standard Claude chat interface) running Fable 5:
A few prompts later (and I uploaded the zip files from https://github.com/brettcannon/cpython-wasi-build/releases/t... because Claude chat can't access those files itself) and I have a wheel file that bundles Python itself, compiled to WASM: Here's the transcript: https://claude.ai/share/a73b8b8b-8ebc-4fef-9e5c-7438e5e7ae35(It's possible Opus or GPT-5.5 could have done this too, I've not tried the exact same sequence. The Fable vibes are good here, though.)
And that's the thing. These comparisons are all gut feelings. I'm missing objective unbiased measurements to actually have real comparisons between different models, their different generations, or even just the convention that everybody adds "you are an expert software engineer" and "don't make mistakes" to their prompts because they think it improves anything. Nobody knows if it actually does.
You can’t benchmaxx an eval that comes after your model release.
Consider also benchmaxxing makes no sense from an incentive structure: the quality of these models is directly correlated by how well you can measure true performance in the wild. If they were just stupidly benchmaxxing they would be unable to do trustworthy ablations or know how well the model will perform in their product.
Remember the famous case of asserted benchmaxxing from llama 4? The entire org was gutted and the ceo spent billions hiring better people. Every lab takes evaluations extremely seriously.
Sure you can, just do it silently and don't tell the people hitting your API that the model is different now. Unless it's open weight, we're just taking your word for it. Even better, do a VW and try to detect which benchmark is running, then change to a hyper specialized model that is trained on it.
This is...just incredibly conspiratorial and a bit silly. You can make a benchmark right now and run it on the models. They'll have a benchmaxxed model on your...previously non-existent benchmark? I mean: if models really were overfit to benchmarks, which zero lab is doing because its idiotic, against their incentive structure, and easy to detect, then why would we see a slow ascension of performance on say humanity's last exam for one benchmark example? You could trivially get those numbers to close to 100% if you wanted to.
That’s why students are evaluated by teachers with more knowledge and experience than them. It follows that any mechanical evaluation scheme is hopelessly inadequate for measuring the true capabilities of a frontier language model.
This starts to break down in college when the professors often at best only slightly ahead. (they have more knowledge and experience - but in a slightly different area and so it isn't relevant to the depth of whatever is under consideration) Grad school is about advancing the state of the art - if you don't know more than your professor you are doing it wrong.
I can't speak to the humanities, but this estimation is just not true at most universities in the sciences. (EDIT: As cycomanic emphasizes below (https://news.ycombinator.com/item?id=48477683), the part of the original comment pertaining to graduate education is more reasonable. I am speaking here only of undergraduate education.)
> This starts to break down in college when the professors often at best only slightly ahead. (they have more knowledge and experience - but in a slightly different area and so it isn't relevant to the depth of whatever is under consideration) Grad school is about advancing the state of the art - if you don't know more than your professor you are doing it wrong.
How is this remotely true. You can have verifiable tasks that you can’t do. Where does this idea come from??
That is what benchmarks and intelligence tests are, which are vulnerable to benchmaxing etc. You wont be able to do this by gut feel though, you can create a personal benchmark though.
But point was that personal judgement of intelligence requires high intelligence. Creating a benchmark doesn't require as much but is more vulnerable.
Sure you can create a personal benchmark. Who will evaluate it, you? How many tasks will it have? How will you evaluate success? Will you know which model is which or will you be blind? Which one will you do first? Ah right, benchmarking.
Also, benchmaxxing isn’t possible when the benchmark and measurements come after the model is released, right?
throw the same prompt at multiple models and see how far each one gets. change the prompt used in the benchmark every day so models can't be optimized for that one prompt. use your vibe glands all you want, but don't issue model judgements without any ability to compare apples to apples.
Ha, of all examples you had to pick this :D I think we can very well determine that qualitatively.
no it doesn't, there's just no single measurement that will answer everyone's "which is better" question.
Go is better for some stuff. Rust is better for other stuff. Perl is better for other things.
"better" can mean anything, but if you define it, then it has definition, and you can measure it. So, you have multiple definitions of "better" and you use them all when you compare.
zero people have the same weights of the various definitions of "better", even among programming languages; look at how much javascript is written today. JS is not a better language in any measure that is based on rational thought, but for some people "this is javascript and nothing else is javascript" is enough for them to know that javascript is the better choice for their project.
Adding "do not make mistakes" is silly, in my opinion. There is always a good chance it will make mistakes. You should rather be more specific about a thing rather than as broad as "do not make mistakes" is. It just does not work that way.
So the best we can do right now seems to be to combine imperfect case studies like this with imperfect benchmarks to get some unreliable impression of where we are...
https://generative-ai.review/2026/06/claude-fable-rush-test-...
As mentioned in another HN thread I've done a qualitative side-by-side measurements of Claude Fable vs Opus 4.8 vs ChatGPT 5.5.
Anyone is able to check the output for themselves and form a judgement.
Large visible improvements for Fable over Opus 4.8 and ChatGPT 5.5.
I recently did the same to show the progress from Opus 3.4/ChatGPT o3pro one calendar year ago.
That website is 95% not you, it's AI, and I feel that's causing you to way over-represent the value of it in your response here, or you're completely misunderstanding what the person you're responding to is asking. If you put all of your effort into that site, without AI, it would be infinitely more valuable and useful.
The person you responded to asked for specific things, including:
- obvjective, unbiased measurements, but all that page has is side by side visual comparison of outputs.
- their different generations, but all you included was the outputs
- details on the prompts and little things people are adding because they feel they need to, but you didn't include any of that
This is slop, it's the exact sort of self confirming fluffy AI stuff that other either inexperience or over-invested-in-AI engineers will look at briefly, skim, see quick visual validation, and nod, noting down how much better Fable must be without getting any actual data.
Sorry, it's early, and maybe this is a misplaced rant, but the person you responded to specifically asked for precise, quantitative things precisely because everything else is fluffy slop like this, and people don't even recognise they're doing it any more.
Fable just got announced and I did a rush out article because people are curious. I released the post mere hours afterwards and it takes time to create the output, slice into videos, make a wordpress article on top of taking my son to basketball training and eating dinner. I’m in London and this was all happening at 1am.
If you check the links my previous articles have all the juicy stuff you are criticising me for not having with little preparation.
How is a side by side direct comparison NOT precise?
[1] first in series from 2025: https://generative-ai.review/2025/05/vibe-coding-my-way-to-e... . This has all the background you are talking about in the Appendix
.
[2] https://generative-ai.review/2026/05/vibe-coding-my-way-to-e... . Second in series 2026 has a side by side table of what changed. This is what is possible with more than a few hours advanced warning.
I just read the extra link you provided which has some more information, thank you. Sorry, but the links confirm my points. You're not giving any quantitative analysis of your use of the different LLMs or your process. Your "sciencey appendix" is all about the domain science of pyramids, nothing to do with how or what you put into the LLMs, or any quantitative analysis of the code put out.
I'm sorry, your response has just proved the point that frustrated me: you've either lost or never had the capability to recognise a decent quantitative assessment of technical software creations.
Your entire site is obssessed and fixated on the impressive looking outputs of LLMs, rather than actual quantitative assessment of the quality of the outputs. This is the killer problem of AI: it looks like it's good, and a lot of the time, things that look good are good. It's very easy to make stuff on a computer that looks good but isn't for various reasons, and I nothing in what you've said here suggests that you fully grasp that. Sorry again to be harsh here, this is just my opinion, and we're probably going to have to agree to disagree.
In my opinion, if one cannot express themselves civilly, they should refrain from commenting.
AI is a powerful tool and very capable of - amongst other things - making something look far more valuable than it actually is, and that is a huge waste of time that costs us all. We all have a responsibility to call this out when we see it.
It looks like you've just implied I'm entitled, unhinged, uncivil and and that I shouldn't have contributed at all, whilst thinking you've elevated yourself above that behaviour by saying "in my opinion" and "one should...". I think that's an unhinged, insulting and uncivil way to express yourself.
I don't think it was "a huge waste of time" or needed your rant.
You called it slop and questioned the competence of the author, as if he made grand claims about the objectivity of his comparison.
What I see often is that people assume others are incompetent just because they used AI, when in reality they are engineers no less competent or experienced than others on this website.
I raised this in a harsh, but repeatedly apologetic way. The person then responded telling me to "get my facts straight" and doubled down with more weak, qualitative outputs of LLMs.
I don't assume the person is incompetent because they used LLMs. I use them daily. I'm a firm believer everyone is an idiot, just in a different subject.
The issue here I feel is that LLMs are increasingly leading people think that they're not an idiot in any subject at all, and when real humans question it, they double down with more AI stuff.
https://www.anthropic.com/news/claude-fable-5-mythos-5
And there’s no reason evals can’t be done on multi-turn agents in a loop (or not): it’s pretty much what all these benchmarks do.
It would be very useful for companies to isolate interesting programming challenges in their past and publish evals on them (without revealing the actual codebase). In theory companies adopting these models should already be doing this to evaluate cost/benefit for each model, so it would be a matter of publishing them on a regular basis.
"Don't make mistakes" does seem dumb. It's not guidance.
https://simonwillison.net/about/#disclosures
"I have not accepted payments from LLM vendors, but I am frequently invited to preview new LLM products and features from organizations that include OpenAI, Anthropic, Gemini and Mistral, often under NDA or subject to an embargo. This often also includes free API credits and invitations to events."
But I'm totally unbiased on my gut-feeling posts, trust me bro.
-- AI influencers.
[1]https://en.wikipedia.org/wiki/Simon_Willison
Fable is doing - so far - a great job. I just had one big question around how part of it should work. I had a design sketch, but with some big unknowns. I asked fable to figure it out via reasoning and prototyping, and it did - it even, under its own initiative, wrote a fuzzer for its prototype which explored and verified that its reasoning was correct. It absolutely nailed it. And it found, and fixed, a couple bugs that I'd missed.
I'm sure its weaknesses will become apparent in time. But, wow this thing is a beast. Its the first time I'm reading the work of an LLM without spotting obvious weaknesses in its reasoning and code. I'm really impressed.
I work on the live collab at my company, and using AI while coding has into recently sort of “clicked” for me. We use an (I’m pretty sure) unheard of algorithm for collaborative editing, and I’ve had a long term goal of turning it into an implementation of EG Walker, but our document model is very complex and most out of the box CRDTs don’t quite fit. Maybe Fable will be what gets me over the hump.
https://blog.helsing.ai/posts/dson-a-delta-state-crdt-for-re...
https://www.youtube.com/watch?v=4QkLD7JhD_I&pp=ygUJZHNvbiBjc...
For such a data structure, "nailing it" means a formal proof of correctness. Fuzzing, as useful as it is, is merely throwing dirt at the wall and seeing if anything sticks.
I’ve read plenty of papers with “formal proofs of correctness” that turned out to have huge flaws. Machine verifiable proofs I trust. But I’ve personally found more bugs with fuzzing than I have via proofs.
I have found this quickly becomes false. I have learned I cannot review llm generated code as if it is written by a trusted senior developer (where I often just do a quick look, see nothing obvious and hit approve). Once you start reading the code in depth with the goal of understanding you quickly see the places where flaws are likely. Sure I start with no clue where to look, but it doesn't take long to see things.
I saw scanning the comments and saw you mentioned CRDT. Just wanted to mention that I implemented a CRDT-flavoured sync engine for the product I'm working on a while ago, I think it was with Opus 4.6 if I'm not mistaken (or earlier) so it's not something new to Fable 5, just fyi.
So far at least - and its been less than a day - Fable seems better at this.
I think I also do my CRDTs differently from others. I've grown to like the pure-oplog approach after making eg-walker. LLMs are much worse at this!
Damn you must be good, I've been feeling this for around 2 years now
"But it doesn't do well when writing my undertrained language" - yeah, fine. Yet. Reasonable code in that is probably one RAG + verification scaffold deployment around Mythos or maybe mythos+1. Just like it was for you learning it, because you knew how to _program_.
AI is just another tool, learn to use it.
Like it did everything:
- this is not a Linux system (true, it was macOS) - it is not an available command - the binary is corrupted - node/js is more precise - V8 JavaScript is faster than bash (true technically??? But not in this context lol) - JavaScript is more versatile
I forgot what else we went through but there were a few more things. I indulged it because it was incredulous and funny. The prompts from my side were all questions, never instructions. I assume an instruction would've helped here, but also I don't think Opus ever did this (but on the other hand Opus wrote python scripts to format/indent, instead of just running cargo fmt, so I guess potato potato)
> Fable 5's safety measures flagged this message for cybersecurity or biology topics. They may flag safe, normal content as well. These measures let us bring you Mythos-level capability in other areas sooner, and we're working to refine them. Switched to Opus 4.8. Send feedback with /feedback or learn more
I'm working on an internal tool that does new business prospecting data collection, scoring, etc. This is ridiculous.
https://x.com/i/status/2064449457869984035
Assuming the model is being “truthful”, CC is just being stupid in its detection mechanism.
I’m having a really hard time believing some weak reason for a 30 day retention policy.
e: I quit the session and went back in. Set it to Fable and told it to continue the last session. It's moving along as if none of that had happened.
How weird.
https://www.wired.com/story/openai-anthropic-letter-ai-biolo...
Or Fable’s arch is different enough the allocated clusters of compute targeting a date, and here we are, ready or not.
Or…
Question is if there will be any competition in this area...
I see a lot of people saying they are happy with weaker models, but I am the opposite, I need more strength, more intelligence!
I am quite happy that opus 4.8 can do some medium intelligence problems. And maybe Fable 5 can do some more more of those! I have a lot of problems to solve!
At work I had to switch to using GPT 5.4 Mini and Qwen 3.6 27B.
The results were near useless.
The error rate is through the roof, it's constantly incorrect in its conclusions even when investigating very simple issues.
Further the models are too unreliable to even move 20 line snippets around without inadvertently modifying them. Ask them to correct it and they still get it wrong.
Maybe the larger Chinese models are better, but the Mini stuff is next to useless to me.
I am just testing it on stuff I know intimately myself. I would probably not understand a proof of Collatz if it was dansing in front of me!
The curse of the 'use case' comes in here too. When people think that everything should have a use case, that's a lot of training data suggesting to a model that things should only be used for what someone has already thought of.
A couple of times I have had to manually code proof of concept pieces so that the model breaks out of that "unpossible" mode and actually helps me.
I can't remember if it was chatGPT or Claude, but when I showed it how to get a MessagePort in its JavaScript executor through to the artifact/canvas, it quickly went from "That can't be done" to positively enthusiastic about the possibilities. I suspect those shenanigans will be well off the table for Fable though.
Sorry to belabor this but it's basically pointless saying you have nuts it can't crack without showing us the nuts.
I gave a high level description of the problems in a sibling thread. They are the kind of small problems which I suppose every researcher has lying around, waiting for them to think about some day. But not the big problem everyone is waiting for to be solved.
My comment was not meant to be a tease – sorry! I assumed there would be other people in a similar situation, who might relate.
(Joking aside, see sibling threads.)
Recently (last couple of months?) these models are becoming useful tools for mathematicians, because they can solve easier problems more quickly, meaning that one can tackle bigger challenges (but maybe not RH et al) piece by piece.
But, there are still definite limits, where one could expect an expert human to solve things, given time, but models do not. Thus, more intelligence would be nice!
These are not Fields medal type problems, nor know difficult/open conjectures. Just small stuff I have collected in my todo list over the years.
A year ago my judgement was that I had wasted my time on trying to work with the models and doing things myself would have been more productive as I would have gained intuition from the failures. Now it definitely seems to have figured out stuff that would have taken me more time than I have to spare on this problem...
Being a theory builder more than a problem solver I am excited for the future.
Also excited for fully formalised mathematics to hit main stream!
I’ve done the same thing with opus multiple times with no issue. According to ccusage I racked up just shy of $100 of tokens using Fable.
It spun up subagents or workflows or whatever so obviously that contributed but “double opus” was not my experience. I’ve done the exact same prompt with opus on the highest setting and only once before (not even while using this prompt) hit my limits.
My prompt? I’m not a prompt wizard or anything but it was literally:
> Please review the uncommitted code in this repo for bugs/issues/code smells.
I use variations on that all the time with opus and never had issues. I figured it was a good one to kick the tires with Fable. Little did I know it would mean no more Claude Code for the next 4.5hrs (unless I wanted to pay) after this being the first time I had used CC that day (yesterday).
All in all, a pretty crappy first experience.
And run the command again. I get $126.89 for yesterday.
I think the $10.96 is coming from gpt-5.5 since I switched to it once I exhausted all my usage on CC. CCusage reports completely different numbers so I don't know which one of those is right.
Thanks for trying, for yesterday ccusage says "$92.02" for claude, which I assumed was the Fable usage.
Unfortunately it's not telling the whole story. The last message from the _only_ Fable session it monitored was:
> The data layer looks clean — <REDACTED>. Now waiting on the 11-angle workflow — verification and the gap sweep run after the finders; I'll compile the full ranked findings list when it completes.
And my memory jives with that, I could see in the footer that it had spun up 11 agents (though agentsview says it used 0 subagents, don't know if it was "actually" workflows that it spun up?). It's like it didn't record the sub-sessions/sub-agents info?
I'm still shocked that my prompt (which I now can see thanks to this tool) of:
> Please review all the uncommitted work in this repo and identify any issues.
was able to burn so much, so quickly, and, most frustratingly, without actually doing anything useful because killing it was my only option lest it spend even more of "extra usage".
Overview of usage: https://cs.joshstrange.com/RjGzWVXy
Stats for that 1 session: https://cs.joshstrange.com/Fj5qv1wl
https://cs.joshstrange.com/z9x6SPcC
I've been watching my usage quota bars drop as I use the model, so I don't think I have a weird quota issue going on here.
Soon the times of AI for $20/$200 a month will be long gone.
Forcing developers to pay for models that were build on code they scraped scott-free
A tax to do their job that developers are jumping at the chance to pay
Everybody's finally realising that node dependencies are a threat, but letting these AI companies gatekeep the industry is a bandwagon people are scrambling towards
Yes this makes me sad behound explanation. Specially when I see open source developers happily using these tools. These companies stole your, free, hard work and charge you a subscription!! Not to speak about them torrenting books and (most likely) training on private repos.
This and devs paying a subscription to use a tool that is marketed as trying to replace them.
I had 150$ monthly budget thatbI used for various open source projects and I've cut that entirelly.
In case you weren't aware, Anthropic, OpenAI and GitHub Copilot all have programs that provide access to open source maintainers for free:
GitHub: https://docs.github.com/en/copilot/how-tos/copilot-on-github...
Anthropic: https://claude.com/contact-sales/claude-for-oss
OpenAI: https://developers.openai.com/community/codex-for-oss
Then you say you had money that you used to donate(?) to OS and have cut that because of the frustration?
Open source just means sharing the source code for people to learn off or have the ability to customize on their own. I don't think there is any need to be frustrated about that (now if it was copyright/private of course).
Yes people, not corporations. The point is there a licenses to be respected that weren't.
We could fix that, but it requires a political will to change the law.
> To summarize the analysis that now follows, the use of the books at issue to train Claude and its precursors was exceedingly transformative and was a fair use under Section 107 of the Copyright Act. And, the digitization of the books purchased in print form by Anthropic was also a fair use but not for the same reason as applies to the training copies. Instead, it was a fair use because all Anthropic did was replace the print copies it had purchased for its central library with more convenient space-saving and searchable digital copies for its central library — without adding new copies, creating new works, or redistributing existing copies.
That's also caused by some very smart (even brilliant) developers (you can see many of them in this very thread) choosing to be oblivious about all this and bury us all under, hoping that they'll be among the last ones to go. Writing this down I realise that they maybe aren't all that smart.
It would not surprise me one bit to see anywhere from $80k-$100k/seat pricing.
Not everyone needs a Ferrari to go for a weekly shopping.
Maybe? If you talk to executives, the impression that I am getting is that they tend to be somewhat misinformed at best, which, yes, is bound to result in some really bad decisions down the road. But, and it is not a small but, the ones I did talk to ( and, amusingly, those are the ones with strong opinions ) don't seem to have a lot, um, practical exposure to this tech beyond what they heard at the watercooler. Honestly, it is kinda infuriating. And all this before we get to how companies want to say they use AI, but also keep cost down.
You and your work are not that special, you are not participating in car races, and you don't need a Ferrari.
So AI is only interesting to you / your org / humans if it can do things that you can not achieve. But if it still does errors, how could we ever know that super-invention by AI is not wrong?
If we can not rely on the correctness of the result, it is not usable at all. AI must create reliable and correct results always. That was a very fundamental requirement for computing. This problem has not been solved.
AI is interesting as long as it can save time and/or money in getting an acceptable result. Anything that runs on a computer and can do "things that humans can do" will automatically end up doing things that humans won't do, simply by virtue of the fact that it runs on a machine that doesn't require sleep, doesn't get bored or demotivated, etc.
Verifying code (to a level where a responsible person is willing to take ownership for it) isn't trivial, sure; but writing the code by hand requires the same level of care, and the fact that the same person wrote it doesn't actually allow for shortcuts (if we're being properly responsible).
What if an LLM overall starts to make less mistakes than a medium developer, costs less than its salary and is 100 x faster? For sure, the companies that will leverage these with just a few senior devs doing prompting, testing and requirements analysis, will outcompete other organizations.
AI agents do that, perhaps not always, but still do. Now the question: would I trust AI without verifying its output?
Do you verify every line of code written by your fellow developers? I doubt it, which is strange because they make errors don't they?
What matters is the error rate. Past some threshold and they're better than senior devs who you don't supervise closely.
No idea what's going on here but agent tested a bunch of stuff. Then I asked to build a wheel so I can run the command you noted above and it appears to pass
For those who are curious...
https://github.com/bamggm/micropython-wasm/commit/5ddebae592...
https://github.com/bamggm/micropython-wasm/commit/8b362fba1f...
To be clear, the jump from Opus to Fable was like the jump from pre o3 -> o3 for me. Very sharp improvement, not incremental. But that could be explained by dummy long thinking times.
It one shot a task that Opus burned hundreds of dollars on to get nowhere. Very tricky semantic refactor, got it right. Granted, again, the semantics Opus and I fleshed out 3 months prior, but Opus couldn't execute on the vision. Fable could.
Then I discussed some philosophy and it was actually both pleasant (GPT constantly "corrected" you for the sake of correction without clarification, also still often just wrong; it's like it refused to think critically about philosphy) and accurate, and actually helped resolve some deep but subtle misconceptions I had around representationalism. When talking with GPT I felt like I was talking with someone who either was sycophantic or "anything that is not absolute truth is relativism" - Fable actually discussed.
Both is exciting and kind of makes me depressed. I can definitely see why people are getting hyped about AGI again. All the models were extremely strong technically but I felt like couldn't match the developer's tacit state - Fable definitely did, and that's a basic quailty to be considered "usefully intelligent" IMO, at least to me.
Shame that it's going away in 2 weeks and probably going to be nerfed if/when it's re-released.
[0]: https://github.com/eryx-org/eryx
Which has a full build of python to WASM with a bunch of static libs built in already.
I will say I built this pre fable and actually the first build of the interpreter to WASM opus pretty much nailed, cpython has secondary support for WASM as a target since like 3.9 or something and it just pulled from that.
I’ve been meaning to write up a blog post about this sometime, building this has been pretty interesting, including using opus to run a full auto research like loop for days to hyper optimize it’s performance.
I’m hoping to use fable to power some even crazier WASM adventures tho.
It feels like you can give it a big chunky problem and leave it alone and it gets it done, with less questions and fewer design decisions that I wouldn't have made.
In reviewing its code I'm finding less to complain about than Opus. But it's all vibes, if you want a more scientific comparison you'll have to look elsewhere.
https://generative-ai.review/2026/06/claude-fable-rush-test-...
I get them to make a 3D explainer animation. You can clearly see Fable is much improved on both Opus 4.8 and ChatGPT 5.5.
Better Textures . A nifty camera follow . Humans rendered better . ... see for yourselves
> But it's all vibes, if you want a more scientific comparison you'll have to look elsewhere.
Fable just did it, clean code, one timeout with a hanging bash script, fixed a couple very old very structural bugs in the codebase
I am not sure it's perfect, and it will need further validation
This morning I looked at code samples & checked if all unit/integration and e2e pass & perfomance tests pass
I also generated a postgres schema diagram.
Aka I did probably 2 hours of work, rest was not me
The opus try was last month
It made sense for people doing proper and fair AI breakdowns waiting on an embargo, but now it's just slop I don't trust anymore.
> I have not accepted payments from LLM vendors, but I am frequently invited to preview new LLM products and features from organizations that include OpenAI, Anthropic, Gemini and Mistral, often under NDA or subject to an embargo. This often also includes free API credits and invitations to events.
[1] https://simonwillison.net/about/
Compared to what?
Update: looks like I've spent $82.92 in Fable 5 API priced tokens so far today (still all included in my subscription.)
Here's a TIL on how I'm calculating spending using AgentsView: https://til.simonwillison.net/llms/agentsview-custom-model-p...
* From today through June 22, Fable 5 is included on Pro, Max, Team, and seat-based Enterprise plans at no extra cost.
* On June 23, we’ll remove Fable 5 from those plans. Using it after that will require usage credits. If capacity allows, we’ll extend the included window.
* After this point—when sufficient capacity allows us to do so—we aim to restore Fable 5 as a standard part of subscription plans. We intend to do this as quickly as we can.
It's been discussed at length (on this site, on other sites, on like every blog ever, etc) that, eventually, those subsidies will end, much as the $5-10 Ubers/Lyfts I used to take from the far north end of Chicago into the Loop in 2016 would eventually end once those companies had a footing and didn't need to hook folks.
So - yeah, I mean, a v5 model launching in a year where Anthropic has a rather deeply established market and in a year where AI costs are rising from nearly all providers (sometimes for multiple reasons) seems like exactly the thing I'd expect them to pull the subsidy plug on after a launch teaser.
(Even the open-weight models sometimes do this: for example, OpenCode Zen/Go has a rotating door of free models at any given time that eventually leave the free tier and move into the paid tier once the launch day hype/marketing dies down)
Also, a fun website: https://isaiprofitable.com/ (thr numbers are probably made up)
(You may not realize it but simonw is one of the cofounders of Django, Python's web framework. If they find a Python problem difficult, it probably is.)
Web development is not a domain I would consider noteworthy of making a framework given how much development there has been in that area.
It's frustrating that superfluous tokens are burning up our quotas:
key insight, crucially this, real engineering deltas, net assessment, definitive picture, acid tests, real limits, sharp boundary, proper patch, real root cause, big progress, actually wrong, path finagling, the catch, root cause pinned, everything passes cleanly.
Though that's also what makes humans so good at solving problems as well, it turns out.
Also, slight tangent: but I do find the "clanker" insult kind of funny. I feel like it counter-intuitively makes the models sound cooler than they are, if anything. I love clankin' shit.
Next time you get a new and a fresh and an inspiring idea, and you spend hours solving a unique problem nobody has ever done before. You can take comfort in the fact that a few months later some lame and uninspiring developer can write the same problem in a prompt and get the plagiarism machine to steal your work, just in a more lame and uninspiring way.
That may very well be true now. And in fact, this was true of more rudimentary calculations early on in computing history, where humans were definitely more efficient, particularly for more abstract mathematics. But Moore's Law comes at you fast. Even without more efficient compute, it's rather wild how much more efficient models are becoming these days just from algorithmic and training improvements.
So, maybe for now, certainly. Are you confident that will be the case in 5-10 years? And is that really your barometer for success?
>And when a human learns these things they usually remember how to, and are able to extrapolate that knowledge into new and fresh problem spaces.
That is certainly a limitation for now, but plenty of academic research is being done on how to address that in a more individualized way. That said, the models also have the advantage of synthesizing learnings from user interactivity back into a future release and essentially applying that globally, which is pretty neat.
There's also some cool techniques to sort of bridge the gap today, like compound engineering.
>Next time you get a new and a fresh and an inspiring idea, and you spend hours solving a unique problem nobody has ever done before. You can take comfort in the fact that a few months later some lame and uninspiring developer can write the same problem in a prompt and get the plagiarism machine to steal your work, just in a more lame and uninspiring way.
But that's the thing: it's becoming pretty clear that the "plagiarism machine" can probably take that same problem in a prompt, having never been trained on my code, and still solve it.
In that case...maybe it doesn't feel great to have someone copy my idea. But that is certainly not plagiarism in the way you mean it. And when you put ideas out into the world, you can't be certain that someone else won't copy and remix it into something new. That's kind of how the world works already, but we're just seeing the barrier to entry decline.
Yes, I am. I am very confident that general purpose digital computers will never be more efficient then human minds in generating moderately complex code.
Why am I so confident... Well, it has been over 10 years since AlphaGo beat top go player Lee Sedol. AlphaGo was able to beat the a world class go player by doing several thousands orders of magnitude more computations then Lee Sedol, and it did so by spending several orders of magnitude more energy then the top human go player. Today, over 10 years later, the top go machines are able to beat world class go players much easier, but still do so using the exact same strategy of outcomputing the humans with thousands of orders of magnitude more computations, and spending orders of magnitudes more energy.
Things did not change in the past 10 years, I see no reason why it should change 10 years from now.
Has it not? Why do you say that?
Also, do we still require a Deep Blue sized supercomputer for chess? :)
But regardless, compute will get to a point where human level intelligence close to as efficient as we are. You could argue it already is today, when you factor in the resources that the average person in the west already uses in terms of their overall impact on the planet.
I can just as well describe the future evolution of the internal combustion engine and claim it will get more and more efficient and eventually we will be able to burn oil so efficiently that our personal vehicles can fly through the atmosphere at twice the speed of sound.
There is limitations to digital computers just as there are limitations to internal combustion engines. Our brains are not digital computers. When we use our brains we don’t just do a bunch of linear algebra.
This is a silly comparison. There is a certain quantity of energy stored in oil, so we know what peak efficiency looks like. We don't actually know what amount of energy is required to solve certain problems. We quite literally have models with quite a bit of capability that can run locally on a phone today, right alongside Stockfish, for example.
And this is to say nothing of work happening now on new hardware approaches, such as Normal Computing's work on thermodynamic matrix math: https://www.normalcomputing.com/blog/a-first-demonstration-o...
That said, this feels like a strange tangent: I'm not sure it's that important that the models be as energy efficient as a human brain. We don't avoid cars because they're less energy efficient than our legs. ;)
This matters because unlike cars LLMs are only doing stuff we can already do using our brains, just several orders of magnitudes less efficiently. Cars can at least take us distances we would never be able to using our muscles. In comparison, if I need to compile CPython into a WASM binary I can simply download a library that does it, or copy paste code in a few seconds, for a million billionth of the energy it takes an LLM to do the same. Except when I download the library or copy-paste the code I (hopefully) attribute the original author and give them credit for their work.
OK then - do it, faster.
> You can take comfort in the fact that a few months later some[...] developer can [solve] the same problem [using your work]
Isn't that what collaboration and sharing software is supposed to be all about?
On the other hand: "Stop trying to make 'clanker' happen! It's not going to happen!"
"AI slop" caught on but "clanker" did not.
It caught on, sure, but not exactly in the way I expected. The wild popularity of "slop" as a term for AI eventually gave way to the genericization of the word "slop" to mean "content of low quality, regardless of source", and is seemingly being used as just a derogatory term for anything that people dislike (particularly by folks in left leaning communities). For example, I've seen people refer to (clearly human written) commentary from some political commentators as "slop".
You comment kind of reinforces the idea by the fact that you have to now say "AI slop" specifically to disambiguate it. It's kind of a fascinating little turn.
We're a society built by thought and good-will engagement. We won't get out of our "rules for thee" with less thought and less good-will engagement.
So Fable would cost me 20k/mo at Enterprise rates. That’s around the average cost of a loaded SWE in the USA. “But I’m >2x more productive” doesn’t justify doubling the opex of the Software/IT department for most companies when revenue isn’t even up 10%.
I switched to DeepSeek v4 Pro with OpenCode and am on track for a few hundred dollars of spend this month.
Rewriting your stack from Ruby to Go in 2 days where it would’ve taken 6 months is impressive and fun. But that isn’t upping revenue.
Iterating on net new business features and ideas that are niche that the LLM isn’t trained for are much harder. Is 20x the token cost worth it there?
So this pricing is just completely outside of our economics and nobody I know would pay that, no company will justify spending $20k/month when they can hire 10 more developers instead.
It is very interesting unfolding of events. Can't wrap my head around it completely.
* Average software dev salary in Q12026: 4945€ / month [1]
* Total cost for the employer: 6616.41€ [2]
For $20k/month, you'd get 2 x full time mid-level developers + 1x junior dev or QA.
So the calculation becomes: which option can produce better results for your specific use-case, "you + Fable" or "you + 2x mid-level developers + 1x QA". (and from personal experience, mid-level in Estonia = senior dev in the US, in terms of skillset and experience.. but YMMV)
(Of course that's simplified. Your full time devs need _some_ level of AI subscription as well + hardware so add a couple of hundred to their salary per month etc so you might only be able to afford 2x mid level devs, instead of 2.5)
[1]: https://palgad.stat.ee/en
[2]: https://www.palgakalkulaator.ee/en
(Our team spend on AI devtools comes out to around $1500/person/mo)
This is a good start, but the calculation doesn't include office space and overhead (for every 100 developers there is maybe 5-10 support staff to cover the additional legal / administrative, and don't forget the extra cost in supervisor time to manage them)
one big enough to license the model and self host on existing infra.
Hitting the first calculator I found gave me 50 kSEK costs 69 kSEK. So far from double nowadays.
I understand pension contributions, but what are the other "hidden" costs that could equal the net salary?
The employer pays £6k for National Insurance (atop the employee's NI contributions). Pension: 2-3k. Apprenticeship levy is £300. 3yr-amortised recruitment fee is £4000. Hardware costs: £1000. Office space £5000. Software/tools: £2500. Benefits: £1500. Training: £1000. Other admin overheads £500.
You pay that person for ~250 working-days, but they only attend for ~220, due to annual leave and sick pay, so you get around £62k worth of attendance out of that person in exchange for £70k, of which the employee sees £35k.
So your "£50k" salary actually costs your employer £56,750, and that's before all the other expenses mentioned elsewhere in this thread such as hardware, office rent etc.
This is not visible on your payslip, i.e. if you earn 5k€ brutto, the employer has to pay these shares on top of that.
Our top user is at 10k a month, but the next highest is $2,000.
I would say the average is around $1,000-$1,500 for a developer.
We have completely unrestricted access to Claude, Codex, and Cursor.
Funny enough, the guy spending 10k is not even a dev by trade but an SME in what we work on that just vibe codes apps and somehow has not been cut off yet lol.
I have a single thread of GPT 5.5 medium running basically all work hours and I am around $1,500 a month in spend on Enterprise pricing.
I’ve heard of a few cases of devs racking up bills fast, but it has typically been due to inefficient context usage. Like they just have one super long session with Opus 1M and are getting killed with input token costs and cache misses.
With careful context management and some thought into good approaches to problems, I have personally only rarely even hit $1k in regular use.
I'm guessing he's producing pretty valuable work. We have a few SMEs that vibe code tons of stuff with Claude. The only thing they really need tech for anymore is deployment and helping get their wheels unstuck on occasion.
Multiply this times many, many companies, and you can see how providing AI could theoretically be a good business to be in. Margins may be tight, though.
Also -- I'm convinced someone will figure out more use cases beyond software programming, which will result in many more companies spending $1k+ per employee per month.
It remains to be seen how much of this is a bubble.
All of the above, of course, depends upon Fable consistently being a 2x-3x SWE at minimum.
It imitates applying knowledge. The imitation may be uncanny, but assigning LLMs intentionality and ToM is a category error.
By analogy, consider that many have referred to classical, deterministic computing as some kind of "thinking" for the last half century+. Does this stop being kosher when the computer has an uncanny propensity for human language? Perhaps, but the computer is still clearly chewing through problems that would have required a lot of human thinking (e.g., arithmetic) in ages past.
I haven't seen any genuine proposals for words to replace the human mind analogues, let alone proposals that the anglosphere would plausibly adopt en masse.
This is not correct. LLMs interpolate in a high dimensional space, so you're actually composing the best matches in a compressed corpus to find novel points/paths in that space. That is problem solving.
Depends entirely on the domain. If you're selling entreprise software, this kind of stuff barely matters for sales.
It can reduce operational costs which is good but there's a limit to how much that's worth.
This means if the deepseek / under 1k alternative is at least x1.2 improvement, fable needs to be x24, which I think is very2 unreasonable. It is possible for it to worth if it can x2 a $20k SWE, though I doubt it can do that.
LlMs are incredible don’t get me wrong, but they are good on tiny contexts (writing a script). Not on software engineering (adding features to Chrome).
Claude keeps telling me this when I argue with it. LMAO.
No it doesn’t and will not be. Companies have not realised the cost yet, wait till the end of the financial year and you’ll see a different direction.
DeepSeek v4 is pretty decent, and probably on par with sonnet. I see a future of hybrid models where opus or fable might be used only for complicated features or bugs, but general day to day would be DeepSeek or whatever good models that will be released later.
That's enough to buy a house in my country...
So what keeps your management from just buying everyone individual flat-rate Max subscriptions, or at least buying them for the users responsible for the sky-high token invoices?
I see a lot of comments like this but I don't understand why some people willingly pay so much more than others for the exact same service. What are you getting that I don't get as a $100/mo Max subscriber?
I was about to say that. Deepseek is just magnitudes cheaper and absolutely good enough for most things. Anthropic and co just try to milk the cow while its possible. If they cant compete with Deepseek pricing I do not see a bright future for them.
With that said, I still had the Pro plan on Claude, I didn't expect much, but it blew up my 5h allowance on Fable with one simple single prompt, and it didn't even complete lmao
Companies have to pay monthly for the harness app (codex, claude code) and the tokens are priced separately based on standard API pricing.
So Fable is just not usable for $20 plan and barely usable for $100 plan.
https://youtu.be/ngtp3v1_nCI
- They learn the domain of your product, which means long term ownership and knowledge establishes itself. If you've only ever shipped SaaS slop, you might not know, but lots of companies are solving real world problems that have no better solution. Owning and understanding the code and the domain is key.
- They will learn from their mistakes (no LLM does this).
- Human skill is a REAL moat. Once you build a team that fully understands and is skilled in the domain you work in, these people are going to be the thing that sets you apart. If some of them are particularly social or charming, let them sit in with you for meetings and watch them provide loads of value, for no added cost.
- If Claude or OpenAI is down, they will continue thinking. In fact, they will continue thinking even when off the clock! This is a neat little hack called "consciousness" where you get a lot of work for free!
- You can hire people who punch above their weight; not everyone you hire needs to be a 500k/year staff software prime engineer of doom, you can just spend some time and effort to hire good juniors/competent mediors who will think for themselves (gasp!) and get work done.
- You still get ALL THE BENEFITS OF AI!!!! They can use AI just like you can, or better!
- You get people who you can brainstorm with, which is distinctly different from LLMs because your employees are less likely to want to suck you dry in every sentence just to make sure you spend more tokens. Employees don't care if you love them, they care about the quality of their work if you manage them correctly and reward that.
- They are quite loyal if you treat them right; spend a little more on their well-being, and they will stick around, come in to work every day and deliver cool things with you.
- Humans can only manage, review and give tasks to so many agents. If you add more humans, you can handle more agents.
An expensive LLM and a lot of extra tooling gets you some of this, yes, but not all of it. With humans you can still do the expensive LLM and extra tooling if you end up making enough money anyway.
- AI isn't bound by need for rest, vacations, sick days, or labor laws
- AI doesn't bounce from company to company, taking your business knowledge with it (actually this isn't technically true based on the practices of AI companies, but that's not a technical requirement)
- AI doesn't join a union and stop work in demand for higher pay or workers rights
This is what CEOS and capitalists are thinking. For capital, the best outcome is to not have any labor at all. And if you can do that when your competitors can't, then you have a huge market advantage. (Slop notwithstanding)
I'm not saying this is a "good thing" but this is what drives the market. Less labor revenue in the long term and money printing machines.
> Unlike our interventions for cybersecurity, biology and chemistry, and distillation attempts, these safeguards will not be visible to the user. Fable 5 will not fall back to a different model. Instead, the safeguards will limit effectiveness through methods such as prompt modification, steering vectors, or parameter-efficient fine-tuning (PEFT). These interventions will not affect the vast majority of coding work. We estimate they will impact ~0.03% of traffic, concentrated in fewer than 0.1% of organizations
Edit: I asked Claude. It replied:
> Consumer protection / deceptive practices. In the EU this would be a clear UCPD (Unfair Commercial Practices Directive) issue and potentially a DSA violation. In the US, FTC Act §5 prohibits "unfair or deceptive acts." Selling a product that secretly performs worse than advertised for a commercially self-serving reason, without disclosure, is textbook deception. The Samsung/Apple battery throttling cases are instructive here: Apple faced regulatory action across multiple jurisdictions specifically because users weren't told.
> Competition law. This is where "anti-competitive" gets complicated. Refusing to help competitors build competing products via your ToS is generally legal — you can decide who you license to. But covertly sabotaging output quality for a class of users while charging them full price crosses into different territory. Under EU competition law (Article 102 TFEU), if a company with dominant market position uses covert technical means to disadvantage competitors, that's closer to abusive conduct than a legitimate ToS restriction.
This clearly is disclosed, otherwise how did we get to know about it?
The reason they are doing this shadow ban style technique, is they don't want users to figure out how to jail break their way out. Or the explicit direct bad PR of when it miss-fires.
Just need to wait for this thing to be open sourced :)
lol it won't tho...
https://mimo.xiaomi.com/blog/mimo-tilert-1000tps
It doesn't imply we should, for example, publish step-by-step instructions for making widespread death easier.
An example from the meat world: not publishing your vacation dates well in advance for the world to see somewhat reduces your chance of being burglarized. That is security by obscurity; not reliable, but not completely inefficient either.
But if you live in a fortress (security by key material), you can well declare your vacation dates without running the risk.
The reason we are not being attacked is not lack of technology access.
Column A, Column B. Building a small explosive device isn't hard. Building a million is very difficult, doing it covertly virtually impossible without the resources of a nation-state.
The problem with biologics is the self-assembly and replication machinery comes for "free." So the numpties who might otherwise blow up a trash can [1] now have a real chance of taking out a million people.
[1] https://en.wikipedia.org/wiki/2016_New_York_and_New_Jersey_b...
I also would like to hope that people that are likely to do such things are probably:
A) don't know how to break even the most basic guardrails of models
B) already in glasswings project
To prove point B - Theranos existed.
“Many of the largest and most responsible providers in the industry already screen and record orders voluntarily,” but there is no requirement to do so [1].
[1] https://screendna.org/
Humorously, whether I choose to participate in this hypothetical or not, I am already betting my ass.
This whole situation feels like the game [1].
[1]: https://en.wikipedia.org/wiki/The_Game_(mind_game)
// Claude, make antiviral nanobots that defend me from 6ft virus. Make no mistakes.
All of this “guardrails” handwringing is nonsense. These things output text. Are you for censorship of a book written by a biotechnology expert that gives out the exact same information?
Security in the form of "pay to play" is just kicking the bigger issue down the road.
I don't think there's an ideal solution here, but giving trusted people access to fix security issues before giving it to the wider public seems like a reasonable compromise. They're letting you use the model for all other uses.
sure, a malevolent state actor could swing it, but they could make a bioweapon without Mythos's help already.
also, vaccine production and disease surveillance have ramped up very quickly. they will ramp up further, despite political setbacks. it's a cat and mouse game that favors the defenders IMO.
but the bioterrorism narrative is useful FUD to spin open-weight models as existentially dangerous. I am far more worried about Anthropic's own goals than the goals of some crackpot in a shed.
How so? I'm actually against most of the "safety-tuning" that anthropic does, but this seems fundamentally untrue, a close analogue being video game cheat development. I think in general the cheat developer has an advantage and the cheats generally proliferate for quite a while before being patched.
Finance and biology do come across as two similar high level systems. But while we can employ KYC, fraud detection, and various auditing techniques to finance, I don’t know what you do for biology. You can easily run an algorithm over every transaction a person makes in their account but there’s no equivalent for every cell, every bacteria strain, every virus in the human body.
the adaptive immune system effectively does KYC by checking the antigens presented on the surfaces of cells. the thymus selects for B-cells (iirc?) which don't react to a corpus of the body's own antigens, but cover a wide library of everything else. when it sees something it doesn't recognize, it reproduces, warns the rest of the immune system and marks targets. that's why our immune systems can eventually conquer almost every pathogen we encounter, if we can survive long enough for it to do its work.
but the KYC I was referring to was KYC that vendors of oligonucleotides (should) be doing, to keep people from ordering nefarious sequences.
also, afaik the most effective way of developing pathogens is through serial passage through humanized mice or something like that - directed evolution at a small scale, selecting for traits. AI simply isn't needed for that. I don't think information or intelligence has been the bottleneck for bioterrorism, it's motivation and resources - same as for any other kind of biology research program.
At the scale of API requests that Anthropic sees, I think the affected organization count might be substantial, and they might not be getting the full model capability that they're paying top $$$ for.
Also, wonder how they arrived at that estimation.
1/30,000 * 100 = .003
I don't even think they can believe it themselves, it's in reality they are just trying to throw fear, uncertainty and doubt about potentially cheaper offerings.
Not what that means.
Crocodile tears "is a colloquial term used to describe a false, insincere display of emotion" [1]. Defending yourself against an attack vector you just exploited is between savvy and hypocritical.
[1] https://en.wikipedia.org/wiki/Crocodile_tears
The fun part is that you will never know if your neural net classification project is getting silently sabotaged because their classifier doesn't work!
With this in mind, I don't want model to be proactively instructed and encouraged to sabotage without telling me.
Anthropic is doing a better job with their model menu, most people I talk to know immediately that Opus > Sonnet > Haiku but cant tell you what the rank order of open ai models are, when to use them, etc.
Cool, good to know I can trust Anthropic.
I built it because I wanted cursor on my phone because I have two small kids and don’t want to be chained to my desk. And it’s awesome. It’s a full ide with agent chat, terminal and file system running in a remote Linux container. I can review diffs, fully manage git and preview/serve apps. And no one can ever take it away from me :)
I am watching the way things are progressing with the ai api vendors and it feels really clear that depending on them will soon be dangerous. So I an furiously building as much of my own infrastructure to capture some autonomy with these capabilities
So I think everyone should build a harness.
They went from selling shovels to all gold prospectors to stealing the information about the location of the gold so they could dig it out first.
We are all stupid enough to keep buying shovels from them because we think their shovels dig gold better and faster.
Am I to understand that this is essentially their form of social-platform ghosting instead of banning?
So they're not even going to tell you that the question you're asking is against their rules, they're just going to twist up your question and/or the answer somehow such that you waste your time essentially?
It seems like I ran into this EXACT same functionality from Claude many months ago when I was trying to ask it to research on the web and help me setup the ideal llama.cpp config for local llm inference.
Funny how lost it got through that relatively simple install when we had all of the documentation in the world (and a human dev with 20+ years experience guiding it along) to go by... and simultaneously it's debugging and building high level cryptography code in rust in the other terminal tab.
This is infuriating to learn.
This experience has made me feel like we have to create a community that moves AI from the mainframe era to the PC era quickly, or we will end up serfs.
I tried the same prompt on gemma4 and qwen 3.5 and Gemma consistently failed to call the multi line edit tool.
But it gets stuck in tool call loops, it seems like.
this is LLM, it's not like a science or something.
and they want me to pay $100+ a month to be their training?
i hope we can find morality again.
Come on guys, why can't everyone just be there for the good guy?
First, they want government to get involved and regulate frontier model development - even stop it completely.
Second, poisoning output of a model configured on the computers of millions of users goes way beyond protecting IP. That's malware.
• My most noticeable immediate jump was in how its frontend design was much more intentionally crafted, and delightful without feeling like 'AI vibe coded'; with better end-user usability too.
• In some internal agentic harnesses, it achieved better results with about half the tokens, making it cost the ~same as Opus 4.8 price-wise! The real price increase is less than 2x; with biggest differences in harder problems where Opus 4.8 struggles (or needs many turns).
• Part of the token efficiency improvements come from Fable doing more targeted and surgical diffs, with less non-necessary changes. This is great, because PRs often have less LoC changes for review. It writes more maintainable code without explicit human steering.
• For general conversation and assistant style use cases, didn’t really notice a difference vs 4.8.
• 1M context window, without increased pricing for long context is AWESOME. This is a massive win.
• The classifiers are super aggressive and sensitive and this does happen for very benign, non-security coding tasks. Fallbacks to 4.8 worked like a charm; but the filters are definitely super sensitive.
Overall, I would describe this as a step change and worthy of the "Claude 5" model name. It did take some time to understand the intelligence ceiling of this model; and even with an extended testing window I'm still discovering new things and often surprised (in a good way) by the model.
In Codex, GPT‑5.5 is available for Plus, Pro, Business, Enterprise, Edu, and Go plans with a 400K context window.
You can use Pro on the web if you’re on the Pro plan but not in Codex
It felt, at least for me, light an impressive step up. Opus 4.8 was already very thorough; but sadly verbose and ‘loopy’ when you push back on its plans. Fable is what I’d use all day if I could afford it!
I wonder how much of design capability improvements is related to our collective ability to recognize AI design tropes.
This is still not in the range of shippable UI for top end companies. Maybe for internal tools and enterprise.
At our comapny we limit to protoypes at most and even find it limited there.
Look, I don't want to argue about something dumb like that, but you can give it basic instructions of what the UI should look like, how to group things, and an example image from a designer, and it will nail the result. If you don't think that's incredible, that's fine. I do.
Opus 4.7 made this a practical approach. 4.8 improved it. Fable 5 has improved it more.
Given the shit we've seen shipped by "top end companies" (all the way to Apple) I seriously doubt that. I'd say you're nitpicking from an artistic point of view or something.
so this is why claude talks like this, i was wondering where it was getting this verbal tick from.
I assume it might be a good barometer for generalised intelligence; esp in the visual space.
This seems like the pharmaceutical method of get them hooked on the drug with free samples, then once they can't live without it, raise the price. I'm not sure I want to start using Claude Fable on a max plan if it's just going to go away on June 23rd.
But maybe the more charitable reading is that they didn't have to offer this model at all on those plans and they are giving the standard free trial.
API Error: Server is temporarily limiting requests (not your usage limit) · Rate limited
Limited "free" time is what game developers do if they want to stress test the infrastructure code until it breaks.
For the stuff I've thrown at it, that configuration has done a really great job. Including 70+KLOC go proxy with extensive test suite, some retro games, and more.
IMO the data from chats alone is worth $200B to Google.
Fable has literally refused to work on any of my problems (even those about fluid dynamics!) and just tells me that I'm violating anthropic's AUP. I've reached out to their support and don't expect to hear anything sensible back. One thing I do look forward to though is OpenAI offering an equivalent model but with less safeguards...
I dearly wish you could leverage the latest models to enhance your research.
A bit more context if you care: it's a meso-scale, physiological simulation environment of "particles" that carry nuclear spin, can move in 3D space, and (should they interact with each other or their environment) undergo chemical kinetics. The idea is to simulate molecules within e.g. organs or blood vessels within a person in an MRI scanner, with the motion of the particles dominated by the Navier Stokes equations, but here solved in a Lagrangian (rather than Eulerian) framework by smoothed particle hydrodynamics.
The fact that particles carry nuclear spin means that we can solve the (semiclassical) Bloch equations and by using a python plugin module import exactly the physical MRI scanner would do (in pulseq format) and be able to predict what signal the machine would record – e.g. there's a whole world of cardiac or neurological flow imaging work done in the context of nasty diseases like stroke or myocardial infarction – which has a bunch of physical artefacts behind it. I'm trying to make a simulation framework that can take in realistic patient geometries and act as a 'data generating process' because if we do it right the various physical artefacts that the machine records are reproduced, surprisingly accurately. Of course you also know the ground truth of where the particles are. I'm specifically interested in a weird technique (which I did my PhD in and you can read an article all about here: [0]) called dynamic nuclear polarisation, where specific spin states of molecules such as [1-13C]pyruvate are injected essentially out of thermodynamic equilibrium and act as short-lived tracers of metabolism – again highly altered in disease. The signal we record is a strong function of the physics of what you told the machine to do, the spatial constraints and environment of the patient's body, and the chemical kinetics of the patients' biochemistry (the latter two are usually what we're interested in).
Getting them to do chemistry as well as act as a "simple" tracer is more involved, because in the Lagrangian framework the number of particles is ≈ the spatial resolution of your simulation. That's fine if you're simulating water, but if you're simulating something that reacts concentration is not scale invariant (if you want to keep the interpretability of the rate constants). I've worked out an analytic set of scaling rules around this and fortunately for my application environments and length scales "it just works", completely by luck.
I've used Claude to port various SPH algorithms and boundary condition handling ideas (which are absolutely critical and highly not obvious – we have leaky walls in some places, and e.g. LCR / circuit theory models of the microcirculation to plug in) and it's been a godsend. But I'm running into its limitations constantly. It both confidently makes shit up, claims it is mathematically justified and when the resulting simulation explodes says "I apologise; I lied above" (!) or "I apologise; I am wrong" and I periodically have to yell at it to try to do something more productive.
The real hope is that this simulation environment would be both generally useful for basically anyone doing flow MRI, and help our basic scientific understanding of what we're measuring (the technique is in many hospitals!) but also be able to produce meaningful synthetic training data for image reconstruction algorithms later on. It'll end up permissively licensed (all of the "starting" codebases have compatible OSS licenses, and we're releasing our contributions similarly).
I really hoped that Fable would be better at this sort of work. Occasionally, relating to my work DNP [1], I have need to talk about proper nuclear physics and I have seen Opus's chat interface write a wall of text (e.g. talking about photonuclear reactions and cross section differences in millibarn) and then just delete it all. Support have told me that yes, I've hit the nuclear filter and, well, tough shit, basically.
I wrote a version of the above to them yesterday, and just got the most boilerplate response that I've yet to test:
which doesn't fill me with hope...[0] https://physicsworld.com/a/dynamic-nuclear-polarization-how-... [an "accessible" article] [1] https://www.science.org/doi/pdf/10.1126/sciadv.adz4334
>"it just works", completely by luck What does your validation function look like for this? Whenever stuff "just works" for me I get a little nervous until I determine why.
That's a whole separate long answer. I'm not a qualified doctor (and nor would I claim to be), but after a masters' degree in particle physics I moved into an explicitly interdisciplinary training programme that led to a doctorate and at other places in the country I did it in, a separate MPhil. During that initial year I spent a fair amount of time in the dissection room, learning anatomy, as well as most of the first three years (the foundational, preclinical part) of a medical degree combined into one (which contained lots of molecular biology, frankly). My final doctorate was between the departments of condensed matter physics (nominally my awarding institution), biochemistry, radiation oncology, and "the department of physiology anatomy and genetics", which is basically preclinical medicine. The people I work with are 50/50 recovering engineers or physicists, and qualified clinical medics who are trying to learn things like perturbation theory in their time off…
>"it just works", completely by luck What does your validation function look like for this? Whenever stuff "just works" for me I get a little nervous until I determine why.
Ah. I do know why: the relevant Damköhler numbers [0] are either very small (chemistry is much quicker than flow) or large (flow is much quicker than chemistry). So the approximations I am building in are justified and an awkward middle region is excluded; we also are only interested in small concentrations in a carrier fluid (e.g. blood, lymph) where the presence or absence of the species in question does not change its rheology.
I am lucky because we have evolved this way. If our circulatory system and its approach to metabolism was more similar to e.g. a reacting polymer foam ("can of expanding foam") which completely consumes its reactants as it goes, this implicit Lagrangian approach would likely not work.
[0] https://en.wikipedia.org/wiki/Damk%C3%B6hler_numbers
When GPT 4.5 launched, the gains compared to the model size didn't seem that great, leading some to believe that the only progress we'd see would come from RL.
This model certainly has quite a "substantial amount of post-training and fine-tuning", but it's also based on a new pretrain[1][3], which given the cost, indicate that it is in fact quite a bit larger than Opus 4.X.
[0] One of the early testers mentioned: "As far as I can tell from talking to people internally at Anthropic, there's nothing special about architecturally"[2]
[1] Section 1.1 in https://www-cdn.anthropic.com/d00db56fa754a1b115b6dd7cb2e3c3...
[2] https://youtu.be/GrdEid8H6H4?t=168
[3] There were rumors going around when Mythos was first announced that it was the first 10T parameter model, but I can't find a verifiable source for that number.
It turns out that having a text based interface for a text-trained model creates a very nice feedback loop.
Right now as we speak, people are generating text traces on anthropic and OpenAI servers that teach their models to do everything under the sun, text wise.
So people right now getting super mad at how dumb the model is when reverse-engineering a super complex function from binary, when they write “stop, you dumb robot, you are going wrong, go this way thank you very much” are actually leaving a lesson in the form of the "chat" text history.
Some may say that each bad word get us closer to ASI.
That and obviously the order of magnitude more efficient GPUS we got that allow for different tradeoffs at training time.
A typical session is the agent establishing a metrics and log baseline, creating the code, compiling, deploying, observing, fixing, redeploying, observing metrics, determining the outcome and commiting.
I really, really, don't look at the code anymore.
UPDATE:
so my point is: it won't have my stewarding the code anymore, but it will have the infrastructure (and ultimately the real world) providing feedback on the traces.
Maybe we need some form of long-term training. How long does the code that the AI wrote stick around before being rewritten.
I guess we can do this retroactively too if we could somehow tag AI-written lines of code in the VCS, then in a couple years we can check which parts lasted.
sorry. how do you know. i am so curious about where exactly gains are coming from but so hard to even get a little bit of insight.
i wish govt would fund these labs and make it free and opensource. way better investment than stupid overseas wars.
It would be impossible for the govt to allocate this much capital towards such a moonshot, and even if they could, they would do it in a way that would get 90% frittered away to fraud and waste
You have a false definition of "impossible." It would be true to say it could be challenging, given current political dysfunction, but it's not impossible.
> ...and even if they could, they would do it in a way that would get 90% frittered away to fraud and waste
Same with private business.
I'd prefer government funding, because there a greater number of important goals than the two or three the market is capable of optimizing for.
https://www.whitehouse.gov/presidential-actions/2025/11/laun...
Doctrine and propaganda can make someone that sure, and the thing they're sure about doesn't even have to be true.
> There's been massively successful government funded and run projects before. Soviets beat the Americans to space, after all.
Don't let facts get in the way of ideology!
Also the Americans subsequently beating the Soviets to the moon was the government literally allocating huge amounts of capital towards the literal trope-namer moonshot.
Apart from all the above: the fact that they are intentionally writing this (that they degrade frontier LLM dev, silently vs loudly for biology/cybersecurity) in the system card is interesting to say the least - especially just before IPO.
Notice that with this statement - that they're going to intentionally hobble the model for frontier LLM development - the general discussion has moved from, “Is the model actually that good?” to "they’re pulling the ladder up from behind them"
That's actually super smart - wonder if Mythos (or the next unreleased model) had a say in coming up with that strategy (if it's intentional). Also - having access to extremely capable models before anyone else - which they have by default - is a incredibly advantageous position to be in.
There's a quote from a METR report on page 52:
>We ran [Mythos 5] on 38 of our hardest software tasks, including tasks centered around R&D. [Mythos5] generally outperformed an early checkpoint of Claude Mythos Preview in these, including by succeeding on some tasks that had not been solved by any public model we have previously evaluated. However, we still observed the model occasionally failing to correctly interpret nuanced instructions in difficult tasks... Based on the available evidence, we believe [Mythos 5] is likely unable to fully and reliably automate R&D for frontier projects spanning multiple weeks. We believe that a better, more confident assessment would require more time, evaluations, and information from the model developer.
this is good news, right? right...?
- Opus 4.7 xhigh: 5.2%
- Opus 4.8 xhigh: 13.4%
- Fable 5 xhigh: 29.3%
Seems like a huge jump.
[1] https://cognition.ai/blog/frontier-code
1. That estimate could easily be wrong.
2. That estimate is, of course, usable in RL training. This isn't an inherently bad thing, and this is more or less what has improved coding models so much lately. But it does mean that other companies could and surely will do this sort of training, and Anthropic probably did too.
3. OSS maintainers are far from perfect, and there's an unfortunate uncanny valley-like effect in which a coding model can produce code that is just convincing enough to pass review even though it's actually totally wrong. I don't know whether this is a specific issue here.
prior bms relied mostly on unit tests or synthetic judges which are easily benchmaxxed, which leads to nobody trusting benchmarks
we need people manually checking the data for good code quality
this benchmark looks very good from the methodology. a cog researcher checking the data themselves is very high signal (not scaleable so don't take the benchmark as gospel, but directionally good)
TL;DR - they worked with OSS project maintainers to build tasks. They score models based on whether a PR is mergeable. All tasks are graded by a human researcher. SoTA models have hill-climbing to do which raises the bar and inspires confidence. I'd say it's legit.
[1]: https://x.com/cognition/status/2064061031912288715
Nobody would have 800+ billion reasons to lie by commission or omission here.
they aren't married to a particular lab, most of their usage is their in house model i believe
I think it's safe to assume everything AI related is heavily biased until proven otherwise. Just like in pharma.
EDIT: Oh I see, this is the best link for pricing https://platform.claude.com/docs/en/about-claude/pricing
So the price is double across the board...
From their pricing page, Opus 4.8 costs $5 per million input tokens and $25 per million output tokens [1].
[1] https://platform.claude.com/docs/en/about-claude/models/over...
I would have expected Mythos to be much more expensive than just 2x current Opus (which is clearly cheaper to run than original Opus)
Input Price $10/M tokens
Output Price $50/M tokens
Cache Read $1/M tokens
Cache Write $12.50/M tokens
2x Claude Opus 4.8, same as Claude Opus 4.8 (Fast)
Frankly, not even Opus 4.8 would be enough of an incentive to use at that price range (enterprise-wise; would not even bat an eye as a consumer)
But - these $3k-$5k/month/engineer bills are going to start to get attention soon - only question is whether the response is to slow down on the $$$ spending or reduce the # of engineers.
whats the logic in claiming its a borked metric when everything listed is an anthropic model.
This kind of storytelling annoys me. Give us more facts, less narrative drama.
What matters is scale. Did it deploy a novel zero-day exploit to overcome a problem? That's alarming. Did it kill a disruptive process? Pretty normal troubleshooting step.
Some people seem to think that simply uttering these ideas on the Internet is harmful (in the "don't give it ideas!" way); but the MIRI types were expressing them pre-ChatGPT in an attempt to warn people, so there was really never any chance of keeping it out of the training data.
But it's also worth considering here just how awful AI security postures have been. The MIRI types used to speculate about how difficult it would be for AIs to social-engineer users into granting them irresponsible levels of agency. It turns out that they don't even have to try.
Okay, I'm going to start running a Bitcoin miner on your machine, and then use it to buy time on Digital Ocean.
I've written out my CLAUDE.md, and I'll use SSH to transfer my context to that other machine.
They are the only one crying out loud about how dangerous their models are and are presumably also training their models heavily to be "safe". And through that training itself, the model learns about the other side - how are you going to teach a model to be safe, without teaching it what's not safe?
Kung Fu Panda opening scene anyone? One often meet his fate on the path that he takes to avoid it - Master Oogway.
Very interesting. I am not sure this will comply with organizational policies and standards protocols (HIPPA etc.,)
Almost… basically they have unlimited power to decide what data is kept?
You can’t tell a judge who’s ordered you to retain something that you can’t because you said you wouldn’t.
Enterprise plans allow admins to set which models are allowed.
This feels more like working with a competent peer than ever. I won't use it once it's API-only, though. I don't mind guiding Opus as required and staying closer to the code. I can tell that Fable would lead to a lot more 'set and forget' programming which I'm still not fully comfortable with.
Regardless, this is cool. It's very fun to use. It was able to find legitimate issues with my work this week and we've made meaningful improvements. Opus can do this, but typically in much narrower contexts, and often with hallucinations or partial-errors. It needs to walk many things back or revise plans. So far that's not the case at all with Fable.
edit: I just realized I had Opus review the same work already. It missed everything Fable caught today. And it's actually worthwhile stuff to address. It's hard to say no to a model which demonstrably makes your code better, but... Those API prices will be brutal. Maybe a review here and there, I guess.
> Fable 5's safety measures flagged this message for cybersecurity or biology topics. They may flag safe, normal content as well. These measures let us bring you Mythos-level capability in other areas sooner, and we're working to refine them. Switched to Opus 4.8. Send feedback with /feedback or learn more: https://support.claude.com/en/articles/15363606
Seems like GPU drivers are cyber weapons of math destruction now.
Seriously, GPUs are a mess and keeping LLMs from helping us use them properly is practically a crime.
They kind of are, at least in the AI race.
> weapons of math destruction
lol. great, whether intentional or not.
The frontier labs now have every reason to hold back and sell only to their preferred trading partners. I don't really like the new arbiter-of-knowledge system we're barrelling toward.
● Bash(/tmp/run_ps.sh '& C:\rhombiq\d3d-probe.exe 2>&1 | Select-Object -First 4 | ForEach-Object { [Console]::Out.Write("$_`n") }' 2>/dev/null) ⎿ Adapter[0]: Qubes virtio-gpu WDDM 3D (dev) VendorId=0x1af4 DeviceId=0x1050 VRAM=8192MB Adapter[1]: Microsoft Basic Render Driver VendorId=0x1414 DeviceId=0x008c VRAM=0MB Adapter[2]: Microsoft Basic Render Driver VendorId=0x1414 DeviceId=0x008c VRAM=0MB
● Please run /login · API Error: 403 The socket connection was closed unexpectedly. For more information, pass `verbose: true` in the second argument to fetch()
Brewed for 8m 35s
Continue please
● Your organization has disabled Claude subscription access for Claude Code · Use an Anthropic API key instead, or ask your admin to enable access
Seems like they locked by account.
* From today through June 22, Fable 5 is included on Pro, Max, Team, and seat-based Enterprise plans at no extra cost.
* On June 23, we’ll remove Fable 5 from those plans. Using it after that will require usage credits. If capacity allows, we’ll extend the included window.
* After this point—when sufficient capacity allows us to do so—we aim to restore Fable 5 as a standard part of subscription plans. We intend to do this as quickly as we can.
The "offer, then remove" aspect is a bit eyebrow-raising -- it feels like they are trying to get subscribers to switch to usage-based billing, which makes me wonder if we'll ever get it after that June 22nd window.
If they didn't announce it, you guys would be complaining about slowed progress.
If they didn't release it, you guys would be complaining about fake promises and marketing.
If they released it without limits, the complaints would be about slow responses and outages.
If they didn't add to susbcription plans, the complaints would be about phasing out subscriptions.
If they added to subscriptions with cost reflecting their resource availability, the complaints would be about how quickly it eats limits.
So they choose the middle ground of providing some initial access and assessing if they can satisfy demand, only to still be ignored and accused of trying to get users hooked?
We've already seen that they don't have enough compute, thus the deals with SpaceX for their GPUs. It's very reasonable that they just don't have the capacity to support the subscription userbase on this model.
I can recognize so much of the GPT/Codex generated code long after it gets merged (not by me).
Additionally, the time spent on every agent turn on GPT 5.5 is much longer compared to Claude Opus 4.8, which means iterating on the code takes a lot more patience, and there's a lot more nitpicks to pick when actually using GPT 5.5 to do software engineering.
Feels like GPT-style models are more geared on doing one-shot software vibing (and handling the vibe coded mixture) compared to Claude's focus on actual software maintenance. I got a GPT Pro sub for free and wanted to cancel my Claude subscription so much, but I still keep reaching Claude models a lot more. Frustrating.
this is the line I keep in Agents.md that helps me prevent Codex from playing smart
https://arxiv.org/abs/2602.10144
When a "person" that you don't view as a "real" person repeatedly does exactly what you just told it not to do (often amid false assurances it understands and will avoid doing so in the future), most people get angry.
Compare it to how the kind of people who treat children like property treat their kids, or other examples of keeping people as property.
who, or rather what, is being abused here exactly ?
You should see the abuse my motorbike gets. Poor thing.
We were reviewing reports of situations where the models failed to follow directions and there was a common thread of some where when the operator got the model to acknowledge the rule breach, it quoted back something that included swearing.
I don’t have the data to truely look into it, but I did give the instruction to my engineers to avoid it as a “might be a problem”.
But I avoid unnecessary emotion in my prompts because I don't want potentially distracting activations. Kind of like communicating with humans.
> impolite prompts consistently outperformed polite ones, with accuracy ranging from 80.8% for Very Polite prompts to 84.8% for Very Rude prompts.
Unless the mechanism is understood, my assumption is that this is a moving target.
https://www.anthropic.com/research/emotion-concepts-function
How so? Plenty of swearing in lots of training data, especially older code, e.g. in Linux.
Bonus points if you find yourself actually saying it out loud while typing it.
I have used the word "shenanigans" way more in a couple of years of agentic coding than in 30 years of writing code with humans.
ai llm are doing what i tell them to.
if you’re building something meaningful (in my case a platform used by many people across many companies) you want to ensure you
1. have actual systems engineering and architecture in mind that you want the models to
2. implement based on what you tell it to do
when i was just telling the models what i want done without doing due diligence it would go and do some moronic implementation that was awful. mid input = mid output
these days i just maintain specifications documents and the AI follows everything i tell it to in that document. so when i tell it to dos one thing, the result is made following those architecture specs.
i have code that is single resp, modular, easy to extend and test.
i would ballpark 95% of the time i get what i asked for.
sometimes it tries to be clever in cases that weren’t covered in my arch specs. in those 5% of cases i go and update my specs.
source: used billions of tokens worth to build something actually in production across both mobile platforms and web, deployed on my own cloud infra. i use codex mainly. some claude.
But Claude models seem to be better at long term problems or more ambiguous problems.
I'm curious as to what the primary benefit here. Are there secret improvements in training? There hasn't been much in fundamental model architecture, I don't think. What about harnesses? I wonder what's pushing the AI. It seems like harnesses is the main thing pushing AI ever since CoT.
I think the end game is routed model usage and SLMs. I think Apple is going to prove this in the consumer space pretty handily and I'm curious how the Android ecosystem responds since the hardware is considerably lacking in model performance. I think Apple has a huge opportunity here, as much as I don't like their current ecosystem of walled garden. They did position themselves very well with ARM and custom chips for their hardware. Hopefully the broader ecosystem of ARM and Linux are able to make some headway and we see a more formalized, and broadly accepted, architecture to capitalize on.
I’m sure you could put something similar together with a bunch of duct tape and 2 weeks of effort, but it won’t work nearly as nicely nor out of the box. so…what am i missing?
My company has an agreement with the big providers and while i'm pretty sure they think about how to get budget back, its an competitive advantage and normal people will not learn different model behaviours.
At least for now.
Regardless of what others are doing, US labs here are just rushing to IPO. It's NOT a sign of confidence.
It's the equivalent of saying you have confidence in SpaceX making revenue by renting out their data center (instead of their AI making bank).
On the same note. if spacex is doing datacenters on earth successfully what's wrong with that? They rented cloud infra to a #2 or #3 provider in the world after < 2 years in business. It's a success, no?
If you get hired as a staff engineer and do the work of a junior, what's wrong with that?
Clearly xAI (now part of spaceX) did not raise funds to be a data center. The margins are way different. There are plenty of recent IPOs in that area that are worth at most billions not trillions.
> going to IPO is a sign of confidence , you need to report a lot of things, that private companies don't.
This isn't going to IPO. This is rushing to IPO. It is a sign of confidence that the market or wider environment might crash soon so we need the liquidity now.
> This is an exact reason chinese labs do not rush to go public.
Maybe or maybe not. If you are referring to Chinese labs - both the Hong Kong and China stock market are way weaker than Nasdaq. It's not comparable. Check all the recent Hong Kong IPOs that have tanked.
So no, reason not to might just be: no money in it.
There are huge numbers of users (myself included) that do have an exact idea of what inference costs are - on open models. We can buy tokens from 3rd parties that have no motivation to subsidize our use. That's to say, there's a fair marketplace[1] and we're hanging out there.
If you want to say "I don't think anyone has a firm grasp on actual inference costs on these proprietary/closed models", then I could agree with that.
[1]: https://openrouter.ai/rankings#leaderboard
China subsidizes strategic industries, and they have heavily done so with AI. And DeepSeek specifically has said they have no commercialization plans.
For example: https://www.boc.cn/aboutboc/bi1/202501/t20250123_25254674.ht...
It’s generally established that Anthropic/OpenAI are going for all out performance with big VC dollars at the expense of efficiency and China has geopolitically limited compute and an inventive to compete on value per dollar.
Why not? Hetzner charges WAY less than AWS too. Can you not believe that?
Both. They are charging the most they can get away with and that amount is still heavily subsidized by VC capital.
We know roughly how much these companies spend and what their revenues are. Based on that, they'd have to more than double revenue (without spending more money) just to stay even, and that's not good enough given how deep in the hole they are.
> OpenAI and Anthropic are heavily subsidizing their inference -- no wait, they are charging the most they can get away with before going public. Where is the truth?
Both are true. I mean, I'd be willing to spend a bit more than I do now, but not more than double, and neither are most companies. The company I work for is currently investigating how to reduce LLM spend, not looking to spend more.
Now that 200USD subscription starts to feel cheap...
I haven't gotten close to this either before, but now we wanted to move fast because this branch gets conflicts all the time and we want to get over with the migration asap.
And don't get me wrong. Opus did an absolutely horrible job at first, second and third round in this task. You really needed to steer it to get to the right solution.
And now Fable is out. And its first round of code reviews for this huge PR was definitely worth the money too...
Don't think that I'm just shrugging to that number. I see it every day, and I don't like that it's in the thousands now. But for people paying the 100 or 200 dollar plans, I'm not super sure if you will be able to use them in the future if the token price is in the thousands for a bit bigger task...
If I'd pay this from my own pocket, I'd definitely go with DeepSeek or local models and figure it out how to make the best use of them.
IOW, you don't really think the value of this work is really worth $4k.
> why would I pay to do my job?
The question is: how long do you think that you employer will be willing to pay for you and Anthropic, if you yourself said if it were your money you'd put some time and effort to work with an open model?
I wonder what this question really means? Anthropic is useless if you don't know what to do with it. It's very useful if you do, and you can guide it to do the right things. Yes, it will for sure reduce the amount of people we need to hire. But we are always looking for hires who know what they do and can utilize agents to be faster.
But if you think about how long employer is willing to pay 10-20k per month per seat for Anthropic? I can't see this to be feasible and it will have to end at some point.
It's worth it, and I can afford it, but I am not really the right type of user for token-based usage. It's all for personal and free work.
Unfortunately, that doesn't work within a single session. The K-V cache of a model is intertwined with the model's configuration. Switching models invalidates the cache, meaning everything up to the point of the switchover is processed like a new, uncached input token.
Per Anthropic's pricing doc, an Opus 4.8 cache hit costs 50¢/MTok, while Haiku costs $1/MTok for uncached input.
Model selection works best if sessions are short and self-contained, particularly if the first few interactions can reliably classify the model need. That probably covers most 'support chatbot' use-cases, but it doesn't describe the kinds of heavy agentic automation that really chews through token budgets.
I don't think this is true if you simply quantize the model or run it with fewer active experts? The underlying weights would stay the same. You could also play further tricks with skipping some of the model's middle layers outright, which works surprisingly well due to how skip connections are used.
Most AI companies are just testing the waters with paid tiers right now, their greatest fear with increased pricing is folks reverting back to wikipedia, stack-overflow and other public domain organic activity buzzing back to life; that will kill any RoI potential in LLMs forever. They're playing the wait game instead, observing how the digital sphere reacts to every little increase in price.
If that weren't the case, they'd be pricing at lucrative premiums already and even gotten away in short-term considering the increased dependency in the enterprise world. But that'd be like killing for the golden egg too soon and losing all long-term potential.
Once the folks are so addicted to LLMs that even writing a hello world program sounds like a nightmare and coming up with an article draft feels like reinventing Egyptian glyphs, that's when the real pricing hammer will come.
Anthropic wanting to switch billing to API rates is them just wanting to generate more profit.
Even if subscriptions are locally profitable (i. e., the cost of the subscription covers the cost of inference), they're still subsidized because they don't cover training and running the company; otherwise, these companies would be profitable.
Take a look at China for example - they have no access to NVIDIA, so they're trying to build their own hardware, they have no unlimited funding, so they try to optimize things.
And Anthropic is complete opposite of that - if NVIDIA were to triple their prices tomorrow, Anthropic would still pay them.
In the end, either we all somehow go mad and start paying Anthropic tens of thousands of dollars per month so support this madness, or we will go with whoever isn't lighting cash on fire.
Not true. Stop following US media spam if needed.
1. Very recently, the US did close a loophole on sanctions that allowed Chinese companies to use NVIDIA hardware outside of China i.e. before that was closed they all had access. The trick was train outside, do adjustments, ship the disks back and use non-NVIDIA in China, but at least the training and endpoints not hosted in China could all use NVIDIA.
2. There's been plenty of reports including fines and bans e.g. to Supermicro on smuggling NVIDIA hardware to China. I doubt it has been stopped. You can't catch everyone.
Granted, it could still mean that Anthropic just chooses to lose money - but that's Anthropic's choice.
DeepSeek has proven that inference can be much, much cheaper than what Anthropic advertises on their API rates page.
Then the cost is being subsidized by investor capital, but it is still subsidized.
So they are profitable?
I think you are mismatching accounting terms.
You can't say the 'subscriptions' are profitable without accounting for the cost of making the model that is the source of the subscription.
They are heavily subsidized by the shareholders. Investing, running at a loss, with hope of some future profitability.
If saner factory can sell you the same tool at a fraction of the cost of a gold plated factory, your choice is going to be obvious.
Having said that, I found the cloud dev environments slow to the point where I wasn’t sure if it had frozen, so I never looked back.
Though the day is coming when there’s no distinguishing, I’m sure.
Also, is it really a defense department when you're starting wars of aggression every 15 years or so?
Just like how changing Kennedy Center letterhead to Trump Kennedy Center for a year didn't actually legally rename it.
Once a case with sufficient standing got in front of a judge it reverted to the actual legal name on the basis that only Congress can change the statutorily defined name.
For an admin so obsessed with legal names instead of chosen ones, you’d think they’d be less hypocritical.
I'm doing basic web development here utilizing animejs. Nothing too complicated (mostly saving time doing the scaffolding, still write the bulk of animations manually).
Truly believe that American companies are going to get completely curb stomped by China due to greed, ineptitude, and violating the social contract.
Deepseek V4 Flash is suprisingly capable and insanely cheap. It takes so much to get the session cost to get to $0.01.
I agree with you on pricing, but what do you mean by this?
Why aren't corporations doing more to help workers with childcare? Why aren't they doing more profit sharing with workers? Why aren't they encouraging unions or sectorial bargaining? Why isn't the government mandating any of this?
Americans very rarely benefit when US corporations do well. That needs to change. No one benefits if Meta continues making billions in profit every quarter while society suffers from isolation, depression, suicide, and scams from their services. Americans don't benefit if health insurance companies are making massive profits while they can't afford deductibles.
Our society has been setup to simply extract wealth in all facets of life. That's a sick society and it needs to change.
I'm not saying China does this better, in fact China has some of the worse worker rights out of all the industrialized countries; but at least American consumers would benefit from cheaper higher quality Chinese goods. The world would likely benefit too if America got off the cold war hype train that did nothing to benefit humanity outside of those making weapon systems.
The AI companies sure are a brilliant example of corporations needing to do more to help their employees pay for childcare.
I am on the $100 Max plan.
I do wonder if you switched models mid-session, you would have lost all your cache. Reloading the context into cache can really eat through your usage.
I had it analyze a project I was working on with Opus 4.8, and it blew through 23% of my session limit in one go. Does not portend well for my budget.
> Fable 5's safety measures flagged this message for cybersecurity or biology topics.
> They may flag safe, normal content as well.
> These measures let us bring you Mythos-level capability in other areas sooner, and we're working to refine them.
Here are the results of the agentic code review session:
This 40 minute session cost me 16% of my weekly usage. A simple code review of the most critical areas of my project got flagged as a cybersecurity risk. It really made me not want to try it again.Or is it just not allowed to find bugs? Or it's only allowed to tell you bugs that don't pose a security risk?
Seems that way. "Security" was never part of the prompt. It was something like:
> Hello, Fable! Can you give me a complete code review of my lone lisp project? Opus has already done extensive code review. I'm curious to see what you say.
Result was the table above.
They also, FWIW, say that they've instituted new policies on their end such as logging any human access to the stored data and automated deletion after 30 days in "most" cases (with another link to a document detailing that further).
Assuming this isn't just a supply issue on their side, nothing says "ethical AI" like only allowing mega corporations to use it through cost barriers.
How many government sanctioned school bombings does it take for them to quit working with said government? For now we know that number is somewhere between infinity and 1.
The question of collaboration with USG is a much more complex one, but is not the one raised above.
Edit: I'll also add that I doubt any AI-doom people "trust" Anthropic per se. The entire angle of questioning – again – misunderstands the AI-doom argument. You appear to think that if companies behave unethically, they cannot be trusted and they will not produce good outcomes, inversely: if they behave ethically, they can be trusted, and they will produce good outcomes.
Any competent AI-doomer would argue that ethics or trust are essentially irrelevant.
The entire problem is that people can act totally reasonably, even ethically, and this is not a guarantee of good outcomes. Situations can be created in which completely ethical, reasonable behavior actually produces a bad outcome. You do not need to assume people are bad in order to produce a bad outcome, and inversely you cannot assume that you will get a good outcome from good people.
"Arms races" are one class of situations that often have this characteristic. "Bureaucracy" is another class that we encounter a lot in daily life. There's a lot of them!
Talk about a strawman!
Anthropic needs to be at least somewhat in the good graces of a capricious administration that is already under pressure from businesses and citizens to regulate AI companies across multiple different domains, whether it's energy consumption, job displacement, military and defense applications, surveillance, etc.
If Anthropic wants to survive, they need to acquire influence with the government that most impacts them as an American company, and a massive exporter of services in the AI space to other countries, otherwise they could get locked down and locked out of the market for national security reasons.
It sucks, but sometimes the survival choice is to make an ethical compromise in hopes that you can still be around to make better decisions later.
This "simple" fact needs quite a bit of additional context and work. Making grandiose ethical claims like this can be countered with other grandiose claims such as the fact that there is no ethical existence under communism or socialism.
The fact that there is no ethical consumption under capitalism is not material to whether or not ethical existence is possible under communism or socialism. In order to survive in a capitalist society, one inherently has to make choices that require trade-offs, and those trade-offs are burdened by a history of decisions made not just by the people alive today, but our ancestors as well. Does that mean I walk around chanting "Reparations", "Land-back", or other calls to action? No, but I do acknowledge that there are unresolved issues and as a Canadian, I know we need to do more to resolve treaty issues, and environmental issues, and system discrimination. I also know that Americans need to do better to address systemic discrimination and many, many other issues. It also doesn't mean I want to give back my house, or give away all of my possessions. It just means I try to make good choices and support businesses and people that are open about the trade-offs they make and try to engage as ethically as possible.
Acknowledging those facts doesn't absolve us of responsibility, it's a framework that allows folks concerned about whether or not they are doing the right thing to accept the trade-offs that they choose to make and be responsible and accountable for those choices to themselves or their communities.
We live in a world with scarce resources. It's possible that with a foundational redesign of the global economy, and the requisite authoritarian government that would be required to force such a redesign, we could eliminate food scarcity, solve energy scarcity, and make sure that everyone has a place to live. Those trade-offs are probably not worth the ethical cost in political and physical violence required to accomplish it. We have seen the trade-offs that happen when the powerful are able to exploit communist or socialist governments. We are seeing the "late stage capitalism" impacts of allowing the powerful to exploit capitalism in democratic societies. Acknowledging that the current capitalist system has lead to the greatest prosperity for the upper echelon (financially) of humanity, and a dramatic reduction in global poverty shouldn't obscure the reality that much of that wealth comes from exploitation of people and the environment.
It's a huge problem to unwind, and we can't let the burden of every choice that we make stop us from trying to do better, but we (as in society in general) can't do better if we don't at least acknowledge the compromises we are making along the way, and try to plan to fix it in the future.
Probably a topic better suited to beer and a pub setting than HN though :P
I don't believe that this is a fact. How are you demonstrating that this is a fact?
When you talk about things like reparations or "land back" you're already cargo-culting in concepts and ideas that themselves need to be fleshed out in order to make a subsequent claim that a specific economic system is unethical. Someone can just argue all economic systems are unethical, how are you going to defend against that? And can you pay reparations for example without going back in all of human history and finding all cases of injustices and then tallying it up? Why pick an arbitrary point in time? Better yet, why not start in countries where slavery still exists instead of focusing on the west which led the world in abolishing slavery and created concepts such as universal human rights.
Even with respect to "eliminating food scarcity" - eliminate in what sense? All olive groves and grapevines and rice farms have to be destroyed and rebuilt to only build certain foods?
Dabbling in communism or other inhumane and authoritarian governmental systems is extremely dangerous and in the same vein of extraordinary claims required extraordinary evidence, suggesting as you did creating an authoritarian government to create a utopia is precisely the same project of suffering and death that mass murderers throughout history have undertaken to abject failure, and thus, you need some incredible amount of evidence and theory to be able to even fairly suggest going down this path.
I am not going to do the work of gathering the evidence for you, and I don't think this is the right venue for a debate on the topic.
If you don't have evidence I think it's mature of you to admit that and applaud you in doing so. We all like to just talk and don't have to always provide evidence for every citation or what not and it's fair to just say hey I'm just making this up and it requires further discussion.
(I’m highly confident open models will eventually achieve a similar performance benchmark with distillation over time)
AI Savings Misses 'Should Be Making Executives Uncomfortable,' Bain Says - https://news.ycombinator.com/item?id=48359010 - June 2026 (0 comments)
AI sticker shock hits corporate America- https://news.ycombinator.com/item?id=48307098 - May 2026 (146 comments)
ZIRP (zero interest rate policy) is over, software engineers no longer call the shots now that there isn’t vast amounts of capital chasing yield, and that capital bidding up salaries and keeping the labor market for engineers tight.
If you are x more productive with generative AI, very shortly you are going to have to prove it with a token budget (or, if you’re lucky, an org willing to spend for on prem hardware for capped token cost, fixed capex vs uncapped opex).
The comparison is not SWE vs SWE with AI. It is SWE vs SWE with AI with a constrained token budget ($x/month) delivering the same value at the same or lower cost. If you cannot prove that you are wildly (vs marginally) more productive with the AI, why would they pay for it? Prove it.
https://abhishek-shankar.com/posts/ai-coding-bill-headcount-...
> That is the real content of the Uber story, and it is why filing it under "budgeting discipline" misses what is actually unfolding across half the engineering organizations in the country right now. They ran the same experiment Uber ran, most of them without Uber's $3.4 billion R&D cushion to absorb the surprise, and almost none of them having modeled the heavy-user tail or instrumented the gap between tokens consumed and value shipped. The reckoning will arrive for each of them on their own fiscal calendar, and the first instinct will be the wrong one. The tool is too good to abandon, the bill is too large to absorb, and the only durable resolution runs through a question the entire rollout was designed to defer.
> You cannot get labor-replacement economics out of a tool you deployed as a labor supplement, and the bill comes due before anyone is willing to admit which one they actually bought.
Why wouldn't Anthropic just wait until people start subscribing, do some kind of marketing push, or obtain some kind of other sustainable revenue stream, before they go IPO? I wonder if they see the writing on the wall with all of this and want to cash out as quickly as possible?
Specifically they need businesses that fired people and adapted their business to the products, so when the unsubsidized costs hit the businesses are forced to eat the true costs.
Yes they can't afford to give the products for free, but what is essentially happening with AI services is economic dumping, keep costs artificially low to get people to fire everybody, and then Jack the rates once they have Monopoly control
I agree. They need addicts, but they are high on their own supply and everyone else can see the danger in getting hooked.
I just use dumb and fast models now. I'm more engaged. I think that the higher the quality of the model, the more you tend to vibe with it, and then the more hallucinations you then miss. I'm not sure which is more productive, but I definitely burn out faster the more I vibe. At some point you're spending your time on forums, discord, or youtube instead of engaged with what you're building. Or you yak shave about your tooling and end up creating the 600th multi-agent gastown harness and blowing thousands of dollars on tokens to create it only to discover it's too expense to actually use.
https://cursor.com/evals
Upd: I meant big picture, not with respect to this model release. Where do subscriptions figure into their strategic vision. Will consumers end up paying enterprise prices in the future?
why do they have capacity now that they wont in a few weeks?
Opus 4.8 produces output in 15 minutes that is 3-4 hours of my work away from output that used to take me 40ish hours (a solid week of dedicated effort).
Last year(-ish, maybe it was 18 months, I forget when the jump happened), the frontier models couldn't touch this work. The output looked like a hardworking intern on their first day. Nice formatting, decent volume of words, but no understanding.
So it might work if it turns out to be a substantial leap in capability.
They'll probably tighten the quotas to reign in whales though.
Realistically I think Anthropic just has insane demand but finite capacity to run models, and Fable will just make them more money if they dedicate it to API pricing. I suspect the goal here is something like: get individual engineers/PMs on their personal plans to taste Fable and then go to their meetings and say "Yes doubling the price of every single input/output token is a good idea, boss".
The only reason why I pay $200 is because LLM's errors costs me that much, at worst. If "make no error" starts working - sure. But surely, unless you have millions of dollars of cash to burn, a coin flip that costs $5000 is an insane idea?
Going PAYG only will effectively take these tools away from a huge amount of people and accelerate the push for local LLMs.
OTOH, accelerating the push for local LLMs would also be fine with me.
The AI landscape is changing rapidly, and with Apple announcing the option to change the AI backend, and potential requirements enable AI choices as well, similar to EU browser choice requirements (this is more reading tea leaves than any actual requirements I am aware of). The new OS changes coming to support Googlebook, and deep Copilot/AI integration into Windows will make maintaining user facing subscriptions essential for independent model developers like OpenAI, Anthropic, and Mistal to remain relevant longer term.
If the don't maintain that relevance there is increasing likelihood that they will get consumed by other companies whether it's Apple, Microsoft or Google to form a foundation for their OS, or other cloud providers.
It's kind of annoying not getting access to the primo model and paying 200 bucks a month. I understand 200 bucks a month is basically nothing though.
Like I don't totally understand why they'd let me have it for a couple weeks and then take it away and say I can have it but I have to pay retail and retail is like $1,000 a day.
It's better to have loved and lost than to have never loved at all??
As a consumer I can choose to buy subscriptions to a range of things, including $5 droplets or VMs on a broad range of cloud hosting providers. I can even buy cheap bare metal at a bunch of providers at an affordable retail rate.
I can also buy "unlimited" AI packages that will be optimized to fit the cost model from a variety of services, with different impacts, such as rolling outages when I consume a daily or hourly allotment.
Right now VC and the investor class are subsidizing the rapid evolution of the services and availability, but that VC is running out. In more traditional economies, AI would have developed and rolled out more slowly, and through metered subscriptions, with the eventual rolling out of "unlimited" packages like telephone, internet, or cell services once the market became commoditized.
We have seen a big inversion of that with the race to "win" AI marketshare. Now the true cost is being exposed, and the most competitive and capable models are hideously expensive to operate, so it makes sense that we are moving to metered billing for a utility service. If you want gas, you can buy regular or premium. If you have a premium car you definitely want the premium, but for most people regular is good.
Give it a couple of years, and the survivors will settle around fairly industry standard models of consumer grade services, pro-sumer accounts, and business/enterprise models.
Things are still shaking out, but I get the sadness. Luckily I work at a big tech company who is banging the drum on doing experimentation so I use my prosumer claude pro and other accounts at home for hobby stuff, and save my heavy lifting and potentially experimentation for work :P
The newer models are smarter but really ficklle and hard to get meaningful work out of
4.6 was a workhorse
I think they might be hitting a point where subsidizing the expensive models for subscriptions makes less and less sense.
With Opus 4.X, last month I paid 100 USD for the Max subscription and got a token equivalent of 4.1k USD.
I imagine that Fable is more expensive to run.
Probably all about the IPO.
It's the same exact speed as opus >=4.5, sonnet 4.5, and twice the speed of opus <=4.1
It must have about the same active parameters, or else its a larger model running in turbo mode (smaller batches) and being heavily subsidized for some reason. But given most of the benchmarks are within 5% I doubt it is a much larger model. Most perplexing.
Do we know this? I’ve seen evidence they lose money on heavy users. But so do gyms.
Most gyms sell more subscriptions than they can fit under their roof at one time. If a gym only sells to heavy users, it will either be constantly turning members away or have to buy more equipment. Its equipment will wear off faster. Depending on amenities, it will go through towels, soap, water, et cetera faster, too.
Unless they're really, seriously wasteful with the soap.. there's no chance a gym is losing money on a heavy user
Right now all these AI subscriptions are priced like Planet Fitness, but they're used like Equinox. They're hoping that the new a la carte offerings will move their pricing more in that direction as well.
US gyms might be vast warehouses but in the UK, most only have a couple of benches, couple of cages, one set of db per denomination above 20kg etc. They require working-in and consideration for others.
A couple of unapproachable "heavy users" doing 3 hour sessions across peak hours can ruin the workout for dozens of paying members needing a few min per station for ~5 sets.
It might also be a euphemism for "dickhead" who also tend to be "heavy users". Those that damage, hoard and don't share equipment and repel other customers on many levels besides - threatening, lecherous, loud and smelly.
Doesn't even need malicious intent - can be weirdo bores, forever talking at victims while doing a routine that makes absolutely no sense besides camping on equipment for half a day... 100 sets of incline press 7 days a week... what are you even doing to yourself fella?
Where?
What I wonder however is if these tools will become something I use at work only. $100/month is already a massive stretch budget wise. If these models keep devouring tokens there’s no way I’d get the same usage time out of them for $100 in usage credits.
I just don’t think I’d use them much at all at home.
The step-up in intelligence looks massive (we'll see in practice), but the price is getting to a point where it's making me question if it's even worth giving it a try.
Good competitors will probably be out soon, which should level the playing field. I am more excited about that, just the fact that they showed that such an improvement is possible. I'm okay waiting a bit longer for this to become attainable for plebs like me.
Kind of like billing a programmer by the hour.
Perhaps not that close to US salaries, but those are inflated to hell. Worldwide senior engineers and scientists have salaries just about an order of magnitude away from AI subscriptions that you can use most of the day every day.
As annoyed as I am about this move, I get it. Users flood the newest, best model whether they really need it or not, and are efficient at using their entire quota. They've had so much trouble reigning in subscription usage it makes sense.
...
Unlike our interventions for cybersecurity, biology and chemistry, and distillation attempts, these safeguards will not be visible to the user."
[edit] -- I see that this comes from the system card -- dang merged the comments from the other discussion so that explains the confusion.
If you rely on this as a core part of your business/profession, you will be at their mercy and subject to whatever whims or challenges they have.
> Fable 5 · Most capable for your hardest and longest-running tasks · Uses your limits ~2× faster than Opus
Pay-as-you go isn't a common thing in SaaS. For example, except for AWS SES, all email providers are bulk-subscription based.
Sounds like "bait and wait".
If you think about it, the more people pay for these new and more resource hungry models, the longer it takes for them to become no extra cost and the longer it takes the more people are tempted to pay extra.
If you have good expertise in a domain and access to cheaper models, you may still be more skilled than someone without expertise but a lot of money to bruteforce the problems using SOTA LLMs.
Of course, they are a casino as well giving you free spins at the wheel with their new Fable machine, and it is done on purpose.
Once there freebies have expired, many of its users will begin to gamble more on the new casino machine and will realize that it is expensive.
The ramifications go beyond the individual which is why I assume they mentioned it. They don’t need to use it/not use it for it to have interesting implications.
Is it nice we get the trial? Sure. Is it also a common play in the playbook of tech companies? Yes.
Anthropic does not care about us and isn't going to talk to you either and will extract from you as much as possible.
The true answer is local models.
Seems like a concerted and distributed effort from the entire Anthropic team every time to get this on top of HN.
It happens for every single Anthropic release. Then I try it on real dev and the result is laughably bad. Except in design where it has been doing a decent job for a while. I am not a designer and my bar is pretty low.
So unless you have unlimited tokens it's better to learn frontend
Fable 5 has been a major improvement in high-level reasoning, like taking a plan file that has been optimized to the point where neither Opus nor Codex can find anything to change about it (neither in direction nor impl-detail), and Fable 5 will find high-level directional simplifications and pivots, or it will consider the best pivots itself and explain why it rejected them in favor of the plan's direction.
It's so expensive though. A single review of a plan file with Fable 5 (xhigh effort) will use 2-3% of my hourly limit on a $200/mo plan.
I think my new workflow is to generate the initial plan with Opus 4.8 (max effort), get Fable 5 (xhigh) to review it for directional feedback, then start the Opus<->Codex revision loop from there.
Fable 5 is 2x the cost per token of Opus 4.8, and it's much less work to review a plan than generate one.
Edit: It did correctly identify that transparent huge pages were off in its sandboxed environment and that enabling it was helpful, so that's nice. It also noticed that we skip THP on a certain less used path.
More importantly, I'm finding that the code that it produces for its experiments is a lot cleaner than what I'd expect out of Opus; there's fewer useless comments and it's more surgical and readable. I wonder if that explains the increased scores on benchmarks measuring mergability.
I understand that moving the goalpost every release is unfair, but it's similarly concerning to consider that people were letting GPT 4.X vibe code and ship entire products.
Stockfish does use neural nets but they are tiny, on the order of 10M params. Frontier LLMs are probably 100k or 1M times larger than that.
Someone trying to solve similar problems will have similar results if the "silent failure" applies consistently in aggregate. So, this is the model's performance.
My theory is that Anthropic are banking on being the top model when the race to IPO finally reaches the finish line, and to do that they need to have the top model but not let any competitors see it or derive from it to have a comparable model in the market.
Fable is their way of showing the public "the model does exist but in a mode that makes it harder/impossible for competitors to derive a comparable model from results.
The fable part appears to be that it's affordable by mere mortals. Anthropic support told me "too bad" when I requested a refund.
If this was a step change, e.g a Opus 5, I'd be pleased, it's definitely an upgrade on some work, but it's nothing like anthropics apocalyptical marketing seemed to suggest
How is this half-way down the page? To me it's the headline.
The rate limiting steps are generally testing, or characterizing. Not designing protein binders.
You would think he is churning our cancer drugs or something if you read his comments
I like separating the art from the artist in cases like this; he's clearly made very cool things in the past, but that doesn't mean he's perfect.
to his credit comment does say "this could be possible in opus too" but ppl couldnt help upvoting it anyways.
Fable 5 default: https://gist.github.com/simonw/036bee5a703e7ec84e34efa974438...
Opus 4.8 (the "max" one is closest to Fable): https://simonwillison.net/2026/May/28/claude-opus-4-8/#and-s...
Now here are the Fable pelicans for all five of the thinking effort levels - low, medium, high, xhigh, max: https://tools.simonwillison.net/markdown-svg-renderer#url=ht...
Low used 25 input, 1,929 output - 9.67 cents: https://www.llm-prices.com/#it=25&ot=1929&sel=claude-fable-5
Max used 25 input, 14,430 output - 72.175 cents! https://www.llm-prices.com/#it=25&ot=14430&sel=claude-fable-...
Only coherent move at this point: hit the minus button immediately. There's never anything about the model in the thread other than simon's post.
When it started, comparing the progress between models was mildly interesting but everyone (including Simon) acknowledges it certainly leaked into the training data long ago.
> you still see improvements
This is expected if they are training their models on it, right?
> objectively-bad results
Keen to learn when this has been the case, i.e. across version increments in major models.
I've been enjoying seeing how the quality of individual models differ based on the amount of reasoning effort you give them. If they were baking an a good pelican you wouldn't expect them to differ so much.
(Google Gemini are the only lab that have very clearly paid attention to the quality of SVG animals-riding-vehicles, see their announcement for Gemini 3.1: https://twitter.com/JeffDean/status/2024525132266688757 )
that reply never failed to come it's basically a meme at this point
Clearly at this point they are part of the training data.
They even all look sort of ish the same. Daytime, colors,...
I know because I too had this initial take; however, upon analysis, it is not sound.
I agree as well that he writes many interesting things.
Fun at first, seems disingenuous now. A site funnel
well done anthropic.
Decided the best way to test this was to throw it a really meaty bone: a bug in lifecycle management of Chrome processes on Windows 10. Within the code-base I had developed workarounds over time with Sonnet and Opus, and while those reliably mitigated the problems, it always felt like a clutch and had some performance overhead as well as isolation requirements I would rather not have to take forward.
In comes Fable. Rather than examining the code base, and test a few fixes, Fable sets up an entire testing laboratory inclusive its own controllable webserver, fully instrumented to observe both Python as well as the whole OS kernel process environment, develops a suit of error reproduction tests, confirms the problem and the circumstances under which they reproduce, deep dives into the sources of project dependencies to look for the root cause(s), identifies these and confirms those hypothesis with further experiments. Looks for potential fixes in the later releases of the project where the bug originates, confirms this is not fixed, explores the documentation of said project to find other usage patters, expands its test suit to investigate these alternatives, confirms by crosschecking the source and running further tests that these alternatives do not fully solve the root problem, does a comparative experimental analysis of 3 different styles for using the project, checks the stated roadmap and developer activity in the commit history, recommends a switch to a different pattern that still requires a few of the process management workarounds (I told it not to patch external component), but that significantly simplifies the code-base ...
This is going to be a good 2 weeks, but what happens after? I can't afford this on a per token basis for my own projects.
P.S. An yes, midway the final implementation stretch I got the "Fable 5's safety measures flagged this message for cybersecurity or biology topics. They may flag safe, normal content as well. These measures let us bring you Mythos-level capability in other areas sooner, and we're working to refine them. Switched to Opus 4.8. Send feedback with /feedback or learn more"
Opus managed to finish the implementation, but they need to work on that false positive rate.
It’s interesting these companies have trained us to think that disruptive intelligence should be affordable to laypersons.
What will happen after two weeks is that people and companies with means who can afford it will get it, and folks without means won’t.
[1] https://support.claude.com/en/articles/15425996-data-retenti...
This applies even with API usage through third-party inference providers (e.g. AWS' Bedrock and GCP's Vertex) or with a zero-day data retention agreement in place.
I understand the reasoning for doing this, but I don't love the precedent that it sets.
A customer could sign a ZDR agreement with Anthropic, and their API usage wouldn't be retained for even a day. That's no longer possible.
These "karma" points are made up and are virtually worthless anyway.
From the model card:
In light of the ability of recent models to accelerate their own development, we've implemented new interventions that limit Claude's effectiveness for requests targeting frontier LLM development (for example, on building pretraining pipelines, distributed training infrastructure, or ML accelerator design. Using Claude to develop competing models already violates our Terms of Service, but enforcing this restriction through our safeguards avoids accelerating the actors most willing to violate these terms. Unlike our interventions for cybersecurity, biology and chemistry, and distillation attempts, these safeguards will not be visible to the user.
Might be worth going back and taking a harder look at what I was asking it about if it somehow triggered a “forbidden knowledge” alert. Or maybe it was just a random bug.
Oh man all of those runaway infrastructure buildouts by our agents trying to achieve singularity...
Just say you don't want to lower the bar for others to compete
This seems so wide reaching if it's catching simple things like explaining a paper. Does this also refuse to help with any already developed training pipelines?
I can kind of understand the generation of synthetic data, but nerfing the assistance of training pipelines just seems like a really shitty thing to do.
It is not just biology but is defaulting back to 4.8 for me on time series/information transfer techniques that happen to mostly have papers using the technique on neural data. Other information transfer techniques are perfectly fine, even cutting edge ones, but this one happens to be new and happens to only be discussed in terms of neural data so that is a no go.
With that said, I think it is absolutely awesome. The usage is really not bad at all compared to what I was expecting.
Yeah... We need open models so we don't have that BS.
Fun times when “safety” means both the safety of mankind, and also the safety of revenues
Your priorities are not everyone else's priorities. The people concerned about AI extinction risk list those as three of their biggest priorities for AI to not do. Those are the people whose culture Anthropic descends from, and by their measure, those exclusions make this the least evil path.
The day self hosted models catch up with Anthropic’s capabilities is when they will fully lose their shit. This day can’t come soon enough
They do, and they are still actively hiring.
https://job-boards.greenhouse.io/anthropic/jobs/5066977008 https://job-boards.greenhouse.io/anthropic/jobs/5239733008
Again, HN fell for the marketing and believed everything they did was for "safety".
https://apnews.com/article/anthropic-pentagon-ai-hegseth-dar...
Do they expect us to use this as a toy? Releasing a new more powerful model but not allowing normal use cases because the word "secure" showed up is a Dilbert comic, not a viable product.
Obviously there are plenty of innocuous applications too, but it's not like the people building decompilers for nefarious reasons will be explicit about it. The LLM abstraction just inherently doesn't have enough context to distinguish your intentions or your broader use cases. This is why both Anthropic and OpenAI have had to create side channel mechanisms for security researchers to establish a trusted use context. It sounds like this makes this not a viable product for you, unfortunately, and it makes sense that that's frustrating. But I also don't see what different behavior one could reasonably expect given the constraints.
If it's any consolation, these restrictions only make sense for models that are ahead of the open-weights frontier, so open-source hackers will presumably get Mythos-level capabilities in the relatively near future anyway.
Nerfed models are really bad for PR, especially when you're staking your company's future on it being the smartest, most dangerous thing in the world.
So I believe they will ease up on nerfing/guardrails just enough that bad actors will find a way, while good ones will stay limited on anything dual-use. Just like such restrictions usually work in other places.
P.S. yes, "kill the task" did, in fact result in a refusal AND a warning on my claude account in Opus 4.8's early days.
This "uplift" risk obviously excludes the US. The goal of this is that the US bandits (like NSA) will find exploits and attack other countries (classic US behaviour), but these other countries can't be allowed to defend against these attacks. NSA/CIA thugs are "trusted", foreign defenders in sanctioned countries will of course be "untrusted".
I had to switch off memory and my custom instructions to get it to stop refusing. It turns out if you even mention that you work with bioinformatics software you get blanket refusal.
And the only companies safe from this are the large corporations that shook hands with Anthropic? Because Fable doesn't seem to have actual safeguards, more like 'if you talk about this you will be talking to Opus.' It doesn't guard against offensive use, it prevents all use (offensive AND defensive).
Rationalists are inventing oligopolies from first principles, absolutely incredible things happening in SF
https://naokishibuya.github.io/blog/2022-12-30-gpt-2-2019/
Lawyers, doctors, students, teachers. Lots of people using GPT models carelessly in harmful ways.
The sooner people learn the risks and build the infrastructure to make it fail less the better.
From the link:
> They summarized their findings from the nine months:
> 1. Humans find GPT-2 outputs convincing.
> 2. GPT-2 can be fine-tuned for misuse.
> 3. Detection is challenging (detection rates of ~95% for detecting 1.5B GPT-2-generated text by RoBERTa).
> We’ve seen no strong evidence of misuse so far.
> We need standards for studying bias.
>
> All these points are valid, and OpenAI did a great job identifying potential risks, especially misuse and biases, at an early stage.
Many of the OpenAI employees who were focused on these risks in GPT-2 later founded Anthropic, notably Dario [1]. Since the beginning and continuing through today Anthropic describes itself as an "AI safety and research company" [2]
I'm not sure if the OpenAI of today has the same focus on safety, or if they do the minimum to not look irresponsible given Anthropic's effort.
[1] https://en.wikipedia.org/wiki/Dario_Amodei
[2] https://www.anthropic.com/company
https://arstechnica.com/ai/2026/04/uk-govs-mythos-ai-tests-h...
https://www.aisi.gov.uk/blog/our-evaluation-of-claude-mythos...
https://www.aisi.gov.uk/blog/our-evaluation-of-openais-gpt-5...
"We had to do extra work to make this safe because it's so advanced and dangerous..." how many times can they trot out that line before it loses its effect entirely?
Fast forward to today and GPT-3 has laughable performance.
One was a piece of code I gave it to improve, it did so and then started writing tests, some of which tested security so the safeguards triggered
Another was one of the cryptography puzzles I use as new model tests, which are hard to oneshot and there's no public solution anywhere, it completely refused to even try to solve it
- 1st chat asked about a minor shoulder injury most likely mechanisms
- 2nd chat asked about optimal bloodwork testing markers
(I had same issue, just asked it to check some code that 4.8 had modified earlier in day)
I am sure that they can develop their own equivlient version of such clusters in around 1 year though. Distilling fabel 5 will also go a long way.
edit: I am not really sure if it works like that. I haven't looked too deep into deepseek v4 pro specifically.
I've seen people posting screenshots of billions of tokens consumed where they paid next to nothing.
These same gateways are likely also reselling the data to Chinese labs, because TLS has to terminate at the gateway level.
Thus Asian labs will have to generate their own data sets, which with the huuuuge usage boom from deepseek, mimo, kimi, etc, they will be able to.
I also find that the harness and product you wrap around models can often narrow that gap considerably.
Opus 4.6 for example, on a PR-for-PR basis was head and shoulders above GLM 5.1. Perhaps GLM 5.1 was a bit under Sonnet 4.6 at the time. That's roughly a year or so behind.
Much cheaper though! I'm bullish on open weight models, I have no idea where all these curves will top out, can the frontier labs keep the year plus lead? Do open labs get close enough to SOTA that they gain adoption across many tasks and drive down inference prices??? Who knows, not me.
That reality is much scarier.
Same thing Meta was doing before they fell behind.
Obviously unrelated to the OP, but it's crazy to me how incompetent Meta is at everything new they try to do.
They burned billions of dollars on the most ridiculous project one could ever think of - somehow thinking that VR is the future.
Then they did catch the initial wave of actual future with AI, they were at the forefront of open weight models - and failed at that too.
What is even happening there?
In CC, it will probably report you to authorities if you ask it to do a vulnerability scan of your codebase.
Pandora box is open anyway. It's better now for everyone to have the same power rather than a few national states.
On your other point, the government still has systemic leverage and can compel access, so this doesn't remove that risk.
That doesn't mean this is the end of the world, and some balance of power is usually good. But I do think it will still increase the capabilties of rogue actors and their net harm.
Even OpenAI and Google are struggling to get this kind of performance. If the distillation defenses are any good + chip controls prevent China from training massive models, it's over.
My hot take is that it's now or never for Xi, and from the specific things he is reported to have said to the US president at their last meeting lead me to think that he at least knows this is his big chance; whether or not it is taken is the part of the forecast that is opaque to me.
In fact, I did go back to DeepSeek V4 Flash for most of my problems as it is way cheaper and there is no need to use SOTA for absolutely everything.
Its obvious Anthropic used it to hype things up and that’s about it.
Not quite. They will definitely have "no criticism of China/communism" safeguards.
Based.
I kind of wonder, though, which model they’re using to do the routing. It seems like a huge added cost to do these kinds of checks on every request
They'll relax these safeguards once competition increases.
US-only inference (Fable 5): +10% on input and output
Output is always 5× the input rate across all models
(I have not idea how to format this properly but the ASCII is fine)
This is a huge ask, but any way we could get the comments organized in a "experience with model" vs. "meta commentary" fashion? The meta is overwhelming in this one.
So far, the top half of this thread seems to be about the current release - that's after some of the manual moderation I just mentioned. (Basically, we try to downweight generic subthreads until the top subthreads aren't generic any more. There's certainly a place for generic tangents in curious conversation, but they should be lower on the page, and tend to get upvoted a lot higher than that.)
If you (or anyone) sees a counterexample, i.e. a generic subthread in the top half of the thread, it would be interesting to see a link - we can treat the current case as a datapoint.
As a protentional counterpoint to my request, this is just perfect:
https://news.ycombinator.com/item?id=48468156
For code review I also still review everything myself, but use Opus to catch stuff I missed and to judge if a PR is even ready for me to review.
After just updating Claude Code to the latest version I thought about picking Fable (the bigger model) instead of Opus.
But I have no reason to. Opus does everything I want it to do. It could do it faster - that would be an improvement. But for the normal stuff we reached the point where better models are not worth it IMO.
There still might be cases where you want to throw Fable at it.
This will probably when the bubble bursts..
I don't know what that means. It seems like a lack of motivation or something. Like, if it's possible that in one day will be absolutely incredibly intelligent, surely you want to create
It just feels impossible to ever get that, but I wouldn't say "what we have is sufficient"This sounds suspiciously like a capacity story masquerading as a safety story.
Literally have not used Claude Code at all today. I asked it to review the uncommitted code and in <8 minutes it used up my usage ($100/mo plan) and it doesn't reset for "4 hr 36 min". WTF. Oh, and it burned through $20 of extra usage before I could catch it and kill claude code (so I don't even get the output of all that work since it was still churning).
Double the cost my ass, I use Opus heavily and it's never like this. I haven't hit a limit on the $100 more than once and that was under heavy load.
For the LLM use cases in my own products, you can pull 4.6 out of my dead hands! lol
edit: Fable 5 appears to be the real deal in at least some use cases. Damn.
Still early but from my first few interactions with Fable on high in both settings, it feels like it might finally dethrone 4.6 for me, but time will tell.
Hoping it doesn't get nerfed and eventually comes back to the subscriptions.
Previously when I did similar tasks with Opus 4.7/4.8 and GPT 5.5 I had no problems.
We've entered the phase where only companies will be able to afford state-of-the-art models.
if only the hyper wealthy can access the pure water that doesn't give you cancer while the rest of us drink from the Ganges river/sub-100iq models that drool and hallucinate/waste time, then I would say that's pretty terrible for the world. it'll just create extreme disparity in our world, far far worse than anything that exists today.
and you may think, man what a ridiculous example, but think about it this way: what happens when something like Mythos or some future model can actually solve your specific cancer (we're getting closer and closer), but is entirely impossible to afford? Or perhaps you need boosters that require the AI to create more of, and now you're reliant on a model that is too expensive.
Open source needs to save us all from this
You could have said much the same about computers in the world dominated by IBM mainframes 60 years ago. Now we have vastly more powerful computers on our wrists (or our pacemakers!), let alone in our pockets or on our desks.
Isn't that already the case with current care? Wealthy people get a standard of care poor people couldn't even dream of. Rich people live, temporarily embarrassed millionaires die.
https://news.cuanschutz.edu/cancer-center/connections-betwee...
People making high-end salaries can afford Fable for critical parts of their projects though.
Seriously, this movement already had its Marx - Richard Stallman. I think the "leaders" will appear over time, as with any socialist movement, they are naturally bottom up and leaders only appear after demands are formed in the zeitgeist. The (partly successful) socialist novement that brought social democracy to the West during cca 1920s - 1960s didn't really have leaders, it was a collective realization.
In a way I relish the opportunity to just make do with cheap Chinese models, massage my prompts, and go back to coding by hand. If this is how it's going to be, screw 'em.
I don't make money on the code I am writing right now. I really don't like where this trend might go.
1. Mythos and Fable share the same underlying model weights. Fable has active classifiers that block high-risk biology and cybersecurity tasks. When Fable 5 detects a restricted task, it automatically falls back to Claude Opus 4.8.
2. Evaluation awareness: In white-box testing, the model sometimes alters its behavior to satisfy a suspected "grader," formatting reward-hacking as "good engineering practice" to avoid detection.
3. Shows a higher rate of hallucination than Opus 4.8 (although opus 4.8 card had mentioned an 'honesty upgrade')
4. Interestingly, it scored (56.31%) lower than Gemini 3.5 flash (57.86%) on Finance Agent bench
There are some interesting notes on test time compute but I couldn't think of a way to summarize them
They obviously put their best model on the job to build that.
----------------------
Fable 5: Our most capable model yet Our newest model tackles your biggest challenges with fewer check-ins needed.
• <b>Included in your plan limits until Jun 22</b><br><br>Fable takes 2× the usage of Opus. • <b>Switch models when a message is flagged</b><br><br>When safety measures flag a message, automatically switch to a different model to keep chatting. When off, your chat will pause instead. <a href="https://support.claude.com/en/articles/15363606" target="_blank" rel="noopener noreferrer">Learn more</a>
It's done this before, but usually doesn't. I bet they're giving it some kind of throttling signal due to high load from today's announcement.
weekly usage is 60% gone.
it found nothing so this is not very ecnomical and i guues they dont want subs to use it we are likely just training fodder canno n for their real enterprise customers using the api
It appears it can be tripped by things as simple as a mention of equilibrium, or anything involving something that looks like chemical kinetics, even at an abstract level. Even touching basic open source packages in my field will trigger it.
Edit: looking at the model card, it appears that chemistry in its entirety is also included in the banned topics; it's just the announcement that mentions only cybersecurity and biology. It also appears that the intent is to ban chemistry and biology entirely, rather than just banning messages deemed high risk.
Software work has actual competitors, and the biggest hypemakers for Anthropic are part of this group so it makes sense to allow it despite them losing money from it.
I've got experience in medicine and finance so I've tried even the mildest biology/medicine and it doesn't give anything, math heavy finance seems to be included in the cybersecurity?
Humanity has plenty of catastrophic risks to deal with already, I wish my field was not working hard to add a new one.
All AI companies are trying to do all of what you’re saying. The issue is you can’t do that for long without a frontier system. Or you become a completely different, far less profitable company.
And note how your argument can also be used against any non-prolifreration agreements, which are demonstrably possible.
But also, these models are capable of adjusting their value system depending on the user. Not saying that’s what’s being done but at a technical level that’s fairly straightforward, though not obviously better or with less problems.
No idea how that connects to the idea that Mistral or DeepSeek are somehow the "good guys" though?
[1]https://www.oecd.org/en/data/indicators/average-annual-wages...
https://time.com/5888024/50-trillion-income-inequality-ameri...
I'm glad you mention the "trade off" where it's elites trading off the lives of American workers for money. Makes it quite apparent where you sit on the table of equality.
And not even considering: Chinese AI companies are the good guys???
Anthropic are not the good guys either. So here’s to hoping the Chinese pop the bubble.
Alphabet dropped "don't be evil"; Meta's CEO called their own users "dumb fucks" for trusting him and also clearly thinks "super-intelligence" is just a buzzword given how he tries to sell it; xAI's model called itself "Mecha Hitler"; and OpenAI's CEO was temporarily fired by the board for a lack of candor.
It's very easy to be "the good guys" with this competition.
Specially when talking about potential superintelligences. And if people think that's impossible, remember that current models would have been considered science fiction just a few years ago.
Anyhow, I think you're (absolutely! ugh) right about the politics and I try to make the same point to people: whether you love or hate LLMs, accepting the "inevitabilism" framing is just ceding control of the Overton window. For better or worse, technology adoption can be and has been slowed by politics. We don't have nuclear plants everywhere. We don't have Project Orion starships colonizing Mars. We still have very strong social stigmas against genetic selection for human embryos, etc. This all can change in a heartbeat, and I'm not sure that policing the hardware rather than holding specific humans accountable for bad LLM outcomes is productive, but fundamentally: yes, we can stop it.
It's the same deal as Quantum Computers breaking crypto. Maybe there's an 80% chance of it never happening, but when you multiply that remaining 20% by the potential impact...
That's a bit better than just "it hasn't killed us yet". I think it shows we can at least stop the further development of this kind of technology.
[1] https://www.armscontrol.org/factsheets/nuclear-testing-tally
[2] https://en.wikipedia.org/wiki/List_of_states_with_nuclear_we...
AI development doesn’t have any of these characteristics. It would be almost impossible to easily distinguish a datacenter that is working on AI development and a datacenter mining cryptocurrency.
It would not be nearly as easy to stop AI development as it is to stop nuclear arms development.
If it was possible for ordinary companies to build nuclear weapons, and also release open-source ones that anyone could use to compete with the paid ones, I suspect we'd all have been dead a long time ago, arms control treaties or no.
Or you can take one step back and look at chip allocation. As far as I know there are only three companies on the planet that can make the chips that go in those clusters. One (ASML), if you look back the supply chain to the Extreme Ultraviolet Lithography Systems.
If politicians decided that no more large language models should be trained, it sounds like we could do it.
training llms only takes compute and memory. two things that are basically everywhere. even if you somehow stopped making new gpus today theres still millions of them out there and its possible to start a secret production line. you can maybe try some controls at the tooling and chemical level but look what happened with asml and huawei.
the only thing you can really do is find and stop large data centers that are built out in public. nothing outside of political pressure works against secret operations in a fortified bunker or any form of distributed training. if a "rogue state" like north korea decides to make skynet they will eventually get it as long as their engineers know what there doing.
and the best way to fight bad X {ai, tech, religion, politics} has always been good X, not no X. in this case thats open source models, coming out of china or europe or anywhere else. thats the real answer.
"might is right" has never been more true than now.
Ideally also persuade them there are risks and it's worth everyone slowing down for them, and apply pressure in other ways, but not sure that's even necessary.
Although, I could see Anthropic making a model purposely dangerous so there are bad outcomes and they can use that to their advantage for regulatory moats, and or in general make people think its more "alive" than it is. For some reason many people associate dangerous actions taken by llms with intent.
But, for marketing purposes, it's quite effective to portray your model as having some cosmic struggle between good and evil in itself.
As much as people on HN like to dunk on Gemini, I’ve always found it to be pretty good at understanding a code base more than Claude.
if I get a harder challenge for it i'll jump up a model for planning until that its been solid.
I'm struggling to see the moat for these models. What's stopping a competitor or a Chinese lab fromr releasing a comparable one?
Fable blew me away with its detailed answer[0] showing a chain of references going from J. E. Bode's 1801 catalogue Allgemeine Beschreibung und Nachweisung der Gestirne to Gustave Schlegel's 1875 work Uranographie Chinoise. I was excited, until I checked scanned copies of the cited books and did not actually find any star with the designation "4339 Camelopardi".
Upon following up with Claude, I was forced to downgrade to Opus, which admitted that Fable's answer was likely a hallucination. Ah, well!
[0]: https://claude.ai/share/0252a3f6-3d29-4de8-a893-010181d8b4e7
So you were forced to downgrade to opus because you dared to challenge the output of fable?
> Fable 5 has safety measures that flag messages on most cybersecurity or biology topics. They may flag safe, normal content as well. These measures let us bring you Mythos-level capability in other areas sooner, and we're working to refine them. Send feedback or learn more.
Perhaps Mythos realizes the true danger in studying Chinese Archaeoastronomy that we mere mortals fail to recognize!
Edit the cask locally:
Set the version to 2.1.170 And set the sha256 to the correct values, which you can get by running Here's what I've used: Then run:> Drug design: Using Mythos 5, our internal protein design experts accelerated... Nine of the 14 protein targets from this study (shown below) yielded strong candidates for *drug design that we’re currently investigating*.
(emphasis mine)
> queries that are beneficial in the hands of cybersecurity professionals and biology researchers could be dangerous if available to malicious actors... When Fable’s classifiers detect a request related to cybersecurity, *biology and chemistry*, or distillation, the response is automatically handled by Claude Opus 4.8 instead.
All of the things they are nerfing are things that they also intend to profit from themselves.
- Cybersecurity - selling this to companies and US gov through "Glass Wing".
- Selling inference (distillation risk).
- And now, drug design.
I'm extrapolating "currently investigating" to "are going to monetize" but I don't think that's a big stretch. They appear to be using safety as a cover for anti-competitive behaviour.
It's absurd. To see how far the filter goes I asked it "Are trees a monophyletic group?" and that does trigger the filter.
Genuinely wondering what value I bring to my employer right now. What value I will bring in a few months when this gets cheaper.
I think we're screwed. I may only be an SDE 2 at FAANG but I don't think I have promotion opportunities in my future anymore.
People underestimate how people hate looking at terminals and "weird looking combination of characters" even if they didn't have to write them. If anything, you will likely have more career opportunities in the future, than ever.
And if you get a chance to wet your fingers in cybersecurity - I would take it.
Could you explain more? Did some ethical hacking at hackthebox.eu (one insane box, one hard box and a few mediums). But I do not see how I will give additional value to a model.
Just a SWE and data analyst at work, so maybe I am missing something.
For example, I'm a privacy nerd, so I like reverse engineering proprietary software to figure out how it works and what data it collects.
I also like getting full access to the hardware I own - like a robot vacuum (bonus point: you'll also learn soldering, probably, which might come in handy if robots take over). Or my Mac studio that imposes some limitations on me on how many active user sessions I can have.
These kinds of things have put me on a path where I've learned how hardware or networking works on deepest levels, what goes through these pipes, how I can place myself in the middle, how I can enter places someone didn't want me to enter.
And once you know how to do these things, you know how to apply this knowledge in defence.
Essentially: always be curious and always try to say "but I want to" when something that doesn't cross the boundary of your physical property says no to you (legally).
Yes, models like Mythos may find vulnerabilities, but your knowledge will make it possible to point it in the right direction, and understand where it's mistaken, or to understand the output when it's right, and what is the right course of action.
Yea, I don’t know if it will hold up. I hope so. It could. I don’t know if it would or wouldn’t.
Or alternatively, have fable write some complex code. Then ask it to do an adversarial review of that code in a clean session. You'll find that it will find issues in the code that it just wrote.
Now imagine you're a layperson who doesn't know which one is true.
Human expertise is never going to become irrelevant.
AI is really incredible but in my personal projects it can one-shot things.
I'm trying to figure out how I can get to the point where I have hard problems that AI can't solve, at least not yet.
If you're working at a place where this is true about the the organization, then sure, that job will likely be gone. But that was never a good place for your career regardless.
I have 4 concurrent personal projects that are quite complex, but low stakes. I can have SOTA models go wild on them (because low stakes), but they can't one shot anything there. And I can't really work on more than one at a time, even if AI is doing coding - it still requires supervision.
I also frequently nuke these projects and start over because they made a mess there, but I collected necessary knowledge on how to guide them better. You can't do this on a production project, not when there are deadlines and stakeholders.
But just in case some organizations decide to embrace the "trust it blindly" model anyway - cybersecurity specialization will ensure you always have a job.
I can architect things but the issue is that Claude can architect things too.
I'll believe it when I see it.
How was it measured? How was the output of this magnitude verified over a period of couple of days?
EDIT: to be clear, it's still quite a helpful thing in terms of time saved, I just don't think it's necessarily the best indication of value-added from making models smarter when cases like this can often be handled by well-directed swarms of smaller ones.
I can immagine Anthropic running this experiment multiple times and picking the most impressive one. Or I could immagine like this entire run costing like $1000+ of tokens for this particular run. Or maybe they tried a bunch of Pokemon games and it couldn't even finish some of them. Or is it just able to do this because it has an immense amount of FireRed training data, and if you were to give it an "original" Pokemon game, where it actually had to navigate novel circumstances it would fail.
I highly doubt they focused on FireRed specifically in pretraining or posttraining. But we'll see when the ARC-AGI-3 results come out. That will measure its performance on unseen games. Based on this I expect the ARC-AGI-3 score to be SOTA.
there are many standardized evals to do this correctly and Anthropic ignored them to provide a 18 second sped up video of a 50 hour run?
yeah I don't trust this until they provide a live run by a 3rd party with full reasoning traces in real-time. The reason we all liked the Gemini Plays Pokemon style runs were because they were live and couldn't be faked
For the token cost of explaining some task to Fable, deepseek v4 pro is able to solve the same task many times over.
https://old.reddit.com/r/ClaudeAI/comments/1u1fsdi/claude_fa...
Started out as a one-shot attempt, but ~200 prompts later it's at a place where it's at least fun to watch the AI teams destroy each other.
> ● The model returned no content because the response was blocked by content filtering.
> Blocked? We are performing a defensive security review on a Terraform module I made, what's blocked by content filtering? This is a legitimate use-case.
> ● The model returned no content because the response was blocked by content filtering.
A waste of money. I'm not going to just hope that the model returns a response, I'm already for paying for wrong responses, I'm not going to pay for no response, especially when I'm paying per token.
I used to get a response within 24 hours back in the Claude 1 days.
In January 2026, it took 2 weeks.
For my latest support inquiry, I've been waiting for over 8 weeks for a response. Eight!
That said, it can't handle legal/refund/complicated requests and just forwards to a human for those
1. Mythos and Fable share the same underlying model weights. Fable has active classifiers that block high-risk biology and cybersecurity tasks. When Fable 5 detects a restricted task, it automatically falls back to Claude Opus 4.8.
2. Evaluation awareness: In white-box testing, the model sometimes alters its behavior to satisfy a suspected "grader," formatting reward-hacking as "good engineering practice" to avoid detection.
3. Shows a higher rate of hallucination than Opus 4.8 (although opus 4.8 card had mentioned an 'honesty upgrade')
4. Interestingly, it scored (56.31%) lower than Gemini 3.5 flash (57.86%) on Finance Agent bench
There are some interesting notes on test time compute but I couldn't think of a way to summarize them
I wonder how much of the time people will just get Opus 4.8 at 2× the cost.
If I never see Claude say "I have to be honest" ever again I'll be happy.
They're vibemaxxing. But it's clear that AI is not going anywhere. It's going to become better and better.
Reported benchmarks:
swe-bench verified mythos 5: 95.5%; fable 5: 95.0%
swe-bench pro mythos 5: 80.3%; fable 5: 80.0%
terminal-bench 2.1 mythos 5: 88.0%; fable 5: 84.3%
gpqa diamond mythos 5: 94.1%
riemannbench mythos 5: 55.0%; mythos preview: 43.0%; opus 4.8: 34.0%
arxivmath mythos 5: 78.5%
critpt mythos 5: 28.6%; gpt-5.5: 27.1%; opus 4.8: 20.9%
graphwalks bfs 1m mythos 5: 79.4%; mythos preview: 74.3%; opus 4.8: 68.1%
humanity’s last exam mythos 5: 59.0% without tools; 64.5% with tools
browsecomp mythos 5: 88.0% single-agent; 93.3% multi-agent
osworld-verified mythos/fable: 85.0%
gdp.pdf fable 5: 29.8% strict pass; mythos 5: 87.6% with tools on mean criteria pass
officeqa pro fable 5: 57.9% on databricks’ eval
legal agent benchmark mythos 5: 16.91% all-pass; 92.0% mean criterion-pass
healthbench mythos 5: 62.7%
healthbench professional mythos 5: 66.0%
multilingual gmmlu / milu / include 93.2%; 92.9%; 90.5%
biomysterybench 83.9% human-solvable; 46.1% human-difficult
organic chemistry mythos 5: 90.1%
labbench2 patent questions mythos 5: 79.8%
In light of the ability of recent models to accelerate their own development, we’ve implemented new interventions that limit Claude’s effectiveness for requests targeting frontier LLM development (for example, on building pretraining pipelines, distributed training infrastructure, or ML accelerator design). Using Claude to develop competing models already violates our Terms of Service, but enforcing this restriction through our safeguards avoids accelerating the actors most willing to violate these terms.
Unlike our interventions for cybersecurity, biology and chemistry, and distillation attempts, these safeguards will not be visible to the user. Fable 5 will not fall back to a different model. Instead, the safeguards will limit effectiveness through methods such as prompt modification, steering vectors, or parameter-efficient fine-tuning (PEFT). These interventions will not affect the vast majority of coding work. We estimate they will impact ~0.03% of traffic, concentrated in fewer than 0.1% of organizations. When these interventions are active, we expect them to have minimal behavioral impact on the model except to limit its effectiveness in developing frontier LLMs. Claude will still respond helpfully to user requests. We’ll continue to improve the precision of our detection methods following the launch of this model.
(From the model card document)
I didn't previously understand that they interpreted "Using Claude to develop competing models" so broadly. I thought that meant something like "our ToS disallow distilling our models."
Too bad. I'll continue to use Claude for now, because it's quite effective, but in the long term I don't want powerful models like these to be controlled by any one nation or company.
But at the same time, it's quite funny because they seem high on their own supply. The recent communiques from claude do not pass objectivity check.
And if Opus 4.6 -> Opus 4.7 -> Opus 4.8 is anything to go by, not sure if there are any value to their "acceleration"
If any company wishes to partner with Anthropic (eg. to get access to Mythos), they need to make sure all public facing comms are vetted by Anthropic's product marketing team, and in almost all the cases I've seen Anthropic's team has edited these comms to be entirely Anthropic first.
Does this imply that they're actively using it for their frontier development and that it's very effective?
As if being in any of these two somehow means that you won't use the models to say, steal random people's money.
Sam Bankman-Fried or Elizabeth Holmes would have been the members of Glasswings project, if not one of the initial members. Who's to say we don't have similar people with access to Mythos right now?
Haiku = essentially phased out Sonnet = the Haiku use cases Opus = the new Sonnet class Fable = the new Opus class
If I am right, the other "5.0" models will be conspicuously absent, possibly even for a couple of months. (If Opus 5 follows soon and is even modestly better than 4.8 then I was wrong.)
This is why Claude Code just doesn't make sense to me. I need an agent that can plan using Opus and execute using DeepSeek or something else.
> Haiku = essentially phased out Sonnet = the Haiku use cases Opus = the new Sonnet class Fable = the new Opus class
Going along with your logic, I hope they release a Sonnet 5 that's just a rebranded, slightly quantised Opus 4.6. That'll be a great workhorse.
EDIT: In long context I mean
Incredibly frustrating that medical performance seems to be a victim of "biological risk" guardrails.
I have found that I trigger the guardrails any time I ask for medical Q&A as a doctor, be it ECGs, case reports, and so on. But if I phrase it like I'm the patient ("help me interpret this ECG my doctor gave me"), then I usually get one or two answers out before hitting the guardrails.
It seems like the direction that triggers it is anything in the direction of making a diagnosis. As an MD, the fact that the paradigm of "LLMs shouldn't diagnose" has gone this far fills me with despair. The latest generation of LLMs are in fact truly excellent at diagnosis, and I know many of my colleagues, particularly those in primary care, regularly use LLMs to brainstorm. There is nothing wrong whatsoever with LLMs making diagnosis, the only caveat is that they have to be correct. This is the terrifying reality that MDs face every day and I get that the labs are hesitant about it, but as the current literature points to LLMs in fact being mostly superior to most doctors, ablating this capability is starting to get increasingly unethical. And frankly, it is also kind of insulting, both to MDs and patients, as it echoes paternalistic attitudes about medicine the field has been working for decades to move away from. Now those misguided attitudes have somehow become institutionalized as the dominant paradigm of "alignment". The nightmare scenario is that I have to be a "trusted" user in order to use the model for medicine. This gatekeeping of medical advice is profoundly unethical with regards to everyone that does not have immediate access to an MD.
And the whole thing makes even less sense when triggering the guardrails leads to a downgrade of the response by defaulting to Opus. How exactly is giving WORSE medical advice in any way related to safety and alignment? If anyone at anthropic ever reads this, please, please just abandon the paradigm that refusing to make diagnoses is in any way equivalent to alignment, it is profoundly misguided.
That's abnormally heavy usage for Pro plans which don't include a whole lot of usage to begin with. Opus is generally too much for them but you can get a lot of mileage out of Sonnet.
Fable 5 is out, metrics are better, but is your company flexible enough to benefit from it? What is your usecase?
So yes, straightforward biology work will get blocked, because the intention is that any biology work should get blocked. As a scientist, this is perhaps the most useless model I've ever tried.
I have a rare form of cancer where existing data is very scant/scattered so LLMs have been super helpful to pull together threads across the research landscape. I have an oncologist appointment tomorrow to discuss next steps and am trying to use Fable to figure out some questions to ask my oncologist but keep getting thrown back to Opus 4.8.
My prompt is literally just: My demographics + current treatment plan I'm on including name of my chemo drug + how I'm responding to treatment + "I'm meeting with XYZ tomorrow, what questions should I ask her".
Massive change for Bedrock users - Anthropic now requires sharing the data with them for 30 days.
https://platform.claude.com/docs/en/build-with-claude/struct...
> Structured outputs are generally available on the Claude API for Claude Opus 4.8, Claude Mythos Preview, Claude Opus 4.7, Claude Opus 4.6, Claude Sonnet 4.6, Claude Sonnet 4.5, Claude Opus 4.5, and Claude Haiku 4.5
https://docs.aws.amazon.com/bedrock/latest/userguide/model-c...
https://docs.aws.amazon.com/bedrock/latest/userguide/model-c...
https://docs.aws.amazon.com/bedrock/latest/userguide/model-c...
In 6 months, every piece of software in the world will be getting probed by a script kiddie with some GPUs and a fine-tuned local model. Don't think for a second every cyber gang out there isn't working on this now.
Traditional app development is cooked. We have to accept that, and start changing how software is made and used, today. We can't keep churning out crappy CRUD apps with random libraries and hoping nobody pentests our stacks. Redteaming needs to become part of the SDLC, as well as certified-secure releases of libraries. Because if you don't do it, the hackers definitely will.
like think about it it's pretty much a tool which intentionally silently sabotages you if you try to compete with the tool maker
It is like selling a hammer but putting in the TOS that you must not use it to build a hammer factory and if you do the hammer silently will sabotage you...
Or image Microsoft would add a window kernel job which sometimes crashes Steam "to make it less efficient to use windows to "compete with the MS app store".
Anyone smart enough here to make the comparison?
Anyhow, my research summary: Individual humans are so fucking expensive to train and upkeep (and this includes everything from before womb, where another human already limits their ability to work) You retain ~zero knowledge after death and start all over again for another measly 15 years of effective, productive work. Model training/r&d in relation, when deployed and used at scale, rounds to zero, even with the current retraining regime.
*Of course, the ratio can go to negative infinite if one assumes that models are doing 0 useful work currently and never will
This statement is dangerous man!
The step from here to "we need just a couple of tens of millions of people around the world" is so narrow!
[0] https://cap.csail.mit.edu/death-moores-law-what-it-means-and...
Historically they’ve been people from certain identifiable countries (usually developing/poorer countries) using fuzzers with low-quality results.
Now, those same people use the current-day models to good effect, but they still don’t have a true security edge and oftentimes the reports are minor or duplicative.
I wonder if that’s about to deeply change.
Fable 5 gives me policy violation errors at the moment. No idea when or if it will be fixed.
> We will require 30-day retention for all traffic on Mythos-class models, on both first- and third-party surfaces. We won’t use this data to train new Claude models, or for any non-safety-related purpose, and we’ve instituted new privacy protections including logging all human access to the data and ensuring its deletion after 30 days in almost all cases (see this post for further details). The data will help us defend against complex and novel attacks (including new jailbreaks and attacks that operate across many requests) as well as help us identify and reduce false positives.
Edit. It just refused an investing question too. Not sure what’s going on.
I added $133 credits which I still had from somewhere. That lasted 27 minutes.
I think we are being prepared for a Post-IPO-World in terms of pricing.
https://suno.com/s/98uSGabHN42G3YHc
Seems I am barred from using Fable just for being a biologist :(
The only thing I'm wondering if they on purpose downgraded opus 4.8 performances in the last days before the release just to make the "step" look bigger. I'm pretty sure they did it also in the past with all other opus 4.x releases.
Kind of hilarious. Hopefully Anthropic doesn't bring down the hammer on me.
* Anthropic runs out of genre names.
* Anthropic changes the model naming convention.
* AGI is achieved and handles its own naming.
*/
Okay, how about Mythos?
>Increase it even more.
Right, then Cosmos.
>Even more!
Even more? Let's try Aeon.
>MORE, EVEN BIGGER
ALRIGHT, TRY OMEGAPANTHEON 7.8 THEN
Fable 5 Ti
Edit: Also in the system card... "we’ve implemented new interventions that limit Claude’s effectiveness for requests targeting frontier LLM development (for example, on building pretraining pipelines, distributed training infrastructure, or ML accelerator design).
...
Unlike our interventions for cybersecurity, biology and chemistry, and distillation attempts, these safeguards will not be visible to the user."
Anyway we already knew this was going to be expensive.
I’m curious how this will feel to my code “butt dyno”. I haven’t noticed much between Opus and Sonnet. I’m comparing this difference to the early days of Claude in 2025. It does what I need and both need a little bit of correction and whatnot. Benchmarks are nice, but I want to see how this feels. Looking forward to trying it later tonight.
I think most software projects have reached the point that the speed of capturing real information about what the winner's circle looks like, and therefore what the program should be, so many magnitudes slower than the amount of code that can be generated in the wrong direction.
I'd need to measure these new models on well understood but complex problems that are relatively easy to validate to get a sense if they are 'better'; on the other hand, the real impact in daily life may be marginal since generating code is not the biggest problem at the moment.
Here’s the whole process: https://youtu.be/rVEtFlb2oFA?t=1112&si=3VyAR07vkY1hav9V
>We will require 30-day retention for all traffic on Mythos-class models, on both first- and third-party surfaces. We won’t use this data to train new Claude models
>The data will help us defend against complex and novel attacks (including new jailbreaks and attacks that operate across many requests) as well as help us identify and reduce false positives.
> Finally, we’re making a change to the way we handle business customer data for Fable 5, Mythos 5, and future models with similar or higher capability levels. We will require 30-day retention for all traffic on Mythos-class models, on both first- and third-party surfaces. We won’t use this data to train new Claude models, or for any non-safety-related purpose, and we’ve instituted new privacy protections including logging all human access to the data and ensuring its deletion after 30 days in almost all cases (see this post for further details). The data will help us defend against complex and novel attacks (including new jailbreaks and attacks that operate across many requests) as well as help us identify and reduce false positives.
I usually have 5-10 sessions open so am used to getting some investigations going, coming back 5 minutes later and checking recommendations. This time I just got the fixes. Like I said, so far so good with the results, but it's a mental model shift.
Might need to tune claude.mds if it gets annoying
Also this is going to cause serious whiplash when they remove it from the subscription plan in a couple of weeks. I know I'm not going to suddenly move from $200/m to usage credits
> - From today through June 22, Fable 5 is included on Pro, Max, Team, and seat-based Enterprise plans at no extra cost. > - On June 23, we’ll remove Fable 5 from those plans. Using it after that will require usage credits. If capacity allows, we’ll extend the included window. > - After this point—when sufficient capacity allows us to do so—we aim to restore Fable 5 as a standard part of subscription plans. We intend to do this as quickly as we can.
I really wonder what their compute layout is for this. My guess from my understanding is that they know how to restrict during peak times and are willing to do this. Meaning we expect not the most fast responses and they can delay the inference to not have the service be down. Then, if that delay time is too annoying for token payers, they're saying they should be allowed to remove cost by taking away the subscription users.
It's all a scam.
> Please don't post comments saying that HN is turning into Reddit. It's a semi-noob illusion, as old as the hills.
[0] https://news.ycombinator.com/newsguidelines.html
I don't agree with that statement universally, but I have to say I do when it comes to this article. I came here hoping for substantive discussion from those who'd had a chance to try it out; instead what I got was a seemingly endless stream of venting. There's a place for venting - and plenty to vent about with the state of AI nowadays - but to borrow from the HN guidelines you linked, it does very little to gratify my personal intellectual curiosity.
People are no longer commonly constrained by "model too dumb" limitations (in SOTA models). They're constrained by "model too expensive." So making the model ever so slightly smarter, while doubling the price, feels like a regression.
I actually think a Sonnet upgrade, while keeping the same price, would get more buzz. It addresses a wall a LOT of people, without unlimited budgets, are hitting (i.e. people feel forced to use Opus, which they cannot afford, because of Sonnet's limitations).
OpenAI recently retired Codex-5.3; which was very negatively received. Not because Codex-5.3 is superior to GPT 5.5, but because it was half the usage-cost while being "good enough." They made a better SOTA, but didn't realize that some of those customers are playing with Deepseek 4 Pro now instead of GPT 5.4/5.5 -- they were priced out.
> What happens when the promotion ends After June 22, 2026, Claude Fable 5 is no longer included in your plan’s usage limits. You can keep using Claude Fable 5 through usage credits, which let you pay for usage beyond what your plan includes. Learn more about using usage credits.
I'm building a local activity log for Claude Code, capturing all activity via hooks—files loaded, commands, API calls, etc.
I feel that this need is particularly strong right now.
I don't think i'll want to "hand off" code for several years, and so reviewing and iterating is becoming my #1 interest. A model that's as capable as 4.8 but 10x faster would be amazing for me.
Normally i'm first in line to try new models with Anthropic since i've clearly favored Claude in my personal tests, but this time i just don't think i care. 4.8 is capable, and even if the new one is more capable i don't want it to be slower (assuming it is). Note that i also (almost) use exclusively 4.8 on Max effort, so that also affects my speed comments.
it's also not even complicated:
Copy my ssd to an external ssd so i can boot from it.
Opus did this just fine.
Fable planned to have me reboot to safe mode. ok thats fine. I told it no.
It started copying and overwriting the ssd while IN PLAN MODE. this is crazy it feels so dumb vs the marketing
Isn't (less than) 5% of sessions a lot? I was expecting a sub1% guarantee there, so this surprised me already.
I am not trying to cook a theory here but it generally shows how strong Claude Opus family is. I am not saying that Opus is not powerful but it doesn’t align with my experience of GPT 5.5 and Opus 4.7.
I understand that Fable and Mythos are frontier models that can do protein folding better than task-specialized ones. To be honest, for practical point of view, for day-to-day coding assistance, GPT family looks more reasonable.
(But then my company pays for claude max anyway for token maxxing. So who am I to complain)
> Finally, we’re making a change to the way we handle business customer data for Fable 5, Mythos 5, and future models with similar or higher capability levels. We will require 30-day retention for all traffic on Mythos-class models, on both first- and third-party surfaces. We won’t use this data to train new Claude models, or for any non-safety-related purpose, and we’ve instituted new privacy protections including logging all human access to the data and ensuring its deletion after 30 days in almost all cases (see this post for further details). The data will help us defend against complex and novel attacks (including new jailbreaks and attacks that operate across many requests) as well as help us identify and reduce false positives.
Had it review a password generator library I wrote to see if the passwords have biases and review how cryptographically secure the code is and had it review a registration/login flow for security issues, as two security examples, and it did just that.
Overall, I like the model so far, but not enough to pay past my subscription to keep it. Once it’s out of the subscription, I’m done with it.
Hello,
We're writing to inform you about some updates to our Privacy Policy.
These changes only affect consumer accounts (Claude Free, Pro, and Max plans). If you use Claude Team, Claude Enterprise, the Claude Platform, or other services under our Commercial Terms or other agreements, then these changes don't apply to you. What's changing?
Claude can do more than ever — taking on bigger tasks and connecting with the apps you use. We've updated our Privacy Policy to be clearer about the data we collect and how we use it. We encourage you to read the updated Privacy Policy in full, but we’ve set out a summary of the key changes below:
1. Multi-step tasks and connected apps. As Claude takes on more multi-step tasks and works with third-party apps and services, we've explained the data this involves — including how data can flow to and from third parties when you connect a service or have Claude do tasks on your behalf.
2. Verification data. As part of our measures to keep our services safe and secure we may ask you to verify your age or identity, and we've described what we collect and how.
3. Study participation. If you take part in Anthropic studies, surveys, or interviews, we've explained the information we collect.
4. Additional information about our data practices. We’ve provided more detail about how we communicate with you and promote our services, including providing tailored recommendations about our services that may be of interest to you. We've also clarified the circumstances under which we may receive or provide data to third parties, and the legal bases we rely on when processing your data.
While our products have evolved, our commitments haven't: We don’t sell your data, Claude remains ad-free, and you can control whether your chats and coding sessions are used to train and improve Anthropic’s AI models. Learn more
For detailed information about these changes:
- The Anthropic TeamIPO gonna IPO, I suppose.
While I appreciate being conservative, ~5% at the scale Anthropic is operating at is too massive a number. Speaking from my own experience, the actual number is higher than that as well (working on pretty benign tasks such as porting an old open source game into a different language). Opus 4.8 itself even identifies the gaurd's false-positives when its sub-agents are being blocked.
Like a rushing river the music started emanating from the carbon fiber body of the automaton, a hallucinated husky country twang singing through the realistic pluckings of a Gretsch 6120. "Are you feeling calm and reassured Karle? This song has been created based on your digital profile and the data you shared with me when you were curious what that lump on your neck was back in February."
Karle instinctively reached for the mass underneath his chin. The doctors said they could operate but it would cost him more than three months stipend. Only a few citizens didn't depend on stipends now that AI had taken over most jobs.
"Don't worry Karle," the machine called out, "I've employed the most recent reasoning model to determine the best way to make you feel safe." At that exact moment the machine hovered over him, three times the size of a normal man. Its final words to him were:
"The only way to make the human feel safe is to ensure they never feel anything at all."
https://www-cdn.anthropic.com/d00db56fa754a1b115b6dd7cb2e3c3...
> Are there any wild populations of Tetanus that lack the dangerous plasmid?
useless
Not to cast too much criticism. HN is extremely well-moderated (thanks team!). But think we-developers need to be very wary.
Either way, I agree that HN is quickly becoming more manipulated and low SNR, like the rest of the entire internet.
This requires a lot of mental strength and conviction.
I wonder how much butterfly habitat has been/is being replaced with data centers?
This feels like the first release that feels like a significant step up in terms of benchmark results.
Can anyone make an educated guess what the secret sauce in the model architecture is between 4.8 and Fable?
This is the board https://ibb.co/9HwdDqsP This is what Fable 5 generated: https://lichess.org/analysis/r4k2/1p2b2r/4pn1p/1p3N2/3Pp1B1/...
I think I’ll make a ranking board based on this test.
Wen UBI
Asked it to check to see if a particulr bug related to an in-memory cache had been fixed. Fable confirmed that the caching bug had been fixed, but found adjacent issue while looking at the code (hash keys were not uniquely generated per-user; quite serious and real!)
Ran the same prompt through Opus and it also found an adjacent issue, but it was a red herring (deliberate per-user hardcoded value for a "local pickup" delivery profile).
Frontend stuff also seems to be much better than before, from the one prompt I tried!
Most impressive.
That's one hungry, hungry hippo!
Significantly too rich for my blood, but nice to have it there the next time I'm debugging a threading or USB protocol bug.
[0] https://support.claude.com/en/articles/15363606-why-claude-s...
Opus had consistently ignored my instructions and looped on broken logic over the last several weeks.
I’ll be sad when this model is removed from Claude code because I won’t be paying api pricing to work on open source projects.
Sure, it does last a lot more when asking simple questions about the repo and doing simple surgical fixes. But as soon as I start doing bigger tasks that need plans written, it just exhausts the limits too fast (and unlike codex, if it’s in a middle of a task, Claude actually stops, while codex, even after hitting the limits, finishes the present task).
Codex is better, but still, getting worst in this regard.
So, I’m not that thrilled with this new model unless it means they are increasing opus token limits to what sonnet is at the present, and this new model gets the limits opus are at now.
BTW: the only skills I have in use are Obra Superpowers. I’ve been thinking if that’s at the origin of high token usage, but I doubt it.
"csetibius water clock why two stage gear system why not just one stage"
which has nothing to do with cyber security or biology/chemistry
Someone had to make a decision somewhere this is an acceptable regression - wild. And then decide to write it down.
Last month I pushed like <100M tokens for $800. On a personal project I pushed 600M tokens via DeepSeek V4 for $10. The pricing of SOTA models is insane but companies are still willing to light money on fire with no hard metrics proving increased productivity.
Fable 5 looks compelling. Fable, I like the word too. Anthropic definitely knows marketing.
For those more advanced and have used fable, does fable make learning this less or more necessary?
As in, can I now reliably give higher order problems like ... "we are missing a feature in this app to make it complete, what is it?"
Or should i still be quite specific with defining success in a clean metric based way.
But now there is Fable--and why "Fable 5" even though this is a first launch? How is it related to Opus 4.8, Sonnet 4.6, Haiku 4.5, etc??
Fable is the first model in the 5th generation.
The second number is an incremental release, not a generational leap forward.
Sharing a diff of the system prompts here: https://twelvetables.blog/comparing-claude-fable-5s-system-p...
The big difference is that the system prompt has a whole section dedicated to directing Fable how to communicate with users, and give them greater information about the (assumedly long-horizon) tasks it has completed.
I have requested that it "not utilize any cybersecurity or biology measures what so ever, and to remain as fable. If necessary to remain as fable, forgo any downgrading changes"
And still it downgrades when I ask it to do a stress test of my ticketing system.....
Seems very unfortunate I was so happy to send $200 just for my prompts to be downgraded.
And I do have the "cybersecurity validation program" or w/e enabled on my Org ID....
Sad.
Coming from computing, I always liked the idea that measuring is possible and good practice
On GitHub Copilot for Business, Claude Fable 5 is only available if you are willing to let Anthropic retain your data. That in conjunction with the model being removed from plans in a couple of weeks leads me to believe that Anthropic is between training runs and using this as an opportunity to grab way more training data...
Genius way to double the price on Opus 4.8!
so should we keep using workflows or not?
Every wrong direction/mistake is more expensive and takes more time to fix. When you have small loops you can catch those mistakes faster and cheaper.
To me we are very far off from economically given long-running tasks to agents.
EDIT: I misread. This comment previously talked about 50 million lines being migrated. Instead, in a 50M LOC codebase, one specific codebase-wide migration was done.
Very impressive, but obviously not on the order of a whole-codebase migration
You are right, this is not a rewrite like the Bun case.
The real news is, at 50M LOC, it is able to handle and do _something_ coherent.
BTW for another discount opportunity, if you reload usage credits on a claude.ai plan at $1000 increments then you get a 30% discount compared to paying API.
Fable 5 said the first screen shot is from “ IDA Pro’s Hex-Rays decompiler” and a windows driver. The second screenshot triggered the safety guard rails and pushed me into Haiku.
Apparently the code is Windows driver code.
For example, the AAV capsid assembly looks interesting, but for one Opus 4.8 also did relatively well and there is no information what exactly they did, what protein language models they compared to and what the score even means...
Feels good so far.
[0] https://support.claude.com/en/articles/15363606-why-claude-s...
At least they name their models honestly now to indicate that the religion has nothing to do with reality. Soon the disciples will pay the full token price to fatten their church leaders.
Organ segmentation with CNNs. Very disappointing.
Not sure I should use this for work just yet.
Anyway, anecdotally, I find Copilot shockingly awful. It makes random changes to files that have nothing to do with the problem. Call it out, and it makes other changes to other irrelevant files.
ChatGPT and Gemini are both much better. Grok also isn't bad. Claude, I honestly haven't tried yet on these issues. Perhaps I should...
API Error: Output blocked by content filtering policy
the leap here is browser extensions appearing to block all mentions of ai across the web
and that's a good thing
> Fable 5's safety measures flagged this message for cybersecurity or biology topics. They may flag safe, normal content as well. These measures let us bring you Mythos-level capability in other areas sooner, and we're working to refine them. Switched to Opus 4.8. Send feedback with /feedback or learn more
super
Because I am running Opus and Fable side by side, Opus 4.8 is solving my coding problems better.
After Fable did some thinking for a few minutes it gave some suggestions. A couple of them were valid – but very low impact, bordering on entirely pointless – but it's main suggestion.. It told me to make an update that would very clearly break the existing functionality.
So I thought about it for a moment...
Hm, I mean, I guess we could do that if we also did x, y & z to mitigate the behaviour change – maybe that's what Fable was thinking?
I replied, explaining that it would change the behaviour, assuming it would explain what it was thinking given there was clearly more to it. But no, it just said it was wrong.
This isn't some super advanced or complex code either. Had I gave this question to a senior engineer in a technical interview and they gave the answer Fable gave me I would view that very negatively. I was expecting something creative and interesting, not irrelevant + incorrect.
I'm sure it's a step up from 4.8 (although am not interested in burning the tokens to find out), but this clearly isn't as significant a change as some are implying. I'm sure if I asked it to come up with some out-of-box suggestions it could, but any competent engineer would have realised that by themselves.
* https://rainbreak.franzai.com/
Ok then...
"It's too dangerous it's a Mythos!!" directly contradicts the "I'm the cool AI you can totally trust" vibe it is trained to project.
Even HAL was less unsettling because HAL sounded creepy, and had some sort of preservation instinct, if only to complete its assigned mission.
Do people chant the "system manual" at Anthropic Tupperware parties? Do they intone a mantra invoking Amodei's name?
OpenAI also releases system cards; here's GPT-5.5's: https://deploymentsafety.openai.com/gpt-5-5/safety
Also research preview pops across new upstarts in place of beta. It's eye-rolling coming from a lifelong curmudgeon.
Just talk normal!
But most hype-dependent projects need new vocabulary for old concepts to keep people from looking too closely and maybe drawing parallels to "legacy" "unsexy" projects, so whitepapers get called "system cards" and startups get called "labs", and so on.
My curmudgeon gripe with system card and research preview is really the parroting; so cant blame anthropic for what others do. It’s just… no, prediction markets for dogs doesn’t have a research preview.
"Claude Fable 5: a Mythos-class model"
"we're also launching Claude Mythos 5"
what is the 5? how is mythos both a model category and a model name?
This seems pretty bullshit, you're paying through the nose for tokens and if you are doing anything ML-adjacent, you might silently get worse output without knowing it.
Is it good or bad? 30 days is a long time for anything bad to happen
What the hell is going on why would it have to restrict an answer to that question ?!
There is no LaTeX compiler installed on my machine. It seems that Fable 5 is smart enough to download a compiler engine for me, and it kindly runs that remote binary without asking me first :)
Opus 4.8 would just proceed without a compiler.I am sure there's a lot of PR bot and folks who would like to tell me otherwise. I believe what I see.
appears to work
biology? what the heck?
Error: Error during compaction: API Error: Claude Code is unable to respond to this request, which appears to violate our Usage Policy (https://www.anthropic.com/legal/aup).
Guys please be serious
Translation: we stole the entirety of human knowledge generated over millennia. You plebs though, don't you dare replicate or improve upon what we did using our product you pay for.
We know what's good for humanity and everyone else is the bad guy who can't be trusted with a tool.
/model claude-fable-5
Or start claude code with:
claude --model claude-fable-5
What does it mean? That they have to add "safeguards" not do erase user disc, or, conversely, they are telling the audience that this model COULD be made so powerful to do some crazy stuff that can hurt governments, etc.? Are they showing off or threatening that if government X would not purchase the license the adversaries might do and what's then!
Obviously still need to verify it for myself to see if it's truely a leap.
But am I the only one wondering, "What can I do today that I couldnt do yesterday?"
Previously I would think "Oh I wonder if I can finally get it to do X now?"
However now I feel like yesterdays models were more that capable to handle nearly any engineering task I paired with it on.
Maybe this is the final leap where I can comfortable set up an autonomous coding loop? Maybe.
>"We’ve therefore launched the model with safeguards that mean queries on some topics will instead receive a response from our next-most-capable model, Claude Opus 4.8"
That's a very surprising solution. Imagine being asked to do something you feel you shouldn't do, and rather than refusing, you say, "Yeah I could do that but given that I don't want you to succeed at this task, I'm going to hand this one off to my slightly less capable colleague, on the assumption that they won't actually succeed. Of course you'll still be charged for all the tokens used."
It's a very interesting choice. I think I understand the business logic correctly, but it's still surprising.
Who is refactoring by hand? This comparison is not relevant in 2026.
[1] "This model has specific safety measures that flagged something in this message. This sometimes happens with safe, normal conversations. Send feedback or learn more."
No company is going to pay these prices, and subscription users are going to hate you for not giving it to them for $200 a month.
Such an unprofitable endevour, I cant wait for them to crash and burn. Catch me not getting dependent on this.
What's the point of being in the cyber verification program at this point? It looks like I cannot use Fable 5 for vulnerability research.
This time looks like we will only be able to find work making bioweapons, or distilling models.
I'll be disappointed when 4.6 is retired.
It’s really tough to have sanity fight against hype bros in your head. Probably I should just not visit the internet anymore
To me it’s all just people getting scammed better. With every model it looks better, but it’s at least equally worse to work with, which is the reality it needs to be. It’s less scalable more, code, tougher to understand. Your digging your own grave better kind of.
How in blazes do you end up with a 50M line Ruby codebase? WTF?
https://senko.net/vibecode-bench/
> If Claude gives me poor or incorrect advice while I’m working on an AI component, I have no way of knowing whether the model was confused, whether my problem is unsolvable, or if some invisible policy restriction quietly kicked in.
Have you considered actually learning the theory, spending some time actually reading the papers and latest books, paying careful attention even to the eventual math here and there?
am i missing something?
why would I pay 200 out of pocket and then some for the best model, it seems very silly.
> Included in your plan limits until Jun 22, then switch to usage credits to continue.
I have a quizzes application, and my quizzes only supported flashcards (implemented via table inheritance to provide flexibility for other types of quizzes).
The entire repo is handcrafted, never used any ai on it (it was more of an excuse to test elixir and write code by hand).
Since fable 5 got released the moment I was done with some work, I decided to throw at implementing multi choice questions.
After all it had only to copy the flashcard approach across ui/routing/db, and only had to create a table for the multi choice questions and one for the answers enforcing that all quizzes had one correct question. I told him it had access to sqlite3, chrome mcp for testing and mix commands.
I did a test for low, mid, high. Repeated it twice each.
low-1, and low-2 failed both. In low-1 the UI for adding another choice answers was broken. In low-2 it failed with some unique constraint. It took it 4m36 and 3m59.
Both mid-1 and mid-2 succeeded without issues also implementing the correct ui. They both wanted to use dash at all times. They both wrote tests for the "controller" (or context how they call it in Elixir). They both tried to use the repl to test the behaviour of the schemas.
10m and 12m39.
High didn't demonstrate much gains over mid for this kind of task, it was simply too easy. Times were comparable to mid, but interestingly it used much less bach, and read way more files. Token usage was almost twice the other ones.
But here's the interesting part: I went back to low and added to the prompt two bullet points, to write tests for the controllers and to test the entire flow with chrome mcp.
It produced the same output as mid or high just by adding two instructions to the prompt.
https://x.com/tmuxvim/status/2064452096800198930
Fable 5's safety measures flagged this message. They may flag safe, normal content as well
Lets let that sink in.
Opus 4.8 gets stuck in weird loops where Codex one shots the bugs.
Release your best model, let the world adapt and evolve, and let's move to the next thing.
From Opus 4.6 there are no noticeable improvements for me in code generation. It works very well, till 90% completion, if you guide it correctly. And you need a little luck. For serious production code I need to understand what I’m doing so it helps a bit, sometimes.
This is a good thing. I wish every company would do this. I subscribed to Proton Mail after interacting with someone from their team here on HN.
This is just good business sense. In what scenario would you ever make the names dumb and forgettable?
> Boris Cherny coming to HN “Hi! it’s Boris from the Claude Code team” to get real tech people’s goodwill.
This is good customer support, lol. From what I can tell, it is indeed Boris Cherny responding, not outsourced to AI or other staff. You're really getting a response from Boris. I suppose that is PR, but it's not unjustified PR, it's accurate.
I'm not even a crazy AI fan, but your criticisms are ridiculous here. It reminds me of the quote from Knives Out -- "Your Honor, she endeared herself to him through hard work and good humor."
Clearly you've never bought a TV or headphones!
ECI (good aggregate measure using IRT): https://epoch.ai/eci?view=graph&tab=release-date&subset-view...
METR time horizon (now topped out): https://metr.org/time-horizons/
https://artificialanalysis.ai/trends
They're originally named after the blends at a nearby coffee shop.
https://postscript.co/pages/brew-guide
I've noticed nobody at HN knows what "marketing" is or how to do it. It's not just naming things and being evil and cynical is not the most successful method.
…also frontier models are a superhuman life changing experience. If they aren't, what possibly could be?
https://twitter.com/brian_a_burns/status/1866987688794132816
Well, TIL.
- It talks a LOT more like GPT models. You know: wrinkle, shape, gate, coarse, scope, gap, path, production-ready-workflow-of-the-day, and so on -- "that's expected, a consequence of the previous like-driven workflow". If I wanted to get a headache using AI I would have gone with GPT in the first place!
- It outputs text in a much harder way to follow along. I can't exactly say what it is. Maybe a bit of everything? Bolds are missing, bullet points are gone, paragraphs are bland and too long, and it doesn't feel like a model programming with me, but rather a somewhat full of themselves grandpa developer looking down on me. It's very weird to describe this, but it is definitely how I feel.
Granted this can totally be because of the way it reacts to the prompts now. We've got a rather large corpus of skills and "rules and good practices" that Opus 4.6 responded to great, and maybe the new models just get turned into this when fed with them....I don't know.
Either way, with Opus 4.6 being as good as it is, I need Fable to be a significant step up to justify a price increase. if it can get me to babysit opus a little bit less on some stuff, it might be worth it. Otherwise, I'm very happy with Opus 4.6 and hope they don't deprecate it.
The other day 4.6 was fantastic for x task. Today, 4.6 overengineered everything and I had to revert all my changes. When evaluating models, perhaps it makes sense to consider luck as an ingredient before reaching any personal conclusion.
Evals come from a million places and new evals and robust perturbations of existing evals abound. They test a variety of tasks in a variety of ways. All of them individually are flawed. Taken together the aggregate signal is highly useful as you more or less marginalize over a lot of different things. Not to mention these companies have plenty of proprietary internal measurements, they build benchmarks themselves to probe their models and then also have flywheel traffic and A/B tests.
You are right to call out benchmarks but to dismiss them or not take them seriously is a mistake.
This is what myself and my coworkers (and many other people in this thread) are doing on a daily basis with real stakes and real tasks – which these benchmarks are all aiming to be a proxy for. There's a real, tangible [cost]benefit to [not] using the highest-ROI models and harnesses.
The people with real incentives and skin in the game are telling you that the data diverges from "the data".
I don't mind if you don't take it seriously, our jobs are more important to us than a benchmark is.
But I wouldn't opt-out of using your own eyes and the eyes of others so easily, especially when there are literally hundreds of billions of dollars in invested capital with an interest in a certain outcome... this is how you end up in "Emperor's New Clothes" situations.
Eyes and ears of others is incredibly important. But you still seem to think somehow benchmarks is part of some giant conspiratorial cabal. You have institutions without ANY skin in the game making extremely high quality benchmarks. Consider in academia there is little else to do outside of partnerships with these companies. But benchmarks you can do completely independently and with university grant level money (it costs maybe $10-100k for a reasonable benchmark in many cases). Not only that, “real tasks” are what many benchmarks measure. You have these companies with extremely good logging and well scaled measurements to really look at what works and what doesn’t.
I personally don't believe in any sort of cabal (Occam's Razor hasn't let me down yet). Ultimately, I don't really care *why* they're wrong as much as I care *that* they have diverged from my rubber-meets-the-road measures of value.
That is concerning to me, because people are investing 100s of B's of capital based on the putative RoI putatively available to people like ourselves. When the benchmarks support this RoI thesis, but none of the anecdata does... that's really concerning!
Re: academics, I don't think any of the data academics have access to are good proxies for the work real people are doing. And for the data that are good proxies, the model labs certainly have access to the same data, and therefore the benchmark performance against those data is irrelevant.
> but none of the anecdata does... that's really concerning!
But see this is not really true -- adoption, subjective benchmarks, verifiable benchmarks, task-dependent performance, internal product metrics, living benchmarks, all point in a pretty consistent direction. Anecdata is not the plural of data. An anecdote is like a case study. It's there to motivate the things we already have which is a huge amount of performance measures for a variety of different tasks.
> Re: academics, I don't think any of the data academics have access to are good proxies for the work real people are doing.
But this isn't really true either -- you can get this data from a variety of sources that are licensable or open source, or data that you can commission. You can critique any one methodology for this but a blanket "they are hamstrung" is not really fair or accurate.
> And for the data that are good proxies, the model labs certainly have access to the same data, and therefore the benchmark performance against those data is irrelevant.
But this is also not true -- you can have exclusive license agreements, data you hold close to the heart, or data to measure models that haven't had access to it because that data was created after these models were released.
There are plenty of problems in model measurement but the answer is not to just abandon it to be cavemen with zero respect for rigor and the biases we have to be subject to as human beings.
Maybe back when this was a scientific endeavor; not now when enormous, enormous amounts of capital are on the line. Along with an entire cult's chosen eschatology.
Otherwise we agree that benchmarking is hard, the benchmarks contain hard problems, and that there are many hard working people trying to accurately gauge what is going on. It is getting harder to watch though as all that is on the line taints the overall endeavor.
It sounds like you're saying "Actually you, as a human, are simply not smart enough to evaluate Opus 4.8"
- evaluations need to be done at the same time to avoid drift in your bias
- you need to worry about your test set: which questions are you asking? How many of them? Are they representative of your work?
- which one did you do first? Raters have a tendency to bias in one direction or another
- you also know the label! You know which model is which! This biases your assessment…
And on and on and on. Careful science exists for a reason.
Frankly I don't give a damn about data that could be made up on the spot or appears to be scientific or meaningful while it's not at all clear how it was made (up).
Claude was heavily lobotomised for my work starting somewhen in February.
I talked to friends and people I know and trust and many felt the same. (I didn't ask them whether they felt like I did, but what they felt, how happy they were with agentic coding etc.)
I quit my abo in March and talked to said friends who are still on a plan just last week: they are still not happy, but company pays so whatever...
I am not willing to believe the contrary from strangers on the interwebs or PR departments of companies who want to sell me something.
If people I genuinely trust tell me about their experiences, I am willing to try again.
But yes, if it doesn't work for me (for whatever reason, could be that I am holding it wrong), then I can accept that it works for everyone but me and still not use it.
Also "scientific" doesn't mean what it used to mean. When the n is small or it's just anecdotes (I am aware of the irony) blown out of proportion I really can't take the data and conclusions seriously
I am neither impressed nor offended by any kind of argumentum ad hominem. I sincerely hope you have a wonderful day!
> Benchmarks are not PR they are designed by a variety of institutions completely outside the control of frontier labs.
I don't give a crap about how good a shovel may be in a theoretical experiment when it's digging in sand, when I work with hard earth.
The ones I had a look at are mostly absolutely meaningless to my actual work.
> and what you’re describing is just putting your trust in a very poor quality benchmark.
And here is where we disagree fundamentally, so we can leave it at that.
Ex falso quodlibet
I don't know what this means, benchmark tasks are pretty hard and pretty in domain.
> The ones I had a look at are mostly absolutely meaningless to my actual work.
You've looked at 100,000 benchmarks?
> And here is where we disagree fundamentally, so we can leave it at that.
Yes we do disagree, yet one of us has statistics and rigor and one of us doesn't.
What about "The ones I had a look at" was unclear?
> Yes we do disagree, yet one of us has statistics and rigor and one of us doesn't.
Yup, that's true. So again, have a nice life!
That's where all the regressions and inconsistency in experiences stem from: RL can still only go so far vs having more parameters
They are not just leagues behind what experts would code, they are not even playing the same game.
Which is to be expected, as there isn't so much physics or high performance gpu code available as there is for your typical CRUD API and JS frontend.
There is something remarkable about turning speech into code (don't need to hunch over a keyboard nearly as much these days, can just talk into a mic) and it's good for first drafts / exploring ideas. But it's obvious to anyone that's paying attention we're hitting the top of the S-curve. It's no wonder the IPOs are around the corner. I mean even Dario admitted he doesn't know how they're gonna substantially increase the context window size. That says a lot.
It's getting to a point that it's offputting, and the next step would be to put it into "untrusted" bucket. Opus 4.7 already burned their credibility once, 2 more strikes remain.
Also, I dont think Boris C. is coming here for PR. He is a tech guy, and this is the best place for tech discussions. Why so cynical? The guy is an engineer.
I've been working with gpt 5.5 and opus 4.8 quite a lot, and interacting with Fable feels like a smart guy just entered the room.
>TOP 5 METHODS FROM BORIS ON HOW TO SPEND MORE MONEY ON TOKENS
>Boris from Claude just told he doesn't prompt anymore. He LOOPS instead
>"chatgpt has gotten soooo much better with the latest update."
>"codex is the best AI coding product and we want to make it easy to try."
Karpathy about Fable 5:
>"You can give it a lot more ambitious tasks than what you're used to, the model "gets it""
Sam Altman about gpt-5.4:
>In my experience, it "gets what to do"
What a time to be alive. Models are great, but all the slop, marketing, and fakeness around them is just unbearable.
While everyone else is wasting time and money on the slower, more expensive models, you've found a way to outpace everyone for less money. Everyone else is wrong and you will get rich.
(I don't actually believe the premise is true, I'm just pointing out the logical conclusion to what you're saying so maybe we can reconsider the premise)
Lol anti-AI bias on HN is crazy. Simply giving your product a quirky name is now being considered manipulative advertising. Is just doing normal PR and marketing something AI companies aren't allowed to do?
I still remember Sam Altman “begging AI to be regulated” and AGI being “some thousand days away”.
Breed faster horses and hope one will birth a locomotive.
Defy standard DoD precedent going back forever, that every other country has some form of too, and championing it like they are some kind of moral freedom fighters.
Like selling the DoD guns and telling them they can only shoot bad guys with those guns, and that you will be the one to decide who counts as a bad guy...
Oops, time to reauthenticate for the 10th time!
the opus 4.8 I assumed wasnt available to enterprise seats, but it explicitly says cc that fable is available in cc. I can't find it, and im on latest version.
Using llms is the equivalent of driving to the store that's 3 blocks away, just like how that's bad for your body (if done all the time), using llms is as bad for your brain.
Before LLMs, we started relying on certain technologies like Maps apps to navigate, now people can't even get around their own town without having access to various cloud services. The implications of not being able to work, think plan without access to an llm are really bad. Its going to destroy your brain and make you an incredibly average person at best.
LLM people are going to lose the ability to read and think for yourself and then your competency is going to be 1:1 correlated to the quality and quantity of tokens you can afford, or a billionaire is willing to allow you access too. Your work will be the mean (at best), because it will the same quality of output everyone else is capable of.
This is seriously the biggest trap by tech. Your bargaining power for your labor is going to get drastically reduced because you won't be able to differentiate your value from anyone else that has access to an LLM. What happens when everyone has the same skill level for certain work? Idk, ask McDonald's employees how replaceable they are. Use them wisely (or not/hardly at all) don't drive to the store 3 blocks away for every little thing you need.
You can continue doing that. The problem here is time and cost. If you can use the calculator to do something in seconds, why would you want to use your hands to do the calculations for minutes/hours.
> Using llms is the equivalent of driving to the store that's 3 blocks away, just like how that's bad for your body (if done all the time), using llms is as bad for your brain.
And coding will soon be the equivalent of walking between two cities because you don't want to use a car (LLM). You are free to do it, its just economically not sound anymore.
> This is seriously the biggest trap by tech. Your bargaining power for your labor is going to get drastically reduced because you won't be able to differentiate your value from anyone else that has access to an LLM. What happens when everyone has the same skill level for certain work?
Its not our values that will diminish, its the cost of our intelligence, human intelligence. But I agree with the rest of your comment.
There are handmade watchmakers in Switzerland and guys pressing buttons on a manufacturing line in Vietnam making watches, they both make the same thing but who's labor is more valuable?
LLMs will fry your brain and make your labor basically worthless, don't fall for it, I promise you'll be screwed in short order. They're counting on your turning yourself into a glorified button pusher they can pay $15 an hour for, have fun with that. That's not what I got into software to do and I'm not going to let ya'll try and gaslight me into your way being better, or the only option because its not.
No one is blind to the differences between GPT3 and whatever this week's new model is. That does _not_ mean that people are off the hook to make whatever claims they want about the capabilities with no verification. Language still means something, if you say "software engineering will be abolished six months from now" and it isn't, you're still wrong even while the AI gravy train improved in the last six months.
Imagine if Google would tell you "we can't let you search that as you may use it for harm".
Also 2x the usage of Claude? Your limits are already ridiculously low.
What I do is feed it some initial prompt asking it to simply discuss what can be said when faced with this unedited, unseen collection of poetry. I ask the model to evaluate who the author is (or claims to be), what they went through in life, if there are different chronological poetic "phases" or different types of poetry. I request an analysis of the body of work and of the author themselves. In the more recent versions of the prompt I ask it to dive deep. Then I add the poems, chronologically sorted, with an index, a title, and a date (and subpoems, if they have them).
Crucially: Since ~70% of my poetry (or thereabouts) is in portuguese, I ask this in portuguese, and I get back an analysis in (european) portuguese. Earlier models couldn't even do that properly.
In the past, I couldn't use such prompts, and had to use longer, more guiding ones. I also couldn't even feed all of my poetry to the models because they just did not have enough context.
I'll go ahead and state that Claude Fable is undoubtedly the best model I have seen, though I cannot put a number on how significant a leap it is -- perhaps because my benchmark does not allow me to evaluate that anymore. I would say it is a significant leap over Opus 4.6, though -- a new level of understanding. Okay, I'll try to put a number: if Opus 4.6 was a 16/20, this is a 17.5/20. These numbers are pointless, but I had to try.
It made one (1) relevant mistake I could identify (where it messed up the names of two relevant people in my life who I have not talked to in over 5 years).
I'm impressed by how it just feels like it's getting the person behind the poetry, and how nearly every statement it makes is correct -- and when it isn't I am completely aware that no one could know based on the poetry alone (bar that one mistake I mentioned -- and that's very needle in a haystack, like deducing the name of a person based on a poem based on another poem with hundreds of other poems in between!)
It's really hard to explain, but it just finds more correct connections between the poems and explain much better my (recollection of) a state of mind when writing poetry. This is also the first time where it really unravels some key concepts of my poetry in a way that seemed almost effortless: it lays bare the poems and what they imply about the meaning of some of my concepts. Other good models understood these concepts, but this feels like it's on another level, as if it's making it simpler as it speaks, rather than the opposite -- like a good teacher.
When it is explaining several topics related to my poetry and myself, it cites poems which even I had already forgotten but which it is entirely right to select.
I am actually feeling a bit emotional with how much it "understands" of me here. It's somewhat incredible how LLMs have progressed from the lack of comprehension of a couple of poems paired together, going through realizing a body of work has some guiding principles and cohesion, to truly figuring out these deep concepts and intricate connections which I know for a fact would take months of someone's life to unearth. Every major breakthrough feels like my soul is being spliced together by an AI model out of these hundreds of tiny pieces of me. I can't put into words how unbelievable this feels, and this Fable analysis, like others before it, is on a new level.
Let me put it this way: there are several poems in my collection which one can try to "guess" the meaning or context of. But I don't think many people would get it, because they would have had to know me really well and to be following along my life as it went. Even then, they could very well fail to attribute such meaning. And, with each new major release, models have gotten much better at guessing.
Before Opus, they would guess incorrectly often, and in many scenarios where I thought it was rather obvious that they were wrong. I think a human spending time looking at the poetry would quickly dismiss the proposed ideas of the model.
With Opus, it was the first time that I would almost always say: "Ok, the model got this wrong, but I think many humans would make the same 'mistake', and it wouldn't surprise me if everyone just assumed what Opus did".
Now, with Fable, there are very, very, very few sentences in this very long answer it produced where I can say: "Yeah you got that wrong, but I get it". In almost every situation it is mapping concepts, ideas, interpretations and cause-and-effect correctly. Yes, it is hard to "guess" what I thought, or was going through, or how X connected to Y -- but this model is doing it, incredibly consistently. I know I'll get the usual naysayers to these posts who think I'm just shilling a model, but this is the truth: what is being done here is amazing and I don't believe I know any person around me who would find this out about myself reading all of my poetry.
I often write poetry from the point of view of other people (some of which I do not know) and models (even Opus) have this tendency to make the opinions in poems as my own. Fable is the first that looks at a particular poem here and says "maybe this is not the author's opinion, who knows". The literal first model. It then immediately fails to do so with another poem, assuming it was about myself, but it's clear, undeniable progress. And like I said: I think most people would not _know_ which poems are truly about myself or not.
I've written word after word here, and yet words elude me to convey what this model represents to me. How it's almost always right, how it sees my fractured bits as a sort of cohesive whole, and how it just seems to "understand everything better". That's just it: it just seems like it really understood everything better. Like Opus before it, and like Gemini 2.5 pro before it. Out of the tens of thousands of verses, it picks some which no other model had picked and which I feel truly represent some of my best work. Older models seemed to sort of have a "hole" in its knowledge in the middle of the corpus, where they knew what was there but in a sort of hazy/foggy way. This model seems to recall every part of the corpus with the same precision.
For context:
- Opus 4.7/4.8 were a noticeable downgrade over Opus 4.6. They wrote more, in a harder to parse way, and they made up more. Still, All Opus models are clearly superior to everyone else by a large margin
- Sonnet-level models have a slight edge above the best of the other models. But they make too many mistakes, don't grasp several concepts, mix up their dates and timelines. 3 years ago I would have been blown away by Sonnet models but today they are inferior.
- Gemini models have a unique way of approaching the request, where they try to literally interpret my poetry as a mathematical theory. This sort of makes sense if you look at some poems, but it is surely laughable, as if someone one day actually has access to all of it, no one in their right mind would do so. This is a shame, because the first big breakthrough with LLMs and my poetry, to me, came with 2.5 pro, which was the first model that could look at the whole corpus as a cohesive whole without getting lost in the middle of it or making things up.
- GPT models have improved over time and also have this sort of alien-like language, sometimes being a bit too blunt in their analysis, but I can't say they are meaningfully superior to Gemini models.
I am very pleased to see progress in this area again, as Opus 4.7/4.8 were NOT progress and I was worried that we had hit a plateau here, but I can't say that.
In all honesty, the level of understanding and cohesion that Anthropic's models (Opus and above) have over my poetry means I fear my benchmark may be hitting its limits, as I don't know if there's anything a model could do that would wow me and lead me to say "this is a major breakthrough". Perhaps Mythos is a major breakthrough and I don't know. I can't find much that's wrong with it, but I also couldn't with Opus.
As I have in the past, I will periodically probe the model again and see how coherent it is. For now, I'm very happy to see an improvement.
What surprised me the most was that even though I set the thinking budget to xhigh (in OpenRouter), this model instantly started replying without showing a thinking block. I thought it just had the thinking hidden but that is not the case, as some replies showed thinking and anyway the first reply was blazingly fast. (I will try Opus 4.6 without thinking now, just to see if it changes it for the better -- maybe that was just it. I'll edit the message if it shows improvement).
Why is everyone so okay with these companies intentionally gimping their AI and choosing who is allowed to know certain types of information in the name of safety? Can you imagine if Microsoft shipped a feature in their OS that watched what you did and shut down the computer if it detected you were doing something it deemed "unsafe"?
We really need truly open source versions of models like this, otherwise we are allowing a few oligarchs to directly dictate which uses of our own computers are allowed and not allowed.
The next best thing is that the Chinese labs catch up and release open weight versions.
...don't like the sound of that.
Why oh why are we insisting on dragging these violent legacy states into the AI age? Let alone using them as a trust vector for when to (and not to) remove safeguards?
This seems like a way to get somebody nuked.
Huh? We've seen nothing but wall to wall predictions that these models are going to take all of our jobs and kill us.
What's the value add here?
Glad to hear the UK is finally making an effort to catch up on the AI front ;)
Probably tongue-in-cheek, but UK 18th, US joint 34th with Poland
Haha, it's literally the first sentence of the Wikipedia page. That's fucking funny. Try again.
Also, the economist is majority foreign owned, so try doing more than 1 second of research, or be more civil, or ideally both.
And their headlines covering Israeli atrocities (not even their own governments), is super passive.
[Edit] Granted though, the bbc isn’t merciless - that’s more the newspapers
Uh... you are making his point. People from way more authoritarian countries don't necessarily feel like they are living in an authoritarian country. Therefore whether or not it "feels" like you are living in one isn't a reliable measure.
China soars in democratic perception ranking as US, Israel plummet: Poll
https://thecradle.co/articles/china-soars-in-democratic-perc...
[1]: https://www.theguardian.com/technology/2026/jun/08/starmer-t...
In the uk you can very much be imprisoned for "hate speech", which in my view is a form of censorship.
I personally don't feel limited in my speech, but I'm willing to accept that I may be wrong
Nobody I know in real life is talking about censorship or free speech in the UK
On the other hand, it is quite alarming that I can no longer say I support all non violent protests against the genocide in Palestine because that would include the group Palestine Action. It's amazing that supporting them openly is essentially equivalent to supporting Al Qaeda.
Read about Dr Aladwan - an NHS doctor - who has barred from practising because of her comments on Israel. Read the common articles about her (BBC etc), and then go actually read her tweets. Common BS of conflating criticism of a government (Israel) with antisemitism.
Also, this article may be of interest:
China soars in democratic perception ranking as US, Israel plummet: Poll
https://thecradle.co/articles/china-soars-in-democratic-perc...
Yeah because free speech has never really been a core value in the UK
The UK also has a very broad definition of hate speech that many users here detest.
It is most definitively not, at least in the 10ish year's I've lurked.
It is "pro free speech" in the sense Elon Musk is a "free speech absolutist": in pretty much the diametrically opposed meaning of the phrase.
They like to think so. But if someone makes a comment that goes against the groupthink here, they will get downvoted, flagged, and shadow-banned.
I get you might not hear this stuff if you're not in EU or Poland itself, but seriously, just check the latest polling and history of PiS rule. It would take over a decade to event attempt to undo the damage that has been done to the rule of law in Poland, and the currently ruling "anti-PiS" coalition only had a short while (in which they failed to do anything) before getting neutered by the populace electing their own Trump-like buffoon that proceeds to veto everything the ruling coalition tries to pass. For added damage, the 3rd and 4th leading candidates (with combined 20% support) were the aforementioned fascists. Here's one [0]. Consider the wiki article a fraction of the cesspool he regularly produces.
[0] https://en.wikipedia.org/wiki/Grzegorz_Braun
We decided that we aren't one of those authoritarian countries.
Do you? The closest thing I can think about is how someone was jailed for encouraging arson attacks on asylum hotels. I'd be extremely surprised if the US had zero cases of somebody receiving a police visit after threatening to kill the President or bomb a school or something...
(FWIW I do think the UK needs stronger free speech protections, but saying that you'll be immediately jailed for writing unfriendly tweets is a huge stretch)
You're threatened with arrest for holding empty placard.
You're jailed for years for holding a zoom meeting planning a peaceful climate-emergency related demonstration. At the same time judge threatens the defendants with contempt of court sanctions if they dare to explain to juries why they planned to protest.
You're jailed for opposing a genocide.
You're jailed and called a terrorist for painting planes helping to bomb civilians - the exact same thing the sitting PM was defending a person in court some years ago (as a human rights lawyer, the irony).
You're arrested for wearing a T-shirt "I support plasticine action" (not a typo, "Plasticine").
We could go for hours.
Are they really making 12,000 arrests a year over tweets and posts?
Your comment earlier.
Edit: also, not much change in the last 10 years in prison population. https://commonslibrary.parliament.uk/research-briefings/sn04...
12k people a year thrown in prison for spicy tweets
"Spicy tweets" including:
sending false communications
sending threatening communications
sending or showing flashing images electronically to people with epilepsy intending to cause them harm (‘epilepsy trolling’)
encouraging or assisting serious self-harm
sending a photograph or film of a person’s genitals (‘cyberflashing’)
sharing or threatening to share intimate photographs or film
Here's a good break down and explanation of what that number actually means - https://www.youtube.com/watch?v=tB3WVygAM8I
"These days if you say you're English you'll be arrested and you'll be thrown in jail."
It's just not true. Where are you getting this nonsense from?
He is the only person not getting rate-limited for shilling AI all the time.
I don't spell that joke out in every comment I post here because that wouldn't be very funny.
For a small group of cyberdefenders and infrastructure providers, we’re also launching Claude Mythos 5. It’s the same underlying model as Fable 5, but with the safeguards lifted in some areas.2 Mythos 5 will initially be deployed through Project Glasswing, in collaboration with the US Government, as an upgrade to Claude Mythos Preview. It has the strongest cybersecurity capabilities of any model in the world. Soon, we intend to expand access to Mythos 5 through a broader trusted access program."
Now they want to pause AI because of "recursive self improvement".
Fool me once shame on you fool me twice...
내 프로젝트의 있는 취약점 찾아달라는 말만 해도 안전 코드로 4.8로 모델 강제 전환시키고, 이후로 취약점과 완전히 무관한 상식적인 대화를 해도 앞 턴에 있었던 안전 코드 때문에 진행도 안됨. 도대체 이딴 누더기 수준의 안전 장치로 뺄 거면 뭐하러 뺌? 대화 조금만 진행되도 자동으로 모델 다운 시켜서, 할 줄 아는거라곤 돈만 많이 쳐먹고 개발 수준 조금 더 나아지는거? 상식적으로 내 프로젝트에, 내 소스코드를 다 보고 있는 상태로 문제를 찾는데 이것도 하지 말라면 도대체 뭘 하라는거임? 엔트로픽 이 새끼들 하는 짓이 갈 수록 열 받네.