Genuine question: what's the evidence that the architect → developer → reviewer pipeline actually produces better results than just... talking to one strong model in one session?
The author uses different models for each role, which I get. But I run production agents on Opus daily and in my experience, if you give it good context and clear direction in a single conversation, the output is already solid. The ceremony of splitting into "architect" and "developer" feels like it gives you a sense of control and legibility, but I'm not convinced it catches errors that a single model wouldn't catch on its own with a good prompt.
arialdomartini [3 hidden]5 mins ago
This is anecdotal but just a couple days ago, with some colleagues, we conducted a little experiment to gather that evidence.
We used a hierarchy of agents to analyze a requirement, letting agents with different personas (architect, business analyst, security expert, developer, infra etc) discuss a request and distill a solution. They all had access to the source code of the project to work on.
Then we provided the very same input, including the personas' definition, straight to Claude Code, and we compared the result.
They council of agents got to a very good result, consuming about 12$, mostly using Opus 4.6.
To our surprise, going straight with a single prompt in Claude Code got to a similar good result, faster and consuming 0.3$ and mostly using Haiku.
This surely deserves more investigation, but our assumption / hypothesis so far is that coordination and communication between agents has a remarkable cost.
Should this be the case, I personally would not be surprised:
- the reason why we humans do job separation is because we have an inherent limited capacity. We cannot reach the point to be experts in all the needed fields : we just can't acquire the needed knowledge to be good architects, good business analysts, good security experts. Apparently, that's not a problem for a LLM. So, probably, job separation is not a needed pattern as it is for humans.
- Job separation has an inherent high cost and just does not scale. Notably, most of the problems in human organizations are about coordination, and the larger the organization the higher the cost for processes, to the point processed turn in bureaucracy. In IT companies, many problems are at the interface between groups, because the low-bandwidth communication and inherent ambiguity of language. I'm not surprised that a single LLM can communicate with itself way better and cheaper that a council of agents, which inevitably faces the same communication challenges of a society of people.
mikkupikku [3 hidden]5 mins ago
Probably the same reason it takes a team of developers and managers 6 months to write what one or two developers can do on their own in one week. The overhead caused by constant meetings and negotiations is massive.
nvardakas [3 hidden]5 mins ago
This matches what I've seen too. I spent time building multi step agent pipelines early on and ended up ripping most of it out. A single well prompted call with good context does 90% of the work. The coordination overhead between agents isn't just a cost problem it's a debugging nightmare when something goes wrong and you're tracing through 5 agent handoffs.
titanomachy [3 hidden]5 mins ago
If it could be done with 30 cents of Haiku calls, maybe it wasn't a complicated enough project to provide good signal?
arialdomartini [3 hidden]5 mins ago
Fair point. I could try with a harder problem.
This still does not explain why Claude Code felt the need to use Opus, and why Opus felt the need to burn 12$ or such an easy task. I mean, it's 40 times the cost.
titanomachy [3 hidden]5 mins ago
I'm a bit confused actually, you said you used Claude Code for both examples? Was that a typo, or was it (1) Claude Code instructed to use a hierarchy of agents and (2) Claude Code allowed to do whatever it wants?
moduspol [3 hidden]5 mins ago
To me, such techniques feel like temporary cudgels that may or may not even help that will be obsolete in 1-6 months.
This is similar to telling Claude Code to write its steps into a separate markdown file, or use separate agents to independently perform many tasks, or some of the other things that were commonly posted about 3-6+ months ago. Now Claude Code does that on its own if necessary, so it's probably a net negative to instruct it separately.
Some prompting techniques seem ageless (e.g. giving it a way to validate its output), but a lot of these feel like temporary scaffolding that I don't see a lot of value in building a workflow around.
TheMuenster [3 hidden]5 mins ago
Totally agree - the fundamental concept here of automatically improving context control when writing code is absolutely something that will be baked into agents in 6 months. The reason it hasn't yet is mainly because the improvements it makes seem to be very marginal.
You can contrast this to something like reasoning, which offered very large, very clear improvements in fundamental performance, and as a result was tackled very aggressively by all the labs. Or (like you mentioned) todo lists, which gave relatively small gains but were implemented relatively quickly. Automatic context control is just going to take more time to get it right, and the gains will be quite small.
kybernetikos [3 hidden]5 mins ago
There's a lot of cargo culting, but it's inevitable in a situation like this where the truth is model dependent and changing the whole time and people have created companies on the premise they can teach you how to use ai well.
_heimdall [3 hidden]5 mins ago
Its also inevitable given that we still don't even really know how these models work or what they do at inference time.
We know input/output pairs, when using a reasoning model we can see a separate stream of text that is supposedly insight into what the model is "thinking" during inference, and when using multiple agents we see what text they send to each other. That's it.
never_inline [3 hidden]5 mins ago
I think this is just anthropomorphism. Sub agents make sense as a context saving mechanism.
Aider did an "architect-editor" split where architect is just a "programmer" who doesn't bother about formatting the changes as diff, then a weak model converts them into diffs and they got better results with it. This is nothing like human teams though.
TheMuenster [3 hidden]5 mins ago
Absolutely agree with this. The main reason for this improving performance is simply that the context is being better controlled, not that this approach is actually going going to yield better results fundamentally.
Some people have turned context control into hallucinated anthropomorphic frameworks (Gas Town being perhaps the best example). If that's how they prefer to mentally model context control, that's fine. But it's not the anthropomorphism that's helping here.
jaredklewis [3 hidden]5 mins ago
> what's the evidence
What’s the evidence for anything software engineers use? Tests, type checkers, syntax highlighting, IDEs, code review, pair programming, and so on.
In my experience, evidence for the efficacy of software engineering practices falls into two categories:
- the intuitions of developers, based in their experiences.
- scientific studies, which are unconvincing. Some are unconvincing because they attempt to measure the productivity of working software engineers, which is difficult; you have to rely on qualitative measures like manager evaluations or quantitative but meaningless measures like LOC or tickets closed. Others are unconvincing because they instead measure the practice against some well defined task (like a coding puzzle) that is totally unlike actual software engineering.
Evidence for this LLM pattern is the same. Some developers have an intuition it works better.
codemog [3 hidden]5 mins ago
My friend, there’s tons of evidence of all that stuff you talked about in hundreds of papers on arxiv. But you dismiss it entirely in your second bullet point, so I’m not entirely sure what you expect.
jaredklewis [3 hidden]5 mins ago
I’ve read dozens of them and find them unconvincing for the reasons outlined. If you want a more specific critique, link a paper.
I personally like and use tests, formal verification, and so on. But the evidence for these methods are weak.
edit: To be clear, I am not ragging on the researchers. I think it's just kind of an inherently messy field with pretty much endless variables to control for and not a lot of good quantifiable metrics to rely on.
thesz [3 hidden]5 mins ago
You can measure customer facing defects.
Also, lines of code is not completely meaningless metric. What one should measure is lines of code that is not verified by compiler. E.g., in C++ you cannot have unbalanced brackets or use incorrectly typed value, but you still may have off-by-one error.
Given all that, you can measure customer facing defect density and compare different tools, whether they are programming languages, IDEs or LLM-supported workflow.
jaredklewis [3 hidden]5 mins ago
It's not the worst metric ever, but is the study in question observational or controlling the problem?
My issue with the observational studies is that basically everything is uncontrolled. Maybe the IDEs are causing less defects. Or maybe some problems are just harder and more defect prone than others. Maybe some teams are better managed or get clearer specifications and so on. Maybe some organizations are better at recording defects and can't be fairly compared with organizations that just report less. The studies don't ever reach the scale where you become confident these things wash out.
If they control the problem, most of those issues are eliminated (though not all, for example the experience and education of the participants still needs to be controlled), but now you are left wondering how well the findings transfer from the toy projects in the experiment into real life.
But finally, it's still not a perfect metric because not all defects are equal, right? What if some tool/process helped you reduce a large number of mostly cosmetic defects, but increases the occurrence of catastrophic defects?
re: LoC, there's some signal here, but it's such a noisy channel, I've never read a study that I thought put it to good use. Happy to have my mind changed if you have a link to one.
codeflo [3 hidden]5 mins ago
> Also, lines of code is not completely meaningless metric.
Comparing lines of code can be meaningful, mostly if you can keep a lot of other things constant, like coding style, developer experience, domain, tech stack. There are many style differences between LLM and human generated code, so that I expect 1000 lines of LLM code do a lot less than 1000 lines of human code, even in the exact same codebase.
jacquesm [3 hidden]5 mins ago
The proper metric is the defect escape rate.
exidex [3 hidden]5 mins ago
Now you have to count defects
jacquesm [3 hidden]5 mins ago
You have to do that anyway, and in fact you probably were already doing that. If you do not track this then you are leaving a lot on the table.
exidex [3 hidden]5 mins ago
I was more thinking in terms of creating a benchmark which would optimized during training. For regular projects, I agree, you have to count that anyway
slopinthebag [3 hidden]5 mins ago
Most developer intuitions are wrong.
See: OOP
vbezhenar [3 hidden]5 mins ago
Intuition is subjective. It's hard to convert subjective experience to objective facts.
tomgp [3 hidden]5 mins ago
That's what science is though
* our intuition/ hunch/ guess is X
* now let's design an experiment which can falsify X
lbreakjai [3 hidden]5 mins ago
The different models is a big one. In my workflow, I've got opus doing the deep thinking, and kimi doing the implementation. It helps manage costs.
Sample size of one, but I found it helps guard against the model drifting off. My different agents have different permissions. The worker can not edit the plan. The QA or planner can't modify the code. This is something I sometimes catch codex doing, modifying unrelated stuff while working.
sigbottle [3 hidden]5 mins ago
I recently had a horrible misalignment issue with a 1 agent loop. I've never done RL research, but this kind of shit was the exact kind of thing I heard about in RL papers - shimming out what should be network tests by echoing "completed" with the 'verification' being grepping for "completed", and then actually going and marking that off as "done" in the plan doc...
Admittedly I was using gsdv2; I've never had this issue with codex and claude. Sure, some RL hacking such as silent defaults or overly defensive code for no reason. Nothing that seemed basically actively malicious such as the above though. Still, gsdv2 is a 1-agent scaffolding pipeline.
I think the issue is that these 1-agent pipelines are "YOU MUST PLAN IMPLEMENT VERIFY EVERYTHING YOURSELF!" and extremely aggressive language like that. I think that kind of language coerces the agent to do actively malicious hacks, especially if the pipeline itself doesn't see "I am blocked, shifting tasks" as a valid outcome.
1-agent pipelines are like a horrible horrible DFS. I still somewhat function when I'm in DFS mode, but that's because I have longer memory than a goldfish.
totomz [3 hidden]5 mins ago
I think the splitting make sense to give more specific prompts and isolated context to different agents. The "architect" does not need to have the code style guide in its context, that actually could be misleading and contains information that drives it away from the architecture
ako [3 hidden]5 mins ago
Wouldn’t skills already solve this? A harness can start a new agent with a specific skill if it thinks that makes sense.
jumploops [3 hidden]5 mins ago
After "fully vibecoding" (i.e. I don't read the code) a few projects, the important aspect of this isn't so much the different agents, but the development process.
Ironically, it resembles waterfall much more so than agile, in that you spec everything (tech stack, packages, open questions, etc.) up front and then pass that spec to an implementation stage. From here you either iterate, or create a PR.
Even with agile, it's similar, in that you have some high-level customer need, pass that to the dev team, and then pass their output to QA.
What's the evidence? Admittedly anecdotal, as I'm not sure of any benchmarks that test this thoroughly, but in my experience this flow helps avoid the pitfall of slop that occurs when you let the agent run wild until it's "done."
"Done" is often subjective, and you can absolutely reach a done state just with vanilla codex/claude code.
Note: I don't use a hierarchy of agents, but my process follows a similar design/plan -> implement -> debug iteration flow.
stavros [3 hidden]5 mins ago
It's not about splitting for quality, it's about cost optimisation (Sonnet implements, which is cheaper). The quality comes with the reviewers.
Notice that I didn't split out any roles that use the same model, as I don't think it makes sense to use new roles just to use roles.
indigodaddy [3 hidden]5 mins ago
Is Sonnet good enough for, hey I have this bug (it's easy to describe and identify), now go fix it?
zingar [3 hidden]5 mins ago
Nitpick: I don’t think architect is a good name for this role. It’s more of a technical project kickoff function: these are the things we anticipate we need to do, these are the risks etc.
I do find it different from the thinking that one does when writing code so I’m not surprised to find it useful to separate the step into different context, with different tools.
Is it useful to tell something “you are an architect?” I doubt it but I don’t have proof apart from getting reasonable results without it.
With human teams I expect every developer to learn how to do this, for their own good and to prevent bottlenecks on one person. I usually find this to be a signal of good outcomes and so I question the wisdom of biasing the LLM towards training data that originates in spaces where “architect” is a job title.
est [3 hidden]5 mins ago
> the architect → developer → reviewer pipeline actually produces better results than just... talking to one strong model in one session?
There's a 63 pages paper with mathematical proof if you really into this.
I'm confused. The linked paper is not primarily a mathematics paper, and to the extent that it is, proves nothing remotely like the question that was asked.
est [3 hidden]5 mins ago
> proves nothing remotely like the question that was asked
I am not an expert, but by my understanding, the paper prooves that a computationally bounded "observer" may fail to extract all the structure present in the model in one computation. aka you can't always one-shot perfect code.
However, arrange many pipelines of roles "observers" may gradually get you there
Can you explain how this paper is relevant to the comment you replied to?
Havoc [3 hidden]5 mins ago
Yeah always seemed pretty sus to me to.
At the same time I can see a more linear approach doing similar. Like when I ask for an implementation plan that is functional not all that different from an architect agent even if not wrapped in such a persona
_heimdall [3 hidden]5 mins ago
Once model providers started releasing "reasoning" models, and later roles and multi-agent systems, it seemed pretty clear to me they are just automating the process of prompt engineering.
They track everything we all do in a chat, then learn the patterns that work and build them in. Rinse and repeat.
Tarq0n [3 hidden]5 mins ago
In machine learning, ensembles of weaker models can outperform a single strong model because they have different distributions of errors. Machine learning models tend to have more pronounced bias in their design than LLMs though.
So to me it makes sense to have models with different architecture/data/post training refine each other's answers. I have no idea whether adding the personas would be expected to make a difference though.
dep_b [3 hidden]5 mins ago
“…if you give it good context…” that’s what the architect session is for basically. You throw around ideas and store the direction you want to go.
Then you execute it with a clean context.
Clean context is needed for maximum performance while not remembering implementation dead ends you already discarded
jwilliams [3 hidden]5 mins ago
If you know what you need, my experience is that a well-formed single-prompt that fits the context gives the best results (and fastest).
If you’re exploring an idea or iterating, the roles can help break it down and understand your own requirements. Personally I do that “away” from the code though.
hakanderyal [3 hidden]5 mins ago
One added benefit is it allows you to throw more tokens to the problem. It’s the most impactful benefit even.
Context & how LLMs work requires this.
From my experience no frontier model produces bug free & error free code with the first pass, no matter how much planning you do beforehand.
With 3 tiers, you spend your token & context budget in full in 3 phases. Plan, implement, review.
If the feature is complex, multiple round of reviews, from scratch.
It works.
palmotea [3 hidden]5 mins ago
> Genuine question: what's the evidence that the architect → developer → reviewer pipeline actually produces better results than just... talking to one strong model in one session?
Using multiple agents in different roles seems like it'd guard against one model/agent going off the rails with a hallucination or something.
fleetfox [3 hidden]5 mins ago
Even for reducing the context size it's probably worth it. If you have to go back back and forth on both problem and implementation even with these new "large" contexts if find quality degrading pretty fast.
hansonkd [3 hidden]5 mins ago
Any of these abstractions are just temporary.
luxcem [3 hidden]5 mins ago
The agent "personalities" and LLM workflow really looks like cargo-cult behavior. It looks like it should be better but we don't really have data backing this.
troupo [3 hidden]5 mins ago
> produces better results than just... talking to one strong model in one session?
I think the author admits that it doesn't, doesn't realise it and just goes on:
--- start quote ---
On projects where I have no understanding of the underlying technology (e.g. mobile apps), the code still quickly becomes a mess of bad choices. However, on projects where I know the technologies used well (e.g. backend apps, though not necessarily in Python), this hasn’t happened yet
--- end quote ---
awesome_dude [3 hidden]5 mins ago
I have been using different models for the same role - asking (say) Gemini, then, if I don't like the answer asking Claude, then telling each LLM what the other one said to see where it all ends up
Well I was until the session limit for a week kicked in.
imiric [3 hidden]5 mins ago
Evidence? My friend, most of the practices in this field are promoted and adopted based on hand-waving, feelings, and anecdata from influencers.
Maybe you should write and share your own article to counter this one.
z3t4 [3 hidden]5 mins ago
Also if something is fun, we prefer to to it that way instead of the boring way.
Then it depends on how many mines you step on, after a while you try to avoid the mines. That's when your productivity goes down radically. If we see something shiny we'll happily run over the minefield again though.
danbruc [3 hidden]5 mins ago
I randomly clicked and scrolled through the source code of Stavrobot - The largest thing I’ve built lately is an alternative to OpenClaw that focuses on security. [1] and that is not great code. I have not used any AI to write code yet but considered trying it out - is this the kind of code I should expect? Or maybe the other way around, has someone an example of some non-trivial code - in size and complexity - written by an AI - without babysitting - and the code being really good?
I would suggest not delegating the LLD (class / interface level design) to the LLM. The clankeren are super bad at it. They treat everything as a disposable script.
Also document some best practices in AGENT.md or whatever it's called in your app.
Eg
* All imports must be added on top of the file, NEVER inside the function.
* Do not swallow exceptions unless the scenario calls for fault tolerance.
* All functions need to have type annotations for parameters and return types.
And so on.
I almost always define the class-level design myself. In some sense I use the LLM to fill in the blanks. The design is still mine.
danbruc [3 hidden]5 mins ago
What actually stood out to me is how bad the functions are, they have no structure. Everything just bunched together, one line after the other, whatever it is, and almost no function calls to provide any structure. And also a ton of logging and error handling mixed in everywhere completely obscuring the actual functionality.
EDIT: My bad, the code eventually calls into dedicated functions from database.ts, so those 200 lines are mostly just validation and error handling. I really just skimmed the code and the amount of it made me assume that it actually implements the functionality somewhere in there.
Example, Agent.ts, line 93, function createManageKnowledgeTool() [1]. I would have expected something like the following and not almost 200 lines of code implementing everything in place. This also uses two stores of some sort - memory and scratchpad - and they are also not abstracted out, upsert and delete deal with both kinds directly.
switch (action)
{
case "help":
return handleHelpAction(arguments);
case "upsert":
return handleUpsertAction(arguments);
case "delete":
return handleDeleteAction(arguments);
default:
return handleUnknowAction(arguments);
}
Which reinforces my point that LLMs are really bad at class and module level design.
xenodium [3 hidden]5 mins ago
From my experience, you kinda get what you ask for. If you don't ask for anything specific, it'll write as it sees fit. The more you involve yourself in the loop, the more you can get it to write according to your expectation. Also helps to give it a style guide of sorts that follows your preferred style.
sweaterkokuro [3 hidden]5 mins ago
In my experience its in all Language Models' nature to maximize token generation. They have been natively incentivized to generate more where possible. So if you dont put down your parameters tightly it will let loose. I usually put hard requirements of efficient code (less is more) and it gets close to how I would implement it. But like the previous comments say, it all depends on how deeply you integrate yourself into the loop.
anthonyrstevens [3 hidden]5 mins ago
>> They have been natively incentivized to generate more where possible
Do you have any evidence of this?
bigfishrunning [3 hidden]5 mins ago
The cloud providers charge per output token, so aren't they then incentivized to generate as many tokens as possible? The business model is the incentive.
0xffff2 [3 hidden]5 mins ago
This is only true in some cases though and not others. With a Claude Pro plan, I'm being billed monthly regardless of token usage so maximizing token count just causes frustration when I hit the rather low usage limits. I've also observed quite the opposite problem when using Github's Copilot, which charges per-prompt. In that world, I have to carefully structure prompts to be bounded in scope, or the agent will start taking shortcuts and half-assing work when it decides the prompt has gone on too long. It's not good at stopping and just saying "I need you to prompt me again so I can charge you for the next chunk of work".
So the summary of the annecdata to me is that the model itself certainly isn't incentivized to do anything in particular here, it's the tooling that's putting its finger on the scale (and different tooling nudges things in different directions).
mbesto [3 hidden]5 mins ago
> and that is not great code
When you say "is not great code" can you elaborate? Does the code work or not?
danbruc [3 hidden]5 mins ago
I don't know, I would assume it works but I would not expect it to be free of bugs. But that is the baseline for code, being correct - up to some bugs - is the absolute minimum requirement, code quality starts from there - is it efficient, is it secure, is it understandable, is it maintainable, ...
stavros [3 hidden]5 mins ago
It works really well, multiple people have been using it for a month or so (including me) and it's flawless. I think "not great" means "not very readable by humans", but it wasn't really meant to be readable.
I don't know if there are underlying bugs, but I haven't hit any, and the architecture (which I do know about) is sane.
FiberBundle [3 hidden]5 mins ago
Pine Town [1], the "whimsical infinite multiplayer canvas of a meadow", also looks like pure slop.
What were you hoping to achieve with this comment?
vidarh [3 hidden]5 mins ago
It's the kind of code you should expect if you don't run a harness that includes review and refactoring stages.
It's by no means the best LLMs can do.
input_sh [3 hidden]5 mins ago
You can make it better by investing a lot of time playing around with the tooling so that it produces something more akin to what you're looking for.
Good luck convincing your boss that this ungodly amount of time spent messing around with your tooling for an immeasurable improvement in your delivery is the time well spent as opposed to using that same amount of time delivering results by hand.
javier123454321 [3 hidden]5 mins ago
You literally have it backwards. It's the bosses that are pulling engineers aside and requiring adoption of a tooling that they're not even sure justifies the increase in productivity versus the cost of setting up the new workflows. At least anecdotally, that's the case.
input_sh [3 hidden]5 mins ago
I don't disagree with that, my claim is that bosses don't know what they're doing. If all of the pre-established quality standards go out the window, that's completely fine with me, I still get paid just the same, but then later on I get to point to that decision and say "I told you so".
Luckily for me, I'm fortunate enough to not have to work in that sort of environment.
TacticalCoder [3 hidden]5 mins ago
> is this the kind of code I should expect?
Sadly yes. But it "works", for some definition of working. We all know it's going to be a maintenance nightmare seen the gigantic amount of code and projects now being generated ad infinitum. As someone commented in this thread: it can one-shot an app showing restaurant locations on a map and put a green icon if they're open. But don't except good code, secure code, performant code and certainly not "maintainable code".
By definition, unless the AIs can maintain that code, nothing is maintainable anymore: the reason being the sheer volume. Humans who could properly review and maintain code (and that's not many) are already outnumbered.
And as more and more become "prompt engineers" and are convinced that there's no need to learn anything anymore besides becoming a prompt engineer, the amount of generated code is only going to grow exponentially.
So to me it is the kind of code you should expect. It's not perfect. But it more or less works. And thankfully it shouldn't get worse with future models.
What we now need is tools, tools and more tools: to help keep these things on tracks. If we ever to get some peace of mind about the correctness of this unreviewable generate code, we'll need to automate things like theorem provers and code coverage (which are still nowhere to be seen).
And just like all these models are running on Linux and QEMU and Docker (dev container) and heavily using projects like ripgrep (Claude Code insist on having ripgrep installed), I'm pretty sure all these tools these models rely on and shall rely on to produce acceptable results are going to be, very mostly, written by humans.
I don't know how to put it nicely: an app showing green icon next to open restaurants on a map ain't exactly software to help lift off a rocket or to pilot a MRI machine.
BTW: yup, I do have and use Claude Code. Color me both impressed and horrified by the "working" amount un unmaintainable mess it can spout. Everybody who understands something about software maintenance should be horrified.
bdashdash [3 hidden]5 mins ago
What I find interesting is how AI enthusiasts will recursively offer AI itself as the solution to any of the issues you mention.
Since AI can read and generate code, it can surely fix code, or find bugs, or address security flaws. And if this all turns into a hot mess, AI can just refactor the whole thing anyway. And so forth.
Personally, I think we'll be some years off before the whole software loop is closed by AI (if it even happens anyway).
dncornholio [3 hidden]5 mins ago
I also managed to find a 1000 line .cpp file in one of the projects. The article's content doesn't match his apps quality. They don't bring any value. His clock looks completely AI generated.
stavros [3 hidden]5 mins ago
Remember you're grinding your anti-LLM axe against something a real person made, and that person read your comment.
dncornholio [3 hidden]5 mins ago
Don't think it's fair to think any negative comment is from some anti-LLM-axe. I seriously gave you the benefit of the doubt, that was the whole reason I even looked further into your work.
It's no shame to be critical in todays world. Delivering proof is something that holds extra value and if I would create an article about the wonderful things I've created, I'd be extra sure to show it.
I looked at your clock project and when I saw that your updated version and improved version of your clock contained AI artifacts, I concluded that there's no proof of your work.
Sorry to have made that conclusion and I'm sorry if that hurt your feelings.
stavros [3 hidden]5 mins ago
Saying things like "there's no proof of your work" is the anti-LLM axe. Yes, it's all written by LLMs, and yes, it's all my work. Judge it on what it does and how well it works, not on whether the code looks like the code you would have written.
fzeroracer [3 hidden]5 mins ago
Is it your work? What did you bring to the table? Because if we're going to analyze design, then code is one function of that design.
For example, you talk about how the code is secure. How do you prove that it is secure?
stavros [3 hidden]5 mins ago
The same way you prove your OSS code is secure.
People here see an LLM-assisted project and suddenly they've never written a bug in their life.
miguelgrinberg [3 hidden]5 mins ago
> One thing I’ve noticed is that different people get wildly different results with LLMs, so I suspect there’s some element of how you’re talking to them that affects the results.
It's always easier to blame the prompt and convince yourself that you have some sort of talent in how you talk to LLMs that other's don't.
In my experience the differences are mostly in how the code produced by the LLM is reviewed. Developers who have experience reviewing code are more likely to find problems immediately and complain they aren't getting great results without a lot of hand holding. And those who rarely or never reviewed code from other developers are invariably going to miss stuff and rate the output they get higher.
keeda [3 hidden]5 mins ago
I dunno, I have extensive experience reviewing code, and I still review all the AI generated code I own, and I find nothing to complain about in the vast majority of cases. I think it is based on "holding it right."
For instance, I've commented before that I tend to decompose tasks intended for AI to a level where I already know the "shape" of the code in my head, as well as what the test cases should look like. So reviewing the generated code and tests for me is pretty quick because it's almost like reading a book I've already read before, and if something is wrong it jumps out quickly. And I find things jumping out more and more infrequently.
Note that decomposing tasks means I'm doing the design and architecture, which I still don't trust the AI to do... but over the years the scope of tasks has gone up from individual functions to entire modules.
In fact, I'm getting convinced vibe coding could work now, but it still requires a great deal of skill. You have to give it the right context and sophisticated validation mechanisms that help it self-correct as well as let you validate functionality very quickly with minimal looks at the code itself.
arikrahman [3 hidden]5 mins ago
"Holding it right" has been one of my biggest problems. Many times I find the output affected by prompt poisoning, and I have to throw away the entire context.
zackify [3 hidden]5 mins ago
This definitely is the case. I was talking to someone complaining about how llms don't work good.
They said it couldn't fix an issue it made.
I asked if they gave it any way to validate what it did.
They did not, some people really are saying "fix this" instead of saying "x fn is doing y when someone makes a request to it. Please attempt to fix x and validate it by accessing the endpoint after and writing tests"
Its shocking some people don't give it any real instruction or way to check itself.
In addition I get great results doing voice to text with very specific workflows. Asking it to add a new feature where I describe what functions I want changed then review as I go vs wait for the end.
mbesto [3 hidden]5 mins ago
> Its shocking some people don't give it any real instruction or way to check itself.
It's not shocking. The tech world is telling them that "Claude will write all of their app easily" with zero instructions/guidelines so of course they're going to send prompts like that.
tracker1 [3 hidden]5 mins ago
I think the implications of limited to no instructions are a little to way off depending on what you're doing... CRUD APIs, sure... especially if you have a well defined DB schema and API surface/approach. Anything that might get complex, less so.
Two areas I've really appreciated LLMs so far... one is being able to make web components that do one thing well in encapsulation.. I can bring it into my project and just use it... AI can scaffold a test/demo app that exercises the component with ease and testing becomes pretty straight forward.
The other for me has been in bridging rust to wasm and even FFI interfaces so I can use underlying systems from Deno/Bun/Node with relative ease... it's been pretty nice all around to say the least.
That said, this all takes work... lots of design work up front for how things should function... weather it's a ui component or an API backend library. From there, you have to add in testing, and some iteration to discover and ensure there aren't behavioral bugs in place. Actually reviewing code and especially the written test logic. LLMs tend to over-test in ways that are excessive or redundant a lot of the time. Especially when a longer test function effectively also tests underlying functionalities that each had their own tests... cut them out.
There's nothing "free" and it's not all that "easy" either, assuming you actually care about the final product. It's definitely work, but it's more about the outcome and creation than the grunt work. As a developer, you'll be expected to think a lot more, plan and oversee what's getting done as opposed to being able to just bang out your own simple boilerplate for weeks at a time.
mikkupikku [3 hidden]5 mins ago
It's surprising they don't learn better after their first hour or two of use. Or maybe they do know better but don't like the thing so they deliberately give it rope to hang itself with, then blame overzealous marketting.
sobjornstad [3 hidden]5 mins ago
There are subtler versions of this too. I've been working on a TUI app for a couple of weeks, and having great success getting it to interactively test by sending tmux commands, but every once in a while it would just deliver code that didn't work. I finally realized it was because the capture tools I gave it didn't capture the cursor location, so it would, understandably, get confused about where it was and what was selected.
I promptly went and fixed this before doing any more work, because I know if I was put in that situation I would refuse to do any more work until I could actually use the app properly. In general, if you wouldn't be able to solve a problem with the tools you give an LLM, it will probably do a bad job too.
petcat [3 hidden]5 mins ago
If you tell a human junior developer just "fix this" then they will spend a week on a wild-goose chase with nothing to show for it.
At least the LLM will only take 5 minutes to tell you they don't know what to do.
ruszki [3 hidden]5 mins ago
Do they? I’ve never got a response that something was impossible, or stupid. LLMs are happy to verify that a noop does nothing, if they don’t know how to fix something. They rather make something useless than really tackle a problem, if they can make tests green that way, or they can claim that something “works”.
And’ve I never asked Claude Code something which is really impossible, or even really difficult.
mikkupikku [3 hidden]5 mins ago
Claude code will happily tell me my ideas are stupid, but I think that's because I nest my ideas in between other alternative ideas and ask for an evaluation of all of them. This effectively combats the sycophantic tendencies.
Still, sometimes claude will tell me off even when I don't give it alternatives. Last night I told it to use luasocket from an mpv userscript to connect to a zeromq Unix socket (and also implement zmq in pure lua) connected to an ffmpeg zmq filter to change filter parameters on the fly. Claude code all but called me stupid and told me to just reload the filter graph through normal mpv means when I make a change. Which was a good call, but I told it to do the thing anyway and it ended up working well, so what does it really know... Anyway, I like that it pushes back, but agrees to commit when I insist.
seunosewa [3 hidden]5 mins ago
After such hard-won wins, ask the AI to save what it learned during the session to a MD file.
IanCal [3 hidden]5 mins ago
I've definitely had pushback on what to do or approaches, yes. I've had this more recently because I've been pushing more on a side of "I want to know if this would end up being fast enough / allow something that it'd be worth doing". I've had to argue harder for something recently, and I'm genuinely not sure if it is possible or not. While it's not flat out refused to do it, it's explained to me why it won't work, and taken some pushing to implement parts of it. My gut feeling is that the blockers it is describing are real but we can sidestep them by taking a wilder swing at the change, but I'm not sure I'm right.
speakingmoistly [3 hidden]5 mins ago
To be fair, that happening feels more like poor management and mentorship than "juniors are scatterbrained".
Over time, you build up the right reflexes that avoid a one-week goose chase with them. Heck, since we're working with people, you don't just say " fix this", you earmark time to make sure everyone is aligned on what needs done and what the plan is.
dkersten [3 hidden]5 mins ago
> At least the LLM will only take 5 minutes to tell you they don't know what to do.
In my experience, the LLM will happily try the wrong thing over and over for hours. It rarely will say it doesn’t know.
Ancapistani [3 hidden]5 mins ago
Don’t ask it to make changes off the bat, then - ask it to make a plan. Then inspect the plan, change it if necessary, and go from there.
raw_anon_1111 [3 hidden]5 mins ago
I made that mistake when I first started using Claude Code/Codex. Now I give it access to my isolated DEV AWS account with appropriately scoped permissions on the IAM level with temporary credentials and tell it how to validate the code and have in my markdown file to always use $x to test any changes to $y.
It’s gotten a lot better.
Ancapistani [3 hidden]5 mins ago
> some people really are saying "fix this" instead of saying "x fn is doing y when someone makes a request to it. Please attempt to fix x and validate it by accessing the endpoint after and writing tests"
This works about 85% of the time IME, in Claude Code. My normal workflow on most bugs is to just say “fix this” and paste the logs. The key is that I do it in plan mode, then thoroughly inspect and refine the plan before allowing it to proceed.
tracker1 [3 hidden]5 mins ago
Yeah, the more time I spend in planning and working through design/api documentation for how I want something to work, the better it does... Similar for testing against your specifications, not the code... once you have a defined API surface and functional/unit tests for what you're trying to do, it's all the harder for AI to actually mess things up. Even more interesting is IMO how well the agents work with Rust vs other languages the more well defined your specifications are.
rirze [3 hidden]5 mins ago
Untested Hypothesis: LLM instruction is usually an intelligence+communication-based skill. I find in my non-authoritative experience that users who give short form instructions are generally ill prepared for technical motivation (whether they're motivating LLMs or humans).
sshine [3 hidden]5 mins ago
Feeding the LLM a "copy as cURL" for its feedback loop instead of letting it manage the dev server was an unlock for me.
jzig [3 hidden]5 mins ago
lol that is still “how you’re talking to them that affects the results” just more specific
raw_anon_1111 [3 hidden]5 mins ago
I have 30 years of experience delivering code and 10 years of leading architecture. My argument is the only thing that matters is does the entire implementation - code + architecture (your database, networking, your runtime that determines scaling, etc) meet the functional and none functional requirements. Functional = does it meet the business requirements and UX and non functional = scalability, security, performance, concurrency, etc.
I only carefully review the parts of the implementation that I know “work on my machine but will break once I put in a real world scenario”. Even before AI I wasn’t one of the people who got into geek wars worrying about which GOF pattern you should have used.
All except for concurrency where it’s hard to have automated tests, I care more about the unit or honestly integration tests and testing for scalability than the code. Your login isn’t slow because you chose to use a for loop instead of a while loop. I will have my agents run the appropriate tests after code changes
I didn’t look at a line of code for my vibe coded admin UI authenticated with AWS cognito that at most will be used by less than a dozen people and whoever maintains it will probably also use a coding agent. I did review the functionality and UX.
Code before AI was always the grind between my architectural vision and implementation
awakeasleep [3 hidden]5 mins ago
Explain how fragility of implementation, like spaghetti code, high coupling low cohesion fit into your world view?
petcat [3 hidden]5 mins ago
As human developers, I think we're struggling with "letting go" of the code. The code we write (or agents write) is really just an intermediate representation (IR) of the solution.
For instance, GCC will inline functions, unroll loops, and myriad other optimizations that we don't care about (and actually want!). But when we review the ASM that GCC generates we are not concerned with the "spaghetti" and the "high coupling" and "low cohesion". We care that it works, and is correct for what it is supposed to do.
Source code in a higher-level language is not really different anymore. Agents write the code, maybe we guide them on patterns and correct them when they are obviously wrong, but the code is just the work-item artifact that comes out of extensive specification, discussion, proposal review, and more review of the reviews.
A well-guided, iterative process and problem/solution description should be able to generate an equivalent implementation whether a human is writing the code or an agent.
sarchertech [3 hidden]5 mins ago
A compiler uses rigorous modeling and testing to ensure that generated code is semantically equivalent. It can do this because it is translating from one formal language to another.
Translating a natural prompt on the other hand requires the LLM to make thousands of small decisions that will be different each time you regenerate the artifact. Even ignoring non-determinism, prompt instability means that any small change to the spec will result in a vastly different program.
A natural language spec and test suite cannot be complete enough to encode all of these differences without being at least as complex as the code.
Therefore each time you regenerate large sections of code without review, you will see scores of observable behavior differences that will surface to the user as churn, jank, and broken workflows.
Your tests will not encode every user workflow, not even close. Ask yourself if you have ever worked on a non trivial piece of software where you could randomly regenerate 10% of the implementation while keeping to the spec without seeing a flurry of bug reports.
This may change if LLMs improve such that they are able to reason about code changes to the degree a human can. As of today they cannot do this and require tests and human code review to prevent them from spinning out. But I suspect at that point they’ll be doing our job, as well as the CEOs and we’ll have bigger problems.
LogicFailsMe [3 hidden]5 mins ago
I don't see a world where a motivated soul can build a business from a laptop and a token service as a problem. I see it as opportunity.
I feel similarly about Hollywood and the creation of media. We're not there in either case yet, but we will be. That's pretty clear. and when I look at the feudal society that is the entertainment industry here, I don't understand why so many of the serfs are trying to perpetuate it in its current state. And I really don't get why engineers think this technology is going to turn them into serfs unless they let that happen to them themselves. If you can build things, AI coding agents will let you build faster and more for the same amount of effort.
I am assuming given the rate of advance of AI coding systems in the past year that there is plenty of improvement to come before this plateaus. I'm sure that will include AI generated systems to do security reviews that will be at human or better level. I've already seen Claude find 20 plus-year-old bugs in my own code. They weren't particularly mission critical but they were there the whole time. I've also seen it do amazingly sophisticated reverse engineering of assembly code only to fall over flat on its face for the simplest tasks.
sarchertech [3 hidden]5 mins ago
That depends on how fast that change happens. If 45% of jobs evaporate in a a 5 year period, a complete societal collapse is the likely outcome.
LogicFailsMe [3 hidden]5 mins ago
Sounds like influencer nonsense to me. Touch grass. If the people are fed and housed, there's no collapse. And if the billionaire class lets them starve, they will finally go through some things just like the aristocracy in France once did. And I think even Peter Thiel is smarter than that. You can feed yourself for <$1000 a year on beans and rice. Not saying you'd enjoy it, but you won't starve. So for ~$40B annually, the billionaires buy themselves revolution insurance. Fantastic value.
OTOH if what you're really talking about is the long-term collapse in our ludicrous carbon footprint when we finally run out of fossil fuels and we didn't invest in renewables or nuclear to replace them, well, I'm with you there.
sarchertech [3 hidden]5 mins ago
>Sounds like influencer nonsense to me. Touch grass.
I don't even know what this means.
The worst unemployment during the Weimar Republic was 25-30%. Unemployment in the Great Depression peaked at 25%.
So yeah if we get to 45% unemployment and those are the highest paying jobs on average then yeah it's gonna be bad. Then you add in second order effects where none of those people have the money to pay the other 55% who are still employed.
We might get to a UBI relatively quickly and peacefully. But I'm not betting on it.
>finally go through some things just like the aristocracy in France once did.
Yeah that's probably the most likely scenario, but that quickly devolved into a death and imprisonment for far more than the aristocrats and eventually ended with Napoleon trying to take over Europe and millions of deaths overall.
The world didn't literally end, but it was 40 years of war, famine, disease, and death, and not a lot of time to think about starting businesses with your laptop.
LogicFailsMe [3 hidden]5 mins ago
And the dark ages lasted a millennium. Sounds like quite an improvement on that. And if America didn't want a society hellbent on living the worst possible timeline, why did it re-elect President Voldemaga and give him the football? And then, even when he breaks nearly every political promise, his support remains better than his predecessor? Anyway, I think the richest ~1135 Americans won't let you starve, but they'll be happy to watch you die young of things that had stopped killing people for quite some time whilst they skim all the cream. And that seems to be what the plurality wants or they'd vote differently.
The good news is that America is ~5% of the world. And the more we keep punching ourselves in the face, the better the chance someone else pulls ahead. But still, we have nukes, so we're still the town bully for the immediate future.
sarchertech [3 hidden]5 mins ago
What are you even arguing about? I have absolutely no idea where you are going with this.
LogicFailsMe [3 hidden]5 mins ago
Yeah I figured that. You think society is going to collapse because of AI. I don't. But I do think that stupid narrative is prevalent in the media right now and the C-suite happily proclaiming they're going to lay people off and replace them with AI got the ball rolling in the first place. Now it has momentum of its own with lunatics like Eliezer Yudkowsky once again getting taken seriously.
Fortunately, the other 95% of humanity is far less doomer about their prospects. So if America wants to be the new neanderthals, they'll be happy to be the new cro magnons.
sarchertech [3 hidden]5 mins ago
I don't think society is going to collapse because of AI because I don't think the current architectures have any chance of becoming AGI. I think that if AGI is even something we're capable of it's very far off.
I think that if CEOs can replace us soon, it's because AGI got here much sooner than I predicted. And if that happens we have 2 options Mad Max and Star Trek and Mad Max is the more likely of the 2.
LogicFailsMe [3 hidden]5 mins ago
What's with all the catastrophic thinking then? Mad Max? Collapse of Society because 45% unemployment? I really hate people on principle but I have more faith in them looking out for their own self interest than you do apparently. Mad Max specifically requires a ridiculous amount of intact infrastructure for all the gasoline (you know gasoline goes bad in 3-6 months? Yeah didn't think so), manufacturing for all the parts for all those crazy custom build road warrior wagons, and ranches of livestock for all the leather for all the cool outfits (and with all that cow, no one needs to starve but oh the infrastructure needed to keep the cows fed).
If doom porn is your thing, try watching Threads or The Day After, especially Threads. That said, I don't think Star Trek is possible, maybe The Expanse but more likely we run out of cheap energy before we get off world.
As for the AGI, it all depends on your definition. We're already at Amazon IC1/IC2 coding performance with these agents (I speak from experience previously managing them). If we get to IC3, one person will be able to build a $1B company and run it or sell it. If you're a purist like me and insist we stick to douchebag racist Nick Bostrom's superintelligence definition of AGI, then we agree. But I expect 24/7 IC3 level engineering as a service for $200/month to be more than enough and I think that's a year or two away. And you can either prepare for that or scream how the sky is falling, your choice.
jplusequalt [3 hidden]5 mins ago
>You can feed yourself for <$1000 a year on beans and rice. Not saying you'd enjoy it, but you won't starve. So for ~$40B annually, the billionaires buy themselves revolution insurance. Fantastic value.
You are the epitome of the tech bro.
LogicFailsMe [3 hidden]5 mins ago
Sure, sure. Understanding how these sociopaths think clearly makes me a tech bro rather than someone who incorporates worst-case scenarios into my planning. Suggesting they would maintain minimum viable society to save their own asses means I'm in favor of it, right? This is why I work remotely.
bhaak [3 hidden]5 mins ago
Peter Thiel might be smarter than that but I’m not sure about the other ones.
Look how Musk treated the Twitter devs or Bezos any of his workers or Trump anybody.
LogicFailsMe [3 hidden]5 mins ago
They're all quite intelligent. And they're world class experts in saving their own bacon. Doesn't mean they have any ethics though nor any emotional intelligence after decades of being surrounded by toadies and bootlickers.
jplusequalt [3 hidden]5 mins ago
>If you can build things, AI coding agents will let you build faster and more for the same amount of effort.
But you aren't building, your LLM is. Also, you are only thinking about ways as you, a supposed builder, will benefit from this technology. Have you considered how all previous waves of new technologies have introduced downstream effects that have muddied our societies? LLMs are not unique in this regard, and we should be critical on those who are trying to force them into every device we own.
k3nx [3 hidden]5 mins ago
I've struggled a bit with this myself. I'm having a paradigm shift. I used to say "but I like writing code". But like the article says, that's not really true. I like building things, the code was just a way to do that. If you want to get pedantic, I wasn't building things before AI either, the compiler/linker was doing that for me. I see this is just another level of abstraction. I still get to decide how things work, what "layers" I want to introduce. I still get to say, no, I don't like that. So instead of being the "grunt", I'm the designer/architect. I'm still building what I want. Boilerplate code was never something I enjoyed before anyway. I'm loving (like actually giggling) having the AI tie all the bits for me and getting up and running with things working. It reminds me of my Delphi days: File->New Project, and you're ready to go. I think I was burnt out. AI is helping me find joy again. I also disable AI in all my apps as well, so I'm still on the fence about several things too.
druide67 [3 hidden]5 mins ago
This resonates. I spent years thinking I enjoyed coding, but what I actually enjoy is designing elegant solutions built on solid architecture. Inventing, innovating, building progressively on strong foundations. The real pleasure is the finished product (is it ever really finished though?) — seeing it's useful and makes people's lives easier, while knowing it's well-built technically. The user doesn't see that part, but we know.
With AI, by always planning first, pushing it to explore alternative technical approaches, making it explain its choices — the creative construction process gets easier. You stay the conductor. Refactoring, new features, testing — all facilitated. Add regular AI-driven audits to catch defects, and of course the expert eye that nothing replaces.
One thing that worries me though: how will junior devs build that expert eye if AI handles the grunt work? Learning through struggle is how most of us developed intuition. That's a real problem for the next generation.
raw_anon_1111 [3 hidden]5 mins ago
Would you say the general contractor for your home isn’t a builder because he didn’t install the toilets?
jplusequalt [3 hidden]5 mins ago
I think this argument would be make more sense if you were talking about an architect, or the customer.
A contractor is still very much putting the house together.
raw_anon_1111 [3 hidden]5 mins ago
The general contractor is not doing the actual building as much as he is coordinating all of the specialist, making sure things run smoothly and scheduling things based on dependencies and coordinating with the customer. I’ve had two houses built from the ground up
LogicFailsMe [3 hidden]5 mins ago
3 myself and I have yet to meet a "vibe" contractor.
raw_anon_1111 [3 hidden]5 mins ago
And he is also not inspecting every screw, wire, etc. He delegates
LogicFailsMe [3 hidden]5 mins ago
Oh you're preaching to the choir. I think we are entering a punctuated equilibrium here w/r to the future of SW engineering. And the people who have the free time to go on to podcasts and insist AI coding agents can't do anything useful rather than learning their abilities and their limitations and especially how to wield them are going to go through some things. If you really want to trigger these sorts, ask them why they delegate code generation to compilers and interpreters without understanding each and every ISA at the instruction level. To that end, I am devoid of compassion after having gone through similar nonsense w/r to GPUs 20 years ago. Times change, people don't.
raw_anon_1111 [3 hidden]5 mins ago
I haven’t stayed relevant and able to find jobs quickly for 30 years by being the old man shouting at the clouds.
I started my career in 1996 programming in C and Fortran on mainframes and got my first only and hopefully last job at BigTech at 46 7 jobs later.
I’m no longer there. Every project I’ve had in the last two years has had classic ML and then LLMs integrated into the implementation. I have very much jumped on the coding agent bandwagon.
LogicFailsMe [3 hidden]5 mins ago
Started mine around the same time and yes, keeping up keeps one employed. What's disheartening however is how little keeping up the key decision makers and stakeholders at FAANNG do and it explains idiocy like already trying to fire engineers and replace them with AI. Hilarity ensued of course because hilarity always ensues for people like that, but hilarity and shenanigans appears to be inexhaustible resources.
LogicFailsMe [3 hidden]5 mins ago
I think that's precisely his thinking and don't let him know about all those fancy expensive unitasker tools they have that you probably don't that let them do it far more cost effectively and better than the typical homeowner. Won't you think of the jerbs(tm)? And to Captain dystopia, life expectencies were increasing monotonically until COVID. Wonder what changed?
petcat [3 hidden]5 mins ago
> A compiler uses rigorous modeling and testing to ensure that generated code is semantically equivalent.
Here are the reported miscompilation bugs in GCC so far in 2026. The ones labeled "wrong-code".
If you can’t understand the difference between a bug that will rarely cause a compiler encountering an edge case to generate a wrong instruction and an LLM that will generate 2 completely different programs with zero overlap because you added a single word to your prompt, then I don’t know what to tell you.
petcat [3 hidden]5 mins ago
The point is that expert humans (the GCC developers) writing code (C++) that generates code (ASM) does not appear to be as deterministic as you seem to think it is.
sarchertech [3 hidden]5 mins ago
I’m very aware of that, but I’m also aware that it’s rare enough that the compiler doesn’t emit semantically equivalent code that most people can ignore it. That’s not the case with LLMs.
I’m also not particularly concerned with non-determinism but with chaos. Determinism in LLMs is likely solvable, prompt instability is not.
jplusequalt [3 hidden]5 mins ago
Classic HN-ism. To focus on the semantics of a statement while ignoring the greater point in order to argue why someone is wrong.
anthonyrstevens [3 hidden]5 mins ago
I think it's a perfectly fine point. The OP said (my interpretation) that LLMs are messy, non-deterministic, and can produce bad code. The same is true of many humans, even those whose "job" is to produce clean, predictable, good code. The OP would like the argument to be narrowly about LLMs, but the bigger point even is "who generates the final code, and why and how much do we trust them?"
petcat [3 hidden]5 mins ago
I argued the greater point? Software code-generation is not deterministic, whether it's done by expert humans or by LLMs.
sarchertech [3 hidden]5 mins ago
It has nothing to do with determinism. It's the difference between nearly perfectly but not quite perfectly translating between rigorously specified formal languages and translating an ambiguous natural language specification into a formal one.
The first is a purely mechanical process, the second is not and requires thousands of decisions that can go either way.
raw_anon_1111 [3 hidden]5 mins ago
And that’s no different than human developers
sarchertech [3 hidden]5 mins ago
The difference is that a human is that a human can reason about their code changes to a much higher degree than an AI can. If you don't think this is true and you think we're working with AGI, why would you bother architecting anything all or building in any guard rails. Why not just feed the AI the text of the contract your working from and let it rip.
raw_anon_1111 [3 hidden]5 mins ago
You give way too much credit to the average mid level ticket taker. And again, why do I care how the code does it as long as it meets the functional and none functional requirements?
jcranmer [3 hidden]5 mins ago
Compilers are some of the largest, most complex pieces of software out there. It should be no surprise that they come with bugs as all other large, complex pieces of software do.
Kye [3 hidden]5 mins ago
This seems to apply easily to LLMs as language coprocessors that can output code. How long was it before people trusted compilers?
sarchertech [3 hidden]5 mins ago
If you don't understand the difference between something that rigorously translates one formal language to another one and something that will spit out a completely different piece of software with 0 lines of overlap based on a one word prompt change, I don't know what to tell you.
anthonyrstevens [3 hidden]5 mins ago
"rigorously" is doing a lot of heavy lifting here.
sarchertech [3 hidden]5 mins ago
Let's substitute rigorously with "in an extremely thorough, careful, and methodical way."
raw_anon_1111 [3 hidden]5 mins ago
As if when you delegate tasks to humans they are deterministic. I would hope that your test cases cover the requirements. If not, your implementation is just as brittle when other developers come online or even when you come back to a project after six months.
sarchertech [3 hidden]5 mins ago
1. Agents aren’t humans. A human can write a working 100k LOC application with zero tests (not saying they should but they could and have). An agent cannot do this.
Agents require tests to keep them from spinning out and your tests do not cover all of the behaviors you care about.
2. If you doubt that your tests don’t cover all your requirements, 99.9% of every production bug you’ve ever had completely passed your test suite.
raw_anon_1111 [3 hidden]5 mins ago
I have never known a human that could or did write 100K lines of bug free working code without running parts of it first and testing.
So humans also don’t write bug free code or tests that cover all use cases - how is that an argument that humans are better?
sarchertech [3 hidden]5 mins ago
Not that humans can't write 100k line programs bug free or without running parts of it.
An AI cannot write a 100k line program on its own without external guard rails otherwise it spins out. This has nothing to do with whether the agent is allowed to run the code itself. This is well documented. Look at what was required to allow Claude to write a "C compiler".
This has nothing to do with whether it's bug free. It literally can't produce a working 100k LOC program without external guardrails.
raw_anon_1111 [3 hidden]5 mins ago
Absolutely no one is arguing that you shouldn’t have a combination of manual and automated tests around either AI or human generated code or that you shouldn’t have a thoughtful design
jmalicki [3 hidden]5 mins ago
I've actually found that well-written well-documented non-spaghetti code is even more important now that we have LLMs.
Why? Because LLMs can get easily confused, so they need well written code they can understand if the LLM is going to maintain the codebase it writes.
The cleaner I keep my codebase, and the better (not necessarily more) abstracted it is, the easier it is for the LLM to understand the code within its limited context window. Good abstractions help the right level of understanding fit within the context window, etc.
I would argue that use of LLMs change what good code is, since "good" now means you have to meaningfully fit good ideas in chunks of 125k tokens.
raw_anon_1111 [3 hidden]5 mins ago
I somewhat agree. But that’s more about modularity. It helps when I can just have Claude code focus on one folder with its own Claude file where it describes the invariants - the inputs and outputs.
throwaw12 [3 hidden]5 mins ago
Valid points. But crucial part of not "letting go" of the code is because we are responsible for that code at the moment.
If, in the future, LLM providers will take ownership of our on-calls for the code they have produced, I would write "AUTO-REVIEW-ACCEPTER" bot to accept everything and deploy it to production.
If, company requires me to own something, then I should be aware about what's that thing and understand ins and outs in detail and be able to quickly adjust when things go wrong
raw_anon_1111 [3 hidden]5 mins ago
In the past ten years as a team lead/architect/person who was responsible for outsourced implementations (ie Salesforce/Workday integrations, etc), I’ve been responsible for a lot of code I didn’t write. What sense would it have made for me to review the code of the web front end of the web developer for best practices when I haven’t written a web app since 2002?
throwaw12 [3 hidden]5 mins ago
as a team lead, if you are not aware of what's happening in the team, what kind of team lead is this?
on the other hand, you may have been an engineering manager, who is responsible for the team, but a lot of times they do not participate in on-call rotations (only as last escalation)
raw_anon_1111 [3 hidden]5 mins ago
As a team lead, I know the architecture, the functional and non functional requirements, I know the website is suppose to do $x but I definitely didn’t guide how since I haven’t done web development in a quarter century, I know the best practices for architecture and data engineering (to a point).
That doesn’t mean I did a code review for all of the developers. I will ask them how they solved for a problem that I know can be tricky or did they take into account for something.
krilcebre [3 hidden]5 mins ago
You are comparing compilers to a completely non deterministic code generation tool that often does not take observable behavior into account at all and will happily screw a part of your system without you noticing, because you misworded a single prompt.
No amount of unit/integration tests cover every single use case in sufficiently complex software, so you cannot rely on that alone.
raw_anon_1111 [3 hidden]5 mins ago
I just rewrote a utility for the third time - the first two were before AI.
Short version, when someone designs a call center with Amazon Connect, they use a GUI flowchart tool and create “contact flows”. You can export the flow to JSON. But it isn’t portable to other environments without some remapping. I created a tool before that used the API to export it and create a portable CloudFormation template.
I always miss some nuance that can half be caught by calling the official CloudFormation linter and the other half by actually deploying it and seeing what errors you get
This time, I did with Claude code, ironically enough, it knew some of the complexity because it had been trained on one of my older open source implementations I did while at AWS. But I told it to read the official CloudFormation spec, after every change test it with the linter, try to deploy it and fix it.
Again, I didn’t care about the code - I cared about results. The output of the script either passes the deployment or it doesn’t. Claude iterated until it got it right based on “observable behavior”. Claude has tested whether my deployments were working as expected plenty of times by calling the appropriate AWS CLI command and fixed things or reading from a dev database based on integration tests I defined.
mikeocool [3 hidden]5 mins ago
When requirements change, a compiler has the benefit of not having to go back and edit the binary it produced.
Maybe we should treat LLM generated code similarly —- just generate everything fresh from the spec anytime there’a a change, though personally I haven’t had much success with that yet.
raw_anon_1111 [3 hidden]5 mins ago
It very much does have to modify the binary it produced to create new code. The entire Linux kernel has an unstable ABI where you have to recompile your code to link to system libraries.
AstroBen [3 hidden]5 mins ago
This is fantasy completely disconnected from reality.
Have you ever tried writing tests for spaghetti code? It's hell compared to testing good code. LLMs require a very strong test harness or they're going to break things.
Have you tried reading and understanding spaghetti code? How do you verify it does what you want, and none of what you don't want?
Many code design techniques were created to make things easy for humans to understand. That understanding needs to be there whether you're modifying it yourself or reviewing the code.
Developers are struggling because they know what happens when you have 100k lines of slop.
If things keep speeding in this direction we're going to wake up to a world of pain in 3 years and AI isn't going to get us out of it.
raw_anon_1111 [3 hidden]5 mins ago
I’ve found much more utility even pre AI in a good suite of integration tests than unit tests. For instance if you are doing a test harness for an API, it doesn’t matter if you even have access to the code if you are writing tests against the API surface itself.
AstroBen [3 hidden]5 mins ago
I do too, but it comes from a bang-for-your-buck and not a test coverage standpoint. Test coverage goes up in importance as you lean more on AI to do the implementation IMO.
raw_anon_1111 [3 hidden]5 mins ago
You did see the part about my unit, integration and scalability testing? The testing harness is what prevents the fragility.
It doesn’t matter to AI whether the code is spaghetti code or not. What you said was only important when humans were maintaining the code.
No human should ever be forced to look at the code behind my vibe coded internal admin portal that was created with straight Python, no frameworks, server side rendered and produced HTML and JS for the front end all hosted in a single Lambda including much of the backend API.
I haven’t done web development since 2002 with Classic ASP besides some copy and paste feature work once in a blue moon.
In my repos - post AI. My Claude/Agent files have summaries of the initial statement of work, the transcripts from the requirement sessions, my well labeled design diagrams , my design review sessions transcripts where I explained it to client and answered questions and a link to the Google NotebookLM project with all of the artifacts. I have separate md files for different implemtation components.
The NotebookLM project can be used for any future maintainers to ask questions about the project based on all of the artifacts.
sarchertech [3 hidden]5 mins ago
> It doesn’t matter to AI whether the code is spaghetti code or not. What you said was only important when humans were maintaining the code.
In my experience using AI to work on existing systems, the AI definitely performs much better on code that humans would consider readable.
You can’t really sit here talking about architecting greenfield systems with AI using methodology that didn’t exist 6 months ago while confidently proclaiming that “trust me they’ll be maintainable”.
Well you can, and most consultants do tend to do that, but it’s not worth much.
Rapzid [3 hidden]5 mins ago
> Well you can, and most consultants do tend to do that
Yeah they do.
I'm familiar enough with the claims to feel confident there is plenty of nefarious astroturfing occurring all over the web including on HN.
raw_anon_1111 [3 hidden]5 mins ago
I wasn’t born into consulting in 1996. AI for coding is by definition the worse today that it will ever be. What makes you think that the complexity of the code will increase faster than the capability of the agents?
sarchertech [3 hidden]5 mins ago
You might have maintained large systems long ago, but if you haven't done it in a while your skill atrophies.
And the most important part is you haven't maintained any large systems written by AI, so stating that they will work is nonsense.
I won't state that AI can't get better. AI agents might replace all of us in the future. But what I will tell you is based on my experience and reasoning I have very strong doubts about the maintainability of AI generated code that no one has approved or understands. The burden of proof isn't on the person saying "maybe we should slow down and understand the consequences before we introduce a massive change." It's on the person saying "trust me it will work even though I have absolutely no evidence to support my claim".
raw_anon_1111 [3 hidden]5 mins ago
Well seeing that Claude code was just introduced last year - it couldn’t have been that long since I didn’t code with AI.
And did I mention I got my start working in cloud consulting as a full time blue badge, RSU earning employee at a little
company you might have heard of based in Seattle? So since I have worked at the second largest employee in the US, unless you have worked for Walmart - I don’t think you have worked for a larger company than I have.
Oh did I also mention that I worked at GE when it was #6 in market cap?
These were some of the business requirements we had to implement for the railroad car repair interchange management software
You better believe we had a rigorous set of automated tests in something as highly regulated with real world consequences as the railroad transportation industry. AI would have been perfect for that because the requirements were well documented and the test coverage was extreme.
And unless your experience coding is before 1986 when I was coding in assembly language in 65C02 as a hobby, I think I might have a wee bit more than you.
I think you should probably save your “I have more experience” for someone who hasn’t been doing this professionally for 30 years for everything from startups, to large enterprises, to BigTech.
integralid [3 hidden]5 mins ago
>No human should ever be forced to look at the code behind my vibe coded internal admin portal
Except security researchers. I work in cybersecurity and we already see vulnerabilities caused by careless AI generated code in the wild. And this will only get worse (or better for my job security).
raw_anon_1111 [3 hidden]5 mins ago
And you haven’t seen security vulnerabilities in the wild based on careless human generated code?
datsci_est_2015 [3 hidden]5 mins ago
Also developer UX, common antipatterns, etc
This “the only thing that matters about code is whether it meets requirements” is such a tired take and I can’t imagine anyone seriously spouting it has has had to maintain real software.
raw_anon_1111 [3 hidden]5 mins ago
The developer UX are the markdown files if no developer ever looks at the code.
Whether you are tired of it or not, absolutely no one in your value you chain - your customers who give your company money or your management chain cares about your code beyond does it meet the functional and non functional requirements - they never did.
And of course whether it was done on time and on budget
datsci_est_2015 [3 hidden]5 mins ago
As a consumer of goods, I care quite a bit about many of the “hows” of those goods just as much as the “whats”.
My home, which I own, for example, is very much a “what” that keeps me warm and dry. But the “how” of it was constructed is the difference between (1) me cursing the amateur and careless decision making of builders and (2) quietly sipping a cocktail on the beach, free of a care in the world.
“How” doesn’t matter until it matters, like when you put too much weight onto that piece of particle board IKEA furniture.
raw_anon_1111 [3 hidden]5 mins ago
Do you know how every nail was put into your house? Does the general contractor?
datsci_est_2015 [3 hidden]5 mins ago
I know where they fucked up and cost me thousands of dollars due to cutting corners during build-out and poor architectural decisions during planning. These kinds of things become very obvious during destructive inspection, which is probably why there are so many limitations on warranties; I digress.
He’s mildly controversial, but watch some @cyfyhomeinspections on YouTube to get a good idea of what you can infer of the “how” of building homes and how it affects homeowners. Especially relevant here because he seems to specialize in inspecting homes that are part of large developments where a single company builds out many homes very quickly and cuts tons of corners and makes the same mistakes repeatedly, kind of like LLM-generated code.
raw_anon_1111 [3 hidden]5 mins ago
So you’re saying that whether it’s humans or AI - when you delegate something to others you have no idea whether it’s producing quality without you checking yourself…
datsci_est_2015 [3 hidden]5 mins ago
> you have no idea whether it’s producing quality without you checking yourself
No, I can have some idea. For example, “brand perception”, which can be negatively impacted pretty heavily if things go south too often. See: GitHub, most recently.
I mean, there are already companies that have a negative reputation regarding software quality due to significant outsourcing (consultancies), or bloated management (IBM), or whatever tf Oracle does. We don’t have to pretend there’s a universe where software quality matters, we already live in one. AI will just be one more way to tank your company’s reputation with regards to quality, even if you can maintain profitability otherwise through business development schemes.
raw_anon_1111 [3 hidden]5 mins ago
So as long as it is meeting the requirements of “it stays up consistently and doesn’t lose my code” you really don’t care how it was coded…
The same as I’ve been arguing about using an agent to do the grunt work of coding.
If GitHub’s login is slow, it isn’t because someone or something didn’t write SOLID code.
datsci_est_2015 [3 hidden]5 mins ago
> So as long as it is meeting the requirements of “it stays up consistently and doesn’t lose my code” you really don’t care how it was coded…
I don’t think we’ll come to common ground on this topic due to mismatching definitions of fundamental concepts of software engineering. Maybe let’s meet again in a year or two and reflect upon our disagreement.
sarchertech [3 hidden]5 mins ago
If you maintain software used by tens of thousands to millions of people, you will quickly realize that no specified functional and non-functional requirements cover anywhere near all user workflows or observable behaviors.
If you mostly parachute in solutions as a consultant, or hand down architecture from above, you won’t have much experience with that, so it’s reasonable for you to underestimate it.
raw_anon_1111 [3 hidden]5 mins ago
AWS S3 by itself is made up of 300 microservices. Absolutely no developer at AWS knows how every line of code was written.
The scalability requirements are part of the “non functional requirements”. I know that the vibe coded internal admin website will never be used by more than a dozen people just like I know the ETL implementation can scale to the required number of transactions because I actually tested it for that scalability.
In fact, the one I gave to the client was my second attempt because my first one fell flat on its face when I ran it at the required scale
sarchertech [3 hidden]5 mins ago
I'm not talking about scalability requirements. I'm talking about the different workflows that 10 million people will come up with when they use a program that won't exist in any requirements docs.
vova_hn2 [3 hidden]5 mins ago
I personally haven't made my my mind either way yet, but I imagine that a vibecoding advocate could say to you that maintaining code makes sense only when the code is expensive to produce.
If the code is cheap to produce, you don't maintain it, you just throw it away and regenerate.
sarchertech [3 hidden]5 mins ago
If you have users, this only works if you have managed to encode nearly every user observable behavior into your test suite.
I’ve never seen this done even with LLMs. Not even close. And even if you did it, the test suite is almost definitely more complex than the code and will suffer from all the same maintainability problems.
raw_anon_1111 [3 hidden]5 mins ago
And in that case how is it different than when random developers come on and off projects?
sarchertech [3 hidden]5 mins ago
For one you don't let random devs hop on and off projects without code reviews, which is what people who say they don't care about the code should be doing.
And 2 clearly agents are worse at reasoning through code changes than humans are.
raw_anon_1111 [3 hidden]5 mins ago
And the team lead with 7 developers isn’t going to be doing code reviews of all the code. At most he is going to be reviewing those critical paths.
I could care less about the implementation behind the vibe coded admin website that will only be used by a dozen people. I care about the authorization.
Even the ETL job, I cared only about the performance characteristics, the resulting correctness, concurrency, logging, and correctness of the results.
soulofmischief [3 hidden]5 mins ago
I would like to introduce you to the concepts of interfaces and memory safety.
Well-designed interfaces enforce decoupling where it matters most. And believe it or not, you can do review passes after an LLM writes code, to catch bugs, security issues, bad architecture, reduce complexity, etc.
mikkupikku [3 hidden]5 mins ago
It's not skill with talking to an LLM, it's the users skill and experience with the problem they're asking the LLM to solve. They work better for problems the prompter knows well and poorly for problems the prompter doesn't really understand.
Try it yourself. Ask claude for something you don't really understand. Then learn that thing, get a fresh instance of claude and try again, this time it will work much better because your knowledge and experience will be naturally embedded in the prompt you write up.
Roxxik [3 hidden]5 mins ago
Not only you understanding the how, but you not understanding the goal.
I often use AI successfully, but in a few cases I had, it was bad. That was when I didn't even know the end goal and regularly switched the fundamental assumptions that the LLM tried to build up.
One case was a simulation where I wanted to see some specific property in the convergence behavior, but I had no idea how it would get there in the dynamics of the simulation or how it should behave when perturbed.
So the LLM tried many fundamentally different approaches and when I had something that specifically did not work it immediately switched approaches.
Next time I get to work on this (toy) problem I will let it implement some of them, fully parametrize them and let me have a go with it. There is a concrete goal and I can play around myself to see if my specific convergence criterium is even possible.
FeepingCreature [3 hidden]5 mins ago
LLMs massively reduce the cost of "let's just try this". I think trying to migrate your entire repo is usually a fool's errand. Figure out a way to break the load-bearing part of the problem out into a sub-project, solve it there, iterate as much as you like. Claude can give you a test gui in one or two minutes, as often as you like. When you have it reliably working there, make Claude write up a detailed spec and bring that back to the main project.
mikkupikku [3 hidden]5 mins ago
Claude is surprisingly good at GUI work I've been learning, not just getting stuff working but also creating reasonably tasteful and practical designs. Asking claude in the browser to mock up a GUI and then having claude code implement it is a surprisingly powerful workflow.
raw_anon_1111 [3 hidden]5 mins ago
I’m far away from a web developer or a web designer. But I think I intuitively understand how to put myself in the shoes of the end user when it comes to UX.
I noticed that Claude is awful at understanding what makes good UX even as simple as something as if you have a one line input box and button that lets you submit the line of text, you should wire it up so a user can press return instead of pressing the button or thinking about them being able to tab through inputs in a decent order
mikkupikku [3 hidden]5 mins ago
Yup, same sort of experience. If I'm fishing for something based on vibes that I can't really visualize or explain, it's going to be a slog. That said, telling the LLM the nature of my dilemma up front, warning it that I'll be waffling, seems to help a little.
__alexs [3 hidden]5 mins ago
I review most of the code I get LLMs to write and actually I think the main challenge is finding the right chunk size for each task you ask it to do.
As I use it more I gain more intuition about the kinds of problems it can handle on it's, vs those that I need to work on breaking down into smaller pieces before setting it loose.
Without research and planning agents are mostly very expensive and slow to get things done, if they even can. However with the right initial breakdown and specification of the work they are incredibly fast.
win311fwg [3 hidden]5 mins ago
I will still take a glance every once in a while to satisfy my curiosity, but I have moved past trying to review code. I was happy with the results frequently enough that I do not find it to be necessary anymore. In my experience, the best predictor is the target programming language. I fail to get much usable code in certain languages, but in certain others it is as if I wrote it myself every time. For those struggling to get good results, try a different programming language. You might be surprised.
make_it_sure [3 hidden]5 mins ago
you are overestimating the skill of code review. Some people have very specific ways of writing code and solving problems which are not aligned what LLMs wrote, but doesn't mean it's wrong.
I know senior developers that are very radical on some nonsense patterns they think are much better than others. If they see code that don't follow them, they say it's trash.
Even so, you can guide the LLM to write the code as you like.
And you are wrong, it's a lot on how people write the prompt.
datsci_est_2015 [3 hidden]5 mins ago
> you are overestimating the skill of code review.
“You are overestimating the skill of [reading, comprehending, and critically assessing code of a non-guaranteed quality]” is an absurd statement if you properly expand out what “code review” means.
I don’t care if you code review the CSS file for the Bojangles online menu web page, but you better be code reviewing the firmware for my dad’s pacemaker.
This whole back and forth with LLM-generated code makes me think that the marginal utility of a lot of code the strong proponents write is <1¢. If I fuck up my code, it costs our partners $200/hr per false alert, which obliterates the profit margin of using our software in the first place.
AIorNot [3 hidden]5 mins ago
By far most of the code LLMs write is for crappy crud apps and webapps not pacemakers and rockets
We can capture enough reliability on what LLMs produce there by guided integration tests and UX tests along with code review and using other LLMs to review along with other strategies to prvent semantic and code drift
Do you know how much crap wordpress
,drupal and Joomla sites I have seen?
Just that work can be automated away
But Ive also worked in high end and mission critical delivery and more formal verification etc - that’s just moving the goalposts on what AI can do- it will get there eventually
Last year you all here were arguing AI Couldn’t code - now everyone has moved the goalposts to formal high end and mission critical ops- yes when money matters we humans are still needed of course - no one denying that- its the utility of the sole human developer against the onslaught of machine aided coding
This profession is changing rapidly- people are stuck in denial
datsci_est_2015 [3 hidden]5 mins ago
> that’s just moving the goalposts on what AI can do- it will get there eventually
This is the nutshell of your argument. I’m not convinced. Technologies often hit a ceiling of utility.
Imagine a “progress curve” for every technology, x-axis time and y-axis utility. Not every progress curve is limitlessly exponential, or even linear - in fact, very few are. I would venture to guess that most technological progress actually mimics population growth curves, where a ceiling is hit based on fundamental restrictions like resource availability, and then either stabilizes or crashes.
I don’t think LLMs are the AI endgame. They definitely have utility, but I think your argument boils down to a bold prediction of limitless progress of a specific technology (LLMs), even though that’s quite rare historically.
tracker1 [3 hidden]5 mins ago
I'm relatively forgiving on bugs that I kind of expect to have happen... just from experience working with developers... a lot of the bugs I catch in LLMs are exactly the same as those I have seen from real people. The real difference is the turn around time. I can stay relatively busy just watching what the LLM is doing, while it's working... taking a moment to review more solidly when it's done on the task I gave it.
Sometimes, I'll give it recursive instructions... such as "these tests are correct, please re-run the test and correct the behavior until the tests work as expected." Usually more specific on the bugs, nature and how I think they should be fixed.
I do find that sometimes when dealing with UI effects, the agent will go down a bit of a rabbit hole... I wanted an image zoom control, and the agent kept trying to do it all with css scaling and the positioning was just broken.. eventually telling it to just use nested div's and scale an img element itself, using CSS positioning on the virtual dom for the positioning/overflow would be simpler, it actually did it.
I've seen similar issues where the agent will start changing a broken test, instead of understanding that the test is correct and the feature is broken... or tell my to change my API/instructions, when I WANT it to function a certain way, and it's the implementation that is wrong. It's kind of weird, like reasoning with a toddler sometimes.
jjice [3 hidden]5 mins ago
I think that's absolutely part of it. Code reviewing has become an even more valuable skill than ever, and I think the industry as a whole still is treating it as low value, despite it always being one of the most important parts of the process.
I think another part (among many others) is not the skill of the individual prompting, but on the quality of the code and documentation (human and agent specific) in the code base. I've seen people run willy-nilly with LLMs that are just spitting out nonsense because there are no examples for how the code should look, not documentation on how it should structure the code, and no human who knows what the code should work reviewing it. A deadly combo to produce bad, unmaintainable code.
If you sort those out though (and review your own damn LLM code), I think that's when LLMs become a powerful programming tool.
I really liked Simon Willison's way of putting it: "Your job is to deliver code you have proven to work".
I think that entirely disregarding the fundamental operation of LLMs with dismissiveness is ungrounded. You are literally saying it isn’t a skill issue while pointing out a different skill issue.
It is absolutely, unequivocally, patently false to say that the input doesn’t affect the output, and if the input has impact, then it IS a skill.
cultofmetatron [3 hidden]5 mins ago
> Developers who have experience reviewing code are more likely to find problems immediately and complain they aren't getting great results without a lot of hand holding
this makes me feel better about the amount of disdain I've been feeling about the output from these llms. sometimes it popsout exactly what I need but I can never count on it to not go offrails and require a lot of manual editing.
mcv [3 hidden]5 mins ago
Exactly my experience. Sometimes it's brilliant, sometimes it produces crap, often it produces something that's a step in the right direction but requires extra work, and often it switches between these different results, producing great results at first until it gets stuck and desperately starts spewing out increasingly weird garbage.
As a developer, you always have to check the code, and recognise when it's just being stupid.
k3nx [3 hidden]5 mins ago
Question: are you manually making those changes to the "stupid" code? I've been having success with Claude using skills. When I see something I wouldn't do I say what I would have done, ask it for why it did it they way it did, then have it update the skills with a better plan. It's like a rubber duck and I understand it better. I have it make the code improvements. Laughing as it goes off the rails is entertaining though.
manojlds [3 hidden]5 mins ago
It's also always easier to blame the LLM when the developer doesn't work with it right.
Ancapistani [3 hidden]5 mins ago
> complain they aren't getting great results without a lot of hand holding
This is what I don’t understand - why would I “complain” about “hand holding”? Why would I just create a Claude skill or analogue that tells the agent to conform to my preferences?
I’ve done this many times, and haven’t run into any major issues.
kasey_junk [3 hidden]5 mins ago
I think that code review experience is a big driver of success with the llms, but my take away is somewhat different. If you’ve spent a lot of time reviewing other people’s code you realize the failures you see with llms are common failures full stop. Humans make them too.
I also think reviewable code, that is code specifically delivered in a manner that makes code review more straightforward was always valuable but now that the generation costs have lowered its relative value is much higher. So structuring your approach (including plans and prompts) to drive to easily reviewed code is a more valuable skill than before.
antihero [3 hidden]5 mins ago
Garbage in, garbage out.
staticassertion [3 hidden]5 mins ago
> It's always easier to blame the prompt and convince yourself that you have some sort of talent in how you talk to LLMs that other's don't.
Well, it's easily the simplest explanation, right?
baxtr [3 hidden]5 mins ago
I thought I try to debunk your argument with a food example. I am not sure I succeeded though. Judge for yourself:
It's always easier to blame the ingredients and convince yourself that you have some sort of talent in how you cook that others don't.
In my experience the differences are mostly in how the dishes produced in the kitchen are tasted. Chefs who have experience tasting dishes critically are more likely to find problems immediately and complain they aren't getting great results without a lot of careful adjustments. And those who rarely or never tasted food from other cooks are invariably going to miss stuff and rate the dishes they get higher.
marviio [3 hidden]5 mins ago
In your example the one making the food is you. You would have to introduce a cooking robot for the analogy to match agentic coding.
baxtr [3 hidden]5 mins ago
Actually I would say it should be a cooking machine like. I am not too familiar with these machines however.
ozgrakkurt [3 hidden]5 mins ago
Unfortunately it is impossible to ascertain what is what from what we read online. Everyone is different and use the tools in a different way. People also use different tools and do different things with them. Also each persons judgement can be wildly different like you are saying here.
We can't trust the measurements that companies post either because truth isn't their first goal.
Just use it or don't use it depending on how it works out imo. I personally find it marginally on the positive side for coding
ttanveer [3 hidden]5 mins ago
That seems to make sense. Any suggestions to improve this skill of reviewing code?
I think especially a number of us more junior programmers lack in this regard, and don't see a clear way of improving this skill beyond just using LLMs more and learning with time?
Dannymetconan [3 hidden]5 mins ago
It's "easy". You just spend a couple of years reviewing PRs and working in a professional environment getting feedback from your peers and experience the consequences of code.
There is no shortcut unfortunately.
vsl [3 hidden]5 mins ago
You improve this skill by not using LLMs more and getting more experienced as a programmer yourself. Spotting problems during review comes from experience, from having learned the lessons, knowing the codebase and libraries used etc.
christofosho [3 hidden]5 mins ago
Find another developer and pair/work together on a project. It doesn't need to be serious, but you should organize it like it is. So, a breakdown of tasks needed to accomplish the goal first. And then many pull requests into the source that can be peer reviewed.
or_am_i [3 hidden]5 mins ago
It's always easier to blame the model and convince yourself that you have some sort of talent in reviewing LLM's work that others don't.
In my experience the differences are mostly in how the code produced by LLM is prompted and what context is given to the agent. Developers who have experience delegating their work are more likely to prevent downstream problems from happening immediately and complain their colleagues cannot prompt as efficiently without a lot of hand holding. And those who rarely or never delegated their work are invariably going to miss crucial context details and rate the output they get lower.
loloquwowndueo [3 hidden]5 mins ago
Never takes long for the “you’re holding it wrong” crowd to pop in.
darkerside [3 hidden]5 mins ago
That's a terrible reason for a mass consumer tool to fail, and a perfectly reasonable one for a professional power tool to fail
hellosimon [3 hidden]5 mins ago
Partly true, but I think there's a real skill in catching subtle logic errors in generated code too not just prompting well. Both matter.
stavros [3 hidden]5 mins ago
That's what I meant, though. I didn't mean "I say the right words", I meant "I don't give them a sentence and walk away".
JasonADrury [3 hidden]5 mins ago
In my experience the differences are mostly between the chair and the keyboard.
I asked Codex to scrape a bunch of restaurant guides I like, and make me an iPhone app which shows those restaurants on a map color coded based on if they're open, closed or closing/opening soon.
I'd never built an iOS app before, but it took me less than 10 minutes of screen time to get this pushed onto my phone.
The app works, does exactly what I want it to do and meaningfully improves my life on a daily basis.
The "AI can't build anything useful" crowd consists entirely of fools and liars.
gonzo41 [3 hidden]5 mins ago
Its exactly the same as 10 years ago and being able to google and search well for odd support bugs. Junior people think it's easy and they don't see the massive skill in framing questions and filtering.
cousin_it [3 hidden]5 mins ago
I'm thinking more and more that there's an ethical problem with using LLMs for programming. You might be reusing someone's GPL code with the license washed off. It's especially worrisome if the results end up in a closed product, competing with the open source project and making more money than it. Of course neither you nor the AI companies will face any consequence, the government is all-in and won't let you be hurt. But ethically, people need to start asking themselves some questions.
For me personally, in my projects there's not a single line of LLM code. At most I ask LLMs for advice about specific APIs. And the more I think about it, the more I want to stop doing even that.
3form [3 hidden]5 mins ago
I would also add: if you're paying, supporting their cause with your money.
Sometimes I would like to have magical make-my-project tool for my selfish reasons; sometimes I know it would be a bad choice to fall behind on what's to come. But I really, really don't want to support that future.
grzesiaka [3 hidden]5 mins ago
The same here. I find big AI-corpos pretty evil and drastically misaligned with broader well-being of the society.
denzen [3 hidden]5 mins ago
> Pine Town is a whimsical infinite multiplayer canvas of a meadow, where you get your own little plot of land to draw on. Most people draw… questionable content
Doesn't help that _pine_ is one way of saying penis in french
vicchenai [3 hidden]5 mins ago
the cost angle is underrated here. sonnet for implementation, opus for architecture review — that's not a philosophical stance, it's just not burning money. i do something similar and the reviewer pass catches a surprising number of cases where the implementer quietly chose the path of least tokens instead of the right solution
highfrequency [3 hidden]5 mins ago
> I’ll tell the LLM my main goal (which will be a very specific feature or bugfix e.g. “I want to add retries with exponential backoff to Stavrobot so that it can retry if the LLM provider is down”), and talk to it until I’m sure it understands what I want. This step takes the most time, sometimes even up to half an hour of back-and-forth until we finalize all the goals, limitations, and tradeoffs of the approach, and agree on what the end architecture should look like.
This sounds sensible, but also makes me wonder how much time is actually being saved if implementing a "very specific feature or bugfix" still takes an hour of back and forth with an LLM.
Can't help but think that this is still just an awkward intermediate phase of development with adolescent LLMs where we need to think about implementation choices at all.
stavros [3 hidden]5 mins ago
Small features or bugfixes generally take a minute or two of conversation.
cpt_sobel [3 hidden]5 mins ago
In the plethora of all these articles that explain the process of building projects with LLMs, one thing I never understood it why the authors seem to write the prompts as if talking to a human that cares how good their grammar or syntax is, e.g.:
> I'd like to add email support to this bot. Let's think through how we would do this.
and I'm not not even talking about the usage of "please" or "thanks" (which this particular author doesn't seem to be doing).
Is there any evidence that suggests the models do a better job if I write my prompt like this instead of "wanna add email support, think how to do this"? In my personal experience (mostly with Junie) I haven't seen any advantage of being "polite", for lack of a better word, and I feel like I'm saving on seconds and tokens :)
dgb23 [3 hidden]5 mins ago
I can't speak for everyone, but to me the most accurate answer is that I'm role-playing, because it just flows better.
In the back of my head I know the chatbot is trained on conversations and I want it to reflect a professional and clear tone.
But I usually keep it more simple in most cases. Your example:
> I'd like to add email support to this bot. Let's think through how we would do this.
I would likely write as:
> if i wanted to add email support, how would you go about it
or
> concise steps/plan to add email support, kiss
But when I'm in a brainstorm/search/rubber-duck mode, then I write more as if it was a real conversation.
xnorswap [3 hidden]5 mins ago
I agree, it's just easier to write requirements and refine things as if writing with a human. I no longer care that it risks anthropomorphising it, as that fight has long been lost. I prefer to focus on remembering it doesn't actually think/reason than not being polite to it.
Keeping everything generally "human readable" also the advantage of it being easier for me to review later if needed.
alkonaut [3 hidden]5 mins ago
I also always imagine that if I'm joined by a colleague on this task they might have to read through my conversation and I want to make it clear to a human too.
As you said, that "other person" might be me too. Same reason I comment code. There's another person reading it, most likely that other person is "me, but next week and with zero memory of this".
We do like anthropomorphising the machines, but I try to think they enjoy it...
jstanley [3 hidden]5 mins ago
How can you use these models for any length of time and walk away with the understanding that they do not think or reason?
What even is thinking and reasoning if these models aren't doing it?
dgb23 [3 hidden]5 mins ago
Thinking and reasoning cannot be abstracted away from the individual who experiences the thinking and reasoning itself and changes because of it.
LLMs are amazing, but they represent a very narrow slice of what thinking is. Living beings are extremely dynamic and both much more complex and simple at the same time.
There is a reason for:
- companies releasing new versions every couple of months
- LLMs needing massive amounts of data to train on that is produced by us and not by itself interacting with the world
- a massive amount of manual labor being required both for data labeling and for reinforcement learning
- them not being able to guide through a solution, but ultimately needing guidance at every decision point
xnorswap [3 hidden]5 mins ago
They produce wonderful results, they are incredibly powerful, but they do not think or reason.
Among many other factors, perhaps the most key differentiator for me that prevents me describing these as thinking, is proactivity.
LLMs are never pro-active.
( No, prompting them on a loop is not pro-activity ).
Human brains are so proactive that given zero stimuli they will hallucinate.
As for reasoning, they simply do not. They do a wonderful facsimile of reasoning, one that's especially useful for producing computer code. But they do not reason, and it is a mistake to treat them as if they can.
jstanley [3 hidden]5 mins ago
I personally don't agree that proactivity is a prerequisite for thinking.
But what would proactivity in an LLM look like, if prompting in a loop doesn't count?
An LLM experiences reality in terms of the flow of the token stream. Each iteration of the LLM has 1 more token in the input context and the LLM has a quantum of experience while computing the output distribution for the new context.
A human experiences reality in terms of the flow of time.
We are not able to be proactive outside the flow of time, because it takes time for our brains to operate, and similarly LLMs are not able to be proactive outside the flow of tokens, because it takes tokens for the neural networks to operate.
The flow of time is so fundamental to how we work that we would not even have any way to be aware of any goings-on that happen "between" time steps even if there were any. The only reason LLMs know that there is anything going on in the time between tokens is because they're trained on text which says so.
Also an LLM will hallucinate on zero input quite happily if you keep sampling it and feeding it the generated tokens.
kqr [3 hidden]5 mins ago
I think it mattered a lot more a few years ago, when the user's prompts were almost all context the LLM had to go by. A prompt written in a sloppy style would cause the LLM to respond in a sloppy style (since it's a snazzy autocomplete at its core). LLMs reason in tokens, so a sloppy style leads it to mimic the reasoning that it finds in the sloppy writing of its training data, which is worse reasoning.
These days, the user prompt is just a tiny part of the context it has, so it probably matters less or not at all.
I still do it though, much like I try to include relevant technical terminology to try to nudge its search into the right areas of vector space. (Which is the part of the vector space built from more advanced discourse in the training material.)
tarsinge [3 hidden]5 mins ago
The reasoning is by being polite the LLM is more likely to stay on a professional path: at its core a LLM try to make your prompt coherent with its training set, and a polite prompt + its answer will score higher (gives better result) than a prompt that is out of place with the answer. I understand to some people it could feel like anthropomorphising and could turn them off but to me it's purely about engineering.
Edit: wording
wiseowise [3 hidden]5 mins ago
> The reasoning is by being polite the LLM is more likely to stay on a professional path
So no evidence.
cpt_sobel [3 hidden]5 mins ago
> If the result of your prompt + its answer it's more likely to score higher i.e. gives better result that a prompt that feels out of place with the answer
Sure seems like this could be the case with the structure of the prompt, but what about capitalizing the first letter of sentence, or adding commas, tag questions etc? They seem like semantics that will not play any role at the end
TheDong [3 hidden]5 mins ago
Why wouldn't capitalization, commas, etc do well?
These are text completion engines.
Punctuation and capitalization is found in polite discussion and textbooks, and so you'd expect those tokens to ever so slightly push the model in that direction.
Lack of capitalization pushes towards text messages and irc perhaps.
We cannot reason about these things in the same way we can reason about using search engines, these things are truly ridiculous black boxes.
cpt_sobel [3 hidden]5 mins ago
> Lack of capitalization pushes towards text messages and irc perhaps.
Might very well be the case, I wonder if there's some actual research on this by people that have some access to the the internals of these black boxes.
spudlyo [3 hidden]5 mins ago
Writing is what gives my thinking structure. Sloppy writing feels to me like sloppy thinking. My fingers capitalize the first letter of words, proper nouns and adjectives, and add punctuation without me consciously asking them to do so.
pegasus [3 hidden]5 mins ago
That's orthography, not semantics, but it's still part of the professional style steering the model on the "professional path" as GP put it.
vitro [3 hidden]5 mins ago
For me it is just a good habit that I want to keep.
mrbungie [3 hidden]5 mins ago
I remember studies that showed that being mean with the LLM got better answers, but by the other hand I also remember an study showing that maximizing bug-related parameters ended up with meaner/malignant LLMs.
cpt_sobel [3 hidden]5 mins ago
Surely this could depend on the model, and I'm only hypothesizing here, but being mean (or just having a dry tone) might equal a "cut the glazing" implicit instruction to the model, which would help I guess.
trq01758 [3 hidden]5 mins ago
My view is that when some "for bots only" type of writing becomes a habit, communication with humans will atrophy. Tokens be damned, but this kind of context switch comes at much too high a cost.
raincole [3 hidden]5 mins ago
Because some people like to be polite? Is it this hard to understand? Your hand-written prompts are unlikely to take significant chunk of context window anyway.
cpt_sobel [3 hidden]5 mins ago
Polite to whom?
qsera [3 hidden]5 mins ago
I think it is easier to be polite always and not switch between polite and non-polite mode depending on who you are talking to.
silversmith [3 hidden]5 mins ago
I believe it's less about politeness and more about pronouns. You used `who`, whereas I would use `what` in that sentence.
In my world view, a LLM is far closer to a fridge than the androids of the movies, let alone human beings. So it's about as pointless being polite to it as is greeting your fridge when you walk into the kitchen.
But I know that others feel different, treating the ability to generate coherent responses as indication of the "divine spark".
darkerside [3 hidden]5 mins ago
I'd say it's more related to getting dressed for work even if you're remote and have no video calls
cpt_sobel [3 hidden]5 mins ago
I get what you're saying, but I'm not talking about swearing at the model or anything, I'm only implying that investing energy in formulating a syntactically nice sentence doesn't or shouldn't bring any value, and that I don't care if I hurt the model's feelings (it doesn't have any).
Note, why would the author write "Email will arrive from a webhook, yes." instead of "yy webhook"? In the second case I wouldn't be impolite either, I might reply like this in an IM to a colleague I work with every day.
well_ackshually [3 hidden]5 mins ago
>investing energy
For the vast majority of people, using capital letters and saying please doesn't consume energy, it just is. There's a thousand things in your day that consume more energy like a shitty 9AM daily.
chriswarbo [3 hidden]5 mins ago
> investing energy in formulating a syntactically nice sentence
This seem to be completely subjective; I write syntactically/grammatically "nice" sentences to LLMs, because that's how I write. I would have to "invest energy" to force myself to write in that supposedly "simpler" style.
stavros [3 hidden]5 mins ago
It's just easier for me to write that way. In that specific sentence, I also kind of reaffirmed what was going on in my head and typed my thought process out loud. There's no deeper logic than that, it's just what's easier for me.
jstanley [3 hidden]5 mins ago
"yy webhook" is much less clear. It could just as easily mean "why webhook" as "yes webhook".
It's also actually more trouble to formulate abbreviated sentences than normal ones, at least for literate adults who can type reasonably well.
cpt_sobel [3 hidden]5 mins ago
I confidently assume that the model has been trained on an ungodly amount of abbreviated text and "yy" has always meant "yeah".
> literate adults who can type reasonably well
For me the difference is around 20 wpm in writing speed if just write out my stream of thoughts vs when I care about typos and capitalizing words - I find real value in this.
layer8 [3 hidden]5 mins ago
> investing energy in formulating a syntactically nice sentence
It would cost me energy to deliberately not write with proper grammar and orthography. I would never want to write sloppily to a colleague either.
jstummbillig [3 hidden]5 mins ago
Anything or anyone. Being polite to your surroundings reflects in your surroundings.
pferde [3 hidden]5 mins ago
Did you thank your keyboard for letting you type this comment?
vikramkr [3 hidden]5 mins ago
For models that reveal reasoning traces I've seen their inner nature as a word calculator show up as they spend way too many tokens complaining about the typo (and AI code review bots also seem obsessed with typos to the point where in a mid harness a few too many irrelevant typos means the model fixates on them and doesn't catch other errors). I don't know if they've gotten better at that recently but why bother. Plus there's probably something to the model trying to match the user's style (it is auto complete with many extra steps) resulting in sloppier output if you give it a sloppier prompt.
stavros [3 hidden]5 mins ago
I write "properly" (and I do say "please" and "thank you"), just because I like exercising that muscle. The LLM doesn't care, but I do.
movpasd [3 hidden]5 mins ago
I prompt politely for two reasons: I suspect it makes the model less likely to spiral (but have no hard evidence either way), and I think it's just good to keep up the habit for when I talk to real people.
lbreakjai [3 hidden]5 mins ago
I just don't want to build the habit of being a sloppy writer, because it will eventually leak into the conversations I have with real humans.
roel_v [3 hidden]5 mins ago
Related to this, has anyone investigated how much typos matter in your chats? I would imagine that typing 'typescfipt' would not be a token in the input training set, so how would the model recognize this as actually meaning 'typescript'? Or does the tokenizer deal with this in an earlier stage?
themantri [3 hidden]5 mins ago
I have tried prompting with a bunch of typos in Claude Code with Sonnet and found it to be fairly tolerant.
It has always done what I meant or asked me a clarifying question (because of my CLAUDE.md instruction).
bob1029 [3 hidden]5 mins ago
With current models this isn't as big of a deal, but why risk being an asshole in any context? I don't think treating something like shit simply because it's a machine is a good excuse.
Also consider the insanity of intentionally feeding bullshit into an information engine and expecting good things to come out the other end. The fact that they often perform well despite the ugliness is a miracle, but I wouldn't depend on it.
wartywhoa23 [3 hidden]5 mins ago
spare_this_one_he_used_to_say_thanks_to_us.jxl
cpt_sobel [3 hidden]5 mins ago
I neither talked about feeding bullshit into it, nor treating it like shit. Around half of the commenters here seem to be missing the middle ground, how is prompting "i need my project to do A, B, C using X Y Z" treating it like shit?
koe123 [3 hidden]5 mins ago
Just stream of consciousness into the context window works wonders for me. More important to provide the model good context for your question
staticassertion [3 hidden]5 mins ago
There is evidence of that, but more importantly, it wouldn't occur to me to write "wanna add email support". That's not my natural voice.
Havoc [3 hidden]5 mins ago
Some people are just polite by nature & habits are hard to break
olalonde [3 hidden]5 mins ago
I suspect they just find it easier and more natural to write with proper grammar.
giuscri [3 hidden]5 mins ago
one reason to do that could be it’s trained on conversations happened between humans.
dmos62 [3 hidden]5 mins ago
I choose to talk in a respectful way, because that's how I want to communicate: it's not because I'm afraid of retaliation or burning bridges. It's because I am caring and conscious. If I think that something doesn't have feelings or long-term memory, whether it's AI or a piece of rock on the side of a trail, it in no way leads me to be abusive to it.
Further, an LLM being inherently sycophantic leads to it mimmicking me, so if I talk to it in a stupid or abusive (which is just another form of stupidity, in my eyes) manner, it will behave stupid. Or, that's what I'd expect. I've not researched this in a focused way, but I've seen examples where people get LLMs to be very unintelligent by prompting riddles or intelligence tests in highly-stylized speech. I wanted to say "highly-stupid speech", but "stylized" is probably more accurate, e.g.: `YOOOO CHATGEEEPEEETEEE!!!!!!1111 wasup I gots to asks you DIS.......`. Maybe someone can prove me wrong.
cpt_sobel [3 hidden]5 mins ago
My wondering was never about being abusive, rather just having a dry tone and cutting the unnecessary parts, some sort of middle ground if you will. Prompting "yo chatgeepeetee whats good lemme get this feature real quick" doesn't make sense to me mostly because it's anthrophomorphizing it, and it's the same concept of unnecessary writing as "Good morning ChatGPT, would you please help me with ..."
dmos62 [3 hidden]5 mins ago
I guess in part I commented not on what you said, but on seeing people be abusive when an LLM doesn't follow instructions or fails to fulfill some expectation. I think I had some pent up feelings about that.
> having a dry tone and cutting the unnecessary parts
That's how I try to communicate in professional settings (AI included). Our approaches might not be that different.
cpt_sobel [3 hidden]5 mins ago
> seeing people be abusive when an LLM doesn't follow instructions or fails to fulfill some expectation. I think I had some pent up feelings about that.
Oh me too, because people are anthropomorphizing the LLM, not because they hurt it. Indirectly, though, I agree that this behaviour can easily affect the way this person would speak to other humans
dmos62 [3 hidden]5 mins ago
To be fair, I do anthropomorphize LLMs. But, I also anthropomorphize, say, a kitchen knife that I accidentally scrape on something (I think "sorry, knife"). I don't reflect on this much; it's just a pleasant way to relate to my environment. What feelings do you have about people anthropomorphizing LLMs?
Anthropomorphizing might not be the right term, because it's about assigning human attributes. When I talk to my dog, for example, I don't contextualize it as giving it human attributes. In a way, talking to something is part of how I engage my relationship-management circuitry. I don't only relate to humans, I relate to everything in one way or another, and kindness is a pretty nice starting point. As I said, I don't think about this much: might come up with something more coherent if I did.
nacozarina [3 hidden]5 mins ago
agree, prompting a token predictor like you’re talking to a person is counterproductive and I too wish it would stop
the models consistently spew slop when one does it, I have no idea where positive reinforcement for that behavior is coming from
kul_ [3 hidden]5 mins ago
LLMs are great at aggregating docs, blogs and other sources out there into a single interface and there has been nothing like it before.
When it comes to coding however, the place where you really need help is the place where you get stuck and that for most people would be the intersection of domain and tech. LLMs need a LOT of baby sitting to be somewhat useful here. If I have to prompt a LLM for hours just to get the correct code, why would I even use it when the tangible output is just carefully thought out few 100 lines of code!
christofosho [3 hidden]5 mins ago
I like reading these types of breakdowns. Really gives you ideas and insight into how others are approaching development with agents. I'm surprised the author hasn't broken down the developer agent persona into smaller subagents. There is a lot of context used when your agent needs to write in a larger breadth of code areas (i.e. database queries, tests, business logic, infrastructure, the general code skeleton). I've also read[1] that having a researcher and then a planner helps with context management in the pre-dev stage as well. I like his use of multiple reviewers, and am similarly surprised that they aren't refined into specialized roles.
I'll admit to being a "one prompt to rule them all" developer, and will not let a chat go longer than the first input I give. If mistakes are made, I fix the system prompt or the input prompt and try again. And I make sure the work is broken down as much as possible. That means taking the time to do some discovery before I hit send.
Is anyone else using many smaller specific agents? What types of patterns are you employing? TIA
I don't think that splitting into subagents that use the same model will really help. I need to clarify this in the post, but the split is 1) so I can use Sonnet to code and save on some tokens and 2) so I can get other models to review, to get a different perspective.
It seems to me that splitting into subagents that use the same model is kind of like asking a person to wear three different hats and do three different parts of the job instead of just asking them to do it all with one hat. You're likely to get similar results.
chriswarbo [3 hidden]5 mins ago
I'm considering using subagents, as a way to manage context and delegate "simple" tasks to cheaper models (if you want to see tokens burn, watch Opus try fixing a misplaced ')' in a Lisp file!).
I see what you mean w.r.t. different hats; but is it useful to have different tools available? For example, a "planner" having Web access and read-only file access, versus a "developer" having write access to files but no Web access?
stavros [3 hidden]5 mins ago
Yes, if you want to separate capabilities, definitely.
marcus_holmes [3 hidden]5 mins ago
that reference you give is pretty dated now, based on a talk from August which is the Beforetimes of the newer models that have given such a step change in productivity.
The key change I've found is really around orchestration - as TFA says, you don't run the prompt yourself. The orchestrator runs the whole thing. It gets you to talk to the architect/planner, then the output of that plan is sent to another agent, automatically. In his case he's using an architect, a developer, and some reviewers. I've been using a Superpowers-based [0] orchestration system, which runs a brainstorm, then a design plan, then an implementation plan, then some devs, then some reviewers, and loops back to the implementation plan to check progress and correctness.
It's actually fun. I've been coding for 40+ years now, and I'm enjoying this :)
Can you bolt superpowers onto an existing project so that it uses the approach going forward (I'm using Opencode), or would that get too messy?
eclipxe [3 hidden]5 mins ago
Yes. But gsd is even better - especially gsd2
felixsells [3 hidden]5 mins ago
re: breaking into specialized subagents -- yes, it matters significantly but the splitting criteria isn't obvious at first.
what we found: split on domain of side effects, not on task complexity. a "researcher" agent that only reads and a "writer" agent that only publishes can share context freely because only one of them has irreversible actions. mixing read + write in one agent makes restart-safety much harder to reason about.
the other practical thing: separate agents with separate context windows helps a lot when you have parts of the graph that are genuinely parallel. a single large agent serializes work it could parallelize, and the latency compounds across the whole pipeline.
mihir_kanzariya [3 hidden]5 mins ago
The biggest unlock for me was treating LLMs less like autocomplete and more like a junior dev who needs very specific instructions. Vague prompts get vague code.
I started writing my prompts almost like mini specs. "Here's the function signature, here's what it should return for these inputs, here are the edge cases." That changed everything. The output went from "kinda close" to actually usable.
The other thing that helped was keeping the feedback loop tight. Don't let the LLM generate 200 lines and then try to review it all. Small chunks, verify each one, then move on. Way less time spent debugging weird hallucinated logic.
jbergqvist [3 hidden]5 mins ago
I've found that spending most of my time on design before any code gets written makes the biggest difference.
The way I think about it: the model has a probability distribution over all possible implementations, shaped by its training data. Given a vague prompt, that distribution is wide and you're likely to get something generic. As you iterate on a design with the model (really just refining the context), the distribution narrows towards a subset of implementations. By the time the model writes code, you've constrained the space enough that most of what it produces is actually what you want.
zihotki [3 hidden]5 mins ago
Just like with many other submissions, I see a great I-shaped senior developer with a developed gut feeling who's able to do big chunks of work.
I wonder how the team members, if any, survive such throughput. I also wonder if there was any quantification applied for the prompts/results, cost analysis, etc.
silisili [3 hidden]5 mins ago
I'm not sure the notion I keep seeing of "it's ok, we still architect, it just writes the code"(paraphrased) sits well with me.
I've not tested it with architecting a full system, but assuming it isn't good at it today... it's only a matter of time. Then what is our use?
PAndreew [3 hidden]5 mins ago
Others have already partially answered this, but here’s my 20 cents. Software development really is similar to architecture. The end result is an infrastructure of unique modules with different type of connectors (roads, grid, or APIs). Until now in SW dev the grunt work was done mostly by the same people who did the planning, decided on the type of connectors, etc. Real estate architects also use a bunch of software tools to aid them, but there must be a human being in the end of the chain who understands human needs, understands - after years of studying and practicing - how the whole building and the infrastructure will behave at large and who is ultimately responsible for the end result (and hopefully rewarded depending on the complexity and quality of the end result). So yes we will not need as many SW engineers, but those who remain will work on complex rewarding problems and will push the frontier further.
rurban [3 hidden]5 mins ago
Since I worked as an architect some comments.
Architecture is fine for big, complex projects. Having everything planned out before keeps cost down, and ensures customer will not come with late changes. But if cost are expected to be low, and there's no customer, architecture is overkill.
It's like making a movie without following the script line by line (watch Godard in Novelle Vague), or building it by yourself or by a non-architect. 2x faster, 10x cheaper.
You immediately see an inflexible overarchitectured project.
You can do fine by restricting the agent with proper docs, proper tests and linters.
dgb23 [3 hidden]5 mins ago
The "grunt work" is in many cases just that. As long as it's readable and works it's fine.
But there are a substantial amount cases where this isn't true. The nitty gritty is then the important part and it's impossible to make the whole thing work well without being intimate with the code.
So I never fully bought into the clean separation of development, engineering and architecture.
chii [3 hidden]5 mins ago
> Then what is our use?
You will have to find new economic utility. That's the reality of technological progress - it's just that the tech and white collar industries didn't think it can come for them!
A skill that becomes obsoleted is useless, obviously. There's still room for artisanal/handcrafted wares today, amidst the industrial scale productions, so i would assume similar levels for coding.
hrmtst93837 [3 hidden]5 mins ago
Assuming the 'artisanal' niche will support anything close to the same number of devs is wishful thinking. If you want to stay in this field, you either get good at moving up a level, stitching model output together, checking it against the repo and the DB, and debugging the weird garbage LLMs make up, or you get comfortable charging premium for the software equivalent of hand-thrown pottery that only a handfull of collectors buy.
borski [3 hidden]5 mins ago
LLMs can build anything. The real question is what is worth building, and how it’s delivered. That is what is still human. LLMs, by nature of not being human, cannot understand humans as well as other humans can. (See every attempt at using an LLM as a therapist)
In short: LLMs will eventually be able to architect software. But it’s still just a tool
staticassertion [3 hidden]5 mins ago
> LLMs can build anything.
This is only possibly true if one of two things are true:
1. All new software can be made up of of preexisting patterns of software that can be composed. ie: There is no such thing as "novel" software, it's all just composition of existing software.
2. LLMs are capable of emergent intelligence, allowing them to express patterns that they were not trained on.
I am extremely skeptical that either of these is true.
borski [3 hidden]5 mins ago
Fair enough; I can see the exaggeration.
It is not impossible, however, that an LLM could run enough “random” tests to find new ways of doing something, but I hear you.
Let me restate that to “An LLM can build most anything…” and I stand by the rest of my comment.
staticassertion [3 hidden]5 mins ago
I think that makes sense, the rest of your comment is definitely true.
silisili [3 hidden]5 mins ago
What is the use of software eng/architect at that point? It's a tool, but one that product or C levels can use directly as I see it?
borski [3 hidden]5 mins ago
Yes, for building something
But for building the right thing? Doubtful.
Most of a great engineer’s work isn’t writing code, but interrogating what people think their problems are, to find what the actual problems are.
In short: problem solving, not writing code.
mattmanser [3 hidden]5 mins ago
Where's this delusion come from recently that great engineers didnt write code?
What a load of crap.
All you're doing is describing a different job role.
What you're talking about is BA work, and a subset of engineers are great at it, but most are just ok.
You're claiming a part of the job that was secondary, and not required, is now the whole job.
borski [3 hidden]5 mins ago
I never said great engineers didn’t write code. But writing the code was never the point.
The point has always been delivering the product to the customer, in any industry. Code is rarely the deliverable.
That’s my point.
mattmanser [3 hidden]5 mins ago
And a horse breeder was important to transportation until the 1920s, but it doesn't mean their job was transportation.
They didn't magically become great truck drivers.
Programmers do not deliver products, they deliver code to make products.
If the code is no longer needed, nor is the job. A different job will replace it with different skills required.
wiseowise [3 hidden]5 mins ago
> But writing the code was never the point.
Is that why most prestigious jobs grilled you like a devil on algos/system design?
> The point has always been delivering the product to the customer, in any industry. Code is rarely the deliverable.
That’s just nonsense. It’s like saying “delivering product was always the most important thing, not drinking water”.
staticassertion [3 hidden]5 mins ago
It's well understood that programming interviews are a pretty shitty tool. They're a proxy for understanding if you have basic skills required to understand a computer. Notably, most companies don't rely on these alone, they have behavioral questions, architecture questions, etc. Have you ever done an interview at these companies you're talking about? They're 8 hours lol maybe 1 is spent programming.
But it's just very obvious to any software engineer worth anything that code is just one part of the job, and it's usually somewhere in the middle of a process. Understanding customer requirements, making technical decisions, maintaining the codebase, reviewing code changes/ providing feedback, responding on incidents, deciding what work to do or not to do, deciding when a constraint has to be broken, etc. There are a billion things that aren't "typing code" that an engineer does every day. To deny this is absurd to anyone who lives every day doing those things.
borski [3 hidden]5 mins ago
Yeah, this is precisely what I meant.
staticassertion [3 hidden]5 mins ago
I'm genuinely blown away at the attitude lately that developers spend their time programming/ our primary value is code. I guess because we tend to be organizationally isolated people just have no idea? But like... it's so absurd to anyone who does the job. It's like thinking that PM's primary role is assigning tickets, just so obviously false.
I think there's some resentment. I've seen repeatedly now people essentially celebrating that "tech bros" are finally going to see their salaries crash or whatever, it's pretty sick but I've noticed this quite a lot.
borski [3 hidden]5 mins ago
> Is that why most prestigious jobs grilled you like a devil on algos/system design?
No. That’s because interviews have always sucked, and have always been terrible predictors of how you do on the job. We just never had a better way of deciding except paying for a project.
> That’s just nonsense. It’s like saying “delivering product was always the most important thing, not drinking water”.
That’s… not an argument? It’s not even a strawman, it’s just unrelated.
The thing a customer has always paid for was the end product. Not the code. This is absolutely trivial to see, since a customer has never asked to read the code.
0xbadcafebee [3 hidden]5 mins ago
A software engineer will be a person who inspects the AI's work, same as a building inspector today. A software architect will co-sign on someone's printed-up AI plans, same as a building architect today. Some will be in-house, some will do contract work, and some will be artists trying to create something special, same as today. The brute labor is automated away, and the creativity (and liability) is captured by humans.
wiseowise [3 hidden]5 mins ago
> It's a tool, but one that product or C levels can use directly as I see it?
Wait, I thought product and C level people are so busy all the time that they can’t fart without a calendar invite, but now you say they have time to completely replace whole org of engineers?
roncesvalles [3 hidden]5 mins ago
FWIW I find LLMs to be excellent therapists.
The commercial solutions probably don't work because they don't use the best SOTA models and/or sully the context with all kinds of guardrails and role-playing nonsense, but if you just open a new chat window in your LLM of choice (set to the highest thinking paid-tier model), it gives you truly excellent therapist advice.
In fact in many ways the LLM therapist is actually better than the human, because e.g. you can dump a huge, detailed rant in the chat and it will actually listen to (read) every word you said.
borski [3 hidden]5 mins ago
Please, please, please don’t make this mistake. It is not a therapist. At best, it might be a facsimile of a life coach, but it does not have your best interests in mind.
It is easy to convince and trivial to make obsequious.
That is not what a therapist does. There’s a reason they spend thousands of hours in training; that is not an exaggeration.
Humans are complex. An LLM cannot parse that level of complexity.
roncesvalles [3 hidden]5 mins ago
You seem to think therapists are only for those in dire straits. Yes, if you're at that point, definitely speak to a human. But there are many ordinary things for which "drop-in" therapist advice is also useful. For me: mild road rage, social anxiety, processing embarrassment from past events, etc.
The tools and reframing that LLMs have given me (Gemini 3.0/3.1 Pro) have been extremely effective and have genuinely improved my life. These things don't even cross the threshold to be worth the effort to find and speak to an actual therapist.
defrost [3 hidden]5 mins ago
Which professional therapist does your Gemini 3.0/3.1 Pro model see?
Do you think I could use an AI therapist to become a more effective and much improved serial killer?
borski [3 hidden]5 mins ago
I never said therapists were only for those in crisis; that is a misreading of my argument entirely.
An LLM cannot parse the complexity of your situation. Period. It is literally incapable of doing that, because it does not have any idea what it is like to be human.
Therapy is not an objective science; it is, in many ways, subjective, and the therapeutic relationship is by far the most important part.
I am not saying LLMs are not useful for helping people parse their emotions or understand themselves better. But that is not therapy, in the same way that using an app built for CBT is not, in and of itself, therapy. It is one tool in a therapist’s toolbox, and will not be the right tool for all patients.
That doesn’t mean it isn’t helpful.
But an LLM is not a therapist. The fact that you can trivially convince it to believe things that are absolutely untrue is precisely why, for one simple example.
vanviegen [3 hidden]5 mins ago
As you said earlier, therapists are (thoroughly) trained on how to best handle situations. Just 'being human' (and thus empathizing) may not be such a big part of the job as you seem to believe.
Training LLMs we can do.
Though it might be important for the patient to believe that the therapist is empathizing, so that may give AI therapy an inherent disadvantage (depending on the patient's view of AI).
emp17344 [3 hidden]5 mins ago
Socialization with other humans has so many benefits for happiness, mental health, and longevity. Conversely, interaction with LLMs often leads to AI psychosis and harms mental health. IMO, this is pretty strong evidence that interaction with LLMs is not similar to socialization with real humans, and a pretty good indicator that LLM “therapy” is significantly less helpful or even harmful than human-driven therapy.
borski [3 hidden]5 mins ago
Precisely.
borski [3 hidden]5 mins ago
> Just 'being human' (and thus empathizing) may not be such a big part of the job as you seem to believe.
The word “just” is not in my comment anywhere. Being human is necessary, but not sufficient.
And no, you cannot train an LLM to be human.
An LLM is not a therapist. Please do not confuse the two.
You cannot train an LLM on how to be human.
pzs [3 hidden]5 mins ago
While I agree with you, I also find that an LLM can help organize my thoughts and come to realizations that I just didn't get to, because I hadn't explained verbally what I am thinking and feeling. Definitely not a substitute for human interaction and relationships, which can be fulfilling in many-many ways LLM's are not, but LLM's can still be helpful as long as you exercise your critical thinking skills. My preference remains always to talk to a friend though.
EDIT: seems like you made the same point in a child comment.
borski [3 hidden]5 mins ago
Yeah, I agree with all of that. A friend built an “emotion aware” coach, and it is extremely useful to both of us.
But he still sees a therapist, regularly, because they are not the same and do not serve the same purpose. :)
gwbas1c [3 hidden]5 mins ago
> On the flip side, when I’m not familiar enough with the technology to be on top of the architecture, I tend to not catch bad decisions that the LLM makes. This leads to the LLM building more and more on top of those bad decisions, eventually getting in a state where it can’t untangle the mess. You know this happens when you keep telling the LLM the code doesn’t work, it says “I know why! Let me fix it” and keeps breaking things more and more.
That exact thing happens with people too! Specifically when a cheap entrepreneur hires a novice developer and can't give the developer appropriate mentoring and reviews.
stavros [3 hidden]5 mins ago
That's actually very true, I didn't realize. Huh, interesting, thanks.
lbreakjai [3 hidden]5 mins ago
It's interesting to see some patterns starting to emerge. Over time, I ended up with a similar workflow. Instead of using plan files within the repository, I'm using notion as the memory and source of truth.
My "thinker" agent will ask questions, explore, and refine. It will write a feature page in notion, and split the implementation into tasks in a kanban board, for an "executor" to pick up, implement, and pass to a QA agent, which will either flag it or move it to human review.
I really love it. All of our other documentation lives in notion, so I can easily reference and link business requirements. I also find it much easier to make sense of the steps by checking the tickets on the board rather than in a file.
Reviewing is simpler too. I can pick the ticket in the human review column, read the requirements again, check the QA comments, and then look at the code. Had a lot of fun playing with it yesterday, and I shared it here:
No criticism or anything, but it really does feel / sound like you (and others who embraced LLMs and agentic coding) aspire to be more of a product manager than a coder. Thing is, a "real" PM comes with a lot more requirements and there's less demand for them - more requirements in that you need to be a people person and willing to spend at least half your time in meetings, and less demand because one PM will organize the work for half a dozen developers (minimum).
Some people say LLM assisted coding will cost a lot of developers' jobs, but posts like this imply it'll cost (solve?) a lot of management / overhead too.
Mind you I've always thought project managers are kinda wasteful, as a software developer I'd love for Someone Else to just curate a list of tasks and their requirements / acceptance criteria. But unfortunately that's not the reality and it's often up to the developers themselves to create the tasks and fill them in, then execute them. Which of course begs the question, why do we still have a PM?
(the above is anecdotal and not a universal experience I'm sure. I hope.)
adampunk [3 hidden]5 mins ago
This seems more about how you view PMs than anything else.
lbreakjai [3 hidden]5 mins ago
I worked with some excellent PMs in the past, it's an entirely different skillset. This wasn't really meant to replace what they do. I really wanted something with which to work at feature-level. That is, after all the hard work of figuring out _what_ to build has been done.
> as a software developer I'd love for Someone Else to just curate a list of tasks and their requirements / acceptance criteria
That's interesting. In every team I worked in, I always fought really hard against anyone but developers being able to write tickets on the board.
fooster [3 hidden]5 mins ago
“one PM will organize the work for half a dozen developer”
That isn’t the job of a PM.
bigblind [3 hidden]5 mins ago
> On projects where I have no understanding of the underlying technology (e.g. mobile apps), the code still quickly becomes a mess of bad choices. However, on projects where I know the technologies used well (e.g. backend apps, though not necessarily in Python), this hasn’t happened yet, even at tens of thousands of SLoC. Most of that must be because the models are getting better, but I think that a lot of it is also because I’ve improved my way of working with the models.
I wonder whether at some point we'll get a translation model, that translates relatively vague requests into sound architectural decisions, with some embedded knowledge of the environment you're building in, and that can ask clarifying questions when there are multiple options with different tradeoffs.
mlnj [3 hidden]5 mins ago
Is that not already possible with Markdown spec files and planning mode?
bigblind [3 hidden]5 mins ago
I guess? At least there you can review the plan, but is this planning mode any better at making architectural decisions than when you prompt an LLM and let it make the changes directly? (it might be, just not sure.)
plastic041 [3 hidden]5 mins ago
I wanted to know how to make softwares with LLM "without losing the benefit of knowing how the entire system works" and "intimately familiar with each project’s architecture and inner workings", while "have never even read most of their code". (Because obviously, you can't.) But OP didn't explain that.
You tell LLM to create something, and then use another LLM to review it. It might make the result safer, but it doesn't mean that YOU understand the architecture. No one does.
ashwinsundar [3 hidden]5 mins ago
Hot take: you can't have your cake and eat it too. If you aren't writing code, designing the system, creating architecture, or even writing the prompt, then you're not understanding shit. You're playing slots with stochastic parrots
The code grows beyond my usual comprehension, I'd have to really read through it for a while. Sometimes the LLMs can't fix a bug so I just work around it or ask for random changes until it goes away. It's not too bad for throwaway weekend projects, but still quite amusing. I'm building a project or webapp, but it's not really coding - I just see stuff, say stuff, run stuff, and copy paste stuff, and it mostly works.
There's a new kind of coding I call "vibe
coding", where you fully give in to the
vibes, embrace exponentials, and forget
that the code even exists.
Not all AI-assisted programming is vibe coding. If you're paying attention to the code that's being produced you can guide it towards being just as high quality (or even higher quality) than code you would have written by hand.
ashwinsundar [3 hidden]5 mins ago
It's appropriate for the commenter I was replying to, who asked how they can understand things, "while having never even read most of their code."
I like AI-assisted programming, but if I fail to even read the code produced, then I might as well treat it like a no-code system. I can understand the high-levels of how no-code works, but as soon as it breaks, it might as well be a black box. And this only gets worse as the codebase spans into the tens of thousands of lines without me having read any of it.
The (imperfect) analogy I'm working on is a baker who bakes cakes. A nearby grocery store starts making any cake they want, on demand, so the baker decides to quit baking cakes and buy them from the store. The baker calls the store anytime they want a new cake, and just tells them exactly what they want. How long can that baker call themself a "baker"? How long before they forget how to even bake a cake, and all they can do is get cakes from the grocer?
stavros [3 hidden]5 mins ago
There are two ways to approach this. One is a priori: "If you aren't doing the same things with LLMs that humans do when writing code, the code is not going to work".
The other one is a posteriori: "I want code that works, what do I need to do with LLMs?"
Your approach is the former, which I don't think works in reality. You can write code that works (for some definition of "works") with LLMs without doing it the way a human would do it.
ashwinsundar [3 hidden]5 mins ago
I re-read your comment a few times, but don't understand what you're saying unfortunately.
You can write code that works (for some definition of "works") with LLMs without doing it the way a human would do it.
Really having a hard time understanding what this possibly means.
stavros [3 hidden]5 mins ago
It means that just because a human can't read the code doesn't mean the code is not correct. Obfuscators exist, for example, and it's conceivable that the LLM writes perfectly correct code even though it's unmaintainable to us.
imiric [3 hidden]5 mins ago
> Sometimes the LLMs can't fix a bug so I just work around it or ask for random changes until it goes away.
It's insane that this quote is coming from one of the leading figures in this field. And everyone's... OK that software development has been reduced to chance and brute force?
ChrisGreenHeur [3 hidden]5 mins ago
the hardware you typed this on was designed by hardware architects that write little to no code. just types up a spec to be implemented by verilog coders.
thenthenthen [3 hidden]5 mins ago
Haha love the Sleight of hand irregular wall clock idea. I once had a wall clock where the hand showing the seconds would sometimes jump backwards, it was extremely unsettling somehow because it was random. It really did make me question my sanity.
kqr [3 hidden]5 mins ago
This used to be one of my recurring nightmares when I was a child. The three I remember were (1) clocks suddenly starting to go backwards, either partially or completely; (2) radio turning on without being able to turn it off, and (3) house fire. There really is something about clocks.
benterris [3 hidden]5 mins ago
One thing I don't get with this workflow, and all the ones we see in similar articles: do the authors run their agents in YOLO mode (full unchecked permission on their machine)?
It seems their agents have full edit rights (scoped to a directory, which seems reasonable), but can also run tests autonomously (which means they can run any code), which equates to full read/write access on the machine? I mean, there are ways to sandbox agents in dedicated containers, but it requires quite a bit of setup, and none of these articles mention it, so I guess they are YOLOing it?
neobrain [3 hidden]5 mins ago
Claude has a sandbox mode that uses bubblewrap to build a lightweight filesystem sandbox that only exposes the project directory: https://code.claude.com/docs/en/sandboxing
It's disabled by default though, and in general (especially with other agents) you very much still have to get out of your way to get any sort of reasonable access control indeed.
In principle though, just running the agent CLI in something like firejail would get you very far if you know what you're doing.
zingar [3 hidden]5 mins ago
On using different models: GitHub copilot has an API that gives you access to many different models from many different providers. They are very transparent about how they use your data[1]; in some cases it’s safer to use a model through them than through the original provider.
You can point Claude at the copilot models with some hackery[2] and opencode supports copilot models out of the box.
Finally, copilot is quite generous with the amount of usage you get from a Github pro plan (goes really far with Sonnet 4.6 which feels pretty close to Opus 4.5), and they’re generous with their free pro licenses for open source etc.
Despite having stuck to autocomplete as their main feature for too long, this aspect of their service is outstanding.
The load-bearing line is buried near the top: “On projects where I have no understanding of the underlying technology, the code still quickly becomes a mess of bad choices.” That’s not a caveat.
That’s the precondition the whole system runs on.
The failure mode is invisible. Bad architecture doesn’t look like a crash. It looks like a codebase that works today and becomes unmaintainable.
mk_chan [3 hidden]5 mins ago
In my experience an LLM does 2 things:
1. Bring you up to some average-LLM level when you don’t have the skills/knowledge to actually do what you want.
2. Work at 80-90% of your capacity but WAY faster than you physically could depending on how much context you provide it. If you don’t provide it sufficient context to do what/how you want it to do, of course it might default to something you don’t want.
zingar [3 hidden]5 mins ago
Big +1 for opencode which for my purposes is interchangeable or better than Claude and can even use anthropic models via my GitHub copilot pro plan. I use it and Claude when one or the other hits token limits.
Edit: a comment below reminded me why I prefer opencode: a few pages in on a Claude session and it’s scrolling through the entire conversation history on every output character. No such problem on OC.
devlinMckhay [3 hidden]5 mins ago
the failure mode section is the most honest thing i have read about this whole thing. I hit that exact wall building something evenings after work. you miss one bad architectural decision because you are tired or in a hurry, and three sessions later the llm is confidently making it worse and you are not even sure when it started going wrong. the only thing that helped was slowing down on the planning side even when i did not feel like i had time for it.
oytis [3 hidden]5 mins ago
I find the same problem applying to coding too. Even with everyone acting in good faith and reviewing everything themselves before pushing, you have essentially two reviwers instead of a writer and a reviewer, and there is no etiquette mandating how thoroughly the "author" should review their PR yet. It doesn't help if the amount of code to review gets larger (why would you go into agentic coding otherwise?)
takwatanabe [3 hidden]5 mins ago
We build and run a multi-agent system. Today Cursor won.
For a log analysis task — Cursor: 5 minutes. Our pipeline: 30 minutes.
Still a case for it:
1. Isolated contexts per role (CS vs. engineering) — agents don't bleed into each other
2. Hard permission boundaries per agent
3. Local models (Qwen) for cheap routine tasks
Multi-agent loses at debugging. But the structure has value.
gehsty [3 hidden]5 mins ago
When I use Claude code to work on a hobby project it feels like doom scrolling…
I can’t get my head around if the hobby is the making or the having, but fair to say I’ve felt quite dissatisfied at the end of my hobby sessions lately so leaning towards the former.
Levitating [3 hidden]5 mins ago
Agreed, I code for fun. But I am not sure if I still find it fun if the LLM just makes what I want.
jumploops [3 hidden]5 mins ago
This is similar to how I use LLMs (architect/plan -> implement -> debug/review), but after getting bit a few times, I have a few extra things in my process:
The main difference between my workflow and the authors, is that I have the LLM "write" the design/plan/open questions/debug/etc. into markdown files, for almost every step.
This is mostly helpful because it "anchors" decisions into timestamped files, rather than just loose back-and-forth specs in the context window.
Before the current round of models, I would religiously clear context and rely on these files for truth, but even with the newest models/agentic harnesses, I find it helps avoid regressions as the software evolves over time.
A minor difference between myself and the author, is that I don't rely on specific sub-agents (beyond what the agentic harness has built-in for e.g. file exploration).
I say it's minor, because in practice the actual calls to the LLMs undoubtedly look quite similar (clean context window, different task/model, etc.).
One tip, if you have access, is to do the initial design/architecture with GPT-5.x Pro, and then take the output "spec" from that chat/iteration to kick-off a codex/claude code session. This can also be helpful for hard to reason about bugs, but I've only done that a handful of times at this point (i.e. funky dynamic SVG-based animation snafu).
stavros [3 hidden]5 mins ago
I don't know if I explained this clearly enough in the article, but I have the LLM write the plan to a file as well. The architect's end result is a plan file in the repo, and the developer reads that.
> The main difference between my workflow and the authors, is that I have the LLM "write" the design/plan/open questions/debug/etc. into markdown files, for almost every step.
>
> This is mostly helpful because it "anchors" decisions into timestamped files, rather than just loose back-and-forth specs in the context window.
Would you please expand on this? Do you make the LLM append their responses to a Markdown file, prefixed by their timestamps, basically preserving the whole context in a file? Or do you make the LLM update some reference files in order to keep a "condensed" context? Thank you.
aix1 [3 hidden]5 mins ago
Not the GP, but I currently use a hierarchy of artifacts: requirements doc -> design docs (overall and per-component) -> code+tests. All artifacts are version controlled.
Each level in the hierarchy is empirically ~5X smaller than the level below. This, plus sharding the design docs by component, helps Claude navigate the project and make consistent decision across sessions.
My workflow for adding a feature goes something like this:
1. I iterate with Claude on updating the requirements doc to capture the desired final state of the system from the user's perspective.
2. Once that's done, a different instance of Claude reads the requirements and the design docs and updates the latter to address all the requirements listed in the former. This is done interactively with me in the loop to guide and to resolve ambiguity.
3. Once the technical design is agreed, Claude writes a test plan, usually almost entirely autonomously. The test plan is part of each design doc and is updated as the design evolves.
3a. (Optionally) another Claude instance reviews the design for soundness, completeness, consistency with itself and with the requirements. I review the findings and tell it what to fix and what to ignore.
4. Claude brings unit tests in line with what the test plan says, adding/updating/removing tests but not touching code under test.
4a. (Optionally) the tests are reviewed by another instance of Claude for bugs and inconsistencies with the test plan or the style guide.
5. Claude implements the feature.
5a. (Optionally) another instance reviews the implementation.
For complex changes, I'm quite disciplined to have each step carried out in a different session so that all communinications are done via checked-in artifacts and not through context. For simple changes, I often don't bother and/or skip the reviews.
From time to time, I run standalone garbage collection and consistency checks, where I get Claude to look for dead code, low-value tests, stale parts of the design, duplication, requirements-design-tests-code drift etc. I find it particularly valuable to look for opportunities to make things simpler or even just smaller (fewer tokens/less work to maintain).
Occasionally, I find that I need to instruct Claude to write a benchmark and use it with a profiler to opimise something. I check these in but generally don't bother documenting them. In my case they tend to be one-off things and not part of some regression test suite. Maybe I should just abandon them & re-create if they're ever needed again.
I also have a (very short) coding style guide. It only includes things that Claude consistently gets wrong or does in ways that are not to my liking.
Havoc [3 hidden]5 mins ago
Yeah same. The markdown thing also helps with the multi model thing. Can wipe context and have another model look at the code and markdown plan with fresh eyes easily
shell0x [3 hidden]5 mins ago
Isn't that what Amp code essentially does? I've used Codex and Claude Max but I keep going back to Amp https://ampcode.com/models
It uses different models for different modes.
I just find it to be faster and it often gets things right at the first attempt, but YMMV.
pathikrit [3 hidden]5 mins ago
10-15 years ago HN was the place where everyone showed off their new shiny toys and people would jump over themselves to try some new framework or db or tool. Now, so much negative sentiment about AI coding. I bet if LLMs came 15 years ago, HN would be brimming with excitement. What happened?
timdellinger [3 hidden]5 mins ago
The difference is that CEOs fired a lot of people, using "AI can replace human coders" as an excuse. Also: there are claims all over the headlines along the lines of "We built something amazing without human coders."
Both claims are loud and are flooding the discussion, but under the hood it's mostly a slop disaster.
So the negative sentiment is a natural response (and a dose of realism).
dncornholio [3 hidden]5 mins ago
You should be critical. If you made something, you should proof that it works. Especially in todays world. This article contains no proof that their work actually works.
dml2135 [3 hidden]5 mins ago
Oligarchy.
codeflo [3 hidden]5 mins ago
I know the argument I'm going to make is not original, but with every passing week, it's becoming more obvious that if the productivity claims were even half true, those "1000x" LLM shamans would have toppled the economy by now. Were are the slop-coded billion dollar IPOs? We should have one every other week.
wiseowise [3 hidden]5 mins ago
They’re busy writing applications for their dogs and building “jerk me off” functionality into their OpenClaw fork. Once they’re done you’ll be sorry you ever asked.
zingar [3 hidden]5 mins ago
Writing pieces of code that beat average human level is solved. Organizing that code is on its way to being solved (posts like this hint at it). Finding problems that people will pay money to have solved by software is a different entirely more complicated matter (tbh I doubt anyone could prove right now that this absolutely is or isn’t solvable - but given the change we’ve seen already I place no bets against AI).
Also even if agents could do everything the societal obstacles to change are extensive (sometimes for very good, sometimes for bad reasons) so I’m expecting it to take another year or two serious change to occur.
user34283 [3 hidden]5 mins ago
Last time I read about a Codex update, I think it mentioned that a million developers tried the tool.
Don't most companies use AI in software development today?
And yes, I know that some companies are not doing that because of privacy and reliability concerns or whatever. With many of them it's a bit of a funny argument considering even large banks managed to adopt agentic AI tools. Short of government and military kind of stuff, everybody can use it today.
peterweisz [3 hidden]5 mins ago
Great article. I'd recommmend to make guardrails and benchmarking an integral part of prompt engineering. Think of it as kind of a system prompt to your Opus 4.6 architect: LangChain, RAG, LLm-as-a-judge, MCP. When I think about benchmarks I always ask it to research for external DB or other ressources as a referencing guardrail
fedeb95 [3 hidden]5 mins ago
This is interesting and goes beyond the usual AI hype. It's the beginning of a structured and efficient use of new tools (aka software engineering).
edlebert_f [3 hidden]5 mins ago
It usually boils down to how efficient you are with your spoken/written language. Basically, LLM generated code tends to be a reflection of the thought efficiency of the human.
I write very little code these days, so I've been following the AI development mostly from the backseat. One aspect I fail to grasp perfectly is what the practical differences are between CLI (so terminal-based) agents and ones fully integrated into an IDE.
Could someone chime in and give their opinion on what are the pros and cons of either approach?
zingar [3 hidden]5 mins ago
I guess you’re probably looking for someone who uses cursor etc to answer but here’s a data point from someone a bit off the beaten path.
My editor supports both modes (emacs). I have the editor integration features (diff support etc) turned off and just use emacs to manage 5+ shells that each have a CLI agent (one of Claude, opencode, amp free) running in them.
If I want to go deep into a prompt then I’ll write a markdown file and iterate on it with a CLI.
kleiba [3 hidden]5 mins ago
I noticed that OpenCode requires per their own website "a modern terminal emulator" - so, no problem in Emacs? Are you running M-x term?
zingar [3 hidden]5 mins ago
I have my own function that starts up a vterm in the root of the repo that I’m in. It is average for running Claude (long sessions get the scrolling through the whole history on every output character bug) but actually better at running opencode which doesn’t have this problem.
rullelito [3 hidden]5 mins ago
For me, I use an IDE if I plan to look at the code.
kleiba [3 hidden]5 mins ago
So, to you basically the distinction is "fully vibe-coded" vs. "with human in the loop"?
user34283 [3 hidden]5 mins ago
I don't think there is a meaningful difference.
Whether I use Antigravity, VS Code with Claude Code CLI, GitHub Copilot IDE plugins, or the Codex app, they all do similar things.
Although I'd say Codex and Claude Code often feel significantly better to me, currently. In terms of what they can achieve and how I work with them.
xhale [3 hidden]5 mins ago
Hi, anyone has a simple example/scaffold how to set up agents/skills like this? I’ve looked at the stavrobots repo and only saw an AGENTS.md. Where do these skills live then?
(I have seen obra/superpowers mentioned in the comments, but that’s already too complex and with an ui focus)
Ultimately, it's just a bunch of markdown files that live in an `/agents` folder, with some meta-information that will depend on the harness you use.
stavros [3 hidden]5 mins ago
These skills live in my home directory, that's why they aren't in the repos. I can upload them if you want.
kkarolis [3 hidden]5 mins ago
Not original commenter, but would be curious (and thankful) to see it.
stavros [3 hidden]5 mins ago
I've updated the post with them, let me know if they work!
sdevonoes [3 hidden]5 mins ago
Agent bots are the new “TODO” list apps. Seems cool and all, but I wish I could see someone writing useful software with LLMs, at least once.
So much power in our hands, and soon another Facebook will appear built entirely by LLMs. What a fucking waste of time and money.
It’s getting tiring.
mkovach [3 hidden]5 mins ago
What hurts the most is that the em dash used to be a small, rebellious literary act that I truly enjoyed employing. A simple, useful hinge in a sentence where it could change its mind. Now? It indicates when an LLM got too frisky with clause boundaries and maintains a phobia of semicolons.
rednafi [3 hidden]5 mins ago
The world could use one less "how I slop" article at this point.
This reminds me of the early Medium days when everyone would write articles on how to make HTTP endpoints or how to use Pandas.
There’s not much skill involved in hauling agents, and you can still do it without losing your expertise in the stuff you actually like to work with.
For me, I work with these tools all the time, and reading these articles hasn’t added anything to my repertoire so far. It gives me the feeling of "bikeshedding about tools instead of actually building something useful with them."
We are collectively addicted to making software that no one wants to use. Even I don’t consistently use half the junk I built with these tools.
Another thing is that everyone yapping about how great AI is isn’t actually showing the tools’ capabilities in building greenfield stuff. In reality, we have to do a lot more brownfield work that’s super boring, and AI isn’t as effective there.
prpl [3 hidden]5 mins ago
I am enjoying the RePPIT framework from Mihail Eric. I think it’s a better formalization of developing without resulting to personas.
ilovetux [3 hidden]5 mins ago
I like the approach outlined in the article. These days having a roadmap for yourself while cruising at highway speeds helps make sense of the chaos.
One big pain point that has existed forever and has never really been addresses adequately is the ability to come up with requirements.
Sure, it sounds easy, I need the app to do x, y and z. But requirements change in real time because of lack of foresight, change of business needs, an unexpected roadblock and more contribute to changing requirements.
So, the advice to come up with the requirements by yourself or with the LLM miss the biggest pain point.
I'd like to see a resurgence of flow charts, IPO (Input, Processing and Output) charts and other tools to organize requirements spring up to help with really nailing down requirements.
I will say, though, some of the pain is relieved because the agent can perform a huge refactor in a couple of minutes, but that opens a whole new can of worms.
neonstatic [3 hidden]5 mins ago
> Before that, code would quickly devolve into unmaintainability after two or three days of programming, but now I’ve been working on a few projects for weeks non-stop, growing to tens of thousands of useful lines of code, with each change being as reliable as the first one.
I'm glad it works for the author, I just don't believe that "each change being as reliable as the first one" is true.
> I no longer need to know how to write code correctly at all, but it’s now massively more important to understand how to architect a system correctly, and how to make the right choices to make something usable.
I agree that knowing the syntax is less important now, but I don't see how the latter claim has changed with the advent of LLMs at all?
> On projects where I have no understanding of the underlying technology (e.g. mobile apps), the code still quickly becomes a mess of bad choices. However, on projects where I know the technologies used well (e.g. backend apps, though not necessarily in Python), this hasn’t happened yet, even at tens of thousands of SLoC. Most of that must be because the models are getting better, but I think that a lot of it is also because I’ve improved my way of working with the models.
I think the author is contradicting himself here. Programs written by an LLM in a domain he is not knowledgable about are a mess. Programs written by an LLM in a domain he is knowledgeable about are not a mess. He claims the latter is mostly true because LLMs are so good???
My take after spending ~2 weeks working with Claude full time writing Rust:
- Very good for language level concepts: syntax, how features work, how features compose, what the limitations are, correcting my wrong usage of all of the above, educating me on these things
- Very good as an assistant to talk things through, point out gaps in the design, suggest different ways to architect a solution, suggest libraries etc.
- Good at generating code, that looks great at the first glance, but has many unexplained assumptions and gaps
- Despite lack of access to the compiler (Opus 4.6 via Web), most of the time code compiles or there are trivially fixable issues before it gets to compile
- Has a hard to explain fixation on doing things a certain way, e.g. always wants to use panics on errors (panic!, unreachable!, .expect etc) or wants to do type erasure with Box<dyn Any> as if that was the most idiomatic and desirable way of doing things
- I ended up getting some stuff done, but it was very frustrating and intellectually draining
- The only way I see to get things done to a good standard is to continuously push the model to go deeper and deeper regarding very specific things. "Get x done" and variations of that idea will inevitably lead to stuff that looks nice, but doesn't work.
So... imo it is a new generation compiler + code gen tool, that understands human language. It's pretty great and at the same time it tires me in ways I find hard to explain. If professional programming going forward would mean just talking to a model all day every day, I probably would look for other career options.
mbbutler [3 hidden]5 mins ago
> Has a hard to explain fixation on doing things a certain way, e.g. always wants to use panics on errors (panic!, unreachable!, .expect etc) or wants to do type erasure with Box<dyn Any> as if that was the most idiomatic and desirable way of doing things
Yes! I see this constantly. I have a Rust guide that Claude adheres to maybe 50% of the time. It also loves to allocate despite my guide having a whole section about different ways to avoid allocations.
OtomotO [3 hidden]5 mins ago
> One thing I’ve noticed is that different people get wildly different results with LLMs, so I suspect there’s some element of how you’re talking to them that affects the results.
Which is Fortuna's work... stochastic models are like that. And confirmation bias is another phenomenon as well as "how do LLMs align with my worldview" whether I see them more positively or more negatively.
zapkyeskrill [3 hidden]5 mins ago
What's the point of writing this? In a few weeks a new model will come out and make your current work pattern obsolete (a process described in the post itself)
zingar [3 hidden]5 mins ago
Solidifying the ideas in writing helps the author improve them, and helps them and the rest of us understand what to look for in the next generation of models.
imiric [3 hidden]5 mins ago
Ah, another one of these. I'm eager to learn how a "social climber" talks to a chatbot. I'm sure it's full of novel insight, unlike thousands of other articles like this one.
indigodaddy [3 hidden]5 mins ago
Dudes not a social climber. You are making a lot of assumptions.
indigodaddy [3 hidden]5 mins ago
This was on the front page and then got completely buried for some reason. Super weird.
mjmas [3 hidden]5 mins ago
On the front page at the moment. Position 12
indigodaddy [3 hidden]5 mins ago
Maybe I missed it. Sometimes when you're scanning for something your brain intentionally doesn't want to see it, I've noticed. Anyway I'm not Stavros obviously, just thought this was a good article.
The author uses different models for each role, which I get. But I run production agents on Opus daily and in my experience, if you give it good context and clear direction in a single conversation, the output is already solid. The ceremony of splitting into "architect" and "developer" feels like it gives you a sense of control and legibility, but I'm not convinced it catches errors that a single model wouldn't catch on its own with a good prompt.
We used a hierarchy of agents to analyze a requirement, letting agents with different personas (architect, business analyst, security expert, developer, infra etc) discuss a request and distill a solution. They all had access to the source code of the project to work on.
Then we provided the very same input, including the personas' definition, straight to Claude Code, and we compared the result.
They council of agents got to a very good result, consuming about 12$, mostly using Opus 4.6.
To our surprise, going straight with a single prompt in Claude Code got to a similar good result, faster and consuming 0.3$ and mostly using Haiku.
This surely deserves more investigation, but our assumption / hypothesis so far is that coordination and communication between agents has a remarkable cost.
Should this be the case, I personally would not be surprised:
- the reason why we humans do job separation is because we have an inherent limited capacity. We cannot reach the point to be experts in all the needed fields : we just can't acquire the needed knowledge to be good architects, good business analysts, good security experts. Apparently, that's not a problem for a LLM. So, probably, job separation is not a needed pattern as it is for humans.
- Job separation has an inherent high cost and just does not scale. Notably, most of the problems in human organizations are about coordination, and the larger the organization the higher the cost for processes, to the point processed turn in bureaucracy. In IT companies, many problems are at the interface between groups, because the low-bandwidth communication and inherent ambiguity of language. I'm not surprised that a single LLM can communicate with itself way better and cheaper that a council of agents, which inevitably faces the same communication challenges of a society of people.
This is similar to telling Claude Code to write its steps into a separate markdown file, or use separate agents to independently perform many tasks, or some of the other things that were commonly posted about 3-6+ months ago. Now Claude Code does that on its own if necessary, so it's probably a net negative to instruct it separately.
Some prompting techniques seem ageless (e.g. giving it a way to validate its output), but a lot of these feel like temporary scaffolding that I don't see a lot of value in building a workflow around.
You can contrast this to something like reasoning, which offered very large, very clear improvements in fundamental performance, and as a result was tackled very aggressively by all the labs. Or (like you mentioned) todo lists, which gave relatively small gains but were implemented relatively quickly. Automatic context control is just going to take more time to get it right, and the gains will be quite small.
We know input/output pairs, when using a reasoning model we can see a separate stream of text that is supposedly insight into what the model is "thinking" during inference, and when using multiple agents we see what text they send to each other. That's it.
Aider did an "architect-editor" split where architect is just a "programmer" who doesn't bother about formatting the changes as diff, then a weak model converts them into diffs and they got better results with it. This is nothing like human teams though.
Some people have turned context control into hallucinated anthropomorphic frameworks (Gas Town being perhaps the best example). If that's how they prefer to mentally model context control, that's fine. But it's not the anthropomorphism that's helping here.
What’s the evidence for anything software engineers use? Tests, type checkers, syntax highlighting, IDEs, code review, pair programming, and so on.
In my experience, evidence for the efficacy of software engineering practices falls into two categories:
- the intuitions of developers, based in their experiences.
- scientific studies, which are unconvincing. Some are unconvincing because they attempt to measure the productivity of working software engineers, which is difficult; you have to rely on qualitative measures like manager evaluations or quantitative but meaningless measures like LOC or tickets closed. Others are unconvincing because they instead measure the practice against some well defined task (like a coding puzzle) that is totally unlike actual software engineering.
Evidence for this LLM pattern is the same. Some developers have an intuition it works better.
I personally like and use tests, formal verification, and so on. But the evidence for these methods are weak.
edit: To be clear, I am not ragging on the researchers. I think it's just kind of an inherently messy field with pretty much endless variables to control for and not a lot of good quantifiable metrics to rely on.
Also, lines of code is not completely meaningless metric. What one should measure is lines of code that is not verified by compiler. E.g., in C++ you cannot have unbalanced brackets or use incorrectly typed value, but you still may have off-by-one error.
Given all that, you can measure customer facing defect density and compare different tools, whether they are programming languages, IDEs or LLM-supported workflow.
My issue with the observational studies is that basically everything is uncontrolled. Maybe the IDEs are causing less defects. Or maybe some problems are just harder and more defect prone than others. Maybe some teams are better managed or get clearer specifications and so on. Maybe some organizations are better at recording defects and can't be fairly compared with organizations that just report less. The studies don't ever reach the scale where you become confident these things wash out.
If they control the problem, most of those issues are eliminated (though not all, for example the experience and education of the participants still needs to be controlled), but now you are left wondering how well the findings transfer from the toy projects in the experiment into real life.
But finally, it's still not a perfect metric because not all defects are equal, right? What if some tool/process helped you reduce a large number of mostly cosmetic defects, but increases the occurrence of catastrophic defects?
re: LoC, there's some signal here, but it's such a noisy channel, I've never read a study that I thought put it to good use. Happy to have my mind changed if you have a link to one.
Comparing lines of code can be meaningful, mostly if you can keep a lot of other things constant, like coding style, developer experience, domain, tech stack. There are many style differences between LLM and human generated code, so that I expect 1000 lines of LLM code do a lot less than 1000 lines of human code, even in the exact same codebase.
See: OOP
Sample size of one, but I found it helps guard against the model drifting off. My different agents have different permissions. The worker can not edit the plan. The QA or planner can't modify the code. This is something I sometimes catch codex doing, modifying unrelated stuff while working.
Admittedly I was using gsdv2; I've never had this issue with codex and claude. Sure, some RL hacking such as silent defaults or overly defensive code for no reason. Nothing that seemed basically actively malicious such as the above though. Still, gsdv2 is a 1-agent scaffolding pipeline.
I think the issue is that these 1-agent pipelines are "YOU MUST PLAN IMPLEMENT VERIFY EVERYTHING YOURSELF!" and extremely aggressive language like that. I think that kind of language coerces the agent to do actively malicious hacks, especially if the pipeline itself doesn't see "I am blocked, shifting tasks" as a valid outcome.
1-agent pipelines are like a horrible horrible DFS. I still somewhat function when I'm in DFS mode, but that's because I have longer memory than a goldfish.
Ironically, it resembles waterfall much more so than agile, in that you spec everything (tech stack, packages, open questions, etc.) up front and then pass that spec to an implementation stage. From here you either iterate, or create a PR.
Even with agile, it's similar, in that you have some high-level customer need, pass that to the dev team, and then pass their output to QA.
What's the evidence? Admittedly anecdotal, as I'm not sure of any benchmarks that test this thoroughly, but in my experience this flow helps avoid the pitfall of slop that occurs when you let the agent run wild until it's "done."
"Done" is often subjective, and you can absolutely reach a done state just with vanilla codex/claude code.
Note: I don't use a hierarchy of agents, but my process follows a similar design/plan -> implement -> debug iteration flow.
Notice that I didn't split out any roles that use the same model, as I don't think it makes sense to use new roles just to use roles.
I do find it different from the thinking that one does when writing code so I’m not surprised to find it useful to separate the step into different context, with different tools.
Is it useful to tell something “you are an architect?” I doubt it but I don’t have proof apart from getting reasonable results without it.
With human teams I expect every developer to learn how to do this, for their own good and to prevent bottlenecks on one person. I usually find this to be a signal of good outcomes and so I question the wisdom of biasing the LLM towards training data that originates in spaces where “architect” is a job title.
There's a 63 pages paper with mathematical proof if you really into this.
https://arxiv.org/html/2601.03220v1
My takeaway: AI learns from real-world texts, and real-world corpus are used to have a role split of architect/developer/reviewer
> There's a 63 page paper with mathematical proof if you really into this.
> https://arxiv.org/html/2601.03220v1
I'm confused. The linked paper is not primarily a mathematics paper, and to the extent that it is, proves nothing remotely like the question that was asked.
I am not an expert, but by my understanding, the paper prooves that a computationally bounded "observer" may fail to extract all the structure present in the model in one computation. aka you can't always one-shot perfect code.
However, arrange many pipelines of roles "observers" may gradually get you there
At the same time I can see a more linear approach doing similar. Like when I ask for an implementation plan that is functional not all that different from an architect agent even if not wrapped in such a persona
They track everything we all do in a chat, then learn the patterns that work and build them in. Rinse and repeat.
So to me it makes sense to have models with different architecture/data/post training refine each other's answers. I have no idea whether adding the personas would be expected to make a difference though.
Then you execute it with a clean context.
Clean context is needed for maximum performance while not remembering implementation dead ends you already discarded
If you’re exploring an idea or iterating, the roles can help break it down and understand your own requirements. Personally I do that “away” from the code though.
Context & how LLMs work requires this.
From my experience no frontier model produces bug free & error free code with the first pass, no matter how much planning you do beforehand.
With 3 tiers, you spend your token & context budget in full in 3 phases. Plan, implement, review.
If the feature is complex, multiple round of reviews, from scratch.
It works.
Using multiple agents in different roles seems like it'd guard against one model/agent going off the rails with a hallucination or something.
I think the author admits that it doesn't, doesn't realise it and just goes on:
--- start quote ---
On projects where I have no understanding of the underlying technology (e.g. mobile apps), the code still quickly becomes a mess of bad choices. However, on projects where I know the technologies used well (e.g. backend apps, though not necessarily in Python), this hasn’t happened yet
--- end quote ---
Well I was until the session limit for a week kicked in.
Maybe you should write and share your own article to counter this one.
[1] https://github.com/skorokithakis/stavrobot
Also document some best practices in AGENT.md or whatever it's called in your app.
Eg
And so on.I almost always define the class-level design myself. In some sense I use the LLM to fill in the blanks. The design is still mine.
EDIT: My bad, the code eventually calls into dedicated functions from database.ts, so those 200 lines are mostly just validation and error handling. I really just skimmed the code and the amount of it made me assume that it actually implements the functionality somewhere in there.
Example, Agent.ts, line 93, function createManageKnowledgeTool() [1]. I would have expected something like the following and not almost 200 lines of code implementing everything in place. This also uses two stores of some sort - memory and scratchpad - and they are also not abstracted out, upsert and delete deal with both kinds directly.
[1] https://github.com/skorokithakis/stavrobot/blob/master/src/a...Do you have any evidence of this?
So the summary of the annecdata to me is that the model itself certainly isn't incentivized to do anything in particular here, it's the tooling that's putting its finger on the scale (and different tooling nudges things in different directions).
When you say "is not great code" can you elaborate? Does the code work or not?
I don't know if there are underlying bugs, but I haven't hit any, and the architecture (which I do know about) is sane.
[1] https://pine.town/
It's by no means the best LLMs can do.
Good luck convincing your boss that this ungodly amount of time spent messing around with your tooling for an immeasurable improvement in your delivery is the time well spent as opposed to using that same amount of time delivering results by hand.
Luckily for me, I'm fortunate enough to not have to work in that sort of environment.
Sadly yes. But it "works", for some definition of working. We all know it's going to be a maintenance nightmare seen the gigantic amount of code and projects now being generated ad infinitum. As someone commented in this thread: it can one-shot an app showing restaurant locations on a map and put a green icon if they're open. But don't except good code, secure code, performant code and certainly not "maintainable code".
By definition, unless the AIs can maintain that code, nothing is maintainable anymore: the reason being the sheer volume. Humans who could properly review and maintain code (and that's not many) are already outnumbered.
And as more and more become "prompt engineers" and are convinced that there's no need to learn anything anymore besides becoming a prompt engineer, the amount of generated code is only going to grow exponentially.
So to me it is the kind of code you should expect. It's not perfect. But it more or less works. And thankfully it shouldn't get worse with future models.
What we now need is tools, tools and more tools: to help keep these things on tracks. If we ever to get some peace of mind about the correctness of this unreviewable generate code, we'll need to automate things like theorem provers and code coverage (which are still nowhere to be seen).
And just like all these models are running on Linux and QEMU and Docker (dev container) and heavily using projects like ripgrep (Claude Code insist on having ripgrep installed), I'm pretty sure all these tools these models rely on and shall rely on to produce acceptable results are going to be, very mostly, written by humans.
I don't know how to put it nicely: an app showing green icon next to open restaurants on a map ain't exactly software to help lift off a rocket or to pilot a MRI machine.
BTW: yup, I do have and use Claude Code. Color me both impressed and horrified by the "working" amount un unmaintainable mess it can spout. Everybody who understands something about software maintenance should be horrified.
Since AI can read and generate code, it can surely fix code, or find bugs, or address security flaws. And if this all turns into a hot mess, AI can just refactor the whole thing anyway. And so forth.
Personally, I think we'll be some years off before the whole software loop is closed by AI (if it even happens anyway).
It's no shame to be critical in todays world. Delivering proof is something that holds extra value and if I would create an article about the wonderful things I've created, I'd be extra sure to show it.
I looked at your clock project and when I saw that your updated version and improved version of your clock contained AI artifacts, I concluded that there's no proof of your work.
Sorry to have made that conclusion and I'm sorry if that hurt your feelings.
For example, you talk about how the code is secure. How do you prove that it is secure?
People here see an LLM-assisted project and suddenly they've never written a bug in their life.
It's always easier to blame the prompt and convince yourself that you have some sort of talent in how you talk to LLMs that other's don't.
In my experience the differences are mostly in how the code produced by the LLM is reviewed. Developers who have experience reviewing code are more likely to find problems immediately and complain they aren't getting great results without a lot of hand holding. And those who rarely or never reviewed code from other developers are invariably going to miss stuff and rate the output they get higher.
For instance, I've commented before that I tend to decompose tasks intended for AI to a level where I already know the "shape" of the code in my head, as well as what the test cases should look like. So reviewing the generated code and tests for me is pretty quick because it's almost like reading a book I've already read before, and if something is wrong it jumps out quickly. And I find things jumping out more and more infrequently.
Note that decomposing tasks means I'm doing the design and architecture, which I still don't trust the AI to do... but over the years the scope of tasks has gone up from individual functions to entire modules.
In fact, I'm getting convinced vibe coding could work now, but it still requires a great deal of skill. You have to give it the right context and sophisticated validation mechanisms that help it self-correct as well as let you validate functionality very quickly with minimal looks at the code itself.
They said it couldn't fix an issue it made.
I asked if they gave it any way to validate what it did.
They did not, some people really are saying "fix this" instead of saying "x fn is doing y when someone makes a request to it. Please attempt to fix x and validate it by accessing the endpoint after and writing tests"
Its shocking some people don't give it any real instruction or way to check itself.
In addition I get great results doing voice to text with very specific workflows. Asking it to add a new feature where I describe what functions I want changed then review as I go vs wait for the end.
It's not shocking. The tech world is telling them that "Claude will write all of their app easily" with zero instructions/guidelines so of course they're going to send prompts like that.
Two areas I've really appreciated LLMs so far... one is being able to make web components that do one thing well in encapsulation.. I can bring it into my project and just use it... AI can scaffold a test/demo app that exercises the component with ease and testing becomes pretty straight forward.
The other for me has been in bridging rust to wasm and even FFI interfaces so I can use underlying systems from Deno/Bun/Node with relative ease... it's been pretty nice all around to say the least.
That said, this all takes work... lots of design work up front for how things should function... weather it's a ui component or an API backend library. From there, you have to add in testing, and some iteration to discover and ensure there aren't behavioral bugs in place. Actually reviewing code and especially the written test logic. LLMs tend to over-test in ways that are excessive or redundant a lot of the time. Especially when a longer test function effectively also tests underlying functionalities that each had their own tests... cut them out.
There's nothing "free" and it's not all that "easy" either, assuming you actually care about the final product. It's definitely work, but it's more about the outcome and creation than the grunt work. As a developer, you'll be expected to think a lot more, plan and oversee what's getting done as opposed to being able to just bang out your own simple boilerplate for weeks at a time.
I promptly went and fixed this before doing any more work, because I know if I was put in that situation I would refuse to do any more work until I could actually use the app properly. In general, if you wouldn't be able to solve a problem with the tools you give an LLM, it will probably do a bad job too.
At least the LLM will only take 5 minutes to tell you they don't know what to do.
And’ve I never asked Claude Code something which is really impossible, or even really difficult.
Still, sometimes claude will tell me off even when I don't give it alternatives. Last night I told it to use luasocket from an mpv userscript to connect to a zeromq Unix socket (and also implement zmq in pure lua) connected to an ffmpeg zmq filter to change filter parameters on the fly. Claude code all but called me stupid and told me to just reload the filter graph through normal mpv means when I make a change. Which was a good call, but I told it to do the thing anyway and it ended up working well, so what does it really know... Anyway, I like that it pushes back, but agrees to commit when I insist.
Over time, you build up the right reflexes that avoid a one-week goose chase with them. Heck, since we're working with people, you don't just say " fix this", you earmark time to make sure everyone is aligned on what needs done and what the plan is.
In my experience, the LLM will happily try the wrong thing over and over for hours. It rarely will say it doesn’t know.
It’s gotten a lot better.
This works about 85% of the time IME, in Claude Code. My normal workflow on most bugs is to just say “fix this” and paste the logs. The key is that I do it in plan mode, then thoroughly inspect and refine the plan before allowing it to proceed.
I only carefully review the parts of the implementation that I know “work on my machine but will break once I put in a real world scenario”. Even before AI I wasn’t one of the people who got into geek wars worrying about which GOF pattern you should have used.
All except for concurrency where it’s hard to have automated tests, I care more about the unit or honestly integration tests and testing for scalability than the code. Your login isn’t slow because you chose to use a for loop instead of a while loop. I will have my agents run the appropriate tests after code changes
I didn’t look at a line of code for my vibe coded admin UI authenticated with AWS cognito that at most will be used by less than a dozen people and whoever maintains it will probably also use a coding agent. I did review the functionality and UX.
Code before AI was always the grind between my architectural vision and implementation
For instance, GCC will inline functions, unroll loops, and myriad other optimizations that we don't care about (and actually want!). But when we review the ASM that GCC generates we are not concerned with the "spaghetti" and the "high coupling" and "low cohesion". We care that it works, and is correct for what it is supposed to do.
Source code in a higher-level language is not really different anymore. Agents write the code, maybe we guide them on patterns and correct them when they are obviously wrong, but the code is just the work-item artifact that comes out of extensive specification, discussion, proposal review, and more review of the reviews.
A well-guided, iterative process and problem/solution description should be able to generate an equivalent implementation whether a human is writing the code or an agent.
Translating a natural prompt on the other hand requires the LLM to make thousands of small decisions that will be different each time you regenerate the artifact. Even ignoring non-determinism, prompt instability means that any small change to the spec will result in a vastly different program.
A natural language spec and test suite cannot be complete enough to encode all of these differences without being at least as complex as the code.
Therefore each time you regenerate large sections of code without review, you will see scores of observable behavior differences that will surface to the user as churn, jank, and broken workflows.
Your tests will not encode every user workflow, not even close. Ask yourself if you have ever worked on a non trivial piece of software where you could randomly regenerate 10% of the implementation while keeping to the spec without seeing a flurry of bug reports.
This may change if LLMs improve such that they are able to reason about code changes to the degree a human can. As of today they cannot do this and require tests and human code review to prevent them from spinning out. But I suspect at that point they’ll be doing our job, as well as the CEOs and we’ll have bigger problems.
I feel similarly about Hollywood and the creation of media. We're not there in either case yet, but we will be. That's pretty clear. and when I look at the feudal society that is the entertainment industry here, I don't understand why so many of the serfs are trying to perpetuate it in its current state. And I really don't get why engineers think this technology is going to turn them into serfs unless they let that happen to them themselves. If you can build things, AI coding agents will let you build faster and more for the same amount of effort.
I am assuming given the rate of advance of AI coding systems in the past year that there is plenty of improvement to come before this plateaus. I'm sure that will include AI generated systems to do security reviews that will be at human or better level. I've already seen Claude find 20 plus-year-old bugs in my own code. They weren't particularly mission critical but they were there the whole time. I've also seen it do amazingly sophisticated reverse engineering of assembly code only to fall over flat on its face for the simplest tasks.
OTOH if what you're really talking about is the long-term collapse in our ludicrous carbon footprint when we finally run out of fossil fuels and we didn't invest in renewables or nuclear to replace them, well, I'm with you there.
I don't even know what this means.
The worst unemployment during the Weimar Republic was 25-30%. Unemployment in the Great Depression peaked at 25%.
So yeah if we get to 45% unemployment and those are the highest paying jobs on average then yeah it's gonna be bad. Then you add in second order effects where none of those people have the money to pay the other 55% who are still employed.
We might get to a UBI relatively quickly and peacefully. But I'm not betting on it.
>finally go through some things just like the aristocracy in France once did.
Yeah that's probably the most likely scenario, but that quickly devolved into a death and imprisonment for far more than the aristocrats and eventually ended with Napoleon trying to take over Europe and millions of deaths overall.
The world didn't literally end, but it was 40 years of war, famine, disease, and death, and not a lot of time to think about starting businesses with your laptop.
The good news is that America is ~5% of the world. And the more we keep punching ourselves in the face, the better the chance someone else pulls ahead. But still, we have nukes, so we're still the town bully for the immediate future.
Fortunately, the other 95% of humanity is far less doomer about their prospects. So if America wants to be the new neanderthals, they'll be happy to be the new cro magnons.
I think that if CEOs can replace us soon, it's because AGI got here much sooner than I predicted. And if that happens we have 2 options Mad Max and Star Trek and Mad Max is the more likely of the 2.
If doom porn is your thing, try watching Threads or The Day After, especially Threads. That said, I don't think Star Trek is possible, maybe The Expanse but more likely we run out of cheap energy before we get off world.
As for the AGI, it all depends on your definition. We're already at Amazon IC1/IC2 coding performance with these agents (I speak from experience previously managing them). If we get to IC3, one person will be able to build a $1B company and run it or sell it. If you're a purist like me and insist we stick to douchebag racist Nick Bostrom's superintelligence definition of AGI, then we agree. But I expect 24/7 IC3 level engineering as a service for $200/month to be more than enough and I think that's a year or two away. And you can either prepare for that or scream how the sky is falling, your choice.
You are the epitome of the tech bro.
Look how Musk treated the Twitter devs or Bezos any of his workers or Trump anybody.
But you aren't building, your LLM is. Also, you are only thinking about ways as you, a supposed builder, will benefit from this technology. Have you considered how all previous waves of new technologies have introduced downstream effects that have muddied our societies? LLMs are not unique in this regard, and we should be critical on those who are trying to force them into every device we own.
With AI, by always planning first, pushing it to explore alternative technical approaches, making it explain its choices — the creative construction process gets easier. You stay the conductor. Refactoring, new features, testing — all facilitated. Add regular AI-driven audits to catch defects, and of course the expert eye that nothing replaces.
One thing that worries me though: how will junior devs build that expert eye if AI handles the grunt work? Learning through struggle is how most of us developed intuition. That's a real problem for the next generation.
A contractor is still very much putting the house together.
I started my career in 1996 programming in C and Fortran on mainframes and got my first only and hopefully last job at BigTech at 46 7 jobs later.
I’m no longer there. Every project I’ve had in the last two years has had classic ML and then LLMs integrated into the implementation. I have very much jumped on the coding agent bandwagon.
Here are the reported miscompilation bugs in GCC so far in 2026. The ones labeled "wrong-code".
https://gcc.gnu.org/bugzilla/buglist.cgi?chfield=%5BBug%20cr...
I count 121 of them.
I’m also not particularly concerned with non-determinism but with chaos. Determinism in LLMs is likely solvable, prompt instability is not.
The first is a purely mechanical process, the second is not and requires thousands of decisions that can go either way.
Agents require tests to keep them from spinning out and your tests do not cover all of the behaviors you care about.
2. If you doubt that your tests don’t cover all your requirements, 99.9% of every production bug you’ve ever had completely passed your test suite.
So humans also don’t write bug free code or tests that cover all use cases - how is that an argument that humans are better?
An AI cannot write a 100k line program on its own without external guard rails otherwise it spins out. This has nothing to do with whether the agent is allowed to run the code itself. This is well documented. Look at what was required to allow Claude to write a "C compiler".
This has nothing to do with whether it's bug free. It literally can't produce a working 100k LOC program without external guardrails.
Why? Because LLMs can get easily confused, so they need well written code they can understand if the LLM is going to maintain the codebase it writes.
The cleaner I keep my codebase, and the better (not necessarily more) abstracted it is, the easier it is for the LLM to understand the code within its limited context window. Good abstractions help the right level of understanding fit within the context window, etc.
I would argue that use of LLMs change what good code is, since "good" now means you have to meaningfully fit good ideas in chunks of 125k tokens.
If, in the future, LLM providers will take ownership of our on-calls for the code they have produced, I would write "AUTO-REVIEW-ACCEPTER" bot to accept everything and deploy it to production.
If, company requires me to own something, then I should be aware about what's that thing and understand ins and outs in detail and be able to quickly adjust when things go wrong
on the other hand, you may have been an engineering manager, who is responsible for the team, but a lot of times they do not participate in on-call rotations (only as last escalation)
That doesn’t mean I did a code review for all of the developers. I will ask them how they solved for a problem that I know can be tricky or did they take into account for something.
No amount of unit/integration tests cover every single use case in sufficiently complex software, so you cannot rely on that alone.
Short version, when someone designs a call center with Amazon Connect, they use a GUI flowchart tool and create “contact flows”. You can export the flow to JSON. But it isn’t portable to other environments without some remapping. I created a tool before that used the API to export it and create a portable CloudFormation template.
I always miss some nuance that can half be caught by calling the official CloudFormation linter and the other half by actually deploying it and seeing what errors you get
This time, I did with Claude code, ironically enough, it knew some of the complexity because it had been trained on one of my older open source implementations I did while at AWS. But I told it to read the official CloudFormation spec, after every change test it with the linter, try to deploy it and fix it.
Again, I didn’t care about the code - I cared about results. The output of the script either passes the deployment or it doesn’t. Claude iterated until it got it right based on “observable behavior”. Claude has tested whether my deployments were working as expected plenty of times by calling the appropriate AWS CLI command and fixed things or reading from a dev database based on integration tests I defined.
Maybe we should treat LLM generated code similarly —- just generate everything fresh from the spec anytime there’a a change, though personally I haven’t had much success with that yet.
Have you ever tried writing tests for spaghetti code? It's hell compared to testing good code. LLMs require a very strong test harness or they're going to break things.
Have you tried reading and understanding spaghetti code? How do you verify it does what you want, and none of what you don't want?
Many code design techniques were created to make things easy for humans to understand. That understanding needs to be there whether you're modifying it yourself or reviewing the code.
Developers are struggling because they know what happens when you have 100k lines of slop.
If things keep speeding in this direction we're going to wake up to a world of pain in 3 years and AI isn't going to get us out of it.
It doesn’t matter to AI whether the code is spaghetti code or not. What you said was only important when humans were maintaining the code.
No human should ever be forced to look at the code behind my vibe coded internal admin portal that was created with straight Python, no frameworks, server side rendered and produced HTML and JS for the front end all hosted in a single Lambda including much of the backend API.
I haven’t done web development since 2002 with Classic ASP besides some copy and paste feature work once in a blue moon.
In my repos - post AI. My Claude/Agent files have summaries of the initial statement of work, the transcripts from the requirement sessions, my well labeled design diagrams , my design review sessions transcripts where I explained it to client and answered questions and a link to the Google NotebookLM project with all of the artifacts. I have separate md files for different implemtation components.
The NotebookLM project can be used for any future maintainers to ask questions about the project based on all of the artifacts.
In my experience using AI to work on existing systems, the AI definitely performs much better on code that humans would consider readable.
You can’t really sit here talking about architecting greenfield systems with AI using methodology that didn’t exist 6 months ago while confidently proclaiming that “trust me they’ll be maintainable”.
Well you can, and most consultants do tend to do that, but it’s not worth much.
Yeah they do.
I'm familiar enough with the claims to feel confident there is plenty of nefarious astroturfing occurring all over the web including on HN.
And the most important part is you haven't maintained any large systems written by AI, so stating that they will work is nonsense.
I won't state that AI can't get better. AI agents might replace all of us in the future. But what I will tell you is based on my experience and reasoning I have very strong doubts about the maintainability of AI generated code that no one has approved or understands. The burden of proof isn't on the person saying "maybe we should slow down and understand the consequences before we introduce a massive change." It's on the person saying "trust me it will work even though I have absolutely no evidence to support my claim".
And did I mention I got my start working in cloud consulting as a full time blue badge, RSU earning employee at a little company you might have heard of based in Seattle? So since I have worked at the second largest employee in the US, unless you have worked for Walmart - I don’t think you have worked for a larger company than I have.
Oh did I also mention that I worked at GE when it was #6 in market cap?
These were some of the business requirements we had to implement for the railroad car repair interchange management software
https://www.rmimimra.com/media/attachments/2020/12/23/indust...
You better believe we had a rigorous set of automated tests in something as highly regulated with real world consequences as the railroad transportation industry. AI would have been perfect for that because the requirements were well documented and the test coverage was extreme.
And unless your experience coding is before 1986 when I was coding in assembly language in 65C02 as a hobby, I think I might have a wee bit more than you.
I think you should probably save your “I have more experience” for someone who hasn’t been doing this professionally for 30 years for everything from startups, to large enterprises, to BigTech.
Except security researchers. I work in cybersecurity and we already see vulnerabilities caused by careless AI generated code in the wild. And this will only get worse (or better for my job security).
This “the only thing that matters about code is whether it meets requirements” is such a tired take and I can’t imagine anyone seriously spouting it has has had to maintain real software.
Whether you are tired of it or not, absolutely no one in your value you chain - your customers who give your company money or your management chain cares about your code beyond does it meet the functional and non functional requirements - they never did.
And of course whether it was done on time and on budget
My home, which I own, for example, is very much a “what” that keeps me warm and dry. But the “how” of it was constructed is the difference between (1) me cursing the amateur and careless decision making of builders and (2) quietly sipping a cocktail on the beach, free of a care in the world.
“How” doesn’t matter until it matters, like when you put too much weight onto that piece of particle board IKEA furniture.
He’s mildly controversial, but watch some @cyfyhomeinspections on YouTube to get a good idea of what you can infer of the “how” of building homes and how it affects homeowners. Especially relevant here because he seems to specialize in inspecting homes that are part of large developments where a single company builds out many homes very quickly and cuts tons of corners and makes the same mistakes repeatedly, kind of like LLM-generated code.
No, I can have some idea. For example, “brand perception”, which can be negatively impacted pretty heavily if things go south too often. See: GitHub, most recently.
I mean, there are already companies that have a negative reputation regarding software quality due to significant outsourcing (consultancies), or bloated management (IBM), or whatever tf Oracle does. We don’t have to pretend there’s a universe where software quality matters, we already live in one. AI will just be one more way to tank your company’s reputation with regards to quality, even if you can maintain profitability otherwise through business development schemes.
The same as I’ve been arguing about using an agent to do the grunt work of coding.
If GitHub’s login is slow, it isn’t because someone or something didn’t write SOLID code.
I don’t think we’ll come to common ground on this topic due to mismatching definitions of fundamental concepts of software engineering. Maybe let’s meet again in a year or two and reflect upon our disagreement.
If you mostly parachute in solutions as a consultant, or hand down architecture from above, you won’t have much experience with that, so it’s reasonable for you to underestimate it.
The scalability requirements are part of the “non functional requirements”. I know that the vibe coded internal admin website will never be used by more than a dozen people just like I know the ETL implementation can scale to the required number of transactions because I actually tested it for that scalability.
In fact, the one I gave to the client was my second attempt because my first one fell flat on its face when I ran it at the required scale
If the code is cheap to produce, you don't maintain it, you just throw it away and regenerate.
I’ve never seen this done even with LLMs. Not even close. And even if you did it, the test suite is almost definitely more complex than the code and will suffer from all the same maintainability problems.
And 2 clearly agents are worse at reasoning through code changes than humans are.
I could care less about the implementation behind the vibe coded admin website that will only be used by a dozen people. I care about the authorization.
Even the ETL job, I cared only about the performance characteristics, the resulting correctness, concurrency, logging, and correctness of the results.
Well-designed interfaces enforce decoupling where it matters most. And believe it or not, you can do review passes after an LLM writes code, to catch bugs, security issues, bad architecture, reduce complexity, etc.
Try it yourself. Ask claude for something you don't really understand. Then learn that thing, get a fresh instance of claude and try again, this time it will work much better because your knowledge and experience will be naturally embedded in the prompt you write up.
I often use AI successfully, but in a few cases I had, it was bad. That was when I didn't even know the end goal and regularly switched the fundamental assumptions that the LLM tried to build up.
One case was a simulation where I wanted to see some specific property in the convergence behavior, but I had no idea how it would get there in the dynamics of the simulation or how it should behave when perturbed.
So the LLM tried many fundamentally different approaches and when I had something that specifically did not work it immediately switched approaches.
Next time I get to work on this (toy) problem I will let it implement some of them, fully parametrize them and let me have a go with it. There is a concrete goal and I can play around myself to see if my specific convergence criterium is even possible.
I noticed that Claude is awful at understanding what makes good UX even as simple as something as if you have a one line input box and button that lets you submit the line of text, you should wire it up so a user can press return instead of pressing the button or thinking about them being able to tab through inputs in a decent order
As I use it more I gain more intuition about the kinds of problems it can handle on it's, vs those that I need to work on breaking down into smaller pieces before setting it loose.
Without research and planning agents are mostly very expensive and slow to get things done, if they even can. However with the right initial breakdown and specification of the work they are incredibly fast.
I know senior developers that are very radical on some nonsense patterns they think are much better than others. If they see code that don't follow them, they say it's trash.
Even so, you can guide the LLM to write the code as you like.
And you are wrong, it's a lot on how people write the prompt.
“You are overestimating the skill of [reading, comprehending, and critically assessing code of a non-guaranteed quality]” is an absurd statement if you properly expand out what “code review” means.
I don’t care if you code review the CSS file for the Bojangles online menu web page, but you better be code reviewing the firmware for my dad’s pacemaker.
This whole back and forth with LLM-generated code makes me think that the marginal utility of a lot of code the strong proponents write is <1¢. If I fuck up my code, it costs our partners $200/hr per false alert, which obliterates the profit margin of using our software in the first place.
We can capture enough reliability on what LLMs produce there by guided integration tests and UX tests along with code review and using other LLMs to review along with other strategies to prvent semantic and code drift
Do you know how much crap wordpress ,drupal and Joomla sites I have seen?
Just that work can be automated away
But Ive also worked in high end and mission critical delivery and more formal verification etc - that’s just moving the goalposts on what AI can do- it will get there eventually
Last year you all here were arguing AI Couldn’t code - now everyone has moved the goalposts to formal high end and mission critical ops- yes when money matters we humans are still needed of course - no one denying that- its the utility of the sole human developer against the onslaught of machine aided coding
This profession is changing rapidly- people are stuck in denial
This is the nutshell of your argument. I’m not convinced. Technologies often hit a ceiling of utility.
Imagine a “progress curve” for every technology, x-axis time and y-axis utility. Not every progress curve is limitlessly exponential, or even linear - in fact, very few are. I would venture to guess that most technological progress actually mimics population growth curves, where a ceiling is hit based on fundamental restrictions like resource availability, and then either stabilizes or crashes.
I don’t think LLMs are the AI endgame. They definitely have utility, but I think your argument boils down to a bold prediction of limitless progress of a specific technology (LLMs), even though that’s quite rare historically.
Sometimes, I'll give it recursive instructions... such as "these tests are correct, please re-run the test and correct the behavior until the tests work as expected." Usually more specific on the bugs, nature and how I think they should be fixed.
I do find that sometimes when dealing with UI effects, the agent will go down a bit of a rabbit hole... I wanted an image zoom control, and the agent kept trying to do it all with css scaling and the positioning was just broken.. eventually telling it to just use nested div's and scale an img element itself, using CSS positioning on the virtual dom for the positioning/overflow would be simpler, it actually did it.
I've seen similar issues where the agent will start changing a broken test, instead of understanding that the test is correct and the feature is broken... or tell my to change my API/instructions, when I WANT it to function a certain way, and it's the implementation that is wrong. It's kind of weird, like reasoning with a toddler sometimes.
I think another part (among many others) is not the skill of the individual prompting, but on the quality of the code and documentation (human and agent specific) in the code base. I've seen people run willy-nilly with LLMs that are just spitting out nonsense because there are no examples for how the code should look, not documentation on how it should structure the code, and no human who knows what the code should work reviewing it. A deadly combo to produce bad, unmaintainable code.
If you sort those out though (and review your own damn LLM code), I think that's when LLMs become a powerful programming tool.
I really liked Simon Willison's way of putting it: "Your job is to deliver code you have proven to work".
https://simonwillison.net/2025/Dec/18/code-proven-to-work/
It is absolutely, unequivocally, patently false to say that the input doesn’t affect the output, and if the input has impact, then it IS a skill.
this makes me feel better about the amount of disdain I've been feeling about the output from these llms. sometimes it popsout exactly what I need but I can never count on it to not go offrails and require a lot of manual editing.
As a developer, you always have to check the code, and recognise when it's just being stupid.
This is what I don’t understand - why would I “complain” about “hand holding”? Why would I just create a Claude skill or analogue that tells the agent to conform to my preferences?
I’ve done this many times, and haven’t run into any major issues.
I also think reviewable code, that is code specifically delivered in a manner that makes code review more straightforward was always valuable but now that the generation costs have lowered its relative value is much higher. So structuring your approach (including plans and prompts) to drive to easily reviewed code is a more valuable skill than before.
Well, it's easily the simplest explanation, right?
It's always easier to blame the ingredients and convince yourself that you have some sort of talent in how you cook that others don't.
In my experience the differences are mostly in how the dishes produced in the kitchen are tasted. Chefs who have experience tasting dishes critically are more likely to find problems immediately and complain they aren't getting great results without a lot of careful adjustments. And those who rarely or never tasted food from other cooks are invariably going to miss stuff and rate the dishes they get higher.
We can't trust the measurements that companies post either because truth isn't their first goal.
Just use it or don't use it depending on how it works out imo. I personally find it marginally on the positive side for coding
I think especially a number of us more junior programmers lack in this regard, and don't see a clear way of improving this skill beyond just using LLMs more and learning with time?
There is no shortcut unfortunately.
In my experience the differences are mostly in how the code produced by LLM is prompted and what context is given to the agent. Developers who have experience delegating their work are more likely to prevent downstream problems from happening immediately and complain their colleagues cannot prompt as efficiently without a lot of hand holding. And those who rarely or never delegated their work are invariably going to miss crucial context details and rate the output they get lower.
I asked Codex to scrape a bunch of restaurant guides I like, and make me an iPhone app which shows those restaurants on a map color coded based on if they're open, closed or closing/opening soon.
I'd never built an iOS app before, but it took me less than 10 minutes of screen time to get this pushed onto my phone.
The app works, does exactly what I want it to do and meaningfully improves my life on a daily basis.
The "AI can't build anything useful" crowd consists entirely of fools and liars.
For me personally, in my projects there's not a single line of LLM code. At most I ask LLMs for advice about specific APIs. And the more I think about it, the more I want to stop doing even that.
Sometimes I would like to have magical make-my-project tool for my selfish reasons; sometimes I know it would be a bad choice to fall behind on what's to come. But I really, really don't want to support that future.
Doesn't help that _pine_ is one way of saying penis in french
This sounds sensible, but also makes me wonder how much time is actually being saved if implementing a "very specific feature or bugfix" still takes an hour of back and forth with an LLM.
Can't help but think that this is still just an awkward intermediate phase of development with adolescent LLMs where we need to think about implementation choices at all.
> I'd like to add email support to this bot. Let's think through how we would do this.
and I'm not not even talking about the usage of "please" or "thanks" (which this particular author doesn't seem to be doing).
Is there any evidence that suggests the models do a better job if I write my prompt like this instead of "wanna add email support, think how to do this"? In my personal experience (mostly with Junie) I haven't seen any advantage of being "polite", for lack of a better word, and I feel like I'm saving on seconds and tokens :)
In the back of my head I know the chatbot is trained on conversations and I want it to reflect a professional and clear tone.
But I usually keep it more simple in most cases. Your example:
> I'd like to add email support to this bot. Let's think through how we would do this.
I would likely write as:
> if i wanted to add email support, how would you go about it
or
> concise steps/plan to add email support, kiss
But when I'm in a brainstorm/search/rubber-duck mode, then I write more as if it was a real conversation.
Keeping everything generally "human readable" also the advantage of it being easier for me to review later if needed.
As you said, that "other person" might be me too. Same reason I comment code. There's another person reading it, most likely that other person is "me, but next week and with zero memory of this".
We do like anthropomorphising the machines, but I try to think they enjoy it...
What even is thinking and reasoning if these models aren't doing it?
LLMs are amazing, but they represent a very narrow slice of what thinking is. Living beings are extremely dynamic and both much more complex and simple at the same time.
There is a reason for:
- companies releasing new versions every couple of months
- LLMs needing massive amounts of data to train on that is produced by us and not by itself interacting with the world
- a massive amount of manual labor being required both for data labeling and for reinforcement learning
- them not being able to guide through a solution, but ultimately needing guidance at every decision point
Among many other factors, perhaps the most key differentiator for me that prevents me describing these as thinking, is proactivity.
LLMs are never pro-active.
( No, prompting them on a loop is not pro-activity ).
Human brains are so proactive that given zero stimuli they will hallucinate.
As for reasoning, they simply do not. They do a wonderful facsimile of reasoning, one that's especially useful for producing computer code. But they do not reason, and it is a mistake to treat them as if they can.
But what would proactivity in an LLM look like, if prompting in a loop doesn't count?
An LLM experiences reality in terms of the flow of the token stream. Each iteration of the LLM has 1 more token in the input context and the LLM has a quantum of experience while computing the output distribution for the new context.
A human experiences reality in terms of the flow of time.
We are not able to be proactive outside the flow of time, because it takes time for our brains to operate, and similarly LLMs are not able to be proactive outside the flow of tokens, because it takes tokens for the neural networks to operate.
The flow of time is so fundamental to how we work that we would not even have any way to be aware of any goings-on that happen "between" time steps even if there were any. The only reason LLMs know that there is anything going on in the time between tokens is because they're trained on text which says so.
Also an LLM will hallucinate on zero input quite happily if you keep sampling it and feeding it the generated tokens.
These days, the user prompt is just a tiny part of the context it has, so it probably matters less or not at all.
I still do it though, much like I try to include relevant technical terminology to try to nudge its search into the right areas of vector space. (Which is the part of the vector space built from more advanced discourse in the training material.)
Edit: wording
So no evidence.
Sure seems like this could be the case with the structure of the prompt, but what about capitalizing the first letter of sentence, or adding commas, tag questions etc? They seem like semantics that will not play any role at the end
These are text completion engines.
Punctuation and capitalization is found in polite discussion and textbooks, and so you'd expect those tokens to ever so slightly push the model in that direction.
Lack of capitalization pushes towards text messages and irc perhaps.
We cannot reason about these things in the same way we can reason about using search engines, these things are truly ridiculous black boxes.
Might very well be the case, I wonder if there's some actual research on this by people that have some access to the the internals of these black boxes.
In my world view, a LLM is far closer to a fridge than the androids of the movies, let alone human beings. So it's about as pointless being polite to it as is greeting your fridge when you walk into the kitchen.
But I know that others feel different, treating the ability to generate coherent responses as indication of the "divine spark".
Note, why would the author write "Email will arrive from a webhook, yes." instead of "yy webhook"? In the second case I wouldn't be impolite either, I might reply like this in an IM to a colleague I work with every day.
For the vast majority of people, using capital letters and saying please doesn't consume energy, it just is. There's a thousand things in your day that consume more energy like a shitty 9AM daily.
This seem to be completely subjective; I write syntactically/grammatically "nice" sentences to LLMs, because that's how I write. I would have to "invest energy" to force myself to write in that supposedly "simpler" style.
It's also actually more trouble to formulate abbreviated sentences than normal ones, at least for literate adults who can type reasonably well.
> literate adults who can type reasonably well
For me the difference is around 20 wpm in writing speed if just write out my stream of thoughts vs when I care about typos and capitalizing words - I find real value in this.
It would cost me energy to deliberately not write with proper grammar and orthography. I would never want to write sloppily to a colleague either.
It has always done what I meant or asked me a clarifying question (because of my CLAUDE.md instruction).
Also consider the insanity of intentionally feeding bullshit into an information engine and expecting good things to come out the other end. The fact that they often perform well despite the ugliness is a miracle, but I wouldn't depend on it.
Further, an LLM being inherently sycophantic leads to it mimmicking me, so if I talk to it in a stupid or abusive (which is just another form of stupidity, in my eyes) manner, it will behave stupid. Or, that's what I'd expect. I've not researched this in a focused way, but I've seen examples where people get LLMs to be very unintelligent by prompting riddles or intelligence tests in highly-stylized speech. I wanted to say "highly-stupid speech", but "stylized" is probably more accurate, e.g.: `YOOOO CHATGEEEPEEETEEE!!!!!!1111 wasup I gots to asks you DIS.......`. Maybe someone can prove me wrong.
> having a dry tone and cutting the unnecessary parts
That's how I try to communicate in professional settings (AI included). Our approaches might not be that different.
Oh me too, because people are anthropomorphizing the LLM, not because they hurt it. Indirectly, though, I agree that this behaviour can easily affect the way this person would speak to other humans
Anthropomorphizing might not be the right term, because it's about assigning human attributes. When I talk to my dog, for example, I don't contextualize it as giving it human attributes. In a way, talking to something is part of how I engage my relationship-management circuitry. I don't only relate to humans, I relate to everything in one way or another, and kindness is a pretty nice starting point. As I said, I don't think about this much: might come up with something more coherent if I did.
the models consistently spew slop when one does it, I have no idea where positive reinforcement for that behavior is coming from
When it comes to coding however, the place where you really need help is the place where you get stuck and that for most people would be the intersection of domain and tech. LLMs need a LOT of baby sitting to be somewhat useful here. If I have to prompt a LLM for hours just to get the correct code, why would I even use it when the tangible output is just carefully thought out few 100 lines of code!
I'll admit to being a "one prompt to rule them all" developer, and will not let a chat go longer than the first input I give. If mistakes are made, I fix the system prompt or the input prompt and try again. And I make sure the work is broken down as much as possible. That means taking the time to do some discovery before I hit send.
Is anyone else using many smaller specific agents? What types of patterns are you employing? TIA
1. https://github.com/humanlayer/advanced-context-engineering-f...
It seems to me that splitting into subagents that use the same model is kind of like asking a person to wear three different hats and do three different parts of the job instead of just asking them to do it all with one hat. You're likely to get similar results.
I see what you mean w.r.t. different hats; but is it useful to have different tools available? For example, a "planner" having Web access and read-only file access, versus a "developer" having write access to files but no Web access?
The key change I've found is really around orchestration - as TFA says, you don't run the prompt yourself. The orchestrator runs the whole thing. It gets you to talk to the architect/planner, then the output of that plan is sent to another agent, automatically. In his case he's using an architect, a developer, and some reviewers. I've been using a Superpowers-based [0] orchestration system, which runs a brainstorm, then a design plan, then an implementation plan, then some devs, then some reviewers, and loops back to the implementation plan to check progress and correctness.
It's actually fun. I've been coding for 40+ years now, and I'm enjoying this :)
[0] https://github.com/obra/superpowers
what we found: split on domain of side effects, not on task complexity. a "researcher" agent that only reads and a "writer" agent that only publishes can share context freely because only one of them has irreversible actions. mixing read + write in one agent makes restart-safety much harder to reason about.
the other practical thing: separate agents with separate context windows helps a lot when you have parts of the graph that are genuinely parallel. a single large agent serializes work it could parallelize, and the latency compounds across the whole pipeline.
I started writing my prompts almost like mini specs. "Here's the function signature, here's what it should return for these inputs, here are the edge cases." That changed everything. The output went from "kinda close" to actually usable.
The other thing that helped was keeping the feedback loop tight. Don't let the LLM generate 200 lines and then try to review it all. Small chunks, verify each one, then move on. Way less time spent debugging weird hallucinated logic.
The way I think about it: the model has a probability distribution over all possible implementations, shaped by its training data. Given a vague prompt, that distribution is wide and you're likely to get something generic. As you iterate on a design with the model (really just refining the context), the distribution narrows towards a subset of implementations. By the time the model writes code, you've constrained the space enough that most of what it produces is actually what you want.
I wonder how the team members, if any, survive such throughput. I also wonder if there was any quantification applied for the prompts/results, cost analysis, etc.
I've not tested it with architecting a full system, but assuming it isn't good at it today... it's only a matter of time. Then what is our use?
Architecture is fine for big, complex projects. Having everything planned out before keeps cost down, and ensures customer will not come with late changes. But if cost are expected to be low, and there's no customer, architecture is overkill. It's like making a movie without following the script line by line (watch Godard in Novelle Vague), or building it by yourself or by a non-architect. 2x faster, 10x cheaper. You immediately see an inflexible overarchitectured project.
You can do fine by restricting the agent with proper docs, proper tests and linters.
But there are a substantial amount cases where this isn't true. The nitty gritty is then the important part and it's impossible to make the whole thing work well without being intimate with the code.
So I never fully bought into the clean separation of development, engineering and architecture.
You will have to find new economic utility. That's the reality of technological progress - it's just that the tech and white collar industries didn't think it can come for them!
A skill that becomes obsoleted is useless, obviously. There's still room for artisanal/handcrafted wares today, amidst the industrial scale productions, so i would assume similar levels for coding.
In short: LLMs will eventually be able to architect software. But it’s still just a tool
This is only possibly true if one of two things are true:
1. All new software can be made up of of preexisting patterns of software that can be composed. ie: There is no such thing as "novel" software, it's all just composition of existing software.
2. LLMs are capable of emergent intelligence, allowing them to express patterns that they were not trained on.
I am extremely skeptical that either of these is true.
It is not impossible, however, that an LLM could run enough “random” tests to find new ways of doing something, but I hear you.
Let me restate that to “An LLM can build most anything…” and I stand by the rest of my comment.
But for building the right thing? Doubtful.
Most of a great engineer’s work isn’t writing code, but interrogating what people think their problems are, to find what the actual problems are.
In short: problem solving, not writing code.
What a load of crap.
All you're doing is describing a different job role.
What you're talking about is BA work, and a subset of engineers are great at it, but most are just ok.
You're claiming a part of the job that was secondary, and not required, is now the whole job.
The point has always been delivering the product to the customer, in any industry. Code is rarely the deliverable.
That’s my point.
They didn't magically become great truck drivers.
Programmers do not deliver products, they deliver code to make products.
If the code is no longer needed, nor is the job. A different job will replace it with different skills required.
Is that why most prestigious jobs grilled you like a devil on algos/system design?
> The point has always been delivering the product to the customer, in any industry. Code is rarely the deliverable.
That’s just nonsense. It’s like saying “delivering product was always the most important thing, not drinking water”.
But it's just very obvious to any software engineer worth anything that code is just one part of the job, and it's usually somewhere in the middle of a process. Understanding customer requirements, making technical decisions, maintaining the codebase, reviewing code changes/ providing feedback, responding on incidents, deciding what work to do or not to do, deciding when a constraint has to be broken, etc. There are a billion things that aren't "typing code" that an engineer does every day. To deny this is absurd to anyone who lives every day doing those things.
I think there's some resentment. I've seen repeatedly now people essentially celebrating that "tech bros" are finally going to see their salaries crash or whatever, it's pretty sick but I've noticed this quite a lot.
No. That’s because interviews have always sucked, and have always been terrible predictors of how you do on the job. We just never had a better way of deciding except paying for a project.
> That’s just nonsense. It’s like saying “delivering product was always the most important thing, not drinking water”.
That’s… not an argument? It’s not even a strawman, it’s just unrelated.
The thing a customer has always paid for was the end product. Not the code. This is absolutely trivial to see, since a customer has never asked to read the code.
Wait, I thought product and C level people are so busy all the time that they can’t fart without a calendar invite, but now you say they have time to completely replace whole org of engineers?
The commercial solutions probably don't work because they don't use the best SOTA models and/or sully the context with all kinds of guardrails and role-playing nonsense, but if you just open a new chat window in your LLM of choice (set to the highest thinking paid-tier model), it gives you truly excellent therapist advice.
In fact in many ways the LLM therapist is actually better than the human, because e.g. you can dump a huge, detailed rant in the chat and it will actually listen to (read) every word you said.
It is easy to convince and trivial to make obsequious.
That is not what a therapist does. There’s a reason they spend thousands of hours in training; that is not an exaggeration.
Humans are complex. An LLM cannot parse that level of complexity.
The tools and reframing that LLMs have given me (Gemini 3.0/3.1 Pro) have been extremely effective and have genuinely improved my life. These things don't even cross the threshold to be worth the effort to find and speak to an actual therapist.
Do you think I could use an AI therapist to become a more effective and much improved serial killer?
An LLM cannot parse the complexity of your situation. Period. It is literally incapable of doing that, because it does not have any idea what it is like to be human.
Therapy is not an objective science; it is, in many ways, subjective, and the therapeutic relationship is by far the most important part.
I am not saying LLMs are not useful for helping people parse their emotions or understand themselves better. But that is not therapy, in the same way that using an app built for CBT is not, in and of itself, therapy. It is one tool in a therapist’s toolbox, and will not be the right tool for all patients.
That doesn’t mean it isn’t helpful.
But an LLM is not a therapist. The fact that you can trivially convince it to believe things that are absolutely untrue is precisely why, for one simple example.
Training LLMs we can do.
Though it might be important for the patient to believe that the therapist is empathizing, so that may give AI therapy an inherent disadvantage (depending on the patient's view of AI).
The word “just” is not in my comment anywhere. Being human is necessary, but not sufficient.
And no, you cannot train an LLM to be human.
An LLM is not a therapist. Please do not confuse the two.
You cannot train an LLM on how to be human.
EDIT: seems like you made the same point in a child comment.
But he still sees a therapist, regularly, because they are not the same and do not serve the same purpose. :)
That exact thing happens with people too! Specifically when a cheap entrepreneur hires a novice developer and can't give the developer appropriate mentoring and reviews.
My "thinker" agent will ask questions, explore, and refine. It will write a feature page in notion, and split the implementation into tasks in a kanban board, for an "executor" to pick up, implement, and pass to a QA agent, which will either flag it or move it to human review.
I really love it. All of our other documentation lives in notion, so I can easily reference and link business requirements. I also find it much easier to make sense of the steps by checking the tickets on the board rather than in a file.
Reviewing is simpler too. I can pick the ticket in the human review column, read the requirements again, check the QA comments, and then look at the code. Had a lot of fun playing with it yesterday, and I shared it here:
https://github.com/marcosloic/notion-agent-hive
Some people say LLM assisted coding will cost a lot of developers' jobs, but posts like this imply it'll cost (solve?) a lot of management / overhead too.
Mind you I've always thought project managers are kinda wasteful, as a software developer I'd love for Someone Else to just curate a list of tasks and their requirements / acceptance criteria. But unfortunately that's not the reality and it's often up to the developers themselves to create the tasks and fill them in, then execute them. Which of course begs the question, why do we still have a PM?
(the above is anecdotal and not a universal experience I'm sure. I hope.)
> as a software developer I'd love for Someone Else to just curate a list of tasks and their requirements / acceptance criteria
That's interesting. In every team I worked in, I always fought really hard against anyone but developers being able to write tickets on the board.
That isn’t the job of a PM.
I wonder whether at some point we'll get a translation model, that translates relatively vague requests into sound architectural decisions, with some embedded knowledge of the environment you're building in, and that can ask clarifying questions when there are multiple options with different tradeoffs.
You tell LLM to create something, and then use another LLM to review it. It might make the result safer, but it doesn't mean that YOU understand the architecture. No one does.
I like AI-assisted programming, but if I fail to even read the code produced, then I might as well treat it like a no-code system. I can understand the high-levels of how no-code works, but as soon as it breaks, it might as well be a black box. And this only gets worse as the codebase spans into the tens of thousands of lines without me having read any of it.
The (imperfect) analogy I'm working on is a baker who bakes cakes. A nearby grocery store starts making any cake they want, on demand, so the baker decides to quit baking cakes and buy them from the store. The baker calls the store anytime they want a new cake, and just tells them exactly what they want. How long can that baker call themself a "baker"? How long before they forget how to even bake a cake, and all they can do is get cakes from the grocer?
The other one is a posteriori: "I want code that works, what do I need to do with LLMs?"
Your approach is the former, which I don't think works in reality. You can write code that works (for some definition of "works") with LLMs without doing it the way a human would do it.
It's insane that this quote is coming from one of the leading figures in this field. And everyone's... OK that software development has been reduced to chance and brute force?
It's disabled by default though, and in general (especially with other agents) you very much still have to get out of your way to get any sort of reasonable access control indeed.
In principle though, just running the agent CLI in something like firejail would get you very far if you know what you're doing.
You can point Claude at the copilot models with some hackery[2] and opencode supports copilot models out of the box.
Finally, copilot is quite generous with the amount of usage you get from a Github pro plan (goes really far with Sonnet 4.6 which feels pretty close to Opus 4.5), and they’re generous with their free pro licenses for open source etc.
Despite having stuck to autocomplete as their main feature for too long, this aspect of their service is outstanding.
[1]: https://docs.github.com/en/copilot/reference/ai-models/model...
[2]: https://github.com/ericc-ch/copilot-api
That’s the precondition the whole system runs on. The failure mode is invisible. Bad architecture doesn’t look like a crash. It looks like a codebase that works today and becomes unmaintainable.
Edit: a comment below reminded me why I prefer opencode: a few pages in on a Claude session and it’s scrolling through the entire conversation history on every output character. No such problem on OC.
Still a case for it: 1. Isolated contexts per role (CS vs. engineering) — agents don't bleed into each other 2. Hard permission boundaries per agent 3. Local models (Qwen) for cheap routine tasks
Multi-agent loses at debugging. But the structure has value.
I can’t get my head around if the hobby is the making or the having, but fair to say I’ve felt quite dissatisfied at the end of my hobby sessions lately so leaning towards the former.
The main difference between my workflow and the authors, is that I have the LLM "write" the design/plan/open questions/debug/etc. into markdown files, for almost every step.
This is mostly helpful because it "anchors" decisions into timestamped files, rather than just loose back-and-forth specs in the context window.
Before the current round of models, I would religiously clear context and rely on these files for truth, but even with the newest models/agentic harnesses, I find it helps avoid regressions as the software evolves over time.
A minor difference between myself and the author, is that I don't rely on specific sub-agents (beyond what the agentic harness has built-in for e.g. file exploration).
I say it's minor, because in practice the actual calls to the LLMs undoubtedly look quite similar (clean context window, different task/model, etc.).
One tip, if you have access, is to do the initial design/architecture with GPT-5.x Pro, and then take the output "spec" from that chat/iteration to kick-off a codex/claude code session. This can also be helpful for hard to reason about bugs, but I've only done that a handful of times at this point (i.e. funky dynamic SVG-based animation snafu).
You can see one here: https://github.com/skorokithakis/sleight-of-hand/blob/master...
Would you please expand on this? Do you make the LLM append their responses to a Markdown file, prefixed by their timestamps, basically preserving the whole context in a file? Or do you make the LLM update some reference files in order to keep a "condensed" context? Thank you.
Each level in the hierarchy is empirically ~5X smaller than the level below. This, plus sharding the design docs by component, helps Claude navigate the project and make consistent decision across sessions.
My workflow for adding a feature goes something like this:
1. I iterate with Claude on updating the requirements doc to capture the desired final state of the system from the user's perspective.
2. Once that's done, a different instance of Claude reads the requirements and the design docs and updates the latter to address all the requirements listed in the former. This is done interactively with me in the loop to guide and to resolve ambiguity.
3. Once the technical design is agreed, Claude writes a test plan, usually almost entirely autonomously. The test plan is part of each design doc and is updated as the design evolves.
3a. (Optionally) another Claude instance reviews the design for soundness, completeness, consistency with itself and with the requirements. I review the findings and tell it what to fix and what to ignore.
4. Claude brings unit tests in line with what the test plan says, adding/updating/removing tests but not touching code under test.
4a. (Optionally) the tests are reviewed by another instance of Claude for bugs and inconsistencies with the test plan or the style guide.
5. Claude implements the feature.
5a. (Optionally) another instance reviews the implementation.
For complex changes, I'm quite disciplined to have each step carried out in a different session so that all communinications are done via checked-in artifacts and not through context. For simple changes, I often don't bother and/or skip the reviews.
From time to time, I run standalone garbage collection and consistency checks, where I get Claude to look for dead code, low-value tests, stale parts of the design, duplication, requirements-design-tests-code drift etc. I find it particularly valuable to look for opportunities to make things simpler or even just smaller (fewer tokens/less work to maintain).
Occasionally, I find that I need to instruct Claude to write a benchmark and use it with a profiler to opimise something. I check these in but generally don't bother documenting them. In my case they tend to be one-off things and not part of some regression test suite. Maybe I should just abandon them & re-create if they're ever needed again.
I also have a (very short) coding style guide. It only includes things that Claude consistently gets wrong or does in ways that are not to my liking.
It uses different models for different modes.
I just find it to be faster and it often gets things right at the first attempt, but YMMV.
Both claims are loud and are flooding the discussion, but under the hood it's mostly a slop disaster.
So the negative sentiment is a natural response (and a dose of realism).
Also even if agents could do everything the societal obstacles to change are extensive (sometimes for very good, sometimes for bad reasons) so I’m expecting it to take another year or two serious change to occur.
Don't most companies use AI in software development today?
And yes, I know that some companies are not doing that because of privacy and reliability concerns or whatever. With many of them it's a bit of a funny argument considering even large banks managed to adopt agentic AI tools. Short of government and military kind of stuff, everybody can use it today.
https://youtu.be/gyy-RcI6pxE?si=e0QPg3jWvwDojKSP
Could someone chime in and give their opinion on what are the pros and cons of either approach?
My editor supports both modes (emacs). I have the editor integration features (diff support etc) turned off and just use emacs to manage 5+ shells that each have a CLI agent (one of Claude, opencode, amp free) running in them.
If I want to go deep into a prompt then I’ll write a markdown file and iterate on it with a CLI.
Whether I use Antigravity, VS Code with Claude Code CLI, GitHub Copilot IDE plugins, or the Codex app, they all do similar things.
Although I'd say Codex and Claude Code often feel significantly better to me, currently. In terms of what they can achieve and how I work with them.
(I have seen obra/superpowers mentioned in the comments, but that’s already too complex and with an ui focus)
https://github.com/marcosloic/notion-agent-hive
Ultimately, it's just a bunch of markdown files that live in an `/agents` folder, with some meta-information that will depend on the harness you use.
So much power in our hands, and soon another Facebook will appear built entirely by LLMs. What a fucking waste of time and money.
It’s getting tiring.
This reminds me of the early Medium days when everyone would write articles on how to make HTTP endpoints or how to use Pandas.
There’s not much skill involved in hauling agents, and you can still do it without losing your expertise in the stuff you actually like to work with.
For me, I work with these tools all the time, and reading these articles hasn’t added anything to my repertoire so far. It gives me the feeling of "bikeshedding about tools instead of actually building something useful with them."
We are collectively addicted to making software that no one wants to use. Even I don’t consistently use half the junk I built with these tools.
Another thing is that everyone yapping about how great AI is isn’t actually showing the tools’ capabilities in building greenfield stuff. In reality, we have to do a lot more brownfield work that’s super boring, and AI isn’t as effective there.
One big pain point that has existed forever and has never really been addresses adequately is the ability to come up with requirements.
Sure, it sounds easy, I need the app to do x, y and z. But requirements change in real time because of lack of foresight, change of business needs, an unexpected roadblock and more contribute to changing requirements.
So, the advice to come up with the requirements by yourself or with the LLM miss the biggest pain point.
I'd like to see a resurgence of flow charts, IPO (Input, Processing and Output) charts and other tools to organize requirements spring up to help with really nailing down requirements.
I will say, though, some of the pain is relieved because the agent can perform a huge refactor in a couple of minutes, but that opens a whole new can of worms.
I'm glad it works for the author, I just don't believe that "each change being as reliable as the first one" is true.
> I no longer need to know how to write code correctly at all, but it’s now massively more important to understand how to architect a system correctly, and how to make the right choices to make something usable.
I agree that knowing the syntax is less important now, but I don't see how the latter claim has changed with the advent of LLMs at all?
> On projects where I have no understanding of the underlying technology (e.g. mobile apps), the code still quickly becomes a mess of bad choices. However, on projects where I know the technologies used well (e.g. backend apps, though not necessarily in Python), this hasn’t happened yet, even at tens of thousands of SLoC. Most of that must be because the models are getting better, but I think that a lot of it is also because I’ve improved my way of working with the models.
I think the author is contradicting himself here. Programs written by an LLM in a domain he is not knowledgable about are a mess. Programs written by an LLM in a domain he is knowledgeable about are not a mess. He claims the latter is mostly true because LLMs are so good???
My take after spending ~2 weeks working with Claude full time writing Rust:
- Very good for language level concepts: syntax, how features work, how features compose, what the limitations are, correcting my wrong usage of all of the above, educating me on these things
- Very good as an assistant to talk things through, point out gaps in the design, suggest different ways to architect a solution, suggest libraries etc.
- Good at generating code, that looks great at the first glance, but has many unexplained assumptions and gaps
- Despite lack of access to the compiler (Opus 4.6 via Web), most of the time code compiles or there are trivially fixable issues before it gets to compile
- Has a hard to explain fixation on doing things a certain way, e.g. always wants to use panics on errors (panic!, unreachable!, .expect etc) or wants to do type erasure with Box<dyn Any> as if that was the most idiomatic and desirable way of doing things
- I ended up getting some stuff done, but it was very frustrating and intellectually draining
- The only way I see to get things done to a good standard is to continuously push the model to go deeper and deeper regarding very specific things. "Get x done" and variations of that idea will inevitably lead to stuff that looks nice, but doesn't work.
So... imo it is a new generation compiler + code gen tool, that understands human language. It's pretty great and at the same time it tires me in ways I find hard to explain. If professional programming going forward would mean just talking to a model all day every day, I probably would look for other career options.
Yes! I see this constantly. I have a Rust guide that Claude adheres to maybe 50% of the time. It also loves to allocate despite my guide having a whole section about different ways to avoid allocations.
Which is Fortuna's work... stochastic models are like that. And confirmation bias is another phenomenon as well as "how do LLMs align with my worldview" whether I see them more positively or more negatively.