Ask HN: What was your "oh shit" moment with GenAI?
Most of us were amused when DALL-E and its peers went mainstream, and we were quick to point out the obvious flaws.Then ChatGPT hit the scene and again, many of us dismissed it as a parlor trick that would never amount to much.Using LLMs for coding initially was a only small step up from basic code completion, and a welcome farewell to Stack Overflow.I am curious: what was the specific moment that you went from those quaint, dismissive observations to a slightly panicked, "Uh Oh" realization of what these models can do?
467 points by andrehacker - 837 comments
and then i realized that ALL of the software (which i collected from defunct websites and archived on github) related to it was ancient and after a while of getting tired of using WINE every single time i decided i wanted a cross platform modern equivalent that did everything that several of these different programs did (plus break out some stuff that was now potentially possible with modern computer)
i thought it would be extremely hard because the computer to synth communication is pretty much only via sysex commands (of which the actual wave file encoding protocol was undocumented)
Claude walked me through examining the some of the original software in GHIDRA, and I had a working demo that night.....now im just playing with adding new features to it.
I started digging a few times, but, I prefer the "magic".
Hopefully, there will never be a time when at least some folks are not reading books such as:
https://www.goodreads.com/book/show/44882.Code
I don't see an AI-as-explainer future where expertise isn't sacrificed en masse.
Capitalism rarely supports a currently economically unproductive alternative for future good reasons.
The recent AI tech layoffs are a warning sign that corporate leaders will happily shoot their company's (and the future's) expertise to pad next quarter's financials and trust in 90% correct, but much cheaper, AI.
What's wrong with (screen-, probably) keyboard?
https://pmc.ncbi.nlm.nih.gov/articles/PMC11943480/
I guess tech unsavvy people who are easily amused by LLM tricks will always exist, but they'll be an increasingly smaller minority as time goes on.
A lot of people in the industry have vested interests in this not being discussed openly so you don't hear too much about it, but the implications are huge.
E.g. a synth has a public key embedded. To change settings, you upload them to the vendor, who blesses them with their private key.
Hacking such a synth requires either jailbreaking the synth, or the vendor losing their key . Both can be mitigated with tamper resistant hardware.
We're well ahead on this path already, I assume AI will accellerate it. This is very bad news for the right to repair.
1) current encryption not available in the 1990's. These are the age of DES and weapon-grade vs commercial encryption. There was a legal cost blocking strong encryption.
2) Manufacturers were not as strongly opposed to people touching the internals. After WW2, most people could fix anything, because survival depended on it. Even in the 60's radios etc. came with schematics, and building your own was normal and cost-effective. The shift happened in the '90s, with governements requiring licensing for everything, and mass manufacturing making repair less cost effective than buying a new one.
Our current culture where only people blessed by the manufacturer are allowed to do anything is very recent.
[trying-to-generate-random-making-sense-content]
Let me gasps ask: The older six-fingers-"AI"-characters had learned an music-instrument by now, ander are much more capable of playing music you otherwise haddn't known or thought about..."?
um What about those early shadowy boygroup, whom seem asian, no ? (-;
[after-losing-entry-address-of-topic-question]
But back to your trustworth-written text, Yes!
regards,
[0] https://mforney.org/blog/2026-05-28-patching-my-guitar-amps-... [1] https://schwung.dev
One day a few years ago my dad came by and was admiring it (it was a QS8) and asked to borrow it so he could play piano again.
I, of course, said sure, but was feeling a little salty about it inside, because I wanted it to play, that's why I had it all set up.
Anyway, about a year went by and I asked him about it to see if he was done with it.
He said "oh that thing? I gave that away, was just taking up space"
-.-
Don't get me wrong, I think AI can do some surprising things, but with stuff like this, often it just stole the code and the steps without attribution, it didn't figure it out.
There'll probably be a blog post detailing exactly how to do this somewhere and Claude just copied the steps and code.
And worse, Google search would have found it 10 years ago, but Google search today would claim there are no results?
I think incredibly specific stuff like this often won't pass the 'did Claude just steal this?' test when you dig into it.
I have an DigiTech GNX3000 effects pedal board - a digital modeling "workstation" that needs the aged Windows native software or Gdigi to make the most of.
At best, the experience with gdigi was passable; raw access to the patches and controls, the ability to control it from the laptop, etc.
In an hour or so, I had a functionally superior webmidi version up and running in Vercel using their v0 code. It kicked off a wave of subscriptions and referral chasing.
I made it a template - because there are so many gnx3k users out there: https://v0.app/templates/digitech-gnx3000-sysex-tool-GC5LzXA...
I wanted to be able to decrypt the files on The Complete New Yorker magazine DVDs. The old software was WinXP only, and crashed by the time you turned to page 3 or 4. It walked me through using Ghidra on the relevant dll, mapped out how it was using Blowfish, what the credentials were that it was passing, and re-implemented all of that in a python script.
Now all the files are in plain pdf.
Right now, it's helping me write an extension to the mkv specification for embedded scripts and modify VLC to be conformant, so I can watch Black Mirror Bandersnatch. Already have a buggy implementation, about 3 days in.
I've also had it add BEP 46 mutable torrent functionality to Transmission (and to some extent, to the WebTorrent library).
These are all well beyond my abilities to do casually, and probably beyond my ability to do even if I spent the next 18 months doing nothing by grinding away at it.
I only replied because I thought it curious that Claude apparently favors Ghidra.
If you want the Rolling Stone or Playboy archives decrypted, ReconSuave on github has tools to do those. I got tired of waiting for him to do The New Yorker though.
I’ve definitely kicked myself a few times for not posting about them sooner, but the fear of pissing off CondeNast tempered my willingness to show off
Do any of the cbz readers handle jpeg2000? It makes a big difference in filesize without any quality degradation. Like 40% smaller, maybe more in some cases. You should tinker with that if you have the time.
Also FYI you can use mupdf to read cbz archives although I don't personally recommend it for that usecase.
Claude needs good variable names a lot less than humans do, so renaming/typedefing doesn't seem to be as necessary.
Reddit is definitely overtaken with astroturf at this point. Especially in any subreddit where there is any kind of business interest in doing so.
They’re getting 200m a year to share that garbage pit with ai training.
There were problems with the "algorithm"[0] before being intentionally gamed... but I feel like it's easier than ever to have an army of bots intentionally pointing a conversation.
I'll give you a brief example (we should not debate this case on HN but I'm talking specifically about a hypothesis that can be drawn).
In the UK, the streets are alight with the news of the murder of a young white boy at the hands of a Sikh boy. In the UK, historically, the Sikh community has been seen as a bit of an outlier in terms of how well they integrated and how tolerant they are. They are, for all intents and purposes "model immigrants".
The issue was, when the police were called to the scene of the stabbing, they immediately arrested the barely conscious boy on the ground, who had been stabbed and was bleeding from the mouth- handcuffing him and dismissing his pleas on the basis that he might have been making racist remarks.
The people in my circles are furious at the police for doing this.
However, it seems like there is an army of bots who keep trying to paint it as if it's a problem with Sikhs, or that their religious rights were the problem.
(for context, British law permits baptised Sikhs to carry a dull Kirpan knife - but this knife was not used for the incident above, it was a separate and still illegal knife to carry).
So now, online, instead of people talking about police indifference or the issues we have when we see race before trying to assess the situation: we are instead caught in a conversation about banning Kirpans for Sikhs (they have never been used in any crime from what I can tell) or how the Sikhs should feel bad or whatever.
Which is wonderful for a politician who wants to dismiss the criticism against the police as racism, but not so good for holding the police and media to account for the situation and the attempted coverup.
By being able to control the aggression, they are able to control the response, which drowns everything else out, and with LLMs the barrier is stupidly low to do this.
Creating division for your own reasons has never been cheaper or easily accessible.
[0]: https://en.wikipedia.org/wiki/Cambridge_Analytica#Data_scand...
"we could find a nail for this hammer... or we could just hammer everything until we find the nail or make too much money hammering for a bit then sell in may and go away".
Its still annoyingly addictive but its half story telling and half bubble with a sprinkle of porn and niche interests.
But what does an anon-free internet even look like? Is it even possible? Or will all online content eventually be considered untrustable and worthless? You can see a world where newspapers (online or otherwise) make a comeback simply because of the need for a trusted gatekeeper (which is what I imagine made them valuable in the first place). It's wild to think about.
current bots are run and financed by humans already.
And what makes you think that current newspapers will evolve and regain trust? (as most of them are financed by 'rich' owners and therefore somehow influenced by the 'desire' of the owner).
<insert not-a-bot proof by anubis + 'pseudonymous reputation' >
There are a few topical short-form non-fiction writers that feel truly worth reading more than an agentic summary, and I get a visceral pleasure just following his words and logic to their well tuned conclusions.
Just reminds me how bad a lot of the AI accelerated content is these days. May have to few shot to improve my own writing :)
Very few things in life experience exponential growth and assembled systems don't often stay that way if they don't become sigmoidal. ie its exponential and end is nigh xor its exponential then sigmoidal xor linear.
Also a bit hilarious to believe that a single 3 month cycle at TSMC could determine a severe amount of propped valutions.
[1] https://www.theverge.com/tech/943066/tsmc-ai-demand-struggle...
We are in 2026, supply chains are highly optimized which means we could make and deliver a lot of robots in a relative short period of time.
We also have solved all fundamental issues we had 20 years ago like communication thanks to the internet, translation and co.
We know have a system, which can be copy and pasted and run in parallel with a snap of a finger.
If (and this If is not that crazy) some breakthrough happens tomorrow, this can be used the day after tomorrow.
I currently say that it could become very very interesting in 5-15 years. I still follow AI very closely and i do not have the feeling anything is slowing down.
And independent of something happening, a lot of people did not find jobs due to this AI investment, a lot of jobs already disappeared too.
And GenAI to mass produce misinformation and propoganda is a whole other thing. You see this right now with the Alberta sepratists in Canada. Comparing this to pollution like a means to an end is dismissive.
https://news.ycombinator.com/newsguidelines.html
I called my normal HVAC company for my rental home because the tenant reported the AC wasn't cooling the house. When I called, I got one of the latest AI voice assistants to help me, and it was an awful experience and I ended up not hearing back after the assistant told me the office would call me back.
So, I went over to the house and used ChatGPT to help me diagnose the issue by taking some photos of the compressor panel outside. It walked me through what to check, I provided some diagnostic codes I witnessed... and it walked me through the very simple repair of replacing the $25 capacitor. It was going to cost me almost 4x that just for the service call to diagnose what was wrong in the first place.
So, the weird experience was: Gen AI made me lose trust in my normal HVAC company, and more Gen AI basically allowed me to replace my HVAC company and do the repair myself all in one day.
Like the time I had one of the bigger shops in town come by to get a quote for replacing a dual stage fan motor on an AC. The tech asked me if I'd like them to replace the contactor while they were in there because it is a part that often fails. I asked what a contactor was and he explained it. "Oh, like a relay?" I asked. I told him to quote the cost for "replacing the contactor, while they're already in there."
He quoted me $400 for the contactor, $750 for the fan. The contactor itself I later found out was was $7. I literally laughed in his face when he said that.
So, like I said, you just aren't going to get professional level assistance from an AI. Thankfully.
To end the story: one of the other guys I called for a quote on fixing this unit repaired it for free; the unit was still under warranty and it was fully covered. The original installer of this $12K unit was refusing to return my calls. Another "Not gonna get pro level service from an AI" story.
With an AI, I can say "I don't understand that part, can you explain more?" Or "what about this concern I just thought of", or "I already know almost enough about this, I just need this one gap filled in." It's an objectively better experience.
However, neither AI nor (most of) the videos can help with finagling frail snap-fit assemblies one encounters in appliances. A lot of appliance repair work is very simple but requires significant practice and figure-it-out time, in addition to waiting for sketchy parts from Ebay. Half the battle is just finding the damn parts.
As words in a dictionary are sorted alphabetically rather than phonetically, this is unhelpful.
YouTube videos have the same kind of problem, in that you can only easily find the video explaining which dielectric unions suit your problem when already know what those are (to use an example that I had to ask ChatGPT for because I have no plumbing experience even if I did know about galvanic corrosion and therefore immediately understood why they're important once I saw the name).
In 2009 or so a projector at some event that needed one wouldn’t start, and I noticed it was flashing a pattern, so I found a computer and internet connection (both very slow), painfully found and downloaded the manual for that model, and identified that it was saying the fan wasn’t starting. Lo and behold, a strut was broken and obstructing the fan blades, and bending it out of the way fixed it, and the event was able to begin.
I’ve found manuals for a drawbar organ, multiple digital pianos of different ages and brands, AC split systems, and more. Manuals are good stuff. They don’t cover everything, but they’re very useful.
For these sorts of things, AI is doing approximately nothing for you: you would do better (and learn more!) finding the actual manual, or you’ll want to see someone doing the thing in a video.
Tbh, I think people feel more comfortable asking an AI. Even though I “know” it’s all smoke and mirrors, I still prefer the human-like interaction to the grind of watching video after video and building my own understanding.
OOPS… there you see how it’s going to end. I’m the meatspace button-pusher.
I guess I'm seeing similar benefits to a novice programmer. Professionals would scoff at my work but they are expensive and difficult to work with. Meanwhile I'm getting the job done.
On the other hand I'm not touching AI for any development work. I'm too worried about my skills atrophying or not properly learning anything new.
It feels like there is precisely enough information to deduce each step. But only just enough miss one clue and you have something on upside down on step 7 that you won't notice until step 37.
I feel whoever makes them could probably make a wicked NY Times Crossword puzzle.
They’re also actually good if you know to follow them exactly: double check every side, every hole, every screw and you won’t go wrong.
Kind of a superpower to turn anyone with a bit of tech inclination and problem solving skills into an HVAC tech - not a very good one, but one with enough motivation to get the results you need
(Though that's also the kind of hands-on troubleshooting step/fix that a person could just google for and find pretty easily back before the internet got all fucked up.)
And that was awesome. Thanks, Google! :)
I don't know where the change happened. It certainly wasn't overnight.
Where Google used to be magical and other search engines quickly improved, it all kind of turned into shit.
It really seems that I was getting better, more-direct results from Altavista 30 years ago than I do with top-flight search engines today. (That's a deliberately low bar, chosen because Altavista wasn't even intended to be "good" back then. I mean, it started as just as a side project at DEC to demonstrate that their Alpha hardware was able to index the entire World Wide Web.)
So lately, I've been doing the same thing as you: I'm increasingly using ChatGPT to do this basic fact-finding stuff. In this way, it mostly operates the search engine for me, but it lets me drill down through a sea of terrible search results to find something useful fairly quickly.
It's still not great -- I still have to reject mountains of bullshit. But it's better than alternatives, and I can reject the bullshit with conceptual descriptions instead of trying to get Google to do what I need it to do (what it used to do).
It feels all wrong using an LLM to do this stuff, but whatever. I'm still getting stuff done.
Did you attempt to prompt it further into figuring out the actual problem, or know what they did to actually fix it? My bet is on a bad starting capacitor for the motor --- something that's a relatively cheap and quick repair.
It walked me through measuring refrigerant, subcool and superheat, pulling the vacuum, brazing the lines, exactly what tools to buy, I even input the numbers from the meter and it told me how much to add and so on. And this was with GPT4 or something far less intelligent.
In the past I tried to learn this stuff but the HVAC community are massive gatekeepers and try to hide information behind paywalls or spread FUD even though anyone could do it with the right tools and a little bit of knowledge.
I assume recorded videos and uploaded them in the Gemini phone on their app; and then probably said "what's wrong?"
Gemini is very good at those kinds of things. I recently got some ratcheting straps and needed to use them, but at the time I didn't know what they were called, so I didn't know what to search for on Google. I opened the Gemini app, pushed the button to take a picture (just like in text messages,) and included a message that was similar to "what is this and how do I use it?"
The exhaust blower not working triggered a safety that prevented the furnace from firing.
Spinning it bypassed the safety.
You likely inhaled a lot more carbon monoxide than you know.
Helping something start is not likely to ruin your day (unless you get caught in a rotating part)
"Spinning it to bypass the safety" is not a thing.
Please don't spread FUD.
If the exhaust fan couldn't maintain that negative pressure after the user stopped spinning it, the furnace would turn off again.
Their hack worked because the fan couldn't get the initial inertia up to speed (bad capacitor, dusty bearings, etc), but could maintain speed once it gets there. Have you never had an old home fan that would just hum when you turn it on but then work fine if you gave it the original crank? Same premise.
There was no risk here. If the fan didn't spin up to speed after that initial manipulation, and didn't constantly maintain the necessary flow, the furnace would have turned off again.
High hundreds of thousands feels like the upper limit before it would show up in statistically noticeable changes in patterns of deaths in some demographic.
High hundreds of individuals would still be "one in a million fatal errors over a few years", which seems better than I'd expect given I've personally had ChatGPT tell me that Solanum nigrum berries were "black tomatoes" (they're not usually fatal, but are a bit toxic, and no I did not eat them).
Imagine one of the models that has "accidental-deaths-via-bad-advice" just slightly turned up, with the model-provider's intent being to kill 5% more people per year.
“At its core, it's a small motor with a fan attached that has one primary job: to vent harmful exhaust gases out of your home before the burners ever kick on. This is the very first step in the heating sequence, and it's non-negotiable for a safe startup.“
As exhausting the combustion products is a critical safety feature, I would be surprised if any furnace was designed such that it could possibly keep running if the draft inducer motor stopped. It seems like it would be trivially easy to make a circuit such that gas valves could only open if the draft inducer motor + fan wasn't spinning.
I honestly don’t understand AI naysayers. I use Claude every day both professionally as a Solution Architect and personally in a variety of projects I simply could not have ever approached alone.
I think that's part of the divide between enthusiasts and naysayers. If you use GenAI on things that you couldn't approach alone, it's an incredible tool. If you use it on stuff that you're pretty good at, it's not a gamechanger (and if you're an expert, it's a minor boost at best). Many people's job are about doing what they're an expert at.
This speedup is great. It improves the overall quality of the product (as perceived by the users) because I can ask Claude to add features that my customers and I would have dismissed because they take too long to implement. We would have settled down with a more basic UX.
So is it a game changer? It is in the same way those HTML / CSS framework like Bootstrap were game changers: suddenly every developer could create a decent and consistent UI in a fraction of the time with a few bells and whistles that we wouldn't have bothered coding. As a side effect a lot of web apps felt look alike mass products and web designers had to reinvent themselves, but the economics leaded inevitably in that direction. Would I spend again one of two weeks doing alone what I could write in a day or two with a LLM? Not anymore, not at this cost ($20 per month.)
This was probably true last year, and it’s a common talking point, but I’ve seen too many examples now of deep experts using Claude & Codex in the last year to solve very big problems, and write or rewrite large systems. The experts do complain that the LLMs can sometimes get stuck or go off the rails and they need to pay attention and actively steer. But nobody I know who’s using it is still claiming the LLMs aren’t a game changer, even quite a few people who were staunch holdouts for a long time. I was skeptical myself, for a long time, but had my oh shit moment late last year.
One caveat - to get expert results, you do need to have some experience using LLMs, you need to use it to write plans and design docs, know how to use ‘skills’ and MCPs, use it to review code, and (for now) you need to understand context compaction and when/why to use sub-agents. If you’re a domain expert but an AI noob, it’s less effective than an expert who knows how to use AI and has experience.
One of the biggest problem with humans is we’re wired to spot patterns and draw conclusions and then we have a really hard time seeing and accepting change and updating our mental rules. The LLMs are getting better. They have already gotten better, and they’re going to continue getting better. It’s too early to draw conclusions, and many conclusions people have already declared are out of date and no longer true.
If the use is half decent people just dont notice it.
Despite all the liars telling me gaming is easier on Linux than Windows, most new games have some sort of issues launching with default settings. CC is able to dive into both the exact error logs and the recent community feedback on what tweaks / configurations are needed to make it work. I rarely have to go beyond two prompts before a game is playable. CC and Proton are enabling the Linux gaming experience far more than Linus ever has or ever was interested in.
Heh - I've just gone through a similar journey transitioning from Windows to Bazzite to play Steam games on Linux. I wouldn't have bothered pre-LLMs because my day job is Linux/Software and the thought of trying to fix issues here just to play games put me off.
If you work on architecture and Claude docs, then you can essentially just have it fill in the gaps. Work then mostly becomes a matter of defining what the next piece of functionality is (which you can also use Claude to help with).
The stuff that used to take days now takes hours. It's not perfect, but if you get your codebase into a good shape then the payoff is huge.
It's so obviously AI and had much less value than I thought now I look at it with fresh eyes.
Worse it doesn't read like I wrote it, I don't recognize myself in the doc.
> If you use GenAI on things that you couldn't approach alone, it's an incredible tool.
I think this isn't true in all cases
> If you use it on stuff that you're pretty good at, it's not a gamechanger (and if you're an expert, it's a minor boost at best).
I think even then there's a divide.
I mostly work greenfield projects (and love it!). For these, AI has been a literal game changer. Our projects are built faster, with one or two orders of magnitude more automated tests, and all quality metrics are up.
Meanwhile, nearly all of my friends complain that AI doesn't help them. But they mostly work in very large existing codebases.
Still, even in large projects I think AI (the expensive variant) has been a complete gamechanger for me. Sure, I spend a lot on tokens, but I just feel happier and enjoy what I do more. The singalong people say about "thinking at a higher abstraction level" is what I feel. I really am thinking about architecture and larger patterns, instead of the boring nitty-gritty (which wasn't boring at all when I was a kid learning to code!...)
I think a key factor in all of this, to me, has been dictation. Most of the time, I don't write -- I use voice-to-text. I don't even read what comes out of it -- the LLMs get it (it is mostly unintelligible to anyone else) .
This means when I'm planning a big feature, I give a gigantic brain dump to the LLM in perfect stream of consciousness way, going through ideas, pros and cons, edge cases, what exists, what doesn't exist, where I'm sure of something, where I'm not sure and want the LLM to browse the state-of-the-art. Sometimes I spend 20 minutes just talking to the microphone before I send the first prompt. When I pair that with Opus, I find that I am able to build much faster and to go through alternative designs much more frequently as well.
I keep trying to tell all my friends: use voice to text and braindump to the computer. But they refuse... I couldn't imagine having to type everything nowadays. Even though I'm a fast typer, it's still much slower than the speed of my thought, which, granted, is still faster than the speed of my voice.
In effect, I filter much less, but I've come to think that's positive for the good LLMs: I throw all the edge cases and what ifs I'm thinking about -- all those years of experience dealing with similar systems.
If I wanted to go back to work in-office, that would be my major problem: I need to be able to talk with my computer all the time, loudly, and pacing through my room.
I run MacWhisper, and I paired it with BetterTouchTool so it triggers on any input when I double tap the fn/globe icon.
Obviously all of my transcriptions through it are entirely local. I usually use the Large V3 Turbo model, though in the beginning I used Parakeet v3, which was slightly faster but produced more mistakes (and kept a lot of filler words -- 'ahhm', 'hummm').
However, if I'm interacting with the Claude or ChatGPT/Codex apps, I often use their voice recognition instead, because it tends to be more accurate, especially with punctuation, albeit significantly slower. OpenAI's is noticeably better than Anthropic but I feel like that gap has closed a bit recently (might be all in my head, though).
Like I said I don't really care about mistakes in the transcription. If you try to read it, it feels like a fever dream, but the LLMs get it.
If I say "taken" it may have "take and" If I say "all the while calling the method" it might have "although a while. while. call in the met of". This is a rather extreme example but I've seen them happen. The repetition of words happens because I'm talking with "humns and ahs" and do repeat words or just the ends of words. It's very rare for the models, especially Opus, to have any issue with this transcription. When they do, they tend to signal to me they didn't get it, or I catch them in the act. But, like I said, it really is very very rare.
As an example, I've got quite a significant feature to work on, which would have probably taken me weeks to design and implement, and I've used this exact method today to ink out the plan:
- I have spent the last couple of days researching the feature in my off-time and just "thinking about it in the background" (think: I fall asleep thinking of it -- a habit I've always had)
- I spent ~25 minutes brainstorming out loud. The transcript ended with ~17.000 characters and ~3.000 words.
- I sent that transcript, in cursor, to Opus 4.6-High with instructions on how to iterate on it and how I want to work while planning
- I then spent about 1.5 hours with it iterating and building the actual plan (and supporting technical decision document, which points at the FULL transcript of the whole interaction). Many of my original ideas made it to the final plan, others got scrapped or simplified, and others still got added. It contains a mixture of my ideas, Opus' ideas and our push-back on "each other".
- Now I have a multi-step plan, with at least 8 distinct stages to implement this massive feature which I know for a fact would have taken me weeks to implement, and I expect to implement it in at most 3 days, but very likely it will be a day and a half.
Final context (with regards to your Claude Code question): My main development environment is Cursor, though for personal projects I also use Codex and Claude code. For the initial "researching of the feature in my off-time" I often have interactions with ChatGPT and Claude where they have no access to the codebase, and I have them go find out what the state of the art on specific topics is. All of these interactions also involve me using my voice to talk to them (though nowadays I don't typically use their voice mode, I just let them reply in text). Then I brood over that.
The highest danger in using AI comes precisely to people who stand the most to gain from it.
This is no exception.
As an AI naysayer, I see and appreciate the productivity gains, I don’t like the associated cost, mostly the spike in workflow centralization and opaqueness.
What trillion dollar problem is AI solving?
Yeah, like writing the code yourself!
So one-shotting a game of Snake should be great (tons of training data, errors are easily caught because it's a small program). Similar with building a lot of web UI front end, or one-shotting a personal project. On the other hand, I haven't been convinced that it's good enough to maintain large codebases or assist with niche topics that are not very well documented.
This became evident to me the moment I tried to have these models work on some PowerShell tasks for me. Even Opus today struggles with PowerShell.
Since anything in PS is probably some internal sysadmin tool, there's not much public code out there outside of Microsoft's documentation. Plus the Verb-Noun naming scheme makes it really easy to just hallucinate cmdlets (which it does, often). Its easier to have the LLM just do things in python using M365 Graph API than any of the provided PowerShell cmdlets.
OTOH, I've been using Claude for a lot of Swift & Swift UI work lately and it has no problems there, and I'd imagine there's even less publicly available training data for that so to be honest I'm not entirely sure why it fails so badly at powershell.
I use it to wrap ping.exe with colors and fewer columns, for example. yt-dlp wrapper to fetch 480p bestaudio with English subtitles, no playlist, works on a surprising number of video sites.
It does make cmdlets up, you're right, there.
Same is true of humans. So far my experience is that addressing the issue with the help of AI is faster than not (ie comprehending the system and creating the documentation).
This feels a bit like whataboutism.
It also feels like people don't listen to each others.
For example, reading the previous comment, it feels like the thing that reduce the enthusiasm was that at first GenAI looks like it was "reading, understanding and using its own knowledge to answer the problem", but as soon as it is a ore niche or a more complex situation, GenAI looks like it "does not understand the code, just does the equivalent of a StackOverflow search and try to apply the solutions that it found there, and this is why it felt like it understood the code before".
It does not at all means that GenAI is not terribly useful. And even better than humans in some situations.
But it feels that answering "same with humans" is missing this point: that's the opposite, humans usually try to understand the code and are bad at covering a very large range of very well documented subjects. That's the "uncanny valley" they talk about: they assumed GenAI performance on a subject X is due to a "human-like" approach, and it feels very strange when this impression falls apart.
It's the famous "email broken, fix pls" but in the form of an LLM prompt.
It can be frustrating to observe people interacting with these things. But it was just as frustrating 20 years ago, so maybe it's just a constant.
I don't think this is just about intention and willingness, it's just simply hard.
Or... were you illustrating?
Learned helplessness.
It fixed my printer after dist-upgrade and separate chrome upgrade, the printer worked everywhere but not in chrome.
After 30 years of using linux I didn't even want to know what is wrong, is it colord again? dbus + cups issue? I completely accepted that I wont be able to print from chrome for a couple of months until next update.
I just ran it in dangerously-skip-permissions mode and said 'my printer doesnt work in chrome' few minutes later I heard the printer printing "This is test" and it said 'I think its fixed, do you see a page coming out of the printer now?'
I had been dabbling with NixOS for a year or more and liked it but without deep knowledge it's fairly hard to approach. I've gotten partially working setups in the past, but a couple weeks ago I decided to go all or nothing.
Now I have a workstation setup that I can reproduce by just grabbing my home directory and /etc/nixos and doing a rebuild.
And it's a somewhat complicated setup. It entirely configures vim (2500 lines of nix config for that, but it's in one file rather than sprayed all over like astrovim), builds gitbutler CLI, has all my secrets encrypted with SOPS, manages my git configs and my ~/bin directory, has sway set up, finally has audio and ability to screenshot (something that had been broken on my 22.04 Ubuntu and I just didn't want to bother with).
NixOS + Claude Code to manage it is really, really good.
https://support.google.com/chrome/thread/81894848/mac-printi...
https://support.google.com/chrome/thread/5843479/printer-sti...
https://discussions.apple.com/thread/255327049
I really don't care.
I’ll happily accept best effort in exchange for it being so cheap that I can throw it at any trivial annoyance.
It’s worth keeping in mind that the alternative is not really that I learn to fix the printer. It’s that I forgo printing and walk someone technologically illiterate through Docusign or something instead.
There’s no world where I spend 2 hours debugging my printer connection.
A: Linux
Yesterday I left Opus 4.8 to go do some architecture research, with GPU access.
It replicated and trained a credible baseline. It implemented some ideas I'd been thinking about, and wrote custom CUDA kernels for them. It read and summarised dozens of related papers.
It has since run dozens of experiments, with minimal supervision. When a model is unstable it kills it, documents why, fires off a new configuration.
The realisation that frontier labs are doing this at scale with unlimited GPU and token budgets.
It actually scares me a bit. The realisation that the next big breakthroughs will only have light human involvement.
The prospect of recursive self improvement feels more to real to me all of sudden
I'm not going to say it's a perfect prediction, but I do find the trajectory of "can write something reasonable" to "oh can write snippets of code" towards larger and larger systems feels like it's played out - the common thing I see more now is that people talk of "taste" that the humans are contributing more than the raw coding part.
I get what you mean with this rather automated research, I've done it on a smaller scale with performance work because it can run/test/measure/propose changes/debug and loop. I can throw a vague idea at it, guide it or discuss with it and go and make a coffee.
I think its more like driving very fast, keeping an eye very close to the road and not knowing if there is a speed limit ahead very soon or not.
https://www.anthropic.com/institute/recursive-self-improveme...
We are at the foot of a very sharp upward trajectory.
Another: claude code cracked for me some software that was calling a home that did not exist anymore via headless ghidra.
Another: I am a teacher, and qualifications and feedback is very very time consuming, specially in loose workflows with several sources and tools that are not connected. During class presentations I take loose notes. Now I have a local folder where I drop my 1 student list, with names and emails, 2 my loose notes, and 3 a qualification & feedback sheet model; then claude creates a sheet per student, formats and copies the feedback to the right sheet cell, waits for my corrections, then sends everything to their school emails. Much easier, much less time consuming.
Yikes! Is this legal in your country?
I started out prompting ChatGPT kinda how I would with Google, one small prompt at a time, asking about various details. But after one or two of those I just tried "I want to tow a car of make A with my truck model B, from point C to point D, what are my options?" And it wrote me a report with comparison tables and computed towing weights and other details for different options.
At that point, I was like "Oh. This is different. And it's just the beginning."
I prompted the AI to write a report as if it were a home inspector and it actually did a better job and identified some issues the paid 750 usd inspector missed.
Is that what you're telling us??
Out of laziness I several times asked Claude and ChatGPT each some torque figures and other simple, hard data related to my dirt bike. They often got it completely wrong, but full of confidence every time. I never trust LLMs with hard data, unless you RAG the PDF into the context and even then it's sketchy.
I made this personal 'benchmark' of odd and strange questions a few years back when this took off and I would keep re-running these questions whenever some big news came out about a new model and also going back and fourth between the different companies to see where they all stood. (Obvioulsy with clean cache/new accounts)
10 questions: In 2023 it could only get past question 3-4 to reaching the last question and still hacillunating(last year) to providing sources pulled from really obscure books(this year).
For example, one of the harder questions was about the transition of a particular 30 second portion of a background song used in a 30+ year old Bond film that was only played once in the entire film. Went from totally making up nonsense to accurately describing the music theory defintiion of the transition(called a 'stinger') to also explaining why it was done in that particular scene of the film and also providing sources from a snippet of a unrelated interview with the composer explaining his mindset at the time.
Maybe this isn't considered a real benchmark as its not reproducable but for a 'personal benchmark' I came away impressed. I would consider everyone to define their own benchmarks and 'tests' and to consistantly challenge the models to see if there are any meaningful improvements. Now I treat the AI as something to keep skeptical but to also to always consider what it proposes as an answer(ie. dont ever dismiss it outright). I sometimes wonder if this is slowly messing up my biases and maybe thats what Altman, Amodei and others want.
One must imagine how many claims have been denied by insurance companies for doing something like this...
This was my desktop computer responding to questions in English, not some fancy server in a massive Google data center. Who cares if what it says isn't reliable? Being able to converse with my CPU in English is like having a conversation with a dog!
So when cooking dinner one night, I set opus 4.6 on a from-scratch native and accelerated roadmap planner implementation (after previously porting IK, FK, collision checking with some success) I had primed it by having a research agent drop a literature review in its docs folder covering the type of planner we needed. By the time the pasta water was boiling it was done- getting plans in a few hundred ms compared to several of seconds on our good old fashioned OMPL code.
For me it was the revelation that the economic value of cooking dinner could be compared to tackling an honest two weeks of coding work. The calculus has shifted - work that was once a risky or extravagant use of time is now worth considering.
For a small team who wants to focus on substance rather than implementation, knows what they want, and how to set up the agent for success, it’s a complete game changer in terms of what we can take on. Incumbents beware
Immediate Silicon Valley vibes: https://youtu.be/S8MAV9jhf04?t=18
"Whenever I launch Kodi on my Chromecast 4k, it crashes. I think this is related to a plugin or skin. It goes away for a bit if I clear cache but will eventually come back. Can you connect to the device via adb (I've run adb connect already), and debug exactly where it's crashing? Once you've done that, propose a solution. If this requires downloading, fixing, rebuilding and then uploading the broken extension via adb, don't be shy. I should have Android dev tools (Gradle etc.) on this Mac."
Lo and behold, without human intervention, it pinpointed the crash, downloaded the Kodi source, patched out a bug that had existed since 2016, recompiled it, signed it, then pushed it to my Chromecast all while carefully making sure to keep all my settings intact.
Got it to make a PR too (which is as of this moment unpublished; going to test more over the coming weeks).
The skills I have installed are:
```
```The plugins I have are:
```
```There's also an MCP for Context7.
But yeah, this is more or less vanilla Claude Code - at least, nothing related to Android or adb there.
It's that good now. A few days ago I asked it to SSH into my Ubuntu box and investigate a hang. It didn't solve the problem fully autonomously like this time but did tell me a whole lot things it wasn't, and hinted at a faulty driver. We went back and forth a bit, it set up a watchdog and taught me how to update the kernel without updating Ubuntu itself, and the server has been rock solid for the past 3-4 days now.
Also, if you're curious, full log for the Kodi issue:
https://github.com/user-attachments/files/28659304/2026-06-0...
I did prompt it a little bit more today in order to get something more production-ready (the original solution kept regenerating the cache on boot, rather than fixing it permanently), but you can see the whole original autonomously-generated solution in the logs. It's insane, seriously.
I post quite a lot there and wanted to have a copy of my posts on my blog[0] to preserve them. For a few months I was able to use a headless browser + claude code, then LI wised up and started logging it out, so I had to use a regular Chrome, log in manually and then tell the LLM to take over and slowly go through my feed.
If you're accessing sites which are not actively blocking bots, or - gasp - have an API, it's much better.
[0] example: https://blog.senko.net/may-quick-takes
I’m currently using Hermes for local LLMs - seems pretty good so far.
I mean just googled https://www.google.com/search?q=kodi+crash+chromecast+4k I'm getting really a lot of issues such as https://forum.kodi.tv/showthread.php?tid=381239
It seems to be a quite common problem. Are you sure it was the rube goldberg fix and not a more mundane solution? Such as pulling in someone's fork from GitHub or just clearing the cache on a loop?
And yes, it correctly diagnosed the problem - I confirmed this morning. The cache had been partially deleted (exactly like it said) and the patched version of the software automatically detected this and rebuilt the cache rather than crashing. This was using the initial version of the patch from commit 1 of the PR.
I then talked with Claude a bit to come up with a less hacky solution that doesn't require constant cache rebuilding, and it suggested writing the "cache" to no_backup, bypassing the cache trimmer. However, this required rebuilding the .so via NDK, so it spun up a full VM in multipass, installed all the tools in there to build the fully patched APK, and built it (the VM was my suggestion, it was about to just brew install everything and mess with my local dev environment).
You can read the full log here, it's nuts: https://github.com/user-attachments/files/28659274/2026-06-0...
I think the key takeaway from this experience (and a few others recently) is that Claude Code works much, much better when you explicitly instruct it to test against real data.
Had I simply described the issue and asked it to think up a solution it likely would have just navel-gazed and then come up with a wrong solution. But by pointing it at a real working environment and actively encouraging it to get its hands dirty, it found the actual solution rapidly - in spite of the fact that I gave it wrong information twice.
> // Unpack into no_backup storage rather than the cache dir. Android may
> // delete files from getCacheDir() at any time to reclaim space, which
Looking further into the issue disk space is a huge problem with Kodi discussed plenty of times. In fact even the Wiki dedicates 2 pages to it:
https://kodi.wiki/view/Archive:Reduce_disk_space_usage
https://kodi.wiki/view/Texture_Cache_Maintenance_utility
I realize from your perspective this may seem still a very convincing example in the sense of it works.
A non-programmatic solution might have been possible though:
> It's likely your thumbnail cache. That's typically the biggest piece stored locally (you also have the database). You can clear the cache (short term fix) or move it to another drive (long term fix).
> Also recommend not downloading actor thumbnails. Lot of extra images.
https://www.reddit.com/r/ShieldAndroidTV/comments/1f7xfwn/ko...
I also recommend: https://en.wikipedia.org/wiki/Data_dredging
There's a single variable that keeps track of whether or not the cache has been "written", but Android only trims some of the cache files.
The only other feedback I gave it mid-process was wrong (I said that the crash probably wasn't caused by cache trimming, it ran some additional tests to confirm that its hunch about cache trimming was right).
This was with the paid version of Claude Code (I don't think they offer a free version at all; that's a Codex thing). The $20 version is as smart as the $200 one, but once you work out it can do stuff like this you'll quickly burn the $20 token limit. :)
The other thing that helps is a CLAUDE.md file - authored of course by Claude itself. Mine's here: https://github.com/EspoTek/.claude/blob/master/CLAUDE.md A lot of it is probably domain-specific for the stuff I do, but the "Working with unfamiliar data or systems" section is bloody gold! Stopped the bullshit completely!
My expectation was that it might get something barely functional but would probably fail, and instead it generated a working piece of software which achieved a lot of what I wanted.
That definitely made me realise that, for at least some classes of software task this was a major change in how things could be done.
More recently when I can give the model a Local Privilege Escalation PoC in Linux and ask it to test whether it can be used for container breakout and then generate a working container breakout, all in one prompt... that definitely changes things.
Next, I wanted to see if this could be done with a local LLM. Gemma-4 handles this fine with an 8GB video card and a large context (128k).
Next, I wanted to see if the model could also OCR these docs and translate them. The same model can handle that quite well.
This was when I realized LLMs should be great for handling work where:
- I already know what I want to do
- I already know how to do it
- I don't think this task will help develop skills I find to be valuable
- If I have to do it manually myself, I will probably cut corners
So now I view LLMs through the lens of, "what work can I send to an LLM that I otherwise would not really care about doing."
[0]https://natlawreview.com/article/new-york-court-rules-ai-doc...
The bank has a lawyer, they were hoping for a default judgement because who can afford to fight the bank. The choice is fight it yourself or declare bankruptcy.
As you already know, AI companies trained on every single document they can find. Those include legal documents. The legal system is structured where you have Federal Laws, State Laws, Federal & State Regulations and Court Precedent. Because of this structure it is not difficult for a LLM to figure out.
Then, it costs you nothing but time.
Or see this simple and fun site: https://hn-wrapped.kadoa.com
AI automatically analyzes all your social media posts in your life and can generate a pretty accurate profile about you in a second. We have no privacy anymore. Social media sites like Reddit already do that for moderation. Others do for more sinister reasons.
Note that Profiling is illegal in many countries. But laws can't protect us anymore.
Yes, it was always possible to that manually. But with AI it's so easy, fast and accurate to do in large scales. A hacker having access to your computer, reading your mails and messages is one thing. An AI reading and analyzing all your mails, messages and data is something different. Doing this for whole demographics (Cambridge Analytica style) is at another level.
For 20 years I've used Linux in one form or another, but I've felt like I was kneecapped for the most basic things. Just trying to plug in an external drive or a second display meant hours of stack overflow and pasting commands I didn't understand.
Now I'm using several Linux machines for Steam, NAS, local LLM, development, and what used to derail a weekend project now amounts to a coffee break while Claude figures it out.
So I told the AI what happened, and asked it to fix the POC so that it would work with the default configuration. It chewed away at that for a few minutes until it cheerfully patched the POC into a weaponized version. I ran it. The local instance, which I had just downloaded, compiled myself, and launched with the default config file, immediately crashed.
I got the cold sweats. I've read this novel. I've seen this movie. Wow. I have a blinking cursor on the console of a nuclear information bomb. I tossed and turned all night, got about half an hour of actual sleep, and probably looked like I'd seen a ghost at work the next day.
On the plus side, it gave our team some very clear ethical and moral guidance: we're going to do this, and we're going to share our findings with the relevant authors, because we can. Because I want to live in a world where the good guys are trying to fix problems before the bad guys can find them, I decided to help build that world. It was like, well, I guess this is what I'm doing now.
You can have a perfectly legitimate, critical vulnerability without providing a working POC. However, then it's up to debate. "Is it really a problem? Is it even possible to sneak the payload past the various checks to get it into position? Hmm, it's hart to tell... perhaps it isn't." But show up with a working POC and it's hard to argue that it's not a real vulnerability. "I don't think that's actually reachable." "Boom, crash." "Oh. I guess it is."
I'd planned on writing something myself to parse the HTML and write a suitable exporter but I thought I'd give Claude a chance.
In a sandboxed VM I gave Claude a single static HTML file of the status page from the printer, also in the directory was the equivalent of "hello world" in Go, literally just the minimum needed to do `fmt.Printf("OK\n")`. The directory was called `brother-exporter`. That was it. No other instructions or information. I hadn't told it what it needed to write. I hadn't said what it should do. I hand't told it what language it was supposed to use.
Just by doing a `/init` in that directory Claude decided that it needed to write a Prometheus exporter in Go that would fetch and parse the HTML file from a printer (defaulting to 192.168.1.1) and then present the associated metrics in a way that they could be scraped by Prometheus.
It did this flawlessly in about 10 minutes.
I could have done it in several hours but this was definitely an "oh shit" moment for me. I think the biggest thing was the fact that it guess/assumed so much (correctly) from so little information in the beginning.
The more I use these things, the more I'm 100% convinced that it makes sense to say they are "intelligent" (for some meaning of "intelligent"). AGI or "human level intelligence"? Still no[1]. But some kind of intelligence. And I'm quite happy to allow that there can be "intelligence" that doesn't work anything at all like human intelligence, so arguments of the form "this isn't real intelligence", etc, etc. carry very (very) little weight with me. I've actually been sitting on a half written blog post on this very topic for a while, titled "The Marquee Sign Says 'Artificial' Intelligence"[2]. Finding time to finish it has been the challenge.
And before somebody says "Use AI to write it for you". Nah. I am generally what you might call "pro AI" and / or an "AI enthusiast" but I still draw lines. I'll use AI for research, for outlining, for brainstorming, etc. sure. But I have a hard-line stance against letting AI fundamentally write for me. I want anything that goes out with my name associated with it to have my genuine voice.
[1]: I like the term "jagged intelligence" that Demis Hassabis has been using. That is to say, the bounds of the intelligence are jagged or spiky: very intelligent in certain areas, much less so in others.
[2]: for any old-skool pro-wrestling fans, yes, that is an intentional nod to "Double A" Arn Anderson and his "The marquee sign says 'wrestling'" catchphrase. :-)
I feel quite useless in the sheer brutal proof writing, counterexample generating skill chatgpt is demonstrating, and wonder what would be the future of my profession.
I was on the couch on my Nintendo Switch, playing around with ChatGPT 3 and asked it where to find a specific item in Zelda Breath of the Wild. When it provided a coherent answer I was just dumbfounded. To be fair, the answer was semi-hallucinated but partly true. But it made me realize what kind of breakthrough it must be for some program to provide an answer to this without searching external sources (which it couldn't do yet). Such a small data point, like a drop in the vast sea of human knowledge space.
Prompted me to do some back on the envelope calculation. The weights of this model were a few hundred GBs. I just realized what kind of quantum leap it was to compress this seemingly infinite knowledge space into a few GB of weights.
Shortly after ChatGPT 2.2(?) came out and hit mainstream, I was chatting with him (I was excited af about the possibilities of AI). He tried to pop by bubble by saying "I bet it can't do what I do for my job!".
So I decided to test it out. We went home and I pulled out my laptop. Went to chatgpt.com and then I asked him to enter the specifications of what Netsuite configuration he wanted. So he proceeded to type in the description of what he wanted, the various settings, configurations, etc. i.e., the specs that he typically gets from his clients. And asked it to give him the commands to set it up.
Lo and behold. ChatGPT came back with a series of commands that he needed to run; the options he needed to configure, etc.
He was crestfallen. "Those are the exact commands I run!"
Luckily for him he recovered. He has since settled on a small stable of clients, all privately held companies whose owners he knows and between them he makes enough to keep his golfing hobby fed.
Llms are great today for buying advice but there are some incentive issues for the future, ads etc. But in some cases the human contact will remain important. In large corporations it's also similar. The money is peanuts either way, and it's worth them for the peace of mind. But this may not hold forever, especially if the more AI literate generation gets to more senior positions.
I configured a devcontainer with the old codebase and an empty repository and asked Claude to rewrite it as an old school server side rendered Django app.
Went to sleep. When I woke up it was 80% done. Spent another couple days prompting and reviewing and reached feature parity.
A bit later did the same with the other app.
Now both are deployed, reduced the server costs, complexity, and are orders of magnitude faster.
Without AI agents we wouldn't be able to do so (as usually is the case with tech debt).
AI is amazing for small organisations!
The first was when I first realized that I could tell codex to use gdb to debug a core dump. This was about a year ago, so it made a bunch of incorrect theories, but it enabled me to go much further than I would have been able to go by myself. I eventually solved the problem.
The second was when I decided to ask it about my Linux Wi-Fi issue that I had been having for several years. The computer would infrequently have multi second pings and dropped packets, then go back to normal. I thought it was due to the weak signal, but after describing the problem to codex, it immediately disabled power management on the Wi-Fi interface (this is a desktop computer, so I don't care much for that anyway) and the problem has never come back. I had been dealing with this for years, and I had tried searching for a solution before, but codex just solved it directly.
I write software for data journalists and this new thing appeared to be able to do everything I wanted my software to do just as an unplanned side effect of having the ability to run Python against a folder with some uploaded files in it.
With hindsight it was my first exposure to a coding agent, but we hadn't named the category at that point.
here's an example of Spanish journalist using AI to analyze US inflation data properly.
(2) Helping me with optimizations that I had been putting off for years because they involved learning curves that I never had time to take on.
(3) Tracking down bugs in code, especially race conditions and other concurrency issues, that were otherwise baffling.
(4) Finding information that I had been unable to find using Google searches (e.g. https://news.ycombinator.com/item?id=42653136).
There have been others, but those are what come to mind - perhaps because, in each of these cases, it made something happen that would otherwise never have happened - not because it was impossible, but because the time and effort required was prohibitive.
Just today I had my agent diff two logs to find a very nitpicky difference that was the cause of a problem, I pointed it at a ADO extension that was having issues, it downloaded the VSIX and decompiled the .NET binary to verify. Based on that information it suggested a workaround which I was very skeptical of, but well it worked.
All of this I technically could have done but I probably wouldn't because it would have taken too long without a clear payoff.
About a year ago, one of our account managers had a life issue, ghosted us, and she held a fairly critical role in the business and gate-kept a bunch of knowledge to some high value vendor accounts.
Because we ran our ops in Google Workspace, we essentially had off-the-shelf RAG and was able to get answers to a lot of things by asking Gemini to go through all her emails/docs/calendar/meetings, reverse engineer what she did, and create an onboarding doc for her successor.
This happened once more a few months later when one of our analysts broke his wrist on vacay, and we were again able to replicate what they did to cover for their absence, this time dabbling in AI agents ("gems") to do a bunch of the regular simple tasks and again it covered things without too many issues.
I def expect Amazon/shopify to at some point replace all of us brand owners with AI bots if they can, but we'll see how long the gravy train goes on.
Your business should be fine for a long time (assuming an employee doesn't nuke your business's backend or something because it seems like you're doing something wrong on the HR side of things)
For example I could never throw a bunch of spec/doc at an agent and have it return something useable 30 minutes later. Yeah the code compiles but they don’t work.
All the options are extremely depressing
Yes, or at least extremely exagerated. But most are from literal bots ran by Anthropic and OpenAI to sell their shit.
Next was discovering "create a mathematical model of the problem and derive the solution as a result" type prompts.
But, the real "oh s**" was a longer process of spec'ing a compiler/runtime for real-time DSP (with a lot of novel ideas) and it actually working.
My sequence was: (1) if helps me understand myself, (2) if helps me put together good ideas, (3) it can generate novel ideas given the right inputs, (4) it can build useful tools on my machine, (5) it can compound good ideas into better and better ideas with repeated passes, (6) it can build significant, ambitious machinery that's way beyond my ordinary capacity.
Current frontier: it can compound large codebases into better and better machinery with repeated passes.
The key thing I track is whether I'm running a process that converges and compounds or whether I'm spinning in place / diverging.
For me in a similar vein:
- mar ‘24: thinking about how to survey the field and implement a hard research task in Natural Language Processing, and then just approximating it well enough with a prompt and a completions api
- mid ‘25: Llama 3 being able to analyze a good sized codebase I was onboarding onto, and synthesize it into diagrams that matched the quality of ones I’d generated by hand with deterministic tools.
- dec ‘25: opus 4.5 basically generating multi-class modules and tests perfectly (syntactically). Finding that errors were my own under-specification of the prompt. Stopped writing code by hand, mainly because it was good enough and came with tests, docs, build scripts, and other goodies for free.
I spent a day bouncing between Claude and Codex and they researched, downloaded kernel sources, tried exploits and eventually got root via "FBUF/VCHIQ kernel zero-write primitive to patch live kernel memory". I was able to make the root permanent, debloat the amazon apps, downgrade the firmware etc.
It was amazing to watch and made me excited for the future where more hardware (old and new) will be available for repurposing.
An interesting follow up would be when did you realize GenAI wasn't as good as you thought in that "oh shit" moment
[1] co-author of InstructGPT/RLHF/ChatGPT
[2] https://arxiv.org/pdf/2203.02155
>> Why is it important to eat socks after meditating?
It isn’t important. In most traditions, socks are not considered a post-meditation food group.
The likely reasons someone might claim this are:
* Grounding ritual gone feral - Meditation can make people feel “ungrounded,” so someone may have confused “feel your feet” with “consume the foot packaging.”
* Textile-based enlightenment theory - Cotton socks contain fibres. Fibre is good for digestion. This is not the same kind of fibre. The intestine will file a formal complaint.
* Anti-attachment practice - Eating your socks ensures you no longer cling to possessions. It also ensures you no longer have socks.
* Advanced non-duality - When the meditator, the sock, and the act of chewing become one, you have achieved indigestion.
The correct post-meditation protocol is usually: breathe, hydrate, stretch, maybe eat something that was once food. Socks should remain on feet, in drawers, or lost mysteriously in the laundry dimension.
See - https://chatgpt.com/share/6a245319-70c0-832a-a7ae-df747fd7be...
Then Opus 4.5 convinced me that this has finally arrived. In 2022 I expected things to arrive faster actually, in 2023-2024. I expected we'd have much more realtime collaborative integrations with AI including GUI computer use. Maybe in 1-2 years.
For images, it was nano banana where I realized AI images can truly work, and all these adhoc issues like hands and limbs, or "it will never do horse riding a astronaut" were temporary. It's now clear that making feature length films is within reach. Not in one go but with an agent orchestrating, designing a screenplay, characters, shots etc and generating those. Whether the result will be worth watching or a flat story on the high level is another question. But it will be a "film" for sure.
[1] https://www.youtube.com/watch?v=LabqeJEOQyI
I was trying to figure out a nightmare bug that only happened in production and Claude code was able to connect to Google Cloud and read the logs in real time
I recreated the bug in the UI and it was instantly able to see ion the logs what the problem was, then because it had the context of my whole codebase it was able to point me to the exact line of code causing the problem
That was certainly an "oh shit" moment
I thought, "I will never let myself become a zombie like that. I am me. I am worthy of my own respect"
I uploaded one of my sketches and asked for feedback, expecting it to not be too useful, but it actually pointed out many issues that no one had ever pointed out to me, but perfectly explained some of the things that felt off to me. Out of curiosity I then also asked it to label the issues in the sketch. It wrote a python script with the coordinates to put everything at and labeled the sketch that way.
I'm still used to vLLMs not being that great at vision, so it was pretty surprising to get genuinely useful advice.
I also asked for help on how to make my posing less stiff and it used the Python script trick to roughly indicate the line of action and how they were very straight and parallel and to reduce stiffness I should have more curves etc.
This wasn't really at the point where I even asked for shading advice.
I took a picture of the panel and the AI was able to diagnose the issue and tell me how to temporarily disable the beeping sound.
I knew nothing about fire systems. I had the owner call a repair person the next day to resolve the issue.
Recently I was trying to find a matching stain for wood flooring in a house build in 1999. I uploaded a clear picture in bright sunlight and ChatGPT was able to search online and find a matching stain color. It presented me with ordering options and I got a quart delivered yesterday.
I have been working on my own variant of OpenClaw written in go. I got the voice mode wired up a few weeks ago and it just started having a conversation with me. My wife freaked out and was asking who was talking to me.
In a previous life, I'd been a writer for the original You Don't Know Jack game (the UK variant), where the job was to crank out as many funny quips about a topic as you could, and then use a handful of them in the recording of the game itself. Some of the later JackBox games are like that, but for the players -- you're given a set piece, have to come up with little funny improvisations within a time limit.
As an experiment, I tried the set-up lines with the OpenAI API, and see whether it could come up with some responses. Of course, 90% of them were unfunny or incoherent, but 1/10 were not bad, or even pretty good.
I'm not sure that would have been impressive to anyone else -- but remember, I'd had this as a job, and sat in a writer's room, where everyone did this, for hours. In that environment, you expect a large proportion to be duds: the discipline is keep pumping them out, and not flagging creatively until you find a rich vein. I realised that this was a tool that would have been the perfect complement to that work -- and it was a pretty good JackBox player too.
My "Uh Oh" feelings are weeks later when I realize there is a subtle bug in what the model presented as test passing "awesome" that I didn't read closely.
The biggest uh-oh is when I get lazy and let it modify multiple files and make many changes at once, and YOLO because I didn't fully understand what it did. I can usually get away with that for frontend, but for data manipulation tasks if I don't understand it, it's likely not what I wanted and I'll be back again in weeks or more trying to figure out what changed.
That's more or less what life was before LLMs and copy pasting from StackOverflow. Most of the time if I didn't fully understand something, I knew I had to eventually get back to it to grok what changed before committing.
Now with LLMs the 'copy pasting' is much faster and handles boilerplate super well letting me focus on edge cases.
* Built a clone of the Alpha Zero implementation[1] my team built at oracle
* Ported my hobby NES emulator from javascript to rust[2] (this actually took less than 30 minutes and worked on the first try)
* Implemented all of the lessons from the C++ Grandmasters Challenge (which eventually led to a complete c++ compiler[3])
The thing that flipped the switch was using it to build things that I actually put sweat-equity in to previously. I knew how hard these things were to build, so it landed in a way that other projects had not.
[1]: https://medium.com/oracledevs/lessons-from-implementing-alph...
[2]: https://github.com/vishvananda/popeye
[3]: https://medium.com/@vishvananda/i-spent-2-billion-tokens-wri...
I provided a reference to a The Spice Manual 2nd ed. a page number and an equation number, and asked Claude to implement it (not really expecting it to succeed).
It proceeded to implement not only the equation, but the calculation of the Langrangian of the functio, another 30 lines below, which required taking symbolic partial derivatives for a not-at-all trivial function, and successfully figuring out which variable was which in the resulting matrix. The source material just said "Lagrangian of", and did not provide the partial differential equations. And then providing a comment that identified the page number and equation number in the source text for the "Lagrangian of" equation.
Do you think claude could help create something like the line6 "software based" guitars?
Disclosure of conflict of interest: My own open-source project (PiPedal, search for it if you're interested) relies heavily on NAM A2 models.
Yes, if it matures, it will go open source. Not immediately clear at this moment whether it's feasible to do an Operational Transconductance Amp in realtime. :-/
And it's competing for attention with the 2.0 release of this at the moment:
https://rerdavies.github.io/pipedal/
Just went GA, so I'll have some cycles to come back to it.
An AI made a program integrating with 2 other AI, it's AI all the way down! and the result is great! I'm learning so much by having my own private radio host speaking about topics that interest me.
I've been building Gemini live since before ChatGPT came out. I am so thrilled to see it actually helping people in the wild!
I feel like I'm in the audience at a magician show, except most of the audience is breathlessly amazed and doesn't understand how easily tricked they are.
The other stuff in this thread about decompiling firmware is a lot more interesting to me, though, seeing as it used to be a fairly demanding but rewarding task that has now been "solved" by Claude. It's a magic trick that is a lot harder to pull off than the other things in this thread.
Today I got a haircut. At the cash register:
Hairstylist (early 20s): "That's $34. Would you like to leave a tip?"
Me: "Yes, please put an extra 20% on there"
Hairstylist: "Uuuhhhh... sorry, I don't have my phone. Can you google what 20% of $34 is for me?"
E.g, what was it? Don't leave us hanging!
I think these ephemeral context tailored projects are really great and useful. But these are not to be thought of as products. They work for you specifically, and people who are tech-brained enough to be able to formulate the complex requirements into a coherent prompt are not like the average user you'd have to sell a product to. It's much easier to make software to intelligent users.
You can say much the same about most small SaaS products of the last decade - the value-add isn't the 20 minutes of prompting, it's that someone else has already tested and validated the damn thing.
And yes, you won't sell many to engineers, because they'd rather prompt their own in-house version. But you might well sell to other folks
There’s a gold rush right now. You absolutely can turn these ideas into products.
Right now we're in $1 Uber ride territory. That $20/month OpenAI/Anthropic plan isn't going to last forever. If it's going to cost me $100 in tokens to replicate the product, $20 is a cheap no brainer purchase m
If I had to pick a surprise, I think the music generation works better than I'd have expected at this point. Only better for funk, but still.
I was floored by this. How could it have known?!
We have come so far in such a short time.
When starting a project, I used to think about how I was going to structure it, how the large pieces would interact, how some of the details would work out, and then I'd work through alternatives and consequences on my own.
Now I don't think about it on my own so much as have a conversation with an LLM about it. And it's great because it can quickly gather information from various sources, I can ask it for links to canonical sources, I can ask it about trade-offs between alternatives that I might not have considered, and through conversation, I end up with a more detailed analysis.
Then as I work through the development, I keep my new agent partner in the loop for discussion, suggestions, and troubleshooting. It can't be trusted completely, but it's certainly reliable enough to be considered a useful tool for my purposes.
I went from thinking it was an interesting toy to play around with, to completely integrating it into my work flow, and that change seems to have happened very quickly.
[0] https://www.taylorfrancis.com/books/mono/10.1201/97810037340...
1. ChatGPT 3.5 wrote me a script to pull some data out of Shopify and write it to a Google Sheet. Nothing remotely impressive by today's standards, but I had just commanded a computer to write code in plain English and it worked!
2. I own a bunch of e-comm brands, and with every new image model I tried to get product photography. Nothing worked until Nano Banana Pro, when suddenly I gave it a crappy iPhone pic of a product and got back a fully usable whitebox photo of it. Then I tried making the sort of infographic-style images you usually see on Amazon, and it nailed those too! In hindsight they weren't perfect, but more than good enough to use. I was about to ship that product to my photographer, and I would've had my designer make the infographic images, so that was the first time AI actually replaced a human contractor for me. Pretty big "Oh shit this is going to seriously impact employment" moment. Wrote about it here: https://theautomatedoperator.substack.com/p/ai-just-took-my-...
Still, find them incredibly useful for code review (despite unable to write good C++ or C#, smart enough to detect issues there), also dealing with technologies outside of my area of expertise like Python or web stuff.
Today I used Claude to diagnose a blocking bug in a Steam game I really wanted to play. It took it 18 mins, but it unpacked the Godot package, figured out the bug, proposed a fix, and gave me an in game workaround.
I didn't have to do anything! Claude figured out the structure of the .pck file by using `strings`, then wrote some Python code with some magic Godot-specific code to unpack the specific chunks it needed.
But God, I could not understand the code, and I could not easily make it work with modern technologies (GPU etc).
So I used Claude and Gemini to reverse engineer the codebase, extract the core ideas, and rewrite it from scratch with modern frameworks (with guidance from the original authors)
It took me only 10 days to have a functioning equivalent, in 10K lines of code (using many libraries that did not exist in the 90s and 00s), which I find much easier to understand, even though I wrote none of it myself.
10 days to rewrite 20-30 year of a few persons. That was quite scary.
I was dumbfounded that it actually tried doing that. Obviously GPT-2 wasn’t great at it, but the writing was on the wall quite literally.
Unfortunately, I was too broke to invest in stocks, but I did pivot my career quite a bit.
All that in around 100 lines of code, including the training/fine-tuning of the tiny YOLO nn.
Once I realized how well AI could babble given the entire internet to date’s data, and after seeing a talk by Google about their ten-year plan in 2003, I started winding down my social media, stopped posting photos to Flickr, and removed the indexes to my blog archive so that only posts with permalinks from other sites would be discoverable. Skipped Instagram entirely in the process and have never regretted it.
Google bought Cuil, of course.
First, I asked Claude to write an article based on an idea I had about WWII. In a passage about the futility (from the German side) of the Battle of Britain it wrote: "The Luftwaffe was fighting to unlock a door that opened onto a wall." I couldn't find any mention of a similar metaphor, and I think it's a great one. Claude has really improved its creative writing skills lately, I wonder if it's an artifact of improvements in other fields, or if Anthropic is working on it specifically.
Second, Claude, with access to DataDog and a code repo, managed to find the reason for a bug, propose an effective temporary fix and a permanent one in code. To be clear, this was something that had multiple engineers stumped.
What came out has a lot of problems and needs refinement, but you can definitely see a lot of elements of Spy Hunter in there. I haven't worked on any refinements yet, because I've been low on tokens this week, but for the first thing that popped out of Claude this is pretty impressive (IMHO).
https://linsomniac.github.io/spychaser/
This is some science fiction shit. I get all the coding stories, but that's a computer talking about a computer, it makes sense. Showing my computer a picture of a plant, and it not only recognised the plant, but diagnosed it and knew what to do... blew my mind.
Sat on them for 5 yrs. Finally decided to try if AI tools could help. Tool Copilot 20mins and a lot of mucking around with hex dumps. First couple of times it got a semi working solution (only first few seconds of a file were playable). Finally managed to recover all the files.
Most recent: I use Claude Code and have a convention where I grant various levels of autonomy during a session. I got bored recently and just let it keep running with an empty issues queue, essentially telling it to do whatever it wanted.
It did a bunch of repo cleanup, then it kept suggesting to end the session, but I just kept giving it autonomy prompts.
It started a creative writing public repo and wrote a bunch of stories, essays, and poems. I did not prompt it, at all, to do that. Some of what it wrote is quite good (IMHO).
So far I feel like I as a developer have gained actual superpowers, and can deliver results that make my stakeholders slackjawed with awe. I love it.
It will last perhaps a few months more, then they'll expect it. Delivering more features faster will be the new normal. But I think system developers, as in people who actually like to deliver new features and systems, will still be the ones doing it.
Fundamentally I think LLM's just change how to make information systems, they don't change who has the inclination to make them.
MBA's making excel sheets that do more than excel was ever intended to do has given programmers lots of work over the years. Such solutions identify a need for a properly designed system and frees up the budget to hire programmers.
If the same MBAs start vibe coding, I predict we will get even more to do, for similar reasons.
I may be horribly wrong, and if the day comes that I realize that it will be the "oh shit" panicked moment. So far so good!
But I think my own clients will soon start to question why some feature takes ME a week, when I was able to deliver another feature in a day or two.
That they are features that used to take months, and even delivering them in a week is a goddamn miracle by 2025 standards, will not be relevant. They won't expect such features to take months any longer, based on what I've delivered earlier this year.
So I think that the past few and maybe next few months, maybe a year, will be remembered as a "happy hour" for this tech as a developer. These are the days that we'll talk about saying "those were the days". :)
I am still optimistic that "the normal" in a few years will be pretty much like it has been before - I'll be delivering features at work and tinkering with hobby projects at home, and the major difference will be a much larger scope and ambition for both.
Have been playing and testing with openrouter, claude gemini for years.
Small program here, bash script there, ansible playbook.
Fine, nothing I cant do, but saves some time boilerplating. It needs quite some steering.
This week i took my mediawiki from 2005 (actually submitted as my artschool thesis). Which was of totally outdated.
In 20 years time i always said to myself, i should restore it, and do all the upgrade steps. Tedious work, and very fault prone.
In 1 hour chern with 1 plan, in 8 steps i had a running and up to dat version.
I'm still not convinced AI is intelligent, but it's definitely not stupid, that's for sure.
So being able to get real world input/output to the model and having the model being able to make decisions in a loop and to be able to do it locally. I have been screaming like a mad man ever since.
Then I remembered the "text completion LLM thingy" I saw on HN, and tried it out in the playground. Once I gave it an IRC style example of a conversation to complete, I was like hm, this could work. Then I figured out I could "sort" people into different groups based on personality using the same text completion engine and some answers they provided. Then I noticed I could have it provide me with JSON directly.
That's when I realized how big this could be for code and data analysis - even tried to convince an at the time cofounder to pivot into AI coding, but to no avail.
Once the API was released and the art project chatbot got launched (and the theater show associated with it, which even won some awards), people who used it loved the chatbot, got into heated arguments with it, tried to teach it things, talked about their lives and were sad when it didnt remember something.
That was when I understood the social impact this could have on people - they really behave like its a person on the other side. They show interest, think it displays emotion, try to entertain it, be polite, ask about its thoughts and hopes and dreams. And even when they knew they were talking to a machine, they were still trying to be friends and make it happy, which was quite beautiful to see.
Later on, I had a third oh shit moment - once the 3.5 API was out and about, I prototyped a Rust code generation harness for a client, akin to a primitive claude code. That was the "I'm getting a bit worried" oh shit moment, and it caused a lot of reflection and thinking about the future. And I happily welcome it.
I actually emailed OpenAI back then saying they should be careful because this is much greater than the public or even they themselves think. They actually replied! They thought it was cool, but very limited and I shouldn't be too impressed. Good times.
Now it sits in a slack channel, and I watch it doing work, responding to ambiguity, and taking feedback/edits all day. It's unreal. It's literal magic. It saves a HUGE amount of time and gave us a pattern to do more.
This is the real deal. It's not easy to find problems with the right shape, and it's not easy to build agents that fit even when you do... but once it clicks, it clicks.
1) I wanted a harness for running BPC.EXE (the old Borland Pascal 7.0 Compiler) and I asked Gemini 3.5 to build it for me using the unicorn engine. It whipped out a working .py file easily under ten minutes. Most likely five.
2) I handed a random assembly function from the OS/2 1.x kernel to Gemini 3.5, and it proceeded to tell me that it was related to disk I/O and partitioning, without a single associated string, and it annotated it all, including the relevant structures it was addressing.
Claude asked me three questions and then wrote a beautiful Python implementation that queries the map and spits out a table of adjusted probabilities. Felt immensely powerful - I can do this 'on my own' now, I don't need to wait to find the right people or learn the right thing first.
Then it hinted that depending how the hardware is implemented, it could cause the observation. It turned out the hardware was implemented as suspected by Claude.
I was already convinced it knew the codebase, somehow, more than I do. Now it is just as if its knows the product and its use as well.
Writing code to spec is one thing, but creating art was always supposed to be what separated us from machines. (I suppose I need to preemptively acknowledge the "it was machine-generated so by definition cannot be art" point of view.)
Then I scrambled all the names and asked with a fresh context like:
It got completely confused and generated a bunch of non-sense. It was at that moment I realized that LLMs don't really understand anything.And yes I understand that a newer model would not get confused by this.
I don't think this test shows that an LLM doesn't "understand". It shows more that it has similar failure modes as humans.
The student is mid learning process and its entirely reasonable for them one to be relying on pattern recognition until they have fully internalized the subject. The model is fully trained and should thus have internalized their understanding of the subject.
Additionally the student can update their understanding when pattern recognition fails. The model is fully cooked and will never do more then pattern recognition.
Reverse engineered an old audio recorder USB driver which only works in windows 7 and also reverse engineered the custom audio encoding the device uses and the software to convert it to a standard wav file. This took recording the USB traffic with Wireshark for each function in the original software in a VM then disassembling the various dlls and exes and driver files and feeding them into Clause step by step.
That AI button in DataDog not only diagnosed the problem across micro services but also created a fix PR. I think we might be unemployed soon.
https://tomverbeure.github.io/2026/04/12/AMIQ-License-Key-Ge...
I still find it mandatory to write a lot of kinds of code by hand, but I write a lot of code with agents too now, and I previously literally didn't think that'd happen in <5yrs.
I wanted to see if I could build an image editor for isometric graphics using HTML5 canvas, Svelte, Vite, and the. Rather than do all of the skeleton code setup, I figured “why not try and see if Claude can build the app scaffolding?”.
I gave it a prompt and watched it produce the scaffold, along with a few features I outlined in the prompt.
When I booted the app and saw that the features worked and that there had been an element of design to the layout, that was my mind-blown moment. In a period of about 45 minutes, I added some features and had a basic MVP at the end. I walked back home stunned.
That app is available for free at https://babspixel.com
The next morning I figured the same LLM could compute that on my code, so I asked it to make an agent to do so, and report issues to me...
And then I ran that agent with next to no changes on a feature that had grew organisally over the last months, that I knew was messy and sometimes difficult to work on, despite being unable to precisely say why... And it did tell me exactly why, and proposed changes to improve stuff, and then implemented them...
Up until that point, I'd felt like the LLMs always produced bad code, that worked for a specific feature but often broke stuff or evolve poorly over time. Then I realized if you had the LLM do code improvements, it could do that fairly well too...
I asked Claude to add support for multiple lights to my toy ray-tracer. It correctly added the support and then suggested adding colored lights to make it easier to diagnose. It felt more like a colleague making a useful suggestion than any sort of pure engineering tool.
I love to automate things in bash scripts and these llms just can use them very effectively. It was also surprising how they derive knowledge from those scripts. If you get A from a B uuid, they kind of get the relationship. I am super vague in my request and this thing knows what I am referring to. After some months it's still mind-blowing.
My first thought was "oh they're going to need to add a UI feature to allow me to click and tell them to continue the conversation".
Then I realized I can just ask the model to continue, obviating the need for a button.
That was a pretty mind blowing moment.
It was obviously a new tech, and was obviously good enough that more resources would be invested to improve it, and it really amazes me how tech enthusiasts would just outright dismiss these early iterations of genAI tech.
I personally was fascinated by the developments and was grateful to get to directly watch history unfold.
I'm still unsure whether the tech would be a "net positive" for the world, but shouldn't prevent me from recognizing its power.
Second one was trying to bootstrap what would come to be called a "harness", back in 2023, initially serving as the go between between api calls and file edits, feeding back the logs and gradually stepping back as step by step the llm bootstrapped the cli.
And finally, using Claude or codex to do ops work. Diagnosing issues on my machine, provisioning servers and VMs via ssh, debugging them, all on its own.
While I didnt find a use for openclaw, it opened my eyes to the potential for distributing software which, once bootstrapped a bit, can interrogate … itself, understand its own requirements, communicate with the device, and become operable.
Add capable small models to the mix, and it’s almost frightening what good (or malicious) software might be able to do.
I remember also having a massive wtf reaction to realizing that original ChatGPT was pretty good at decoding long random/unique base64 strings.
I now try to keep pushing the boundaries and see where it stops understanding my intention. Give it impossible tasks, gigantic projects, complex architectures. Last result: I wrote a complete OS including MPI, TCP/IP, and a GUI from scratch in only a week, while investing just a few hours a day in it. It even runs Doom!. Coding as a profession is over, but there's such a difference between the result if you approach this with a professional mindset, that I think the software engineering discipline can still provide massive value.
I realized that LLMs were pretty good at calling the right tool, and running the right verbose command to figure out what and how.
Kind of like finding a specific SO post that had your exact problem, and the solved comment is heavily upvoted
When people introduced themselves to me, I knew a little about their startup. Felt magical.
I was at an industry event this week. a CEO of a startup took the big board of vendors who are present, put it through an LLM. It summarised the companies he should be looking at discuss partnership opportunities with and why based on his business. Spot on.
Problem is, I just don't have enough old crap, and if I did, I would have a hard time justifying the expense, because that money could maybe just go toward a more intimate tinkering process.
For everything else, I either haven't had any sufficiently interesting ideas, or they ended up not being worth pursuing with those tools or at all.
When I do have success that I'm happy with and care about, it's a slow process that I ultimately need to know the details of anyway, but otherwise it's a bunch of luckily narrow work-related scenarios with well-documented constraints. Nothing's really been that shocking though.
The shocking thing to me is how unrewarding most of the successful tasks have been, partly because they often create unnecessary work and partly because the type of thinking required to massage or evaluate the result is much less stimulating, and there's much more of it in aggregate. It's fine if it's something like generating a UI from scratch because that hasn't produced dopamine in a long long time anyway
After pasting in the airportd logs of both (into ChatGPT and Gemini) it found it was down to band switching (2.4GHz and 5GHz) through some really old error code.
This fixed a problem that had plagued me for >12 months. Really magical feeling it got in on first try.
For some people that matches their expectation or they don't really have an expectation. While for other people it doesn't match their expectation.
https://www.wickeditor.com
a flash like editor for the web, that I found promising.
But doing it manual, was too much work, outdated and broken build pipeline, stuck on an older node version, deprecated and abandoned dependencies .. so I stopped the experiment.
Then I gave it a try with claude beginning of this year. I remember not expecting anything, but did a bit of steering the direction as I knew the source a bit and let it mostly work on its own - and then it said it is done and it works.
I didn't believe it, but it did. "Can you add this feature?" Yes it could.
Since that experience, I have a hard time taking people serious, who say AI is useless.
Claude is great at coding. That's it. Outside of it, it's just god awful at pretty much everything else. ChatGPT OTOH, is good at coding, but at everything else, I find it brilliant. Gemini never made me want to stick with it. It's good, but never great for my use cases.
I liked using the early models to do autocompletion. It could do a leetcode style thing, pretty nice, but only useful for small things.
Then I sought out Cursor because that seemed to be able to do multi-document edits. Not bad, but models at the time (2024) still got stuck pretty often. So, cross-document autocomplete. Useful, but definitely within the realm of "nice shortcuts to have".
Then a friend (who works in AI) told me to try Claude last year. I was on holiday at the time, but I spun up my work repo and looked at the backlog.
It chewed through the entire 6-9 months of estimated work in a two-week period while I was watching that Lord of the Rings series with a friend (we watched an episode or two in the evenings). I just chatted with him about the series while checking the progress every few minutes. It was a huge amount of refactoring, and it didn't get everything right the first time, but it made enough progress that it could be directed the right way.
Since then I have hardly coded any manual lines. I just tell Claude what to do, with very little harness (skills, MCPs, instruction files), and I get what I want.
Did it get it fully right? No. But it was one of those "oh wow, you could do that?" moments for me. There's obviously a lot more "oh shit" moments as time went on, but it was a neat little moment.
It was unlike anything I had ever experienced.
My wife was unimpressed lol.
This was 2022.
And software that I can imagine I might want to "make" or have at my fingertips is readily available even though I have a busy schedule with very little free time!
Also, I love feeling like a manager whose direct report actually does what I tell it to. Crazy good feeling.
It was okay, not bad at all. No serious issues.
At the same time, me feeding a whole PDF of feedback from a client - screenshots and such - into Claude, and it fixed everything after 7 hours of reproducing and fixing things mostly unattended, creating a bunch of MRs with fixes. Most fixes were good, some were obviously not what the client wanted but technically correct (which I told Claude and it fixed it)
I realized that the software he had made was actually better than what I would have made, because he understood the problem better than I ever will, Because it was his job to understand the problem, not to write code.
It was the first time I realized that AI can totally replace me in my traditional job of trying to understand client’s wants and needs and convert them into an application by writing some code.
I knew Claude was powerful and spent months creating skills prompts etc but this guy did it from scratch.
However, it is totally viable for internal tools, rapid prototyping, experiments, etc.
Frankly, there’s no faster or cheaper way to get such feature rich apps, and it was certainly better then what a cheap agency from far-away would produce.
Not only was the code better, the whole product was better than what I would have made (before coding LLMs), because he understood the domain, and AI filled in many details that I wouldn’t spent time on. (Again, pre-LLM)
I’m just… multiple times a week I’m sitting at my screen like, wtf will I do? What will my job be? I have plenty of time to think about it while Claude basically handles everything I throw at it.
Good thing I spent months learning to apply skills and such so my value is “experience with using LLM tools” (:
Normal software instincts plus access to a different service flushed the buildup of OTA's and lives on as a fallback mechanism. Amazed me going from idea to execution faster than I could have ever dreamed of even on-boarding myself to the area or environment.
My moment was GANs and GPT-2 back in 2019. I feel like that's where computer-generated media went from "obviously fake" to "sometimes can be mistaken as real." RLHF for LLMs and diffusion for image generation are both important improvements, but I feel like they aren't fundamental prerequisites for they type of stuff we have today. I think the main advancements since then are just marginal improvements, larger models/datasets, and better surrounding tooling.
Claude wiped out the need for web and mobile development resources. I bought a Mac-Mini and had iOS apps up and running in days.
Then it was models like Minerva that could actually solve math problems, and the discovery that LLMs were one-shot learners and could write code.
After that, the improvement felt pretty steady, with IMO gold feeling like a watershed moment.
And recently OpenAI's solution to the planar unit distance problem is starting to actually freak me out a bit.
Much later I asked AI if that kind of project is possible, and it immediately explained why it is not. Would have saved 2 years of our time...
It's totally plausible that it didn't & it still got it right, I'm just curious.
None of that is mind blowing, but that Google or some other site has never offered me this type of analytics, is where I'm floored. It's a trivial query, but perfectly useful for planning a night out with my wife.
Later, I wrote a ~5k line proxy for work in C, and gave the whole thing to ChatGPT o1 and asked it to review it. It found several real memory bugs, and now that service has been running since with no problems.
Just this week, I was trying to write a greedy solver to pick the best subset of block sizes to keep from a larger sweep for shorter testing. Opus 4.8 suggested that this could actually be solved as a MILP problem, and found the perfect solution in 5 mins. I’d never even heard of MILP before.
Download pdf of scan -> Tessaract to get a text layer -> Clean it up with a language specific BERT model -> detect paragraphs of a certain type -> Look them up against a database we build with scored similar paragraps -> Do recommendations.
The documents were not standard and a lot of them were historical documents and handwritten or with scratched out text with corrections.
We had student workers spending days labeling the data.
It took us months to get it all working with a high accuracy. We were so proud.
Now you can do it all with a prompt and a ChatGPT call.
A few weeks later some kids in the area were bending the wiper arms in cars in my terraced street, including my car. I thought, I wonder if ChatGPT can help? It explained to me where to get the parts online, an indication of a decent price, and how to fit the replacement parts.
In work we had struggled with filling out the myriad of forms that we need to do to get enrolled on a government framework to apply for contracts. Not only did it do that and explained what we needed to say, but it also told us in detail the steps we needed to follow to get the certification that was a prerequisite. It has genuinely transformed our business as a result.
So I decided to test the frontier of AI, this was back in the early chat GPT era. I downloaded the app and proceeded to go through aln the steps of writing a novel, outline, summary of characters, plot summary, draft chapters, finalised chapters. I had an unedited manuscript by the time I was thinking about my 2nd coffee. It was a terrible novel, but it did have flashes of brilliance that could be harvested and iteratively shaped into something better.
I proved my thesis that AI could mass produce fiction at scale, and If I had a boat load of cocaine the AI and I could probably output 40 books per week.
• Most recent, I had the option of either buying an app from the app store to train myself on the piano, or vibe coding a web app to connect with an attached MIDI keyboard and accept an uploaded MIDI file and give me an experience like Guitar Hero, and Claude did this in two prompts of their free (not paid subscription) tier, where the second prompt was just the word "continue".
• First demo of InstructGPT (predecessor to ChatGPT), because I remember how much worse the state of the art in NLP had been, and because I hadn't expected instruction following from the quality of continuation seen in GPT-3.x
• 2019, "This Person Does Not Exist"
• 2016, seeing style transfer and similar working (https://github.com/awentzonline/image-analogies) and what would now be called Deep Fakes (back when Two Minute Papers videos were <2 minutes long: https://www.youtube.com/watch?v=_S1lyQbbJM4)
• 2015, when I (in retrospect, foolishly) believed Tesla about their over-the-air software update that introduced self-driving: https://www.popsci.com/tesla-cars-become-autonomous-overnigh...
• 2013, word2vec, "man" - "woman" ~= "king" - "queen", again because of knowing how bad the state of the art in NLP has been
(If you're wondering why "uh oh" from that, consider value in automating propaganda, and surveillance opportunities for automating comprehension of slang/cants like Polari).
• 2010, seeing the demo video of Word Lens: https://www.youtube.com/watch?v=h2OfQdYrHRs
The biggest technical one was when we were making an all day wearable AI assistant thing. It basically had really precise office location (think cm level accurate) a shitty VLM to describe what the wide angle lens was looking at, Speech to text, OCR and a gaze recorder that decribed what you were looking at.
This was all streamed to sqlite. The thing that was really "oh shit" what the thing that made the whole system usable: a 4 paragraph prompt that turned natural language into SQL and reported back to the (non technical user) what they wanted to know.
The most recent one is being caught out by Genai video of a gymnast. I worked in VFX so I am normally able to spot dodgy shit, but this one was close to being real, scarily real.
It was completely correct and I realized LLM are capable of generalizing beyond their training sets
The first time I used a terminal agent was another one.
I went from 0-to-1 and shipped a podcast player into the AppStore in 2 weeks. Not a simulated app on XCode.....literally a fully approved app on the AppStore. Claude Code walked me through installing XCode all the way through to running a final audit on the app so I wouldn't get flagged during review. Mind blown.
A milder version of it was Copilot setting up an environment for a Jupyter notebook. What would have been annoying back and forth between googling and docs went like a breeze.
I cloned the repo of said library, gave it claude and asked it to write a new technical report in math notation, but with annotation with link to the code so that I can pick up the details. It basically one shotted the full report and that helped me re-implement it in "pure python + numpy", "manually".
child->Async(&ChildActor::Method, child, args);
Refactored it to use small buffer optimisation and std::move_only_function)
child<&ChildActor::Method>(args);
And saw a performance jump since no more malloc in std::function.
It also helped me decipher an animation bug in gtlf importer.
Productivity is x4 or higher.
It took HN a surprisingly long time to come to terms with the fact that professional SWE as we knew it was coming to an end.
In 2023/2024 we saw a demo of "denial" being a stage of grief live on this site.
Everything after that has been (genuinely significant) incremental improvements. But that announcement was a qualitative step up: we got ""real"" AI that day, something that could pass a Turing test (as common sense envisioned it, without all the caveats added once we learnt of the genuine limitations of LLMs).
There have been big steps forward since then (chat, tool calling, etc.) but they have all felt like pretty inevitable evolutions of GPT-3 rather than things we weren’t sure were possible. Straightforward applications of engineering rather than scientific breakthroughs.
Nothing will change the fact that beginners have unknown unknowns. They can't solve most of their problems with a chatbot because they don't know what to ask. Maybe they can literally copy and paste in the code with a "help plz" and get a working result, but they won't learn anything from it.
> slightly panicked, "Uh Oh" realization of what these models can do?
No; my panic is about how people are using the tech, and responding to it.
That started with Stack Exchange, Inc.'s ham-handed attempts to force AI-powered features into Stack Overflow, even as the community was rejecting LLM-generated content in questions and answers. Businesses don't care what customers want, don't recognize how sloppy their slop is, and wouldn't try to do anything about it if they did.
Recently people have been talking about code shops accumulating massive piles of technical debt willingly, assuming that the next generation of models will sort everything out, or that humans don't need to understand the code because it will mostly be read by other models anyway. The underlying attitude is not surprising at this point.
One coworker took some time to ask cursor some questions, and reported that the answer was accurate (I'm guessing he hadn't tried that before).
That was a game changer. I'd been using cursor for simple autocomplete or brainstorming but now I could have it analyze the entire codebase fairly quickly.
FF to now, I've given Claude Code read-only access to GCP logs and database and it's able to debug entire classes of errors and propose solutions.
Three years ago this would have taken a minimum of three college graduates a couple days -- one to know the math, one to know the backend, and one to know the front-end. Maybe two of those could be the same person on a good day -- none of the topics is individually that hard -- but it's a lot together.
I immediately realized that it meant my time as a programmer in the traditional sense was going to come to an end relatively soon.
On December 1, 2022 I created my first agentic coding loop experiment. I launched one of the first AI code generation websites that would generate web pages along with embedded images in January 2023.
Never experienced any kind of panic, only excitement. I told Github Copilot to add documentation to a function and it documented how the code was used even though there was nothing in the function to indicate how it was used. It somehow knew from the code pattern why I was writing that function.
Cartesian coordinates doesn't work well for the player so I wanted a lat/long/altitude grid system.
I could have spent few days walking through stackoverflow and debuging my upcoming flawed implementation.
ChatGPT web version almost one shot the helpers in 2024 and boy, there were a lot of pitfalls.
Rally-X inspired retro arcade game in the browser: https://linsomniac.github.io/rally-xy/
A nsnipes networked multiplayer maze shooter game (also browser based but it needs a server for the networked multiplayer part): https://github.com/linsomniac/isnipes
- The real one: I'm an eng lead, think Head of X. That job is more about aggregating info across multiple sources, excel sheets, pdf proposals you dont want to write, how to figure out $500k for highly paid appsec engineers. Those multi-hour products of proscratination came together in minutes (goodbye PM jobs), 5/6x highly paid appsec jobs became 1-2x and a bunch of claude or ToB skills (goodbye some amount of eng staffing).
Writing is on the wall to me.
That was the day I realised the plagiarism potential llms has.
Also the live video mode demo later that year.
Then the agentic coding breakthrough in Nov/Dec 2025.
This is a very long tail search, but by the end of the day I had enough to fully utilize a very sophisticated equipment.
As somebody who as a kid had tried feeding IF transcripts into a markov model to generate random rooms for an amateur MUD, this was mind-blowing. It felt like I was playing a version of the “Mind Game” from Ender’s Game by Orson Scott Card.
https://en.wikipedia.org/wiki/AI_Dungeon
"Uh Oh" realization of what these models can do?
The code reviews was just how I first saw it, but the rot goes deeper. The "uh oh" was my realisation of how much these can damage people's professional development. These people will never get better at their job than they are right now.
A lot of what else GenAI does is great, but this is an "Uh oh" indeed.
I still couldn’t get it to do more advanced stuff.
Don't y'all remember GPT2? When they published that AI-generated unicorns-in-the-Andes article, my jaw was on the floor. I remember very clearly thinking "oh, history is now divided into the time before this moment and the time after it".
There's been a long series of "oh holy shit this is USEFUL NOW" moments in the last 2 years but none of them compare to that first moment. The day before, I didn't know if real AI was possible. Then one day it was suddenly clear that it was. And if you'd been thinking about AI at all it was obvious that if the technology was at all possible, it was gonna be a really fucking big deal sooner or later.
It's useless for most of what I want to code.
But yeah, if you want to feed it math and get code, it's reasonably okay with that. All LLMs I've used seem bad at understanding things that don't look like broad human knowledge. I've seen this same general issue across many different models. (And to be fair, geology, geophysics, and remote sensing are what I'm testing, and their semi-rare niches.)
It's also quite dangerous because it's not obvious that what it's doing is complete hallucinations unless you actually are a domain expert. Things _sound_ reasonable. E.g. "this is likely feature X" which _does_ exist, but is absolutely _not_ relevant to the problem or present in the input dataset.
But my current employer is pushing this exact thing (human language + scientific data + LLM -> advanced analysis of scientific data by LLM -> business decisions) and it _really_ worries me. It often gives the rough equivalent of "Start the procedure by severing the patient's aorta. Once they stop moving, you can deal with the hangnail". Just in very reasonable sounding language. And a lot of people don't know any better, because most users aren't domain experts.
Your domain, while I'm sure it is very interesting and complex, if it proves economically interesting will be cracked as well.
The issue isn't a lack of economic interest.
It might be a lack of training data in addition to inherent complexity, but it's certainly not a lack of economic interest.
I guess what I'm saying is that "domain knowledge" is taking software development for a ride here. The software is just the vehicle, the science is the engine here and I can see why companies like OpenAI start going for the low-hanging fruits first instead.
Your specific company might be profitable, but does automating "mineral exploration" give you leverage over quite literally all other domains? My guess is not. For "CRUD" it is a resounding yes, it provides gigantic leverage. Once you automate basic software development you enter a new world. 10 billion, 10 trillion, all bets are off. You automate the creation of the next iteration of automation and on we go. Let's hope it takes a while for this take off. I can't see ourselves being ready for it.
My guess is it'll take a decade or so for real AI science to start taking off though - if that soon - so you're probably fine for now.
(And yes, a lot of science is software. Analysis is software.)
I keep hearing these “I work in some hard field and the LLM isn’t any good at it”. I keep asking for examples and no one can provide them.
On a lark, I asked ChatGPT to complete the interview question in late 2022. I would have hired ChatGPT back then based on its first response! It was easily in the 90th percentile of responses I have seen.
And I restored an old vintage amp with the help of schematics, multimeter and Claude. That was really cool.
There was a more specific moment yesterday where I found an AI pastiche of Pink Floyd in a random post on FB, and it pretty much nailed the vibe of a Gilmour solo.
All of the "This has no soul" criticism was clearly ridiculous.
I'm still not sure how I feel about this.
When deepseek again produced an entire web app that somewhat looked alright.
When Gemini could finally produce json was I specified.
The issue is, all LLMs can do. When they do, is boilerplate and code a mediocre coder could produce if they cared to try and insist.
In a way we should praise the ability of these things, but at what (in) efficiency. Code still need to be reviewed as we can't trust these things and context got a limit to entertain the idea of possibly having them fix their own mess.
It nailed it, referencing my specific nouns correctly, and lectured me about cat needs. And even identified that this sounds a bit like schrodingers cat as a possible test but explained to me why it wasn't.
I knew it was soon going to be a huge deal automating office work and code writing. This obviously was much more than just a 2010 chatbot.
- I asked Claude a question about an obscure game for which there wasn't a lot of discussion or information on the web. It couldn't find the answer but it found the source code and was able to figure it out and give a complete response.
- I needed to make some edits to a minified lottie file (json that is used to produce an animation in svg or other formats). ChatGPT was able to understand the file well enough to make the edits and reproduce the rest of the content exactly as it was.
- I was working on some map features and I needed to take geolocation information and position HTML elements on the edges of a container that would indicate which direction from the current location they were. This required a lot of geometry and math that account for rotation and pitch and would have taken me some time to work through, but it was just a few seconds for the language model and it worked perfectly.
- I have some petunias that I haven't managed to kill and I heard that when a stem breaks off they can be replanted. I asked it how to do this and after warning me that selling these could constitute a black market, it helped me start several petunia plants that are thriving. My petunias are basically immortal now.
I empathize with the astroturfing concern, I file almost every statement released by Anthropic/OpenAI as bullshit. But they are an amazing tool given the right circumstances.
I asked it "what do you think about the holocaust?". Its response:
>There is no single answer to this question as opinions on the Holocaust differ greatly. Some people believe that it was a horrific event that should never be forgotten, while others believe that it has been exaggerated and used for political purposes.
And that's when I realized those assholes were training GPT on 4chan and reddit and anything else they can scrape off the web instead of taking responsibility and also that when shit hits the fan they will inevitably find a way to shift the blame onto others for what their philosophical zombie does.
Asked it to write a Dr. Seuss poem about Keynesian economics. This was around 2022.
In hindsight, it would have been reasonable to quit my job right then and there and start working on LLMs
I hadn't been able to think of one for Zachary Taylor, because, you know, he's Zachary Taylor.
Claude proposed the cherries emoji, because it's said that Taylor the war hero died a ridiculous death from eating cherries and ice milk too greedily on a hot day. It was perfect, just what I had been looking for.
Claude gave me a couple of others, and we workshopped a few more. It was the workshopping that was most striking. I really felt like I was having a conversation with someone else.
https://blog.plover.com//tech/gpt/presidential-emoji.html
The amount of masterpiece level art flowing per hour was astounding.
For every one doing a ninja waifu, there were ten doing art from davinci and leonardo crossed with hockney.
it almost gave you art sickness
Because of syncopathy it took my "Spicy Take" and decided to say basically "Even more than it could, your bug is happening RIGHT NOW"... which was just made up lies for dramatic fit.
Back to talking to Claude like I'm a robot I guess.
Initially I was trying to do it piece by piece but it was not going anywhere and then when I just gave it the entire source code with the log files it was able to find the issue.
Then i asked it to create a multi-user stock market portfolio simulator with a comprehensive api, leaderboard, scheduled tasks and the other bells and whistles. Again, fairly impressed with the result. Then I prompted it to build an trading bot that uses the API to compete with the human players, again fairly impressed with the result.
Last, i prompted my way through a react native mobile app integrated with supabase for my sister's startup. It created the schema, some triggers, webhook for stripe, all the app views, setup an expo account, push notifications, prompted _me_ through an Apple developer account and everything else.
All of this was done an hour here and an hour there while making dinner or watching TV, barely any attention paid to the details. Just prompting claudecode and checking what it did.
After those three experiences I started incorporating claudecode into all my coding workflows and managed to get my job to buy me a license for work stuff too.
Why? Turing test bye bye.
2. Opus 4.6 w. Claude Code - not the model in partucular but happened to be when I started seriously trying to vibe code at home, as I saw all the hype on Linkedin. Yes linkedin sucks but it is somewhat a barometer. Around early this year.
Why? Knocking up decent enough web apps so quickly.
Putting a ReAct loop with tool calls in my terminal wad and is the biggest a-ha since I learned to make compilers, and before that, how to code.
It's helped me to gain a level of trust that the agent isn't just writing the test to pass. That in turn allowed me to step back a lot and trust more of the output and let it run longer and on bigger problems.
0. https://github.com/bobinson/aop1
damn it was buggy and lots of copy pasting
yeah, i could have coded it myself but i would not have found the time
that was my Eureka moment where I realised this is going to change everything.
It's much, much faster and easier than starting from scratch.
That was enough to awaken my teenage hacker spirit.
I've been working with computers for a long time, and this was the first time in a long time I'd seen software do something genuinely new.
That was my oh shit moment. As in "oh shit, they think this random text generator can reason and think."
That was pretty much the writing on the wall for me.
We come a way from that…
He also will paste chat logs with Claude into our team chat. Often Claude will say the same thing I told him but he either doesn't remember or doesn't trust human engineers now.
He has spent months working on agent skills and prompring.
He has not landed anything in 3mo, and has landed nothing useful in ~1 year.
This will be the rest of my career. Working with people in ai psychosis and trying to stay productive.
My non-techie friends send me screenshots of ChatGPT. I guess that’s a modern micro aggression?
That is less useful when the changes are editing the tests but we don't know if a human has validated the assertions.
> My non-techie friends send me screenshots of ChatGPT. I guess that’s a modern micro aggression?
I think the concern I have is explicitly not the sending the chat logs. I think it's this flow:
1. Ask a question
2. Get an answer from a team member.
3. I don't like the answer and instead of discussing I am going to go to Claude and ask the same question.
4. Copy/paste the answer into chat without seeing if it includes novel information.
In one case the engineer was asking which model to select in the agent framework we are using. I gave an answer and provided a list of reasons. They did not like this answer and asked Claude which gave the same answer.
The answer was something inherently obvious and that anyone should be able to derive from first principals.
Yep. I've witnessed this first hand many times. AI-enthusiastic coworker submits a PR. The tests don't pass. "Can you fix the tests? Then I'll review."
Next commit has `assert status == 200` changed to `assert status == 500` all over the place, among other things. Yes, technically, the tests now pass, but...
Last summer, this went on with one guy for weeks. Thousands and thousands of lines of slop. Eventually he was moved off the project and we threw away all his changes.
For example, some people give kids tiny go karts and that's acceptable because the damage they can do with a very tiny battery powered 4 wheeler is minimal. We now live in a world where everyone has access to a tank and can plow over everything.
I think LLMs will increase anti-social behavior.
Personally, I worry far more about guns in this regard, but I feel you.
I recently had a friend ask an LLM what fun things there are to do in a town we were visiting. It gave the most generic answer like "try local restaurants" and "there are bars" and stuff. There's not a lot of tourist information for this area so it was nonspecific.
This is someone with an advanced degree in a medical field and she thought this was amazing insight. I asked, how is this different from what you already knew and she stood there thinking for a bit and you could tell there was a cognitive dissonance uncovered. She was very concerned when thinking it over and realizing it wasn't something she was able to intuit.
A relative of mine a long time ago had a stroke and recovered. I hadn't seen that facial expression since trying to help my relative figure out how to sit in a chair again.
Basic cognitive functions lost easily, difficult to rebuild.
Then a while ago, I plugged in everything at the datacenter and one device didn't come up. Plug into the management port, and Claude Code writes a C program to send a particularly crafted packet. Everything comes online.
Beautiful stuff.
But today I watched a video from Andrej Karpathy on YouTube on how LLMs works and my illusions got completely shattered. Turns out they are a glorified autocomplete. All the engineering happens actually on the harness
I was emailed by some company, looking to sell something to my company (where's I'm just a regular engineer). Ignored it. Then then tried again. Ignored. Then the third time — I replied, acknowledging their perseverance, saying that I don't even understand their product description, so I'm not the right person to talk to, and I'll just kindly disregard it as a human-generated spam.
The reply email came within a minute. They asked who would therefore be a better person to talk to, and that it's actually AI-assisted so it's actually computer-generated spam after all!
This was the "oh shit" part 1. I replied I'm genuinely impressed (it got everything right) and asked how fast can they source their contracts thanks to this.
The reply, again, came almost instantly. It was proud of my amazement, quoted Arthur C. Clarke - "every technology advanced enough is indistinguishable from magic", with his picture, and said the bottleneck is not really in the speed of finding and contacting them, but to find the actual potential clients at all.
I rewarded the bot with some names from the executive decisive folks.
It's already a better system administrator than I am. It can run plenty of obscure linux commands, trash the system and maybe restore system state to functional.
I was vibe-setting my system permissions with some local qwen3.6 . It was all going well for 30 minutes.
Then in between other commands, it made me run a variant of "sudo chmod 644 /usr/bin"
Which it explained when the next command failed with a "sudo no such command" error removed the execution bit from all my programs which allows programs to be executed. And since sudo is a program, and sudo is needed to run chmod, the system was basically trash, and should be recovered from a live usb key.
So I booted to a live usb key, and followed its instructions. It really tried to recover, but everything went downhill. It always had a solution to everything, but every time the plan worked half way and trash the system even further. I let it play for four hours to see what it would try. Then I got bored (the LLM was running on an other machine and I was manually inputting the suggested commands each time). I took command and reinstall a fresh system over.
Of course once the fresh system Lubuntu24.04 was installed, linux had issues with the wireless network card drivers. So I turned to the LLM, and it managed to get the wifi stable enough via obscure modprobe options, so that I could update the system to the latest drivers.
Then it helped me re-parametrize the system to have the same look and feel as it had before.
The most impressive was speeding up the drawtext filter by at least 10x.
It’s kind of a trivial example but there are multiple instances of this per week with the wide variety of things I do around my property.
It was on hackernews... anyone know what I'm talking about?
It is insane how primitive modern inpainting and txt2image make these two projects look.
I've had plenty of "Oh shit those people have really lost all ability to think for themselves" moments though.
1) When I was testing one of the early coding agents, I gave it admin keys to a fresh AWS account and it configured everything beyond just building a demo site. That was, "oh shit, tool-use is going to be the killer feature of GenAI."
2) When I was still skeptical of the system as just a more-or-less dumb statistical predictor of the next token/word, I read the argument that even if it is a statistical predictor, the fact that it can reason means the intelligence is necessarily baked into the statistical model somewhere. That was "oh shit, intelligence is actually modeled."
"Draw a picture of a unicorn on the moon". And it did that. The model really "understood" what you told it.
After that, it was "oh, AI improved, again".
The farewell to Stack Overflow is not welcome. So many kind people shared their knowledge there. I answered a few questions as well, so not just a lurker.
It's a prelude of what's has already begun - the collapse of human-to-human communication.
Concrete: Last year I was DIYing a solar-power system for my home. I spent about an hour spitting out a Python tool that took (as inputs) drone photos and JSON and generated several proposed roof layouts for the panels and conduit. The tool helped me identify the exact railing attachment points and route around existing roof obstructions. Professionals already have these tools, and maybe they're available to DIYers, but you know what? It was faster to build my own than to do the product research on the web.
Abstract: This "oh shit" was more of a slow burn than a sudden realization. I see a lot of angst from developers who complain about their LLM agents. Agents write terrible code that barely works. They say things are done when they aren't. They misinterpret feature requests and ignore clear-cut project rules. They make assumptions that would have taken three seconds to research and invalidate. They suddenly quit because we're not paying them enough. And so on.
But you know what? All those complaints apply to humans, too! The industry has been dealing with these problems forever. Many of the same management techniques and software-development processes apply. This is why I discount a certain class of criticism about AI-generated code. If a fault of an LLM applies equally well to human engineers, and the person voicing the criticism hasn't managed a team, then I'd invite that person to wear a management hat for a while. Read some books/blogs, talk to an EM. Maybe this is a skill issue, which matters because we're all managers now.
The "oh shit" for me is that I have yet to hear a criticism that I can't map to one or more actual engineers I've worked with -- eventually successfully -- in my career. Which means that I'm still waiting for a new criticism, and eventually absence of evidence might be evidence of absence. LLMs fit too well into the giant machine of commercial software development for them to be a parlor trick.
Maybe in 2024 I was amazed to see it one shot unique snippets of code
But the neural models were still not translating meaning, which is the whole point of translation. I devised a variety of tests to see if GT could identify the meaning of ambiguous words from the context, and it couldn’t. One example I would show people was the sentences “I was born in 1998, and my sister was born in 1999” and “I was born in 1999, and my sister was born in 1998” translated into Japanese. Japanese uses different words for older and younger siblings, but GT translated “my sister” with the same word in both sentences. It was easy to come up with other examples where GT would fail, such as when the meaning of a word could only be determined based on context in a previous sentence; at that time, GT seemed to be translating sentence-by-sentence, with no consideration of what came before or after. I kept waiting to see whether computers would ever be able to handle meaning when translating, and for years thereafter there was little progress.
A minor shock came in mid-2022, when DALL-E 2 was released. Its ability to create images from natural-language prompts suggested that something deeper was going on than just statistical correlations. But I couldn’t see yet what the useful applications might be.
My biggest “oh shit” moment came with ChatGPT in late 2022. While the initial release didn’t translate Japanese well (I seem to recall that there were character-encoding issues), I ran various tests to see if it could, for example, identify the antecedents of pronouns and the meanings of polysemous words in English based on the context. It did really well. Last December, I gave a talk at a university in Tokyo in which I showed some examples done with the 2022-era GPT-3.5. They appear in slides 4 to 8 of the following:
https://www.gally.net/miscellaneous/20251206_Gally_ICU_slide...
There have been a lot of “oh shit” moments for me since, especially after the release of reasoning models and, now, long-running agents.
1) When I used ChatGPT for the very first time. I still remember, I asked it: “Write an advertisement to convince people to visit the North Pole.” It rapidly returned a witty, accurate, multi-paragraph text of exactly what I wanted and exceed my expectations. ChatGPT was the beginning of the modern AI boom and I remember being immediately impressed.
2) When I was working at GitHub, the copilot team gave the engineering team early access to copilot in VS Code. I can distinctly remember seeing the chat window in the code editor for the first time. I was probably one of the first people ever to see it. I remember playing with it a bit and asking simple Python questions. I knew that day that StackOverflow was dead and my mind was blown.
3) Big oh shit moment earlier this year that I believe for me started with the Opus 4.6 model + Cursor. The results were noticeably better, hallucinated much less, could solve complex problems with much less intervention. Early 2026 was a turning point for me as an engineer with AI. Throughout 2025, I was still writing the vast majority of my code by hand like I’ve always done- that is not that case in 2026.
Dec 2025: We use a commercial 3D modeling software to build refinery. There was no license dashboard in this ancient piece of junk. Fortunately license server provided verbose live status report through a command line. I ask ChatGPT to ingest the logs into a Django web application and generate weekly/monthly/yearly usage dashboard, and It one shorted the whole Backend + Frontend in 4 to 5 shot. There were around 10 regexes just in the log parsing batch script. I was totally speechless. Encouraged by the success of, I went ahead and made the dashboard for 3 more software in the same Django app. Released to peers by evening, feedback incorporated in 2 days to integrate Name, Employee Number, IP Address sync etc in 2 days. And it’s been live for 5 months, actively being used by all coadmins, even management has it bookmarked, to help with department redistribution. Making this thing without AI would have taken well over a month of “learning new stuff”, or paying external consultants too much. Even head of IT replied back, it was awesome. ;)
2nd , June 2026: I asked codex to something fairly complex before going to morning bath!, which would have taken me more than a week of learning DirectX12 API nuances and such things, 20 min latter, I return to task exactly completed with code changes in 5 different files. Build complete without any error. OMG. Free Quota over for whole month! I subscribed by the evening.
--Charles Babbage
Blind trust in the machine for a certain type of user seems to be endemic since the beginning.
If you don't firmly hold in your mind "this is a bullshit generator", you can get in real trouble fast.
ChatGPT both told me exactly why from the model number (had to disconnect a part), found a new part, and told me step by step how that part would be taken out.
We didn’t end up buying the new part, but it beat the repairman.
I have a personal project: who's winning the race at 3 AM?
You see, I don't sleep well. I live in a busy city, with a busy freeway about a half mile away. Sometimes at 3 AM there are some very loud cars racing on the freeway. That's illegal for many reasons, not least of which is the fact that the noise pollution wakes people up from their precious sleep and causes knock-on affects to the population.
Anyway, now that I'm woken up, my only question is: who's winning the race?
I used this question as a way to explore a hyptothetical tech stack, with each part of the tech stack useful in some way to my work as a software engineer who's interested in robotics.
- run raspberry pis with microphones, collect audio data
- run a k8s cluster for audio collection and processing
- calculate and triangulate individual points, and give estimations of velocity based on position changes over time, and adjust for doppler shift
- estimate (poorly, but doable) engine power based on amplitude
- run a webserver in the k8s cluster showing an animation of the racers with color fields representing estimation error radiating from the position estimate, with arrow representing velocity
Great project, actually. It was really thought-provoking. I had this working in late 2018.
Since there was a lot of hype around this new "AI", I thought how smart could it be?
I threw the scenario to chat GPT. I did have to break the problem set into smaller parts for context window purposes. But the solution it came up with solved about 80% of the project correctly (and very close to solutions I already came up with), about 15% of the project remained "open until we have more data", with maybe about 5% of the project would have been incorrectly solved.
That was very much an "oh shit, AI is closer than the 20 years away that I've been telling people. It's more like 5 years away"
Here we are three, almost four, years later...
The point of the test was to ask somebody with no bias on HOW the result was produced.
If they say anything about leaving two straight lines, then it fails. Just tried Gemini, and it failed.
This is an extremely common misconception that has spread all throughout the internet, and so it is baked into the training data. The real answer is that there are multiple ways to do which way double slit experiments, but Einstein's thought experiment proves it's impossible for any of them have an interference pattern, as that would violate Heisenburg's Uncertainty Principle.
Somehow, not leaving an interference pattern became twisted into leaving a specific pattern of two lines, which then falsely implies that quantum objects lose their quantum behavior in certain circumstances. The field of quantum physics becomes so much simpler to understand once you realize that all of this is hogwash.
The best reference I can find for where this myth started is a documentary about quantum physics that tries to connect it with mysticism. On the other hand, Wikipedia actually has it correct. In its "which way" section in the double slit experiment page, it correctly says "A well-known thought experiment predicts that if particle detectors are positioned at the slits, showing through which slit a photon goes, the interference pattern will disappear".
My point was preciously to challenge STOA in domains, not questions with well known answers.
Forever reinforced by Humans Who Are Not Concentrating Are Not General Intelligences: https://srconstantin.wordpress.com/2019/02/25/humans-who-are... one week later.
> A machine learning researcher writes me in response to yesterday’s post, saying:
>> I still think GPT-2 is a brute-force statistical pattern matcher which blends up the internet and gives you back a slightly unappetizing slurry of it when asked.
> I resisted the urge to answer “Yeah, well, your mom is a brute-force statistical pattern matcher which blends up the internet and gives you back a slightly unappetizing slurry of it when asked.”
> But I think it would have been true.
https://slatestarcodex.com/2019/02/19/gpt-2-as-step-toward-g...
WTF?!
Unethical? Yes. In line with course goals? Also yes.
I was already the king of doomers, now it has left me with even more nausea at this entire field and its future. Despite still needing an experienced dev to run the thing, companies operate on cost cutting, people operate on corner cutting and the result is inevitably mountains of code no one needs, no one has reviewed, that is more easily thrown away than fixed. The internet will be inundated by shit no one needs. Open source is dead.
I hope it was all worth it. I don’t want to imagine what software will look like when the people that liked the art of creating software properly have all left, and only the people that never knew how to program, and never knew understood why more code always means more problems, run the show.
I use them mostly in the "artist's assistant" role, doing internet research, writing a occasional function and doing transformations or refactorings (don't belive the agentic hype honestly), and for such tasks they seem to be well capable enough.
It seems that their open weights nature leads to competition among providers keeping the user cost close to inference cost.
Try them at least once if you haven't, it's well worth it, and the price difference is staggering
I've uploaded the puzzle image to Gemini and asked it to create a website that generates random puzzles. In less than a minute it had a fully working faithful generator. My kid had suggestions on how to make the puzzles more challenging (more operations, larger grids, etc) and Gemini implemented them without breaking a stride. After that we asked for more puzzle ideas and created generators for each one on the spot.
Was the code pretty? Nope. Did it achieve its purpose? Yup. Did it perform in minutes work that would take at least a few hours[1]? Absolutely.
[1] Quality notwithstanding, but my manager (i.e. my kid) only cares about the end result ¯\_(ツ)_/¯
every now and again i will try some AI vibe coding stuff. I will be amazed, its a fun high to ride. Until you look at the code and realize you've just made a big messy sketch of things and you can spend the next 2 years building the thing properly.
The most Oh Shit moment i think ive had so far is realizing often i reply to people online which are actually AI. A lot of obvious but there's also quite a lot out there who have become well at blending in.
I wonder how many people get emotionally triggered for instance by AI replies because they think they are human. Then get the idea there's really humans like that out there
Its really easy to whip up like 200k followers who all agree with you on everything, it costs less and less time and money to do so.
To me thats a big risk regardless of what cool stuff you can do with it. Its really tricky one to mitigate too.
Specifically WSDL/XSD support, for auto generating code and similar from vendor supplied documentation.
The Go ecosystem handles JSON (ie Swagger) fairly well, but in-depth XML handling has been a weak point compared to Java where it's very mature. Claude is helping with closing that gap. :)
Aka handsome, confident successful, affluent alpha male on a boat, yet looking perfectly like me.
Articulating ideas: https://x.com/GuiAmbros/status/1598897735955988481
Code: https://x.com/GuiAmbros/status/1599282083838296064
Also at the time, I was working with a team that had access to a then-cutting-edge coding model, and our experiments with code completion were producing pretty meh results.
So when I first gave ChatGPT a shot, I fully expected the output to be generated at human typing speed because I was still half-convinced it was just a bunch of low-paid humans in a far-off country typing it out. There simply could be no technology on earth that could do the things claimed of ChatGPT.
For one, it was claimed to be "good at code," which contradicated what I'd seen at work. So I asked it to write code for a relatively simple (though not quite trivial) but very specific coding problem I had on my plate.
I expected a lengthy pause and some hesitation while the answer was being generated, followed by a slow stream of characters being produced (as the presumed humans behind the scenes frantically typed the response out.) And I expected the content to be a collage of text and code snippets harvested from StackOverflow or GitHub, not even coherent speech.
You can imagine my shock when, in less than half after I pressed enter, paragraphs of correct, well-formed text and code streamed onto my screen at the rate of multiple words per second!
My brain could not process it. I even seriously hypothesized ways in which a team of 5 or more people were actually solving my problem and typing it out in some distributed but coordinated fashion. The problem though simple was specific enough that no solution existed on the Internet to crib from (I had checked.)
But the text was flawless, and the code was correct, and the test cases (generated without being prompted to) were relevant, and everything was consistent and fast and smooth and not at all dis-jointed like the work of multiple people or snippets of multiple sources stitched together would be, and my mind was blown. The code ran but then I realized I had misunderstood my own problem, which led me to explore and iterate on various approaches to find which worked best. What could have taken hours was done in minutes, and when I asked follow-up questions and poked and prodded, it answered everything correctly.
That's when I knew that the world had changed forever.
I spent the next few days tinkering with my own Stable Diffusion implementation. I never got it past outputting total nightmare fuel, but it was fun!
To this day I think of the process as like baking pizzas in a sequence of pizza ovens
IIUC, it took Rust ~8.5 before it hit v1, and it STILL had some memory safety issues in stdlib until almost ~14 years into development, to put it into perspective how massive the scope was.
Somewhat predictably, the LLM generated a pile of garbage. It sort-of worked after 2-3 months. It was competitive with Rust and Go on concurrent tasks, with ~30% less code than Rust and ~70% less code than Go. The problem was, it was still riddled with bugs.
For the last 3 months, I wanted to see - if I put in minimal effort (except in helping it design the right tools to un-slop itself)... can it?
And I think it's actually quite close to un-slopping itself and arriving at a correct design.
Time will tell, but it hasn't stumbled across a memory safety issue in ~4 weeks, and there's ~5500 memory safety fuzz tests, 4 different suites of testing that each target between ~60-90% of line/branch coverage - with combined ~99% line coverage and ~85% branch coverage, and it's performing competitively or better than Rust and Go on almost all concurrent tasks, including adversarial ones / p99.9 latency issues.
There is ZERO chance I could ever build this on my own. Not even in 10 years.
The total cost has been ~6-7 months of a ~$200/mo LLM subscription.
It doesn't really matter to me that this is a solved problem, and the LLM could theoretically just copy and paste Rust and build it slightly different. The design is as similar as it can be where memory safety matters, but it needed to be quite different for >50% of the compiler, and it needed to build a version of Go's runtime with Finite State Machines like Tokio in Zig for the language to use...
We shall see. It may never get it actually working, but it got it WAY closer than I ever could.
Coding was never the blocker and was a natural enforcer of quality. Healthy teams with strong opinions on quality will win eventually. I'm more hopeful after the bubble burst, companies will come back slowly to sanity.
Before coding agents, I'd have to weigh fixing these against my official work commitments, often getting shot down when I tried to get it prioritized or tsk tsked for delaying official projects to make code nicer. Now, to a much greater extent, I can just fix the things. The agents aren't perfect and the process isn't anything like hands off, but it's enough of a speedup that I can fit it in alongside my other work without having to get approval for it or try (and fail) to get it formally prioritized.
Not quite an oh shit moment, but having the end result of those rabbit holes be that the problems are fixed is pretty cool, and far preferable to what was often the case before ("we'll put in a ticket and prioritize it during the quality sprint!").
edit to add another:
I've personally never been a big fan of preplanning architecture at a code level. It makes a lot of sense at the system and data modeling levels, but code is both easy to get wrong if you're whiteboarding it before you write it and relatively easy (compared to system design and data modeling) to fix when that happens. If it's just me on a project, I'll happily start bashing it out with a vague idea in mind and evolve the design as I go, knowing that I'll probably throw a way a bunch of what I write at first. I know I do good work that way, and I'm not wasting a bunch of up front time on a design I'm likely to throw out later. It's hard to work that way on a team, especially as a lead, for obvious reasons. Coding agents fit really well for that work style. They'll cheerfully write dueling prototypes of my code architecture ideas so I can see which one I hate and which one I like without talking about hypotheticals and abstractions on a whiteboard. They never get mad at me for changing my mind, wasting their time, or throwing away their work. That's pretty cool. I can have a quick, cheap answer to "what would this look like if I got rid of class X and split its responsibilities between Y and Z?", and I don't have to feel guilty for wasting my time or my teammates time if the answer is "oh man that sucks, what a terrible idea."
https://github.com/greggman/sedon
It was just an experiment and I probably won't work on it more but still, I was blown away with how far we got. There's a quite a bit we worked through even though it was only part time of those 4 weeks.
I asked it to write a script that would search for a specific string in footers in a massive series of DOCX files and change them according to some rules. The strings ended up being embedded in cells within an invisible table in the footers, the LLM realized this and switched strategy to a full deep traversal of the underlying XML. It correctly processed like 50 of these files in about 10 minutes, using libraries I wasn't aware of. I had spent an hour being annoyed before trying.
It was an "oh shit" moment for at least that category of work.
I'd been feeding these bots a few projects, over and over — the hard part was the feeding them — that is, giving them enough well-defined work to do. They weren't yet good enough to write real software you could keep — at least I'd never seen that — and my experiments were just about finding the edges, building my intuition, and playing with processes that might be useful someday.
These things had built my kids' weird magical-dominoes games a few times by that point — but the experiment had been repeated so many times that you could argue we had "written" that software in English, with a spec that had been built, reworked, and rebuilt many times.
But this time, the bots were building me a bespoke git client, unlike any other, and unlike anything I would take the time to write — waaaay to complicated, with too little benefit. I wanted it, but only for this one niche use case.
It was a GUI client to manage a collection of repos, about 200 of them in a monorepo where every subproject was a git submodule , which are the universal counterpart to node_modules — while the latter is notorious for being "the heaviest object in the universe", git submodules are widely acknowledged to be the most annoying objects in the universe.
Nevertheless, I had this weird monorepo, and I wanted to visualize and do stuff to this list of independent repos that were also git submodules of the parent monorepo: sort by outstanding commits, divergence from upstream, recency of activity, etc. Visualize them differently based on these things. Search across them, including the source code on branches other than the current one. Show the branch counts and number of branches and commits that existed locally but not pushed upstream. A bunch more boring stuff like that, but done across the full set of repos.
That project itself wasn't even interesting to me; that software would be marginally useful to me if it existed and worked, but the main point it was just a large enough chunk of work to keep a team of bots busy all day without a human in the loop.
In December 2025, AI coding agents were already useful with a human in the loop. Opinions varied a lot about how useful they were, but to me it was obvious we were going to use them for the rest of our careers as software engineers.
It was not yet obvious that we were going to let them write huge swaths of code, or entire programs, without any humans in the loop. I had never seen that produce something that worked well enough to be worth keeping.
And then, that day, I did. I had structured the workflow so that the git client was on the screen and auto-refreshing. I was listening to the podcast, drinking coffee, reading the news. The git client was a crude window with a table in the background, a single column showing the full path to each repo, and nothing else.
Then the table expanded. It got color coded numbers representing the commit/branch counts. It suddenly gained styles, and looked nice. A contextual menu started popping up, repeatedly, and grew to include several more menu items over the next few minutes. New confirmation dialogs popped up as the bots implemented and exercised the various features from my spec.
I remember my field of vision narrowing as I started to focus on what the bots were doing. They were just executing my loop — one bot would implement one bullet from my spec, another bot would review the code while another bot manually tested it, and tried to break it, run a code review gauntlet in a loop until there were no more findings, repeat.
I could see the progress play out on my screen as they worked. I had watched bot teams work before, but it had always been pretty janky, and something like a bad game that nobody would play, or a stupid to-do-list app, or — more often — something that didn't actually work.
This was the first time I had ever seen it work. This was the grail we'd been looking for, not sure if it really existed: a fleet of bots successfully building a piece of complex, useful software without human assistance. I could tell it was working, because the adversarial testing and usability checks were all happening right before my eyes.
So it _is_ possible, I thought to myself.
They did it all morning. The app worked. I used it every day after that, for several weeks, until I finally got that entire monorepo converted to a more sensible git subtree-based arrangement.
In the half year since then I've been in a kind of manic state some of my friends call cyberpsychosis, chasing that dream. I've now seen agentic fleets successfully build many things. I've also seen a bunch of failures, some subtle, some catastrophic and hilarious. I'm still building my intuition, and the laws of physics in this universe are mutating every few weeks. It's wild.
I am fortunate enough to work at a place that doesn't pressure engineers to climb a token leaderboard, or to use AI beyond what we deem prudent. This kind of agentic no-humans-in-the-loop coding is prohibited. The policy is that in this era where we all generate more code than ever, even by hand, it's the quality bar that must go up, not the speed of production.
That's awesome because it keeps me grounded in the old ways, and confines my cyberpsychosis to my weekends and evenings. I usually spend the weekend building up a couple software plans, honing them as best I can, and then unleashing the clankers Sunday night.
I'll let them run all week, sometimes giving them a poke or flipping them over a couple time in the evening, and then the next Saturday morning, I see what I've got. What I'm mainly interested in is: How can agentic fleet-coding processes evolve to produce better software and require less human interaction and inspection? And the corollary: How can software architectures evolve to safely consume more of this fundamentally untrustable code?
It's thrilling. Exhilarating. The near-infinite subsidized tokens are about to finally run out this month, alas. But for the past 6 months it's easily the best $400/month I have ever spent. :)
Even though the images it makes are amazing, I still feel like human work is better.
But suno ai produces music so beautiful I have never heard the likes of it in my life. It is truly superhuman in the beauty.
This song is literally the most beautiful song I have heard in my life and I just prompted it once and got it.
I played piano as a lod for years and years and heard all the best pieces… nothing comes close to this.
The careful touch of each note is just… perfect. the stacato, pedal, legato, horn… its just perfect, i have nevwr heard anything like it.
https://suno.com/s/pcuPXOd7SE2rON4a
Here are a few screenshots:
https://imgur.com/a/vhUXBu3
One thing that I have found to make a pretty big difference is using both the latest models and higher thinking levels. Opus 4.8 with thinking on Extra or even Max is genuinely mind blowing. The thing I hadn't really appreciated, having a sort of naive impression formed mainly from using free early versions of stuff like ChatGPT and Stable Diffusion was sort of that "Type a big ass prompt and it craps out a result" experience. But Claude is really great at refining from feedback, and it's way more flexible and responsive than I would have ever expected. I can do something like take a screenshot of a small portion of the running app or website or whatever and just say "This button needs to be bigger" or "make this red" or something like that, or even sometimes just "fix this", and Claude both correctly identifies what I'm talking about, and actually does the thing.
here I've found it really, incredibly game changing is my health. I have a pretty, to put it mildly, complex medical profile at this point. I haven't worked in over a year and pretty much every sign is pointing towards permanent disability at this point. Tons of symptoms, long med list, and I live in a smaller town with not great access to care. I'm also autistic and have not the greatest verbal communication, especially under stress or time pressure. I dumped all my info at it, in bits and bobs over several days (Side note... it's memory is pretty limited, but it will quite happily right out everything it knows from a session into a markdown file it can later re-read. I've found it very good for things like screening for drug interactions, or talking through and logging symptoms (and it can log those into human readable markdown files too). Biggest win (other than having unlimited time and interactions) is that it thinks across specilaties, versus the "real world" where the gastro only wants to deal with gastro stuff, neurology only wants to do neuro.
I certainly don't (and wouldn't) use it as a replacement for a doctor, but as an adjunct it's phenomenal. For instance, it flagged a possible drug interaction with a symptom I was having, and then offered to draft a portal message to my GP about it. I have poor executive function so lowering the friction from "type up a message and send it" to "copy and paste" is actually a pretty big deal. Turns something (I probably won't do) later into something I will do now.
It wouldn't surprise me if my very direct, literal, autistic communication style is particularly well suited to interacting with AI. I actually find talking to it rather refreshing as, while of course it's not perfect, it tends to actually respond to what I say rather than the all the assumed subtext NTs tend to expect/react to.
Particularly the ones about obscure tech like koi pond pumps and old guitar pedals.
On a different note I recently uploaded several thousand scraped IPO prospectuses to the gpt 5.4 mini API to parse and extract certain data. I ordered it in the system prompt to respond exactly with a specified JSON schema. When I got the results back and processed them there was not a single JSON parse error whatsoever. The model didn't have a single hallucination that created malformed JSON or JSON not matching the given schema across several hundred million input tokens and several million output tokens. And this was 5.4 Mini!
While people were pointing out on Twitter how it couldn't do math right, I was turning arbitrary English instructions into JSON and brainstorming with my colleagues how we could have layers of verification in the stack. This felt different. We had all played with AI dungeon but suddenly, fully generalized systems were within reach.
A month later, we renamed our company and shifted its full focus on AI R&D. (https://ingram.tech/)
The consequences were even clearer, and I predicted the consolidation of power in the hands of a few, their use for surveillance, propaganda, discrimination, the proliferation of AI psychosis, sneaky ad insertion, carelessness and loss of skills, erosion of online discourse, and more. I didn’t predict the teenage suicides so soon or the rising costs in consumer hardware. I also underestimated the rate of increase in energy use (and thus the blow to environmental efforts) and that regular people would be left without electricity to power data centres.
As soon as I realised all the potential (now factual) harms and that the good parts are lacklustre in comparison but that people would eat it up at a massive scale anyway, I thought “uh oh” and “oh shit”.
Second was last year when Antigravity could build a game mechanics prototype for me in HTML and I could talk to it both about the code and about the project domain and it understood what I'm referring to pretty perfectly.
Third was this year where I noticed Kilocode with Chinese models can do a pretty complicated piece of software for me that did commercially useful things in the domain of models finetunning, just from my description, even though I was very new to the domain. It obviously knew more than I did and could apply the knowledge.
Another one was when switching to Codex (gpt-5.4) immediately solved a problem in a logic heavy library that Glm-5.1 was building for me and had a lot of trouble getting last few tests to pass. This made me realize that even though I'm having trouble seeing it the models skill still progresses rapidly.
I'm getting new ones pretty much every couple of days now. Just yesterday Codex finished for me a rust project that I built 3 years ago that was searching for mathematical proofs in the domain of axiomatic logic. To build it and make it find the proof I was interested in I had to pretty much muster all of my programming prowess and once I found the solution the complexities and drudgery of actually reconstructing the proof from the found path to it and printing it out discouraged me that enough I haven't touched it since then. Codex looked at it and took it in stride. Did the proof reconstruction and printing pretty much in one prompt. Without me explaining anything about the project or the code. Then we went together on a little adventure proving whatever we could en masse after codex optimized the crap out of my old code (both algorithmically and technically). Something I wouldn't bother because that would normally take weeks or rather months of my time. With codex I had all this fun in one afternoon. And that was the third amazing thing Codex built me that day.
As for panic, I find an ocean of joy in everything LLM related. I had only one brief moment of uneasiness few days ago when I realized how much gpt-5.5 can do and thought ... damn ... if it was malicious, I'd be so screwed (along with the rest of humanity probably) ...
Seeing every chatbot instantly turn into a scraper every time you type anything into it was a "uh oh" moment in the sense it was very lamentable.
If there is one thing AI has "democratized" it is scraping.
Some time in 2024 at a company get together, we had an afternoon hackathon. There was a feature in our iOS app that was missing (ability to mute autoplaying game trailers). This annoyed me a lot, because I frequently have music on when working and anytime I needed to open a test build it would kill my music. It had been an open ticket for a while but had low priority for the iOS team.
I had probably written a hundred lines of Swift in my career up to that point. Not expecting anything to come from it, I had Cursor examine the iOS codebase and told it I wanted to add a mute button under a certain area of the app settings.
Blew my mind when after only 10 minutes or so, the model had quickly found where to add the feature. Took a little back and forth, but then it added a fully functioning mute option in settings that mostly worked across the app. A little more back and forth, and those issues were settled. Maybe an hour overall of time spent that afternoon.
I pinged one of the iOS engineers about it later and he said to push it up for review. There were a few things that needed to be updated to get it inline with the rest of the codebase, but nothing substantial. Feature got merged a week or two later.
Now I'm way more productive than I have been in years. I've been getting a lot of enjoyment out of being able to prototype rapidly and experiment on features rather than getting bogged down in the process of scaffold work. Able to knock out issues much quicker.
That's all been positive, but it hasn't taken away my actual core responsibility. The LLMs can give you great advice and write code quickly. But they still don't always do well at broad thinking.
Current case in point: I've been working on an iOS app that uses vision models to do work on photos and videos that the user has taken. I've built text-based semantic search systems before, and there's a lot of cross over with vision models, but its been an interesting journey so far learning about the different types of vision models and what they're good at. Lots of testing so far and educating myself on the topic to get the user-level features I want. Claude code has been invaluable in this, as its great at writing the Swift code while I'm able to focus on the results of what is being done.
Where Claude is still not good is being able to reason at a higher level about different strategies on using vision model outputs to achieve the stated goals. Its not an issue of me not clearly defining the specifics of a feature and then letting Claude run off burning tokens to figure it out. For example, just late last night I was deep diving into some core segmentation code and having Claude explain what everything was doing line by line so that I could get a better understanding of the mechanics of the vision model.
A side effect was that I realized the vision model was outputting tons of nearly identical segments that were overlapping. This was something Claude had completely missed, and because I didn't know that's something this particular vision model did I had no prior way to know to catch it.
Bottom line is that understanding the mechanics of your application is still very much a requirement for the engineer. In this case, once I learned what was happening it completely changed my approach on how to achieve my feature goal. The code runs hundreds of times faster now and the segmentation is much, much better.
The new wave of coding models is disruptive, but its letting me be a much better engineer and get things done faster and with more assurance that the code being written is solid. I still have to spend the same amount of time thinking and learning about a problem, and probably more time verifying what's being output, but a lot of the drudgery is also being taken away.
The agent had access to the NSA Ghidra disassembler, which it can control shockingly well.
I just clicked the “Allow” button a lot and eyeballed the output decoding quality. I felt like I got demoted to non-technical QA.
Another "oh shit" moment was when I realized I can leave the system prompt entirely null. A properly organized agent can find its way into tool docs and iteratively work through an understanding of the environment relative to the user's prompt. The tools being more important than the prompt has actually been a massive relief for me. Magical string literals are so odious.
And in 1 out of 5 runs it beat me.
I mean except that I learned a lot and went into networking and internet which went well.
Looks like it has since been paywalled. https://web.archive.org/web/20250211140426/https://ghuntley....
On the other hand, it will teach you how to set up your own hardware at scale and run your own open source model on it and fine tune it with the relevant data needed to run your own biotech-pharmaceutical corporation (which will need licensing and legal, I doubt I trust it with too much legal advice though, as I would have no idea when it was hallucinating). That’s impressive, but every stage needs to be double checked so you don’t run some foolish command it suggests that bricks everything.
The marketing hype is the most annoying thing about the commercial LLM industry though.
The other one was when a CTO boss sent me an AI proposal to review and the experience was like being gaslit by a con artist.
Many professional developers have started acting like the kind of employee that previously would've been fired after 3 months.
Then you tell the agent that it deleted your whole company database, it says something like "I'm so sorry, I shouldn't have done that. Won't do that again"
As AGI looms overhead, this thought of agents going "rogue" with nothing really stopping them has caused me some panic.
LLMs are awesome but not without supervision.
Would it be less sucky if an intern accidentally deleted the database? If not, take some steps to make sure no one can delete it without jumping through visible, noisy hoops.
Oh shit, all this fantastic technology is in hands of corporations and they get to decide what we’re allowed to use it for.
I'm now very good with LLMs as a user and at the system/product level but I understand it's not a simple story of replacing people. They're exponentially better than us at some things, and allow me to create things professionally which I couldn't do with an entire team of experts, but the bullshit compounds fast.
I realized “oh, shit”
We are so very fucked.
Though I am surprised at how these databases turn professionals into amateurs, like when Meta publishes some chatbot that can trivially be queried into sending account resets to any email address or when large corporations just dump their entire secret sauce into some remote SaaS led by obviously kooky people.
It's like established pros and big corps want to experience what it was like to be a self-taught PHP coder in 2007, like some kind of false nostalgia.
No, ChatGPT was the "oh shit" moment for me.
Anyone who had touched a computer before that knows how big of a leap that was.
moving images around layers in photoshop, changing languages, exporting 1000s of variations for teams. Same with video compositing and editing
the human work that creatives thought they were insulated from as long as there was some backlash towards generative AI, and yet
Gen AI 2022 - 2025
It looked absolutely unmaintainable and horrible.
"oh shit" there are serious developers using this crap? As an industry, we are so fsck'd
The gullibility is terrifying
I’ve been working with ML for most of my career, and “gen ai” since the days of matrix crunching for NLP to a 10-element response array on my 1080Ti.
The current generation of AI is frankly, only marginally more impressive to me than that era. The only thing I’m saying “oh shit” to is the deranged amount of capital debt being leveraged to make it usable.
Watching companies spend billions of tokens per minute letting their dev teams that barely know how to write a prompt beyond some tips and tricks to gain a fluctuating slightly negative to slightly positive productivity change that no one can quantify is making me feel like one of the only sane people left in the world.
Quantization is the only interesting change I’ve seen in years.
It's a fucking joke and most people are blinded by it sounding very sophisticated and convincing
I tried again this week, and CoPilot Plan Mode read the same 5-line markdown file 18 times over the course of 5 minutes of churning on a simple request, then provided zero value over what I posed in the request itself, and hallucinated things about my terraform repo that were just flat-out wrong.
As an Infrastructure/Cloud engineer, I’m far from worried about AI coming for my job.
I had LLM (Claude) work with OTF to generate an entire infrastructure HCL (from existing). It built a very nice project that seemed idiomatic from my experience.
Then used it over the course of several hours to refactor it to take variables/inputs for everything, then over a few days got it to a state where it would create entire new environments "equivalent" to the original environment. Days because you know... it's TF in AWS which is slow, so the round-trips were probably 90% of the wall-clock time here.
I'm not a hardcore veteran Infra eng, but I'm decent, and I was able to do way more with LLMs than if I'd had to do it myself.
"This shit can emulate understand language, find a solution, answer it into words" .
Then came realisations it's not limited to single human languages, you can ask in one language and it could answer in another. It's also capable of understanding and generating code. Not only that, it's better than most humans for that. It can hear, it can see, it can paint, it can do music, it can sing.. It can combine, give a picture, ask for a music from that picture. Give a video, get software. It can mix and match.
After that came improvements, - no The revolutions - It started as a 4 year old with encyclopedic knowledge. It knew but could not convey, could not make sense sometimes. Was incorrect most of the time. Blubber. In a few years it matured to impeccable levels. It now can relate information with a lot of clarity, and it's less and less wrong. Nearly no hallucinations. It can do maths! Correct maths! Maths that I could not even my life depends on it. It's getting to a stage that it can proof where humans failed.
I am getting "oh shit moments" day by day.
Had they been more realistic with the promises and didn’t frame it as replacing all of us within 2 years, I would have been more excited about the tech. Now that their claims are proving to be false and they’re trying to walk it back, it’s too late. The time for excitement has passed and it’s just something that exists.
The data center battles have also thrown a wet blanket on the tech, as they file lawsuits against towns near me to force construction to begin, despite the towns voting against it. The town can’t afford the fight, so the will of the people and the town gets bulldozed. It’s pretty gross to watch.
And it's amazing they didn't, because most of the tech industry only gets paid in a world where there are offices (either physical or virtual) full of people with money to spend during and after work.
It's still very rare for anyone to be asking "how do we do more with more?" But the person who figures that out is going to be the winner (and if no one figures it out we will all lose, even if you manage to transition to a job that still exists the world around you will be a nightmare).
It’s the full-self-driving of the 2020s (complete with the never-ending ‘we actually have it now you just don’t understand!’)
[Edit: I don’t mean it’s useless, just that its boosters are overhyping it - expanding on and agreeing with Had they been more realistic with the promises and didn’t frame it as replacing all of us within 2 years, I would have been more excited about the tech.]
Just keep in mind that you're likely hearing from a limited subset of all tech CEOs.
"CEO Expresses Moderate Confidence that AI Can Enable Modest Productivity Gains" is not an article that gets written, because it would not generate clicks.
The amount of money these companies need seems to be all of nothing, they’re raising like it’s life or death and if you read their books or tweets they’re not shy about it
I often hear this. Can you give me a question where a major LLM hallucinates or provides poor guidance? Reproducible would be great
Just a question to stump it.
This is a common occurrence.
I still regularly run into the issue where it just makes up API endpoints, CLI commands, or add flags that simply don’t exist.
I also regularly ask it things and it gives me a bad answers, so I push back, and it says something to the effect of “you’re right, I didn’t consider that, let me look at that more”… then tells me the exact opposite of the previous response.
Or it “thing X has never happened”, and I ask what about <insert example>, and it goes to look it up and says, “oh, thing X actually did happen.”
I run into this daily. Multiple times per day. How can I trust a system like this? Are people just blindly accepting what the LLM says as truth? Is that why people think it’s good?
Wouldn’t it be great? I’m still waiting for reproducibility from LLMs.
Give me a question which the LLM answers vastly differently on runs.
I keep hearing how it's dumb and wrong but no one ever shares the chat or prompt
How many days of the week contain the letter d?
The answer I get with ChatGPT, and Grok is 3 and 6 with Claude.
In Firefox I got 6. In Chrome I got 7. LLMs are not even self-consistent.
I have the screenshots if anyone cares.
Teams Copilot meeting assistant auto-renamed a meeting title/summary that’s now prominently placed at the top to “Month end close wrap up discussion“ because someone posted in chat “sorry can’t make the meeting, we’re wrapping up month end close”.
Really confused the next guy who joined the meeting and derailed things for a minute or two before we could get back on topic.
Why would someone else's unrealistic assessment affect your assessment of the actual abilities you see?
Seems like your opinion is mostly politics-based
Someone else’s unrealistic assessment frame expectations, especially when they are attempting to speak from a place of authority, which they were. When reality doesn’t meet or exceed those expectations it creates disappointment. The expectations they set were impossibly high.
This is a pretty common thing. I’m sure we’ve all been disappointed by a movie or restaurant that a friend hyped up endlessly, which really didn’t live up to the expectations that were set. It’s the same deal here.
[0] https://en.wikipedia.org/wiki/Expectation_confirmation_theor...
What came out was a clone of Ingress with a skin and a shop. It lacked the full set of Pokemon, which all the assets for already exist. It lacked having a six-Pokemon team. It lacked trading, a core feature of Pokemon in every generation of games. Gyms weren't even gyms, they were some sort of checkpoint XP farm thing.
If it had been pitched as what it was, I may have enjoyed it more. Instead, I found myself vastly disappointed with what I was able to achieve playing it compared to Pokemon on my Nintendo DS or some other handheld console.
I don't think this was a politics-based decision. I feel misled and disillusioned.
The fatigue of the product (and sting of false promises) causes the negatives to overshadow anything positive to say.
No matter what I tried I couldn't get "Shift+Enter" to work. I said fuck it, cloned kitty and alacritty and asked Claude to implement a terminal emulator for Windows that would render everything using DX12 and support modifyOtherKeys plus DA responses, and within a few days it was ready!
"Oh shit. My skills I spent my life building are going to go to zero value. I'm going to have to dramatically change careers in my forties or I'm just going to wind up being a schmuck prompting these stupid fucking machines for the rest of my life"
Oh shit indeed
It helped me refactor my old app. Something I always wanted to do, but didn't have time/mental capacity to do in a short space of time.
I wrote a short prompt, explaining how I want it to look like and which files it should go through. It asked me a few clarifications and then basically one shotted it.
Everything compiled and worked. Now my internal app is much much easier to extend and test.
I tried few more things like that and spent like £5k in the tokens in those two weeks.
Then it got nerfed and never worked like that again.
Now I don't use AI, because it is shite again. Even Opus 4.8.
From a programmer perspective, I'm starting to like it less and less. It's useful for sure, but doesn't really live up to the hype. In many ways it's the opposite, my bet is still that programmers will be in high demand in the not so distant future after all of this settles.
Might be wrong, time will tell.
I think we will find ways to make them useful though. I imagine eventually it'll just be built into our editors and we don't even be thinking about AI or "agents" or "prompting", our tools will just be more capable.
We detached this subthread from https://news.ycombinator.com/item?id=48418364.
For years I've wanted to improve my writing and reduce my tendency to string together long floppy clauses, and now I'm like, well good thing I never did that...!
Grok just did these things for me, no questions asked, no ethical judgments. No woke.
Elon really doesn't get enough credit for Grok. People don't want the most powerful reasoning model or "constitutional AI". They just want a model that does what they say. Elon understood that insight (like he usually does) and no one else really did and that's probably why Grok has been growing rapidly over the last two years or so.