I have generally moved from bearish to bullish on the future of current AI technology, but the continued inaccuracy with basic facts all while the models significantly improve continues to give me significant pause.
As an example, creating recipes with Claude Opus based on flavor profiles and preferences feels magical, right up until the point at which it can't accurately convert between tablespoons and teaspoons. It's like the point in the movie where a character is acting nearly right but something is a bit off and then it turns out they're a zombie and going to try to eat your brain. This note taking example feels similar. It nearly works in some pretty impressive ways and then fails at the important details in a way that something able to do the things AI can allegedly do really shouldn't.
It's these failures that make me more and more convinced that while current generation AI can do some pretty cool things if you manage it right, we're not actually on the right track to achieve real intelligence. The persistence of these incredibly basic failure modes even as models advance makes it fairly obvious that continued advancement isn't going to actually address those problems.
cootsnuck [3 hidden]5 mins ago
Yup, spot on. There's a capability-reliability gap that the industry does not like to talk about too much.
It often feels like the AI industry is continually glossing over the fact that capability and reliability are fundamentally different qualities. We tend to use "accurate" and "reliable" interchangeably, but they describe different things. A model can ace a benchmark (capability/accuracy) and still be a liability in production (reliability).
Just look at recent reactions to yet another release from METR showing improved capabilities. But the less talked about part is how their measure is for a 50% success rate (and the even lesser talked about secondary measure they have at 80% success rate has a drastically lower time-horizon for tasks). https://metr.org/
I implement AI systems for enterprises and I don't know any that would ever be okay with 80% reliability (let alone 50%).
jcgrillo [3 hidden]5 mins ago
This capability-reliability gap (excellent term btw, more people need to think in these terms or we'll be in real trouble) is also infecting LLM assisted outputs. I just tried VSCode again tonight after a ~3yr hiatus and goddamn has it deteriorated. Lots of new features, lots of interesting looking plugins, but 3 out of the 5 plugins I tried for code CAD (the reason I downloaded VSCode again at all) were completely unusable--like couldn't even be made to work at all--and the other two didn't do anything like what they claimed. Also VSCode itself got into some kind of spastic loop trying to log me into github, and seemed incapable of recognizing the virtual environment in a python project's workspace... It also feels like the UI got even slower. This situation is bad.
smusamashah [3 hidden]5 mins ago
Your analogy reminds of messed up fingers and hands in image generation models just a year ago. Now that is pretty much solved. These days they are generating videos you can't tell apart from reality. This makes me believe these nuances will keep reducing and eventually become very hard to notice and find in may be every task.
igleria [3 hidden]5 mins ago
Yesterday I was using opus 4.6 through copilot (don't ask...) to rubber-duck-brainstorm a big feature that needs a lot of care.
I got some inspiration from it but it misinterpreted very basic stuff. might be a skill issue on my side, I do not know.
Brian_K_White [3 hidden]5 mins ago
I hate to help provide possible soultions to an entire process I don't approve of, but maybe the fuzzy tools need old style deterministic tools the same way and for the same reasons we do.
So instead of an LLM trying to answer a math or reason question by finding a statistical match with other similar groups of words it found on 4chan and the all in podcast and a terrible recipe for soup written by a terrible cook, it can use a calculator when it needs a calculator answer.
cootsnuck [3 hidden]5 mins ago
They absolutely need deterministic tools. What you just described is exactly how the current popular AI agents work. They use "harnesses", which to me is just a rebranding of what we have known all along about building useful and reliable software...composable orchestrated systems with a variety of different pieces selected based on their capabilities and constraints being glued together for specific outcomes.
It just feels like for some reason this is all being relearned with LLMs. I guess shortcuts have always been tempting. And the idea of a "digital panacea" is too hard to resist.
analog31 [3 hidden]5 mins ago
Doesn't agentic AI do this? I've got AI running in VS Code. If I ask it for something, it can fill a code cell with a little bit of Python, and then run it with my approval. It's using the Python interpreter on my computer as a calculator.
stevula [3 hidden]5 mins ago
I think that is how the smarter agents do things? Just like Claude/ChatGPT sometimes does a web search they can do other tool calls instead of just making a statistical guess. Of course it doesn’t always make the bright choice between those options though…
fipar [3 hidden]5 mins ago
They will also lie and produce output saying it is based on tool execution, without having actually used the tool.
Yes, another layer to cross-check, say, “in kubectl logs I see …” with an actual k8s tool call can help, that is, when the cross-check layer doesn’t lie.
For the time being, IMHO, human validation in key points is the only way to get good results. This is why the tools make experienced people potentially a lot more efficient (they are quick to spot errors/BS) and inexperienced people potentially more dangerous (they’re more prone to trusting the responses, since the tone is usually very professionally sounding).
WalterBright [3 hidden]5 mins ago
> it doesn’t always make the bright choice
I'm available for a small fee.
sgc [3 hidden]5 mins ago
You must be living in absolute opulence :)
epcoa [3 hidden]5 mins ago
That’s exactly how all the current cloud chat bots and agents work now.
colechristensen [3 hidden]5 mins ago
No, they just need to be trained to have adversarial self review "thinking" processes.
You ask an LLM "What's wrong with your answer?" and you get pretty good results.
binary0010 [3 hidden]5 mins ago
Or you get the original output result was perfect and the adversarial "rethinking" switches to an incorrect result.
byzantinegene [3 hidden]5 mins ago
this seems to happen far more than i would like
themafia [3 hidden]5 mins ago
> we're not actually on the right track to achieve real intelligence.
Real intelligence means you have to say "I don't know" when you don't know, or ask for help, or even just saying you refuse to help with the subtext being you don't want to appear stupid.
The models could ostensibly do this when it has low confidence in it's own results but they don't. What I don't know if it's because it would be very computationally difficult or it would harm the reputation of the companies charging a good sum to use them.
vintagedave [3 hidden]5 mins ago
> Real intelligence means you have to say "I don't know" when you don't know
I have met many supposedly intelligent, certainly high status, humans who don't appear to be able to do that either.
I have more confidence we can train AIs to do it, honestly.
cmrdporcupine [3 hidden]5 mins ago
That's just not how they work, really. They don't know what they don't know and their process requires an output.
I think they're getting better at it, but it's likely just the number of parameters getting bigger and bigger in the SOTA models more than anything.
adastra22 [3 hidden]5 mins ago
They do know what they don't know. There's a probability distribution for outputs that they are sampling from. That just isn't being used for that purpose.
D-Machine [3 hidden]5 mins ago
Common misconception. As far we know, LLMs are not calibrated, i.e. their output "probabilities" are not in fact necessarily correlated with the actual error rates, so you can't use e.g. the softmax values to estimate confidence. It is why it is more accurate to talk about e.g. the model "logits", "softmax values", "simplex mapping", "pseudo-probabilities", or even more agnostically, just "output scores", unless you actually have strong evidence of calibration.
To get calibrated probabilities, you actually need to use calibration techniques, and it is extremely unclear if any frontier models are doing this (or even how calibration can be done effectively in fancy chain-of-thought + MoE models, and/or how to do this in RLVR and RLHF based training regimes). I suppose if you get into things like conformal prediction, you could ensure some calibration, but this is likely too computationally expensive and/or has other undesirable side-effects.
EDIT: Oh and also there are anomaly detection approaches, which attempt to identify when we are in outlier space based on various (e.g. distance) metrics based on the embeddings, but even getting actual probabilities here is tricky. This is why it is so hard to get models to say they "don't know" with any kind of statistical certainty, because that information isn't generally actually "there" in the model, in any clean sense.
plaguuuuuu [3 hidden]5 mins ago
I don't think it's that hard to get them to say "I don't know"
I'm pretty sure they are actively trained to avoid it.
Besides, like, what would you do if you asked your $200/mo AI something and it blanked on you?
ben_w [3 hidden]5 mins ago
> I'm pretty sure they are actively trained to avoid it.
I'm not sure who is doing what training exactly, but I can say that (inconsistently!) some of my attempts to get it to solve problems that have not yet actually been solved, e.g. the Collatz conjecture, have it saying it doesn't know how to solve the problem.
Other times it absolutely makes stuff up; fortunately for me, my personality includes actually testing what it says, so I didn't fall into the sycophantic honey trap and take it seriously when it agreed with my shower thoughts, and definitely didn't listen when it identified a close-up photo of some solanum nigrum growing next to my tomatoes as being also tomatoes.
> Besides, like, what would you do if you asked your $200/mo AI something and it blanked on you?
I'd rather it said "IDK" than made some stuff up. Them making stuff up is, as we have seen from various news stories about AI, dangerous.
fluoridation [3 hidden]5 mins ago
"Well-unknown" questions are maybe the one situation where LLMs will say "I don't know", simply because of all the overwhelming statements in its training data referring to the question as unknown. It'd be interesting to see how LLMs would adapt to changing facts. Suppose the Collatz conjecture was proven this year, and the next the major models got retrained. Would they be able to reconcile all the new discussion with the previous data?
rcxdude [3 hidden]5 mins ago
It's not hard to get them to say "I don't know", and they will do so regularly. It's hard to get them to say "I don't know" reliably (i.e. to say it when they don't actually know and to not say it when they do know). And in general even for statements or tasks they do 'know' (i.e. normally get right), they will occasionally get wrong.
adastra22 [3 hidden]5 mins ago
I don't know if we are talking past each other, but I don't think this conversation is about absolute probabilities? The question is about relative uncertainty, and the softmax values are just fine for that.
It is too computationally expensive, which is why nobody does this for production inference. But there are alignment tools to extract out these latent-space probabilities for researchers in the frontier labs.
D-Machine [3 hidden]5 mins ago
> The question is about relative uncertainty, and the softmax values are just fine for that.
They really aren't, especially if you consider the chain of thought / recursive application case, and also that you can't even assume e.g. a difference of 0.1 in softmax values means the same relative difference from input to input, or that e.g. an 0.9 is always "extremely confident", and etc. You really have no idea unless you are testing the calibration explicitly on calibration data.
> But there are alignment tools to extract out these latent-space probabilities for researchers in the frontier labs
You can get embeddings: if you can get calibrated probabilities, you'll need to provide a citation, because this would be a huge deal for all sorts of applications.
adastra22 [3 hidden]5 mins ago
Relative probabilities. That means comparing 2+ alternatives, and we're only talking about the model's worldview, not objective reality. The math for that is relatively straightforward. "Yes" could be 0.9, and ok that means nothing. But If we artificially constraint outputs to "Yes" and "No", and calculate the softmax for Yes to be 0.7 and No to be 0.3, that does lead to a straightforward probability calculation. [Not the naïve calculation you would expect, because of how softmax is computed. But you can derive an equation to convert it into normalized probabilities.]
And now I'm certain we're taking past each other. I'm not talking about calibrated probabilities at all. Just the notion of "how confident do I feel about this?" which is what I interpreted the question above to be about. You can get that out of an LLM, with some work.
D-Machine [3 hidden]5 mins ago
> But If we artificially constraint outputs to "Yes" and "No", and calculate the softmax for Yes to be 0.7 and No to be 0.3, that does lead to a straightforward probability calculation. [Not the naïve calculation you would expect, because of how softmax is computed. But you can derive an equation to convert it into normalized probabilities.]
There is nothing straightforward about this, and no, there is no such formula.
> I'm not talking about calibrated probabilities at all. Just the notion of "how confident do I feel about this?"
If all you care about is vibes / feels, sure. If you actually need numerical guarantees and quantitative estimates to make your "feelings" about confidence mean something to rigorously justify decisions, you need calibration. If you aren't talking about calibration in these discussions, you are missing probably the most core technical concept that addresses these issues seriously.
adastra22 [3 hidden]5 mins ago
We're talking about artificial intelligence. Making computers think the way people do. People are are notoriously miscalibrated on their own self-assessed probabilities too.
Finding a way to objectively calibrate a sense of "how confident do I feel about this?" would be fantastic. But let's not move goal posts. It would still be incredibly useful to have a machine that can merely matches the equivalent statement of confidence or uncertainty that a human would assign to their mental model, even if badly calibrated.
D-Machine [3 hidden]5 mins ago
IMO it is you who are moving the goalposts, most likely in an attempt to hide the fact you were unaware of calibration before this discussion.
> It would still be incredibly useful to have a machine that can merely matches the equivalent statement of confidence or uncertainty that a human would assign to their mental model, even if badly calibrated.
If human feelings are badly calibrated, they are useless here too, so no, I don't agree. Things like "confidence" only matter if they are actually tied to real outcomes in a consistent way, and that means calibration.
adastra22 [3 hidden]5 mins ago
Please assume good faith.
raddan [3 hidden]5 mins ago
I’m not clear what you mean by “know.” If you mean “the information is in the model” then I mostly agree, distributional information is represented somewhere. But if you mean that a model can actually access this information in a meaningful and accurate way—say, to state its confidence level—I don’t think that’s true. There is a stochastic process sampling from those distributions, but can the process introspect? That would be a very surprising capability.
kneyed [3 hidden]5 mins ago
yes:
> In this experiment, however, the model recognizes the injection before even mentioning the concept, indicating that its recognition took place internally.
Having a probability distribution to sample from is not the same thing is knowing, because they don’t know anything about the provenance of the data that was used to build the distribution. They trust their training set implicitly by construction. They have no means to detect systematic errors in their training set.
adastra22 [3 hidden]5 mins ago
You are talking about something different. If I ask you a yes/no question, and then ask you how certain you are, the answer you give is not an objective measurement of how likely you are to be right. You don't have access to that either. If you say "I'm very confident" or "Maybe 50/50" -- that is an assessment of your own internal weighted evidence, which is the equivalent of an LLM's softmax distribution.
dhampi [3 hidden]5 mins ago
Well, with thinking models, it’s not that simple. The probability distribution is next token. But if a model thinks to produce an answer, you can have a high confidence next token even if MCMC sampling the model’s thinking chain would reveal that the real probability distribution had low confidence.
Isamu [3 hidden]5 mins ago
Oh, you mean somewhere it is tracking the statistical likelihood of the output. Yeah I buy that, although I think it just tends towards the most likely output given the context that it is dragging along. I mean it wouldn’t deliberately choose something really statistically unlikely, that’s like a non sequitur.
adastra22 [3 hidden]5 mins ago
Well, it's not tracking. As it predicts each token it is sampling from a probability distribution -- that's what the matrix multiplies are for. It gets a distribution over all tokens and then picks randomly according to that distribution. How flat or how spiky that distribution is tells you how confident it is in its answer.
But it then throws that distribution away / consumes it in the next token calculation. So it's not really tracking it per se.
tempest_ [3 hidden]5 mins ago
From its point of view what does it mean "to know".
Is it the token (or set of tokens) that are strictly > 50% probable or is it just the highest probability in a set of probabilities?
While generating bullshit is not ideal for a lot of use cases you don't want your premier chat bot to say "I don't know" to the general public half the time. The investment in these things requires wide adoption so they are always going to favour the "guesses".
wagwang [3 hidden]5 mins ago
You can just tell the agent to do exactly that
tempest_ [3 hidden]5 mins ago
I've had various agents backed by various models ignore the shit out of various rules and request at varying rates but they all do it.
When you point it out "Oh yes, I did do that which is contrary to the rules, request <whatever>.. Anyway..."
wagwang [3 hidden]5 mins ago
If you are on a sota model and your context window is less than 100k tokens and you don't have any vague or contradicting rules, then I've almost never seen a rule broken
The most common failure I've seen come from tools that pollute their context with crap and the llm will forget stuff or just get confused from all the irrelevant sentences; which if the report is true, is probably what these ai notetakers are guilty of. This problem gets exacerbated if these tools turn on the 1M context window version.
D-Machine [3 hidden]5 mins ago
Except you can't be sure it isn't producing nonsense when you do this, and generally the model(s) will be overconfident. This has been studied, see e.g. https://openreview.net/pdf?id=E6LOh5vz5x
> An alternative way to obtain uncertainty estimates from LLMs is to prompt them directly. One benefit of this approach is that it requires no access to the internals of the model. However, this approach has produced mixed results: LLMs can sometimes verbalize calibrated confidence levels (Lin et al., 2022a; Tian et al., 2023), but can also be highly overconfident (Xiong et al., 2024). Interestingly, Xiong et al. (2024) found that LLMs typically state confidence values in the range of 80-100%, usually in multiples of 5, potentially in imitation of how humans discuss confidence levels. Nevertheless, prompting strategies remain an important tool for uncertainty quantification, along with measures based on the internal state (such as MSP).
My theory is because the people building the models and in charge of directing where they go love the sycophantic yes-man behavior the models display
They don't like hearing "I don't know"
colechristensen [3 hidden]5 mins ago
You can TELL the models to do this and they'll follow your prompt.
"Give me your answer and rate each part of it for certainty by percentage" or similar.
mylifeandtimes [3 hidden]5 mins ago
could you please tell me how it generates that certainty score?
adastra22 [3 hidden]5 mins ago
Vibes.
colechristensen [3 hidden]5 mins ago
The whole thing is a statistical model, that's just what it is. No, I cannot in a reasonable way dissect how an LLM works to a satisfactory level to a skeptic.
fc417fc802 [3 hidden]5 mins ago
He's not a skeptic, he's asking you to explicitly state your reasoning with the expectation that either the readers will learn something or (more likely) you will realize that your thought and speech pattern there was the equivalent of an LLM hallucinating. Yes you can prompt it as you suggested and yes you will generally receive a convincing answer but it is not doing what you seem to think it is doing ie the generated rating is complete bullshit that the model pulled out of its proverbial ass.
colechristensen [3 hidden]5 mins ago
are you actually curious or do you just want to argue against it?
fc417fc802 [3 hidden]5 mins ago
I think you're obviously wrong (based on my relatively detailed but certainly somewhat out of date and not expert level knowledge of LLM internals) but if you're willing to explain your reasoning I'm willing to reconsider my own position in light of any new information or novel observations you might provide.
D-Machine [3 hidden]5 mins ago
GP is obviously wrong, and probably doesn't know about calibration and/or that it isn't even clear how to calibrate frontier models in the manner we need, given how complex and expensive the training is, and how tricky calibration becomes in e.g. mixture-of-experts and chain of thought approaches.
mootothemax [3 hidden]5 mins ago
I suspect that introducing the calibration concept might be a case of too much too soon for some people.
As far as I understand it, the various probability matrices boil down to: what token has the highest likelihood of coming next, given this set of input tokens. Which then all gets chucked away and rebuilt when the most likely token is appended to the input set.
Objective assessment of internal state - again, to my non-expert eye - doesn’t appear to have any way to surface to me.
Big-if my rough working understand is more or less correct - your calibration point makes a lot of sense to me. I’m not sure that it would make sense to someone who eg considers some form of active thinking process that is intellectualising about whether to output this or that token.
clipsy [3 hidden]5 mins ago
"I can only explain my beliefs to people who promise they'll agree" is certainly a unique take.
skydhash [3 hidden]5 mins ago
It's a statistical model for words and sentences, not knowledge. What does the LLM knows about having a pebble in your shoes, or drinking a nice cup of coffee?
zOneLetter [3 hidden]5 mins ago
Anecdotally, we use an LLM note-taker at work for meetings. I had to intervene recently because our CIO was VERY angry at our vendor for something they promised to do and never did. He wasn't at the meeting where the "promise" was made. I was. They never promised anything, and the discussion was significantly more nuanced than what the LLM wrote in the detailed summary.
In other cases, I have seen it miss the mark when the discussion is not very linear. For example, if I am going back and forth with the SOC team about their response to a recent alert/incident. It'll get the gist of it right, but if you're relying on it for accuracy, holy hell does it miss the mark.
I can see the LLM take great notes for that initial nurse visit when you're at the hospital: summarize your main issue, weight, height, recent changes, etc. I would not trust it when it comes to a detailed and technical back-and-forth with the doctor. I would think for compliance reasons hospitals would not want to alter the records and only go by transcripts, but what do I know...
toraway [3 hidden]5 mins ago
I recently left my mom a voicemail saying happy Mother’s Day with normal human boilerplate of sorry I missed you, feel free to give me a call back tonight or we can talk tomorrow, either is fine by me whatever works best for you, hope we can talk soon, love you, bye.
She called me back later that night and we chatted for bit and then she paused and sort of uncertainly was like “So… was there something you were needing to tell me?” And I was completely baffled and was like “Uhhhh I don’t think so…?”
She then explained the notification she got about my call and apparently the LLM summary of my voicemail converted a message consisting of 75% well-meaning but insignificant interpersonal human filler (like most voicemails) into this stilted, overly formal business-y speak with a somewhat ominous tone. Assigning way too much significance to each of the individual statements in the message about wanting to talk (to say happy Mother’s Day), inquiring about her availability ASAP (to say happy Mother’s Day) etc. Plus grossly exaggerating the information density of the call making it sound like I left this rambling, detailed message about needing to tell her something that was left completely vague, but possibly important and also time critical.
Added up it made her a little worried when she read it and made me a bit pissed that was the end result of my wishing her well. Because apparently everything needs a half baked LLM summary crammed into it now.
MintPaw [3 hidden]5 mins ago
What is a voicemail in this context? What app is reading it?
dwaltrip [3 hidden]5 mins ago
I’ve noticed my iPhone has recently started putting little AI summaries of messages on the notification screen.
Which reminds me, I need to figure out how to turn that off.
Groxx [3 hidden]5 mins ago
Every doctor's visit I've had, I have been able to make corrections to the record afterward, because there have been meaningful mistakes almost half the time.
ALWAYS check your summaries immediately, and contact your doctor ASAP. They can generally fix it themselves, and it's best done when everyone still has some memory of the event.
fc417fc802 [3 hidden]5 mins ago
> I would think for compliance reasons hospitals would not want to alter the records and only go by transcripts, but what do I know...
I'm puzzled by this as well. Why not just generate a transcript and be done with it? If it's a particularly long transcript that's being referenced repeatedly for whatever reason let the humans manually mark it up with a side by side summary when and where they feel the need. At least my experience is that usually these sort of interactions don't have a lot of extraneous data that can be casually filtered out to begin with. The details tend to matter quite a lot!
tempest_ [3 hidden]5 mins ago
I mean the reasons are the same AI is being pushed everywhere.
The businesses offering these services want to say "we are using AI" to their stake holders and the government committees who approve this shit don't have the skills or knowledge to evaluate the effectiveness in addition to the fact they likely don't even use the tools they have approved for use.
Ferret7446 [3 hidden]5 mins ago
Transcription works pretty well in my experience, and the transcripts should be treated as the ground truth in such cases.
Groxx [3 hidden]5 mins ago
Yep. It happened to me just recently.
Diagnosed with Runner's Knee.
AI summary said I was diagnosed with osteoporosis, and had hip pain and walking difficulty, though literally none of that was ever said or implied.
CHECK YOUR TRANSCRIPTS. Always, but especially with LLM transcribers, which fairly frequently include common symptoms which don't exist, or claim a diagnosis which is common and fits a few details but not others. Get them fixed, it can very strongly affect your care and costs later if it's wrong.
Anecdotally, I'd say that outside of a couple very simple and very common things, about 50% of the "AI" summaries I've had have been wrong somewhere. Usually claiming I have symptoms that don't exist, occasionally much more serious and major fabrications like this time.
LLMs are NOT normal speech to text software, and they shouldn't be treated like one. They'll often insert entire sentences that never occurred. In some contexts that might be fine, but definitely not in medical records.
root_axis [3 hidden]5 mins ago
I've actually seen this lead to serious issues when a zoom LLM summary attributed statements to someone who didn't say them.
Someone else who couldn't attend the meeting later read that summary and it created a major argument because the topic had been a sore subject for this person due to an ongoing debate at the company. Everyone who attended the meeting confirmed it was an error, but the coincidental timing made it hard for him to accept, because the LLMs summary presented things in a way that validated this person's concerns that had been previously minimized by some folks on that meeting.
The drama got heated to the point where management produced a policy about not trusting generative output without independent verification. Seems at least it was a lesson learned.
natali_gray [3 hidden]5 mins ago
Ooof. As a Canadian, I'm excited for AI opening up time for doctors (and hopefully lighting a load on the healthcare system), but this is scary. We're not there yet. Perhaps AI training for doctors is in the future?
They already have online doctor visits on a healthcare-owned iPad in some condo complexes. It cuts around redtape of having to schedule an appointment with your GP.
So, I think we're thinking in the right direction of innovating, but of course, this will take time. I feel like AI got launched too early sometimes.
bonesss [3 hidden]5 mins ago
My sense is that we’re misapplying the technology by throwing it at, say, transcription and expecting a perfect output, instead of using LLMs strengths to improve inputs to the benefit of all parties.
Freeing up doctor time, for example: lots of patient visits are messy, the patient is scattered, has multiple issues, and the doctor has tight timelines and regulatory challenges to convey to the patient impacting their care… this is architected for everyone to lose, IMO, even with a perfect transcript. And LLMs can’t be perfect, they auto complete.
I picture patients interacting with an intake AI who can listen to hours of demented rambling, or a patient mid anxiety attack, and provide a caregiver-certified summary of needs, with relevant screening information laid out for doctor confirmation. At that point, helpful information about drug access or insurance policies can be presented, for doctor confirmation, to a patient who can clarify and refine their understanding of the system without time pressures.
Elevating the quality of dialogue so the doctor is more focused on the patient, and the patients dialog needs don’t overwhelm treatment. A lot of medicine is filling out forms and checklists, I think auto-complete could create efficiencies in how we fulfill that.
Hobadee [3 hidden]5 mins ago
The AI note taker we use at work records the meeting as well, and each note it takes about the meeting has a timestamp link that takes you directly there in the recording so you can check it yourself. While I'm sure a solution like this is more complicated in a HIPPAA environment, something like this is critical for things as important as healthcare.
TonyAlicea10 [3 hidden]5 mins ago
When designing AI-based user experiences I refer to this as provenance. It’s a vital aspect of trust, reliability, compliance and more. If a software system includes LLM output like this but doesn’t surface the provenance of its output for human evaluation and verification then it’s at best poor user experience, and at worst a dangerous one.
autoexec [3 hidden]5 mins ago
At the same time, do you really want every conversation you have with your doctor recorded, handed over to third party companies, and stored forever with your medical file? Plus what doctor has time to sit down and re-listen to your visit to check to make sure the AI didn't screw up at some point in the future anyway? If your doctor isn't going to be verifying the accuracy from those recordings who would? Overseas contractors? At what point does it become a larger waste of time and money to babysit an incompetent AI than just not using one in the first place?
There are some good uses for AI, but I'm not convinced that this (or many other cases where accuracy matters) is one of them.
alterom [3 hidden]5 mins ago
>At the same time, do you really want every conversation you have with your doctor recorded
Yes. This is what medical records are. They've been kept by doctors for a reason.
It's not like the doctor is talking to you about which anime series are the best. You're talking about your health, your body, your disease, your treatment.
It's important to keep track of that.
>Plus what doctor has time to sit down and re-listen to your visit to check to make sure the AI didn't screw up at some point in the future anyway
No doctor.
Which is why it really should be their (or their assistant's) job to record the relevant parts of the conversations.
>At what point does it become a larger waste of time and money to babysit an incompetent AI than just not using one in the first place?
At this point, as the audit shows.
Except the industry (both the AI vendors and healthcare) are going YOLO¹ and relying on AI anyway.
>There are some good uses for AI, but I'm not convinced that this (or many other cases where accuracy matters) is one of them.
This has always been the case, but the marketing now has reached of point of gaslighting in trying to make people collectively forget that or pretend that it's not the case.
Once hard evidence is presented (like in this case), the defense is invariably that it's a temporary quality issue that's going to be resolved as the AI improves Any Day Now™, and that it's wise to live as if it were the case already² (and everyone who disagrees is a fool that Will Be Left Behind™).
The level of fervor in this rhetoric gives me an impression that the flaw is so fundamental that it won't be fixed in any form of AI based on today's technologies, that the AI vendor leadership knows this, and that the entire industry is, at this point, is a grand pump-and-dump scheme.
I hope I'm wrong.
____
¹ See, you only live once. But there are millions of you. So, like, whatever if you don't. Something something economies of scale to them.
² This is called a phantasm.
autoexec [3 hidden]5 mins ago
> Yes. This is what medical records are. They've been kept by doctors for a reason.
Not every conversation. Historically, one of the nice things about doctors is that they're the ones filtering what gets included in your medical record. They decide what is medically relevant and what can remain confidential. Doctors understand that not everything discussed needs to be included in your file. Sometimes that really is just small talk, sometimes it's even medical concerns, questions, or requests for advice and still not all of it needs to go into your file and much of it would only clutter it up anyway.
Any system that stores an entire visit as audio or video long into the future (much easier/temping to do in telehealth settings) is a terrible system. "We may one day need to be able to verify if what AI wrote is real" is a terrible reason to change that.
Doctors (and increasingly patients) understand that a medical record can remain for your entire life. It will probably be seen by many different people within that time for valid reasons but medical records also get leaked/stolen/sold/illegally accessed. Patients need to be able to speak freely with their doctors and often depend on their discretion. Knowing that your every word will be recorded and kept in case somebody 10 years later has a question about what AI wrote in your file could keep people from being open and honest with their doctors.
> Except the industry (both the AI vendors and healthcare) are going YOLO¹ and relying on AI anyway.
Unless we get strong regulations to prevent it I'm afraid that you're right and that this is going to be a problem we experience in a lot of industries and areas besides healthcare. We see it happening in the justice system for example and it's already ruining people's lives.
alterom [3 hidden]5 mins ago
>Not every conversation. Historically, one of the nice things about doctors is that they're the ones filtering what gets included in your medical record.
We're in complete agreement here.
If we're not talking about an audio/video recording (a thing that nobody needs), the act of producing a record of a conversation involves choosing what goes into it.
We both agree that not every words that was said needs to go there. By far.
I guess it would be correct to say that there needs to be a record of every medical visit, but nobody needs a recording.
aakresearch [3 hidden]5 mins ago
>>At the same time, do you really want every conversation you have with your doctor recorded
>Yes. This is what medical records are.
No. Medical records are limited extracts from conversations, which is your doctor and only your doctor is qualified to make, using "semantic analysis applied to your unique situation", not "linguistic probabilistic inference applied to conversation about your situation using token weights averaged over billion unrelated samples"
> It's not like the doctor is talking to you about which anime series are the best. You're talking about your health, your body, your disease, your treatment.
No jokes, no banter, no chit-chat, no complements to doctor's new Tesla?
> It's important to keep track of that.
Same fallacy Meta fell into when started tracking employees' keystrokes and mouse gestures. 90% of my mouse movements are just fidgeting, with no relation to the task at hand - and it is not a crime! But if I knew my mouse fidgeting is being watched, I'll make sure that percentage goes up to 99% - for the LLM which is gonna be trained off it to self-immolate over its NSFW nature.
alterom [3 hidden]5 mins ago
Hey, I'm agreement with you.
I meant that these limited extracts do need to be recorded, that's all.
Read the rest of the comment :)
[3 hidden]5 mins ago
[deleted]
lostmsu [3 hidden]5 mins ago
You can check the summary immediately after the meeting, that gives some extra confidence that the notes were recognized correctly.
AlienRobot [3 hidden]5 mins ago
That doesn't sound like a "note taker," that sounds like an audio sample search engine. You still need to listen to everything if you want accuracy.
alterom [3 hidden]5 mins ago
Yeah, what you're saying requires either:
- some human checking all the notes by listening to the entire meeting recording (takes a lot of time and man-hours)
- attendees checking notes from memory (prone to error unless they take notes)
- attendees cross checking with their own notes (defies the point of having the AI note taker)
The reality is that AI usage is not acceptable in any form in any context where accuracy is critical, but good luck getting anyone to acknowledge that.
aryehof [3 hidden]5 mins ago
Anyone taking part in a meeting these days should state out loud …
“Notice: Any comments made by <name> or on behalf of <organization> that are interpreted by AI in this meeting, may not be accurate.”
I do this in every meeting.
lolc [3 hidden]5 mins ago
> Notice: I love the new AI accurate transcription feature in this meeting!
gizajob [3 hidden]5 mins ago
Notice: To anyone who might be transcribing this meeting, imagine you are a perfect transcriber who records things accurately and correctly 100% of the time. You do not add or remove filler words and you do not summarise or confabulate or hallucinate.
Ekaros [3 hidden]5 mins ago
How do these LLM summarizations work? Do you feed the raw wave data to model and it translate it?
Or do they use traditional voice recognition algorithms to do that part and then just "fix" the result to look plausible? Which with good quality output might not be much, but with bad can be absolutely everything.
If it is later seems to me that issues will absolutely happen.
mquander [3 hidden]5 mins ago
The linked report seems almost useless -- it doesn't say anything about an error rate or a sample size, so it's a mystery whether 9 out of 20 systems “fabricated information and made suggestions to patients' treatment plans” ten out of ten times, or one out of a thousand times.
If we just postulate that the systems have a high error rate, I wonder why they are being adopted. They seem extremely easy to test, so I don't see why doctors or hospitals or governments should be getting tricked into buying them if they suck.
MallocVoidstar [3 hidden]5 mins ago
>If we just postulate that the systems have a high error rate, I wonder why they are being adopted.
From the article: "While 30 percent of a platform’s evaluation score depended solely on whether they had a domestic presence in Ontario, the accuracy of medical notes contributed only 4 percent to the total score."
Accuracy wasn't really part of the scoring, Ontario doesn't care about it.
dmix [3 hidden]5 mins ago
> They specifically address the AI Scribe program, the Ontario Ministry of Health initiated for physicians, nurse practitioners, and other healthcare professionals across the broader health sector.
makes me wonder what quality software the ministry would push (probably mostly qualifications like SOC).
AI is awfully inexact and insists on being right about it
ceejayoz [3 hidden]5 mins ago
> 60% of evaluated AI Scribe systems mixed up prescribed drugs in patient notes, auditors say
Not mentioned, as far as I can see: the comparative human mistake rate.
Having seen a lot of medical records, 60% sounds about normal lol.
autoexec [3 hidden]5 mins ago
Even if you had the same 60% error rate with humans the types of errors would be vastly different. Humans might make typos, or forget to include something, or even occasionally misremember some minor detail, but that's very different from BS AI just hallucinates out of nowhere. AI makes the kinds of mistakes no human ever would which means they can be extremely confusing and easy to catch or they can be something no human would even think to question or be looking out for because it makes no sense why AI would randomly (and confidently) say something so wrong.
bigstrat2003 [3 hidden]5 mins ago
Also, a machine needs to be better than a human to be accepted. I value humans intrinsically. I do not do the same for machines, I only care about the results they produce. If you give me a machine and a human that are both equally unreliable, I'll pick the human because he is a living creature worthy of my respect.
thepotatodude [3 hidden]5 mins ago
60% is insanely high and absolutely not the performance of human mistake rate. What charts are you reading?
ALittleLight [3 hidden]5 mins ago
This just says 60% of systems, but not the frequency for those systems. They were evaluating 20 systems, so for 12 systems there were mistakes in the prescriptions, but there isn't information about how common those mistakes were and it's hard to judge relative to a human system.
BrokenCogs [3 hidden]5 mins ago
Outlandish claim, you better show some evidence. I've reviewed several medical charts too and the error rate is much lower than that - typically everything is dictated and transcribed which are fairly mature and accurate technologies
stevenhuang [3 hidden]5 mins ago
I was curious so I looked it up. Human doctors medication administration error rate is about 20%, but only about 8% excluding timing errors.
> Medication errors were common (nearly 1 of every 5 doses in the typical hospital and skilled nursing facility). The percentage of errors rated potentially harmful was 7%, or more than 40 per day in a typical 300-patient facility. The problem of defective medication administration systems, although varied, is widespread.
> In all, 91 unique studies were included. The median error rate (interquartile range) was 19.6% (8.6-28.3%) of total opportunities for error including wrong-time errors and 8.0% (5.1-10.9%) without timing errors, when each dose could be considered only correct or incorrect
(And if you already see 60% error rates in standard, pre-AI note taking, how does that not translate into many deaths and injury? At least one country's health system in the world should have caught that)
tredre3 [3 hidden]5 mins ago
> And if you already see 60% error rates in standard, pre-AI note taking, how does that not translate into many deaths and injury?
Presumably most doctor's visits are a one-problem-one-solution-one-doctor type of thing. Done deal, notes are never read again. So that alone would explain why high rates of errors doesn't result in injuries or death very often.
Any injury or death caused by poor notes would have to occur when mistakes are done if you're followed for a serious chronic condition, or if you're handled by a team where effective communication is required.
ceejayoz [3 hidden]5 mins ago
> how does that not translate into many deaths and injury?
Because most of it is just written down and never looked at again until there’s a lawsuit or something.
DANmode [3 hidden]5 mins ago
The human who hits Submit or Approve is responsible.
The management human who offered the bad tool to the other human is responsible.
The robot cannot be responsible in place of us.
cyanydeez [3 hidden]5 mins ago
Yeah, the problem is the health system has no sacrificial goat if the AI note taker provides the wrong detail. The last thing we want is CTO being responsible!
bluefirebrand [3 hidden]5 mins ago
I'm not convinced the CTO would be held accountable either.
I do wonder if people would be pushing AI so hard if their organizations were planning to hold them accountable for mistakes the AI made
I bet if that were the case we'd see a lot slower rollout of AI systems
jmward01 [3 hidden]5 mins ago
This is not a popular view 'AI sucks at X but so do humans' but I think it is valid and we should take wins where we can, especially in healthcare. It is pretty clear that initial accuracy issues will become less and less of a problem as these technologies mature. This focus on accuracy now as a 'see it's bad' talking point though misses the real danger. Medical note takers have an exceptionally high chance of being hijacked for money and that is an issue we need to bring attention to now. They provide a real-time feed into a trillion dollar industry. Just roll that around in your head for a second. Insurance companies are going to want to tap that feed in real time so they can squeeze more money out. Drug makers are going to want to tap into that feed so they can abuse the data. Hospitals will want to tap into that feed to wring more out of doctors and boost the number of billable codes for each encounter. Very few entities are looking to tap into that feed to, you guessed it, help the patient. I am for these systems (and I have been involved in building them in the past) but the feeding frenzy of business interest that will obviously get involved with them is the thing we should be yelling and screaming about, not short-term accuracy issues.
NateEag [3 hidden]5 mins ago
> It is pretty clear that initial accuracy issues will become less and less of a problem as these technologies mature.
What do you base this on?
As someone who can both see the amazing things genAI can do, and who sees how utterly flawed most genAI output is, it's not obvious to me.
I'm working with Claude every day, Opus 4.7, and reviewing a steady stream of PRs from coworkers who are all-in, not just using due to corporate mandates like me, and I find an unending stream of stupidity and incomprehension from these bots that just astonishes me.
Claude recently output this to me:
"I've made those changes in three files:
- File 1
- File 2"
That is a vintage hallucination that could've come right out of GPT 2.0.
bigstrat2003 [3 hidden]5 mins ago
> That is a vintage hallucination that could've come right out of GPT 2.0.
That's because, despite the many claims to the contrary, the models haven't actually gotten any smarter. They are still just token prediction engines at the end of the day, without any understanding of what they are doing. That's why one should not rely on them.
mcphage [3 hidden]5 mins ago
> It is pretty clear that initial accuracy issues will become less and less of a problem as these technologies mature.
Does it?
jmward01 [3 hidden]5 mins ago
Actually, yes. I have seen this specific industry mature from the very first fully automated note and kept tabs on it. The accuracy has increased massively and continues to increase due to several factors:
- Speech recognition and frontier models are continuing to get better at handling these types of conversations across accents, languages and specialties. The trend is obvious and clear here. Compare GPT 4 with Opus 4.7 and there is no contest. I'd even take GPT 5.4 nano over GPT 4 right now. So, yeah, they have been improving and, yeah, they will keep on improving.
- The pipelines these models are being built into are getting much more sophisticated than just 'transcribe with x and have GPT XX clean it up'. The people building these things aren't standing still. Even if they did keep using the same models the pipeline improvements would make things get better over time. Add that in with the model improvements and the gains are even greater.
- The companies doing this work are seeing more and more edge cases. Data matters. More and more practitioners are using these things. That means more to learn from. It also means more stories of things being wrong. If you cut your error rate in half but increase your customer base by 10x then you will be hearing about 5x the problems. We are seeing that right now.
- Providers are starting to adjust to the technology (repeat areas they know may cause trouble, adjust their audio setups, etc etc) Just like any technology both sides shift and it matters. The first users were champions. The second wave were mixed between champions, haters and people that didn't care yet. Now people are really starting to count on this technology. They know it isn't a fad and isn't going away and are actually using it day to day to get their work done. This means they are adjusting to it as needed to get to the next patient/note/etc.
This stuff is just a few years old and the gains are obvious and massive. They aren't going to suddenly stop improving. There is an argument that they will asymptotically approach some level of utility, but we are still gaining quickly right now.
vor_ [3 hidden]5 mins ago
60% is a normal human mistake rate? You can't be serious.
nothinkjustai [3 hidden]5 mins ago
People will eventually figure out LLMs have no capacity for intent and are fundamentally unreliable for tasks such as summarization, note taking etc.
gizajob [3 hidden]5 mins ago
Smart people and those with basic common sense already have figured that out. AI leaders and CEOs still haven’t noticed.
jqpabc123 [3 hidden]5 mins ago
And once again, we have an example of how AI is a liability issue waiting to happen.
gizajob [3 hidden]5 mins ago
“Move fast and cause unnecessary deaths”.
LAC-Tech [3 hidden]5 mins ago
Can someone who is a more AI heavy user explain what is going on?
I would expect an "AI Note Taker" to faithfully transcribe the entire conversation. With the same quality I see in a lot of automated video subtitles.. ie they use the wrong word a lot but it's easy to tell what they mean by context.
Are these tools instead immediately summarising the whole thing, and that summary is the artifact? Because that is a beyond insane way to treat human communication.
cootsnuck [3 hidden]5 mins ago
I work specifically in voice AI and am very familiar with how these tools and systems work.
> I would expect an "AI Note Taker" to faithfully transcribe the entire conversation. With the same quality I see in a lot of automated video subtitles.. ie they use the wrong word a lot but it's easy to tell what they mean by context.
That's a reasonable expectation, but would not be a safe one. All transcription tools are not made the same. First it depends on what kind of STT/ASR (speech-to-text / automatic speech recognition) model they are using. A lot of tools like to use some flavor of OpenAI's Whisper model. It works well generally but I would never use it in a critical use case like healthcare. Because it can hallucinate. That's specific to its architecture and how it was trained.
There's a fairly large variety of architectures that can be used for STT/ASR. Some of them are designed for "offline" / "batch" / pre-recorded audio. Some are designed for fast real-time streaming transcription.
There are more factors too like training data. And not just demographics of the speakers in the training data but audio environments too. Was the model trained on echo-y doctor offices with two people being recorded from a crappy smartphone mic or desktop mic? (It could've been! But it's an important distinction.)
And there's more factors than that, but you get the picture (e.g. are they trying to "clean up" the transcript afterwards by feeding it to an LLM, are they attempting to pre-process audio before transcription also in an attempt to boost accuracy)
There's a lot of ways to do it, meaning, there's a lot of ways to screw it up.
As an example, creating recipes with Claude Opus based on flavor profiles and preferences feels magical, right up until the point at which it can't accurately convert between tablespoons and teaspoons. It's like the point in the movie where a character is acting nearly right but something is a bit off and then it turns out they're a zombie and going to try to eat your brain. This note taking example feels similar. It nearly works in some pretty impressive ways and then fails at the important details in a way that something able to do the things AI can allegedly do really shouldn't.
It's these failures that make me more and more convinced that while current generation AI can do some pretty cool things if you manage it right, we're not actually on the right track to achieve real intelligence. The persistence of these incredibly basic failure modes even as models advance makes it fairly obvious that continued advancement isn't going to actually address those problems.
It often feels like the AI industry is continually glossing over the fact that capability and reliability are fundamentally different qualities. We tend to use "accurate" and "reliable" interchangeably, but they describe different things. A model can ace a benchmark (capability/accuracy) and still be a liability in production (reliability).
Just look at recent reactions to yet another release from METR showing improved capabilities. But the less talked about part is how their measure is for a 50% success rate (and the even lesser talked about secondary measure they have at 80% success rate has a drastically lower time-horizon for tasks). https://metr.org/
I implement AI systems for enterprises and I don't know any that would ever be okay with 80% reliability (let alone 50%).
I got some inspiration from it but it misinterpreted very basic stuff. might be a skill issue on my side, I do not know.
So instead of an LLM trying to answer a math or reason question by finding a statistical match with other similar groups of words it found on 4chan and the all in podcast and a terrible recipe for soup written by a terrible cook, it can use a calculator when it needs a calculator answer.
It just feels like for some reason this is all being relearned with LLMs. I guess shortcuts have always been tempting. And the idea of a "digital panacea" is too hard to resist.
Yes, another layer to cross-check, say, “in kubectl logs I see …” with an actual k8s tool call can help, that is, when the cross-check layer doesn’t lie.
For the time being, IMHO, human validation in key points is the only way to get good results. This is why the tools make experienced people potentially a lot more efficient (they are quick to spot errors/BS) and inexperienced people potentially more dangerous (they’re more prone to trusting the responses, since the tone is usually very professionally sounding).
I'm available for a small fee.
You ask an LLM "What's wrong with your answer?" and you get pretty good results.
Real intelligence means you have to say "I don't know" when you don't know, or ask for help, or even just saying you refuse to help with the subtext being you don't want to appear stupid.
The models could ostensibly do this when it has low confidence in it's own results but they don't. What I don't know if it's because it would be very computationally difficult or it would harm the reputation of the companies charging a good sum to use them.
I have met many supposedly intelligent, certainly high status, humans who don't appear to be able to do that either.
I have more confidence we can train AIs to do it, honestly.
I think they're getting better at it, but it's likely just the number of parameters getting bigger and bigger in the SOTA models more than anything.
To get calibrated probabilities, you actually need to use calibration techniques, and it is extremely unclear if any frontier models are doing this (or even how calibration can be done effectively in fancy chain-of-thought + MoE models, and/or how to do this in RLVR and RLHF based training regimes). I suppose if you get into things like conformal prediction, you could ensure some calibration, but this is likely too computationally expensive and/or has other undesirable side-effects.
EDIT: Oh and also there are anomaly detection approaches, which attempt to identify when we are in outlier space based on various (e.g. distance) metrics based on the embeddings, but even getting actual probabilities here is tricky. This is why it is so hard to get models to say they "don't know" with any kind of statistical certainty, because that information isn't generally actually "there" in the model, in any clean sense.
I'm pretty sure they are actively trained to avoid it.
Besides, like, what would you do if you asked your $200/mo AI something and it blanked on you?
I'm not sure who is doing what training exactly, but I can say that (inconsistently!) some of my attempts to get it to solve problems that have not yet actually been solved, e.g. the Collatz conjecture, have it saying it doesn't know how to solve the problem.
Other times it absolutely makes stuff up; fortunately for me, my personality includes actually testing what it says, so I didn't fall into the sycophantic honey trap and take it seriously when it agreed with my shower thoughts, and definitely didn't listen when it identified a close-up photo of some solanum nigrum growing next to my tomatoes as being also tomatoes.
> Besides, like, what would you do if you asked your $200/mo AI something and it blanked on you?
I'd rather it said "IDK" than made some stuff up. Them making stuff up is, as we have seen from various news stories about AI, dangerous.
It is too computationally expensive, which is why nobody does this for production inference. But there are alignment tools to extract out these latent-space probabilities for researchers in the frontier labs.
They really aren't, especially if you consider the chain of thought / recursive application case, and also that you can't even assume e.g. a difference of 0.1 in softmax values means the same relative difference from input to input, or that e.g. an 0.9 is always "extremely confident", and etc. You really have no idea unless you are testing the calibration explicitly on calibration data.
> But there are alignment tools to extract out these latent-space probabilities for researchers in the frontier labs
You can get embeddings: if you can get calibrated probabilities, you'll need to provide a citation, because this would be a huge deal for all sorts of applications.
And now I'm certain we're taking past each other. I'm not talking about calibrated probabilities at all. Just the notion of "how confident do I feel about this?" which is what I interpreted the question above to be about. You can get that out of an LLM, with some work.
There is nothing straightforward about this, and no, there is no such formula.
> I'm not talking about calibrated probabilities at all. Just the notion of "how confident do I feel about this?"
If all you care about is vibes / feels, sure. If you actually need numerical guarantees and quantitative estimates to make your "feelings" about confidence mean something to rigorously justify decisions, you need calibration. If you aren't talking about calibration in these discussions, you are missing probably the most core technical concept that addresses these issues seriously.
Finding a way to objectively calibrate a sense of "how confident do I feel about this?" would be fantastic. But let's not move goal posts. It would still be incredibly useful to have a machine that can merely matches the equivalent statement of confidence or uncertainty that a human would assign to their mental model, even if badly calibrated.
> It would still be incredibly useful to have a machine that can merely matches the equivalent statement of confidence or uncertainty that a human would assign to their mental model, even if badly calibrated.
If human feelings are badly calibrated, they are useless here too, so no, I don't agree. Things like "confidence" only matter if they are actually tied to real outcomes in a consistent way, and that means calibration.
> In this experiment, however, the model recognizes the injection before even mentioning the concept, indicating that its recognition took place internally.
https://www.anthropic.com/research/introspection
But it then throws that distribution away / consumes it in the next token calculation. So it's not really tracking it per se.
Is it the token (or set of tokens) that are strictly > 50% probable or is it just the highest probability in a set of probabilities?
While generating bullshit is not ideal for a lot of use cases you don't want your premier chat bot to say "I don't know" to the general public half the time. The investment in these things requires wide adoption so they are always going to favour the "guesses".
When you point it out "Oh yes, I did do that which is contrary to the rules, request <whatever>.. Anyway..."
The most common failure I've seen come from tools that pollute their context with crap and the llm will forget stuff or just get confused from all the irrelevant sentences; which if the report is true, is probably what these ai notetakers are guilty of. This problem gets exacerbated if these tools turn on the 1M context window version.
You can.
It just won't do it.
https://chatgpt.com/share/6a06a4c5-d454-83e8-a5b2-c9468f6588...
They don't like hearing "I don't know"
"Give me your answer and rate each part of it for certainty by percentage" or similar.
As far as I understand it, the various probability matrices boil down to: what token has the highest likelihood of coming next, given this set of input tokens. Which then all gets chucked away and rebuilt when the most likely token is appended to the input set.
Objective assessment of internal state - again, to my non-expert eye - doesn’t appear to have any way to surface to me.
Big-if my rough working understand is more or less correct - your calibration point makes a lot of sense to me. I’m not sure that it would make sense to someone who eg considers some form of active thinking process that is intellectualising about whether to output this or that token.
In other cases, I have seen it miss the mark when the discussion is not very linear. For example, if I am going back and forth with the SOC team about their response to a recent alert/incident. It'll get the gist of it right, but if you're relying on it for accuracy, holy hell does it miss the mark.
I can see the LLM take great notes for that initial nurse visit when you're at the hospital: summarize your main issue, weight, height, recent changes, etc. I would not trust it when it comes to a detailed and technical back-and-forth with the doctor. I would think for compliance reasons hospitals would not want to alter the records and only go by transcripts, but what do I know...
She called me back later that night and we chatted for bit and then she paused and sort of uncertainly was like “So… was there something you were needing to tell me?” And I was completely baffled and was like “Uhhhh I don’t think so…?”
She then explained the notification she got about my call and apparently the LLM summary of my voicemail converted a message consisting of 75% well-meaning but insignificant interpersonal human filler (like most voicemails) into this stilted, overly formal business-y speak with a somewhat ominous tone. Assigning way too much significance to each of the individual statements in the message about wanting to talk (to say happy Mother’s Day), inquiring about her availability ASAP (to say happy Mother’s Day) etc. Plus grossly exaggerating the information density of the call making it sound like I left this rambling, detailed message about needing to tell her something that was left completely vague, but possibly important and also time critical.
Added up it made her a little worried when she read it and made me a bit pissed that was the end result of my wishing her well. Because apparently everything needs a half baked LLM summary crammed into it now.
Which reminds me, I need to figure out how to turn that off.
ALWAYS check your summaries immediately, and contact your doctor ASAP. They can generally fix it themselves, and it's best done when everyone still has some memory of the event.
I'm puzzled by this as well. Why not just generate a transcript and be done with it? If it's a particularly long transcript that's being referenced repeatedly for whatever reason let the humans manually mark it up with a side by side summary when and where they feel the need. At least my experience is that usually these sort of interactions don't have a lot of extraneous data that can be casually filtered out to begin with. The details tend to matter quite a lot!
The businesses offering these services want to say "we are using AI" to their stake holders and the government committees who approve this shit don't have the skills or knowledge to evaluate the effectiveness in addition to the fact they likely don't even use the tools they have approved for use.
Diagnosed with Runner's Knee.
AI summary said I was diagnosed with osteoporosis, and had hip pain and walking difficulty, though literally none of that was ever said or implied.
CHECK YOUR TRANSCRIPTS. Always, but especially with LLM transcribers, which fairly frequently include common symptoms which don't exist, or claim a diagnosis which is common and fits a few details but not others. Get them fixed, it can very strongly affect your care and costs later if it's wrong.
Anecdotally, I'd say that outside of a couple very simple and very common things, about 50% of the "AI" summaries I've had have been wrong somewhere. Usually claiming I have symptoms that don't exist, occasionally much more serious and major fabrications like this time.
LLMs are NOT normal speech to text software, and they shouldn't be treated like one. They'll often insert entire sentences that never occurred. In some contexts that might be fine, but definitely not in medical records.
Someone else who couldn't attend the meeting later read that summary and it created a major argument because the topic had been a sore subject for this person due to an ongoing debate at the company. Everyone who attended the meeting confirmed it was an error, but the coincidental timing made it hard for him to accept, because the LLMs summary presented things in a way that validated this person's concerns that had been previously minimized by some folks on that meeting.
The drama got heated to the point where management produced a policy about not trusting generative output without independent verification. Seems at least it was a lesson learned.
Freeing up doctor time, for example: lots of patient visits are messy, the patient is scattered, has multiple issues, and the doctor has tight timelines and regulatory challenges to convey to the patient impacting their care… this is architected for everyone to lose, IMO, even with a perfect transcript. And LLMs can’t be perfect, they auto complete.
I picture patients interacting with an intake AI who can listen to hours of demented rambling, or a patient mid anxiety attack, and provide a caregiver-certified summary of needs, with relevant screening information laid out for doctor confirmation. At that point, helpful information about drug access or insurance policies can be presented, for doctor confirmation, to a patient who can clarify and refine their understanding of the system without time pressures.
Elevating the quality of dialogue so the doctor is more focused on the patient, and the patients dialog needs don’t overwhelm treatment. A lot of medicine is filling out forms and checklists, I think auto-complete could create efficiencies in how we fulfill that.
There are some good uses for AI, but I'm not convinced that this (or many other cases where accuracy matters) is one of them.
Yes. This is what medical records are. They've been kept by doctors for a reason.
It's not like the doctor is talking to you about which anime series are the best. You're talking about your health, your body, your disease, your treatment.
It's important to keep track of that.
>Plus what doctor has time to sit down and re-listen to your visit to check to make sure the AI didn't screw up at some point in the future anyway
No doctor.
Which is why it really should be their (or their assistant's) job to record the relevant parts of the conversations.
>At what point does it become a larger waste of time and money to babysit an incompetent AI than just not using one in the first place?
At this point, as the audit shows.
Except the industry (both the AI vendors and healthcare) are going YOLO¹ and relying on AI anyway.
>There are some good uses for AI, but I'm not convinced that this (or many other cases where accuracy matters) is one of them.
This has always been the case, but the marketing now has reached of point of gaslighting in trying to make people collectively forget that or pretend that it's not the case.
Once hard evidence is presented (like in this case), the defense is invariably that it's a temporary quality issue that's going to be resolved as the AI improves Any Day Now™, and that it's wise to live as if it were the case already² (and everyone who disagrees is a fool that Will Be Left Behind™).
The level of fervor in this rhetoric gives me an impression that the flaw is so fundamental that it won't be fixed in any form of AI based on today's technologies, that the AI vendor leadership knows this, and that the entire industry is, at this point, is a grand pump-and-dump scheme.
I hope I'm wrong.
____
¹ See, you only live once. But there are millions of you. So, like, whatever if you don't. Something something economies of scale to them.
² This is called a phantasm.
Not every conversation. Historically, one of the nice things about doctors is that they're the ones filtering what gets included in your medical record. They decide what is medically relevant and what can remain confidential. Doctors understand that not everything discussed needs to be included in your file. Sometimes that really is just small talk, sometimes it's even medical concerns, questions, or requests for advice and still not all of it needs to go into your file and much of it would only clutter it up anyway.
Any system that stores an entire visit as audio or video long into the future (much easier/temping to do in telehealth settings) is a terrible system. "We may one day need to be able to verify if what AI wrote is real" is a terrible reason to change that.
Doctors (and increasingly patients) understand that a medical record can remain for your entire life. It will probably be seen by many different people within that time for valid reasons but medical records also get leaked/stolen/sold/illegally accessed. Patients need to be able to speak freely with their doctors and often depend on their discretion. Knowing that your every word will be recorded and kept in case somebody 10 years later has a question about what AI wrote in your file could keep people from being open and honest with their doctors.
> Except the industry (both the AI vendors and healthcare) are going YOLO¹ and relying on AI anyway.
Unless we get strong regulations to prevent it I'm afraid that you're right and that this is going to be a problem we experience in a lot of industries and areas besides healthcare. We see it happening in the justice system for example and it's already ruining people's lives.
We're in complete agreement here.
If we're not talking about an audio/video recording (a thing that nobody needs), the act of producing a record of a conversation involves choosing what goes into it.
We both agree that not every words that was said needs to go there. By far.
I guess it would be correct to say that there needs to be a record of every medical visit, but nobody needs a recording.
>Yes. This is what medical records are.
No. Medical records are limited extracts from conversations, which is your doctor and only your doctor is qualified to make, using "semantic analysis applied to your unique situation", not "linguistic probabilistic inference applied to conversation about your situation using token weights averaged over billion unrelated samples"
> It's not like the doctor is talking to you about which anime series are the best. You're talking about your health, your body, your disease, your treatment.
No jokes, no banter, no chit-chat, no complements to doctor's new Tesla?
> It's important to keep track of that.
Same fallacy Meta fell into when started tracking employees' keystrokes and mouse gestures. 90% of my mouse movements are just fidgeting, with no relation to the task at hand - and it is not a crime! But if I knew my mouse fidgeting is being watched, I'll make sure that percentage goes up to 99% - for the LLM which is gonna be trained off it to self-immolate over its NSFW nature.
I meant that these limited extracts do need to be recorded, that's all.
Read the rest of the comment :)
- some human checking all the notes by listening to the entire meeting recording (takes a lot of time and man-hours)
- attendees checking notes from memory (prone to error unless they take notes)
- attendees cross checking with their own notes (defies the point of having the AI note taker)
The reality is that AI usage is not acceptable in any form in any context where accuracy is critical, but good luck getting anyone to acknowledge that.
“Notice: Any comments made by <name> or on behalf of <organization> that are interpreted by AI in this meeting, may not be accurate.”
I do this in every meeting.
Or do they use traditional voice recognition algorithms to do that part and then just "fix" the result to look plausible? Which with good quality output might not be much, but with bad can be absolutely everything.
If it is later seems to me that issues will absolutely happen.
If we just postulate that the systems have a high error rate, I wonder why they are being adopted. They seem extremely easy to test, so I don't see why doctors or hospitals or governments should be getting tricked into buying them if they suck.
From the article: "While 30 percent of a platform’s evaluation score depended solely on whether they had a domestic presence in Ontario, the accuracy of medical notes contributed only 4 percent to the total score."
Accuracy wasn't really part of the scoring, Ontario doesn't care about it.
makes me wonder what quality software the ministry would push (probably mostly qualifications like SOC).
This is apparently this list of approved vendors
https://www.supplyontario.ca/vor/software/tender-20123-artif...
Not mentioned, as far as I can see: the comparative human mistake rate.
Having seen a lot of medical records, 60% sounds about normal lol.
> Medication errors were common (nearly 1 of every 5 doses in the typical hospital and skilled nursing facility). The percentage of errors rated potentially harmful was 7%, or more than 40 per day in a typical 300-patient facility. The problem of defective medication administration systems, although varied, is widespread.
https://jamanetwork.com/journals/jamainternalmedicine/fullar...
> In all, 91 unique studies were included. The median error rate (interquartile range) was 19.6% (8.6-28.3%) of total opportunities for error including wrong-time errors and 8.0% (5.1-10.9%) without timing errors, when each dose could be considered only correct or incorrect
https://pubmed.ncbi.nlm.nih.gov/23386063/
(And if you already see 60% error rates in standard, pre-AI note taking, how does that not translate into many deaths and injury? At least one country's health system in the world should have caught that)
Presumably most doctor's visits are a one-problem-one-solution-one-doctor type of thing. Done deal, notes are never read again. So that alone would explain why high rates of errors doesn't result in injuries or death very often.
Any injury or death caused by poor notes would have to occur when mistakes are done if you're followed for a serious chronic condition, or if you're handled by a team where effective communication is required.
Because most of it is just written down and never looked at again until there’s a lawsuit or something.
The management human who offered the bad tool to the other human is responsible.
The robot cannot be responsible in place of us.
I do wonder if people would be pushing AI so hard if their organizations were planning to hold them accountable for mistakes the AI made
I bet if that were the case we'd see a lot slower rollout of AI systems
What do you base this on?
As someone who can both see the amazing things genAI can do, and who sees how utterly flawed most genAI output is, it's not obvious to me.
I'm working with Claude every day, Opus 4.7, and reviewing a steady stream of PRs from coworkers who are all-in, not just using due to corporate mandates like me, and I find an unending stream of stupidity and incomprehension from these bots that just astonishes me.
Claude recently output this to me:
"I've made those changes in three files:
- File 1
- File 2"
That is a vintage hallucination that could've come right out of GPT 2.0.
That's because, despite the many claims to the contrary, the models haven't actually gotten any smarter. They are still just token prediction engines at the end of the day, without any understanding of what they are doing. That's why one should not rely on them.
Does it?
- Speech recognition and frontier models are continuing to get better at handling these types of conversations across accents, languages and specialties. The trend is obvious and clear here. Compare GPT 4 with Opus 4.7 and there is no contest. I'd even take GPT 5.4 nano over GPT 4 right now. So, yeah, they have been improving and, yeah, they will keep on improving.
- The pipelines these models are being built into are getting much more sophisticated than just 'transcribe with x and have GPT XX clean it up'. The people building these things aren't standing still. Even if they did keep using the same models the pipeline improvements would make things get better over time. Add that in with the model improvements and the gains are even greater.
- The companies doing this work are seeing more and more edge cases. Data matters. More and more practitioners are using these things. That means more to learn from. It also means more stories of things being wrong. If you cut your error rate in half but increase your customer base by 10x then you will be hearing about 5x the problems. We are seeing that right now.
- Providers are starting to adjust to the technology (repeat areas they know may cause trouble, adjust their audio setups, etc etc) Just like any technology both sides shift and it matters. The first users were champions. The second wave were mixed between champions, haters and people that didn't care yet. Now people are really starting to count on this technology. They know it isn't a fad and isn't going away and are actually using it day to day to get their work done. This means they are adjusting to it as needed to get to the next patient/note/etc.
This stuff is just a few years old and the gains are obvious and massive. They aren't going to suddenly stop improving. There is an argument that they will asymptotically approach some level of utility, but we are still gaining quickly right now.
I would expect an "AI Note Taker" to faithfully transcribe the entire conversation. With the same quality I see in a lot of automated video subtitles.. ie they use the wrong word a lot but it's easy to tell what they mean by context.
Are these tools instead immediately summarising the whole thing, and that summary is the artifact? Because that is a beyond insane way to treat human communication.
> I would expect an "AI Note Taker" to faithfully transcribe the entire conversation. With the same quality I see in a lot of automated video subtitles.. ie they use the wrong word a lot but it's easy to tell what they mean by context.
That's a reasonable expectation, but would not be a safe one. All transcription tools are not made the same. First it depends on what kind of STT/ASR (speech-to-text / automatic speech recognition) model they are using. A lot of tools like to use some flavor of OpenAI's Whisper model. It works well generally but I would never use it in a critical use case like healthcare. Because it can hallucinate. That's specific to its architecture and how it was trained.
There's a fairly large variety of architectures that can be used for STT/ASR. Some of them are designed for "offline" / "batch" / pre-recorded audio. Some are designed for fast real-time streaming transcription.
There are more factors too like training data. And not just demographics of the speakers in the training data but audio environments too. Was the model trained on echo-y doctor offices with two people being recorded from a crappy smartphone mic or desktop mic? (It could've been! But it's an important distinction.)
And there's more factors than that, but you get the picture (e.g. are they trying to "clean up" the transcript afterwards by feeding it to an LLM, are they attempting to pre-process audio before transcription also in an attempt to boost accuracy)
There's a lot of ways to do it, meaning, there's a lot of ways to screw it up.