HN.zip

LLM Neuroanatomy II: Modern LLM Hacking and Hints of a Universal Language?

98 points by realberkeaslan - 30 comments
dnhkng [3 hidden]5 mins ago
Author here. Another thing I want to highlight: the language-agnostic "thinking space" finding came from Evan Maunder, who read Part 1 and ran an elegant experiment — same sentence in English, Mandarin, and Base64, cosine similarity at every layer. The representations converge by the early layers, stay nearly identical through the mid-stack, then diverge again at the end as the model commits to an output format.

I extended this to a 2×2 design (two languages × two content types) and the result is even starker: by layer 10, cross-language same-content pairs are more similar than same-language different-content pairs. The model cares about what you're saying, not what language you're saying it in.

This is also what makes layer duplication work — those mid-stack layers operate in a space where input and output distributions match, so you can loop through them without breaking anything. The encoding and decoding boundaries are where the blue walls show up in the heatmaps.

hmokiguess [3 hidden]5 mins ago
I'm trying to understand what you said, can you please correct me if I'm wrong here.

Would this be sort of like saying the way embeddings of different primitives across languages end up distributed in a vector space all follow the same principles and "laws"?

For example, if I train a large corpus of english and, separately, a large corpus of spanish, in both cases the way language constructs that are equivalent across both will end up represented using the same vector space patterns?

1bpp [3 hidden]5 mins ago
A fun thing to do is convince a model to fluidly switch between character sets to express ideas as 'efficiently' as possible. It likes to use Chinese hanzi a lot for abstract concepts. I've also seen Gemini use them unprompted in the middle of an English sentence.
mikkupikku [3 hidden]5 mins ago
AIs code switching between human languages is cyberpunk AF.
theredsix [3 hidden]5 mins ago
Extrapolating the benchmarks, this would imply the best RYS 27B is capable of out performing the 397B MoE?
vessenes [3 hidden]5 mins ago
David,

Thanks for this research. I remember being stunned when Goliath showed up and .. worked; this feels like under explored research right now.

I've been thinking about implications of this for local generation -- what's really nice about a repeated layer is it takes up no extra memory -- and therefore works well on the edge.

Can you suggest some exploration angles on the edge side? I've recently started looking at fixing expert layers for an entire generation run as interesting - basically you pay the memory cost once for loading in selected experts - and I think RYS type thinking is a natural extension of this. If you've got some ideas, I'm all ears.

driese [3 hidden]5 mins ago
Ever since I read about this, I have been thinking about the next logical step: train a NN to route the internal loops dynamically after each layer. Instead of just choosing a given set of layers that are repeated, let the new classifier decide whether it wants to loop, where it wants to loop, whether to loop multiple times, to loop a big part, or to just jump to the final layers straight away. Each token could loop more or less based on its relevance.

It has some similarities of a MoE architecture, but instead of choosing experts, it chooses layer routes. Training this NN classifier together with the LLM could condense the required amount of layers for a given intelligence down drastically if it works. If anyone wants to work on this, feel free to send me a message.

dnhkng [3 hidden]5 mins ago
Thanks!

I have pushed basic code to GitHub (https://github.com/dnhkng/RYS)

Some interesting areas to explore might be a combination of deleting some layers and duplicating others. i.e. reduce VRAM by dropping some layer (this works, well documented), and recovering performance by duplicating others (saves VRAM). I am not pursuing this, but it seems interesting!

vessenes [3 hidden]5 mins ago
Thanks -- interesting. I like the idea of ablating layers. I guess you could get a differentiable stack that has a layer skip and layer copy/loop and a total memory use loss function; that would let someone ship either a big (usually ablate) or little (usually copy) model. The expert routing for longer sequences interests me a lot because the edge inference issue is always memory bandwidth.
yodon [3 hidden]5 mins ago
If you look at convolutional neural nets used in image processing, it's super common for the first layer or so to learn a family of wavelet basis functions. Later layers then do recognition in wavelet space, without that space ever being explained or communicated to the training algorithm.

This work here is obviously more complex than that, but suggests something similar is going on with early layers transforming to some sort of generalized basis functions defining a universal language representation.

saidnooneever [3 hidden]5 mins ago
it sometimes makes me think of a video at some point of a guy (Daniel Tammet) who had some brain difference,which caused him to be extremely fast at language learning. He said all language carries the same patterns for him, which he sees through synestesia or whatever.

he learnt icelandic in week and had a fluent conversation on their national TV to prove it. (this is nuts, that language is extremely difficult to pickup with nasal sounds etc.)

ofcourse i guess its not even close to average to have such a abilities as a human, but i wonder if at some point LLMs and AI algorithms and models might shed light on such kind of abstractions (like some mentioned in comments also about image recognition algos) that might help humans actually learn these things themselves, train on them and perhaps even get taught such a thing as a skill.

dnhkng [3 hidden]5 mins ago
Author here. The result that surprised me most: after evaluating 3,024 beam search candidates, training a surrogate model on ~4,600 measurements, and scoring 2 million configurations — the Pareto-optimal configs were all simple contiguous blocks. No exotic multi-block compositions, no sparse repeats. Just "repeat layers 31–33" and you're on the efficiency frontier.

I think this says something interesting about how transformers organise computation internally. The mid-stack reasoning circuits are coherent enough that you can loop through them twice without distribution mismatch. The encoding/decoding boundaries are not.

big_toast [3 hidden]5 mins ago
This was a little dense for me to grok. Are these well known results or is there an abstract-like summary?

The RYS (repeat yourself) hypothesis that duplicating (the right) layers is enough to improve performance (sorry for not reading closely enough, it's really just stacking the relevant layers?).

The ERD (encoding, reasoning, decoding) layer structure is a relatively robust observation? That the middle layers of the NN will reason in universal space, and this is kinda evidenced by cosine similarities of the hidden states at each layer given similar or dissimilar inputs. And that similar inputs converges by layer 5 and you can kinda watch that happen in the cosine similarities?

This post is incredible and I'm afraid it'll drop off the front page before people engage deeply with it. (The methodology was interesting, maybe there's other big ideas I'm missing.)

wongarsu [3 hidden]5 mins ago
I find the RYS result far more surprising than the ERD result. Encode-Reasoning-Decode is after all a very popular way to design neural networks (even an autoencoder is just that without the reasoning step), the same structure emerging from optimization isn't that surprising.

But the methodology to measure it and put numbers on which layers are most involved in encoding/decoding and where the reasoning takes place is very valuable.

The finding that the phases are more cleanly separated in large-ish models is interesting. I wonder what this could mean for embedding models? Usually we take small LLMs and chop off the last couple layers to get an embedding model. But I wonder if you could get better embedding models using something like the first five layers of Qwen3.5-27B, or the first X layers of Kimi K2.5? The methodology in the article seems to give a straight forward way to find the optimal cutting point

vibe42 [3 hidden]5 mins ago
Perhaps not widely known but certainly known in LLM research. There was a bunch of these experiments done 2 years ago and what's interesting is that it still seems to work on the latest models.

Though beware that the increased score on math and EQ could lead to other areas scoring less well; would love to see how these models score on all open benchmarks.

v9v [3 hidden]5 mins ago
The author claimed that the models he modified with this layer repetition method topped the huggingface open llm leaderboard in his first post: https://dnhkng.github.io/posts/rys/

Do you remember the names of the previous experiments done on this? Would love to take a look.

vibe42 [3 hidden]5 mins ago
Just learned about it the other day from this thread from Feb, 2024: https://old.reddit.com/r/LocalLLaMA/comments/1aqrd7t/i_made_...

Has some interesting github links.

notnullorvoid [3 hidden]5 mins ago
Incredible research. I wonder how close we are to outputting the universal language into it's own reasoning context (which skips encoding layers). Then using the later decoding layers to lazily inspect the reasoning context.
vibe42 [3 hidden]5 mins ago
This is orthogonal to quantisation. Could have big impact on smaller models in the 4B-14B range where people often try specific quants and context sizes to fit into the VRAM of a laptop/desktop GPU.
yodon [3 hidden]5 mins ago
Apologies if I missed this in the article (or in the first article in the series) - what happens if you add two copies of the layer set? Does performance improve over adding one copy of the layer set?
dnhkng [3 hidden]5 mins ago
Author here: That was done in this blog post, in the beam search. I started with the best re-layer configs, and iteratively added more blocks, including the same multiple times, during a long beam search.

It turns out this does not help (somewhat surprisingly).

skyde [3 hidden]5 mins ago
Actually not surprised. I guess this is for the same reason “say it twice” [1] is working. Because LLm are trained as causal language model, past token cannot attend to future token. One copy of the layer set solve this. [1]https://arxiv.org/html/2512.14982v1
JPLeRouzic [3 hidden]5 mins ago
Has anyone started to implement this technique in Llama.cpp or similar inference tool?
dnhkng [3 hidden]5 mins ago
There was some work done on this a while back, during the FrankenMerge craze of 23'

I am working with TurboDerp to integrate this into the Exllama v3 format.

sigbottle [3 hidden]5 mins ago
Wow, super interesting keywords. Are you a ML researcher? What kind of experiments do you do?
lostmsu [3 hidden]5 mins ago
How's the reproducibility of the results? Like avg score of 10 runs vs original.
dnhkng [3 hidden]5 mins ago
Author here: The code is up on GitHub.

The probes I used seem to help identify good configurations, but are quite noisey. A small probe set was initially used to make the scan tractable, and then the higher ranked models were retested on a set ~10x larger.

_lex [3 hidden]5 mins ago
We've discovered the language. It changes the economics of computing.

As in, this entire cloud buildout is unnecessary because it becomes like using a calculator.

Reach out to chat.

cjameskeller [3 hidden]5 mins ago
Would you be willing to elaborate? I would be curious to hear more.
_lex [3 hidden]5 mins ago
shoot me an email and lets jump on a call. I'll blow your mind.