Hacker News

Softmax, can you derive the Jacobian? And should you care?

59 points by smaddrellmander - 6 comments

ComplexSystems [3 hidden]5 mins ago

Good article, but

"We take the exponential of each input and normalize by the sum of all exponentials. This transforms a vector of arbitrary real numbers into values between 0 and 1 that sum to 1, it technically this is a pseudo-probability distribution (they're not derived from a probability space), but it's close enough to a probability distribution and for practical purposes they work just fine."

Why is this a "pseudo-probability distribution?"

qurren [3 hidden]5 mins ago

One thing extremely worth noting that the article does not:

The reason "temperature" is called such is because softmax is mathematically identical to the Boltzmann distribution [1] from thermodynamics, which describes the probability distribution of energy states of an ensemble of particles in equilibrium. In terminology more well understood by ML folks, the particles' energies will be distributed as the softmax of their negative energies divided by their temperatures (in Kelvin). Units are scaled by the Boltzmann constant (k_B).

Setting an LLM's temperature to zero is mathematically the same thing as cooling an ensemble of particles to absolute zero: in physics, the particles are all forced to their lowest energy state, in LLMs, the model is forced to deterministically predict the single most likely logit/token.

Now to drow another analogy for what happens at high temperatures: the reason a heating element glows red when it is hot is because if you take the expectation value (mean) of energies under this softmax distribution, that mean goes up with temperature, and when the energy gets high enough, the particles start shaking off energy in the form of photons that are now high energy enough to be in the visible spectrum. Incandescent bulbs with tungsten filaments are even hotter than that heating element, and glow white because as temperature T is even higher, the softmax distribution's mean energy moves higher and flattens out, and it roughly covers the whole visible spectrum somewhat more uniformly. In the case of the bulb, photons of all sorts of wavelengths are being spewed out, that's white light. Likewise, if you set an LLM's temperature to an absurdly high number, it spews out a very wide spectrum of mostly nonsense tokens.

[1] https://en.wikipedia.org/wiki/Boltzmann_distribution

hibijibies [3 hidden]5 mins ago

Nice article and explanations!

On a tangential note, I keep noticing "why x matters", "it's crucial here" that just remind me of Claude. Recently Claude has been gaslighting me in complex problems with such statements and seeing them on an article is low-key infuriating at this point. I can't trust Claude anymore on the most complex problems where it sometimes gets the answer right but completely misses the point and introduces huge complex blocks of code and logic with precisely "why it matters", "this is crucial here".

snovv_crash [3 hidden]5 mins ago

I've seen many posts on Reddit in this AI-induced 'psychosis' when people end up believing the words that get generated for them without applying sufficient critical thought.

This sycophancy is a serious problem and exploits a weakness in the human psyche (flattery) which may be easier for the RLHF to find reward in than genuinely correct responses.

xchip [3 hidden]5 mins ago

"This transforms a vector of arbitrary real numbers into values between 0 and 1 that sum to 1"

Not really, softmax transforms logits (logariths of probabilities) into probabilities.

Probabilities → logits → back again.

Start with p = [0.6, 0.3, 0.1]. Logits = log(p) = [-0.51, -1.20, -2.30]. Softmax(logits) = original p.

NN prefer to output logits because they are linear and go from -inf to +inf.

ForceBru [3 hidden]5 mins ago

It's still true that softmax transforms arbitrary vectors into probability vectors.

In your example you'll also get the original `p` with just `exp(logits)`. Softmax normalizes the output to sum to one, so it can output a probability vector even if the input is _not_ simply `log(p)`.