A related viewpoint is that overparametrization is good because the model is stranded when the Hessian has all positive/zero eigenvalues. If we treat the probability that a particular Hessian eigenvalue turns positive as a Bernoulli process, the chance of all eigenvalues going positive/zero exponentially decreases as the parameter count increases
You don't need billions of parameters for that, precisely because the risk of being stuck at a local minimum decreases exponentially with the number of parameters. Right?
vatsachak [3 hidden]5 mins ago
Isn't this trivial?
What's more interesting is as to why double descent happens
Scene_Cast2 [3 hidden]5 mins ago
IIRC the original author of the Lottery Ticket Hypothesis now disavows that idea.
One intuitive way of looking at it is like so - let's say that you have a gaussian-looking plot. You want to fit a gaussian. You have a stupid simple model where you can slide your gaussian left and right.
If your initial starting point happens to be roughly within range, great, your optimizer will take care of it for you and slide it into the correct place. If you're too far, too bad, no meaningful gradient.
Instead, neural nets give you the option to spawn a gaussian anywhere you please. In this case, no sliding is necessary, but it comes at a heavy parametrization cost.
getnormality [3 hidden]5 mins ago
A while ago a lot of the discussion about overparameterization was about explaining "double descent", the observation that test error doesn't descend monotonically and actually hits a local maximum around the point where the model has just enough parameters to interpolate the data. My favorite article about double descent looks at this in terms of splines [1]. If I can try to summarize that article: when you are designing a parametrized model to fit to data, you have a choice. You can either:
1. Avoid overparameterization by design. Manually create or choose a space of functions that has limited degrees of freedom by construction.
2. Accept overparameterization and regularize.
The latter tends to be more robust, because of the bitter lesson. It's not practical to manually design an ideal, on-demand, just-right limited-parameter model for every dataset we are presented with. The best way to approach that ideal, it turns out, is really to just let the computer figure it out via regularized optimization over an overparameterized space.
Statisticians started moving in favor of overparameterization long before deep learning got off the ground. This trend dates back at least to the machine learning bible, Elements of Statistical Learning (2001).
> This trend dates back at least to the machine learning bible, Elements of Statistical Learning (2001).
Could you elaborate on this?
porridgeraisin [3 hidden]5 mins ago
Hi, I work on RL, or as it is known today, "classical" RL. I'm interested in knowing the latest work that explains double descent and in general optimisation behaviour of overparameterized neural networks. Do you have a survey paper or blog post or anything else to recommend?
WithinReason [3 hidden]5 mins ago
How is this view inconsistent with the lottery ticket hypothesis?
[1] https://arxiv.org/abs/1406.2572
What's more interesting is as to why double descent happens
One intuitive way of looking at it is like so - let's say that you have a gaussian-looking plot. You want to fit a gaussian. You have a stupid simple model where you can slide your gaussian left and right.
If your initial starting point happens to be roughly within range, great, your optimizer will take care of it for you and slide it into the correct place. If you're too far, too bad, no meaningful gradient.
Instead, neural nets give you the option to spawn a gaussian anywhere you please. In this case, no sliding is necessary, but it comes at a heavy parametrization cost.
1. Avoid overparameterization by design. Manually create or choose a space of functions that has limited degrees of freedom by construction.
2. Accept overparameterization and regularize.
The latter tends to be more robust, because of the bitter lesson. It's not practical to manually design an ideal, on-demand, just-right limited-parameter model for every dataset we are presented with. The best way to approach that ideal, it turns out, is really to just let the computer figure it out via regularized optimization over an overparameterized space.
Statisticians started moving in favor of overparameterization long before deep learning got off the ground. This trend dates back at least to the machine learning bible, Elements of Statistical Learning (2001).
[1] https://mlu-explain.github.io/double-descent/
Could you elaborate on this?