Reddit - Dive into anything

Feed About

Hot

Open sort options

Hot

Change post view

[D] Simple Questions Thread

u/AutoModerator

•

[D] Simple Questions Thread

Discussion

Please post your questions here instead of creating a new thread. Encourage others who create new posts for questions to post here instead!

Thread will stay alive until next one so keep posting after the date in the title.

Thanks to everyone for answering questions in the previous thread!

[D] Why transformers are not trained layer-wise?

u/kiockete

•

[D] Why transformers are not trained layer-wise?

Discussion

It seems to me that thanks to the residual path the gradient that flows to each layer is the same regardless of the transformer layer/block. Example:

ProjectionAndCost(X + L1(X) + L2(X + L1(X)) + L3(X + L1(X) + L2(X + L1(X))) ...)

Since the input to ProjectionAndCost is just sum of outputs from all layers and initial embeddings then the gradient that comes to the layer L1 is the same as the gradient that comes to L2 or L3.

So we could:

first train only L1: ProjectionAndCost(X + L1(X))
freeze L1, include L2 and train: ProjectionAndCost(X + L1(X) + L2(X + L1(X)))
freeze L1 and L2, include L3 and train: ProjectionAndCost(X + L1(X) + L2(X + L1(X)) + L3(X + L1(X) + L2(X + L1(X))))
.. and so on

We can't train first L2 then L1, because the input to L2 depends on L1, but we could train lower layers first then gradually add and train deeper layers. Is there any problem with that approach?

[D] Old Paper - Troubling Trends in Machine Learning Scholarship

u/pyepyepie

•

[D] Old Paper - Troubling Trends in Machine Learning Scholarship

Discussion

I just wanted to remind or introduce newcomers to this paper. I think this discussion should be re-opened since many people here actually do influence the trends of the field.

On a personal note (feel free to skip):

Specifically, I want to point out the issue of "Mathiness", as it seems like this problem got way out of hand and most best papers of conferences suffer from it (one of the most important ML papers tried to be mathy and introduced a big mistake, I believe other papers have bigger issues but no one bothers to check it).

So here are my personal points to academics and researchers:

We (I think most will relate), practitioners, do not need equations to know what recall is and clearly don't want to read difficult-to-understand versions of what linear regression is, it just makes your paper unuseful. If you don't want to waste our time, please put it in the appendix or completely remove it.
Reviewers, please don't get impressed by unnecessary math, if it's complicated and does nothing useful, who cares? Also, it might be flawed anyway and you will probably not catch it.

Get the Reddit app

r/MachineLearning

Be nice: no offensive behavior, insults or attacks

Make your post clear and comprehensive

Posts without appropriate tag in title will be removed

Beginner or career related questions go elsewhere

Non-arxiv link posts only allowed on weekends (must be demos)*

Beginner's tutorials and projects go elsewhere

Quality Contribution

Limit self-promotion