Skip to main content

Get the Reddit app

Scan this QR code to download the app now
Or check it out in the app stores

r/MachineLearning

members
online


[D] Why transformers are not trained layer-wise? [D] Why transformers are not trained layer-wise?
Discussion

It seems to me that thanks to the residual path the gradient that flows to each layer is the same regardless of the transformer layer/block. Example:

ProjectionAndCost(X + L1(X) + L2(X + L1(X)) + L3(X + L1(X) + L2(X + L1(X))) ...)

Since the input to ProjectionAndCost is just sum of outputs from all layers and initial embeddings then the gradient that comes to the layer L1 is the same as the gradient that comes to L2 or L3.

So we could:

  • first train only L1: ProjectionAndCost(X + L1(X))

  • freeze L1, include L2 and train: ProjectionAndCost(X + L1(X) + L2(X + L1(X)))

  • freeze L1 and L2, include L3 and train: ProjectionAndCost(X + L1(X) + L2(X + L1(X)) + L3(X + L1(X) + L2(X + L1(X))))

  • .. and so on

We can't train first L2 then L1, because the input to L2 depends on L1, but we could train lower layers first then gradually add and train deeper layers. Is there any problem with that approach?


[D] Old Paper - Troubling Trends in Machine Learning Scholarship [D] Old Paper - Troubling Trends in Machine Learning Scholarship
Discussion

I just wanted to remind or introduce newcomers to this paper. I think this discussion should be re-opened since many people here actually do influence the trends of the field.

https://arxiv.org/pdf/1807.03341


On a personal note (feel free to skip):

Specifically, I want to point out the issue of "Mathiness", as it seems like this problem got way out of hand and most best papers of conferences suffer from it (one of the most important ML papers tried to be mathy and introduced a big mistake, I believe other papers have bigger issues but no one bothers to check it).

So here are my personal points to academics and researchers:

  1. We (I think most will relate), practitioners, do not need equations to know what recall is and clearly don't want to read difficult-to-understand versions of what linear regression is, it just makes your paper unuseful. If you don't want to waste our time, please put it in the appendix or completely remove it.

  2. Reviewers, please don't get impressed by unnecessary math, if it's complicated and does nothing useful, who cares? Also, it might be flawed anyway and you will probably not catch it.