[D] Why transformers are not trained layer-wise?

It seems to me that thanks to the residual path the gradient that flows to each layer is the same regardless of the transformer layer/block. Example:

ProjectionAndCost(X + L1(X) + L2(X + L1(X)) + L3(X + L1(X) + L2(X + L1(X))) ...)

Since the input to ProjectionAndCost is just sum of outputs from all layers and initial embeddings then the gradient that comes to the layer L1 is the same as the gradient that comes to L2 or L3.

So we could:

first train only L1: ProjectionAndCost(X + L1(X))
freeze L1, include L2 and train: ProjectionAndCost(X + L1(X) + L2(X + L1(X)))
freeze L1 and L2, include L3 and train: ProjectionAndCost(X + L1(X) + L2(X + L1(X)) + L3(X + L1(X) + L2(X + L1(X))))
.. and so on

We can't train first L2 then L1, because the input to L2 depends on L1, but we could train lower layers first then gradually add and train deeper layers. Is there any problem with that approach?

Get the Reddit app

[D] Why transformers are not trained layer-wise?