Skip to main content

Get the Reddit app

Scan this QR code to download the app now
Or check it out in the app stores
Go to MachineLearning
r/MachineLearning

ml. Beginners please see learnmachinelearning


Members Online

[D] Why transformers are not trained layer-wise?

Discussion

It seems to me that thanks to the residual path the gradient that flows to each layer is the same regardless of the transformer layer/block. Example:

ProjectionAndCost(X + L1(X) + L2(X + L1(X)) + L3(X + L1(X) + L2(X + L1(X))) ...)

Since the input to ProjectionAndCost is just sum of outputs from all layers and initial embeddings then the gradient that comes to the layer L1 is the same as the gradient that comes to L2 or L3.

So we could:

  • first train only L1: ProjectionAndCost(X + L1(X))

  • freeze L1, include L2 and train: ProjectionAndCost(X + L1(X) + L2(X + L1(X)))

  • freeze L1 and L2, include L3 and train: ProjectionAndCost(X + L1(X) + L2(X + L1(X)) + L3(X + L1(X) + L2(X + L1(X))))

  • .. and so on

We can't train first L2 then L1, because the input to L2 depends on L1, but we could train lower layers first then gradually add and train deeper layers. Is there any problem with that approach?