Freeze layers for transfer learning #3247

DanBmh · 2020-08-13T12:41:49Z

Currently when doing transfer learning we reinitialize the uppermost layers randomly. Afterwards training is continuing normally. But this has the problem that gradients are also propagated through the lower layers we would like to keep, and changes them too.

These changes allow training the reinitialized uppermost layers only. After this you can start a new training, further optimizing the complete network.

community-tc-integration · 2020-08-13T12:42:05Z

No Taskcluster jobs started for this pull request

The `allowPullRequests` configuration for this repository (in `.taskcluster.yml` on the
default branch) does not allow starting tasks for this pull request.

ftyers · 2020-08-13T14:07:52Z

@DanBmh do you have any indication that this works better than tuning the whole network? In our experiments (see earlier versions of the Common Voice paper and Josh' thesis -- Chapter 8) we found that tuning all layers works quite a bit better than freezing any of them, see e.g. this figure:

DanBmh · 2020-08-13T14:32:27Z

@ftyers I'm working on it right now:)

But the approach I'm suggesting is a bit different to yours. It's using both steps.

My transfer-learing workflow would look like this:

training with frozen layers
training with all layers (you have to start a new training for this)

Just a side note to the topic: I'm not reinitializing the last layer if possible, because in my experiments with German I got better results than with random initialization of the last layer

ftyers · 2020-08-13T15:19:30Z

@DanBmh So, you freeze all layers apart from the last one, and then train the last layer. Then when that has trained, you train all the layers?

I'd be interested in seeing the results when you have them!

DanBmh · 2020-08-13T15:22:12Z

So, you freeze all layers apart from the last one, and then train the last layer. Then when that has trained, you train all the layers?

Exactly

First test was not as good as planned:
(using my other pr #3245; es_epochs=7; reduce_lr_plateau_epochs=3; es_min_delta=0.9; no augmentation)

Language	Dataset	Additional Infos	Losses	Training epochs of best model / Total duration
DE	Voxforge	frozen transfer-learning, then training all layers	Test: 37.707958, Validation: 41.832220	12+3; Time: 42min
DE	Voxforge	without frozen transfer-learning	Test: 36.630890, Validation: 41.208125	7; Time: 28min

Not sure why, maybe some training randomness. Because I don't think this should lead to worse results.

DanBmh · 2020-08-14T10:29:41Z

I did run another test, this time I tried your approach with dropping and reinitializing the last layer.
(As noted above, I normally don't drop the layer when training German, I just train over the English weights)

Language	Dataset	Additional Infos	Losses	Training epochs of best model / Total duration
DE	Voxforge	dropped last layer	Test: 42.516270, Validation: 47.105518	8; Time: 28min
DE	Voxforge	with frozen transfer-learning in two steps	Test: 36.600590, Validation: 40.640134	14+8; Time: 42min

Here you can see an improvement if using the frozen transfer-learning approach.
(Note one the dataset: Voxforge has 31h, I'm using about 5h each for dev+test, rest for training. So it's quite small)

So I would say if the network architecture did not change it's faster to train with the English weights (no dropping, no freezing), but if the network had to be changed (different alphabet) it's better to train in two steps with the frozen network.

Of course we would need to do some more test before we can say this for sure.

lissyx · 2020-08-14T10:40:45Z

Not being judgmental but i think we'll at least wait after 1.0 to merge that

JRMeyer · 2020-08-14T12:23:41Z

@DanBmh -- even though my previous research with deepspeech seems to point to frozen-transfer not working, I still think this feature should be integrated. Your two-step approach makes perfect intuitive sense, and there's work from computer vision and NLP that shows frozen transfer works very well.

So, I think the feature would be useful, but @lissyx is right, this is a post-1.0 feature.

JRMeyer · 2020-08-14T12:34:13Z

@DanBmh -- I don't see why you need a new flag for load_frozen_graph. For reference, this is how I implemented transfer-learning + freezing layers before: https://github.com/mozilla/STT/blob/transfer-learning2/DeepSpeech.py#L264-L278

DanBmh · 2020-08-14T12:47:27Z

I don't see why you need a new flag for `load_frozen_graph

Problem is that when loading the checkpoint for the second training the frozen layers have no variables for Adam. They were not loaded/saved in the first training because they were not used. So we have to reinitialize them.

JRMeyer

@DanBmh -- Firstly, this is a great addition and I'd like to see it merged!

Here's some issues I have with the current implementation:

(1) I don't foresee any scenario where the frozen layers are not identical to the transferred layers. For example, if I'm transferring four layers from model A, re-initializing the last 2 layers for model B, if I freeze any layers from model A, I would freeze all four of them, not just a subset.

In this case, we can change the freeze_source_layers flag to a boolean, and just use the drop_source_layers flag to deduce which layers to freeze. Otherwise, we are basically duplicating the drop_source_layers flag when we don't have to.

(2) I also really don't like the load_frozen_graph flag, because it requires you to know the training history of some checkpoints. Is there a way to get around this? Can we maybe re-initialize the Adam tensors before the model is saved to disk, after a training run? I think this flag will trip a lot of people up in use. If someone gives me a checkpoint to use as a starting point for further transfer-learning, I don't expect to have to know anything about how those checkpoints were trained.

DanBmh · 2020-08-18T18:28:26Z

Currently I did use this for my German training (0 dropped, 1 frozen). I think this makes sense for all languages with same alphabet like English and and similar pronunciation. This could also be used for transfer-learning of dialects. Instead of random initialized weights you would get somewhat matching weights at the beginning of the training.
You're right about this one and I think reinitialization after training is a good idea. I hope I can find some time in the next days to test it.

lissyx · 2020-08-28T16:31:12Z

@DanBmh Please dont let that sink, if you have some time to rebase :)

lissyx · 2020-08-28T16:31:48Z

Not being judgmental but i think we'll at least wait after 1.0 to merge that

We might change our opinion here, right @reuben ?

DanBmh · 2020-08-29T13:29:48Z

@JRMeyer what do you think about reinitialization of tensors named "Adam" by default if they are missing? With an additional message for users that they are reinitialized because they are not in the checkpoint.

I can't think of an elegant way to reinitialize the adam-tensors before checkpoint saving at the moment. Because we would need to reinitialize them every time before we save a checkpoint (some might want to load intermediate checkpoints for some reasons).

reuben · 2020-08-29T21:14:09Z

Not being judgmental but i think we'll at least wait after 1.0 to merge that

We might change our opinion here, right @reuben ?

Yeah, I think it should be fine to land this once the comments here have been addressed. It's also a good opportunity to make the training functionality more fine grained instead of the huge train.py which can do a billion different things depending on the combination of flags that's passed. It's really hard to reason about what a training call is going to do unless you're deeply familiar with the code.

reuben · 2020-10-19T11:50:15Z

@JRMeyer ping

ftyers

i've gone over and asked some questions. Some of these might be things that really need to be changed and others are just suggestions. Some of them might also not be valid, e.g. there might be some reasons why you've done it how you have that I am not aware of.

ftyers · 2020-11-19T15:33:29Z

training/deepspeech_training/util/checkpoints.py

 import tensorflow.compat.v1 as tfv1

 from .flags import FLAGS
-from .logging import log_info, log_error, log_warn
+from .logging import log_error, log_info, log_warn


Why change the order here?

I don't know what the DS policy is for that, I'd have to ask.

ftyers · 2020-11-19T15:34:17Z

training/deepspeech_training/util/checkpoints.py

+                print_msg = True
+    if print_msg:
+        msg = "Some Adam tensors are missing, they will be initialized automatically."
+        log_info(msg)


Can you not do just log_info("...") ?

Maybe something like

missing = [] if ... missing.append(v) if missing: for v in missing: log_info("Missing... {}".format(v))

The split into two parts was an autoformatting issue, black would have split this into three lines.

I didn't print every missing layer, because there are messages later, that state which layers were reinitialized exactly.

ftyers · 2020-11-19T15:38:11Z

training/deepspeech_training/util/flags.py

@@ -93,6 +93,7 @@ def create_flags():
    # Transfer Learning

    f.DEFINE_integer('drop_source_layers', 0, 'single integer for how many layers to drop from source model (to drop just output == 1, drop penultimate and output ==2, etc)')
+    f.DEFINE_integer('freeze_source_layers', 0, 'use same value as above to freeze the other layers')


This description is a bit vague, what is it trying to say?

Intended use is together with the 'drop_source_layers' flag, if we want to drop the last layer, we can freeze the rest of the net by setting 'freeze_source_layers' to the same value of 1.

Perhaps you could be more explicit in the help text, e.g. say that you have to use the --drop_source_layers. Maybe say something like "freeze N remaining layers after using --drop_source_layers"? But even that is a bit unclear. Try and describe what the flag does, and then maybe give an example of how it can be used together with --drop_source_layers.

Better now?

ftyers · 2020-11-19T15:48:15Z

training/deepspeech_training/train.py

+                    # Filter out layers if we want to freeze some
+                    if FLAGS.freeze_source_layers > 0:
+                        filter_vars = drop_freeze_number_to_layers(FLAGS.freeze_source_layers, "freeze")
+                        new_train_vars = list(train_vars)


Is there a reason not to build up new_train_vars from empty, something like:

new_train_vars = [] for tv in train_vars: if tv.name not in filter_vars: new_train_vars.append(tv) train_vars = new_train_vars

Seemed more intuitive, we want to train all except the filtered layers.
Your example doesn't work by the way, because the filter_vars contain names like layer_1 and train_vars have the full layer name layer_1:dense:0

Hence the "something like" :) -- You could of course use find() or something like that. I have no particularly strong feeling about it, but err on the side of simplicity.

ftyers · 2020-11-19T15:49:12Z

training/deepspeech_training/train.py

                    # Compute gradients for model parameters using tower's mini-batch
-                    gradients = optimizer.compute_gradients(avg_loss)
+                    gradients = optimizer.compute_gradients(avg_loss, var_list=train_vars)


I'd like someone else to take a look at this.

ftyers · 2020-11-19T15:49:52Z

training/deepspeech_training/util/checkpoints.py

@@ -19,32 +19,33 @@ def _load_checkpoint(session, checkpoint_path, allow_drop_layers, allow_lr_init=
    # compatibility with older checkpoints.
    lr_var = set(v for v in load_vars if v.op.name == 'learning_rate')
    if lr_var and ('learning_rate' not in vars_in_ckpt or
-                    (FLAGS.force_initialize_learning_rate and allow_lr_init)):
+                   (FLAGS.force_initialize_learning_rate and allow_lr_init)):


Spacing only change...

Fixing: PEP 8: E127 continuation line over-indented for visual indent

ahmed82

Are we replacing the TensorFlow ?

lissyx requested a review from JRMeyer August 13, 2020 13:50

JRMeyer suggested changes Aug 18, 2020

View reviewed changes

DanBmh mentioned this pull request Aug 20, 2020

Implementation of layer-norm in the training script #3256

Closed

lissyx assigned DanBmh Aug 24, 2020

DanBmh force-pushed the freezed_transfer branch from 459af0f to d2b0d9b Compare August 29, 2020 12:39

DanBmh force-pushed the freezed_transfer branch 2 times, most recently from 67082eb to 4b20d71 Compare October 1, 2020 13:27

DanBmh requested a review from JRMeyer October 1, 2020 13:31

ftyers reviewed Nov 19, 2020

View reviewed changes

Daniel added 2 commits December 8, 2020 12:19

Freeze layers for transfer learning.

fadaf2a

Refactor freezing.

a0d5597

DanBmh force-pushed the freezed_transfer branch from 4b20d71 to a0d5597 Compare December 8, 2020 11:31

ahmed82 reviewed Jun 13, 2022

View reviewed changes

Freeze layers for transfer learning #3247

Are you sure you want to change the base?

Freeze layers for transfer learning #3247

Conversation

DanBmh commented Aug 13, 2020

community-tc-integration bot commented Aug 13, 2020

ftyers commented Aug 13, 2020 • edited

DanBmh commented Aug 13, 2020

ftyers commented Aug 13, 2020

DanBmh commented Aug 13, 2020 • edited

DanBmh commented Aug 14, 2020 • edited

lissyx commented Aug 14, 2020

JRMeyer commented Aug 14, 2020

JRMeyer commented Aug 14, 2020

DanBmh commented Aug 14, 2020

JRMeyer left a comment

Choose a reason for hiding this comment

DanBmh commented Aug 18, 2020

lissyx commented Aug 28, 2020

lissyx commented Aug 28, 2020

DanBmh commented Aug 29, 2020

reuben commented Aug 29, 2020

reuben commented Oct 19, 2020

ftyers left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ahmed82 left a comment

Choose a reason for hiding this comment

ftyers commented Aug 13, 2020 •

edited

DanBmh commented Aug 13, 2020 •

edited

DanBmh commented Aug 14, 2020 •

edited