Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gradient Clipping #596

Open
jafioti opened this issue Mar 21, 2023 · 7 comments · May be fixed by #902
Open

Gradient Clipping #596

jafioti opened this issue Mar 21, 2023 · 7 comments · May be fixed by #902

Comments

@jafioti
Copy link
Contributor

jafioti commented Mar 21, 2023

When training large or deep models, exploding gradients are frequent and cause instability. Clipping them to a certian small amount is an effective way of stabilizing training.

To implement this, I believe a method on the Gradients struct would be needed (correct me if I'm wrong)

@coreylowman
Copy link
Owner

I know there's multiple ways to clip gradients (e.g. pytorch has clip_grad_norm_ and clip_grad_value_.

Do we know if one of these is more widely used than the other?

@jafioti
Copy link
Contributor Author

jafioti commented Mar 21, 2023

I think clip_grad_norm_ is more widely used, however it is also more complex, as it takes the norm of all the gradients first. clip_grad_value_ is used less, but is far more straightforward to implement so I think it makes sense to add that first.

@nkoppel
Copy link
Contributor

nkoppel commented Mar 21, 2023

It should be possible to implement a general Gradients::map function that takes a FnMut(&mut Tensor<(usize,), E, D>) -> Result<(), D::Err> and applies it to each D::Vec after wrapping it in a tensor.

@jafioti
Copy link
Contributor Author

jafioti commented Mar 21, 2023

That seems like all that would be needed for the clip_grad_value_

@coreylowman
Copy link
Owner

coreylowman commented Mar 22, 2023

pytorch implementations of the above are pretty straightforward https://github.com/pytorch/pytorch/blob/master/torch/nn/utils/clip_grad.py

I would say clip_grad_norm would be required to go through TensorCollection api so:

  1. only the norm of the model's gradients are considered
  2. We can get access to the gradient's tensor
model.clip_grad_norm(&mut grads, 0.5);
model.clip_grad_value(&mut grads, 0.5);

Then we could implement clip_grad_norm with two passes with RecursiveWalker:

  1. Accumulate each gradient's norm. For each tensor & gradient:
    1. Create a tensor out of the gradient using Gradients::get
    2. Compute norm of tensor with g.square().sum().sqrt()
    3. Append this 0d norm tensor to a Vec along the walker
  2. Call stack on the Vec of 0d norms
  3. Call stacked.square().sum().sqrt() to compute total norm
  4. Multiply each gradient by max_norm / total_norm as done in pytorch code

If we wanted this all to be in-place:

  • For clip_grad_norm, we'd need a way to in-place multiply a D::Vec<E>.
  • For clip_grad_value, we'd need a way to in-place clamp a D::Vec<E>.

Also separately, the .square().sum().sqrt() way of taking norm may be expensive since .square() will allocate another tensor with the same size as the gradient. I think this can be addressed separately though.

@opfromthestart
Copy link
Contributor

Has any work been done on this?

@swfsql swfsql linked a pull request Dec 14, 2023 that will close this issue
2 tasks
@swfsql
Copy link
Contributor

swfsql commented Dec 14, 2023

I've submitted a draft PR, and once the examples are added I'll mark as ready for review. But so far I think it's working correctly, I've been able to avoid exploding gradients.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants