optimize clip_by_norm #183

zhangting2020 · 2023-09-22T10:57:31Z

优化点：

mul + set_value，其中set_value导致大量的memcpy，替换成inplace的操作
冗余的clip_norm计算，增加need_grad_norm，仅在tensorboard需要观察该值时再进行计算，否则会引入大量的norm算子
冗余的cast，PR代码中的 paddle_dtype打印结果为“float32”，但是实际tensor.dtype得到的是paddle.float32，会导致判断失败，从而引入无意义的cast。另外O2下梯度为fp16，原始写法需要每次将clip_coef_clamped转换成fp16，实际只需要计算一次，其他的梯度直接使用即可，所以添加了clip_coef_clamped_low_precison 变量。

效果：

O1：0.739 steps/s -> 2.194 steps/s
O2：1.067 steps/s -> 2.655 steps/s

CLAassistant · 2023-09-22T10:57:37Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.

zhangting_2017@163.com seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

opt clip_by_norm

f408e09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

optimize clip_by_norm #183

optimize clip_by_norm #183

zhangting2020 commented Sep 22, 2023

CLAassistant commented Sep 22, 2023

optimize clip_by_norm #183

Are you sure you want to change the base?

optimize clip_by_norm #183

Conversation

zhangting2020 commented Sep 22, 2023

CLAassistant commented Sep 22, 2023