Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize XLnet performance #307

Open
zhaoyuchen2018 opened this issue Dec 19, 2019 · 7 comments
Open

Optimize XLnet performance #307

zhaoyuchen2018 opened this issue Dec 19, 2019 · 7 comments
Assignees

Comments

@zhaoyuchen2018
Copy link
Contributor

zhaoyuchen2018 commented Dec 19, 2019

模型提供的测试报告:
image

V100 单机单卡自测值:
paddle 版本:develop
速度:0.961218 steps/s
tf 1.15
速度: 1.61 step/s

@zhaoyuchen2018
Copy link
Contributor Author

image

从profile结果看 stack 和stack_grad op的 cpu耗时太多

@zhaoyuchen2018
Copy link
Contributor Author

image

从tracing文件来看 stack_grad耗时很多,很大可能在等待GPU的操作

@zhaoyuchen2018 zhaoyuchen2018 self-assigned this Dec 19, 2019
@zhaoyuchen2018
Copy link
Contributor Author

  paddle     tf    
op Calls time cost(ms)   op calls time cost(ms)
stack_grad 12 719.9        
stack 15 347.6        
elementwise_mul 305 96   Mul 1203 132.4
matmul_grad 186 191.4   BatchMatMulV2 370 315.8
transpose2_grad 383 179.8        
transpose2 406 197.3   Transpose 1236 260.6
matmul 193 74   MatMul 373 154.8
             
total   1806       863.6

OP对比如上图所示,完整的OP没有贴出来,首先针对这些占比较大的进行优化

@zhaoyuchen2018
Copy link
Contributor Author

zhaoyuchen2018 commented Dec 26, 2019

优化stack op:PaddlePaddle/Paddle#21940
优化后xlnet-ernie: 1.005,提升~4%

image

@zhaoyuchen2018
Copy link
Contributor Author

优化transpose后:1.337516 steps/s

image

image

@zhaoyuchen2018
Copy link
Contributor Author

image
在计算element_wise之前 大量时间被浪费在CPU和GPU的sync

@zhaoyuchen2018
Copy link
Contributor Author

image
优化了data transform之后性能提升~8%

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant