Why losses are higher for GPTJ than LLama? #168

ri938 · 2023-06-20T12:33:10Z

ri938
Jun 20, 2023

I find that the quantisation losses are higher for GPTJ than LLama which seems to stay pretty low.

2023-06-20 19:05:19 INFO [auto_gptq.modeling._base] Quantizing attn.q_proj in layer 2/28...
2023-06-20 19:05:20 INFO [auto_gptq.quantization.gptq] duration: 0.9488532543182373
2023-06-20 19:05:20 INFO [auto_gptq.quantization.gptq] avg loss: 2.4244191646575928
2023-06-20 19:06:48 INFO [auto_gptq.modeling._base] Quantizing attn.out_proj in layer 2/28...
2023-06-20 19:06:49 INFO [auto_gptq.quantization.gptq] duration: 1.0223016738891602
2023-06-20 19:06:49 INFO [auto_gptq.quantization.gptq] avg loss: 0.09044748544692993
2023-06-20 19:08:18 INFO [auto_gptq.modeling._base] Quantizing mlp.fc_in in layer 2/28...
2023-06-20 19:08:19 INFO [auto_gptq.quantization.gptq] duration: 1.182661771774292
2023-06-20 19:08:19 INFO [auto_gptq.quantization.gptq] avg loss: 9.602218627929688
2023-06-20 19:12:34 INFO [auto_gptq.modeling._base] Quantizing mlp.fc_out in layer 2/28...
2023-06-20 19:12:39 INFO [auto_gptq.quantization.gptq] duration: 4.980050802230835
2023-06-20 19:12:39 INFO [auto_gptq.quantization.gptq] avg loss: 0.29060760140419006
2023-06-20 19:14:30 INFO [auto_gptq.modeling._base] Start quantizing layer 3/28
2023-06-20 19:16:36 INFO [auto_gptq.modeling._base] Quantizing attn.k_proj in layer 3/28...
2023-06-20 19:16:37 INFO [auto_gptq.quantization.gptq] duration: 0.9983532428741455
2023-06-20 19:16:37 INFO [auto_gptq.quantization.gptq] avg loss: 3.372664451599121
2023-06-20 19:16:37 INFO [auto_gptq.modeling._base] Quantizing attn.v_proj in layer 3/28...
2023-06-20 19:16:38 INFO [auto_gptq.quantization.gptq] duration: 0.9186806678771973
2023-06-20 19:16:38 INFO [auto_gptq.quantization.gptq] avg loss: 1.2949587106704712
2023-06-20 19:16:38 INFO [auto_gptq.modeling._base] Quantizing attn.q_proj in layer 3/28...
2023-06-20 19:16:39 INFO [auto_gptq.quantization.gptq] duration: 0.9289827346801758
2023-06-20 19:16:39 INFO [auto_gptq.quantization.gptq] avg loss: 3.412869453430176
2023-06-20 19:18:07 INFO [auto_gptq.modeling._base] Quantizing attn.out_proj in layer 3/28...
2023-06-20 19:18:08 INFO [auto_gptq.quantization.gptq] duration: 1.0374314785003662
2023-06-20 19:18:08 INFO [auto_gptq.quantization.gptq] avg loss: 0.19779932498931885
2023-06-20 19:19:36 INFO [auto_gptq.modeling._base] Quantizing mlp.fc_in in layer 3/28...
2023-06-20 19:19:37 INFO [auto_gptq.quantization.gptq] duration: 1.0931320190429688
2023-06-20 19:19:37 INFO [auto_gptq.quantization.gptq] avg loss: 13.734126091003418

This is with a sample size of 2048 * 2 on C4 dataset. I find it gets better the larger I make the dataset.

I also notice that as you go through the layers it gets worse. So the loss is highest for layer 28/28 (approximately).

Why is the avg loss so high for GPTJ compare to LLama? With Llama I am also able to use a much smaller dataset: like 1024 to achieve lower loss.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why losses are higher for GPTJ than LLama? #168

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Why losses are higher for GPTJ than LLama? #168

ri938 Jun 20, 2023

Replies: 0 comments

ri938
Jun 20, 2023