Unexpected inference results from Flan-T5 XXL converted to ctranslate2 with version 4.2.1 and 4.1.1 (using tensor parallel) #1691

gk-kd · 2024-05-02T04:35:03Z

I'm using the Flan-t5 XXL of the shelf model in our project and for deployment we have converted it to ctranslate2 version using following command
ct2-transformers-converter --model ~/input_folder/ --output_dir ~/flant5_ct2/

Now I'm hosting the model as gRPC server, while starting under tensor parallel mode like
ctranslate2.Translator(checkpoint_path, device="cuda", tensor_parallel=True)

I started the server with mpirun with 2 instances to allow tensor parallel to kick in. This works well and model is loaded evenly across to 2 GPUs
mpirun -n 2 python model_server.py

Now when I run inference on it, it returns following result as response to my prompt ("Who is president of united states?")
"< pad >< pad >< pad >< pad >< pad >< pad >"

Now this is a strange behaviour which happens only with ctranslate2==4.2.1

Some suggestions to fix it would really helpful here.

minhthuc2502 · 2024-05-02T08:43:56Z

Do you have the same behavior with ctranslate2 4.1.1?

gk-kd · 2024-05-02T10:00:54Z

Do you have the same behavior with ctranslate2 4.1.1?

No it works fine with 4.1.1, but results are different between "with tensor parallel" and "without tensor parallel". I saw in 4.2.0 that some bugs have been fixed related to tensor parallel, so tried upgrading but ran into this different issue

Btw the response is like this

< pad >< pad >< pad >< pad >< pad >< pad >

I tried different quantization types like bfloat16, float16 etc... but nothing seems to work

anterart · 2024-05-06T15:01:42Z

I also experienced an issue wtih 4.2.1 Translator.
The inference with Translator with 4.2.1 produced poor results, I didn't inspect the output itself, I just looked on my metrics which dropped to zero.
This didn't happen on 4.1.1 or 3.24.0

I thought about reconverting my models using the 4.2.1 version converter (I used the 3.24.0 version converter to generate the Translators I'm using), but didn't have the time to do it yet.

kkoehncke · 2024-05-06T18:55:55Z

I am also seeing this regression for all variants of Flan-T5 (base, large, XL). Model is just outputting <pad> repeatedly. We convert correctly to use bfloat16 as it is a known issue with T5 to use any other precision. We reverted back to 3.24.1. Performing inference without tensor parallelism, just a single GPU.

minhthuc2502 added the bug Something isn't working label May 7, 2024

minhthuc2502 mentioned this issue May 13, 2024

fix regression flash attention #1695

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unexpected inference results from Flan-T5 XXL converted to ctranslate2 with version 4.2.1 and 4.1.1 (using tensor parallel) #1691

Unexpected inference results from Flan-T5 XXL converted to ctranslate2 with version 4.2.1 and 4.1.1 (using tensor parallel) #1691

gk-kd commented May 2, 2024 •

edited

minhthuc2502 commented May 2, 2024

gk-kd commented May 2, 2024 •

edited

anterart commented May 6, 2024 •

edited

kkoehncke commented May 6, 2024 •

edited

Unexpected inference results from Flan-T5 XXL converted to ctranslate2 with version 4.2.1 and 4.1.1 (using tensor parallel) #1691

Unexpected inference results from Flan-T5 XXL converted to ctranslate2 with version 4.2.1 and 4.1.1 (using tensor parallel) #1691

Comments

gk-kd commented May 2, 2024 • edited

minhthuc2502 commented May 2, 2024

gk-kd commented May 2, 2024 • edited

anterart commented May 6, 2024 • edited

kkoehncke commented May 6, 2024 • edited

gk-kd commented May 2, 2024 •

edited

gk-kd commented May 2, 2024 •

edited

anterart commented May 6, 2024 •

edited

kkoehncke commented May 6, 2024 •

edited