[Feature] throw Turbomind error to python #1539

lijing1996 · 2024-05-01T10:08:56Z

Motivation

Cannot catch and continue running when Turbomind throws an error.

Related resources

No response

Additional context

No response

zhyncs · 2024-05-05T09:39:11Z

Hi @lijing1996 You may provide detailed information about the error reported, how it was triggered, and provide a minimal reproducible example.

When the TurboMind Engine reports an error, it is usually divided into two situations. One is an unrecoverable error, such as OOM, just let it crash and the other is an error that only affects a specific request at present, in which case letting that request fail and having the client retry will suffice.

lijing1996 · 2024-05-06T00:55:43Z

Hi @lijing1996 You may provide detailed information about the error reported, how it was triggered, and provide a minimal reproducible example.

When the TurboMind Engine reports an error, it is usually divided into two situations. One is an unrecoverable error, such as OOM, just let it crash and the other is an error that only affects a specific request at present, in which case letting that request fail and having the client retry will suffice.

The first case. In such a case, could it catch the error and then re-import and re-load the model? I found it was sometimes OOM in my case with a large batch size. However, with a small batch size, the speed was low.

zhyncs · 2024-05-06T03:11:29Z

In such a case, could it catch the error and then re-import and re-load the model?

In this situation, catching the error is meaningless as it is a fatal error. It should just be allowed to crash to expose the problem. Also, I believe this is a bug that should be fixed. Could you provide detailed steps for reproducing it, including the model, request parameters, specific request content, etc.? As a program running on the server side for a long time, stability is very important, especially for Internet services.

lijing1996 · 2024-05-06T07:54:31Z

In such a case, could it catch the error and then re-import and re-load the model?

In this situation, catching the error is meaningless as it is a fatal error. It should just be allowed to crash to expose the problem. Also, I believe this is a bug that should be fixed. Could you provide detailed steps for reproducing it, including the model, request parameters, specific request content, etc.? As a program running on the server side for a long time, stability is very important, especially for Internet services.

It is just a OOM error. I use VLM to caption lots of images, so I need re-start after crash.

lvhan028 assigned lzhangzz May 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] throw Turbomind error to python #1539

[Feature] throw Turbomind error to python #1539

lijing1996 commented May 1, 2024

zhyncs commented May 5, 2024

lijing1996 commented May 6, 2024

zhyncs commented May 6, 2024

lijing1996 commented May 6, 2024

[Feature] throw Turbomind error to python #1539

[Feature] throw Turbomind error to python #1539

Comments

lijing1996 commented May 1, 2024

Motivation

Related resources

Additional context

zhyncs commented May 5, 2024

lijing1996 commented May 6, 2024

zhyncs commented May 6, 2024

lijing1996 commented May 6, 2024