Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] Add more unit test to ensure the the outputs are reasonable #704

Open
AsakusaRinne opened this issue Apr 28, 2024 · 3 comments
Open
Assignees
Labels

Comments

@AsakusaRinne
Copy link
Collaborator

Description

Our unit test ensures that loading model and running inference successfully, but cannot indicate the output is reasonable instead of kinda garbage. Currently we need to run the examples when making some major features & fix, which is a bit annoying.

To address this issue, I think we could send the output to OpenAI chatgpt to check if it's reasonable. I will afford the payment of the tokens but will use github.triggering_actor to allow only developers who have write access to trigger the corresponding workflow.

@AsakusaRinne AsakusaRinne self-assigned this Apr 28, 2024
@martindevans
Copy link
Collaborator

We could also hardcode the expected responses in the unit tests. For example in this test it generates two completions of "Question. what is a cat?\nAnswer:" and assets that they are the same. We could assert the exact response too.

Of course this would only work with temp=0 and a specific model (even a specific quantisation), but it might save a few OpenAI calls!

@SignalRT
Copy link
Collaborator

In my opinion it would be easier the alternative that Martin proposes. We can not run all the test in CI y we should verify all the test locally.

@AsakusaRinne
Copy link
Collaborator Author

We could also hardcode the expected responses in the unit tests.

Yes, I also want to save the tokens where this approach works! I'll only consider using OpenAI API when necessary.

We can not run all the test in CI. we should verify all the test locally.

I tend to view things a bit differently. The workflows and unit test are responsible for reducing the risk when we merge the PRs. As long as the workflows pass, it should be equal to saying that terrible behaviors won't appear if we merge the PR.

However, due to the GPU backends, it's indeed hard for us to cover all the cases in the workflows. I can provide a machine with Nvidia GPU and Linux OS to run the workflows, but no idea for Windows yet. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants