Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support inference URLs for models used by scanners #101

Open
adrien-lesur opened this issue Feb 22, 2024 · 2 comments
Open

Support inference URLs for models used by scanners #101

adrien-lesur opened this issue Feb 22, 2024 · 2 comments

Comments

@adrien-lesur
Copy link

adrien-lesur commented Feb 22, 2024

Is your feature request related to a problem? Please describe.
My understanding of the documentation and the code is that llm-guard will lazy-load the models required by the chosen scanners from Huggingface. I apologize if this is incorrect

This is not ideal for consumers like Kubernetes workloads because :

  • When llm-guard is used as a library
    • each pod will download the same models, wasting resources
    • k8s workloads are usually preferred with low resource allocations to do efficient horizontal scaling.
  • With "usage as API" scenario to have an llm-guard-api dedicated deployment with more resources
    • you might still want your llm-guard-api deployment to scale too, and you face the same resource optimization issue.

A third option is that you already have the models deployed somewhere in a central place so that the only information required by the scanners would be the inference URL and the authentication.

Describe the solution you'd like
Users that use a platform to host and run models in a central place should be able to provide inference URLs and authentication to the scanners, instead of lazy-loading the models.

Describe alternatives you've considered
The existing possible usages described by the documentation (as a library or as API).

@asofter
Copy link
Collaborator

asofter commented Feb 23, 2024

Hey @adrien-lesur , at some point, we considered having the support of HuggingFace Inference Endpoints but we learned that it's not used widely.

How would you usually deploy those models? I assume https://github.com/neuralmagic/deepsparse or something.

@adrien-lesur
Copy link
Author

Hi @asofter,
The models would usually be deployed via vLLM like documented here for Mistral.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants