Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LoRAX server with 2 GPUs and multiple adapters becomes permanently faster in swapping ONLY after parallel execution of requests. #395

Open
1 of 4 tasks
lighteternal opened this issue Apr 8, 2024 · 1 comment
Assignees

Comments

@lighteternal
Copy link

System Info

So I just noticed this very strange behaviour that has perhaps no severe implications but nevertheless is interesting to explore:

I am hosting a Mixtral model on a server with 2 A100s. My product comprises 3 LLM calls, 2 of them using 2 adapters (1 for each) and the last one using the base model. After a new LoRAX release, I download the latest image and run it.
After the server is all warmed up, I usually run a small validation script to ensure that all requests are successfully served.

I noticed that the server would initially be very slow in swapping the adapters, regardless of the arguments passed to the docker run command. Eventually though, we would reach an almost instantaneous swapping, so I didn't pay too much attention, until today, where I noticed the following:

  • If only sequential calls are sent to the server (regardless of using the adapters or not), the whole pipeline would take ~15sec.
  • If I used a script to parallelize them (say via concurrent.futures) and run it, they would naturally finish faster.
  • But this parallel execution would result in a speed improvement that would somehow persist even if I revisited my sequential script, noticing that the same sequential operation that took ~15sec, would now finish in less than 3sec!
  • It appears that this parallel execution of different adapters would lead to them being loaded to cache, which was not the case despite the relevant parameters being set in the docker config.

Curious if anyone else has encountered this. In any case, I am still at awe by the speed of integrating new features and improvements and unaware if this is expected, so I thought I'd flag this. 💪

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

Expected behavior

@lighteternal lighteternal changed the title LoRAX server with 2 GPUs and multiple adapters becomes faster in swapping ONLY after parallel execution of requests. LoRAX server with 2 GPUs and multiple adapters becomes permanently faster in swapping ONLY after parallel execution of requests. Apr 8, 2024
@magdyksaleh
Copy link
Collaborator

This is really weird. Let me try to repro it with some adapters and see why that might be happening

@magdyksaleh magdyksaleh self-assigned this Apr 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants