Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fleshing out API mode #39

Closed
DifferentialityDevelopment opened this issue May 7, 2024 · 13 comments
Closed

Fleshing out API mode #39

DifferentialityDevelopment opened this issue May 7, 2024 · 13 comments

Comments

@DifferentialityDevelopment
Copy link
Contributor

Hi there

Amazing project by the way, it has given me hopes in being able to run really big models, specifically I'm very excited about the upcoming 400b Llama model coming in the following months, and I've been planning how to run it locally.
Anyways I'm going off topic here, I'd really like to help out where I can and while it's undocumented in the readme, I see there is an simple server capability built in to distributed-llama main program.

I was wondering if I can assist in fleshing it out, there are a few things I can think of at the top of my head that could be needed in an API for distributed-llama, in no particular order:

  • ability to stop on encountering specific strings (known issue with llama 3 where eos is <!end_of_text!> in tokenizer.config but chat template uses <|eot_id|>)
  • chat template integration, and for that matter an openai chat completion api endpoint (makes integrating it into chat webui's much simpler)
  • statistics (get througput stats from the workers so you can isolate where bottlenecks are)

I'm fairly new to C++ but I'd be happy to work with you on this

@b4rtaz
Copy link
Owner

b4rtaz commented May 8, 2024

Hello @DifferentialityDevelopment,

yeah, more hands are welcome.

ability to stop on encountering specific strings

This could be easy to implement. Maybe you could add an extra cli parameter like --eos-id 123 that overrides the value from the tokenizer file.

chat template integration, and for that matter an openai chat completion api endpoint

This would be very good. But still Distributed Llama doesn't support the kv cache recalculation after the inference riched seqLen tokens. Now that cannot be implemented, because there are some more important changes to do before. I think in next days I'll add the roadmap for this project.

@DifferentialityDevelopment
Copy link
Contributor Author

I've already made some good progress on an API endpoint, was thinking it might be a good idea to instead of building it into main, to rather have it seperate as server.cpp
Sometime in the next week I'm hoping to have it ready, I'm aiming for the chat completions endpoint to work in this format
https://platform.openai.com/docs/api-reference/chat/create
I'm almost certain I'll need your help on refactoring it though, I've still got tons to learn about C++, I mostly program in C#
The only external dependency right now I'm looking at using for it is this https://github.com/nlohmann/json

@b4rtaz
Copy link
Owner

b4rtaz commented May 11, 2024

Great! I would prefer to not use external depenciencies, you can attach the json library to the source code as is done in llama.cpp. To goal is to keep Distributed Llama simple to compile (make xyz).

Also I think this server should be excluded from the main.cpp file. You could put the main code to the src/apps/api-server.cpp file.

@DifferentialityDevelopment
Copy link
Contributor Author

Great! I would prefer to not use external depenciencies, you can attach the json library to the source code as is done in llama.cpp. To goal is to keep Distributed Llama simple to compile (make xyz).

It's only a header file ie json.hpp, which makes it super easy to integrate and allows for using JSON data, without bloating the source code with custom code to parse JSON, was necessary for the API endpoint as most communication to/from server is via JSON

Also I think this server should be excluded from the main.cpp file. You could put the main code to the src/apps/api-server.cpp file.

Yes this is what I've done, though it's at src/server.cpp

I've already got it mostly working, the non-streaming endpoint is basically done then I just have to finish the streaming endpoint and it should be good to go, I've only made a few small changes to the other files, like for instance the sampler to be able to dynamically set the temperature or seed based on the api request to process

@DifferentialityDevelopment
Copy link
Contributor Author

By the way I've published a distributed version of llama 3 8B Instruct to huggingface
https://huggingface.co/Azamorn/Meta-Llama-3-8B-Instruct-Distributed

That model is ideal for testing the chat completion endpoint.

@DifferentialityDevelopment
Copy link
Contributor Author

I've got it responding back now, super stoked I finally got it working!

image

and another one, I've got stop strings integrated too

image

You can check out my changes here
https://github.com/DifferentialityDevelopment/distributed-llama

Still busy with it though

@DifferentialityDevelopment
Copy link
Contributor Author

API is basically done, streaming & non streaming modes work too, tested both and seems to work fine.

image

Do you think it's ready for a pull request?

@b4rtaz
Copy link
Owner

b4rtaz commented May 11, 2024

Nice job! I'll review your changes after you'll create a new PR.

@b4rtaz
Copy link
Owner

b4rtaz commented May 11, 2024

Btw: what is a behaviour when a generating position reaches the kv cache max position?

@DifferentialityDevelopment
Copy link
Contributor Author

Btw: what is a behaviour when a generating position reaches the kv cache max position?

To be honest I have not tested that, do you mean when it reaches max sequence length?
From what I can tell it should just force end the chat completion, normally the chat completion can end if an stop word is detected or eos token is generated, but if not it will run until it reaches
the maximum sequence length specified in TransformerSpec->seqLen

@b4rtaz
Copy link
Owner

b4rtaz commented May 11, 2024

I mean if the chat completion would support keeping a session (id) then the kv cache limit may be a problem very quickly. Because after a short conversation the limit would be reached.

Does the current version support keeping a session or only it generates answers for new prompts? If no I still think this is a good step forward.

The rolling kv cache is a planned feature.

@DifferentialityDevelopment
Copy link
Contributor Author

I haven't implemented it to keep session yet, so each request would recompute the prompt for the whole conversation, but I think I get where your coming from, before tokens get returned it processes each token in the prompt before it begins generating new tokens.

Ideally after each next chat message it wouldn't need to reprocess all the previous chat messages from previous requests, assuming the chat history is not altered.

@b4rtaz
Copy link
Owner

b4rtaz commented May 19, 2024

The 0.6.0 version is released with the API. I'm closing this issue now. Any next improvement/fix should be discussed in a new thread.

@b4rtaz b4rtaz closed this as completed May 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants