-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fleshing out API mode #39
Comments
Hello @DifferentialityDevelopment, yeah, more hands are welcome.
This could be easy to implement. Maybe you could add an extra cli parameter like
This would be very good. But still Distributed Llama doesn't support the kv cache recalculation after the inference riched |
I've already made some good progress on an API endpoint, was thinking it might be a good idea to instead of building it into main, to rather have it seperate as server.cpp |
Great! I would prefer to not use external depenciencies, you can attach the json library to the source code as is done in llama.cpp. To goal is to keep Distributed Llama simple to compile ( Also I think this server should be excluded from the |
It's only a header file ie json.hpp, which makes it super easy to integrate and allows for using JSON data, without bloating the source code with custom code to parse JSON, was necessary for the API endpoint as most communication to/from server is via JSON
Yes this is what I've done, though it's at src/server.cpp I've already got it mostly working, the non-streaming endpoint is basically done then I just have to finish the streaming endpoint and it should be good to go, I've only made a few small changes to the other files, like for instance the sampler to be able to dynamically set the temperature or seed based on the api request to process |
By the way I've published a distributed version of llama 3 8B Instruct to huggingface That model is ideal for testing the chat completion endpoint. |
I've got it responding back now, super stoked I finally got it working! and another one, I've got stop strings integrated too You can check out my changes here Still busy with it though |
Nice job! I'll review your changes after you'll create a new PR. |
Btw: what is a behaviour when a generating position reaches the kv cache max position? |
To be honest I have not tested that, do you mean when it reaches max sequence length? |
I mean if the chat completion would support keeping a session ( Does the current version support keeping a session or only it generates answers for new prompts? If no I still think this is a good step forward. The rolling kv cache is a planned feature. |
I haven't implemented it to keep session yet, so each request would recompute the prompt for the whole conversation, but I think I get where your coming from, before tokens get returned it processes each token in the prompt before it begins generating new tokens. Ideally after each next chat message it wouldn't need to reprocess all the previous chat messages from previous requests, assuming the chat history is not altered. |
The 0.6.0 version is released with the API. I'm closing this issue now. Any next improvement/fix should be discussed in a new thread. |
Hi there
Amazing project by the way, it has given me hopes in being able to run really big models, specifically I'm very excited about the upcoming 400b Llama model coming in the following months, and I've been planning how to run it locally.
Anyways I'm going off topic here, I'd really like to help out where I can and while it's undocumented in the readme, I see there is an simple server capability built in to distributed-llama main program.
I was wondering if I can assist in fleshing it out, there are a few things I can think of at the top of my head that could be needed in an API for distributed-llama, in no particular order:
I'm fairly new to C++ but I'd be happy to work with you on this
The text was updated successfully, but these errors were encountered: