Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Plugin interface for backends #570

Open
wants to merge 20 commits into
base: master
Choose a base branch
from
Open

Plugin interface for backends #570

wants to merge 20 commits into from

Conversation

iboB
Copy link
Contributor

@iboB iboB commented Oct 11, 2023

This implements a proposal to add a plugin interface to backends.

There is a demo project which uses the changes from this PR here: https://github.com/iboB/pytorch-ggml-plugin

Notable changes

add set_tensor_external_data to the backend interface

This is as simple as tensor->data = external_data_ptr on cpu and metal, but requires more steps on cuda and potentially other backends (creating tensor extra data)

add ggml_backend_cuda_init_plugin

This allows us to initialize the cuda backend with extrenal device id, cublas instance, and cuda stream

add GGML_PLUGIN config to CMakeLists.txt

This is explicitly not a CMake option it should be set to ON from the outside when one calls add_subdirectory to ggml for a plugin. It forces a static library build and adds -fPIC.

Questions

Tensors whose pointer is set externally with set_tensor_external_data have no buffer. Thus ggml_get_backend and associated ops tensor_get/set will simply crash when called on them.

We could leave it as it is documenting that these ops are not supported for external tensors and they have null buffers, or we could add a dummy buffer with the correct backend to them (which may be tricky if one wants to allocate memory with it).

What about something else?

@slaren
Copy link
Collaborator

slaren commented Oct 11, 2023

Setting the tensor buffer to NULL is not good, tensors need to be associated with a backend. The only exception is backwards compatibility with the CPU backend.

I think this can be handled by using functions to create custom buffers, such as ggml_backend_cpu_buffer_from_ptr, and adding a function to allocate a tensor at a specific offset from a buffer. This is how I expect to add support for mmap.

@iboB
Copy link
Contributor Author

iboB commented Oct 11, 2023

I think this can be handled by using functions to create custom buffers, such as ggml_backend_cpu_buffer_from_ptr, and adding a function to allocate a tensor at a specific offset from a buffer. This is how I expect to add support for mmap.

So, wouldn't this mean a separate buffer (and an allocation for the buffer data) for each external tensor?

Note that use case is changing the tensor external data pointer often. Input/output tensors need to be reset for every inference call

@iboB
Copy link
Contributor Author

iboB commented Oct 11, 2023

Why not create a dummy buffer with size zero, which can be used for all external tensors for a given backend?

Thus the backend will be accessible, but allocating data on this buffer will not be possible. Moreover a query of whether the tensor is external will also be possible (tensor->buffer == &g_dummy_buffer_for_external_tensors)

@slaren
Copy link
Collaborator

slaren commented Oct 11, 2023

Why not create a dummy buffer with size zero,

Yes I think this is fine. Just make a custom type of buffer that returns NULL as the base address and sets the size to zero, then just allocate tensors with this buffer at any address that you want.

@iboB
Copy link
Contributor Author

iboB commented Oct 11, 2023

@slaren, how's this?

@slaren
Copy link
Collaborator

slaren commented Oct 11, 2023

We cannot just add a dummy buffer to the backends, this may not work in every backend. ggml_backend_cuda_set_tensor_external_data is unnecessary and duplicates the code from init_tensor.

For the CUDA backend, it should work like this:

ggml_backend_buffer_t buffer = ggml_backend_cuda_buffer_from_ptr(NULL, 0);

ggml_backend_init_tensor(buffer, tensor, data);

void ggml_backend_init_tensor(..) {
    tensor->data = data;
    tensor->buffer = buffer;
    ggml_backend_buffer_init_tensor(buffer, tensor);
}

After this, modify ggml-alloc.c to call this function to initialize tensors instead of doing it manually.

@iboB
Copy link
Contributor Author

iboB commented Oct 11, 2023

We cannot just add a dummy buffer to the backends, this may not work in every backend.

Why not? It's a dummy buffer anyway. If for some reason a backend does not support external pointers, then it should have its set_tensor_external_data set to null.

This is much simpler than copying code for the the dummy buffer to each backend separately.

@iboB
Copy link
Contributor Author

iboB commented Oct 11, 2023

ggml_backend_cuda_set_tensor_external_data is unnecessary and duplicates the code from init_tensor.

The key part here is that the extra is reused if available and ggml_backend_cuda_buffer_init_tensor sets a new one every time. As I said, the use case is that set external pointer is potentially called for a tensor multiple times. We will run out of extra-s if we set a new one for every call.

@slaren
Copy link
Collaborator

slaren commented Oct 11, 2023

Some backends have buffer objects and cannot represent a memory address with just a pointer. This is not so uncommon either, Metal, OpenCL and Vulkan work in this way. So a general purpose dummy buffer is completely useless because it needs to be associated with the backend buffer object. It is not good to have a dummy object there that may be useless depending on the backend. Additionally, the code of set_tensor_external_data is essentially a duplicated of init_tensor.

This needs to be handled on a per backend basis, and the buffer interface is flexible enough to allow this.

The CUDA backend needs a different extra for each tensor, the CUDA buffer implementation keeps its own ring buffer of extras. It is not ok to reuse the global ring buffer either, that and other globals in the CUDA backend will be removed in the future.

@iboB
Copy link
Contributor Author

iboB commented Oct 11, 2023

I think there is some misunderstanding here.

The dummy buffer is just a dummy. It should never be used to actually allocate. It's there just to signify that the tensor is external and to allow get_backend to work. It should not be used for anything else.

Moreover set_tensor_external_data is not a copy. It critically reuses the extra if one is set.

The workflow for a plugin should be like this:

  • build model, init weights, etc
  • every time inference is requested from the outside, reset input and output data pointers to the external ones, and run model

This means that set_tensor_external_data is not initialization (it may be the first time it's called) but every other time it reuses data. It may be called millions of times for the same tensor (the input and output ones). We cannot guarantee that the outside world will provide the inputs and request the outputs on the same address. In fact with pytorch we almost have the guarantee that it wont.

We don't to rebuild the graph with new tensors for every inference step, so we just reset their data pointers.

@slaren
Copy link
Collaborator

slaren commented Oct 11, 2023

The problem is that some backends cannot use dummy buffers, they need a buffer object in addition to an offset. For these tensors, the dummy buffer will never be useful. If you wanted to use an external buffer with these backends, you would need a new function to create a ggml_backend_buffer from the native buffer object.

You can modify the implementationinit_tensor so that it reuses the tensor extra if it is already set.

@iboB
Copy link
Contributor Author

iboB commented Oct 11, 2023

Wouldn't potential extra data like buffer objects, be stored in that backend's extra member per tensor?

I still don't quite get what you mean.

Say I have an OpenGL backend:

GLuint vbo = ...;
ggml_backend_set_tensor_external_data(glbackend, mytensor, (void*)vbo)`
...

static void ggml_backend_opengl_set_tensor_external_data(ggml_backend_t backend, struct ggml_tensor * tensor, void * data) {
    ...
    tensor->buffer = &backend->dummy_external_tensor_buffer;
    extra->vbo = (GLuint)data;
    extra->offset = 0;	
}

The dummy buffer is still usable. Its only point is to allow ggml_backend_is_tensor_external and the other helpers which internally call ggml_get_backend

@iboB
Copy link
Contributor Author

iboB commented Oct 11, 2023

Though it does make sense to have offset as a future proof arg for set_external_data for cases where data is totally opaque. I'm adding it now

@iboB
Copy link
Contributor Author

iboB commented Oct 11, 2023

I found a problem with the dummy buffer for cuda. Since it is impossible to associate something with the lifetime of a tensor, it is impossible to manage the extras properly.

I'll try to think of a solution

@iboB iboB marked this pull request as draft October 11, 2023 13:33
@iboB
Copy link
Contributor Author

iboB commented Oct 12, 2023

Ok. so, the main problem is views. Views are pointed when the compute graph is built. Obviously if there are views to external tensors, which are then redirected this will lead to bugs and crashes.

While I have some ideas of how to make this work, I won't be adding them to this PR. Instead I'll go for the simpler solution: Use an allocator to set external pointers which means:

  • Extrenal pointers will be assigned with an allocr function
  • Only newly created tensors can be assigned external data, and once assigned, it cannot be changed
  • Which means that the requirement to avoid rebuilding the compute graph is dropped. You will have to rebuild the graph if you reassign external pointers

After this makes it into ggml I will open an issue do discuss potential improvements.

Essentially a rewrite is incoming :)

@iboB
Copy link
Contributor Author

iboB commented Oct 12, 2023

@slaren so this is the current version.

The limitations are as mentioned above.

Example usage is added.

I talked with @ggerganov and there is no metal support yet. It's not that easy, though we have some ideas of how to handle it. Metal support can be addressed in the future if there's interest. The odds of having demand for metal-based plugins are minimal anyway.

@iboB iboB marked this pull request as ready for review October 12, 2023 08:56
@ggerganov ggerganov self-requested a review October 12, 2023 10:50
Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes in ggml are quite minimal and isolated, so I don't think there would be an issue to merge this. Adding a CI to run the plugin example would be useful for long-term support of this functionality.

@ggerganov ggerganov requested a review from slaren October 12, 2023 13:34
@slaren
Copy link
Collaborator

slaren commented Oct 12, 2023

I have already explained how I intend to implement support for external tensors. ggml-backend is a work in progress and frankly, it is quite annoying to be ignored about changes to a design that I am still working on.

So my conclusion: it is too early to merge changes to ggml-backend. If you want to implement this change, please do it on the old ggml-cuda interface.

@ggerganov
Copy link
Owner

ggerganov commented Oct 12, 2023

@slaren That's a fair point. You have the final call for this PR and if you think it interferes with the design of the interface we will not merge it at this point. The change is trivial, so I don't think it would be a problem for @iboB to maintain it in a separate branch, until and if support for this sort of functionality becomes available in the future.

@iboB
Copy link
Contributor Author

iboB commented Oct 12, 2023

It's trivial now, but it won't be trivial with any of these features:

  • Support for resetting external data of existing tensors (and thus no requirement to rebuild the cgraph). The main implication here are views, but also extra buffers
  • Metal support (would require extra data per tensor ideally, or at least metal buffers per backend buffer as opposed to global)

I still think that the ability to create plugins to existing inference apps (built with torch, tensor flow, cuDNN, etc) is very valuable.

@iboB
Copy link
Contributor Author

iboB commented Oct 12, 2023

BTW, @slaren how do you intend to implement external data to tensors?

@slaren
Copy link
Collaborator

slaren commented Oct 12, 2023

Probably in the way that I suggested earlier, but nothing is set in stone. I will get back to this while implementing support for mmap for llama.cpp. As for the views, we can add a function to ggml-backend to re-initialize a view, but I don't think that this can be automated, since there isn't any consistent way to enumerate all views of a tensor in ggml at the moment. The CUDA extras will also very likely be removed except for multi GPU in split tensors, so that may simplify some things a bit.

@iboB
Copy link
Contributor Author

iboB commented Oct 13, 2023

If split tensor ops are eliminated, cuda extras can easily be eliminated as well. Otherwise it doesn't seem possible.

An alternative approach for views would be to reinitialize them when computing the graph, as opposed to when building the graph. This is cheap and the additional overhead would be negligible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants