Plugin interface for backends #570

iboB · 2023-10-11T06:03:21Z

This implements a proposal to add a plugin interface to backends.

There is a demo project which uses the changes from this PR here: https://github.com/iboB/pytorch-ggml-plugin

Notable changes

add set_tensor_external_data to the backend interface

This is as simple as tensor->data = external_data_ptr on cpu and metal, but requires more steps on cuda and potentially other backends (creating tensor extra data)

add ggml_backend_cuda_init_plugin

This allows us to initialize the cuda backend with extrenal device id, cublas instance, and cuda stream

add GGML_PLUGIN config to CMakeLists.txt

This is explicitly not a CMake option it should be set to ON from the outside when one calls add_subdirectory to ggml for a plugin. It forces a static library build and adds -fPIC.

Questions

Tensors whose pointer is set externally with set_tensor_external_data have no buffer. Thus ggml_get_backend and associated ops tensor_get/set will simply crash when called on them.

We could leave it as it is documenting that these ops are not supported for external tensors and they have null buffers, or we could add a dummy buffer with the correct backend to them (which may be tricky if one wants to allocate memory with it).

What about something else?

slaren · 2023-10-11T08:05:22Z

Setting the tensor buffer to NULL is not good, tensors need to be associated with a backend. The only exception is backwards compatibility with the CPU backend.

I think this can be handled by using functions to create custom buffers, such as ggml_backend_cpu_buffer_from_ptr, and adding a function to allocate a tensor at a specific offset from a buffer. This is how I expect to add support for mmap.

iboB · 2023-10-11T08:18:46Z

I think this can be handled by using functions to create custom buffers, such as ggml_backend_cpu_buffer_from_ptr, and adding a function to allocate a tensor at a specific offset from a buffer. This is how I expect to add support for mmap.

So, wouldn't this mean a separate buffer (and an allocation for the buffer data) for each external tensor?

Note that use case is changing the tensor external data pointer often. Input/output tensors need to be reset for every inference call

iboB · 2023-10-11T08:23:15Z

Why not create a dummy buffer with size zero, which can be used for all external tensors for a given backend?

Thus the backend will be accessible, but allocating data on this buffer will not be possible. Moreover a query of whether the tensor is external will also be possible (tensor->buffer == &g_dummy_buffer_for_external_tensors)

slaren · 2023-10-11T08:28:06Z

Why not create a dummy buffer with size zero,

Yes I think this is fine. Just make a custom type of buffer that returns NULL as the base address and sets the size to zero, then just allocate tensors with this buffer at any address that you want.

iboB · 2023-10-11T09:20:52Z

@slaren, how's this?

slaren · 2023-10-11T09:55:15Z

We cannot just add a dummy buffer to the backends, this may not work in every backend. ggml_backend_cuda_set_tensor_external_data is unnecessary and duplicates the code from init_tensor.

For the CUDA backend, it should work like this:

ggml_backend_buffer_t buffer = ggml_backend_cuda_buffer_from_ptr(NULL, 0);

ggml_backend_init_tensor(buffer, tensor, data);

void ggml_backend_init_tensor(..) {
    tensor->data = data;
    tensor->buffer = buffer;
    ggml_backend_buffer_init_tensor(buffer, tensor);
}

After this, modify ggml-alloc.c to call this function to initialize tensors instead of doing it manually.

iboB · 2023-10-11T10:01:02Z

We cannot just add a dummy buffer to the backends, this may not work in every backend.

Why not? It's a dummy buffer anyway. If for some reason a backend does not support external pointers, then it should have its set_tensor_external_data set to null.

This is much simpler than copying code for the the dummy buffer to each backend separately.

iboB · 2023-10-11T10:10:16Z

ggml_backend_cuda_set_tensor_external_data is unnecessary and duplicates the code from init_tensor.

The key part here is that the extra is reused if available and ggml_backend_cuda_buffer_init_tensor sets a new one every time. As I said, the use case is that set external pointer is potentially called for a tensor multiple times. We will run out of extra-s if we set a new one for every call.

slaren · 2023-10-11T10:14:06Z

Some backends have buffer objects and cannot represent a memory address with just a pointer. This is not so uncommon either, Metal, OpenCL and Vulkan work in this way. So a general purpose dummy buffer is completely useless because it needs to be associated with the backend buffer object. It is not good to have a dummy object there that may be useless depending on the backend. Additionally, the code of set_tensor_external_data is essentially a duplicated of init_tensor.

This needs to be handled on a per backend basis, and the buffer interface is flexible enough to allow this.

The CUDA backend needs a different extra for each tensor, the CUDA buffer implementation keeps its own ring buffer of extras. It is not ok to reuse the global ring buffer either, that and other globals in the CUDA backend will be removed in the future.

iboB · 2023-10-11T10:22:13Z

I think there is some misunderstanding here.

The dummy buffer is just a dummy. It should never be used to actually allocate. It's there just to signify that the tensor is external and to allow get_backend to work. It should not be used for anything else.

Moreover set_tensor_external_data is not a copy. It critically reuses the extra if one is set.

The workflow for a plugin should be like this:

build model, init weights, etc
every time inference is requested from the outside, reset input and output data pointers to the external ones, and run model

This means that set_tensor_external_data is not initialization (it may be the first time it's called) but every other time it reuses data. It may be called millions of times for the same tensor (the input and output ones). We cannot guarantee that the outside world will provide the inputs and request the outputs on the same address. In fact with pytorch we almost have the guarantee that it wont.

We don't to rebuild the graph with new tensors for every inference step, so we just reset their data pointers.

slaren · 2023-10-11T10:31:58Z

The problem is that some backends cannot use dummy buffers, they need a buffer object in addition to an offset. For these tensors, the dummy buffer will never be useful. If you wanted to use an external buffer with these backends, you would need a new function to create a ggml_backend_buffer from the native buffer object.

You can modify the implementationinit_tensor so that it reuses the tensor extra if it is already set.

iboB · 2023-10-11T11:54:02Z

Wouldn't potential extra data like buffer objects, be stored in that backend's extra member per tensor?

I still don't quite get what you mean.

Say I have an OpenGL backend:

GLuint vbo = ...;
ggml_backend_set_tensor_external_data(glbackend, mytensor, (void*)vbo)`
...

static void ggml_backend_opengl_set_tensor_external_data(ggml_backend_t backend, struct ggml_tensor * tensor, void * data) {
    ...
    tensor->buffer = &backend->dummy_external_tensor_buffer;
    extra->vbo = (GLuint)data;
    extra->offset = 0;	
}

The dummy buffer is still usable. Its only point is to allow ggml_backend_is_tensor_external and the other helpers which internally call ggml_get_backend

iboB · 2023-10-11T12:16:04Z

Though it does make sense to have offset as a future proof arg for set_external_data for cases where data is totally opaque. I'm adding it now

iboB · 2023-10-11T13:07:22Z

I found a problem with the dummy buffer for cuda. Since it is impossible to associate something with the lifetime of a tensor, it is impossible to manage the extras properly.

I'll try to think of a solution

iboB · 2023-10-12T06:42:10Z

Ok. so, the main problem is views. Views are pointed when the compute graph is built. Obviously if there are views to external tensors, which are then redirected this will lead to bugs and crashes.

While I have some ideas of how to make this work, I won't be adding them to this PR. Instead I'll go for the simpler solution: Use an allocator to set external pointers which means:

Extrenal pointers will be assigned with an allocr function
Only newly created tensors can be assigned external data, and once assigned, it cannot be changed
Which means that the requirement to avoid rebuilding the compute graph is dropped. You will have to rebuild the graph if you reassign external pointers

After this makes it into ggml I will open an issue do discuss potential improvements.

Essentially a rewrite is incoming :)

iboB · 2023-10-12T08:54:52Z

@slaren so this is the current version.

The limitations are as mentioned above.

Example usage is added.

I talked with @ggerganov and there is no metal support yet. It's not that easy, though we have some ideas of how to handle it. Metal support can be addressed in the future if there's interest. The odds of having demand for metal-based plugins are minimal anyway.

ggerganov

The changes in ggml are quite minimal and isolated, so I don't think there would be an issue to merge this. Adding a CI to run the plugin example would be useful for long-term support of this functionality.

slaren · 2023-10-12T14:01:32Z

I have already explained how I intend to implement support for external tensors. ggml-backend is a work in progress and frankly, it is quite annoying to be ignored about changes to a design that I am still working on.

So my conclusion: it is too early to merge changes to ggml-backend. If you want to implement this change, please do it on the old ggml-cuda interface.

ggerganov · 2023-10-12T14:27:55Z

@slaren That's a fair point. You have the final call for this PR and if you think it interferes with the design of the interface we will not merge it at this point. The change is trivial, so I don't think it would be a problem for @iboB to maintain it in a separate branch, until and if support for this sort of functionality becomes available in the future.

iboB · 2023-10-12T17:25:26Z

It's trivial now, but it won't be trivial with any of these features:

Support for resetting external data of existing tensors (and thus no requirement to rebuild the cgraph). The main implication here are views, but also extra buffers
Metal support (would require extra data per tensor ideally, or at least metal buffers per backend buffer as opposed to global)

I still think that the ability to create plugins to existing inference apps (built with torch, tensor flow, cuDNN, etc) is very valuable.

iboB · 2023-10-12T17:39:57Z

BTW, @slaren how do you intend to implement external data to tensors?

slaren · 2023-10-12T22:08:47Z

Probably in the way that I suggested earlier, but nothing is set in stone. I will get back to this while implementing support for mmap for llama.cpp. As for the views, we can add a function to ggml-backend to re-initialize a view, but I don't think that this can be automated, since there isn't any consistent way to enumerate all views of a tensor in ggml at the moment. The CUDA extras will also very likely be removed except for multi GPU in split tensors, so that may simplify some things a bit.

iboB · 2023-10-13T01:48:11Z

If split tensor ops are eliminated, cuda extras can easily be eliminated as well. Otherwise it doesn't seem possible.

An alternative approach for views would be to reinitialize them when computing the graph, as opposed to when building the graph. This is cheap and the additional overhead would be negligible.

iboB marked this pull request as draft October 11, 2023 13:33

iboB force-pushed the torch-demo branch from 9ef3da7 to dc5a03d Compare October 12, 2023 08:52

iboB marked this pull request as ready for review October 12, 2023 08:56

ggerganov self-requested a review October 12, 2023 10:50

ggerganov approved these changes Oct 12, 2023

View reviewed changes

ggerganov requested a review from slaren October 12, 2023 13:34

iboB added 2 commits October 23, 2023 11:49

cmake : don't find CUDA it it's already found

3071540

cmake : if built as a plugin, use fPIC and build as static lib

9f9f46b

iboB and others added 18 commits October 23, 2023 11:49

backend : add set_tensor_external_data

7f33707

cuda : init backend as a plugin

ac9b0ba

cmake : rely on external CUDA if any to make CUDAToolkit available

7396f84

cuda : fix unused var warning

e1f7aed

cuda : destroy resources on free

95770f5

cuda : properly reuse tensor->extra

c36c63e

metal : backend interface set_external_data

4103c51

minor : typo

acd2940

cuda : only destroy handles on free if not a plugin

77ba5bd

backends : dummy buffer for external tensors

ac52b0e

backend : add offset arg to set_tensor_external_data

e033464

minor : style

b29eb7f

wip

973c420

rewrite tensor external data

837aa5f

plugin example

3ee1e60

remove non-standard enum fwd decl

a8b9f6a

minor : document set_tensor_external_data

933b132

minor : spaces

6ced18f

iboB force-pushed the torch-demo branch from c026efd to 6ced18f Compare October 23, 2023 08:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Plugin interface for backends #570

Plugin interface for backends #570

iboB commented Oct 11, 2023

slaren commented Oct 11, 2023

iboB commented Oct 11, 2023

iboB commented Oct 11, 2023

slaren commented Oct 11, 2023

iboB commented Oct 11, 2023

slaren commented Oct 11, 2023

iboB commented Oct 11, 2023

iboB commented Oct 11, 2023

slaren commented Oct 11, 2023

iboB commented Oct 11, 2023

slaren commented Oct 11, 2023

iboB commented Oct 11, 2023 •

edited

iboB commented Oct 11, 2023 •

edited

iboB commented Oct 11, 2023

iboB commented Oct 12, 2023 •

edited

iboB commented Oct 12, 2023

ggerganov left a comment

slaren commented Oct 12, 2023

ggerganov commented Oct 12, 2023 •

edited

iboB commented Oct 12, 2023 •

edited

iboB commented Oct 12, 2023

slaren commented Oct 12, 2023

iboB commented Oct 13, 2023 •

edited

Plugin interface for backends #570

Are you sure you want to change the base?

Plugin interface for backends #570

Conversation

iboB commented Oct 11, 2023

Notable changes

Questions

slaren commented Oct 11, 2023

iboB commented Oct 11, 2023

iboB commented Oct 11, 2023

slaren commented Oct 11, 2023

iboB commented Oct 11, 2023

slaren commented Oct 11, 2023

iboB commented Oct 11, 2023

iboB commented Oct 11, 2023

slaren commented Oct 11, 2023

iboB commented Oct 11, 2023

slaren commented Oct 11, 2023

iboB commented Oct 11, 2023 • edited

iboB commented Oct 11, 2023 • edited

iboB commented Oct 11, 2023

iboB commented Oct 12, 2023 • edited

iboB commented Oct 12, 2023

ggerganov left a comment

Choose a reason for hiding this comment

slaren commented Oct 12, 2023

ggerganov commented Oct 12, 2023 • edited

iboB commented Oct 12, 2023 • edited

iboB commented Oct 12, 2023

slaren commented Oct 12, 2023

iboB commented Oct 13, 2023 • edited

iboB commented Oct 11, 2023 •

edited

iboB commented Oct 11, 2023 •

edited

iboB commented Oct 12, 2023 •

edited

ggerganov commented Oct 12, 2023 •

edited

iboB commented Oct 12, 2023 •

edited

iboB commented Oct 13, 2023 •

edited