Memory Planning Issue/Optimization (for "concat" style operations) #3210

rvijayc · 2024-04-22T19:11:17Z

Would it be possible to enhance the existing memory planner to cover situations where intelligent use of memory offsets can eliminate unnecessary copies? I am posting a minimal example below:

import torch
import torch.nn as nn
import torch.export

# author a model.
class Example(nn.Module):
    def __init__(self):
        super(Example, self).__init__()
        
    def forward(self, x):
        x_sin = torch.sin(x)
        x_cos = torch.cos(x)
        x_combined = torch.cat([x_cos, x_sin])
        out = torch.sigmoid(x_combined)
        return out
    
model = Example().eval()

# generate a uniform distribution of data.
n_batches = 100
# generate some example input.
x_in = torch.distributions.uniform.Uniform(-1, 1).sample([1, 64])

# export the model.
m_export = torch.export.export(model, (x_in[0,:],))

# convert to edge.
from executorch.exir import EdgeProgramManager, to_edge
m_edge: EdgeProgramManager = to_edge(m_export)

from executorch.exir import ExecutorchBackendConfig, ExecutorchProgramManager
from executorch.exir.passes import MemoryPlanningPass
executorch_program: ExecutorchProgramManager = m_edge.to_executorch(
    ExecutorchBackendConfig(
        passes=[],  # User-defined passes
        memory_planning_pass=MemoryPlanningPass(
            "greedy"
        ),  # Default memory planning pass
    )
)

This produces a graph like below:

And a corresponding memory allocation like below:

+-------+------------+----------------------+----------+---------+---------+-----------+--------+
|   idx | lifetime   | name                 |   mem_id | shape   | dtype   |   mem_ofs |   size |
|-------+------------+----------------------+----------+---------+---------+-----------+--------|
|     0 | [0, 4]     | x                    |        1 | [64]    | float32 |         0 |    256 |
|     1 | [1, 6]     | alloc                |        1 | [64]    | float32 |       512 |    256 |
|     2 | [1, 6]     | aten_sin_default     |        1 | [64]    | float32 |       512 |    256 |
|     3 | [3, 6]     | alloc_1              |        1 | [64]    | float32 |      1024 |    256 |
|     4 | [3, 6]     | aten_cos_default     |        1 | [64]    | float32 |      1024 |    256 |
|     5 | [5, 8]     | alloc_2              |        1 | [128]   | float32 |         0 |    512 |
|     6 | [5, 8]     | aten_cat_default     |        1 | [128]   | float32 |         0 |    512 |
|     7 | [7, 9]     | alloc_3              |        1 | [128]   | float32 |       512 |    512 |
|     8 | [7, 9]     | aten_sigmoid_default |        1 | [128]   | float32 |       512 |    512 |
+-------+------------+----------------------+----------+---------+---------+-----------+--------+

There are two issues I see with this example:

The memory offset of aten.sin should have been 256 instead of 512. I am not sure why we have a hole of 256 bytes between x and aten.sin. The alignment (defaults to 16) is already met even if we use 256.
- A useful datapoint is that the gap seems to be proportional to the number of users of x.
The aten.cat operation can be completely eliminated if the output of aten.cos was directed to offset 256 and the output of aten.sin was directed to offset of 512 (which puts them back to back in memory without requiring an explicit copy done by aten.cat).

In the list above, (1) seems like a bug while (2) is an enhancement request. Please let me know in case I can provide more information.

Version

Collecting environment information...
PyTorch version: 2.4.0.dev20240415+cpu
Is debug build: False
CUDA used to build PyTorch: Could not collect
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.6 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
Clang version: 10.0.0-4ubuntu1
CMake version: version 3.29.2
Libc version: glibc-2.31

Python version: 3.10.13 | packaged by conda-forge | (main, Dec 23 2023, 15:36:39) [GCC 12.3.0] (64-bit runtime)
Python platform: Linux-5.4.0-146-generic-x86_64-with-glibc2.31
Is CUDA available: False
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: GPU 0: GRID A100D-10C
Nvidia driver version: 525.60.13
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] executorch==0.2.1.dev0+unknown
[pip3] numpy==1.26.4
[pip3] onnx==1.13.1
[pip3] torch==2.4.0.dev20240415+cpu
[pip3] torchaudio==2.2.0.dev20240415+cpu
[pip3] torchsr==1.0.4
[pip3] torchvision==0.19.0.dev20240415+cpu
[conda] blas 1.0 mkl
[conda] cpuonly 2.0 0 pytorch-nightly
[conda] executorch 0.2.1.dev0+unknown pypi_0 pypi
[conda] libjpeg-turbo 2.0.0 h9bf148f_0 pytorch-nightly
[conda] mkl 2023.1.0 h213fc3f_46344
[conda] mkl-service 2.4.0 py310h5eee18b_1
[conda] mkl_fft 1.3.8 py310h5eee18b_0
[conda] mkl_random 1.2.4 py310hdb19cb5_0
[conda] numpy 1.26.4 py310h5f9d8c6_0
[conda] numpy-base 1.26.4 py310hb5e798b_0
[conda] pytorch 2.4.0.dev20240417 py3.10_cpu_0 pytorch-nightly
[conda] pytorch-mutex 1.0 cpu pytorch-nightly
[conda] torch 2.4.0.dev20240415+cpu pypi_0 pypi
[conda] torchaudio 2.2.0.dev20240415+cpu pypi_0 pypi
[conda] torchsr 1.0.4 pypi_0 pypi
[conda] torchvision 0.19.0.dev20240415+cpu pypi_0 pypi

The text was updated successfully, but these errors were encountered:

JacobSzwejbka · 2024-05-09T18:57:23Z

@metascroy Recently did something similar with view_copy elimination. I think this can definitely work, though we are being pretty careful about introducing views and aliasing right now by default.

For this scenario theres some risk that introducing the aliasing is not sound. Its almost certainly fine in practice since we have no mutation in the graph today outside of a niche copy_ scenario and the out args of out variant ops. The reason this risks exists is because torch.cat in pytorch isnt a view afaik it copies the memory under the hood.

You could probably write 2passes here to handle this.

Replace the cat ops with a custom cat op that does nothing
a custom memory planning pass though that looks for your special cat op and ensures the args are memory planned accordingly

Id have to think more if theres a better way we could be expressing special nodes that we dont actually want in the runtime though. Maybe the changes @metascroy is making for noop views could be extended to allow user hook up. CC @larryliu0820 @angelayi for thoughts.

JacobSzwejbka · 2024-05-09T18:57:42Z

The memory offset of aten.sin should have been 256 instead of 512. I am not sure why we have a hole of 256 bytes between x and aten.sin. The alignment (defaults to 16) is already met even if we use 256.
A useful datapoint is that the gap seems to be proportional to the number of users of x.

@ydwu4 can you take a look at this

larryliu0820 · 2024-05-09T21:16:30Z

The memory offset of aten.sin should have been 256 instead of 512. I am not sure why we have a hole of 256 bytes between x and aten.sin. The alignment (defaults to 16) is already met even if we use 256.

This seems like a bug honestly. I would suspect here: https://github.com/pytorch/executorch/blob/main/exir/memory_planning.py#L593-L598 we are annotating the size wrong. Need to debug further though.

ydwu4 · 2024-05-09T21:37:35Z

Did some check. Input shapes are correct and it's due to the un-optimality of the greedy algorithm. We need to memory planning for 5 tensors: input, sin, cos, concat, softmax. As shown below, each pair is (tensor_allocated_memory, shared obj info).

256, SharedObject(idx=0, offset=0, size=512, last_used_index=8))
(256, SharedObject(idx=1, offset=512, size=512, last_used_index=9))
(256, SharedObject(idx=2, offset=1024, size=256, last_used_index=6))
(512, SharedObject(idx=0, offset=0, size=512, last_used_index=8))
(512, SharedObject(idx=1, offset=512, size=512, last_used_index=9))

The algorithm first allocates 3 256 buffers, then expand the fisrt two 256 buffer to 512 (due to the greedyness in selecting the best fit so far with only local knowledge). So we're getting 512 + 512 + 256 total memory sizes. While the optimimal memory size would be 512(for storing output of cat) + 512 (for storing output of softmax) = 1024. Note that: we're not actually very far from the optimal given that we don't leverage the information that cat operation can actually be combined into one allocation.

rvijayc · 2024-05-13T21:45:01Z

@metascroy Recently did something similar with view_copy elimination. I think this can definitely work, though we are being pretty careful about introducing views and aliasing right now by default.

For this scenario theres some risk that introducing the aliasing is not sound. Its almost certainly fine in practice since we have no mutation in the graph today outside of a niche copy_ scenario and the out args of out variant ops. The reason this risks exists is because torch.cat in pytorch isnt a view afaik it copies the memory under the hood.

You could probably write 2passes here to handle this.

Replace the cat ops with a custom cat op that does nothing

a custom memory planning pass though that looks for your special cat op and ensures the args are memory planned accordingly

Id have to think more if theres a better way we could be expressing special nodes that we dont actually want in the runtime though. Maybe the changes @metascroy is making for noop views could be extended to allow user hook up. CC @larryliu0820 @angelayi for thoughts.

Thanks for the suggestions. I am not yet sufficiently familiar with the EXIR memory planner, but I was planning to get familiar with it and do exactly what you were suggesting (i.e., subclass the base memory planner and see if I can add hooks for users to handle special situations). Will try this when I get some time, but would be nice if this was a built-in feature (provided all the right conditions are met for such optimization).

cccclai assigned cccclai and JacobSzwejbka and unassigned cccclai and JacobSzwejbka Apr 23, 2024

JacobSzwejbka added bug Something isn't working enhancement Not as big of a feature, but technically not a bug. Should be easy to fix module: exir Issues related to Export IR labels May 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory Planning Issue/Optimization (for "concat" style operations) #3210

Memory Planning Issue/Optimization (for "concat" style operations) #3210

rvijayc commented Apr 22, 2024

JacobSzwejbka commented May 9, 2024

JacobSzwejbka commented May 9, 2024

larryliu0820 commented May 9, 2024

ydwu4 commented May 9, 2024 •

edited

rvijayc commented May 13, 2024

Memory Planning Issue/Optimization (for "concat" style operations) #3210

Memory Planning Issue/Optimization (for "concat" style operations) #3210

Comments

rvijayc commented Apr 22, 2024

Version

JacobSzwejbka commented May 9, 2024

JacobSzwejbka commented May 9, 2024

larryliu0820 commented May 9, 2024

ydwu4 commented May 9, 2024 • edited

rvijayc commented May 13, 2024

ydwu4 commented May 9, 2024 •

edited