New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory Planning Issue/Optimization (for "concat" style operations) #3210
Comments
@metascroy Recently did something similar with view_copy elimination. I think this can definitely work, though we are being pretty careful about introducing views and aliasing right now by default. For this scenario theres some risk that introducing the aliasing is not sound. Its almost certainly fine in practice since we have no mutation in the graph today outside of a niche copy_ scenario and the out args of out variant ops. The reason this risks exists is because torch.cat in pytorch isnt a view afaik it copies the memory under the hood. You could probably write 2passes here to handle this.
Id have to think more if theres a better way we could be expressing special nodes that we dont actually want in the runtime though. Maybe the changes @metascroy is making for noop views could be extended to allow user hook up. CC @larryliu0820 @angelayi for thoughts. |
@ydwu4 can you take a look at this |
This seems like a bug honestly. I would suspect here: https://github.com/pytorch/executorch/blob/main/exir/memory_planning.py#L593-L598 we are annotating the size wrong. Need to debug further though. |
Did some check. Input shapes are correct and it's due to the un-optimality of the greedy algorithm. We need to memory planning for 5 tensors: input, sin, cos, concat, softmax. As shown below, each pair is (tensor_allocated_memory, shared obj info). 256, SharedObject(idx=0, offset=0, size=512, last_used_index=8))
(256, SharedObject(idx=1, offset=512, size=512, last_used_index=9))
(256, SharedObject(idx=2, offset=1024, size=256, last_used_index=6))
(512, SharedObject(idx=0, offset=0, size=512, last_used_index=8))
(512, SharedObject(idx=1, offset=512, size=512, last_used_index=9)) The algorithm first allocates 3 256 buffers, then expand the fisrt two 256 buffer to 512 (due to the greedyness in selecting the best fit so far with only local knowledge). So we're getting 512 + 512 + 256 total memory sizes. While the optimimal memory size would be 512(for storing output of cat) + 512 (for storing output of softmax) = 1024. Note that: we're not actually very far from the optimal given that we don't leverage the information that cat operation can actually be combined into one allocation. |
Thanks for the suggestions. I am not yet sufficiently familiar with the EXIR memory planner, but I was planning to get familiar with it and do exactly what you were suggesting (i.e., subclass the base memory planner and see if I can add hooks for users to handle special situations). Will try this when I get some time, but would be nice if this was a built-in feature (provided all the right conditions are met for such optimization). |
Would it be possible to enhance the existing memory planner to cover situations where intelligent use of memory offsets can eliminate unnecessary copies? I am posting a minimal example below:
This produces a graph like below:
And a corresponding memory allocation like below:
There are two issues I see with this example:
aten.sin
should have been 256 instead of 512. I am not sure why we have a hole of 256 bytes betweenx
andaten.sin
. The alignment (defaults to 16) is already met even if we use 256.x
.aten.cat
operation can be completely eliminated if the output ofaten.cos
was directed to offset 256 and the output ofaten.sin
was directed to offset of 512 (which puts them back to back in memory without requiring an explicit copy done byaten.cat
).In the list above, (1) seems like a bug while (2) is an enhancement request. Please let me know in case I can provide more information.
Version
Collecting environment information...
PyTorch version: 2.4.0.dev20240415+cpu
Is debug build: False
CUDA used to build PyTorch: Could not collect
ROCM used to build PyTorch: N/A
OS: Ubuntu 20.04.6 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
Clang version: 10.0.0-4ubuntu1
CMake version: version 3.29.2
Libc version: glibc-2.31
Python version: 3.10.13 | packaged by conda-forge | (main, Dec 23 2023, 15:36:39) [GCC 12.3.0] (64-bit runtime)
Python platform: Linux-5.4.0-146-generic-x86_64-with-glibc2.31
Is CUDA available: False
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: GPU 0: GRID A100D-10C
Nvidia driver version: 525.60.13
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
Versions of relevant libraries:
[pip3] executorch==0.2.1.dev0+unknown
[pip3] numpy==1.26.4
[pip3] onnx==1.13.1
[pip3] torch==2.4.0.dev20240415+cpu
[pip3] torchaudio==2.2.0.dev20240415+cpu
[pip3] torchsr==1.0.4
[pip3] torchvision==0.19.0.dev20240415+cpu
[conda] blas 1.0 mkl
[conda] cpuonly 2.0 0 pytorch-nightly
[conda] executorch 0.2.1.dev0+unknown pypi_0 pypi
[conda] libjpeg-turbo 2.0.0 h9bf148f_0 pytorch-nightly
[conda] mkl 2023.1.0 h213fc3f_46344
[conda] mkl-service 2.4.0 py310h5eee18b_1
[conda] mkl_fft 1.3.8 py310h5eee18b_0
[conda] mkl_random 1.2.4 py310hdb19cb5_0
[conda] numpy 1.26.4 py310h5f9d8c6_0
[conda] numpy-base 1.26.4 py310hb5e798b_0
[conda] pytorch 2.4.0.dev20240417 py3.10_cpu_0 pytorch-nightly
[conda] pytorch-mutex 1.0 cpu pytorch-nightly
[conda] torch 2.4.0.dev20240415+cpu pypi_0 pypi
[conda] torchaudio 2.2.0.dev20240415+cpu pypi_0 pypi
[conda] torchsr 1.0.4 pypi_0 pypi
[conda] torchvision 0.19.0.dev20240415+cpu pypi_0 pypi
The text was updated successfully, but these errors were encountered: