Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QST] Additional GPU mem reservation when creating a Dataset causes OOM when allocating all GPU mem to the LocalCUDACluster #1863

Open
piojanu opened this issue Sep 20, 2023 · 3 comments
Labels
question Further information is requested

Comments

@piojanu
Copy link

piojanu commented Sep 20, 2023

Hi!

I've run into such a problem:

from dask_cuda import LocalCUDACluster
from dask.distributed import Client

cluster = LocalCUDACluster(
    n_workers=1,                 # Number of GPU workers
    device_memory_limit="12GB",  # GPU->CPU spill threshold (~75% of GPU memory)
    rmm_pool_size="16GB",        # Memory pool size on each worker
)
client = Client(cluster)

# NOTE: Importing Merlin before cluster creation will ALSO create this additional reservation on GPU
from merlin.core.utils import set_dask_client

set_dask_client(client)

import numpy as np
import pandas as pd

df = pd.DataFrame({
    "a": np.random.randint(0, 100, 1000),
    "b": np.random.randint(0, 100, 1000),
    "c": np.random.randint(0, 100, 1000),
})

import nvtabular as nvt

# No matter if I specify the `client` or not, there is an additional reservation on GPU created
# that causes "cudaErrorMemoryAllocation out of memory" here.
ds = nvt.Dataset(df, client=client)

I run this code in JupyterLab on the GCP VM with NVIDIA V100 16GB GPU.
I've also tried nvtabular.utils.set_dask_client and it didn't solve the problem.

Questions:

  • Is it expected behavior and I don't understand something?
  • Can't I simply allocate all memory for the cluster and make NVTabular use it?
  • How shall the LocalCUDACluster be configured then?
@piojanu piojanu added the question Further information is requested label Sep 20, 2023
@rnyak
Copy link
Contributor

rnyak commented Sep 20, 2023

@piojanu what helps with the OOM issues with NVT is the part_size and the row group memory size of your parquet file(s). you can also repartition your dataset and save back to disk, and that might help with OOM. for LocalCUDACluster args you can read here: https://docs.rapids.ai/api/dask-cuda/nightly/api/

if you have a single GPU you can try to set the row group size of your files and that would help without LocalCudaCluster.

There is a LocalCudaCluster example here:
https://github.com/NVIDIA-Merlin/Merlin/blob/main/examples/quick_start/scripts/preproc/preprocessing.py

@piojanu
Copy link
Author

piojanu commented Sep 20, 2023

Hi!

I have follow up questions:

  • What is the rule of thumb to set “part_size and the row group memory size”? Make them smaller or bigger? How one influences the other?
  • What do you mean by “you can also repartition your dataset and save back to disk”? Can you show me some code snippet?

Thanks for help :)

@piojanu
Copy link
Author

piojanu commented Nov 6, 2023

By accident, I've found out that merlin.io.dataset.Dataset.shuffle_by_keys is the root of this OOM.

  • It makes ops.Categorify OOM even when following the troubleshooting guide:
    • Setting up the LocalCUDACluster doesn't help.
    • Saving with consistent row group size (maximum, row group can happen to be smaller because, in a dask partition which saves one file, there is usually a reminder that doesn't fit a full row group) doesn't help.
  • I've also tested it and there is no need to shuffle when doing GroupBy on the Dask DataFrame when loading data that was sorted by session_id in BigQuery.
    • No matter if you set the arguments index and calculate_divisions to the dd.read_parquet.
    • However, nvt.ops.GroupBy on the same data doesn't return the expected number of sessions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants