Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Super slow iteration with trivial custom transform #6833

Open
xslittlegrass opened this issue Apr 23, 2024 · 2 comments
Open

Super slow iteration with trivial custom transform #6833

xslittlegrass opened this issue Apr 23, 2024 · 2 comments

Comments

@xslittlegrass
Copy link

Describe the bug

Dataset is 10X slower when applying trivial transforms:

import time
import numpy as np
from datasets import Dataset, Features, Array2D

a = np.zeros((800, 800))
a = np.stack([a] * 1000)
features = Features({"a": Array2D(shape=(800, 800), dtype="uint8")})

ds1 = Dataset.from_dict({"a": a}, features=features).with_format('numpy')

def transform(batch):
	return batch

ds2 = ds1.with_transform(transform)

%time sum(1 for _ in ds1)
%time sum(1 for _ in ds2)
CPU times: user 472 ms, sys: 319 ms, total: 791 ms
Wall time: 794 ms
CPU times: user 9.32 s, sys: 443 ms, total: 9.76 s
Wall time: 9.78 s

In my real code I'm using set_transform to apply some post-processing on-the-fly for the 2d array, but it significantly slows down the dataset even if the transform itself is trivial.

Related issue: #5841

Steps to reproduce the bug

Use code in the description to reproduce.

Expected behavior

Trivial custom transform in the example should not slowdown the dataset iteration.

Environment info

  • datasets version: 2.18.0
  • Platform: Linux-5.15.0-79-generic-x86_64-with-glibc2.35
  • Python version: 3.11.4
  • huggingface_hub version: 0.20.2
  • PyArrow version: 15.0.0
  • Pandas version: 1.5.3
  • fsspec version: 2023.12.2
@rangehow
Copy link

rangehow commented Apr 27, 2024

Similar issue in text process

tokenizer=AutoTokenizer.from_pretrained(model_dir[args.model])
train_dataset=datasets.load_from_disk(dataset_dir[args.dataset],keep_in_memory=True)['train']
train_dataset=train_dataset.map(partial(dname2func[args.dataset],tokenizer=tokenizer),batched=True,num_proc =50,remove_columns=train_dataset.features.keys(),desc='tokenize',keep_in_memory=True)

After this train_dataset will be like

Dataset({
    features: ['input_ids', 'labels'],
    num_rows: 51760
})

In which input_ids and labels are both List[int]
However, per iter on dataset cost 7.412479639053345s ……?

for j in tqdm(range(len(train_dataset)),desc='first stage'):
    input_id,label=train_dataset['input_ids'][j],train_dataset['labels'][j]

@lhoestq
Copy link
Member

lhoestq commented May 4, 2024

The transform currently replaces the numpy formatting.

So you're back to copying data to long python lists which is super slow.

It would be cool for the transform to not remove the formatting in this case, but this requires a few changes in the lib

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants