tf.data filter dataset too slow #67330

huangrt01 · 2024-05-10T11:02:11Z

Issue type

Performance

Have you reproduced the bug with TensorFlow Nightly?

Yes

Source

source

TensorFlow version

2.4

Custom code

Yes

OS platform and distribution

No response

Mobile device

No response

Python version

No response

Bazel version

No response

GCC/compiler version

No response

CUDA/cuDNN version

No response

GPU model and memory

No response

Current behavior?

Similar to issue #53169, I have observed that the "filter before batch" approach is significantly slow. Filtering the dataset alone takes 430ms, whereas the "batch+map" method only requires 20ms.
In theory, the computation of filter and map should be similar, but "filter before batch" consumes excessive time.
I attempted to filter after batching, but encountered a limitation where the filter predicate must return a scalar boolean value. Unfortunately, it does not support filtering batched elements.

My question is:
Is there a potential optimization for this performance issue? I aim to develop a customized operation that can filter batched elements (accepting [M,] shaped tensors as input and producing [N,] tensors as output). Is there a more efficient approach available?

import time

import tensorflow as tf
fast_dataset = tf.data.Dataset.range(10000)


def fast_benchmark(dataset, name, num_epochs=2):
    start_time = time.perf_counter()
    for _ in tf.data.Dataset.range(num_epochs):
        for _ in dataset:
            pass
    tf.print("Test", name, "Execution time(ms):", 1000 * (time.perf_counter() - start_time))


def increment(x):
    return x+1


def filter_fn(x):
  return tf.math.equal(tf.math.mod(x, 2), 1)


if __name__ == '__main__':
  fast_benchmark(
    fast_dataset
    .map(increment)
    .batch(256)
    ,
    "map+batch"
  )
  fast_benchmark(
    fast_dataset
    .batch(256)
    .map(increment)
    ,
    "batch+map"
  )
  fast_benchmark(
    fast_dataset
    .map(increment)
    .batch(256)
    .prefetch(tf.data.AUTOTUNE)
    ,
    "map+batch+prefetch"
  )
  fast_benchmark(
    fast_dataset
    .batch(256)
    .map(increment)
    .prefetch(tf.data.AUTOTUNE)
    ,
    "batch+map+prefetch"
  )
  fast_benchmark(
    fast_dataset
    .prefetch(tf.data.AUTOTUNE)
    .batch(256)
    .map(increment)
    ,
    "prefetch+batch+map"
  )
  fast_benchmark(
    fast_dataset
    .batch(256)
    .prefetch(tf.data.AUTOTUNE)
    .map(increment)
    ,
    "batch+prefetch+map"
  )
  fast_benchmark(
    fast_dataset
    .filter(filter_fn)
    .batch(256)
    ,
    "filter+batch"
  )
  fast_benchmark(
    fast_dataset
    .batch(256)
    .filter(filter_fn)
    ,
    "batch+filter"
  )

result:

Standalone code to reproduce the issue

import time

import tensorflow as tf
fast_dataset = tf.data.Dataset.range(10000)


def fast_benchmark(dataset, name, num_epochs=2):
    start_time = time.perf_counter()
    for _ in tf.data.Dataset.range(num_epochs):
        for _ in dataset:
            pass
    tf.print("Test", name, "Execution time(ms):", 1000 * (time.perf_counter() - start_time))


def increment(x):
    return x+1


def filter_fn(x):
  return tf.math.equal(tf.math.mod(x, 2), 1)


if __name__ == '__main__':
  fast_benchmark(
    fast_dataset
    .map(increment)
    .batch(256)
    ,
    "map+batch"
  )
  fast_benchmark(
    fast_dataset
    .batch(256)
    .map(increment)
    ,
    "batch+map"
  )
  fast_benchmark(
    fast_dataset
    .map(increment)
    .batch(256)
    .prefetch(tf.data.AUTOTUNE)
    ,
    "map+batch+prefetch"
  )
  fast_benchmark(
    fast_dataset
    .batch(256)
    .map(increment)
    .prefetch(tf.data.AUTOTUNE)
    ,
    "batch+map+prefetch"
  )
  fast_benchmark(
    fast_dataset
    .prefetch(tf.data.AUTOTUNE)
    .batch(256)
    .map(increment)
    ,
    "prefetch+batch+map"
  )
  fast_benchmark(
    fast_dataset
    .batch(256)
    .prefetch(tf.data.AUTOTUNE)
    .map(increment)
    ,
    "batch+prefetch+map"
  )
  fast_benchmark(
    fast_dataset
    .filter(filter_fn)
    .batch(256)
    ,
    "filter+batch"
  )
  fast_benchmark(
    fast_dataset
    .batch(256)
    .filter(filter_fn)
    ,
    "batch+filter"
  )

Relevant log output

Test map+batch Execution time(ms): 585.0009880959988
Test batch+map Execution time(ms): 23.16068299114704
Test map+batch+prefetch Execution time(ms): 503.9997957646847
Test batch+map+prefetch Execution time(ms): 19.63987946510315
Test prefetch+batch+map Execution time(ms): 54.23441715538502
Test batch+prefetch+map Execution time(ms): 16.469698399305344
Test filter+batch Execution time(ms): 282.77427703142166

The text was updated successfully, but these errors were encountered:

SuryanarayanaY · 2024-05-14T04:22:14Z

Hi @huangrt01 ,

I have tested the code with tf-nightly and found execution fails.Could you please check the gist and confirm anything missing in submitted code?

huangrt01 · 2024-05-14T11:33:49Z

Hi @huangrt01 ,

I have tested the code with tf-nightly and found execution fails.Could you please check the gist and confirm anything missing in submitted code?

Hi @SuryanarayanaY ，
you can just ignore the error and the "batch+filter" benchmark. In fact, we have got the log results:

google-ml-butler bot added the type:performance Performance Issue label May 10, 2024

google-ml-butler bot assigned SuryanarayanaY May 10, 2024

SuryanarayanaY added comp:data tf.data related issues TF 2.16 labels May 14, 2024

SuryanarayanaY added the stat:awaiting response Status - Awaiting response from author label May 14, 2024

google-ml-butler bot removed the stat:awaiting response Status - Awaiting response from author label May 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tf.data filter dataset too slow #67330

tf.data filter dataset too slow #67330

huangrt01 commented May 10, 2024 •

edited

SuryanarayanaY commented May 14, 2024

huangrt01 commented May 14, 2024

tf.data filter dataset too slow #67330

tf.data filter dataset too slow #67330

Comments

huangrt01 commented May 10, 2024 • edited

Issue type

Have you reproduced the bug with TensorFlow Nightly?

Source

TensorFlow version

Custom code

OS platform and distribution

Mobile device

Python version

Bazel version

GCC/compiler version

CUDA/cuDNN version

GPU model and memory

Current behavior?

Standalone code to reproduce the issue

Relevant log output

SuryanarayanaY commented May 14, 2024

huangrt01 commented May 14, 2024

huangrt01 commented May 10, 2024 •

edited