wrong calculations with TPU distribution strategy #67301

mohammad0081 · 2024-05-10T06:46:26Z

Issue type

Bug

Have you reproduced the bug with TensorFlow Nightly?

No

Source

source

TensorFlow version

2.16.1

Custom code

Yes

OS platform and distribution

colab os platform

Mobile device

No response

Python version

No response

Bazel version

No response

GCC/compiler version

No response

CUDA/cuDNN version

No response

GPU model and memory

TPU

Current behavior?

we expected that the training process , would be faster on multiple TPUs rather than single T4 GPU , but the numbers returned from fit() method would be the same . instead, we got faster calculations but whole different numbers returned , such as accuracy and loss significantly decreased after 5 epochs . in GPU , after 5 epochs we get 90+ percent of accuracy on test set , but on TPU mirrored strategy we get 24 % of accuracy on test set and it converges on this range

Standalone code to reproduce the issue

# the whole project is private, we have a dataset on medical images and we split the dataset into into two # directories , then we create a keras datagenerator and create train_datagen and test_datagen . 
# then we load a model from keras.applications and add three Dense layers to it for classification task . # then we fine tune the model with adam optimizer with lr = 0.0001 . 
# the creation and compilation of the model are on the strategy.scope() , which strategy is created
# exactly from Tensorflow.org  "use TPU's" and mirrored strategy docs

Relevant log output

No response

tilakrayal · 2024-05-11T08:42:24Z

@mohammad0081,
Could you please, share colab link or simple standalone code with supporting files to reproduce the issue in our environment. It helps us in localizing the issue faster. Thank you!

github-actions · 2024-05-19T01:50:27Z

This issue is stale because it has been open for 7 days with no activity. It will be closed if no further activity occurs. Thank you.

mohammad0081 · 2024-05-19T11:33:37Z

yes , this issue will accur even with mnist dataset using tpu strategy, accuracy will be random !
about 10%

i will provide the code soon

mohammad0081 · 2024-05-19T19:11:39Z

import tensorflow as tf
(x_train, y_train), (_, __) = tf.keras.datasets.mnist.load_data()
resolver = tf.distribute.cluster_resolver.TPUClusterResolver()
tf.config.experimental_connect_to_cluster(resolver)
tf.tpu.experimental.initialize_tpu_system(resolver)

print("All devices: ", tf.config.list_logical_devices('TPU'))

def create_model():
model = tf.keras.models.Sequential()
import tensorflow.keras.layers as l
model.add(l.Input(shape=(28, 28, 1)))
model.add(l.Conv2D(64, (3, 3), activation = 'relu'))
model.add(l.Conv2D(32, (3, 3), activation = 'relu'))
model.add(l.Flatten())
model.add(l.Dense(512, activation = 'relu'))
model.add(l.Dense(64, activation = 'relu'))
model.add(l.Dense(10, activation = 'softmax'))

model.compile(optimizer = 'sgd', loss = 'sparse_categorical_crossentropy', metrics = ['accuracy'])

return model

with strategy.scope() :

model = create_model()
model.fit(x_train, y_train, epochs= 20)

google-ml-butler bot added the type:bug Bug label May 10, 2024

google-ml-butler bot assigned tilakrayal May 10, 2024

tilakrayal added TF 2.16 comp:tpus tpu, tpuestimator labels May 10, 2024

tilakrayal added the stat:awaiting response Status - Awaiting response from author label May 11, 2024

github-actions bot added the stale This label marks the issue/pr stale - to be closed automatically if no activity label May 19, 2024

google-ml-butler bot removed stale This label marks the issue/pr stale - to be closed automatically if no activity stat:awaiting response Status - Awaiting response from author labels May 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

wrong calculations with TPU distribution strategy #67301

wrong calculations with TPU distribution strategy #67301

mohammad0081 commented May 10, 2024 •

edited

tilakrayal commented May 11, 2024

github-actions bot commented May 19, 2024

mohammad0081 commented May 19, 2024

mohammad0081 commented May 19, 2024

wrong calculations with TPU distribution strategy #67301

wrong calculations with TPU distribution strategy #67301

Comments

mohammad0081 commented May 10, 2024 • edited

Issue type

Have you reproduced the bug with TensorFlow Nightly?

Source

TensorFlow version

Custom code

OS platform and distribution

Mobile device

Python version

Bazel version

GCC/compiler version

CUDA/cuDNN version

GPU model and memory

Current behavior?

Standalone code to reproduce the issue

Relevant log output

tilakrayal commented May 11, 2024

github-actions bot commented May 19, 2024

mohammad0081 commented May 19, 2024

mohammad0081 commented May 19, 2024

mohammad0081 commented May 10, 2024 •

edited