Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wrong calculations with TPU distribution strategy #67301

Open
mohammad0081 opened this issue May 10, 2024 · 4 comments
Open

wrong calculations with TPU distribution strategy #67301

mohammad0081 opened this issue May 10, 2024 · 4 comments
Assignees
Labels
comp:tpus tpu, tpuestimator TF 2.16 type:bug Bug

Comments

@mohammad0081
Copy link

mohammad0081 commented May 10, 2024

Issue type

Bug

Have you reproduced the bug with TensorFlow Nightly?

No

Source

source

TensorFlow version

2.16.1

Custom code

Yes

OS platform and distribution

colab os platform

Mobile device

No response

Python version

No response

Bazel version

No response

GCC/compiler version

No response

CUDA/cuDNN version

No response

GPU model and memory

TPU

Current behavior?

we expected that the training process , would be faster on multiple TPUs rather than single T4 GPU , but the numbers returned from fit() method would be the same . instead, we got faster calculations but whole different numbers returned , such as accuracy and loss significantly decreased after 5 epochs . in GPU , after 5 epochs we get 90+ percent of accuracy on test set , but on TPU mirrored strategy we get 24 % of accuracy on test set and it converges on this range

Standalone code to reproduce the issue

# the whole project is private, we have a dataset on medical images and we split the dataset into into two # directories , then we create a keras datagenerator and create train_datagen and test_datagen . 
# then we load a model from keras.applications and add three Dense layers to it for classification task . # then we fine tune the model with adam optimizer with lr = 0.0001 . 
# the creation and compilation of the model are on the strategy.scope() , which strategy is created
# exactly from Tensorflow.org  "use TPU's" and mirrored strategy docs

Relevant log output

No response

@tilakrayal
Copy link
Contributor

@mohammad0081,
Could you please, share colab link or simple standalone code with supporting files to reproduce the issue in our environment. It helps us in localizing the issue faster. Thank you!

@tilakrayal tilakrayal added the stat:awaiting response Status - Awaiting response from author label May 11, 2024
Copy link

This issue is stale because it has been open for 7 days with no activity. It will be closed if no further activity occurs. Thank you.

@github-actions github-actions bot added the stale This label marks the issue/pr stale - to be closed automatically if no activity label May 19, 2024
@mohammad0081
Copy link
Author

yes , this issue will accur even with mnist dataset using tpu strategy, accuracy will be random !
about 10%

i will provide the code soon

@google-ml-butler google-ml-butler bot removed stale This label marks the issue/pr stale - to be closed automatically if no activity stat:awaiting response Status - Awaiting response from author labels May 19, 2024
@mohammad0081
Copy link
Author

import tensorflow as tf
(x_train, y_train), (_, __) = tf.keras.datasets.mnist.load_data()
resolver = tf.distribute.cluster_resolver.TPUClusterResolver()
tf.config.experimental_connect_to_cluster(resolver)
tf.tpu.experimental.initialize_tpu_system(resolver)

print("All devices: ", tf.config.list_logical_devices('TPU'))

def create_model():
model = tf.keras.models.Sequential()
import tensorflow.keras.layers as l
model.add(l.Input(shape=(28, 28, 1)))
model.add(l.Conv2D(64, (3, 3), activation = 'relu'))
model.add(l.Conv2D(32, (3, 3), activation = 'relu'))
model.add(l.Flatten())
model.add(l.Dense(512, activation = 'relu'))
model.add(l.Dense(64, activation = 'relu'))
model.add(l.Dense(10, activation = 'softmax'))

model.compile(optimizer = 'sgd', loss = 'sparse_categorical_crossentropy', metrics = ['accuracy'])

return model

with strategy.scope() :

model = create_model()
model.fit(x_train, y_train, epochs= 20)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:tpus tpu, tpuestimator TF 2.16 type:bug Bug
Projects
None yet
Development

No branches or pull requests

2 participants