Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Support repeatability experiments #1939

Open
wangat opened this issue Aug 31, 2023 · 3 comments
Open

[FEATURE] Support repeatability experiments #1939

wangat opened this issue Aug 31, 2023 · 3 comments
Labels
enhancement New feature or request

Comments

@wangat
Copy link

wangat commented Aug 31, 2023

Thank you for your code and engineering. I use the following code to fix the seed, which can be tested to fully reproduce the comparison experiment.(Because I have found that sometimes random replacement of seeds or incomplete fixation can result in a positive or negative 2% evaluation annotation, which is unacceptable in comparative experiments.)

I did some testing experiments and obtain the following results:1. After shutdown and restart, keep the same parameters to completely repeat the previous experiment;2. The same type of graphics card on the same server, single card training or the same number of multi-card training results are the same;3. The seed and the superparameter are the same, but the graphics card is different, and the final result is different;4. Different hardware+same graphics card, resulting in different results;5. The results of continuing training after interrupting the model training are different from those of the model that has been trained all the time (I guess it is related to the epoch and learning rate, sorry that I have not finished studying the relevant code).

I have tested multiple model families, including:resnet、mobilenet、efficientnet、efficientformer、vit、levit、xcit. However, I found that the efficientformerv2_s1 model was not completely fixed, and there were other factors in the code that prevented full reproducibility. When I tested the same graphics card on the same server, I found a slight difference in results during the first epoch; In addition, using the same graphics card for multiple experiments, a gap appeared in the second epoch of testing. I am doing some experiments and searching other articles to find the cause of the problem, but I have not determined it yet. Could you please help me find it?

I modified random.py in utils using the following code.
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
np.random.seed(seed)
random.seed(seed)
os.environ['PYTHONHASHSEED'] =str(seed)

torch.backends.cudnn.deterministic = True (Using these two will make training slower)
torch.backends.cudnn.benchmark = False

@wangat wangat added the enhancement New feature or request label Aug 31, 2023
@rwightman
Copy link
Collaborator

torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
torch.cuda.manual_seed_all(seed)

that's redundant, if you are using typical distributed training with one GPU per process, just one call to torch.manual_seed() is needed, cuda is called if it's there

efficientformer is not going to be deterministic, it's been explored before, there is at least one op in that net which seems to be non deterministic, probably the interpolation #1770

@rwightman
Copy link
Collaborator

See also, #853

I don't believe full determinism is possible in all cases for all models due to some ops just not having support (at least last time it was looked at). Could still add an optional flag to set deterministic= and disable benchmark, but certainly not worth having as a default and likely not going to work in all cases.

@wangat
Copy link
Author

wangat commented Sep 1, 2023

torch.manual_seed(seed) torch.cuda.manual_seed(seed) torch.cuda.manual_seed_all(seed)

that's redundant, if you are using typical distributed training with one GPU per process, just one call to torch.manual_seed() is needed, cuda is called if it's there

efficientformer is not going to be deterministic, it's been explored before, there is at least one op in that net which seems to be non deterministic, probably the interpolation #1770

Thank you for your reply and relevant information. I fixed the resize in my previous training. Because I need to deploy the model on nvidia jetson nx, I find that the interpolation results are different using PIL or opencv, sometimes resulting in 2-6x error of the evaluation criteria.(Because I found that the robustness of the model is poor, for the cropped images after the target detection model and resize to a fixed size, when a cropped picture of the same object changes slightly, it may even cause the opposite category results with high confidence.) So I replaced the resize section with the fixed opencv interpolation method. This will improve the above problem.

On this basis, I tested about 10 model families, and only efficientformer showed this. It's probably efficientformer code that introduces an additional element of uncertainty. I tried to find the problem using torch.use_deterministic_algorithms(True) and modified either CUBLAS_WORKSPACE_CONFIG=:4096:8 or CUBLAS_WORKSPACE_CONFIG=:16:8. But running the code shows the following: runtimeerror: scatter_add_cuda_kernel does not have a deterministic implementation, but you set ‘torch.use_determinestic_algorithms(true)’. You can turn off determinism just for this operation, or you can use the ‘warn_only=true’ option......

I have searched some materials and found similar problems in pyg-team/pytorch_geometric#3175, but I have not made clear how to locate and operate them. I plan to look up more information and try to deal with this problem. Thank you again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants