[FEATURE] Support repeatability experiments #1939

wangat · 2023-08-31T01:35:00Z

Thank you for your code and engineering. I use the following code to fix the seed, which can be tested to fully reproduce the comparison experiment.(Because I have found that sometimes random replacement of seeds or incomplete fixation can result in a positive or negative 2% evaluation annotation, which is unacceptable in comparative experiments.）

I did some testing experiments and obtain the following results：1. After shutdown and restart, keep the same parameters to completely repeat the previous experiment；2. The same type of graphics card on the same server, single card training or the same number of multi-card training results are the same；3. The seed and the superparameter are the same, but the graphics card is different, and the final result is different；4. Different hardware+same graphics card, resulting in different results；5. The results of continuing training after interrupting the model training are different from those of the model that has been trained all the time (I guess it is related to the epoch and learning rate, sorry that I have not finished studying the relevant code).

I have tested multiple model families, including：resnet、mobilenet、efficientnet、efficientformer、vit、levit、xcit. However, I found that the efficientformerv2_s1 model was not completely fixed, and there were other factors in the code that prevented full reproducibility. When I tested the same graphics card on the same server, I found a slight difference in results during the first epoch; In addition, using the same graphics card for multiple experiments, a gap appeared in the second epoch of testing. I am doing some experiments and searching other articles to find the cause of the problem, but I have not determined it yet. Could you please help me find it?

I modified random.py in utils using the following code.
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
np.random.seed(seed)
random.seed(seed)
os.environ['PYTHONHASHSEED'] =str(seed)

torch.backends.cudnn.deterministic = True （Using these two will make training slower）
torch.backends.cudnn.benchmark = False

rwightman · 2023-08-31T03:02:06Z

torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
torch.cuda.manual_seed_all(seed)

that's redundant, if you are using typical distributed training with one GPU per process, just one call to torch.manual_seed() is needed, cuda is called if it's there

efficientformer is not going to be deterministic, it's been explored before, there is at least one op in that net which seems to be non deterministic, probably the interpolation #1770

rwightman · 2023-08-31T04:14:32Z

See also, #853

I don't believe full determinism is possible in all cases for all models due to some ops just not having support (at least last time it was looked at). Could still add an optional flag to set deterministic= and disable benchmark, but certainly not worth having as a default and likely not going to work in all cases.

wangat · 2023-09-01T01:38:27Z

torch.manual_seed(seed) torch.cuda.manual_seed(seed) torch.cuda.manual_seed_all(seed)

that's redundant, if you are using typical distributed training with one GPU per process, just one call to torch.manual_seed() is needed, cuda is called if it's there

efficientformer is not going to be deterministic, it's been explored before, there is at least one op in that net which seems to be non deterministic, probably the interpolation #1770

Thank you for your reply and relevant information. I fixed the resize in my previous training. Because I need to deploy the model on nvidia jetson nx, I find that the interpolation results are different using PIL or opencv, sometimes resulting in 2-6x error of the evaluation criteria.（Because I found that the robustness of the model is poor, for the cropped images after the target detection model and resize to a fixed size, when a cropped picture of the same object changes slightly, it may even cause the opposite category results with high confidence.） So I replaced the resize section with the fixed opencv interpolation method. This will improve the above problem.

On this basis, I tested about 10 model families, and only efficientformer showed this. It's probably efficientformer code that introduces an additional element of uncertainty. I tried to find the problem using torch.use_deterministic_algorithms(True) and modified either CUBLAS_WORKSPACE_CONFIG=:4096:8 or CUBLAS_WORKSPACE_CONFIG=:16:8. But running the code shows the following: runtimeerror: scatter_add_cuda_kernel does not have a deterministic implementation， but you set ‘torch.use_determinestic_algorithms(true)’. You can turn off determinism just for this operation, or you can use the ‘warn_only=true’ option......

I have searched some materials and found similar problems in pyg-team/pytorch_geometric#3175, but I have not made clear how to locate and operate them. I plan to look up more information and try to deal with this problem. Thank you again.

wangat added the enhancement New feature or request label Aug 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] Support repeatability experiments #1939

[FEATURE] Support repeatability experiments #1939

wangat commented Aug 31, 2023

rwightman commented Aug 31, 2023

rwightman commented Aug 31, 2023

wangat commented Sep 1, 2023

[FEATURE] Support repeatability experiments #1939

[FEATURE] Support repeatability experiments #1939

Comments

wangat commented Aug 31, 2023

rwightman commented Aug 31, 2023

rwightman commented Aug 31, 2023

wangat commented Sep 1, 2023