option to make training more deterministic #143

elliottzheng · 2023-02-14T13:44:31Z

I have been trying to make it more deterministic, here I share some of my experiences.

replace these lines with

torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False # this should be False instead of True(current code)
torch.use_deterministic_algorithms(True) # will raise an error when nondeterministic functions are used.

check https://pytorch.org/docs/stable/notes/randomness.html#cuda-convolution-benchmarking for more details

you may need to run the program with flag CUBLAS_WORKSPACE_CONFIG=:4096:8, if some error is raised after doing 1, check here for details
replace the bilinear F.interpolate here with the implementation below, as it is nondeterministic, check here fore details.

class Interpolate(nn.Module):

    def __init__(self, channel: int, scale_factor: int):
        super().__init__()
        # assert 'mode' not in kwargs and 'align_corners' not in kwargs and 'size' not in kwargs
        assert isinstance(scale_factor, int) and scale_factor > 1 and scale_factor % 2 == 0
        self.scale_factor = scale_factor
        kernel_size = scale_factor + 1  # keep kernel size being odd
        self.weight = nn.Parameter(
            torch.empty((1, 1, kernel_size, kernel_size), dtype=torch.float32).expand(channel, -1, -1, -1)
        )
        self.conv = functools.partial(
            F.conv2d, weight=self.weight, bias=None, padding=scale_factor // 2, groups=channel
        )
        with torch.no_grad():
            self.weight.fill_(1 / (kernel_size * kernel_size))

    def forward(self, t):
        if t is None:
            return t
        return self.conv(F.interpolate(t, scale_factor=self.scale_factor, mode='nearest'))

    @staticmethod
    def naive(t: torch.Tensor, size: Tuple[int, int], **kwargs):
        if t is None or t.shape[2:] == size:
            return t
        else:
            assert 'mode' not in kwargs and 'align_corners' not in kwargs
            return F.interpolate(t, size, mode='nearest', **kwargs)

However, the training is still non-deterministic

As the raymarching_train here is non-deterministic, I am not familiar with CUDA extension, thus I don't know how to solve it, you might want to look at it, the rays outputted by the function is non-deterministic.

Here I provide a bash script test.sh to run deterministic experiments for debugging

gpu_id=$1
echo "Running on GPU $gpu_id"
rm -rf "results/squirrel_seed0_size64_deterministic_run$gpu_id"
rm deterministic_run_$gpu_id.txt

CUDA_LAUNCH_BLOCKING=1 CUDA_VISIBLE_DEVICES=$gpu_id python main.py \
--text "A DSLR photo of a squirrel" \
--cuda_ray \
--fp16 \
--dir_text \
--sd_version "2.0" \
--eval_interval 1 \
--seed 0 \
--deterministic \
--iters 20 \
--workspace "results/squirrel_seed0_size64_deterministic_run$gpu_id" > deterministic_run_$gpu_id.txt

run bash test.sh 0 and bash test.sh 1 to run on GPU 0 and GPU 1, and compare the output in deterministic_run_0.txt ,deterministic_run_1.txt ,

ashawkey · 2023-02-15T03:01:33Z

Thanks for your efforts! Making it deterministic must be quite hard...
For the ray marching extension, it involves race conditions for each ray to be written to the output, but the overall outcome should be the same. i.e., the point-wise results are different, but ray-wise results are the same:

# run 1
xyzs: [ray1's points] [ray2's points] [ray3's points] (point-wise, order of rays may vary)
colors: [ray1's color] [ray2's color] [ray3's color] (ray-wise, always ordered)
# run 2
xyzs: [ray3's points] [ray1's points] [ray2's points]
colors: [ray1's color] [ray2's color] [ray3's color]

So I guess there may be some other reasons. Also, you could first check if the non-cuda-ray mode can be deterministic.
Currently I'm not having enough resources to test, but I may help to check it later.

option to make training more deterministic

442f77e

ARuiChen mentioned this pull request Aug 31, 2023

Seed Problem Gorilla-Lab-SCUT/Fantasia3D#40

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

option to make training more deterministic #143

option to make training more deterministic #143

elliottzheng commented Feb 14, 2023 •

edited

ashawkey commented Feb 15, 2023

option to make training more deterministic #143

Are you sure you want to change the base?

option to make training more deterministic #143

Conversation

elliottzheng commented Feb 14, 2023 • edited

However, the training is still non-deterministic

ashawkey commented Feb 15, 2023

elliottzheng commented Feb 14, 2023 •

edited