Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove tmp image if the download operation fails #1140

Closed
wants to merge 5 commits into from
Closed

Conversation

enescakir
Copy link
Member

@enescakir enescakir commented Jan 24, 2024

Exit early if the image already exists

It's just a structural refactor. No need extra indentation.

Finalize the image download with File.rename

At present, we download the image to a temporary path and then move it to the final location using the qemu-img convert command. However, this process is not atomic and can be time-consuming. It would be more efficient to use the File.rename command to move the image directly to the final location, as this is an atomic operation. This method also allows other virtual machines to use the old image right up until the last moment.

Add force flag to download image even if exists

If the boot image already exists, the virtual machine uses it. However, in certain situations like updating GitHub runner images, we need to download the latest version. If the force flag is set to true, the VM host will download the image regardless of its existence.

Allow only one concurrent image download with flock

Right now, we use the "File::CREAT | File::EXCL" flags to open a temporary file for image downloads, which should stop multiple downloads happening at once. These flags only let one process create the file, and if the file is already there, it fails. However, if a download gets interrupted, the file sticks around and the next download attempt fails. It's tricky to figure out if the file is incomplete, so we have to delete it manually. Using flock to lock a lock file is a more effective way to stop multiple downloads idempotently at the same time.

Use force flag in the boot image download prog

The previous version of the download_boot_image method fails when the file already exists. I've incorporated a force flag into the method to overwrite any pre-existing file. This way, the program doesn't need to delete the existing file before downloading the boot image.

Fixes #911

@fdr
Copy link
Collaborator

fdr commented Jan 25, 2024

What do you think is the safety profile of always deleting the file before starting? Is there a chance of premature promotion if there's bad concurrency? Do we stop that in the control plane? Do we stop it via how systemd units are named for background jobs? can we stop it with flock? Can we stop it one or more of those approaches for bug detection and defense in depth?

The reason to not use a rescue like this is it relies on the program to run to get the result you want.

@enescakir enescakir marked this pull request as draft January 30, 2024 14:25
@enescakir
Copy link
Member Author

What do you think is the safety profile of always deleting the file before starting? Is there a chance of premature promotion if there's bad concurrency? Do we stop that in the control plane? Do we stop it via how systemd units are named for background jobs? can we stop it with flock? Can we stop it one or more of those approaches for bug detection and defense in depth?

The reason to not use a rescue like this is it relies on the program to run to get the result you want.

I discussed this topic with Burak and will continue to work on it. Meanwhile, my other image download PRs are not specifically dependent on this one.

@enescakir enescakir changed the base branch from main to minio-download January 31, 2024 11:54
Base automatically changed from minio-download to main February 2, 2024 10:14
@enescakir enescakir force-pushed the tmp-cleanup branch 2 times, most recently from 91d1e80 to 826a82a Compare February 21, 2024 15:07
It's just a structural refactor. No need extra indentation.
At present, we download the image to a temporary path and then move it
to the final location using the `qemu-img convert` command. However,
this process is not atomic and can be time-consuming.  It would be more
efficient to use the `File.rename` command to move the image directly to
the final location, as this is an atomic operation.  This method also
allows other virtual machines to use the old image right up until the
last moment.
If the boot image already exists, the virtual machine uses it. However,
in certain situations like updating GitHub runner images, we need to
download the latest version. If the force flag is set to true, the VM
host will download the image regardless of its existence.
Right now, we use the "File::CREAT | File::EXCL" flags to open a
temporary file for image downloads, which should stop multiple downloads
happening at once. These flags only let one process create the file, and
if the file is already there, it fails. However, if a download gets
interrupted, the file sticks around and the next download attempt fails.
It's tricky to figure out if the file is incomplete, so we have to
delete it manually. Using flock to lock a lock file is a more effective
way to stop multiple downloads idempotently at the same time.
The previous version of the `download_boot_image` method fails when the
file already exists. I've incorporated a `force` flag into the method to
overwrite any pre-existing file.  This way, the program doesn't need to
delete the existing file before downloading the boot image.
@enescakir enescakir marked this pull request as ready for review February 21, 2024 18:33
@enescakir enescakir requested review from fdr and pykello and removed request for fdr, pykello, byucesoy and furkansahin February 21, 2024 18:34
@enescakir
Copy link
Member Author

@fdr I refactored this PR a lot. It uses flock.

@fdr
Copy link
Collaborator

fdr commented Feb 23, 2024

@fdr I refactored this PR a lot. It uses flock.

ack, digging in

@@ -470,43 +470,43 @@ def download_boot_image(boot_image, custom_url: nil, ca_path: nil)

download = urls.fetch(boot_image) || custom_url
image_path = vp.image_path(boot_image)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thanks for the commit message and breaking this out; reviewing it with ?w=1 in github was great.

@enescakir enescakir closed this Apr 30, 2024
@enescakir enescakir deleted the tmp-cleanup branch April 30, 2024 06:42
@github-actions github-actions bot locked and limited conversation to collaborators Apr 30, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Image download is not idempotent
2 participants