Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Protect against malicious nested archives (aka zip bombs) #210

Open
qkaiser opened this issue Jan 27, 2022 · 7 comments
Open

Protect against malicious nested archives (aka zip bombs) #210

qkaiser opened this issue Jan 27, 2022 · 7 comments
Labels
enhancement New feature or request format:archive

Comments

@qkaiser
Copy link
Contributor

qkaiser commented Jan 27, 2022

We should check the extracted file size during chunk parsing to avoid filling up the disk when extracting malicious nested archives.

Samples can be found here: https://www.bamsoftware.com/hacks/zipbomb/

For zip bombs, a check can be implemented in is_valid:

def is_valid(self, file: io.BufferedIOBase) -> bool:
        has_encrypted_files = False
        try:
            with zipfile.ZipFile(file) as zip:  # type: ignore
                for zipinfo in zip.infolist():
                    if zipinfo.flag_bits & ENCRYPTED_FLAG:
                        has_encrypted_files = True
            if has_encrypted_files:
                logger.warning("There are encrypted files in the ZIP")
            return True
        except (zipfile.BadZipFile, UnicodeDecodeError, ValueError):
            return False

Something similar to this could work:

with zipfile.ZipFile(file) as zip:  # type: ignore
    extracted_size = sum(e.file_size for e in zip.infolist())
    if extracted_size > SOME_CONSTANT:
        # bail out

I'll check if similar behavior (ie. "let's fill the whole disk") can be triggered with other formats.

@qkaiser qkaiser added enhancement New feature or request format:archive labels Jan 27, 2022
@qkaiser qkaiser self-assigned this Jan 27, 2022
@qkaiser
Copy link
Contributor Author

qkaiser commented Jan 27, 2022

Please note that compression bombs only works with implementation that recursively extract the archive content, which technically we don't. I still think this is a good enhancement / hardening of unblob to protect against these files. I'll do some tests locally and get back with some notes :)

@flukavsky
Copy link
Contributor

Instead of checking the extracted file size for the content of a single compressed archive, we should keep an eye on the overall extracted size.

With the proposed approach, we could run into issues if a compressed archive has nested archives that are below the threshold - e.g., 42.zip only extracts 530kb in the first loop, but if we continue recursively, we will eventually extract a ton of 4.3gb files, which by themselves may fly below the radar but still fill up the disk.

I'd suggest to add an option (e.g., --max-extraction-size), which would override a reasonable default value (e.g., 100gb). Alternatively, we could also monitor available disk space and abort once we have used e.g., 90% of the available free disk space.

@qkaiser qkaiser added this to the v3.0 milestone Feb 8, 2022
@vlaci
Copy link
Contributor

vlaci commented Feb 22, 2022

It is very hard to come up with a general solution for this. I'd suggest extraction to a separate volume or use quota to mitigate this issue.

@qkaiser qkaiser removed their assignment May 6, 2022
@qkaiser
Copy link
Contributor Author

qkaiser commented Feb 10, 2023

I agree with the recommendation on separate volume and using quota. I'm sure we can provide sound recommendations on that subject in our documentation.

however

I was toying with the idea of implementing a "best effort" protection against zip bombs by collecting size information from StatReport.

A basic implementation is available on this branch: main...zip-bomb

I'm saying "best effort" because the smallest unit we can work on is a Task, so the increase in size is computed only after a file has been processed and decompressed/extracted.

I got good results against zip bombs with this, the only thing that is missing is a proper cancellation/cleanup of running processes (marked as TODO in the code).

There's no urgency to it, but I would be glad if anyone from the team could have a look and share their thoughts.

@vlaci
Copy link
Contributor

vlaci commented Feb 10, 2023

I also had a weird idea: writing some wrapper that can be LD_PRELOAD-ed that captures write calls, tracks the total disk consumption and aborts when a limit is reached.

This could also be used to wire in files created by extractors to our reports in real-time.

@qkaiser
Copy link
Contributor Author

qkaiser commented Feb 10, 2023

This would go against our attempt at supporting OSX (or at least Apple M1/M2).

Another idea I had was to launch a "watcher process" registering inotify callbacks on the extracted directories recursively, killing watched processes once the size limit is reached. However inotify only exists on Linux, and the vast majority of inotify libraries in Python sucks (although not super complex to write a modern one).

@vlaci
Copy link
Contributor

vlaci commented Feb 10, 2023

This would go against our attempt at supporting OSX (or at least Apple M1/M2).

There is an equivalent of LD_LIBRARY_PATH on mac, but for sure it would complicate things.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request format:archive
Projects
None yet
Development

No branches or pull requests

3 participants