Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PermissionError #1527

Open
quant-exchange opened this issue Feb 3, 2024 · 7 comments
Open

PermissionError #1527

quant-exchange opened this issue Feb 3, 2024 · 7 comments

Comments

@quant-exchange
Copy link

quant-exchange commented Feb 3, 2024

Hey,

The following error occurred while running in a docker container:

File "/usr/local/lib/python3.9/site-packages/tensorboard/compat/tensorflow_stub/io/gfile.py", line 171, in _write
with io.open(filename, mode, encoding=encoding) as f:
PermissionError: [Errno 13] Permission denied: b'/app/lightning_logs/version_220/events.out.tfevents.1706995191.5f065f752aca.5108.0'
Epoch 94: 14%|█▍ | 94/600 [00:00<00:04, 142.97it/s, loss=0.000312, v_num=220, MAE=0.220, RMSE=0.267, Loss=0.000393, RegLoss=0.000]

NOTE: the writes to that path have worked with no issues until I recently saw this error (no permission changes on my side were done in the container, etc).

I reran the same training and it fully succeeded; not sure why this would happen; thought you should all see this.

Thanks,

Q.E.

@quant-exchange
Copy link
Author

Got it again: PermissionError: [Errno 13] Permission denied: b'/app/lightning_logs/version_241/events.out.tfevents.1707013124.5f065f752aca.6320.0'
return fn(*args, **kwargs)
File "/usr/local/lib/python3.9/site-packages/lightning_fabric/loggers/tensorboard.py", line 272, in save
self.experiment.flush()
File "/usr/local/lib/python3.9/site-packages/torch/utils/tensorboard/writer.py", line 1200, in flush

@quant-exchange
Copy link
Author

PermissionError: [Errno 13] Permission denied: b'/app/lightning_logs/version_290/events.out.tfevents.1707080038.5f065f752aca.10470.0'
self._run()
File "/usr/local/lib/python3.9/site-packages/tensorboard/summary/writer/event_file_writer.py", line 275, in _run
self._record_writer.write(data)
File "/usr/local/lib/python3.9/site-packages/tensorboard/summary/writer/record_writer.py", line 40, in write
self._writer.write(header + header_crc + data + footer_crc)
File "/usr/local/lib/python3.9/site-packages/tensorboard/compat/tensorflow_stub/io/gfile.py", line 773, in write

@quant-exchange
Copy link
Author

quant-exchange commented Feb 4, 2024

I'm going to test the following on this and see if it prevents the error in docker (I think this is relevant to the project since docker is a common method of using Python and packages like this one in general): chmod -R 777 /app/lightning_logs (will update this ticket if I don't see the error occur in the next 24/7 period)

@quant-exchange
Copy link
Author

quant-exchange commented Feb 6, 2024

Update for folks using this in a docker container: chmod -R 777 /app/lightning_logs appears to have fixed the issue; I have not seen the permissions errors in the logs since running that; we can close this and take note this is something you might have to do in a Python image.

@ourownstory
Copy link
Owner

Thank you for sharing your solution to the issue you observed @quant-exchange

@quant-exchange quant-exchange reopened this Apr 6, 2024
@quant-exchange
Copy link
Author

quant-exchange commented Apr 6, 2024

Hey NP Team,

We've got NP 0.8.0 in production on a docker container still and it's been a while since we've seen the error (so figured our chmod updates might have mitigated this, but we encountered the same PermissionError: [Errno 13] Permission denied: b'/app/lightning_logs/.

There could be a few things under the hood with TensorBoard that are causing this or system related. Since outside of the training we're not using the lightning_logs version_x data, I've included this code at the bottom after the training is finished:

        import shutil # add with your imports

        log_dir = "/app/lightning_logs/"

        # Ensure the directory exists and the safety string is in the path
        if os.path.exists(log_dir) and 'lightning_logs' in log_dir:
            # List everything in the directory
            for entry in os.listdir(log_dir):
                # Construct full path
                entry_path = os.path.join(log_dir, entry)
                # Check if it's a directory and delete it
                if os.path.isdir(entry_path):
                    shutil.rmtree(entry_path)
                    print(f"Deleted directory: {entry_path}")
            print("All folders within log directory deleted.")
        else:
            print("Log directory not found or safety check failed.")

This is a temporary work-around until more info is understood about this issue. Updating the DevOps dependencies to the most recent version of TensorBoard could be advised; NOTE: we usually see this once the logs have stacked to version_175-version_250 range and the epoch stops around 2% usually.

Thanks,

Q.E.

@quant-exchange
Copy link
Author

quant-exchange commented Apr 25, 2024

Current analysis points towards memory and or reference issues*.

Recommendation; use the above script as a production batch job (to clear out the logs folder). However, you can have all that data periodically copied from the native lightning_logs folder into Azure Blob Storage or AWS S3 Buckets, for example so you can keep and access your full log history (just partition the data by date into your storage container for the logs) and ship to a log service of your choice; just ideas.

This is for making this process production ready; since I have implemented the above strategy for our project; I have NOT seen the error occur again.

@ourownstory if you agree with these items it would be a good consideration for documentation purposes if you're running this in production on a schedule as we've probably ran NeuralProphet trainings thousands of times on a schedule at this point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants