-
Notifications
You must be signed in to change notification settings - Fork 464
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PermissionError #1527
Comments
Got it again: PermissionError: [Errno 13] Permission denied: b'/app/lightning_logs/version_241/events.out.tfevents.1707013124.5f065f752aca.6320.0' |
PermissionError: [Errno 13] Permission denied: b'/app/lightning_logs/version_290/events.out.tfevents.1707080038.5f065f752aca.10470.0' |
I'm going to test the following on this and see if it prevents the error in docker (I think this is relevant to the project since docker is a common method of using Python and packages like this one in general): chmod -R 777 /app/lightning_logs (will update this ticket if I don't see the error occur in the next 24/7 period) |
Update for folks using this in a docker container: chmod -R 777 /app/lightning_logs appears to have fixed the issue; I have not seen the permissions errors in the logs since running that; we can close this and take note this is something you might have to do in a Python image. |
Thank you for sharing your solution to the issue you observed @quant-exchange |
Hey NP Team, We've got NP 0.8.0 in production on a docker container still and it's been a while since we've seen the error (so figured our chmod updates might have mitigated this, but we encountered the same PermissionError: [Errno 13] Permission denied: b'/app/lightning_logs/. There could be a few things under the hood with TensorBoard that are causing this or system related. Since outside of the training we're not using the lightning_logs version_x data, I've included this code at the bottom after the training is finished:
This is a temporary work-around until more info is understood about this issue. Updating the DevOps dependencies to the most recent version of TensorBoard could be advised; NOTE: we usually see this once the logs have stacked to version_175-version_250 range and the epoch stops around 2% usually. Thanks, Q.E. |
Current analysis points towards memory and or reference issues*. Recommendation; use the above script as a production batch job (to clear out the logs folder). However, you can have all that data periodically copied from the native lightning_logs folder into Azure Blob Storage or AWS S3 Buckets, for example so you can keep and access your full log history (just partition the data by date into your storage container for the logs) and ship to a log service of your choice; just ideas. This is for making this process production ready; since I have implemented the above strategy for our project; I have NOT seen the error occur again. @ourownstory if you agree with these items it would be a good consideration for documentation purposes if you're running this in production on a schedule as we've probably ran NeuralProphet trainings thousands of times on a schedule at this point. |
Hey,
The following error occurred while running in a docker container:
File "/usr/local/lib/python3.9/site-packages/tensorboard/compat/tensorflow_stub/io/gfile.py", line 171, in _write
with io.open(filename, mode, encoding=encoding) as f:
PermissionError: [Errno 13] Permission denied: b'/app/lightning_logs/version_220/events.out.tfevents.1706995191.5f065f752aca.5108.0'
Epoch 94: 14%|█▍ | 94/600 [00:00<00:04, 142.97it/s, loss=0.000312, v_num=220, MAE=0.220, RMSE=0.267, Loss=0.000393, RegLoss=0.000]
NOTE: the writes to that path have worked with no issues until I recently saw this error (no permission changes on my side were done in the container, etc).
I reran the same training and it fully succeeded; not sure why this would happen; thought you should all see this.
Thanks,
Q.E.
The text was updated successfully, but these errors were encountered: