Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training job failed "Permission denied" - with custom docker image #4678

Closed
FaustineBt opened this issue May 14, 2024 · 1 comment
Closed
Labels
bug component: training Relates to the SageMaker Training Platform

Comments

@FaustineBt
Copy link

Dear all,

I am facing a problem when training a job with Sagemaker.
First I created my own docker image with the model I wanted to use (VAR modeling from the statsmodels.tsa.api package).
Following the documentation, I have a folder with the following organization :
train-image/
|-- var/
|------- nginx.conf
|------- serve
|------- train
|------- wsgi.py
|-- Dockerfile
|-- requirements.txt

My dockerfile :
image

I put my custom image on ECR with the following commands :
image

My error arrives during the .fit() :
image

I get this error :

INFO:sagemaker:Creating training-job with name: var-HUBEAU-STREAM-FLOW-I522101001-2024-05-14-12-56-19-331
2024-05-14 12:56:20 Starting - Starting the training job...
2024-05-14 12:56:38 Starting - Preparing the instances for training...
2024-05-14 12:57:17 Downloading - Downloading the training image
2024-05-14 12:57:17 Training - Training image download completed. Training in progress..[FATAL tini (7)] exec train failed: Permission denied
2024-05-14 12:57:45 Uploading - Uploading generated training model
2024-05-14 12:57:45 Failed - Training job failed

I tried to put some logs/prints on my train file (from the train-image/var/ folder) but i can't see any of them.... I think it surely fails before.

I tried to look at the permissions of the role running the training job and gave full access for Sagemaker and S3.

I also checked at the s3 bucket and there is the training dataset sucessfully saved. I tried to download it with a debug script and It worked. I can access this file.

I read tons of issues but never found something related to my error message...
[FATAL tini (7)] exec train failed: Permission denied

Traceback (most recent call last):
  File "/home/faustine/Documents/mantorai/train/resources/streamflow/main.py", line 16, in var_train_jobs
    raise exp
  File "/home/faustine/Documents/mantorai/train/resources/streamflow/main.py", line 13, in var_train_jobs
    varTrainer.var_train(local_run=False)
  File "/home/faustine/Documents/mantorai/train/resources/streamflow/trainer.py", line 116, in var_train
    var_model = varEstimator.model_data
  File "/home/faustine/.pyenv/versions/3.10.4/lib/python3.10/site-packages/sagemaker/workflow/pipeline_context.py", line 311, in wrapper
    return run_func(*args, **kwargs)
  File "/home/faustine/.pyenv/versions/3.10.4/lib/python3.10/site-packages/sagemaker/estimator.py", line 1292, in fit
    self.latest_training_job.wait(logs=logs)
  File "/home/faustine/.pyenv/versions/3.10.4/lib/python3.10/site-packages/sagemaker/estimator.py", line 2474, in wait
    self.sagemaker_session.logs_for_job(self.job_name, wait=True, log_type=logs)
  File "/home/faustine/.pyenv/versions/3.10.4/lib/python3.10/site-packages/sagemaker/session.py", line 4849, in logs_for_job
    _logs_for_job(self.boto_session, job_name, wait, poll, log_type, timeout)
  File "/home/faustine/.pyenv/versions/3.10.4/lib/python3.10/site-packages/sagemaker/session.py", line 6760, in _logs_for_job
    _check_job_status(job_name, description, "TrainingJobStatus")
  File "/home/faustine/.pyenv/versions/3.10.4/lib/python3.10/site-packages/sagemaker/session.py", line 6813, in _check_job_status
    raise exceptions.UnexpectedStatusException(
sagemaker.exceptions.UnexpectedStatusException: Error for Training job var-HUBEAU-STREAM-FLOW-I522101001-2024-05-14-12-56-19-331: Failed. Reason: AlgorithmError: , exit code: 126
python-BaseException

Screenshots or logs
image
image
image

System information

  • SageMaker Python SDK version: sagemaker==2.177.1
  • Framework name (eg. PyTorch) or algorithm (eg. KMeans): VAR (from statsmodels.tsa.api)
  • Framework version: statsmodels==0.14.0
  • Python version: 3.10
  • CPU or GPU: CPU
  • Custom Docker image (Y/N): Yes

Thank your in advance for your help !
Sorry I can not make my issue reproductible to you.

@FaustineBt FaustineBt added the bug label May 14, 2024
@mufaddal-rohawala
Copy link
Member

@FaustineBt Thanks for reaching out to SageMaker! In order to use a custom docker image with sagemaker library's Estimator for training the image Dockerfile needs to be configured a certain way for compatibility. I think you are mising a few steps here like installing sagemaker-training and setting entrypoint SAGEMAKER_PROGRAM.

Here is a an E2E example to set this up, https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-debugger/build_your_own_container_with_debugger/debugger_byoc.html

Please do open another issue if more issues are encountered.

@mufaddal-rohawala mufaddal-rohawala added the component: training Relates to the SageMaker Training Platform label May 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug component: training Relates to the SageMaker Training Platform
Projects
None yet
Development

No branches or pull requests

2 participants