Training job failed "Permission denied" - with custom docker image #4678

FaustineBt · 2024-05-14T13:28:19Z

Dear all,

I am facing a problem when training a job with Sagemaker.
First I created my own docker image with the model I wanted to use (VAR modeling from the statsmodels.tsa.api package).
Following the documentation, I have a folder with the following organization :
train-image/
|-- var/
|------- nginx.conf
|------- serve
|------- train
|------- wsgi.py
|-- Dockerfile
|-- requirements.txt

My dockerfile :

I put my custom image on ECR with the following commands :

My error arrives during the .fit() :

I get this error :

INFO:sagemaker:Creating training-job with name: var-HUBEAU-STREAM-FLOW-I522101001-2024-05-14-12-56-19-331
2024-05-14 12:56:20 Starting - Starting the training job...
2024-05-14 12:56:38 Starting - Preparing the instances for training...
2024-05-14 12:57:17 Downloading - Downloading the training image
2024-05-14 12:57:17 Training - Training image download completed. Training in progress..[FATAL tini (7)] exec train failed: Permission denied
2024-05-14 12:57:45 Uploading - Uploading generated training model
2024-05-14 12:57:45 Failed - Training job failed

I tried to put some logs/prints on my train file (from the train-image/var/ folder) but i can't see any of them.... I think it surely fails before.

I tried to look at the permissions of the role running the training job and gave full access for Sagemaker and S3.

I also checked at the s3 bucket and there is the training dataset sucessfully saved. I tried to download it with a debug script and It worked. I can access this file.

I read tons of issues but never found something related to my error message...
[FATAL tini (7)] exec train failed: Permission denied

Traceback (most recent call last):
  File "/home/faustine/Documents/mantorai/train/resources/streamflow/main.py", line 16, in var_train_jobs
    raise exp
  File "/home/faustine/Documents/mantorai/train/resources/streamflow/main.py", line 13, in var_train_jobs
    varTrainer.var_train(local_run=False)
  File "/home/faustine/Documents/mantorai/train/resources/streamflow/trainer.py", line 116, in var_train
    var_model = varEstimator.model_data
  File "/home/faustine/.pyenv/versions/3.10.4/lib/python3.10/site-packages/sagemaker/workflow/pipeline_context.py", line 311, in wrapper
    return run_func(*args, **kwargs)
  File "/home/faustine/.pyenv/versions/3.10.4/lib/python3.10/site-packages/sagemaker/estimator.py", line 1292, in fit
    self.latest_training_job.wait(logs=logs)
  File "/home/faustine/.pyenv/versions/3.10.4/lib/python3.10/site-packages/sagemaker/estimator.py", line 2474, in wait
    self.sagemaker_session.logs_for_job(self.job_name, wait=True, log_type=logs)
  File "/home/faustine/.pyenv/versions/3.10.4/lib/python3.10/site-packages/sagemaker/session.py", line 4849, in logs_for_job
    _logs_for_job(self.boto_session, job_name, wait, poll, log_type, timeout)
  File "/home/faustine/.pyenv/versions/3.10.4/lib/python3.10/site-packages/sagemaker/session.py", line 6760, in _logs_for_job
    _check_job_status(job_name, description, "TrainingJobStatus")
  File "/home/faustine/.pyenv/versions/3.10.4/lib/python3.10/site-packages/sagemaker/session.py", line 6813, in _check_job_status
    raise exceptions.UnexpectedStatusException(
sagemaker.exceptions.UnexpectedStatusException: Error for Training job var-HUBEAU-STREAM-FLOW-I522101001-2024-05-14-12-56-19-331: Failed. Reason: AlgorithmError: , exit code: 126
python-BaseException

Screenshots or logs

System information

SageMaker Python SDK version: sagemaker==2.177.1
Framework name (eg. PyTorch) or algorithm (eg. KMeans): VAR (from statsmodels.tsa.api)
Framework version: statsmodels==0.14.0
Python version: 3.10
CPU or GPU: CPU
Custom Docker image (Y/N): Yes

Thank your in advance for your help !
Sorry I can not make my issue reproductible to you.

The text was updated successfully, but these errors were encountered:

mufaddal-rohawala · 2024-05-17T01:10:24Z

@FaustineBt Thanks for reaching out to SageMaker! In order to use a custom docker image with sagemaker library's Estimator for training the image Dockerfile needs to be configured a certain way for compatibility. I think you are mising a few steps here like installing sagemaker-training and setting entrypoint SAGEMAKER_PROGRAM.

Here is a an E2E example to set this up, https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-debugger/build_your_own_container_with_debugger/debugger_byoc.html

Please do open another issue if more issues are encountered.

FaustineBt added the bug label May 14, 2024

mufaddal-rohawala closed this as completed May 17, 2024

mufaddal-rohawala added the component: training Relates to the SageMaker Training Platform label May 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training job failed "Permission denied" - with custom docker image #4678

Training job failed "Permission denied" - with custom docker image #4678

FaustineBt commented May 14, 2024

mufaddal-rohawala commented May 17, 2024

Training job failed "Permission denied" - with custom docker image #4678

Training job failed "Permission denied" - with custom docker image #4678

Comments

FaustineBt commented May 14, 2024

mufaddal-rohawala commented May 17, 2024