You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am facing a problem when training a job with Sagemaker.
First I created my own docker image with the model I wanted to use (VAR modeling from the statsmodels.tsa.api package).
Following the documentation, I have a folder with the following organization :
train-image/
|-- var/
|------- nginx.conf
|------- serve
|------- train
|------- wsgi.py
|-- Dockerfile
|-- requirements.txt
My dockerfile :
I put my custom image on ECR with the following commands :
My error arrives during the .fit() :
I get this error :
INFO:sagemaker:Creating training-job with name: var-HUBEAU-STREAM-FLOW-I522101001-2024-05-14-12-56-19-331
2024-05-14 12:56:20 Starting - Starting the training job...
2024-05-14 12:56:38 Starting - Preparing the instances for training...
2024-05-14 12:57:17 Downloading - Downloading the training image
2024-05-14 12:57:17 Training - Training image download completed. Training in progress..[FATAL tini (7)] exec train failed: Permission denied
2024-05-14 12:57:45 Uploading - Uploading generated training model
2024-05-14 12:57:45 Failed - Training job failed
I tried to put some logs/prints on my train file (from the train-image/var/ folder) but i can't see any of them.... I think it surely fails before.
I tried to look at the permissions of the role running the training job and gave full access for Sagemaker and S3.
I also checked at the s3 bucket and there is the training dataset sucessfully saved. I tried to download it with a debug script and It worked. I can access this file.
I read tons of issues but never found something related to my error message...
[FATAL tini (7)] exec train failed: Permission denied
Traceback (most recent call last):
File "/home/faustine/Documents/mantorai/train/resources/streamflow/main.py", line 16, in var_train_jobs
raise exp
File "/home/faustine/Documents/mantorai/train/resources/streamflow/main.py", line 13, in var_train_jobs
varTrainer.var_train(local_run=False)
File "/home/faustine/Documents/mantorai/train/resources/streamflow/trainer.py", line 116, in var_train
var_model = varEstimator.model_data
File "/home/faustine/.pyenv/versions/3.10.4/lib/python3.10/site-packages/sagemaker/workflow/pipeline_context.py", line 311, in wrapper
return run_func(*args, **kwargs)
File "/home/faustine/.pyenv/versions/3.10.4/lib/python3.10/site-packages/sagemaker/estimator.py", line 1292, in fit
self.latest_training_job.wait(logs=logs)
File "/home/faustine/.pyenv/versions/3.10.4/lib/python3.10/site-packages/sagemaker/estimator.py", line 2474, in wait
self.sagemaker_session.logs_for_job(self.job_name, wait=True, log_type=logs)
File "/home/faustine/.pyenv/versions/3.10.4/lib/python3.10/site-packages/sagemaker/session.py", line 4849, in logs_for_job
_logs_for_job(self.boto_session, job_name, wait, poll, log_type, timeout)
File "/home/faustine/.pyenv/versions/3.10.4/lib/python3.10/site-packages/sagemaker/session.py", line 6760, in _logs_for_job
_check_job_status(job_name, description, "TrainingJobStatus")
File "/home/faustine/.pyenv/versions/3.10.4/lib/python3.10/site-packages/sagemaker/session.py", line 6813, in _check_job_status
raise exceptions.UnexpectedStatusException(
sagemaker.exceptions.UnexpectedStatusException: Error for Training job var-HUBEAU-STREAM-FLOW-I522101001-2024-05-14-12-56-19-331: Failed. Reason: AlgorithmError: , exit code: 126
python-BaseException
Screenshots or logs
System information
SageMaker Python SDK version: sagemaker==2.177.1
Framework name (eg. PyTorch) or algorithm (eg. KMeans): VAR (from statsmodels.tsa.api)
Framework version: statsmodels==0.14.0
Python version: 3.10
CPU or GPU: CPU
Custom Docker image (Y/N): Yes
Thank your in advance for your help !
Sorry I can not make my issue reproductible to you.
The text was updated successfully, but these errors were encountered:
@FaustineBt Thanks for reaching out to SageMaker! In order to use a custom docker image with sagemaker library's Estimator for training the image Dockerfile needs to be configured a certain way for compatibility. I think you are mising a few steps here like installing sagemaker-training and setting entrypoint SAGEMAKER_PROGRAM.
Dear all,
I am facing a problem when training a job with Sagemaker.
First I created my own docker image with the model I wanted to use (VAR modeling from the statsmodels.tsa.api package).
Following the documentation, I have a folder with the following organization :
train-image/
|-- var/
|------- nginx.conf
|------- serve
|------- train
|------- wsgi.py
|-- Dockerfile
|-- requirements.txt
My dockerfile :
I put my custom image on ECR with the following commands :
My error arrives during the .fit() :
I get this error :
I tried to put some logs/prints on my train file (from the train-image/var/ folder) but i can't see any of them.... I think it surely fails before.
I tried to look at the permissions of the role running the training job and gave full access for Sagemaker and S3.
I also checked at the s3 bucket and there is the training dataset sucessfully saved. I tried to download it with a debug script and It worked. I can access this file.
I read tons of issues but never found something related to my error message...
[FATAL tini (7)] exec train failed: Permission denied
Screenshots or logs
System information
Thank your in advance for your help !
Sorry I can not make my issue reproductible to you.
The text was updated successfully, but these errors were encountered: