Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SageMaker Hyperpod "Target not connected" #280

Open
sean-smith opened this issue Apr 22, 2024 · 0 comments
Open

SageMaker Hyperpod "Target not connected" #280

sean-smith opened this issue Apr 22, 2024 · 0 comments
Labels
Troubleshooting Tips These are informational to make it easier to troubleshoot common issues.

Comments

@sean-smith
Copy link
Contributor

If you're trying to connect to your SageMaker Hyperpod cluster and you see an error "An error occurred (TargetNotConnected)", there's a couple of common causes:

An error occurred (TargetNotConnected) when calling the StartSession operation: sagemaker-cluster:..._controller-machine-i-... is not connected.
kex_exchange_identification: Connection closed by remote host
Connection closed by UNKNOWN port 65535

To troubleshoot do a few things:

  1. Check your aws credentials are configured for the right account:
aws sts get-caller-identity --query Account --output text
  1. Check to see the region is correct:
aws configure get region

If those don't work, try and ssm into a compute node, you'll need the cluster-id, worker-group name and instance-id which you can get from the aws sagemaker list-cluster-nodes --cluster-name <cluster-name> CLI call.

aws ssm start-session \
    --target sagemaker-cluster:<cluster-id>_worker-group-1-<instance-id>

Once you're there you can get the ip address of the controller node by running:

sudo cat /opt/ml/config/resource_config.json | jq | grep -5 controller-machine

That'll show:

      "Name": "controller-machine",
      "InstanceType": "ml.m5.12xlarge",
      "Instances": [
        {
          "InstanceName": "controller-machine-1",
          "AgentIpAddress": "172.16.90.220",
          "CustomerIpAddress": "10.1.39.83",
          "InstanceId": "i-0defeb24a1f5dfe85"
        }
      ]

Use the CustomerIpAddress 10.1.39.83 to SSH into headnode from that compute node:

ssh 10.1.39.83
@sean-smith sean-smith added the Troubleshooting Tips These are informational to make it easier to troubleshoot common issues. label Apr 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Troubleshooting Tips These are informational to make it easier to troubleshoot common issues.
Projects
None yet
Development

No branches or pull requests

1 participant