Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

InferenceService Model Transition in Pending/InProgress forever while inference service is operational #3686

Open
CanmingCobble opened this issue May 13, 2024 · 3 comments
Labels

Comments

@CanmingCobble
Copy link

/kind bug

What steps did you take and what happened:
Deployed inferenceservice iris-classifier-deployment:

% kubectl get inferenceservices
NAME                         URL                                                             READY   PREV   LATEST   PREVROLLEDOUTREVISION   LATESTREADYREVISION                          AGE
iris-classifier-deployment   http://iris-classifier-deployment.default.35.226.25.89.nip.io   True           100                              iris-classifier-deployment-predictor-00001   128m

After running it for some time, it could get into a transition "InProgress" status and could not get out of it.

% kubectl describe inferenceservices iris-classifier-deployment
Name:         iris-classifier-deployment
Namespace:    default

...

Status:
  Address:
    URL:  http://iris-classifier-deployment.default.svc.cluster.local
  Components:
    Predictor:
      Address:
        URL:                      http://iris-classifier-deployment-predictor.default.svc.cluster.local
      Latest Created Revision:    iris-classifier-deployment-predictor-00001
      Latest Ready Revision:      iris-classifier-deployment-predictor-00001
      Latest Rolledout Revision:  iris-classifier-deployment-predictor-00001
      Traffic:
        Latest Revision:  true
        Percent:          100
        Revision Name:    iris-classifier-deployment-predictor-00001
      URL:                http://iris-classifier-deployment-predictor.default.35.226.25.89.nip.io
  Conditions:
    Last Transition Time:  2024-05-13T17:23:21Z
    Status:                True
    Type:                  IngressReady
    Last Transition Time:  2024-05-13T17:23:20Z
    Severity:              Info
    Status:                True
    Type:                  LatestDeploymentReady
    Last Transition Time:  2024-05-13T17:23:20Z
    Severity:              Info
    Status:                True
    Type:                  PredictorConfigurationReady
    Last Transition Time:  2024-05-13T17:23:20Z
    Status:                True
    Type:                  PredictorReady
    Last Transition Time:  2024-05-13T17:23:20Z
    Severity:              Info
    Status:                True
    Type:                  PredictorRouteReady
    Last Transition Time:  2024-05-13T17:23:21Z
    Status:                True
    Type:                  Ready
    Last Transition Time:  2024-05-13T17:23:20Z
    Severity:              Info
    Status:                True
    Type:                  RoutesReady
  Model Status:
    Copies:
      Failed Copies:  0
      Total Copies:   1
    States:
      Active Model State:  Loaded
      Target Model State:  Pending
    Transition Status:     InProgress
  Observed Generation:     1
  URL:                     http://iris-classifier-deployment.default.35.226.25.89.nip.io
Events:                    <none>

At the same time, the deployed inference service is functional:

curl -s --show-error --fail-with-body -H Host:iris-classifier-deployment.default.35.226.25.89.nip.io -H "Authorization: Bearer " -H Content-Type:application/json http://localhost:8100/classify -d ' [[5.9, 3, 5.1, 1.8]]'
Handling connection for 8100
Prediction: {"prediction":2}

What did you expect to happen:

Expect the model transition status UpToDate:

  Model Status:
    Copies:
      Failed Copies:  0
      Total Copies:   1
    States:
      Active Model State:  Loaded
      Target Model State:  Loaded
    Transition Status:     UpToDate

What's the InferenceService yaml:
[To help us debug please run kubectl get isvc $name -n $namespace -oyaml and paste the output]

% kubectl get isvc iris-classifier-deployment  -n default -oyaml
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  annotations:
    aioliModelSubDir: iris-model-v1
    enable-metric-aggregation: "true"
    prometheus.io/probe: "true"
    prometheus.io/scrape: "true"
    prometheus.kserve.io/path: /metrics
    prometheus.kserve.io/port: "3000"
    serving.knative.dev/progress-deadline: 1500s
    serving.kserve.io/enable-prometheus-scraping: "true"
  creationTimestamp: "2024-05-13T17:21:55Z"
  finalizers:
  - inferenceservice.finalizers
  generation: 1
  labels:
    inference/deployment-id: b6396fb0-6ed1-4869-8d89-05a8f06de9f8
    inference/packaged-model: iris-model.v1
    inference/packaged-model-id: 2368184e-44b5-4727-bd89-6f995ba9ac2e
  name: iris-classifier-deployment
  namespace: default
  resourceVersion: "134688"
  uid: c58bf65a-4f70-4c8a-87bb-a15f6973d4b7
spec:
  predictor:
    canaryTrafficPercent: 100
    containers:
    - env:
      - name: MODEL_NAME_AND_VERSION
        value: iris-model-v1
      - name: AIOLI_LOGGER_PORT
        value: "49160"
      - name: AIOLI_MODEL_SUBDIR
        value: iris-model-v1
      image: jjharrow/iris:latest
      name: kserve-container
      ports:
      - containerPort: 8080
        protocol: TCP
      resources:
        limits:
          cpu: 500m
          memory: 2560Mi
        requests:
          cpu: 500m
          memory: 2560Mi
    - env:
      - name: MODEL_NAME_AND_VERSION
        value: iris-model-v1
      - name: AIOLI_LOGGER_PORT
        value: "49160"
      - name: AIOLI_MODEL_SUBDIR
        value: iris-model-v1
      image: determinedai/aioli-logger
      name: aioli-logger
      resources: {}
    logger:
      mode: all
      url: http://localhost:49160
    maxReplicas: 1
    minReplicas: 0
    scaleMetric: rps
    scaleTarget: 2
status:
  address:
    url: http://iris-classifier-deployment.default.svc.cluster.local
  components:
    predictor:
      address:
        url: http://iris-classifier-deployment-predictor.default.svc.cluster.local
      latestCreatedRevision: iris-classifier-deployment-predictor-00001
      latestReadyRevision: iris-classifier-deployment-predictor-00001
      latestRolledoutRevision: iris-classifier-deployment-predictor-00001
      traffic:
      - latestRevision: true
        percent: 100
        revisionName: iris-classifier-deployment-predictor-00001
      url: http://iris-classifier-deployment-predictor.default.35.226.25.89.nip.io
  conditions:
  - lastTransitionTime: "2024-05-13T17:23:21Z"
    status: "True"
    type: IngressReady
  - lastTransitionTime: "2024-05-13T17:23:20Z"
    severity: Info
    status: "True"
    type: LatestDeploymentReady
  - lastTransitionTime: "2024-05-13T17:23:20Z"
    severity: Info
    status: "True"
    type: PredictorConfigurationReady
  - lastTransitionTime: "2024-05-13T17:23:20Z"
    status: "True"
    type: PredictorReady
  - lastTransitionTime: "2024-05-13T17:23:20Z"
    severity: Info
    status: "True"
    type: PredictorRouteReady
  - lastTransitionTime: "2024-05-13T17:23:21Z"
    status: "True"
    type: Ready
  - lastTransitionTime: "2024-05-13T17:23:20Z"
    severity: Info
    status: "True"
    type: RoutesReady
  modelStatus:
    copies:
      failedCopies: 0
      totalCopies: 1
    states:
      activeModelState: Loaded
      targetModelState: Pending
    transitionStatus: InProgress
  observedGeneration: 1
  url: http://iris-classifier-deployment.default.35.226.25.89.nip.io

Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]

Could be related to using preemptible VMs to run the Kubernetes cluster. But we had also seen it happen creating new pod for canary rollout.

Environment:

  • Istio Version:
  • Knative Version:
  • KServe Version:
  • Kubeflow version:
  • Cloud Environment:[k8s_istio/istio_dex/gcp_basic_auth/gcp_iap/aws/aws_cognito/ibm]
  • Minikube/Kind version:
  • Kubernetes version: (use kubectl version):
Client Version: v1.28.3
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.28.7-gke.1026001
  • OS (e.g. from /etc/os-release):
@serdarildercaglar
Copy link

serdarildercaglar commented May 14, 2024

I am facing the same issue when I increase the number of workers.
#3662

@serdarildercaglar
Copy link

Could you please attend to this issue?
@yuzisun

@Syntax-Error-1337
Copy link

Check for Resource Limits and Quotas:
Ensure that your Kubernetes cluster has sufficient resources (CPU, memory) available and that no resource quotas are being exceeded. Preemptible VMs can also be a source of instability if they are being reclaimed by the cloud provider.
Review Logs:
Check the logs of the InferenceService and the associated pods (kserve-container and aioli-logger) for any errors or warnings. This can provide insights into what might be causing the transition status to remain in progress.

kubectl logs <pod-name> -c kserve-container
kubectl logs <pod-name> -c aioli-logger

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants