InferenceService Model Transition in Pending/InProgress forever while inference service is operational #3686

CanmingCobble · 2024-05-13T19:53:38Z

/kind bug

What steps did you take and what happened:
Deployed inferenceservice iris-classifier-deployment:

% kubectl get inferenceservices
NAME                         URL                                                             READY   PREV   LATEST   PREVROLLEDOUTREVISION   LATESTREADYREVISION                          AGE
iris-classifier-deployment   http://iris-classifier-deployment.default.35.226.25.89.nip.io   True           100                              iris-classifier-deployment-predictor-00001   128m

After running it for some time, it could get into a transition "InProgress" status and could not get out of it.

% kubectl describe inferenceservices iris-classifier-deployment
Name:         iris-classifier-deployment
Namespace:    default

...

Status:
  Address:
    URL:  http://iris-classifier-deployment.default.svc.cluster.local
  Components:
    Predictor:
      Address:
        URL:                      http://iris-classifier-deployment-predictor.default.svc.cluster.local
      Latest Created Revision:    iris-classifier-deployment-predictor-00001
      Latest Ready Revision:      iris-classifier-deployment-predictor-00001
      Latest Rolledout Revision:  iris-classifier-deployment-predictor-00001
      Traffic:
        Latest Revision:  true
        Percent:          100
        Revision Name:    iris-classifier-deployment-predictor-00001
      URL:                http://iris-classifier-deployment-predictor.default.35.226.25.89.nip.io
  Conditions:
    Last Transition Time:  2024-05-13T17:23:21Z
    Status:                True
    Type:                  IngressReady
    Last Transition Time:  2024-05-13T17:23:20Z
    Severity:              Info
    Status:                True
    Type:                  LatestDeploymentReady
    Last Transition Time:  2024-05-13T17:23:20Z
    Severity:              Info
    Status:                True
    Type:                  PredictorConfigurationReady
    Last Transition Time:  2024-05-13T17:23:20Z
    Status:                True
    Type:                  PredictorReady
    Last Transition Time:  2024-05-13T17:23:20Z
    Severity:              Info
    Status:                True
    Type:                  PredictorRouteReady
    Last Transition Time:  2024-05-13T17:23:21Z
    Status:                True
    Type:                  Ready
    Last Transition Time:  2024-05-13T17:23:20Z
    Severity:              Info
    Status:                True
    Type:                  RoutesReady
  Model Status:
    Copies:
      Failed Copies:  0
      Total Copies:   1
    States:
      Active Model State:  Loaded
      Target Model State:  Pending
    Transition Status:     InProgress
  Observed Generation:     1
  URL:                     http://iris-classifier-deployment.default.35.226.25.89.nip.io
Events:                    <none>

At the same time, the deployed inference service is functional:

curl -s --show-error --fail-with-body -H Host:iris-classifier-deployment.default.35.226.25.89.nip.io -H "Authorization: Bearer " -H Content-Type:application/json http://localhost:8100/classify -d ' [[5.9, 3, 5.1, 1.8]]'
Handling connection for 8100
Prediction: {"prediction":2}

What did you expect to happen:

Expect the model transition status UpToDate:

  Model Status:
    Copies:
      Failed Copies:  0
      Total Copies:   1
    States:
      Active Model State:  Loaded
      Target Model State:  Loaded
    Transition Status:     UpToDate

What's the InferenceService yaml:
[To help us debug please run kubectl get isvc $name -n $namespace -oyaml and paste the output]

% kubectl get isvc iris-classifier-deployment  -n default -oyaml
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  annotations:
    aioliModelSubDir: iris-model-v1
    enable-metric-aggregation: "true"
    prometheus.io/probe: "true"
    prometheus.io/scrape: "true"
    prometheus.kserve.io/path: /metrics
    prometheus.kserve.io/port: "3000"
    serving.knative.dev/progress-deadline: 1500s
    serving.kserve.io/enable-prometheus-scraping: "true"
  creationTimestamp: "2024-05-13T17:21:55Z"
  finalizers:
  - inferenceservice.finalizers
  generation: 1
  labels:
    inference/deployment-id: b6396fb0-6ed1-4869-8d89-05a8f06de9f8
    inference/packaged-model: iris-model.v1
    inference/packaged-model-id: 2368184e-44b5-4727-bd89-6f995ba9ac2e
  name: iris-classifier-deployment
  namespace: default
  resourceVersion: "134688"
  uid: c58bf65a-4f70-4c8a-87bb-a15f6973d4b7
spec:
  predictor:
    canaryTrafficPercent: 100
    containers:
    - env:
      - name: MODEL_NAME_AND_VERSION
        value: iris-model-v1
      - name: AIOLI_LOGGER_PORT
        value: "49160"
      - name: AIOLI_MODEL_SUBDIR
        value: iris-model-v1
      image: jjharrow/iris:latest
      name: kserve-container
      ports:
      - containerPort: 8080
        protocol: TCP
      resources:
        limits:
          cpu: 500m
          memory: 2560Mi
        requests:
          cpu: 500m
          memory: 2560Mi
    - env:
      - name: MODEL_NAME_AND_VERSION
        value: iris-model-v1
      - name: AIOLI_LOGGER_PORT
        value: "49160"
      - name: AIOLI_MODEL_SUBDIR
        value: iris-model-v1
      image: determinedai/aioli-logger
      name: aioli-logger
      resources: {}
    logger:
      mode: all
      url: http://localhost:49160
    maxReplicas: 1
    minReplicas: 0
    scaleMetric: rps
    scaleTarget: 2
status:
  address:
    url: http://iris-classifier-deployment.default.svc.cluster.local
  components:
    predictor:
      address:
        url: http://iris-classifier-deployment-predictor.default.svc.cluster.local
      latestCreatedRevision: iris-classifier-deployment-predictor-00001
      latestReadyRevision: iris-classifier-deployment-predictor-00001
      latestRolledoutRevision: iris-classifier-deployment-predictor-00001
      traffic:
      - latestRevision: true
        percent: 100
        revisionName: iris-classifier-deployment-predictor-00001
      url: http://iris-classifier-deployment-predictor.default.35.226.25.89.nip.io
  conditions:
  - lastTransitionTime: "2024-05-13T17:23:21Z"
    status: "True"
    type: IngressReady
  - lastTransitionTime: "2024-05-13T17:23:20Z"
    severity: Info
    status: "True"
    type: LatestDeploymentReady
  - lastTransitionTime: "2024-05-13T17:23:20Z"
    severity: Info
    status: "True"
    type: PredictorConfigurationReady
  - lastTransitionTime: "2024-05-13T17:23:20Z"
    status: "True"
    type: PredictorReady
  - lastTransitionTime: "2024-05-13T17:23:20Z"
    severity: Info
    status: "True"
    type: PredictorRouteReady
  - lastTransitionTime: "2024-05-13T17:23:21Z"
    status: "True"
    type: Ready
  - lastTransitionTime: "2024-05-13T17:23:20Z"
    severity: Info
    status: "True"
    type: RoutesReady
  modelStatus:
    copies:
      failedCopies: 0
      totalCopies: 1
    states:
      activeModelState: Loaded
      targetModelState: Pending
    transitionStatus: InProgress
  observedGeneration: 1
  url: http://iris-classifier-deployment.default.35.226.25.89.nip.io

Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]

Could be related to using preemptible VMs to run the Kubernetes cluster. But we had also seen it happen creating new pod for canary rollout.

Environment:

Istio Version:
Knative Version:
KServe Version:
Kubeflow version:
Cloud Environment:[k8s_istio/istio_dex/gcp_basic_auth/gcp_iap/aws/aws_cognito/ibm]
Minikube/Kind version:
Kubernetes version: (use kubectl version):

Client Version: v1.28.3
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.28.7-gke.1026001

OS (e.g. from /etc/os-release):

The text was updated successfully, but these errors were encountered:

serdarildercaglar · 2024-05-14T19:42:43Z

I am facing the same issue when I increase the number of workers.
#3662

serdarildercaglar · 2024-05-14T19:49:48Z

Could you please attend to this issue?
@yuzisun

Syntax-Error-1337 · 2024-05-23T07:25:06Z

Check for Resource Limits and Quotas:
Ensure that your Kubernetes cluster has sufficient resources (CPU, memory) available and that no resource quotas are being exceeded. Preemptible VMs can also be a source of instability if they are being reclaimed by the cloud provider.
Review Logs:
Check the logs of the InferenceService and the associated pods (kserve-container and aioli-logger) for any errors or warnings. This can provide insights into what might be causing the transition status to remain in progress.

kubectl logs <pod-name> -c kserve-container
kubectl logs <pod-name> -c aioli-logger

oss-prow-bot bot added the kind/bug label May 13, 2024

serdarildercaglar mentioned this issue May 28, 2024

Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method #3662

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

InferenceService Model Transition in Pending/InProgress forever while inference service is operational #3686

InferenceService Model Transition in Pending/InProgress forever while inference service is operational #3686

CanmingCobble commented May 13, 2024

serdarildercaglar commented May 14, 2024 •

edited

serdarildercaglar commented May 14, 2024

Syntax-Error-1337 commented May 23, 2024

InferenceService Model Transition in Pending/InProgress forever while inference service is operational #3686

InferenceService Model Transition in Pending/InProgress forever while inference service is operational #3686

Comments

CanmingCobble commented May 13, 2024

serdarildercaglar commented May 14, 2024 • edited

serdarildercaglar commented May 14, 2024

Syntax-Error-1337 commented May 23, 2024

serdarildercaglar commented May 14, 2024 •

edited