Daprd deleting actors unexpectedly #7734

grizzlybearg · 2024-05-15T11:17:25Z

In what area(s)?

runtime
daprd

What version of Dapr?

edge: daprio/daprd:edge (latest release)

Expected Behavior

My app uses the actor component. The actor in our code has a timer that is triggered at regular intervals. My expectation is that the timer is expected to be triggered without the daprd deleting the actors.
I also expect the daprd container to be able to conduct a healthz check on the actor without failure.

Actual Behavior

The daprd logs show the following log:

2024-05-15 13:00:15 time="2024-05-15T10:00:15.742793315Z" level=error msg="Error performing request: Get "http://10.5.0.6:8884/healthz\": context deadline exceeded" app_id=envrunneractor instance=bfd5899a36a7 scope=actorshealth type=log ver=edge
2024-05-15 13:00:20 time="2024-05-15T10:00:20.743193975Z" level=error msg="Error performing request: Get "http://10.5.0.6:8884/healthz\": context deadline exceeded" app_id=envrunneractor instance=bfd5899a36a7 scope=actorshealth type=log ver=edge
2024-05-15 13:00:20 time="2024-05-15T10:00:20.743313885Z" level=warning msg="Actor health check failed 4 times, marking unhealthy" app_id=envrunneractor instance=bfd5899a36a7 scope=actorshealth type=log ver=edge
2024-05-15 13:00:21 time="2024-05-15T10:00:21.056283208Z" level=debug msg="Disconnecting from placement service by the unhealthy app" app_id=envrunneractor instance=bfd5899a36a7 scope=dapr.runtime.actors.placement type=log ver=edge
2024-05-15 13:00:21 time="2024-05-15T10:00:21.057393103Z" level=debug msg="Halting actor 'envrunneractor||TESTER'" app_id=envrunneractor instance=bfd5899a36a7 scope=dapr.runtime.actor type=log ver=edge

This message tends to happen during the creation of the actor for the first time or when a timer callback is invoked. Immediately after this log message appears, all actors are deleted (deactivated). When having thousands of actors, recovery of these actors is compute intensive given that there's a lot of data associated with each actor. Therefore, it would be ideal if we stopped the random deactivation of actors. I have been unable to deactivate the actor healthz check.

I've confirmed that our internal code works without any issues (even without dapr (single processs runtime)), therefore, that is not a reason for the daprd runtime to delete actors.

Notes:

I do know that dapr creates the healthz endpoint for the actor component automatically:

I've confirmed that the healthz url is working

Steps to Reproduce the Problem

Our internal code isn't public but I can share the docker compose file that we are using for dev to CICD

name: app
services:
  envrunneractor:
    container_name: main
    image: "local:latest"
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
      - /home/workspace/sdk:/workspaces/sdk:cached

  command: >
    /bin/bash -c "
      uvicorn actor_service:app --port 8884 --host 10.5.0.6"
  ports:
    - "50002:50002"
    - name: main_app
      target: 8884
      host_ip: 0.0.0.0
      published: "8884"
      protocol: tcp
      app_protocol: http

  depends_on:
    - redis
    - placement
  networks:
    dev-cloud:
      ipv4_address: 10.5.0.6

  environment:
    DAPR_API_METHOD_INVOCATION_PROTOCOL: "http"
    DAPR_GRPC_ENDPOINT: "10.5.0.6:50002?tls=true"
    DAPR_HTTP_ENDPOINT: "http://10.5.0.6:3500"
    DAPR_HTTP_PORT: "3500"
    DAPR_GRPC_PORT: "50002"
    APP_ID: "envrunneractor"
    DAPR_HEALTH_TIMEOUT: "3000"

runner-dapr:
  image: "daprio/daprd:edge"
  container_name: dapr
  environment:
    DAPR_HOST_IP: "10.5.0.6"
    APP_PORT: "8884"

  command: "./daprd \
      --app-id envrunneractor \
      --app-port 8884 \
      --dapr-grpc-port 50002 \
      --dapr-http-port 3500 \
      --resources-path /components \
      --log-level debug \
      --mode standalone \
      --actors-service placement:10.5.0.7:50004,10.5.0.6:50002 \
      --app-protocol http \
      --app-channel-address 10.5.0.6"

  volumes:
    - "./components/:/components"
  depends_on:
    - envrunneractor
  network_mode: "service:envrunneractor"

############################
# Dapr placement service
############################
placement:
  container_name: placement
  image: "daprio/dapr:edge"
  command: "./placement \
      --port 50004 \ 
      --log-level debug"
  ports:
    - "50004:50004"
  networks:
    dev-cloud:
      ipv4_address: 10.5.0.7

############################
# Redis state store
############################
redis:
  container_name: redis
  image: "redis:6"
  ports:
    - "6379:6379"
  networks:
    dev-cloud:
      ipv4_address: 10.5.0.8

networks:
dev-cloud:
  external: true

Our internal code is inspired by the dapr example found at: https://github.com/dapr/python-sdk/tree/release-1.0/examples/demo_actor

The text was updated successfully, but these errors were encountered:

ItalyPaleAle · 2024-05-15T14:46:23Z

2 things:

Timers are only invoked if the actor is already active. If the actor gets deactivated for any reason, including rebalancing (which can happen randomly if you scale Dapr), then the timers won't fire. If you want a "persistent" timer, you should use a reminder
From the logs, it appears that Dapr can't invoke /healthz on your app. Implementing a /healthz endpoint in your app is required for using actors, and it must respond with a 2xx status code. Seems that your app may have temporarily stopped responding?

grizzlybearg · 2024-05-15T16:48:54Z

2 things:

Timers are only invoked if the actor is already active. If the actor gets deactivated for any reason, including rebalancing (which can happen randomly if you scale Dapr), then the timers won't fire. If you want a "persistent" timer, you should use a reminder

From the logs, it appears that Dapr can't invoke /healthz on your app. Implementing a /healthz endpoint in your app is required for using actors, and it must respond with a 2xx status code. Seems that your app may have temporarily stopped responding?

@ItalyPaleAle

I've noticed that if the actor is deactivated, the reminders will not fire either.
The dapr actor sdk for fastapi automatically sets up an /healthz endpoint. Are you suggesting I find a way to replace the existing endpoint? Second, the existing (automatic from the actor sdk) healthz endpoint does work when invoked manually from the both the host and from within the daprd container:
. As you can see from the image, it does respond with a 2xx status code.

Is there a reason why daprd tries to send a http://10.5.0.6:8884/healthz request only when the timers and reminders fire? Is there a way to to disable this? The docs suggest that the health check is disabled by default. I don't understand why daprd still tries to invoke the health endpoint with the health check disabled

elena-kolevska · 2024-05-21T23:39:05Z

@grizzlybearg The reason we have to have a healthz check is so that dapr can know if the application can still serve those actor types. Otherwise, we could end up in a scenario where an app comes online, registers actor types A, B and C, and crashes, but the placement service and dapr sidecar still try to invoke actors on it.
When the placement service is aware that a host is down, it will properly rebalance the actors and it will forward the requests to other hosts that host the same actor type.

grizzlybearg added the kind/bug Something isn't working label May 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Daprd deleting actors unexpectedly #7734

Daprd deleting actors unexpectedly #7734

grizzlybearg commented May 15, 2024

ItalyPaleAle commented May 15, 2024

grizzlybearg commented May 15, 2024

elena-kolevska commented May 21, 2024

Daprd deleting actors unexpectedly #7734

Daprd deleting actors unexpectedly #7734

Comments

grizzlybearg commented May 15, 2024

In what area(s)?

What version of Dapr?

Expected Behavior

Actual Behavior

Notes:

Steps to Reproduce the Problem

ItalyPaleAle commented May 15, 2024

grizzlybearg commented May 15, 2024

elena-kolevska commented May 21, 2024