Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Daprd deleting actors unexpectedly #7734

Open
grizzlybearg opened this issue May 15, 2024 · 3 comments
Open

Daprd deleting actors unexpectedly #7734

grizzlybearg opened this issue May 15, 2024 · 3 comments
Labels
kind/bug Something isn't working

Comments

@grizzlybearg
Copy link

In what area(s)?

runtime
daprd

What version of Dapr?

edge: daprio/daprd:edge (latest release)

Expected Behavior

My app uses the actor component. The actor in our code has a timer that is triggered at regular intervals. My expectation is that the timer is expected to be triggered without the daprd deleting the actors.
I also expect the daprd container to be able to conduct a healthz check on the actor without failure.

Actual Behavior

The daprd logs show the following log:

2024-05-15 13:00:15 time="2024-05-15T10:00:15.742793315Z" level=error msg="Error performing request: Get "http://10.5.0.6:8884/healthz\": context deadline exceeded" app_id=envrunneractor instance=bfd5899a36a7 scope=actorshealth type=log ver=edge
2024-05-15 13:00:20 time="2024-05-15T10:00:20.743193975Z" level=error msg="Error performing request: Get "http://10.5.0.6:8884/healthz\": context deadline exceeded" app_id=envrunneractor instance=bfd5899a36a7 scope=actorshealth type=log ver=edge
2024-05-15 13:00:20 time="2024-05-15T10:00:20.743313885Z" level=warning msg="Actor health check failed 4 times, marking unhealthy" app_id=envrunneractor instance=bfd5899a36a7 scope=actorshealth type=log ver=edge
2024-05-15 13:00:21 time="2024-05-15T10:00:21.056283208Z" level=debug msg="Disconnecting from placement service by the unhealthy app" app_id=envrunneractor instance=bfd5899a36a7 scope=dapr.runtime.actors.placement type=log ver=edge
2024-05-15 13:00:21 time="2024-05-15T10:00:21.057393103Z" level=debug msg="Halting actor 'envrunneractor||TESTER'" app_id=envrunneractor instance=bfd5899a36a7 scope=dapr.runtime.actor type=log ver=edge

This message tends to happen during the creation of the actor for the first time or when a timer callback is invoked. Immediately after this log message appears, all actors are deleted (deactivated). When having thousands of actors, recovery of these actors is compute intensive given that there's a lot of data associated with each actor. Therefore, it would be ideal if we stopped the random deactivation of actors. I have been unable to deactivate the actor healthz check.

I've confirmed that our internal code works without any issues (even without dapr (single processs runtime)), therefore, that is not a reason for the daprd runtime to delete actors.

Notes:

I do know that dapr creates the healthz endpoint for the actor component automatically:
image

I've confirmed that the healthz url is working
image

Steps to Reproduce the Problem

Our internal code isn't public but I can share the docker compose file that we are using for dev to CICD

name: app
services:
  envrunneractor:
    container_name: main
    image: "local:latest"
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
      - /home/workspace/sdk:/workspaces/sdk:cached

  command: >
    /bin/bash -c "
      uvicorn actor_service:app --port 8884 --host 10.5.0.6"
  ports:
    - "50002:50002"
    - name: main_app
      target: 8884
      host_ip: 0.0.0.0
      published: "8884"
      protocol: tcp
      app_protocol: http

  depends_on:
    - redis
    - placement
  networks:
    dev-cloud:
      ipv4_address: 10.5.0.6

  environment:
    DAPR_API_METHOD_INVOCATION_PROTOCOL: "http"
    DAPR_GRPC_ENDPOINT: "10.5.0.6:50002?tls=true"
    DAPR_HTTP_ENDPOINT: "http://10.5.0.6:3500"
    DAPR_HTTP_PORT: "3500"
    DAPR_GRPC_PORT: "50002"
    APP_ID: "envrunneractor"
    DAPR_HEALTH_TIMEOUT: "3000"

runner-dapr:
  image: "daprio/daprd:edge"
  container_name: dapr
  environment:
    DAPR_HOST_IP: "10.5.0.6"
    APP_PORT: "8884"

  command: "./daprd \
      --app-id envrunneractor \
      --app-port 8884 \
      --dapr-grpc-port 50002 \
      --dapr-http-port 3500 \
      --resources-path /components \
      --log-level debug \
      --mode standalone \
      --actors-service placement:10.5.0.7:50004,10.5.0.6:50002 \
      --app-protocol http \
      --app-channel-address 10.5.0.6"

  volumes:
    - "./components/:/components"
  depends_on:
    - envrunneractor
  network_mode: "service:envrunneractor"

############################
# Dapr placement service
############################
placement:
  container_name: placement
  image: "daprio/dapr:edge"
  command: "./placement \
      --port 50004 \ 
      --log-level debug"
  ports:
    - "50004:50004"
  networks:
    dev-cloud:
      ipv4_address: 10.5.0.7

############################
# Redis state store
############################
redis:
  container_name: redis
  image: "redis:6"
  ports:
    - "6379:6379"
  networks:
    dev-cloud:
      ipv4_address: 10.5.0.8

networks:
dev-cloud:
  external: true

Our internal code is inspired by the dapr example found at: https://github.com/dapr/python-sdk/tree/release-1.0/examples/demo_actor

@grizzlybearg grizzlybearg added the kind/bug Something isn't working label May 15, 2024
@ItalyPaleAle
Copy link
Contributor

2 things:

  1. Timers are only invoked if the actor is already active. If the actor gets deactivated for any reason, including rebalancing (which can happen randomly if you scale Dapr), then the timers won't fire. If you want a "persistent" timer, you should use a reminder
  2. From the logs, it appears that Dapr can't invoke /healthz on your app. Implementing a /healthz endpoint in your app is required for using actors, and it must respond with a 2xx status code. Seems that your app may have temporarily stopped responding?

@grizzlybearg
Copy link
Author

2 things:

  1. Timers are only invoked if the actor is already active. If the actor gets deactivated for any reason, including rebalancing (which can happen randomly if you scale Dapr), then the timers won't fire. If you want a "persistent" timer, you should use a reminder
  2. From the logs, it appears that Dapr can't invoke /healthz on your app. Implementing a /healthz endpoint in your app is required for using actors, and it must respond with a 2xx status code. Seems that your app may have temporarily stopped responding?

@ItalyPaleAle

  1. I've noticed that if the actor is deactivated, the reminders will not fire either.
  2. The dapr actor sdk for fastapi automatically sets up an /healthz endpoint. Are you suggesting I find a way to replace the existing endpoint? Second, the existing (automatic from the actor sdk) healthz endpoint does work when invoked manually from the both the host and from within the daprd container:
    image. As you can see from the image, it does respond with a 2xx status code.

Is there a reason why daprd tries to send a http://10.5.0.6:8884/healthz request only when the timers and reminders fire? Is there a way to to disable this? The docs suggest that the health check is disabled by default. I don't understand why daprd still tries to invoke the health endpoint with the health check disabled

@elena-kolevska
Copy link
Contributor

@grizzlybearg The reason we have to have a healthz check is so that dapr can know if the application can still serve those actor types. Otherwise, we could end up in a scenario where an app comes online, registers actor types A, B and C, and crashes, but the placement service and dapr sidecar still try to invoke actors on it.
When the placement service is aware that a host is down, it will properly rebalance the actors and it will forward the requests to other hosts that host the same actor type.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants