Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PostgreSQL - connection failed: FATAL: remaining connection slots are reserved for non-replication superuser connections #5071

Open
agusmakmun opened this issue May 15, 2024 · 29 comments
Labels

Comments

@agusmakmun
Copy link

agusmakmun commented May 15, 2024

What happened?

I'm having this postgresql error when doing upgrade the django cookie cutter for my project.

connection failed: FATAL: remaining connection slots are reserved for non-replication superuser connections

I'm using many celery tasks, websocket connections, and deploy using the k8s in my project.

The similar issues are mentioned in Stack Overflow:

One of the thread mentions this package, but not sure if it's working or not:
https://github.com/altairbow/django-db-connection-pool

Details

  • Python version, run python3 -V: 3.12
  • Docker version (if using Docker), run docker --version: 3
  • docker compose version (if using Docker), run docker compose --version: 3
  • Cookie cutter Django: 2024.04.22
(env-my-labs) ➜  env-my-labs cookiecutter https://github.com/cookiecutter/cookiecutter-django
  [1/27] project_name (My Awesome Project): My Project Labs
  [2/27] project_slug (practical_devsecops_cloud_labs):
  [3/27] description (Behold My Awesome Project!): My Project Labs
  [4/27] author_name (Daniel Roy Greenfeld): My Project Developer
  [5/27] domain_name (example.com): my-domain.com
  [6/27] email (practical-devsecops-developer@my-domain.com): info@my-domain.com
  [7/27] version (0.1.0): 3.7.0
  [8/27] Select open_source_license
    1 - MIT
    2 - BSD
    3 - GPLv3
    4 - Apache Software License 2.0
    5 - Not open source
    Choose from [1/2/3/4/5] (1): 5
  [9/27] Select username_type
    1 - username
    2 - email
    Choose from [1/2] (1): 1
  [10/27] timezone (UTC):
  [11/27] windows (n): n
  [12/27] Select editor
    1 - None
    2 - PyCharm
    3 - VS Code
    Choose from [1/2/3] (1): 3
  [13/27] use_docker (n): y
  [14/27] Select postgresql_version
    1 - 16
    2 - 15
    3 - 14
    4 - 13
    5 - 12
    Choose from [1/2/3/4/5] (1): 5
  [15/27] Select cloud_provider
    1 - AWS
    2 - GCP
    3 - Azure
    4 - None
    Choose from [1/2/3/4] (1): 1
  [16/27] Select mail_service
    1 - Mailgun
    2 - Amazon SES
    3 - Mailjet
    4 - Mandrill
    5 - Postmark
    6 - Sendgrid
    7 - SendinBlue
    8 - SparkPost
    9 - Other SMTP
    Choose from [1/2/3/4/5/6/7/8/9] (1): 5
  [17/27] use_async (n): y
  [18/27] use_drf (n): y
  [19/27] Select frontend_pipeline
    1 - None
    2 - Django Compressor
    3 - Gulp
    4 - Webpack
    Choose from [1/2/3/4] (1): 1
  [20/27] use_celery (n): y
  [21/27] use_mailpit (n): y
  [22/27] use_sentry (n): y
  [23/27] use_whitenoise (n): n
  [24/27] use_heroku (n): n
  [25/27] Select ci_tool
    1 - None
    2 - Travis
    3 - Gitlab
    4 - Github
    5 - Drone
    Choose from [1/2/3/4/5] (1): 3
  [26/27] keep_local_envs_in_vcs (y): y
  [27/27] debug (n): n
 [SUCCESS]: Project initialized, keep up the good work!
@agusmakmun agusmakmun added the bug label May 15, 2024
@foarsitter
Copy link
Collaborator

This is a postgresql configuration issue and not related to this project.

@foarsitter foarsitter added the wontfix Wait a 10 days days and automatically close label May 15, 2024
@agusmakmun
Copy link
Author

But the problem is, this case is not happening in Django 3.2, and only appear after migrated to Django 4.
Any different between the versions? I check at the CONN_MAX_AGE using the same 60s value.

@github-actions github-actions bot removed the wontfix Wait a 10 days days and automatically close label May 15, 2024
@foarsitter
Copy link
Collaborator

If that is the problem, why don't you report it as such?

  • What else did you change during the migration?
  • What lib are you using for websockets, since this template doesn't have them?
  • what is your max connections? Run the show max_connections ; query to find out.
  • Did you use async before Django 4?

@agusmakmun
Copy link
Author

agusmakmun commented May 15, 2024

I'm using the django channels for websocket connection, and upgrading to these pacakges.

django==4.2.11  # pyup: < 5.0  # https://www.djangoproject.com/

# Django Async
channels==4.1.0  # https://github.com/django/channels/
channels-redis==4.2.0  # https://github.com/django/channels_redis/

# asgi_redis (async to sync)
asgiref==3.8.1  # https://github.com/django/asgiref/

For max_connections is 100 as default from Postgres 12.

Did you use async before Django 4?

I'm only using async in the websocket things, and celery tasks to cleanup the sandbox machines.

@foarsitter
Copy link
Collaborator

  • max_connections isn't related to your project, so do you use your database for more projects?
  • how many gunicorn workers do you use?
  • how many celeryworkers workers do you use?
  • is you use the async/unvicon option you are using the uvicon worker, I'm correct?

Increasing max_connections to 400 would be a simple solution

@agusmakmun
Copy link
Author

  1. nope, I'm using a single application but setup using k8s.
  2. 12 workers as I manually config in compose/production/django/start
exec /usr/local/bin/gunicorn config.asgi --bind 0.0.0.0:5000 --chdir=/app -k uvicorn.workers.UvicornWorker --workers=${UVICORN_TOTAL_OF_WORKERS:=12}
  1. I'm not sure of it, but we're using single pod of celeryworker.
kubectl --kubeconfig=.vscode/development/kubeconfig get pods
NAME                           READY   STATUS    RESTARTS   AGE
celerybeat-7dccfd8965-mbjt9    2/2     Running   0          4h31m
celeryworker-fc7d59c8c-9bgcv   2/2     Running   0          4h31m
django-5d54d9ddbf-dmpk6        2/2     Running   0          4h31m
flower-78c4b8f74f-8h8hx        2/2     Running   0          4h31m
next-849ff6c88b-95tgn          1/1     Running   0          4h31m
websocket-768b6cd964-gj9j4     2/2     Running   0          4h31m
  1. Yup, we're using uvicorn workers as in above.

@agusmakmun
Copy link
Author

Let say if many users are using the websocket at the same time. is increasing the max_connections to 400 will solve the problem?

@foarsitter
Copy link
Collaborator

foarsitter commented May 15, 2024

How many workers does celery use? 12?

Do the math: you are using two django pods, resulting in at least 24 connections. I assume the same for celery, makes at least 48 in total. You are running beat twice, that will result in double scheduling. Why are you running flower twice? Do next & websockets read the database?

Howere, my guess would be that the catch is in worker-connections I suppose. Each worker has many http connections and each http connection has his own database connection. So with your 24 workers and the default worker-connections of 1000 you need 24000 database connections at max for django alone. Here is some more info: https://stackoverflow.com/questions/63471960/gunicorn-uvicorn-worker-py-how-to-honor-limit-concurrency-setting

@agusmakmun
Copy link
Author

agusmakmun commented May 15, 2024

How many workers does celery use? 12?

How to check the how many celery workers?

This is what I get so far;

kubectl --kubeconfig=kubeconfig exec --stdin --tty celeryworker-6dc48c545d-p879n -- /bin/bash

Defaulted container "celeryworker" out of: celeryworker, vault-agent, vault-agent-init (init)
django@celeryworker-6dc48c545d-p879n:/app$ source /entrypoint
PostgreSQL is available
django@celeryworker-6dc48c545d-p879n:/app$
django@celeryworker-6dc48c545d-p879n:/app$
django@celeryworker-6dc48c545d-p879n:/app$ celery -A config.celery_app status
->  celery@celeryworker-6dc48c545d-p879n: OK

1 node online.

and for more:

django@celeryworker-6dc48c545d-p879n:/app$ celery -A config.celery_app inspect stats
->  celery@celeryworker-6dc48c545d-p879n: OK
    {
        "broker": {
            "alternates": [],
            "connect_timeout": 4,
            "failover_strategy": "round-robin",
            "heartbeat": 120.0,
            "hostname": "redis-master.redis.svc.cluster.local.",
            "insist": false,
            "login_method": null,
            "port": 6379,
            "ssl": false,
            "transport": "redis",
            "transport_options": {},
            "uri_prefix": null,
            "userid": null,
            "virtual_host": "0"
        },
        "clock": "107892",
        "pid": 8,
        "pool": {
            "implementation": "celery.concurrency.prefork:TaskPool",
            "max-concurrency": 8,
            "max-tasks-per-child": "N/A",
            "processes": [
                16,
                17,
                18,
                19,
                20,
                21,
                22,
                23
            ],
            "put-guarded-by-semaphore": false,
            "timeouts": [
                300,
                600
            ],
            "writes": {
                "all": "0.67, 0.17, 0.17",
                "avg": "0.33",
                "inqueues": {
                    "active": 0,
                    "total": 8
                },
                "raw": "4, 1, 1",
                "strategy": "fair",
                "total": 6
            }
        },
        "prefetch_count": 32,
        "rusage": {
            "idrss": 0,
            "inblock": 40,
            "isrss": 0,
            "ixrss": 0,
            "majflt": 3,
            "maxrss": 166536,
            "minflt": 187490,
            "msgrcv": 0,
            "msgsnd": 0,
            "nivcsw": 833,
            "nsignals": 0,
            "nswap": 0,
            "nvcsw": 1597,
            "oublock": 56,
            "stime": 1.8547509999999998,
            "utime": 10.366434
        },
        "total": {
            "my_project.base.tasks.send_announcement_broadcast": 1,
            "my_project.webssh.tasks.cleanup_expired_durations": 1,
            "my_project.webssh.tasks.cleanup_machines": 3,
            "my_project.webssh.tasks.sync_cloud_drive_account": 1
        },
        "uptime": 533
    }

1 node online.

@foarsitter
Copy link
Collaborator

"max-concurrency": 8

@agusmakmun
Copy link
Author

anything that I can do apart from increasing the max_connections?

@foarsitter
Copy link
Collaborator

You have at least these 4 options as far I can tell:

@agusmakmun
Copy link
Author

Thank you so much @foarsitter, we will try your suggestions 👍

@foarsitter
Copy link
Collaborator

If you resolve your issue by implementing a custom UvicornWorker we would like to receive a PR :)

@agusmakmun
Copy link
Author

Sure, will happy to do that. For quick solution we will try to increase the postgres max_connections to 1000 first.

@browniebroke browniebroke added the solved Wait a bit and close the issue if no response assuming it was solved. label May 15, 2024
@agusmakmun
Copy link
Author

agusmakmun commented May 16, 2024

Finally my issue was solved. This happens because of using both shared_task() and @celery_app.task(), and it leads duplicate database connection. So, after rollback all to use @celery_app.task() only, it's resolve the problem.

Note: I changed it to @shared_task() because trying to follow the users/tasks.py for non-executable tasks for cronjob.

from celery import shared_task
from .models import User
@shared_task()
def get_users_count():
"""A pointless Celery task to demonstrate usage."""
return User.objects.count()

Before

  1. myapp/tasks.py
from celery import shared_task
from config import celery_app

@shared_task()
def task_1():
   ...

@celery_app.task()
def task_2_that_called_for_cronjob():
   ...

After

from config import celery_app

@celery_app.task()
def task_1():
   ...

@celery_app.task()
def task_2_that_called_for_cronjob():
   ...

@github-actions github-actions bot removed the solved Wait a bit and close the issue if no response assuming it was solved. label May 16, 2024
@agusmakmun
Copy link
Author

The issue still appears again.

@agusmakmun agusmakmun reopened this May 16, 2024
@foarsitter
Copy link
Collaborator

shared_task is just another way of registering tasks for situations when there is no app available yet. So that can't be the solution.

@agusmakmun
Copy link
Author

agusmakmun commented May 16, 2024

I found some related issue are because of impact of upgrade Django 4 + using sync_to_async, as I'm right now using it to handle the async process for the websocket.

  1. https://stackoverflow.com/questions/75848263/too-many-db-connections-django-4-x-asgi
  2. https://code.djangoproject.com/ticket/33497
  3. https://code.djangoproject.com/ticket/32889
  4. django/django@36fa071

This may be due to Django 4.0 having per-request contexts for the thread sensitivity of sync_to_async() — See #32889.
If so, that's kind-of a good thing, in that too many open resources is what you'd expect in async code, and up to now, we've not been hitting that, as we've essentially been running serially. - https://code.djangoproject.com/ticket/33497#comment:4

@foarsitter
Copy link
Collaborator

So you put CONN_MAX_AGE to 0 and solved the issue?

@agusmakmun
Copy link
Author

Nope, I had changed it to 0. And the error still occurs.

@agusmakmun
Copy link
Author

agusmakmun commented May 17, 2024

The issue finally resolved after implementing the django-db-connection-pool + set the CONN_MAX_AGE=0

  1. settings.py
DATABASES = {"default": env.db("DATABASE_URL")}
DATABASES["default"]["ATOMIC_REQUESTS"] = True
# https://github.com/altairbow/django-db-connection-pool?tab=readme-ov-file#postgresql
DATABASES["default"]["ENGINE"] = "dj_db_conn_pool.backends.postgresql"
DATABASES["default"]["POOL_OPTIONS"] = {
    "POOL_SIZE": 10,
    "MAX_OVERFLOW": 10,
    "RECYCLE": 24 * 60 * 60,
}

# Set to 0 to disable long(er) living connections
# https://docs.djangoproject.com/en/dev/ref/settings/#std-setting-CONN_MAX_AGE
# https://github.com/cookiecutter/cookiecutter-django/issues/5071
# "connection failed: FATAL: remaining connection slots are reserved for non-replication superuser connections"  # noqa: ERA001
DATABASES["default"]["CONN_MAX_AGE"] = env.int("CONN_MAX_AGE", default=0)
  1. requirements.txt
psycopg[pool]==3.1.18  # https://github.com/psycopg/psycopg
django-db-connection-pool[psycopg3]==1.2.5  # https://github.com/altairbow/django-db-connection-pool

Screenshot 2024-05-17 at 20 05 22

@browniebroke browniebroke added the solved Wait a bit and close the issue if no response assuming it was solved. label May 17, 2024
@agusmakmun
Copy link
Author

The upcoming errors that occurs is ProgrammingError the last operation didn't produce a result.
Even after changing the ORM of .exists() to .count() > 0.

Related issue: psycopg/psycopg#417

@github-actions github-actions bot removed the solved Wait a bit and close the issue if no response assuming it was solved. label May 19, 2024
@browniebroke browniebroke added the solved Wait a bit and close the issue if no response assuming it was solved. label May 19, 2024
@foarsitter
Copy link
Collaborator

@agusmakmun is CONN_MAX_AGE required when a pool is involved? That shouldn't have to be the case I suppose.

@github-actions github-actions bot removed the solved Wait a bit and close the issue if no response assuming it was solved. label May 21, 2024
@agusmakmun
Copy link
Author

For this case, was resolved after downgrade the python fromdocker.io/python:3.12.3-slim-bookworm topython:3.10.10-slim-bullseye.

I also consulate with ChatGPT with this query:

For a database server with 2 cores and 2GB of memory, how many POOL_SIZE, MAX_OVERFLOW and RECYCLE. I need numbers only.

DATABASES["default"]["ENGINE"] = "dj_db_conn_pool.backends.postgresql"
DATABASES["default"]["POOL_OPTIONS"] = {
    "POOL_SIZE": 3,
    "MAX_OVERFLOW": 3,
    "RECYCLE": 1800,  # 30 minutes
}

DATABASES["default"]["CONN_MAX_AGE"] = env.int("CONN_MAX_AGE", default=0)

Until deployed to production, no issues so far. Not sure which one is correct, either decrease the pool_size with propper size, or downgrade the python.

@foarsitter
Copy link
Collaborator

foarsitter commented May 21, 2024

You can certain remove CONN_MAX_AGE since it defaults to 0.

@Saurav-Paul
Copy link

The issue finally resolved after implementing the django-db-connection-pool + set the CONN_MAX_AGE=0

  1. settings.py
DATABASES = {"default": env.db("DATABASE_URL")}
DATABASES["default"]["ATOMIC_REQUESTS"] = True
# https://github.com/altairbow/django-db-connection-pool?tab=readme-ov-file#postgresql
DATABASES["default"]["ENGINE"] = "dj_db_conn_pool.backends.postgresql"
DATABASES["default"]["POOL_OPTIONS"] = {
    "POOL_SIZE": 10,
    "MAX_OVERFLOW": 10,
    "RECYCLE": 24 * 60 * 60,
}

# Set to 0 to disable long(er) living connections
# https://docs.djangoproject.com/en/dev/ref/settings/#std-setting-CONN_MAX_AGE
# https://github.com/cookiecutter/cookiecutter-django/issues/5071
# "connection failed: FATAL: remaining connection slots are reserved for non-replication superuser connections"  # noqa: ERA001
DATABASES["default"]["CONN_MAX_AGE"] = env.int("CONN_MAX_AGE", default=0)
  1. requirements.txt
psycopg[pool]==3.1.18  # https://github.com/psycopg/psycopg
django-db-connection-pool[psycopg3]==1.2.5  # https://github.com/altairbow/django-db-connection-pool

Screenshot 2024-05-17 at 20 05 22

Thank you so much.. I was suffering from this issue.

@agusmakmun
Copy link
Author

Guys, FYI, until now I still facing this issue some how.

@agusmakmun agusmakmun reopened this May 28, 2024
@foarsitter
Copy link
Collaborator

And you did increase max_connections to 1000?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants