[BUG] Can't delete runs with mlflow gc due to api timeout #12005

darrenjkt · 2024-05-15T03:55:54Z

Issues Policy acknowledgement

I have read and agree to submit bug reports in accordance with the issues policy

Where did you encounter this bug?

Local machine

Willingness to contribute

No. I cannot contribute a bug fix at this time.

MLflow version

Client: 2.12.2
Tracking server: 2.12.2

System information

Ubuntu 22.04 LTS
Python 3.10.12

Describe the problem

I am trying to permanently delete a single run with the gc command.

I am hosting the mlflow server with the command:

mlflow server --backend-store-uri postgresql://<user>:<pw>@<db-url>:5432 --artifacts-destination s3://some-bucket --serve-artifacts --host 0.0.0.0 --port 4242 --gunicorn-opts "--timeout 1800"

I have set up the environment variable: export MLFLOW_TRACKING_URI=http://0.0.0.0:4242

Tracking information

No response

Code to reproduce issue

mlflow gc --backend-store-uri postgresql://<user>:<pw>@<db-url>:5432 --run-ids 839eb62c783f45c893650fa5c44840cc

Stack trace

urllib3.exceptions.ResponseError: too many 500 error responses

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/gr/.local/lib/python3.10/site-packages/requests/adapters.py", line 486, in send
    resp = conn.urlopen(
  File "/home/gr/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 946, in urlopen
    return self.urlopen(
  File "/home/gr/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 946, in urlopen
    return self.urlopen(
  File "/home/gr/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 946, in urlopen
    return self.urlopen(
  [Previous line repeated 2 more times]
  File "/home/gr/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 936, in urlopen
    retries = retries.increment(method, url, response=response, _pool=self)
  File "/home/gr/.local/lib/python3.10/site-packages/urllib3/util/retry.py", line 515, in increment
    raise MaxRetryError(_pool, url, reason) from reason  # type: ignore[arg-type]
urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='0.0.0.0', port=4242): Max retries exceeded with url: /api/2.0/mlflow-artifacts/artifacts/6/7dabad9f7c5b48c18059d84996bd1ac9/artifacts/ (Caused by ResponseError('too many 500 error responses'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/gr/.local/lib/python3.10/site-packages/mlflow/utils/rest_utils.py", line 128, in http_request
    return _get_http_response_with_retries(
  File "/home/gr/.local/lib/python3.10/site-packages/mlflow/utils/request_utils.py", line 228, in _get_http_response_with_retries
    return session.request(method, url, allow_redirects=allow_redirects, **kwargs)
  File "/home/gr/.local/lib/python3.10/site-packages/requests/sessions.py", line 589, in request
    resp = self.send(prep, **send_kwargs)
  File "/home/gr/.local/lib/python3.10/site-packages/requests/sessions.py", line 703, in send
    r = adapter.send(request, **kwargs)
  File "/home/gr/.local/lib/python3.10/site-packages/requests/adapters.py", line 510, in send
    raise RetryError(e, request=request)
requests.exceptions.RetryError: HTTPConnectionPool(host='0.0.0.0', port=4242): Max retries exceeded with url: /api/2.0/mlflow-artifacts/artifacts/6/7dabad9f7c5b48c18059d84996bd1ac9/artifacts/ (Caused by ResponseError('too many 500 error responses'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/gr/.local/bin/mlflow", line 8, in <module>
    sys.exit(cli())
  File "/home/gr/.local/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/home/gr/.local/lib/python3.10/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/home/gr/.local/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/gr/.local/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/gr/.local/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/home/gr/.local/lib/python3.10/site-packages/mlflow/cli.py", line 607, in gc
    artifact_repo.delete_artifacts()
  File "/home/gr/.local/lib/python3.10/site-packages/mlflow/store/artifact/http_artifact_repo.py", line 113, in delete_artifacts
    resp = http_request(self._host_creds, endpoint, "DELETE", stream=True)
  File "/home/gr/.local/lib/python3.10/site-packages/mlflow/utils/rest_utils.py", line 151, in http_request
    raise MlflowException(f"API request to {url} failed with exception {e}")
mlflow.exceptions.MlflowException: API request to http://0.0.0.0:4242/api/2.0/mlflow-artifacts/artifacts/6/7dabad9f7c5b48c18059d84996bd1ac9/artifacts/ failed with exception HTTPConnectionPool(host='0.0.0.0', port=4242): Max retries exceeded with url: /api/2.0/mlflow-artifacts/artifacts/6/7dabad9f7c5b48c18059d84996bd1ac9/artifacts/ (Caused by ResponseError('too many 500 error responses'))

Other info / logs

No response

What component(s) does this bug affect?

What interface(s) does this bug affect?

area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
area/docker: Docker use across MLflow's components, such as MLflow Projects and MLflow Models
area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry
area/windows: Windows support

What language(s) does this bug affect?

language/r: R APIs and clients
language/java: Java APIs and clients
language/new: Proposals for new client languages

What integration(s) does this bug affect?

integrations/azure: Azure and Azure ML integrations
integrations/sagemaker: SageMaker integrations
integrations/databricks: Databricks integrations

The text was updated successfully, but these errors were encountered:

harupy · 2024-05-15T06:06:56Z

@darrenjkt Can you run curl http://0.0.0.0:4242?

darrenjkt · 2024-05-15T06:36:44Z

Running the curl http://0.0.0.0:4242 gives the following:

<!doctype html><html lang="en"><head><meta charset="utf-8"/><meta name="viewport" content="width=device-width,initial-scale=1,shrink-to-fit=no"/><link rel="shortcut icon" href="./static-files/favicon.ico"/><meta name="theme-color" content="#000000"/><link rel="manifest" href="./static-files/manifest.json" crossorigin="use-credentials"/><title>MLflow</title><script defer="defer" src="static-files/static/js/main.648bec9b.js"></script><link href="static-files/static/css/main.183e956f.css" rel="stylesheet"></head><body><noscript>You need to enable JavaScript to run this app.</noscript><div id="root"></div><div id="modal"></div></body></html>%

harupy · 2024-05-15T10:17:39Z

@darrenjkt To confirm, you're running both mlflow server and mlflow gc on the same machine, right?

darrenjkt · 2024-05-15T10:18:55Z

Yes that's correct

harupy · 2024-05-15T10:20:16Z

@darrenjkt can you log artifacts?

darrenjkt · 2024-05-15T10:39:56Z

Yes I can. I usually log my artifacts from a separate mflow client to the mlflow server though. For the client, I usually specify the mlflow_tracking_uri in python before I start a run and log artifacts/model

harupy · 2024-05-15T11:54:08Z

@darrenjkt What happens on you set MLFLOW_TRACKING_URI=postgresql://<user>:<pw>@<db-url>:5432 and run mlflow gc?

darrenjkt · 2024-05-15T22:35:35Z

When I do that, I get this error
mlflow.exceptions.MlflowException: The configured tracking uri scheme: 'postgresql' is invalid for use with the proxy mlflow-artifact scheme. The allowed tracking schemes are: {'https', 'http'}

harupy · 2024-05-16T00:42:39Z

@darrenjkt Thanks for trying. Do you see any logs in the tracking server? When 500 (internal error) occurs, tracking server usually prints out a traceback or error messages.

darrenjkt · 2024-05-21T23:50:17Z

Ah thanks I got the following error in the tracking server logs related to permissions on my AWS user.

botocore.exceptions.ClientError: An error occurred (AccessDenied) when calling the DeleteObject operation: Access Denied

However, I've set up a user with AmazonRDSFullAccess and AmazonS3FullAccess but yet I still get this error. The user has DeleteObject permissions for S3.

darrenjkt · 2024-05-22T00:13:31Z

Solved it! I was exporting the AWS keys in the local environment but starting the mlflow server with a different set of AWS keys. Starting the mlflow server using the AWS keys with correct permissions resolved the gc deletion issue.

Thanks for your help!

darrenjkt added the bug Something isn't working label May 15, 2024

github-actions bot added area/artifacts Artifact stores and artifact logging area/server-infra MLflow Tracking server backend labels May 15, 2024

darrenjkt closed this as completed May 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Can't delete runs with mlflow gc due to api timeout #12005

[BUG] Can't delete runs with mlflow gc due to api timeout #12005

darrenjkt commented May 15, 2024

harupy commented May 15, 2024

darrenjkt commented May 15, 2024 •

edited

harupy commented May 15, 2024

darrenjkt commented May 15, 2024

harupy commented May 15, 2024

darrenjkt commented May 15, 2024 •

edited

harupy commented May 15, 2024

darrenjkt commented May 15, 2024

harupy commented May 16, 2024 •

edited

darrenjkt commented May 21, 2024 •

edited

darrenjkt commented May 22, 2024 •

edited

[BUG] Can't delete runs with mlflow gc due to api timeout #12005

[BUG] Can't delete runs with mlflow gc due to api timeout #12005

Comments

darrenjkt commented May 15, 2024

Issues Policy acknowledgement

Where did you encounter this bug?

Willingness to contribute

MLflow version

System information

Describe the problem

Tracking information

Code to reproduce issue

Stack trace

Other info / logs

What component(s) does this bug affect?

What interface(s) does this bug affect?

What language(s) does this bug affect?

What integration(s) does this bug affect?

harupy commented May 15, 2024

darrenjkt commented May 15, 2024 • edited

harupy commented May 15, 2024

darrenjkt commented May 15, 2024

harupy commented May 15, 2024

darrenjkt commented May 15, 2024 • edited

harupy commented May 15, 2024

darrenjkt commented May 15, 2024

harupy commented May 16, 2024 • edited

darrenjkt commented May 21, 2024 • edited

darrenjkt commented May 22, 2024 • edited

darrenjkt commented May 15, 2024 •

edited

darrenjkt commented May 15, 2024 •

edited

harupy commented May 16, 2024 •

edited

darrenjkt commented May 21, 2024 •

edited

darrenjkt commented May 22, 2024 •

edited