Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Can't delete runs with mlflow gc due to api timeout #12005

Closed
3 of 23 tasks
darrenjkt opened this issue May 15, 2024 · 11 comments
Closed
3 of 23 tasks

[BUG] Can't delete runs with mlflow gc due to api timeout #12005

darrenjkt opened this issue May 15, 2024 · 11 comments
Labels
area/artifacts Artifact stores and artifact logging area/server-infra MLflow Tracking server backend bug Something isn't working

Comments

@darrenjkt
Copy link

Issues Policy acknowledgement

  • I have read and agree to submit bug reports in accordance with the issues policy

Where did you encounter this bug?

Local machine

Willingness to contribute

No. I cannot contribute a bug fix at this time.

MLflow version

  • Client: 2.12.2
  • Tracking server: 2.12.2

System information

  • Ubuntu 22.04 LTS
  • Python 3.10.12

Describe the problem

I am trying to permanently delete a single run with the gc command.

I am hosting the mlflow server with the command:

mlflow server --backend-store-uri postgresql://<user>:<pw>@<db-url>:5432 --artifacts-destination s3://some-bucket --serve-artifacts --host 0.0.0.0 --port 4242 --gunicorn-opts "--timeout 1800"

I have set up the environment variable: export MLFLOW_TRACKING_URI=http://0.0.0.0:4242

Tracking information

No response

Code to reproduce issue

mlflow gc --backend-store-uri postgresql://<user>:<pw>@<db-url>:5432 --run-ids 839eb62c783f45c893650fa5c44840cc

Stack trace

urllib3.exceptions.ResponseError: too many 500 error responses

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/gr/.local/lib/python3.10/site-packages/requests/adapters.py", line 486, in send
    resp = conn.urlopen(
  File "/home/gr/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 946, in urlopen
    return self.urlopen(
  File "/home/gr/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 946, in urlopen
    return self.urlopen(
  File "/home/gr/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 946, in urlopen
    return self.urlopen(
  [Previous line repeated 2 more times]
  File "/home/gr/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 936, in urlopen
    retries = retries.increment(method, url, response=response, _pool=self)
  File "/home/gr/.local/lib/python3.10/site-packages/urllib3/util/retry.py", line 515, in increment
    raise MaxRetryError(_pool, url, reason) from reason  # type: ignore[arg-type]
urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='0.0.0.0', port=4242): Max retries exceeded with url: /api/2.0/mlflow-artifacts/artifacts/6/7dabad9f7c5b48c18059d84996bd1ac9/artifacts/ (Caused by ResponseError('too many 500 error responses'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/gr/.local/lib/python3.10/site-packages/mlflow/utils/rest_utils.py", line 128, in http_request
    return _get_http_response_with_retries(
  File "/home/gr/.local/lib/python3.10/site-packages/mlflow/utils/request_utils.py", line 228, in _get_http_response_with_retries
    return session.request(method, url, allow_redirects=allow_redirects, **kwargs)
  File "/home/gr/.local/lib/python3.10/site-packages/requests/sessions.py", line 589, in request
    resp = self.send(prep, **send_kwargs)
  File "/home/gr/.local/lib/python3.10/site-packages/requests/sessions.py", line 703, in send
    r = adapter.send(request, **kwargs)
  File "/home/gr/.local/lib/python3.10/site-packages/requests/adapters.py", line 510, in send
    raise RetryError(e, request=request)
requests.exceptions.RetryError: HTTPConnectionPool(host='0.0.0.0', port=4242): Max retries exceeded with url: /api/2.0/mlflow-artifacts/artifacts/6/7dabad9f7c5b48c18059d84996bd1ac9/artifacts/ (Caused by ResponseError('too many 500 error responses'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/gr/.local/bin/mlflow", line 8, in <module>
    sys.exit(cli())
  File "/home/gr/.local/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/home/gr/.local/lib/python3.10/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/home/gr/.local/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/gr/.local/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/gr/.local/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/home/gr/.local/lib/python3.10/site-packages/mlflow/cli.py", line 607, in gc
    artifact_repo.delete_artifacts()
  File "/home/gr/.local/lib/python3.10/site-packages/mlflow/store/artifact/http_artifact_repo.py", line 113, in delete_artifacts
    resp = http_request(self._host_creds, endpoint, "DELETE", stream=True)
  File "/home/gr/.local/lib/python3.10/site-packages/mlflow/utils/rest_utils.py", line 151, in http_request
    raise MlflowException(f"API request to {url} failed with exception {e}")
mlflow.exceptions.MlflowException: API request to http://0.0.0.0:4242/api/2.0/mlflow-artifacts/artifacts/6/7dabad9f7c5b48c18059d84996bd1ac9/artifacts/ failed with exception HTTPConnectionPool(host='0.0.0.0', port=4242): Max retries exceeded with url: /api/2.0/mlflow-artifacts/artifacts/6/7dabad9f7c5b48c18059d84996bd1ac9/artifacts/ (Caused by ResponseError('too many 500 error responses'))

Other info / logs

No response

What component(s) does this bug affect?

  • area/artifacts: Artifact stores and artifact logging
  • area/build: Build and test infrastructure for MLflow
  • area/deployments: MLflow Deployments client APIs, server, and third-party Deployments integrations
  • area/docs: MLflow documentation pages
  • area/examples: Example code
  • area/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registry
  • area/models: MLmodel format, model serialization/deserialization, flavors
  • area/recipes: Recipes, Recipe APIs, Recipe configs, Recipe Templates
  • area/projects: MLproject format, project running backends
  • area/scoring: MLflow Model server, model deployment tools, Spark UDFs
  • area/server-infra: MLflow Tracking server backend
  • area/tracking: Tracking Service, tracking client APIs, autologging

What interface(s) does this bug affect?

  • area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
  • area/docker: Docker use across MLflow's components, such as MLflow Projects and MLflow Models
  • area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry
  • area/windows: Windows support

What language(s) does this bug affect?

  • language/r: R APIs and clients
  • language/java: Java APIs and clients
  • language/new: Proposals for new client languages

What integration(s) does this bug affect?

  • integrations/azure: Azure and Azure ML integrations
  • integrations/sagemaker: SageMaker integrations
  • integrations/databricks: Databricks integrations
@darrenjkt darrenjkt added the bug Something isn't working label May 15, 2024
@github-actions github-actions bot added area/artifacts Artifact stores and artifact logging area/server-infra MLflow Tracking server backend labels May 15, 2024
@harupy
Copy link
Member

harupy commented May 15, 2024

@darrenjkt Can you run curl http://0.0.0.0:4242?

@darrenjkt
Copy link
Author

darrenjkt commented May 15, 2024

Running the curl http://0.0.0.0:4242 gives the following:

<!doctype html><html lang="en"><head><meta charset="utf-8"/><meta name="viewport" content="width=device-width,initial-scale=1,shrink-to-fit=no"/><link rel="shortcut icon" href="./static-files/favicon.ico"/><meta name="theme-color" content="#000000"/><link rel="manifest" href="./static-files/manifest.json" crossorigin="use-credentials"/><title>MLflow</title><script defer="defer" src="static-files/static/js/main.648bec9b.js"></script><link href="static-files/static/css/main.183e956f.css" rel="stylesheet"></head><body><noscript>You need to enable JavaScript to run this app.</noscript><div id="root"></div><div id="modal"></div></body></html>%

@harupy
Copy link
Member

harupy commented May 15, 2024

@darrenjkt To confirm, you're running both mlflow server and mlflow gc on the same machine, right?

@darrenjkt
Copy link
Author

Yes that's correct

@harupy
Copy link
Member

harupy commented May 15, 2024

@darrenjkt can you log artifacts?

@darrenjkt
Copy link
Author

darrenjkt commented May 15, 2024

Yes I can. I usually log my artifacts from a separate mflow client to the mlflow server though. For the client, I usually specify the mlflow_tracking_uri in python before I start a run and log artifacts/model

@harupy
Copy link
Member

harupy commented May 15, 2024

@darrenjkt What happens on you set MLFLOW_TRACKING_URI=postgresql://<user>:<pw>@<db-url>:5432 and run mlflow gc?

@darrenjkt
Copy link
Author

When I do that, I get this error
mlflow.exceptions.MlflowException: The configured tracking uri scheme: 'postgresql' is invalid for use with the proxy mlflow-artifact scheme. The allowed tracking schemes are: {'https', 'http'}

@harupy
Copy link
Member

harupy commented May 16, 2024

@darrenjkt Thanks for trying. Do you see any logs in the tracking server? When 500 (internal error) occurs, tracking server usually prints out a traceback or error messages.

@darrenjkt
Copy link
Author

darrenjkt commented May 21, 2024

Ah thanks I got the following error in the tracking server logs related to permissions on my AWS user.

botocore.exceptions.ClientError: An error occurred (AccessDenied) when calling the DeleteObject operation: Access Denied

However, I've set up a user with AmazonRDSFullAccess and AmazonS3FullAccess but yet I still get this error. The user has DeleteObject permissions for S3.

@darrenjkt
Copy link
Author

darrenjkt commented May 22, 2024

Solved it! I was exporting the AWS keys in the local environment but starting the mlflow server with a different set of AWS keys. Starting the mlflow server using the AWS keys with correct permissions resolved the gc deletion issue.

Thanks for your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/artifacts Artifact stores and artifact logging area/server-infra MLflow Tracking server backend bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants