Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document inactivate/freeze/decommission procedures #2121

Open
bmerry opened this issue Jul 20, 2020 · 2 comments
Open

Document inactivate/freeze/decommission procedures #2121

bmerry opened this issue Jul 20, 2020 · 2 comments

Comments

@bmerry
Copy link

bmerry commented Jul 20, 2020

This may be that I'm not using things right: I can't find any document that explains the differences between inactivating, freezing and decommissioning a host.

If I go through the following steps:

  1. Mark a host inactive (via POST /api/inactive).
  2. Stop the mesos-agent on it.
  3. Start a new instance of mesos-agent on it (I'm using a Docker container to run Mesos, so I think it gets a new slave ID, but I'm not 100% sure).
  4. Mark the host active again (via DELETE /api/inactive).

Then the slave remains in the decommissioned state and won't run any tasks.

My goal is to be able to prevent new tasks running on a slave (so that once existing tasks die we can reboot/do maintenance on it - we use only on-demand tasks with finite lifetime), and later allow tasks to run on it again (possibly after doing maintenance on it). I've been using "inactive" rather than "freeze" because the former API works on hostnames, which means it can be set even if the mesos-agent isn't running at the time. But let me know what you advise for that.

@ssalinas
Copy link
Member

so, inactive was something we created to deal with some ec2 impairment cases. We would frequently have some cases whee a host went impaired, came back, went impaired, and cycled like that. The inactive marker was meant to make it so that anything coming in with that host name will be automatically marked as decommissioned, to save tasks from being launched on an impaired/cycling host like that. The reactive here essentially just removes it from a 'blocked' list of hosts

Other definitions:

  • Freeze - don't launch new tasks on a host, but leave any that are already running alone
  • Decommission - don't launch new tasks on a host, and also move any that are currently running on the host elsewhere

If just using decommission, since it is done by slave id, the new agent coming into the cluster with a new id will be in the active state. To clean up any that are in that inactive + decommissioned state you mentioned, can remove them from inactive list first, then 'reactivate' in the UI. We can update docs to make this clearer

@bmerry bmerry changed the title Re-activating a host doesn't re-enable the slave Document inactivate/freeze/decommission procedures Jul 20, 2020
@bmerry
Copy link
Author

bmerry commented Jul 20, 2020

Thanks for the quick response. I've updated the title to indicate that docs should be improved, rather than anything necessarily changed.

To clean up any that are in that inactive + decommissioned state you mentioned, can remove them from inactive list first, then 'reactivate' in the UI.

I've give that a try (with the API, since I'm writing a command-line tool).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants