Document inactivate/freeze/decommission procedures #2121

bmerry · 2020-07-20T14:21:30Z

This may be that I'm not using things right: I can't find any document that explains the differences between inactivating, freezing and decommissioning a host.

If I go through the following steps:

Mark a host inactive (via POST /api/inactive).
Stop the mesos-agent on it.
Start a new instance of mesos-agent on it (I'm using a Docker container to run Mesos, so I think it gets a new slave ID, but I'm not 100% sure).
Mark the host active again (via DELETE /api/inactive).

Then the slave remains in the decommissioned state and won't run any tasks.

My goal is to be able to prevent new tasks running on a slave (so that once existing tasks die we can reboot/do maintenance on it - we use only on-demand tasks with finite lifetime), and later allow tasks to run on it again (possibly after doing maintenance on it). I've been using "inactive" rather than "freeze" because the former API works on hostnames, which means it can be set even if the mesos-agent isn't running at the time. But let me know what you advise for that.

ssalinas · 2020-07-20T14:33:28Z

so, inactive was something we created to deal with some ec2 impairment cases. We would frequently have some cases whee a host went impaired, came back, went impaired, and cycled like that. The inactive marker was meant to make it so that anything coming in with that host name will be automatically marked as decommissioned, to save tasks from being launched on an impaired/cycling host like that. The reactive here essentially just removes it from a 'blocked' list of hosts

Other definitions:

Freeze - don't launch new tasks on a host, but leave any that are already running alone
Decommission - don't launch new tasks on a host, and also move any that are currently running on the host elsewhere

If just using decommission, since it is done by slave id, the new agent coming into the cluster with a new id will be in the active state. To clean up any that are in that inactive + decommissioned state you mentioned, can remove them from inactive list first, then 'reactivate' in the UI. We can update docs to make this clearer

bmerry · 2020-07-20T17:45:49Z

Thanks for the quick response. I've updated the title to indicate that docs should be improved, rather than anything necessarily changed.

To clean up any that are in that inactive + decommissioned state you mentioned, can remove them from inactive list first, then 'reactivate' in the UI.

I've give that a try (with the API, since I'm writing a command-line tool).

bmerry changed the title ~~Re-activating a host doesn't re-enable the slave~~ Document inactivate/freeze/decommission procedures Jul 20, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Document inactivate/freeze/decommission procedures #2121

Document inactivate/freeze/decommission procedures #2121

bmerry commented Jul 20, 2020

ssalinas commented Jul 20, 2020

bmerry commented Jul 20, 2020

Document inactivate/freeze/decommission procedures #2121

Document inactivate/freeze/decommission procedures #2121

Comments

bmerry commented Jul 20, 2020

ssalinas commented Jul 20, 2020

bmerry commented Jul 20, 2020