provisioner needs to be moved to early in each node's run list #1548

aspiers · 2012-09-05T10:52:44Z

This issue has been seen while testing reinstalls:

The provisioner barclamp has priority 1060, so its recipe runs last, or pretty much last, meaning that any recipe that fails prior to that will prevent the provisioner recipe from running. On a newly-reinstalled node, this means there'll be no ssh_authorized keys file, leaving you in a position where you can't actually ssh to the node to figure out what's going on.

1060 was clearly a very deliberate choice of number, so there must be other reasons for wanting it to be last. @galthaus can you elaborate please?

Possible workaround for users: use the Chef web UI to re-order that node's role's run list to put provisioner early. Wait for the periodic chef-client run on the client.

Downstream bug: https://bugzilla.novell.com/show_bug.cgi?id=778764

The text was updated successfully, but these errors were encountered:

galthaus · 2012-09-12T16:53:02Z

So ... You see ...

The provisioner has been an evolving piece of code that manages node state for bring up the node to "ready". This allows the rest of the crowbar system to apply/transition other proposals onto those nodes.

There are currently chef-client calls in the provisioner transition function that assume that the admin node and the provisioner node are the same. This chef-client call runs on the admin node and fixes the boot mechanism to move the node forward. Originally, this was DHCP config file changes and a dhcp restart. Overtime this has become pxelinux config file changes without DHCP server restarts. These chef-client calls of the admin node were intended to be the "last" thing the transition steps did before return control to the node to control sequencing. Hence, the 1060 number in the order.

This is confounded by a different set of changes. Originally, there was one order to rule all things. This barclamp order was used for transition order, installation order, chef run list order, and whatever order we needed. We have since split this order into all three of those items. In many places, we didn't actively revisit the ordering to make this correct.

In some regard, your question is why shouldn't the transition order be 1060 and the chef-order be 10 or just after the deployer-client role. Answer: With the current code base, nothing and it should probably be changed to be that way.

We don't see this ssh problem in our testing, but we may not be using the same testing methodology or rigor in this particular area. I am quite certain it could exist and could be hit.

tserong · 2012-09-13T03:12:35Z

I saw the ssh problem when doing a reinstall of a swift-proxy node. swift-proxy was executing before swift-compute, so failed (ring files didn't exist yet), which meant the provisioner never ran, so authorized keys were never set up. Since fixing the swift-proxy ordering problem (which was the same fix as for ceph, as you mentioned on the mailing list), this test case won't cause the ssh problem anymore, but the potential is still there at least during initial deployment and reinstall, for a failed recipe to result in no ssh authorized keys file. Hooray for edge cases :-)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

provisioner needs to be moved to early in each node's run list #1548

provisioner needs to be moved to early in each node's run list #1548

aspiers commented Sep 5, 2012

galthaus commented Sep 12, 2012

tserong commented Sep 13, 2012

provisioner needs to be moved to early in each node's run list #1548

provisioner needs to be moved to early in each node's run list #1548

Comments

aspiers commented Sep 5, 2012

galthaus commented Sep 12, 2012

tserong commented Sep 13, 2012