Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

provisioner needs to be moved to early in each node's run list #1548

Open
aspiers opened this issue Sep 5, 2012 · 2 comments
Open

provisioner needs to be moved to early in each node's run list #1548

aspiers opened this issue Sep 5, 2012 · 2 comments

Comments

@aspiers
Copy link
Member

aspiers commented Sep 5, 2012

This issue has been seen while testing reinstalls:

The provisioner barclamp has priority 1060, so its recipe runs last, or pretty much last, meaning that any recipe that fails prior to that will prevent the provisioner recipe from running. On a newly-reinstalled node, this means there'll be no ssh_authorized keys file, leaving you in a position where you can't actually ssh to the node to figure out what's going on.

1060 was clearly a very deliberate choice of number, so there must be other reasons for wanting it to be last. @galthaus can you elaborate please?

Possible workaround for users: use the Chef web UI to re-order that node's role's run list to put provisioner early. Wait for the periodic chef-client run on the client.

Downstream bug: https://bugzilla.novell.com/show_bug.cgi?id=778764

@galthaus
Copy link
Contributor

So ... You see ...

The provisioner has been an evolving piece of code that manages node state for bring up the node to "ready". This allows the rest of the crowbar system to apply/transition other proposals onto those nodes.

There are currently chef-client calls in the provisioner transition function that assume that the admin node and the provisioner node are the same. This chef-client call runs on the admin node and fixes the boot mechanism to move the node forward. Originally, this was DHCP config file changes and a dhcp restart. Overtime this has become pxelinux config file changes without DHCP server restarts. These chef-client calls of the admin node were intended to be the "last" thing the transition steps did before return control to the node to control sequencing. Hence, the 1060 number in the order.

This is confounded by a different set of changes. Originally, there was one order to rule all things. This barclamp order was used for transition order, installation order, chef run list order, and whatever order we needed. We have since split this order into all three of those items. In many places, we didn't actively revisit the ordering to make this correct.

In some regard, your question is why shouldn't the transition order be 1060 and the chef-order be 10 or just after the deployer-client role. Answer: With the current code base, nothing and it should probably be changed to be that way.

We don't see this ssh problem in our testing, but we may not be using the same testing methodology or rigor in this particular area. I am quite certain it could exist and could be hit.

@tserong
Copy link
Contributor

tserong commented Sep 13, 2012

I saw the ssh problem when doing a reinstall of a swift-proxy node. swift-proxy was executing before swift-compute, so failed (ring files didn't exist yet), which meant the provisioner never ran, so authorized keys were never set up. Since fixing the swift-proxy ordering problem (which was the same fix as for ceph, as you mentioned on the mailing list), this test case won't cause the ssh problem anymore, but the potential is still there at least during initial deployment and reinstall, for a failed recipe to result in no ssh authorized keys file. Hooray for edge cases :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants