Skip to content
This repository has been archived by the owner on Dec 1, 2022. It is now read-only.

cluster-state-service initial reconcile slow #193

Open
dominicbarnes opened this issue May 8, 2017 · 5 comments
Open

cluster-state-service initial reconcile slow #193

dominicbarnes opened this issue May 8, 2017 · 5 comments

Comments

@dominicbarnes
Copy link

I'm working on deploying cluster-state-service on ECS, but I'm running into some problems. The process starts and immediately begins "Reconciler loading tasks and instances", but it isn't long before ECS kills the task because it failed to pass it's health check. It turns out that the initial reconcile blocks starting the web server, meaning that until that reconciliation finishes, nothing else will run.

If I simply raise the thresholds here, I actually discover that it actually times out the context deadline if I let it go for long enough. There is no other error or any indication as to what is slow and keeping this from starting. We have a number of clusters with quite a few tasks so it's possible that this may just take some time, but there are no logs indicating that any work is being done so it's hard to even gauge progress.

I'm using Terraform to deploy things similarly to the "Local Deployment" instructions, so I've actually been porting things from CloudFormation to Terraform along the way. It's entirely possible I've gotten some things configured incorrectly.

I've tried adding CSS_LOG_LEVEL=debug to get more logging, but that hasn't yielded any more logs at all.

@kylbarnes
Copy link
Contributor

@dominicbarnes I think this may be related to #192. There seems to be a bug in the cluster-state-service code related to querying Etcd and getting a context.DeadlineExceeded error on the reconciler. I'll try to duplicate this on my side and post the results.

@kylbarnes
Copy link
Contributor

@dominicbarnes Please see my latest comment in #192. This issue is correlated to the amount of memory allocated to the Etcd container. I was able to reduce my Etcd server startup time from 90s to 1s by increasing the amount of memory available to the Etcd container. Please give this a shot and let me know if it doesn't work for you.

@dominicbarnes
Copy link
Author

@kylbarnes I'll try that as well, but it looks like our situation was caused by etcd not being configured correctly networking-wise. It would seem that AWS ALBs are just not compatible with the gRPC interface of etcd, so even with other clients we get timeouts from etcd. I'm going to close this issue, and I'll re-open if I still have issues and your suggestion doesn't help. Thanks!

@dominicbarnes
Copy link
Author

I'm going to re-open this issue, as I was able to resolve the etcd issues and got the reconcile working. However, it takes 1-3 minutes to perform all this in our staging environment, and I'm sure production may be worse.

Would it be reasonable to request that the initial reconcile not block the server from starting? This would allow ECS to let the process to start up since it will be able to pass health checks. I understand that it blocks because it's pretty important to the service's health overall, so I wouldn't mind if the process panics and exits if the initial reconcile fails for any reason.

@dominicbarnes dominicbarnes reopened this May 11, 2017
@kylbarnes
Copy link
Contributor

@dominicbarnes The problem here is really the amount of data stored in Etcd, and how long it takes Etcd to load it into memory on startup. Can you perform the manual steps listed in #197? This will compact the historical events in Etcd that have expired to free up memory. This expired data is really only needed if you're using the cluster-state-service streaming APIs. Please let me know if this doesn't work for you in speeding up the cluster-state-server startup time. You can also add --auto-compaction-retention 24 to the Etcd task-definition if you want Etcd to automatically run this for you every 24 hours.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants