cluster-state-service initial reconcile slow #193

dominicbarnes · 2017-05-08T23:17:09Z

I'm working on deploying cluster-state-service on ECS, but I'm running into some problems. The process starts and immediately begins "Reconciler loading tasks and instances", but it isn't long before ECS kills the task because it failed to pass it's health check. It turns out that the initial reconcile blocks starting the web server, meaning that until that reconciliation finishes, nothing else will run.

If I simply raise the thresholds here, I actually discover that it actually times out the context deadline if I let it go for long enough. There is no other error or any indication as to what is slow and keeping this from starting. We have a number of clusters with quite a few tasks so it's possible that this may just take some time, but there are no logs indicating that any work is being done so it's hard to even gauge progress.

I'm using Terraform to deploy things similarly to the "Local Deployment" instructions, so I've actually been porting things from CloudFormation to Terraform along the way. It's entirely possible I've gotten some things configured incorrectly.

I've tried adding CSS_LOG_LEVEL=debug to get more logging, but that hasn't yielded any more logs at all.

The text was updated successfully, but these errors were encountered:

kylbarnes · 2017-05-08T23:51:40Z

@dominicbarnes I think this may be related to #192. There seems to be a bug in the cluster-state-service code related to querying Etcd and getting a context.DeadlineExceeded error on the reconciler. I'll try to duplicate this on my side and post the results.

kylbarnes · 2017-05-09T21:04:43Z

@dominicbarnes Please see my latest comment in #192. This issue is correlated to the amount of memory allocated to the Etcd container. I was able to reduce my Etcd server startup time from 90s to 1s by increasing the amount of memory available to the Etcd container. Please give this a shot and let me know if it doesn't work for you.

dominicbarnes · 2017-05-10T16:13:30Z

@kylbarnes I'll try that as well, but it looks like our situation was caused by etcd not being configured correctly networking-wise. It would seem that AWS ALBs are just not compatible with the gRPC interface of etcd, so even with other clients we get timeouts from etcd. I'm going to close this issue, and I'll re-open if I still have issues and your suggestion doesn't help. Thanks!

dominicbarnes · 2017-05-11T22:19:55Z

I'm going to re-open this issue, as I was able to resolve the etcd issues and got the reconcile working. However, it takes 1-3 minutes to perform all this in our staging environment, and I'm sure production may be worse.

Would it be reasonable to request that the initial reconcile not block the server from starting? This would allow ECS to let the process to start up since it will be able to pass health checks. I understand that it blocks because it's pretty important to the service's health overall, so I wouldn't mind if the process panics and exits if the initial reconcile fails for any reason.

kylbarnes · 2017-05-15T21:02:28Z

@dominicbarnes The problem here is really the amount of data stored in Etcd, and how long it takes Etcd to load it into memory on startup. Can you perform the manual steps listed in #197? This will compact the historical events in Etcd that have expired to free up memory. This expired data is really only needed if you're using the cluster-state-service streaming APIs. Please let me know if this doesn't work for you in speeding up the cluster-state-server startup time. You can also add --auto-compaction-retention 24 to the Etcd task-definition if you want Etcd to automatically run this for you every 24 hours.

dominicbarnes closed this as completed May 10, 2017

dominicbarnes reopened this May 11, 2017

shubharao added the component/cluster-state-service label Aug 16, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cluster-state-service initial reconcile slow #193

cluster-state-service initial reconcile slow #193

dominicbarnes commented May 8, 2017

kylbarnes commented May 8, 2017

kylbarnes commented May 9, 2017

dominicbarnes commented May 10, 2017

dominicbarnes commented May 11, 2017

kylbarnes commented May 15, 2017

cluster-state-service initial reconcile slow #193

cluster-state-service initial reconcile slow #193

Comments

dominicbarnes commented May 8, 2017

kylbarnes commented May 8, 2017

kylbarnes commented May 9, 2017

dominicbarnes commented May 10, 2017

dominicbarnes commented May 11, 2017

kylbarnes commented May 15, 2017