New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
When adding new node to an existing cluster, the new node gets deployed by K8s but stays at 1/2 forever #1610
Comments
What is the reason why the Pod isn't getting ready? What does |
Also, the logs of pod can be seen with: |
|
|
Yes, the operator is very old and we will be upgrading it in some time and a small correction - we are running k8s 1.24 |
Sorry, I made a typo here.. |
|
`INFO [nioEventLoopGroup-2-1] 2023-06-08 07:11:53,434 Cli.java:617 - address=/172.31.174.206:52158 url=/api/v0/probes/readiness status=500 Internal Server Error INFO [main] 2023-06-08 07:05:51,592 Keyspace.java:386 - Creating replication strategy system_auth params KeyspaceParams{durable_writes=true, replication=ReplicationParams{class=org.apache.cassandra.locator.NetworkTopologyStrategy, dc1=3}} INFO [main] 2023-06-08 07:05:51,761 ColumnFamilyStore.java:2252 - Truncating system.size_estimates INFO [main] 2023-06-08 07:06:01,564 ConfigurationLoader.java:62 - Configuration location: file:/opt/metrics-collector/config/metric-collector.yaml |
INFO [GossipStage:1] 2023-06-08 07:05:56,631 StorageService.java:2722 - Nodes /100.124.109.22:7000 and /100.124.109.40:7000 have the same token 1883473184085982538. Ignoring /100.124.109.22:7000 That is a massive problem. Are you somehow sharing disks or something between nodes? Or how did the nodes get the same token? |
We use EBS volumes and not sharing any disks with nodes. I can quickly recreate a new cluster and share the logs again |
@burmanm i have created a new cluster and tested it, the result is the same when I use helm to update the size 4, however when I directly update the cassdc object then the process works as expected and within a few mins the pod 4 becomes healthy. |
What's the helm command you use to update the size? |
I use
|
Hmm, my helm doesn't have helm diff command (or helm upgrade diff) so I wonder what's that doing (would it be possible to get the dry-run to see the output what it modifies?). I transferred the ticket to k8ssandra/k8ssandra as apparently you're still using k8ssandra 1.x installation and not just the cass-operator helm chart. |
the helm diff basically does the dry run and displays what would be changed using the upgrade command and basically, these are the values that have changed. k8ssandra/templates/cassandra/cassdc.yaml
|
That seems to indicate a failure in your Cassandra cluster to get to the replication factor of 4 when you use the Helm command to update. That's to distribute the passwords for authentication purposes. Difficult to say from Helm chart / operator point of view as the problem is not there, it's something in the cluster itself. |
Thanks @burmanm , i have tried to restart cass-operator and that has restarted the PODs in statefulset fixed the issue |
What happened?
I have followed the document to scale the k8ssandra cluster up from 3 node to 4 node cluster, the new pods gets scheduled in k8s but the status stays at 1/2
NAME READY STATUS RESTARTS AGE
pod/kiran1-cassandra-int-cass-operator-6d987f86c8-6wsht 1/1 Running 0 55m
pod/kiran1-cassandra-int-dc1-default-sts-0 2/2 Running 0 17m
pod/kiran1-cassandra-int-dc1-default-sts-1 2/2 Running 0 18m
pod/kiran1-cassandra-int-dc1-default-sts-2 2/2 Running 0 19m
pod/kiran1-cassandra-int-dc1-default-sts-3 1/2 Running 0 7m50s
Describing pod returns
Warning Unhealthy 3m7s (x32 over 7m43s) kubelet Readiness probe failed: HTTP probe failed with statuscode: 500
I don't see any error in cass-operator
2023-06-08T06:49:50.769Z INFO controllers.CassandraDatacenter.cassandradatacenter_controller.controller.cassandradatacenter-controller Created PodDisruptionBudget dc1-pdb {"reconciler group": "cassandra.datastax.com", "reconciler kind": "CassandraDatacenter", "name": "dc1", "namespace": "kiran1-cassandra-int", "reason": "CreatedResource", "eventType": "Normal"} 2023-06-08T06:49:50.769Z INFO controllers.CassandraDatacenter.cassandradatacenter_controller.controller.cassandradatacenter-controller starting CheckRackPodTemplate() {"reconciler group": "cassandra.datastax.com", "reconciler kind": "CassandraDatacenter", "name": "dc1", "namespace": "kiran1-cassandra-int", "namespace": "kiran1-cassandra-int", "datacenterName": "dc1", "clusterName": "kiran1-cassandra-int"} 2023-06-08T06:49:50.769Z DEBUG events Normal {"object": {"kind":"CassandraDatacenter","namespace":"kiran1-cassandra-int","name":"dc1","uid":"fdc43783-1c3a-42c8-a2af-3e29b2375d0d","apiVersion":"cassandra.datastax.com/v1beta1","resourceVersion":"8006192120"}, "reason": "CreatedResource", "message": "Created PodDisruptionBudget dc1-pdb"} 2023-06-08T06:49:50.769Z INFO controllers.CassandraDatacenter.cassandradatacenter_controller.controller.cassandradatacenter-controller statefulset needs an update {"reconciler group": "cassandra.datastax.com", "reconciler kind": "CassandraDatacenter", "name": "dc1", "namespace": "kiran1-cassandra-int", "namespace": "kiran1-cassandra-int", "datacenterName": "dc1", "clusterName": "kiran1-cassandra-int", "rackName": "default"} 2023-06-08T06:49:50.769Z INFO controllers.CassandraDatacenter.cassandradatacenter_controller.controller.cassandradatacenter-controller Updating rack default {"reconciler group": "cassandra.datastax.com", "reconciler kind": "CassandraDatacenter", "name": "dc1", "namespace": "kiran1-cassandra-int", "reason": "UpdatingRack", "eventType": "Normal"} 2023-06-08T06:49:50.769Z DEBUG events Normal {"object": {"kind":"CassandraDatacenter","namespace":"kiran1-cassandra-int","name":"dc1","uid":"fdc43783-1c3a-42c8-a2af-3e29b2375d0d","apiVersion":"cassandra.datastax.com/v1beta1","resourceVersion":"8006192120"}, "reason": "UpdatingRack", "message": "Updating rack default"} 2023-06-08T06:49:50.790Z INFO controllers.CassandraDatacenter.cassandradatacenter_controller.controller.cassandradatacenter-controller Updating statefulset pod specs {"reconciler group": "cassandra.datastax.com", "reconciler kind": "CassandraDatacenter", "name": "dc1", "namespace": "kiran1-cassandra-int", "namespace": "kiran1-cassandra-int", "datacenterName": "dc1", "clusterName": "kiran1-cassandra-int", "statefulSet": {"namespace": "kiran1-cassandra-int", "name": "kiran1-cassandra-int-dc1-default-sts"}} 2023-06-08T06:49:50.840Z INFO controllers.CassandraDatacenter Reconcile loop completed {"cassandradatacenter": "kiran1-cassandra-int/dc1", "requestNamespace": "kiran1-cassandra-int", "requestName": "dc1", "loopID": "1bd1d375-64d2-4e09-bc51-29e1a1d933cc", "duration": 0.185828658}
STS stays at 3/4
NAME READY AGE
statefulset.apps/kiran1-cassandra-int-dc1-default-sts 3/4 24m
When i tried to run the Cassandra script from the new container /opt/cassandra/bin/cassandra, it throws below error
cassandra@kiran1-cassandra-int-dc1-default-sts-3:/$ INFO [Messaging-EventLoop-3-15] 2023-06-08 07:05:58,589 NoSpamLogger.java:92 - /100.124.109.40:7000->/100.124.109.22:7000-URGENT_MESSAGES-[no-channel] failed to connect io.netty.channel.ConnectTimeoutException: connection timed out: /100.124.109.22:7000 at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe$2.run(AbstractEpollChannel.java:576) at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98) at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:170) at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164) at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472) at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:384) at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) at java.base/java.lang.Thread.run(Thread.java:829) cassandra@kiran1-cassandra-int-dc1-default-sts-3:/$ INFO [main] 2023-06-08 07:06:01,414 Gossiper.java:2245 - No gossip backlog; proceeding INFO [main] 2023-06-08 07:06:01,418 AuthCache.java:215 - (Re)initializing CredentialsCache (validity period/update interval/max entries) (3600000/3600000/1000)
What did you expect to happen?
The new pod should be up and healthy
How can we reproduce it (as minimally and precisely as possible)?
Step 1. Create a 3 node cluster
Step 2. Update the size to 4 in cassdc
Step 3. Wait until a new pod gets created
Step 4. Notice that the new pod stays at 1/2
cass-operator version
Version: 0.35.0 Type: application AppVersion: 1.10.0
Kubernetes version
22
Method of installation
Helm
Anything else we need to know?
No response
The text was updated successfully, but these errors were encountered: