Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Operator behaving differently running in cluster compared to out of cluster #6678

Open
coillteoir opened this issue Feb 11, 2024 · 4 comments
Assignees
Labels
language/go Issue is related to a Go operator project lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. triage/support Indicates an issue that is a support question.
Milestone

Comments

@coillteoir
Copy link

coillteoir commented Feb 11, 2024

Type of question

General operator-related help

Question

I am creating an operator to work with a CI/CD system. When I run it locally, it creates pods as expected. But when I deploy it to the cluster, it fails to check if a pod has already been created and will create multiple pods of the same "task".

Pipeline Spec:
image

Locally using make run:
image

In Cluster after pushing docker image and using make deploy:
image

What did you do?

To run individual tasks in a pipeline, I wrote a function which uses DFS to go through a tree data structure and checks the status of child pods before generating a new pod for that
The operator then loops over the generated list of pods and creates them in the cluster.
image

What did you expect to see?

The correct amount of pods being created.

What did you see instead? Under which circumstances?

Multiple pods being created and the pipeline not being validated.

Environment

Operator type:

/language go

Kubernetes cluster type:

$ operator-sdk version

1.33

$ go version

1.22

$ kubectl version

1.29

Additional context

Current branch for bug: https://github.com/coillteoir/bramble/tree/develop
In the execution group of controllers.
It occurs in both Kind and MiniKube

@openshift-ci openshift-ci bot added the language/go Issue is related to a Go operator project label Feb 11, 2024
@coillteoir
Copy link
Author

Im unsure of where to start with this issue, in particular if it's a bug in my code or from an upstream library such as controller-runtime.

@jberkhahn
Copy link
Contributor

So, reconciliation loops aren't really run in a deterministic manner - multiple controllers might pick up the same event and try to reconcile it, which is why it's always a good idea to check the state of the system before trying to modify it. It looks like you're just always firing off this function that tries to create a bunch of pods.

Not sure why you're experiencing different behavior on/off cluster, though. It might just be due to the increased latency meaning you're getting less controller loops firing or something.

@jberkhahn jberkhahn added the triage/support Indicates an issue that is a support question. label Feb 12, 2024
@jberkhahn jberkhahn self-assigned this Feb 12, 2024
@jberkhahn jberkhahn added this to the Backlog milestone Feb 12, 2024
@coillteoir
Copy link
Author

Just curious, is the controller runtime synchronous or does it use goroutines under the hood? And if it does, would ther be a way to force my reconcile loop to wait for the controller to finish provisioning/getting resources before continuing?

@openshift-bot
Copy link

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
language/go Issue is related to a Go operator project lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. triage/support Indicates an issue that is a support question.
Projects
None yet
Development

No branches or pull requests

3 participants