Add event recording and status conditions for worker deployments#203
Add event recording and status conditions for worker deployments#203thearcticwatch wants to merge 8 commits intomainfrom
Conversation
carlydf
left a comment
There was a problem hiding this comment.
also make fmt-imports will solve some of your lint errors
… usage Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
5632069 to
5900793
Compare
carlydf
left a comment
There was a problem hiding this comment.
looking good! just did initial review, we should still add a functional test once these comments are addressed.
I found https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md#events and https://book.kubebuilder.io/reference/raising-events#creating-events helpful while reviewing.
internal/controller/execplan.go
Outdated
| } | ||
| if err != nil { | ||
| r.Recorder.Eventf(workerDeploy, corev1.EventTypeWarning, "TestWorkflowStartFailed", | ||
| "Failed to start gate workflow %q (buildID %s): %v", wf.workflowType, wf.buildID, err) |
There was a problem hiding this comment.
can you put the task queue name in this event?
| if err := r.executeK8sOperations(ctx, l, workerDeploy, p); err != nil { | ||
| return err | ||
| } | ||
|
|
||
| deploymentHandler := temporalClient.WorkerDeploymentClient().GetHandle(p.WorkerDeploymentName) | ||
|
|
||
| if err := r.startTestWorkflows(ctx, l, workerDeploy, temporalClient, p); err != nil { | ||
| return err | ||
| } | ||
|
|
||
| if err := r.updateVersionConfig(ctx, l, workerDeploy, deploymentHandler, p); err != nil { |
There was a problem hiding this comment.
was this refactor just to make the code easier to understand, or was it necessary for the events change? (I'm all for it, just curious)
There was a problem hiding this comment.
Yes, the tets were failing for Cognitive Complexity and the refactor fixed them. Also just a bonus refactor.
| r.setCondition(&workerDeploy, temporaliov1alpha1.ConditionRolloutReady, metav1.ConditionTrue, | ||
| "RolloutSucceeded", "Target version rollout complete "+workerDeploy.Status.TargetVersion.BuildID) |
There was a problem hiding this comment.
It feels inconsistent to me to be saying "rollout ready", "rollout succeeded", and "rollout complete" all together. Maybe just "rollout complete" for all of them?
Also, I just realized, the condition as written will cause us to emit a ConditionRolloutReady event at the end of the first reconcile loop that meets this condition (which is correct), but also at the end of every subsequent reconcile loop, which mean we will emit the same event every 30s.
I think if we instead emit this event in the actual updateVersionConfig function where we successfully set the Current Version, that would ensure that we only emit it once per completed rollout
| //+kubebuilder:rbac:groups=core,resources=secrets,verbs=get;list;watch | ||
| //+kubebuilder:rbac:groups=apps,resources=deployments,verbs=get;list;watch;create;update;patch;delete | ||
| //+kubebuilder:rbac:groups=apps,resources=deployments/scale,verbs=update | ||
| // +kubebuilder:rbac:groups="",resources=events,verbs=create;patch |
There was a problem hiding this comment.
https://book.kubebuilder.io/reference/raising-events#granting-the-required-permissions says that this would be enough:
// +kubebuilder:rbac:groups=events.k8s.io,resources=events,verbs=create;patch
so I think kubebuilder:rbac:groups="" may be overly permissive?
| r.Recorder.Eventf(&workerDeploy, corev1.EventTypeWarning, "TemporalConnectionNotFound", | ||
| "Unable to fetch TemporalConnection %q: %v", workerDeploy.Spec.WorkerOptions.TemporalConnectionRef.Name, err) | ||
| r.setCondition(&workerDeploy, temporaliov1alpha1.ConditionTemporalConnectionHealthy, metav1.ConditionFalse, | ||
| "TemporalConnectionNotFound", fmt.Sprintf("TemporalConnection %q not found: %v", workerDeploy.Spec.WorkerOptions.TemporalConnectionRef.Name, err)) | ||
| _ = r.Status().Update(ctx, &workerDeploy) |
There was a problem hiding this comment.
seems like this pattern of setting and recording events is repeated throughout the codebase; think we should clean this up by having a nice helper which could also make it look neater
wdyt
| ) | ||
|
|
||
| func (r *TemporalWorkerDeploymentReconciler) executePlan(ctx context.Context, l logr.Logger, temporalClient sdkclient.Client, p *plan) error { | ||
| func (r *TemporalWorkerDeploymentReconciler) executeK8sOperations(ctx context.Context, l logr.Logger, workerDeploy *temporaliov1alpha1.TemporalWorkerDeployment, p *plan) error { |
There was a problem hiding this comment.
any specific reason to rename this? i quite liked the approach of having these three files/steps where we first generate a status, make a plan and then act on it.
| ConflictToken: vcfg.ConflictToken, | ||
| Identity: getControllerIdentity(), | ||
| }); err != nil { | ||
| r.Recorder.Eventf(workerDeploy, corev1.EventTypeWarning, "VersionRegistrationFailed", |
There was a problem hiding this comment.
we should also log the error here, in addition to recording the event here imo. We have followed this elsewhere in the same file and in the code, and I think it could be good practice given that if we do have a persistence failure, the logs would still be an option for an operator.
| r.Recorder.Eventf(&workerDeploy, corev1.EventTypeWarning, "PlanGenerationFailed", | ||
| "Unable to generate reconciliation plan: %v", err) | ||
| return ctrl.Result{}, err | ||
| } | ||
|
|
||
| // Execute the plan, handling any errors | ||
| if err := r.executePlan(ctx, l, temporalClient, plan); err != nil { | ||
| if err := r.executePlan(ctx, l, &workerDeploy, temporalClient, plan); err != nil { | ||
| r.Recorder.Eventf(&workerDeploy, corev1.EventTypeWarning, "PlanExecutionFailed", | ||
| "Unable to execute reconciliation plan: %v", err) |
There was a problem hiding this comment.
any specific reason you chose to not persist failures for plan generation and execution phases but persisted when we do generate the status from the server?
What changed: Added Kubernetes events and status conditions
(TemporalConnectionHealthy, RolloutReady) to the worker controller
reconciliation loop.
##Why: Reconciliation failures were only visible in controller logs —
events and conditions let users diagnose issues directly via kubectl.
Closes Add events to the TemporalWorkerDeployment CRD when there is a problem #28
How was this tested:
added unit tests
Any docs updates needed?
N/A