Skip to content

Conversation

@alex-hunt-materialize
Copy link
Contributor

@alex-hunt-materialize alex-hunt-materialize commented Feb 9, 2026

Simplified rollout triggers and CRD design doc

Motivation

Tips for reviewer

Checklist

  • This PR has adequate test coverage / QA involvement has been duly considered. (trigger-ci for additional test/nightly runs)
  • This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
  • If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
  • If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
  • If this PR includes major user-facing behavior changes, I have pinged the relevant PM to schedule a changelog post.

- Change `forcePromote` from `Uuid` to `Option<String>` - Instead of triggering promotion when matching the UUID of `requestRollout`, it triggers promotion when matching the hash stored in `status.requestedRolloutSpecHash`.

**Status changes:**
- Replace `lastCompletedRolloutRequest` (`Uuid`) with `lastCompletedRolloutSpecHash` (`Option<String>`) - Stores the spec hash of the last successful rollout. Will be `None` if first deploying or if upgrading from `v1alpha1`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

once we get ready to move to a real v1, we should be able to drop the Option here - at that point, the only time we don't have a value is when the cr is first created, but we already have the spec at that point, so we can just fill it in.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that is true. This is the last completed rollout request. A lot happens between starting a rollout and considering it complete. We save the status multiple times along the way, and we should indicate that it isn't complete in that case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my point though is that in all of those in progress states, the last completed hash will be different from the requested rollout hash - we aren't actually gaining any extra information from the option. the way it's described here, the option is None if and only if the hashes are equal, which i think just adds an extra invalid state for no benefit?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree about the requested rollout hash, just not about the last completed hash. Specifically, what would you put there during the first rollout?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, yeah, sorry if i was unclear, i was only referring to the requested rollout hash here - the last completed rollout hash being None before the first rollout makes sense to me.


**Status changes:**
- Replace `lastCompletedRolloutRequest` (`Uuid`) with `lastCompletedRolloutSpecHash` (`Option<String>`) - Stores the spec hash of the last successful rollout. Will be `None` if first deploying or if upgrading from `v1alpha1`.
- Replace `resourcesHash` (`String`) with `requestedRolloutSpecHash` (`Option<String>`) - Stores the spec hash of the currently requested rollout. Will be `None` when no rollout is ongoing.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think it'd probably be simpler to have this just always be set (not an Option) - the UpToDate condition will be an easier thing for users to check. it feels like otherwise it'd be possible to accidentally get into an invalid state (lastCompletedrolloutSpecHash == requestedRolloutSpecHash) which would be hard to figure out how to recover from, easier if we just don't make that kind of invalid state representable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're right. We can always set it to the calculated hash of the current CR, except if we have already begun promoting.

Note to self, ensure we set any status updates after we've reached promoting state using the value from the status, not the currently calculated value.

- `environmentdScratchVolumeStorageRequirement`
- `serviceAccountName`
- `serviceAccountAnnotations`
- `serviceAccountLabels`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the service account annotations and labels are also applied immediately, so they probably shouldn't be here (although a change to serviceAccountName will require a rollout since that needs to update the corresponding field on the statefulset)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think they might still be load bearing, despite being applied immediately. If we change the annotations on the service account to add an AWS IAM role ARN for example, do the credentials get applied to existing pods? I'm not sure if they do.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, i guess that's true. may be worth testing to see what the behavior here is, but i think you're probably right. we'll probably want to leave a comment explaining this, since it's not immediately obvious

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants