Split "set target release" endpoint into two: one for update, one for mupdate recovery#9887
Split "set target release" endpoint into two: one for update, one for mupdate recovery#9887jgallagher wants to merge 15 commits intomainfrom
Conversation
|
Schema diff. Very simple, nice that they take the same params. Do you think I should expose this functionality in the console? Probably not, right? --- a/2026021301.0.0-6e51ab/spec.json
+++ b/2026021800.0.0-38e767/spec.json
@@ -7,7 +7,7 @@
"url": "https://oxide.computer",
"email": "api@oxide.computer"
},
- "version": "2026021301.0.0"
+ "version": "2026021800.0.0"
},
"paths": {
"/device/auth": {
@@ -12383,6 +12383,35 @@
}
}
},
+ "/v1/system/update/target-release/recovery": {
+ "put": {
+ "tags": ["system/update"],
+ "summary": "Recover from an Oxide-support-driven system update",
+ "description": "Inform the control plane of the release of the rack's system software it is now running due to a recovery operation (\"mupdate\") performed by Oxide support.\n\nThis endpoint should only be called at the direction of Oxide support.",
+ "operationId": "target_release_update_recovery",
+ "requestBody": {
+ "content": {
+ "application/json": {
+ "schema": {
+ "$ref": "#/components/schemas/SetTargetReleaseParams"
+ }
+ }
+ },
+ "required": true
+ },
+ "responses": {
+ "204": {
+ "description": "resource updated"
+ },
+ "4XX": {
+ "$ref": "#/components/responses/Error"
+ },
+ "5XX": {
+ "$ref": "#/components/responses/Error"
+ }
+ }
+ }
+ },
"/v1/system/update/trust-roots": {
"get": {
"tags": ["system/update"], |
Probably not, yeah. @ahl and I chatted about this a few weeks ago, and IIRC we wanted to tuck this operation somewhere out of the main path even in the CLI, since it should only be called after support performs a mupdate (and will fail if called any other time anyway). |
davepacheco
left a comment
There was a problem hiding this comment.
(still going through nexus/src/app/deployment.rs but wanted to leave this before the watercooler)
nexus/external-api/src/lib.rs
Outdated
| /// This endpoint should only be called at the direction of Oxide support. | ||
| #[endpoint { | ||
| method = PUT, | ||
| path = "/v1/system/update/target-release/recovery", |
There was a problem hiding this comment.
I know we talked about doing this as a separate endpoint. But seeing the change, I wonder if it'd be clearer as a single endpoint with a new required field on the request body with this "intent" in it? I'm not that sure either way, to be honest.
It's interesting to me that I think this is a semantically breaking change to the existing endpoint, but this construction avoids actually breaking the wire protocol. If you kept one endpoint and added an intent, you'd be forced to make the explicit choice that in the older version, the intent is inferred to be Update. I like that, but I don't think it's a reason to prefer that approach.
It does feel goofy to have this recovery path underneath the other one. If we want a separate endpoint I'd consider PUT /v1/system/update/recovery_finish or something.
There was a problem hiding this comment.
I think we should have two separate endpoints instead of adding a required intent field. My reasoning is that if someone is reading through the API to learn how to do this from an automation tool they're writing, they are going to become aware of the fact that the parameter is there and required, and that the only two options are "this is the normal one" and "don't use unless Oxide tells you to". Seems like a weird amount of attention to be paid to it.
I agree with moving the endpoint URL to be adjacent to /target-release and not under it.
There was a problem hiding this comment.
I like two endpoints, same reason. Also agree with making it parallel. Didn't notice it was under the existing endpoint.
There was a problem hiding this comment.
Kept as two endpoints, but moved to /v1/system/update/recovery-finish.
| ) -> Result<(), TargetReleaseChangeError> { | ||
| // We cannot update to the _identical_ version we're already at. | ||
| if proposed_new_version == current_version { | ||
| warn!(log, "cannot start update: attempt to update to current version"); |
There was a problem hiding this comment.
I don't object to extra logging, but won't the message already showed up in the Nexus log as the error_message on the 400-level response here?
There was a problem hiding this comment.
Yeah; I put this in here so I could get better logs from the unit tests when they were still failing (which call these functions directly without going through the API). I could take the logs out now? I don't feel strongly either way.
There was a problem hiding this comment.
I often feel that this sort of code path logging is hard to keep complete and not-noisy and I usually go for a single summary message in one spot. But I don't feel strongly either. There's no harm.
nexus/src/app/deployment.rs
Outdated
| // A mupdate has occurred; we must not allow an update. | ||
| warn!( | ||
| log, | ||
| "cannot start update: mupdate override in place"; | ||
| "sled_id" => %sled_id, | ||
| ); | ||
| return Err(TargetReleaseChangeError::WaitingForMupdateToBeCleared); |
There was a problem hiding this comment.
It's a little surprising that the (non-trivial) logic about determining whether a MUPdate is outstanding isn't shared between this function and validate_can_set_target_release_for_mupdate_recovery. Are these very different? (Is this just to account for the ambiguity in the case where both host OS slots report CurrentContents? In that case maybe a helper could report the ambiguity.)
There was a problem hiding this comment.
I think it's kinda awkward because in this function, we also care about extracting the exact versions of the contents if they're not CurrentContents. Maybe I can extract a helper that does exactly that, though; I'll give it a shot.
There was a problem hiding this comment.
I tried this in 3851bb9. It's pretty wordy, but probably still worth it overall?
I didn't include checking zones in the helper for two reasons:
- We don't have a
sled_config.in_service_zone_configs()method; we only have that at the blueprint level. - Writing a separate
check_zone_config_for_mupdate()helper is pretty pointless, because it's just a match of the zone's image source.
There was a problem hiding this comment.
I do think it's better.
Some other thoughts, again, take 'em or leave 'em: we don't really use this with ? so I'm not sure it's useful to use Result. I wonder if it'd be clearer, both within the function and in the callers, if it returned:
enum SledMupdatePending {
MupdatePending(SledMupdateDetectedHow),
NoMupdatePending(os_versions),
}
enum SledMupdateDetectedHow {
RemoveMupdateOverridePresent,
BootDiskContents,
}or maybe even better?
fn sled_update_status(sled_config: &BlueprintSledConfig, current_target_version: semer::Version) -> SledUpdateStatus { ... }
enum SledUpdateStatus {
HasUnresolvedMupdate(SledMupdateDetectedHow),
RunningRequestedVersion, // `current_target_version` was found in at least one slot
PreviousUpdatePending, // `current_target_version` not found anywhere
}Basically, trying to put the tricky decision-making into one simpler, stateless function and the higher-level logic in the callers, which might then read more clearly.
| TargetReleaseSource::Unspecified => { | ||
| // There is no current target release; it's always fine to | ||
| // set the first one. | ||
| } |
There was a problem hiding this comment.
Is this right? If you have a newly-provisioned control plane are you in mupdate-recovery state, or are we in a subtly different state where you can upload a new repo but you can't take certain actions that would require the presence of artifacts?
(I wanted to ask about the add sled flow but I think we mupdate the sled before adding it to the control plane. But in the future we've talked about wanting to have Nexus manage that recovery flow for a new sled, using the artifacts available to it.)
There was a problem hiding this comment.
Is this right?
I hope so - I didn't change this behavior! I did move it - it used to be inlined in the HTTP endpoint handler.
If you have a newly-provisioned control plane are you in mupdate-recovery state, or are we in a subtly different state where you can upload a new repo but you can't take certain actions that would require the presence of artifacts?
We are in a subtly different state as far as the planner is concerned - it's willing to continue to add new zones, etc., all sourced out of ::InstallDataset, even though it doesn't know what version of software the rack is running. It assumes if no target release has ever been set, we're in a mupdate-only world.
(I wanted to ask about the add sled flow but I think we mupdate the sled before adding it to the control plane. But in the future we've talked about wanting to have Nexus manage that recovery flow for a new sled, using the artifacts available to it.)
I believe after a rack has had an online update, adding a mupdated sled will put the rack into "needs mupdate recovery" mode, and we'll need to hit this endpoint to move past that. But yeah in the fullness of time Nexus should manage that process, which will remove the need to do that.
davepacheco
left a comment
There was a problem hiding this comment.
This looks good!
I think it wouldn't hurt to get another set of eyes on it, given how tricky and important this is.
| ) -> Result<(), TargetReleaseChangeError> { | ||
| // We cannot update to the _identical_ version we're already at. | ||
| if proposed_new_version == current_version { | ||
| warn!(log, "cannot start update: attempt to update to current version"); |
There was a problem hiding this comment.
I often feel that this sort of code path logging is hard to keep complete and not-noisy and I usually go for a single summary message in one spot. But I don't feel strongly either. There's no harm.
nexus/src/app/deployment.rs
Outdated
| // A mupdate has occurred; we must not allow an update. | ||
| warn!( | ||
| log, | ||
| "cannot start update: mupdate override in place"; | ||
| "sled_id" => %sled_id, | ||
| ); | ||
| return Err(TargetReleaseChangeError::WaitingForMupdateToBeCleared); |
There was a problem hiding this comment.
I do think it's better.
Some other thoughts, again, take 'em or leave 'em: we don't really use this with ? so I'm not sure it's useful to use Result. I wonder if it'd be clearer, both within the function and in the callers, if it returned:
enum SledMupdatePending {
MupdatePending(SledMupdateDetectedHow),
NoMupdatePending(os_versions),
}
enum SledMupdateDetectedHow {
RemoveMupdateOverridePresent,
BootDiskContents,
}or maybe even better?
fn sled_update_status(sled_config: &BlueprintSledConfig, current_target_version: semer::Version) -> SledUpdateStatus { ... }
enum SledUpdateStatus {
HasUnresolvedMupdate(SledMupdateDetectedHow),
RunningRequestedVersion, // `current_target_version` was found in at least one slot
PreviousUpdatePending, // `current_target_version` not found anywhere
}Basically, trying to put the tricky decision-making into one simpler, stateless function and the higher-level logic in the callers, which might then read more clearly.
The existing "set target release" external API endpoint is used for two reasons:
However, the checks we ought to perform for "should the new target release version be allowed" are pretty different for the two cases, and we were both too strict and too loose. A couple examples of incorrect behavior prior to this PR:
As of this change, there are separate "set target release for update" and "set target release for mupdate recovery" endpoints with more correct validation for each intent. In the two examples above:
Closes #9113. Also addresses an issue @askfongjojo ran into on a racklette recently with needing to "downgrade"; e.g., in a sequence like this:
After this change, we can now correct the mistake in step 4: because 18 wasn't the release actually deployed, we'd still be in the "need to recover from mupdate" state, allowing the operator to set the target release back to 17.