Split "set target release" endpoint into two: one for update, one for mupdate recovery by jgallagher · Pull Request #9887 · oxidecomputer/omicron

jgallagher · 2026-02-19T22:36:53Z

The existing "set target release" external API endpoint is used for two reasons:

To start a new online update
To inform Nexus that a mupdate has occurred, and to allow reconfigurator to recover from that mupdate

However, the checks we ought to perform for "should the new target release version be allowed" are pretty different for the two cases, and we were both too strict and too loose. A couple examples of incorrect behavior prior to this PR:

We refused recover from a mupdate if the version we mupdated to was below the current target version (even if no downgrade had actually taken place! - see below for an example of how this could happen)
We allowed setting the target release to itself spuriously (should not be able to set target release to itself (unless MUPdate happened)? #9113)

As of this change, there are separate "set target release for update" and "set target release for mupdate recovery" endpoints with more correct validation for each intent. In the two examples above:

This is now allowed - if we're in a "need recovery from mupdate" case, we allow any new target version. (If it doesn't match the software that we actually mupdated to, the planner won't be able to match up artifacts, so we'll stay in the "need recovery from mupdate" case until the correct version is set.)
This is no longer allowed - "set target release for update" now rejects setting the release version to itself.

Closes #9113. Also addresses an issue @askfongjojo ran into on a racklette recently with needing to "downgrade"; e.g., in a sequence like this:

Install R16
Mupdate to 17
Upload TUF repos for R17 and R18
Set target release to 18 (oops! - this should have been 17, and now we have no way to proceed other than mupdating to R18)

After this change, we can now correct the mistake in step 4: because 18 wasn't the release actually deployed, we'd still be in the "need to recover from mupdate" state, allowing the operator to set the target release back to 17.

…ease-endpoint

david-crespo · 2026-02-19T23:41:29Z

Schema diff. Very simple, nice that they take the same params. Do you think I should expose this functionality in the console? Probably not, right?

--- a/2026021301.0.0-6e51ab/spec.json
+++ b/2026021800.0.0-38e767/spec.json
@@ -7,7 +7,7 @@
       "url": "https://oxide.computer",
       "email": "api@oxide.computer"
     },
-    "version": "2026021301.0.0"
+    "version": "2026021800.0.0"
   },
   "paths": {
     "/device/auth": {
@@ -12383,6 +12383,35 @@
         }
       }
     },
+    "/v1/system/update/target-release/recovery": {
+      "put": {
+        "tags": ["system/update"],
+        "summary": "Recover from an Oxide-support-driven system update",
+        "description": "Inform the control plane of the release of the rack's system software it is now running due to a recovery operation (\"mupdate\") performed by Oxide support.\n\nThis endpoint should only be called at the direction of Oxide support.",
+        "operationId": "target_release_update_recovery",
+        "requestBody": {
+          "content": {
+            "application/json": {
+              "schema": {
+                "$ref": "#/components/schemas/SetTargetReleaseParams"
+              }
+            }
+          },
+          "required": true
+        },
+        "responses": {
+          "204": {
+            "description": "resource updated"
+          },
+          "4XX": {
+            "$ref": "#/components/responses/Error"
+          },
+          "5XX": {
+            "$ref": "#/components/responses/Error"
+          }
+        }
+      }
+    },
     "/v1/system/update/trust-roots": {
       "get": {
         "tags": ["system/update"],

jgallagher · 2026-02-20T15:18:42Z

Schema diff. Very simple, nice that they take the same params. Do you think I should expose this functionality in the console? Probably not, right?

Probably not, yeah. @ahl and I chatted about this a few weeks ago, and IIRC we wanted to tuck this operation somewhere out of the main path even in the CLI, since it should only be called after support performs a mupdate (and will fail if called any other time anyway).

davepacheco

(still going through nexus/src/app/deployment.rs but wanted to leave this before the watercooler)

nexus/external-api/src/lib.rs

davepacheco · 2026-02-24T21:21:09Z

nexus/external-api/src/lib.rs

+    /// This endpoint should only be called at the direction of Oxide support.
+    #[endpoint {
+        method = PUT,
+        path = "/v1/system/update/target-release/recovery",


I know we talked about doing this as a separate endpoint. But seeing the change, I wonder if it'd be clearer as a single endpoint with a new required field on the request body with this "intent" in it? I'm not that sure either way, to be honest.

It's interesting to me that I think this is a semantically breaking change to the existing endpoint, but this construction avoids actually breaking the wire protocol. If you kept one endpoint and added an intent, you'd be forced to make the explicit choice that in the older version, the intent is inferred to be Update. I like that, but I don't think it's a reason to prefer that approach.

It does feel goofy to have this recovery path underneath the other one. If we want a separate endpoint I'd consider PUT /v1/system/update/recovery_finish or something.

I think we should have two separate endpoints instead of adding a required intent field. My reasoning is that if someone is reading through the API to learn how to do this from an automation tool they're writing, they are going to become aware of the fact that the parameter is there and required, and that the only two options are "this is the normal one" and "don't use unless Oxide tells you to". Seems like a weird amount of attention to be paid to it.

I agree with moving the endpoint URL to be adjacent to /target-release and not under it.

I like two endpoints, same reason. Also agree with making it parallel. Didn't notice it was under the existing endpoint.

Kept as two endpoints, but moved to /v1/system/update/recovery-finish.

nexus/src/app/deployment.rs

davepacheco · 2026-02-24T22:43:50Z

nexus/src/app/deployment.rs

+) -> Result<(), TargetReleaseChangeError> {
+    // We cannot update to the _identical_ version we're already at.
+    if proposed_new_version == current_version {
+        warn!(log, "cannot start update: attempt to update to current version");


I don't object to extra logging, but won't the message already showed up in the Nexus log as the error_message on the 400-level response here?

Yeah; I put this in here so I could get better logs from the unit tests when they were still failing (which call these functions directly without going through the API). I could take the logs out now? I don't feel strongly either way.

I often feel that this sort of code path logging is hard to keep complete and not-noisy and I usually go for a single summary message in one spot. But I don't feel strongly either. There's no harm.

nexus/src/app/deployment.rs

davepacheco · 2026-02-24T22:49:33Z

nexus/src/app/deployment.rs

+            // A mupdate has occurred; we must not allow an update.
+            warn!(
+                log,
+                "cannot start update: mupdate override in place";
+                "sled_id" => %sled_id,
+            );
+            return Err(TargetReleaseChangeError::WaitingForMupdateToBeCleared);


It's a little surprising that the (non-trivial) logic about determining whether a MUPdate is outstanding isn't shared between this function and validate_can_set_target_release_for_mupdate_recovery. Are these very different? (Is this just to account for the ambiguity in the case where both host OS slots report CurrentContents? In that case maybe a helper could report the ambiguity.)

I think it's kinda awkward because in this function, we also care about extracting the exact versions of the contents if they're not CurrentContents. Maybe I can extract a helper that does exactly that, though; I'll give it a shot.

I tried this in 3851bb9. It's pretty wordy, but probably still worth it overall?

I didn't include checking zones in the helper for two reasons:

We don't have a sled_config.in_service_zone_configs() method; we only have that at the blueprint level.

Writing a separate check_zone_config_for_mupdate() helper is pretty pointless, because it's just a match of the zone's image source.

I do think it's better.

Some other thoughts, again, take 'em or leave 'em: we don't really use this with ? so I'm not sure it's useful to use Result. I wonder if it'd be clearer, both within the function and in the callers, if it returned:

enum SledMupdatePending { MupdatePending(SledMupdateDetectedHow), NoMupdatePending(os_versions), } enum SledMupdateDetectedHow { RemoveMupdateOverridePresent, BootDiskContents, }

or maybe even better?

fn sled_update_status(sled_config: &BlueprintSledConfig, current_target_version: semer::Version) -> SledUpdateStatus { ... } enum SledUpdateStatus { HasUnresolvedMupdate(SledMupdateDetectedHow), RunningRequestedVersion, // `current_target_version` was found in at least one slot PreviousUpdatePending, // `current_target_version` not found anywhere }

Basically, trying to put the tricky decision-making into one simpler, stateless function and the higher-level logic in the callers, which might then read more clearly.

iliana · 2026-02-24T22:59:56Z

nexus/src/app/deployment.rs

+            TargetReleaseSource::Unspecified => {
+                // There is no current target release; it's always fine to
+                // set the first one.
+            }


Is this right? If you have a newly-provisioned control plane are you in mupdate-recovery state, or are we in a subtly different state where you can upload a new repo but you can't take certain actions that would require the presence of artifacts?

(I wanted to ask about the add sled flow but I think we mupdate the sled before adding it to the control plane. But in the future we've talked about wanting to have Nexus manage that recovery flow for a new sled, using the artifacts available to it.)

Is this right?

I hope so - I didn't change this behavior! I did move it - it used to be inlined in the HTTP endpoint handler.

If you have a newly-provisioned control plane are you in mupdate-recovery state, or are we in a subtly different state where you can upload a new repo but you can't take certain actions that would require the presence of artifacts?

We are in a subtly different state as far as the planner is concerned - it's willing to continue to add new zones, etc., all sourced out of ::InstallDataset, even though it doesn't know what version of software the rack is running. It assumes if no target release has ever been set, we're in a mupdate-only world.

(I wanted to ask about the add sled flow but I think we mupdate the sled before adding it to the control plane. But in the future we've talked about wanting to have Nexus manage that recovery flow for a new sled, using the artifacts available to it.)

I believe after a rack has had an online update, adding a mupdated sled will put the rack into "needs mupdate recovery" mode, and we'll need to hit this endpoint to move past that. But yeah in the fullness of time Nexus should manage that process, which will remove the need to do that.

davepacheco

This looks good!

I think it wouldn't hurt to get another set of eyes on it, given how tricky and important this is.

davepacheco · 2026-02-25T23:07:16Z

nexus/src/app/deployment.rs

+) -> Result<(), TargetReleaseChangeError> {
+    // We cannot update to the _identical_ version we're already at.
+    if proposed_new_version == current_version {
+        warn!(log, "cannot start update: attempt to update to current version");


I often feel that this sort of code path logging is hard to keep complete and not-noisy and I usually go for a single summary message in one spot. But I don't feel strongly either. There's no harm.

davepacheco · 2026-02-25T23:25:15Z

nexus/src/app/deployment.rs

+            // A mupdate has occurred; we must not allow an update.
+            warn!(
+                log,
+                "cannot start update: mupdate override in place";
+                "sled_id" => %sled_id,
+            );
+            return Err(TargetReleaseChangeError::WaitingForMupdateToBeCleared);


I do think it's better.

Some other thoughts, again, take 'em or leave 'em: we don't really use this with ? so I'm not sure it's useful to use Result. I wonder if it'd be clearer, both within the function and in the callers, if it returned:

enum SledMupdatePending { MupdatePending(SledMupdateDetectedHow), NoMupdatePending(os_versions), } enum SledMupdateDetectedHow { RemoveMupdateOverridePresent, BootDiskContents, }

or maybe even better?

fn sled_update_status(sled_config: &BlueprintSledConfig, current_target_version: semer::Version) -> SledUpdateStatus { ... } enum SledUpdateStatus { HasUnresolvedMupdate(SledMupdateDetectedHow), RunningRequestedVersion, // `current_target_version` was found in at least one slot PreviousUpdatePending, // `current_target_version` not found anywhere }

Basically, trying to put the tricky decision-making into one simpler, stateless function and the higher-level logic in the callers, which might then read more clearly.

jgallagher added 6 commits February 19, 2026 14:43

add separate endpoint for "set target releaes for mupdate recovery"

2a3d2e6

implement distinct "set target release version" validity checking

b6a47d5

openapi

2cbcdf0

fix test_audit_log_coverage

fb65943

fix external API integration tests

bdc3bf2

tighten mupdate checks (both OS slots)

4f938f7

jgallagher requested review from davepacheco and iliana February 19, 2026 22:36

jgallagher added 2 commits February 19, 2026 18:34

Merge remote-tracking branch 'origin/main' into john/split-target-rel…

99b83f4

…ease-endpoint

cargo fmt

7b0481a

openapi after merging main

3f25760

davepacheco reviewed Feb 24, 2026

View reviewed changes

iliana reviewed Feb 24, 2026

View reviewed changes

jgallagher added 6 commits February 25, 2026 16:38

Merge branch 'main' into john/split-target-release-endpoint

8355609

external API: better docs, different path

8b5d617

major version -> schedule release; clearer comments

4362a3c

extract check_sled_config_for_mupdate() helper

3851bb9

change operation id

cabceeb

remove linkering references to "major versions"

c1d1390

davepacheco approved these changes Feb 25, 2026

View reviewed changes

Conversation

jgallagher commented Feb 19, 2026

Uh oh!

david-crespo commented Feb 19, 2026

Uh oh!

jgallagher commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

davepacheco left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

davepacheco left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jgallagher commented Feb 20, 2026 •

edited

Loading