[db_metadata] Parse schema on migration, await backfill by smklein · Pull Request #9878 · oxidecomputer/omicron

smklein · 2026-02-18T16:49:23Z

Non-atomic schema changes ARE real and CAN hurt you

Adds CRDB-specific preprocessing and sqlparser parsing to each schema change within Nexus
Verifies that each change contains "at most one" schema-modifying DDL statement
Adds a secondary transaction to each "apply schema change" step, which verifies that the underlying backfill operation has completed. The objective here is to transform "delayed, async DDL statements" into synchronous operations, so we can know "it succeeded fully" or "it failed" before moving on

Part of #9866
Fixes #9888

david-crespo · 2026-02-18T16:58:07Z

Wild, but also way simpler than I expected.

david-crespo · 2026-02-18T16:59:31Z

nexus/db-queries/src/db/datastore/db_metadata.rs

+    /// verification queries.
+    ///
+    /// If verification fails, the error propagates immediately — the outer
+    /// startup retry loop will re-attempt the entire migration.


It's worth spelling out (at least in the PR, if not in the comment) exactly how this prevents the problem we saw. What prevents this from running into the same error over and over upon re-running the migration?

Without @sunshowers ' fix for #9874: this would continue failing over and over again.

This PR is intended to prevent Nexuses from "crossing the barrier" of an individual schema change (basically: trying to help us treat them as synchronous atomic steps again). It makes no changes for "Max memory/batch sizes" nor changes to attempted retries.

I can update the PR description to make this more clear!

david-crespo · 2026-02-18T18:36:46Z

nexus/db-model/src/schema_versions.rs

+///    detect them via regex here, in the same pass that removes them.
+///
+/// This is **only for classification** — the original SQL is always what
+/// gets executed against the database.


Calling out this line for other reviewers — important!

jgallagher · 2026-02-18T19:59:23Z

nexus/db-model/src/schema_versions.rs

+    // Both are valid PostgreSQL, but sqlparser only handles the
+    // bracket notation.  We match `ARRAY` followed by `,` or end-of-
+    // definition (not `ARRAY[` which is the array literal syntax).
+    let type_array_re = Regex::new(r"(?i)(\w+)\s+ARRAY\s*([,)\n])").unwrap();


This function is impressive, but honestly turns me off of the whole sqlparser plan entirely; I didn't realize it couldn't parse our schema.

Is it worth going back to the idea board for other ways to accomplish this?

I agree that it is wild, but I would vote against "throw it out" at this point on the grounds that the main failure mode here is a test failure due to SQL not parsing. The scarier failure mode is a modification significant enough to be classified wrong, but I feel like we should be able to rule that out decently well with testing. I also wonder if it's worth considering a different parser like https://crates.io/crates/pg_query (much lower download count but still seems legit) since it's just a dev dep. I'm having the robot try it.

Probably not worth it — sounds like we might have to do actual work to get this to work on illumos — but I'll give it a shot and see how it looks.

I'm currently working on eliminating a chunk of the regex by just pulling in a more recent version of sqlparser. Will look into the difficulty of creating our own dialect next.

It's a lot better now. Meanwhile $10 later my pg_query attempt is crap.

I'm gonna try to tweak this further to basically do the following:

Try to remove all DML

Do the crdb-specific pre-processing, but only for the DDL which could have back-filled operations

Treat all other DDL as "Some other DDL", which we don't pass to sqlparser

This should reduce the regex usage further, and hopefully get it into a more bearable territory. The amount of crdb-specific regex will scale with the amount of "actually async backfilling DDL" operations we're catching, but maybe that will tolerable enough to let us punt on a CRDB-specific SQL parser a little longer...

Oh yep, right about the dep. But the logic of this function is identical between dev and prod because it's based on the hard coded SQL on hand. If you wanted to, you could avoid running this function in production by pre-processing the queries, couldn't you? Like you could do the classification at dev time, write it down, and have a snapshot test that makes sure the classification is up to date with the code.

I believe this is true

I think this latest version strips the regex about as far as it can go, by partitioning the world into "DDL we care about parsing" and "Other DDL which we aren't gonna bother parsing at all".

Oh yep, right about the dep. But the logic of this function is identical between dev and prod because it's based on the hard coded SQL on hand. If you wanted to, you could avoid running this function in production by pre-processing the queries, couldn't you? Like you could do the classification at dev time, write it down, and have a snapshot test that makes sure the classification is up to date with the code.

I believe this is true

I like this idea a lot. If we something like an expectorate test that this code filled in to maintain a static list of migration steps that need particular async backfill followup, we'd get:

SQL parsing out of prod and into tests - prod can read a compiled form of the static list

A chance to confirm in every PR review that the parser is catching all the new indices being created (I wouldn't want to rely on this at all - that's the whole point of tooling - but it seems good that there's an opportunity for us to review)

Alright, I went ahead and did that, and updated the docs. sqlparser and regex are now dev-dependencies, and we have an EXPECTORATE test that generates .verify.sql files which sit alongside the migrations.

This also retroactively added checks for all migrations, which I figured would be worthwhile for perusal. However, if we want me to drop that generation (e.g., only generate the .verify.sql files for relatively new migrations) I could do that too.

…docs

…le testing

smklein · 2026-02-20T22:11:39Z

nexus/db-queries/src/db/datastore/db_metadata.rs

+    // ---------------------------------------------------------------
+    // Integration tests for schema change verification
+    // ---------------------------------------------------------------


I tried to include a fair number of unit tests in this PR, but this sequence of integration tests is pretty dang important for the behavior we're trying to test.

These cover the verification queries:

CREATE INDEX IF NOT EXISTS

Checking that the verification query passes/fails if the index exists/doesn't exist.

Checking that concurrent invocations of index creation can cause some callers to "skip" the backfill, and that verification acts as a back-stop.

ALTER TABLE ADD CONSTRAINT

Checking that the verification query passes/fails if the constraint exists/doesn't exist

Checking that concurrent invocations of constraint creation can cause some callers to "skip" the backfill, and that verification acts as a back-stop.

Additionally, these cover some DDL changes that have backfill, but don't need verification queries:

ADD COLUMN IF NOT EXISTS ... NOT NULL DEFAULT

Confirms that concurrent callers block each other out, so no verification query is needed

ALTER COLUMN ... SET NOT NULL

Confirms that concurrent callers block each other out, so no verification query is needed

(an earlier revision of this PR emitted verification queries for these last two cases, but that has since been removed, as they are not necessary)

smklein · 2026-02-23T19:48:51Z

nexus/db-model/src/schema_versions.rs

+///
+/// Permits only `[a-zA-Z0-9_]` characters, which covers all valid
+/// SQL identifiers and our enum variant values.
+#[doc(hidden)]


This struct, and SchemaChangeInfo, are pretty much only used at test-time, but they're used by multiple test modules, so it's handy to have them exposed publicly. I could probably use some cargo manipulation to do a "testing" feature flag, but for the moment I just opted to "use them in cfg(test) stuff, and keep them doc(hidden)".

smklein · 2026-02-23T19:49:42Z

nexus/db-model/src/schema_versions.rs

+                // NOT VALID constraints are synchronous — no async
+                // validation job runs, so there is no race to guard
+                // against and no verification query is needed.
+                if *not_valid {


We have exactly one migration which adds a constraint with the NOT VALID keyword (it was one of yours, @david-crespo !) but it seems like a legit use-case, so I added it here.

Probably because only Claude knows about these arcane SQL features. I would never have come up with it myself.

smklein · 2026-02-23T19:51:26Z

nexus/db-model/src/schema_versions.rs

+    ///
+    /// Returns trimmed, non-empty statement fragments with comments
+    /// stripped.
+    fn split_and_strip_sql(sql: &str) -> Vec<String> {


This was "strip comments" and "split statements" as two separate functions, but they kinda need to know about each other.

I know there's a ton of string manipulation going on here, but:

I have a lot of tests for it

I added proptests for it

It'll only be running at test-time (EXPECTORATE time) anyway

smklein · 2026-02-24T20:58:20Z

Merged with #9889 , now using the max_sql_memory_mib argument. Thanks @sunshowers

[db_metadata] Parse schema on migration, await backfill

f118398

david-crespo reviewed Feb 18, 2026

View reviewed changes

jgallagher reviewed Feb 18, 2026

View reviewed changes

smklein force-pushed the migration-await-backfill branch from f792e27 to 0bf9368 Compare February 18, 2026 21:48

smklein changed the base branch from main to sqlparser February 18, 2026 21:48

smklein added 5 commits February 18, 2026 13:50

Upgrade sqlparser from 0.45 to 0.61

a5c9a81

merge

c775f2d

SqlIdentifiers, multiple changes in single statements allowed

b1a5da2

better integration tests

31b202f

fix docs

47abf04

smklein force-pushed the migration-await-backfill branch from 0bf9368 to 47abf04 Compare February 18, 2026 21:50

smklein force-pushed the sqlparser branch from 59125e5 to a5c9a81 Compare February 18, 2026 21:50

smklein added 5 commits February 18, 2026 14:14

use sqlparser 0.61, reduce regex usage

7503080

more regex simplification

535cde7

Reduce regexes further

4f2dfcf

Add docs, handle more column add cases

aa92314

move to test-time, make sqlparser/regex dev-dependency, EXPECTORATE, …

a67e505

…docs

smklein marked this pull request as ready for review February 20, 2026 15:34

Fix semicolon parsing, lazylock regex

ac9b16a

smklein mentioned this pull request Feb 20, 2026

[crdb] set index backfill batch size to 5000, max-sql-memory to 256MiB #9889

Merged

smklein added 3 commits February 20, 2026 12:44

Remove two DDL cases that don't visibly backfill, work on more reliab…

55eac77

…le testing

fmt

262df72

Make test more strict

55049bc

smklein commented Feb 20, 2026

View reviewed changes

Handle adding a NOT VALID constraint

546d8ce

Base automatically changed from sqlparser to main February 23, 2026 19:11

smklein added 2 commits February 23, 2026 11:31

Combine statement splitting and comment stripping, add proptest

e06030a

merge

336969e

smklein commented Feb 23, 2026

View reviewed changes

smklein added 2 commits February 24, 2026 12:44

merge

6bbb88b

use max_sql_memory_mib

c2c3c89

Comments

Conversation

smklein commented Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

david-crespo commented Feb 18, 2026

Uh oh!

david-crespo Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

david-crespo Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jgallagher Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

smklein commented Feb 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

smklein commented Feb 18, 2026 •

edited

Loading

david-crespo Feb 18, 2026 •

edited

Loading

david-crespo Feb 18, 2026 •

edited

Loading

jgallagher Feb 19, 2026 •

edited

Loading