Skip to content

Comments

[db_metadata] Parse schema on migration, await backfill#9878

Open
smklein wants to merge 20 commits intomainfrom
migration-await-backfill
Open

[db_metadata] Parse schema on migration, await backfill#9878
smklein wants to merge 20 commits intomainfrom
migration-await-backfill

Conversation

@smklein
Copy link
Collaborator

@smklein smklein commented Feb 18, 2026

Non-atomic schema changes ARE real and CAN hurt you

  • Adds CRDB-specific preprocessing and sqlparser parsing to each schema change within Nexus
  • Verifies that each change contains "at most one" schema-modifying DDL statement
  • Adds a secondary transaction to each "apply schema change" step, which verifies that the underlying backfill operation has completed. The objective here is to transform "delayed, async DDL statements" into synchronous operations, so we can know "it succeeded fully" or "it failed" before moving on

Part of #9866
Fixes #9888

@david-crespo
Copy link
Contributor

Wild, but also way simpler than I expected.

/// verification queries.
///
/// If verification fails, the error propagates immediately — the outer
/// startup retry loop will re-attempt the entire migration.
Copy link
Contributor

@david-crespo david-crespo Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's worth spelling out (at least in the PR, if not in the comment) exactly how this prevents the problem we saw. What prevents this from running into the same error over and over upon re-running the migration?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without @sunshowers ' fix for #9874: this would continue failing over and over again.

This PR is intended to prevent Nexuses from "crossing the barrier" of an individual schema change (basically: trying to help us treat them as synchronous atomic steps again). It makes no changes for "Max memory/batch sizes" nor changes to attempted retries.

I can update the PR description to make this more clear!

/// detect them via regex here, in the same pass that removes them.
///
/// This is **only for classification** — the original SQL is always what
/// gets executed against the database.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling out this line for other reviewers — important!

// Both are valid PostgreSQL, but sqlparser only handles the
// bracket notation. We match `ARRAY` followed by `,` or end-of-
// definition (not `ARRAY[` which is the array literal syntax).
let type_array_re = Regex::new(r"(?i)(\w+)\s+ARRAY\s*([,)\n])").unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is impressive, but honestly turns me off of the whole sqlparser plan entirely; I didn't realize it couldn't parse our schema.

Is it worth going back to the idea board for other ways to accomplish this?

Copy link
Contributor

@david-crespo david-crespo Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it is wild, but I would vote against "throw it out" at this point on the grounds that the main failure mode here is a test failure due to SQL not parsing. The scarier failure mode is a modification significant enough to be classified wrong, but I feel like we should be able to rule that out decently well with testing. I also wonder if it's worth considering a different parser like https://crates.io/crates/pg_query (much lower download count but still seems legit) since it's just a dev dep. I'm having the robot try it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not worth it — sounds like we might have to do actual work to get this to work on illumos — but I'll give it a shot and see how it looks.

image

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm currently working on eliminating a chunk of the regex by just pulling in a more recent version of sqlparser. Will look into the difficulty of creating our own dialect next.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a lot better now. Meanwhile $10 later my pg_query attempt is crap.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm gonna try to tweak this further to basically do the following:

  1. Try to remove all DML
  2. Do the crdb-specific pre-processing, but only for the DDL which could have back-filled operations
  3. Treat all other DDL as "Some other DDL", which we don't pass to sqlparser

This should reduce the regex usage further, and hopefully get it into a more bearable territory. The amount of crdb-specific regex will scale with the amount of "actually async backfilling DDL" operations we're catching, but maybe that will tolerable enough to let us punt on a CRDB-specific SQL parser a little longer...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yep, right about the dep. But the logic of this function is identical between dev and prod because it's based on the hard coded SQL on hand. If you wanted to, you could avoid running this function in production by pre-processing the queries, couldn't you? Like you could do the classification at dev time, write it down, and have a snapshot test that makes sure the classification is up to date with the code.

I believe this is true

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this latest version strips the regex about as far as it can go, by partitioning the world into "DDL we care about parsing" and "Other DDL which we aren't gonna bother parsing at all".

Copy link
Contributor

@jgallagher jgallagher Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yep, right about the dep. But the logic of this function is identical between dev and prod because it's based on the hard coded SQL on hand. If you wanted to, you could avoid running this function in production by pre-processing the queries, couldn't you? Like you could do the classification at dev time, write it down, and have a snapshot test that makes sure the classification is up to date with the code.

I believe this is true

I like this idea a lot. If we something like an expectorate test that this code filled in to maintain a static list of migration steps that need particular async backfill followup, we'd get:

  • SQL parsing out of prod and into tests - prod can read a compiled form of the static list
  • A chance to confirm in every PR review that the parser is catching all the new indices being created (I wouldn't want to rely on this at all - that's the whole point of tooling - but it seems good that there's an opportunity for us to review)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, I went ahead and did that, and updated the docs. sqlparser and regex are now dev-dependencies, and we have an EXPECTORATE test that generates .verify.sql files which sit alongside the migrations.

This also retroactively added checks for all migrations, which I figured would be worthwhile for perusal. However, if we want me to drop that generation (e.g., only generate the .verify.sql files for relatively new migrations) I could do that too.

@smklein smklein force-pushed the migration-await-backfill branch from f792e27 to 0bf9368 Compare February 18, 2026 21:48
@smklein smklein changed the base branch from main to sqlparser February 18, 2026 21:48
@smklein smklein marked this pull request as ready for review February 20, 2026 15:34
Comment on lines +2452 to +2454
// ---------------------------------------------------------------
// Integration tests for schema change verification
// ---------------------------------------------------------------
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to include a fair number of unit tests in this PR, but this sequence of integration tests is pretty dang important for the behavior we're trying to test.

These cover the verification queries:

  • CREATE INDEX IF NOT EXISTS
    • Checking that the verification query passes/fails if the index exists/doesn't exist.
    • Checking that concurrent invocations of index creation can cause some callers to "skip" the backfill, and that verification acts as a back-stop.
  • ALTER TABLE ADD CONSTRAINT
    • Checking that the verification query passes/fails if the constraint exists/doesn't exist
    • Checking that concurrent invocations of constraint creation can cause some callers to "skip" the backfill, and that verification acts as a back-stop.

Additionally, these cover some DDL changes that have backfill, but don't need verification queries:

  • ADD COLUMN IF NOT EXISTS ... NOT NULL DEFAULT
    • Confirms that concurrent callers block each other out, so no verification query is needed
  • ALTER COLUMN ... SET NOT NULL
    • Confirms that concurrent callers block each other out, so no verification query is needed

(an earlier revision of this PR emitted verification queries for these last two cases, but that has since been removed, as they are not necessary)

Base automatically changed from sqlparser to main February 23, 2026 19:11
///
/// Permits only `[a-zA-Z0-9_]` characters, which covers all valid
/// SQL identifiers and our enum variant values.
#[doc(hidden)]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This struct, and SchemaChangeInfo, are pretty much only used at test-time, but they're used by multiple test modules, so it's handy to have them exposed publicly. I could probably use some cargo manipulation to do a "testing" feature flag, but for the moment I just opted to "use them in cfg(test) stuff, and keep them doc(hidden)".

// NOT VALID constraints are synchronous — no async
// validation job runs, so there is no race to guard
// against and no verification query is needed.
if *not_valid {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have exactly one migration which adds a constraint with the NOT VALID keyword (it was one of yours, @david-crespo !) but it seems like a legit use-case, so I added it here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably because only Claude knows about these arcane SQL features. I would never have come up with it myself.

///
/// Returns trimmed, non-empty statement fragments with comments
/// stripped.
fn split_and_strip_sql(sql: &str) -> Vec<String> {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was "strip comments" and "split statements" as two separate functions, but they kinda need to know about each other.

I know there's a ton of string manipulation going on here, but:

  • I have a lot of tests for it
  • I added proptests for it
  • It'll only be running at test-time (EXPECTORATE time) anyway

@smklein
Copy link
Collaborator Author

smklein commented Feb 24, 2026

Merged with #9889 , now using the max_sql_memory_mib argument. Thanks @sunshowers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Asynchronous CRDB schema migrations can break our migration engine

3 participants