[db_metadata] Parse schema on migration, await backfill#9878
[db_metadata] Parse schema on migration, await backfill#9878
Conversation
|
Wild, but also way simpler than I expected. |
| /// verification queries. | ||
| /// | ||
| /// If verification fails, the error propagates immediately — the outer | ||
| /// startup retry loop will re-attempt the entire migration. |
There was a problem hiding this comment.
It's worth spelling out (at least in the PR, if not in the comment) exactly how this prevents the problem we saw. What prevents this from running into the same error over and over upon re-running the migration?
There was a problem hiding this comment.
Without @sunshowers ' fix for #9874: this would continue failing over and over again.
This PR is intended to prevent Nexuses from "crossing the barrier" of an individual schema change (basically: trying to help us treat them as synchronous atomic steps again). It makes no changes for "Max memory/batch sizes" nor changes to attempted retries.
I can update the PR description to make this more clear!
| /// detect them via regex here, in the same pass that removes them. | ||
| /// | ||
| /// This is **only for classification** — the original SQL is always what | ||
| /// gets executed against the database. |
There was a problem hiding this comment.
Calling out this line for other reviewers — important!
| // Both are valid PostgreSQL, but sqlparser only handles the | ||
| // bracket notation. We match `ARRAY` followed by `,` or end-of- | ||
| // definition (not `ARRAY[` which is the array literal syntax). | ||
| let type_array_re = Regex::new(r"(?i)(\w+)\s+ARRAY\s*([,)\n])").unwrap(); |
There was a problem hiding this comment.
This function is impressive, but honestly turns me off of the whole sqlparser plan entirely; I didn't realize it couldn't parse our schema.
Is it worth going back to the idea board for other ways to accomplish this?
There was a problem hiding this comment.
I agree that it is wild, but I would vote against "throw it out" at this point on the grounds that the main failure mode here is a test failure due to SQL not parsing. The scarier failure mode is a modification significant enough to be classified wrong, but I feel like we should be able to rule that out decently well with testing. I also wonder if it's worth considering a different parser like https://crates.io/crates/pg_query (much lower download count but still seems legit) since it's just a dev dep. I'm having the robot try it.
There was a problem hiding this comment.
I'm currently working on eliminating a chunk of the regex by just pulling in a more recent version of sqlparser. Will look into the difficulty of creating our own dialect next.
There was a problem hiding this comment.
It's a lot better now. Meanwhile $10 later my pg_query attempt is crap.
There was a problem hiding this comment.
I'm gonna try to tweak this further to basically do the following:
- Try to remove all DML
- Do the crdb-specific pre-processing, but only for the DDL which could have back-filled operations
- Treat all other DDL as "Some other DDL", which we don't pass to sqlparser
This should reduce the regex usage further, and hopefully get it into a more bearable territory. The amount of crdb-specific regex will scale with the amount of "actually async backfilling DDL" operations we're catching, but maybe that will tolerable enough to let us punt on a CRDB-specific SQL parser a little longer...
There was a problem hiding this comment.
Oh yep, right about the dep. But the logic of this function is identical between dev and prod because it's based on the hard coded SQL on hand. If you wanted to, you could avoid running this function in production by pre-processing the queries, couldn't you? Like you could do the classification at dev time, write it down, and have a snapshot test that makes sure the classification is up to date with the code.
I believe this is true
There was a problem hiding this comment.
I think this latest version strips the regex about as far as it can go, by partitioning the world into "DDL we care about parsing" and "Other DDL which we aren't gonna bother parsing at all".
There was a problem hiding this comment.
Oh yep, right about the dep. But the logic of this function is identical between dev and prod because it's based on the hard coded SQL on hand. If you wanted to, you could avoid running this function in production by pre-processing the queries, couldn't you? Like you could do the classification at dev time, write it down, and have a snapshot test that makes sure the classification is up to date with the code.
I believe this is true
I like this idea a lot. If we something like an expectorate test that this code filled in to maintain a static list of migration steps that need particular async backfill followup, we'd get:
- SQL parsing out of prod and into tests - prod can read a compiled form of the static list
- A chance to confirm in every PR review that the parser is catching all the new indices being created (I wouldn't want to rely on this at all - that's the whole point of tooling - but it seems good that there's an opportunity for us to review)
There was a problem hiding this comment.
Alright, I went ahead and did that, and updated the docs. sqlparser and regex are now dev-dependencies, and we have an EXPECTORATE test that generates .verify.sql files which sit alongside the migrations.
This also retroactively added checks for all migrations, which I figured would be worthwhile for perusal. However, if we want me to drop that generation (e.g., only generate the .verify.sql files for relatively new migrations) I could do that too.
f792e27 to
0bf9368
Compare
0bf9368 to
47abf04
Compare
| // --------------------------------------------------------------- | ||
| // Integration tests for schema change verification | ||
| // --------------------------------------------------------------- |
There was a problem hiding this comment.
I tried to include a fair number of unit tests in this PR, but this sequence of integration tests is pretty dang important for the behavior we're trying to test.
These cover the verification queries:
CREATE INDEX IF NOT EXISTS- Checking that the verification query passes/fails if the index exists/doesn't exist.
- Checking that concurrent invocations of index creation can cause some callers to "skip" the backfill, and that verification acts as a back-stop.
ALTER TABLE ADD CONSTRAINT- Checking that the verification query passes/fails if the constraint exists/doesn't exist
- Checking that concurrent invocations of constraint creation can cause some callers to "skip" the backfill, and that verification acts as a back-stop.
Additionally, these cover some DDL changes that have backfill, but don't need verification queries:
ADD COLUMN IF NOT EXISTS ... NOT NULL DEFAULT- Confirms that concurrent callers block each other out, so no verification query is needed
ALTER COLUMN ... SET NOT NULL- Confirms that concurrent callers block each other out, so no verification query is needed
(an earlier revision of this PR emitted verification queries for these last two cases, but that has since been removed, as they are not necessary)
| /// | ||
| /// Permits only `[a-zA-Z0-9_]` characters, which covers all valid | ||
| /// SQL identifiers and our enum variant values. | ||
| #[doc(hidden)] |
There was a problem hiding this comment.
This struct, and SchemaChangeInfo, are pretty much only used at test-time, but they're used by multiple test modules, so it's handy to have them exposed publicly. I could probably use some cargo manipulation to do a "testing" feature flag, but for the moment I just opted to "use them in cfg(test) stuff, and keep them doc(hidden)".
| // NOT VALID constraints are synchronous — no async | ||
| // validation job runs, so there is no race to guard | ||
| // against and no verification query is needed. | ||
| if *not_valid { |
There was a problem hiding this comment.
We have exactly one migration which adds a constraint with the NOT VALID keyword (it was one of yours, @david-crespo !) but it seems like a legit use-case, so I added it here.
There was a problem hiding this comment.
Probably because only Claude knows about these arcane SQL features. I would never have come up with it myself.
| /// | ||
| /// Returns trimmed, non-empty statement fragments with comments | ||
| /// stripped. | ||
| fn split_and_strip_sql(sql: &str) -> Vec<String> { |
There was a problem hiding this comment.
This was "strip comments" and "split statements" as two separate functions, but they kinda need to know about each other.
I know there's a ton of string manipulation going on here, but:
- I have a lot of tests for it
- I added proptests for it
- It'll only be running at test-time (EXPECTORATE time) anyway
|
Merged with #9889 , now using the |

Non-atomic schema changes ARE real and CAN hurt you
sqlparserparsing to each schema change within NexusPart of #9866
Fixes #9888