[crdb] set index backfill batch size to 5000, max-sql-memory to 256MiB by sunshowers · Pull Request #9889 · oxidecomputer/omicron

sunshowers · 2026-02-20T05:14:54Z

Per the analysis in https://github.com/oxidecomputer/omicron-9874-findings.

An interesting note is that we didn't used to set max-sql-memory at all, which meant we were relying on the CockroachDB default. On Linux that was 25% of system memory, but on illumos the detection fell back to a default of 128 MiB. So developers on Linux would effectively have very different SQL memory limits from those on illumos. With this change, the two OSes now have similar behavior.

Fixes #9874.

Created using spr 1.3.6-beta.1

smklein · 2026-02-20T16:39:20Z

cockroach-admin/src/cockroach_cli.rs

+                // See https://github.com/oxidecomputer/omicron-9874-findings for
+                // why we set the max SQL memory to be 256MiB.
+                .arg("--max-sql-memory=256MiB")


This is more a note-to-self: I tweak this value in #9878 too, but only for testing to make reproductions more consistent. We should merge this carefully.

(I'm sure we can figure something out, just calling it out early)

Oh good catch, I guess we should make the max sql memory an option on the cockroach command builder.

Done -- also added this to cargo xtask db-dev run

smklein · 2026-02-20T16:41:32Z

nexus/tests/integration_tests/schema.rs

+    similar_asserts::assert_eq!(
+        observed_cluster_settings,
+        expected_cluster_settings,
+        "Cluster settings do not match between migrations and dbinit.sql. \
+         If dbinit.sql contains a SET CLUSTER SETTING statement, there must \
+         be a corresponding migration (and vice versa)."


Real smart to include this as a check, good call!

smklein · 2026-02-20T16:45:22Z

nexus/db-model/src/schema_versions.rs

 }

+/// Pragma comment that marks a migration step as non-transactional.
+const NON_TRANSACTIONAL_PRAGMA: &str = "-- NON-TRANSACTIONAL";


I hope no one complains that the --NON-TRANSACTIONAL flag isn't working, if they don't realize the -- is syntax for SQL comments and we're checking the space here 😅

(This is probably fine... but I am a little nervous that "almost correct" pragmas will silently do nothing)

It sure looks like a comment in up.sql 😬

I assume this is not something we expect to come up often, given it's the first time we've needed it. Would it be unreasonable to build a static list of migration files that are nontransactional instead of embedding a special marker in the file itself? Maybe kinda janky, but I'm imagining something like

static NON_TRANSACTIONAL_UPGRADE_STEPS: &[&str] = &[ "index-backfill-batch-size/up.sql", ];

(This is probably fine... but I am a little nervous that "almost correct" pragmas will silently do nothing)

FWIW SET CLUSTER SETTING blows up quite loudly if in a (multi-statement) transaction.

I assume this is not something we expect to come up often, given it's the first time we've needed it. Would it be unreasonable to build a static list of migration files that are nontransactional instead of embedding a special marker in the file itself? Maybe kinda janky, but I'm imagining something like

I considered that but felt that was slightly worse than embedding this knowledge locally in the file. I don't have a very strong preference here though.

I considered that but felt that was slightly worse than embedding this knowledge locally in the file. I don't have a very strong preference here though.

I don't love the "static list" suggestion I made, but I more strongly dislike these two bits about embedding the knowledge locally in the file:

We now have to check every schema migration step for a special magic comment (not worried about runtime overhead or anything; more the cognitive/complxity overhead - today we just read the file, and that seems like a nice property to maintain as long as we can)

It allows any schema migration check to opt into non-transactional mode

The latter one bugs me more, I think? Maybe this is paranoia, but I'm worried about someone having some problem that they can address by opting out of transactions (but ought to address some other way), and code review missing it because the opt-out "flag" is a comment in a migration step, which I think is pretty easy to overlook.

Static list directly addresses the first of these, and maybe makes the second less likely? Hopefully a big scary comment on the static list plus a comment-per-entry works as a wakeup in PR review, and anyone stumbling across that list after changes are merged should be able to easily tell which migration steps have this opt-out and why.

One downside of static lists, IMO:

Someone writes a migration with steps "01.sql, 02.sql, 03.sql"

They mark "02.sql" as "the one needing to be non-transactional"

They realize they need another change before that, so they add it, and rename all the files

Now the steps are "01.sql, 02.sql, 03.sql (this is the one which should be non-txn), 04.sql"

The static list still points to 02.sql as the non-transactional one

I guess this is sorta my argument about the danger of non-locality of labelling

Maybe bad idea: what about putting it in the filename somehow? It's more local than a static list but doesn't involve sidebanding control information in a comment. (The filename is still a sideband, but it feels less surprising that the filename could be semantically meaningful than a comment.) On the flip side: it seems potentially less visible in code review.

FWIW this is the approach I'm taking with #9878 - I'm auto-generating NN.verify.sql queries to live next to NN.sql files

Hmm I would rather not encode this in the file name because this isn't an ancillary file the way NN.verify.sql is.

The solution @jgallagher and I came up with is on the update watercooler today is:

rename -- NON-TRANSACTIONAL to -- DANGER-NON-TRANSACTIONAL -- make it semiotically not a place of honor

remove the documentation from the README, move it to a comment, with a clear warning sign

Thoughts?

Hmm I would rather not encode this in the file name because this isn't an ancillary file the way NN.verify.sql is.

The solution @jgallagher and I came up with is on the update watercooler today is:

rename -- NON-TRANSACTIONAL to -- DANGER-NON-TRANSACTIONAL -- make it semiotically not a place of honor

remove the documentation from the README, move it to a comment, with a clear warning sign

Thoughts?

I'd also be fine with this, especially given your note that you used this to update a cluster setting, and if we somehow beef this process and accidentally make the SQL statement happen in a txn, the migration would blow up in tests.

Ok I just put up a change to encode this in the filename.

17d18fb looks good to me; I also think it's reasonable to keep this out of the README given how sketchy non-transactional operations are.

Created using spr 1.3.6-beta.1

smklein · 2026-02-23T21:01:24Z

nexus/db-model/src/schema_versions.rs

                    // We have 2 or more `up*.sql`; they should be numbered
                    // exactly 1..=up_sqls.len().
-                    if i as u64 + 1 != *up_number {
+                    if i as u64 + 1 != parsed.up_number {


This is where we'd catch "a transactional + non-transactional SQL change both using the same number", right?

Yeah (and there's a test for this too)

Created using spr 1.3.6-beta.1

[spr] initial version

7bd887d

Created using spr 1.3.6-beta.1

sunshowers requested review from davepacheco and smklein February 20, 2026 17:37

smklein approved these changes Feb 20, 2026

View reviewed changes

sunshowers added 3 commits February 22, 2026 21:46

make max sql memory configurable, add comments

c1aee59

Created using spr 1.3.6-beta.1

encode info in filename

17d18fb

Created using spr 1.3.6-beta.1

remove pragma test, not relevant any more

47e510f

Created using spr 1.3.6-beta.1

smklein reviewed Feb 23, 2026

View reviewed changes

fix doc reference

1797333

Created using spr 1.3.6-beta.1

sunshowers merged commit efe4c20 into main Feb 24, 2026
16 checks passed

sunshowers deleted the sunshowers/spr/crdb-set-index-backfill-batch-size-to-5000-max-sql-memory-to-256mib branch February 24, 2026 01:36

smklein mentioned this pull request Feb 24, 2026

[db_metadata] Parse schema on migration, await backfill #9878

Open

Conversation

sunshowers commented Feb 20, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants