Skip to content

fix: use branch name + date + hash for CI schema names to prevent collisions#940

Open
devin-ai-integration[bot] wants to merge 4 commits intomasterfrom
devin/1772185362-fix-databricks-schema-collision
Open

fix: use branch name + date + hash for CI schema names to prevent collisions#940
devin-ai-integration[bot] wants to merge 4 commits intomasterfrom
devin/1772185362-fix-databricks-schema-collision

Conversation

@devin-ai-integration
Copy link
Contributor

@devin-ai-integration devin-ai-integration bot commented Feb 27, 2026

Summary

Fixes flaky Databricks CI failures (SCHEMA_NOT_FOUND / TABLE_OR_VIEW_NOT_FOUND) caused by cross-branch schema name collisions.

Root cause: The "Write dbt profiles" step truncated the schema identifier to 40 characters via head -c 40. For databricks_catalog, the warehouse prefix alone consumes 39 of those 40 characters, leaving only 1 character to differentiate branches. Any two branches starting with the same letter (e.g. core-402-duckdb-support and core-397-clickhouse-support) produced identical schemas. When their CI jobs ran concurrently (different concurrency groups, same schema), one job's env.clear() would drop the other's schemas mid-test.

Evidence: Runs 22466863627 and 22466754085 overlapped (00:08–00:42 UTC, Feb 27), both targeting schema dbt_pkg_databricks_catalog_dbt_latest_official_c, both failed. The latest run with no concurrent Databricks job passed.

Fix: Replace head -c 40 truncation with a structured suffix: <branch_24>_<YYMMDD>_<8-char SHA-256 of concurrency group>.

  • Truncated branch name (≤24 chars) — keeps schemas human-readable in the data platform
  • Date stamp (YYMMDD) — makes it easy to identify and clean up old schemas by age
  • 8-char hash of the full concurrency group (warehouse type + dbt version + branch) — guarantees uniqueness across concurrent jobs

Example schema: dbt_pkg_core_402_duckdb_suppor_260227_a1b2c3d4

Worst-case total: dbt_pkg_ (8) + branch (24) + _ (1) + date (6) + _ (1) + hash (8) + _elementary (11) + _gw7 (4) = 63 chars — exactly at PostgreSQL's limit (the most restrictive warehouse).

Review & Testing Checklist for Human

  • Verify schema suffix in CI logs: Check the "Write dbt profiles" step output in this PR's CI run — it now logs the resolved SCHEMA_SUFFIX. Confirm it matches the <branch>_<YYMMDD>_<hash> format.
  • Run a second PR's CI concurrently against this one (ideally with a branch starting with the same characters) to validate that the collision is actually resolved end-to-end.
  • Spot-check the 63-char budget math against the actual schema names visible in Databricks/Snowflake after a CI run. The _elementary and _gw7 suffixes are appended by pytest-xdist and dbt — verify these haven't changed length.
  • Orphaned schemas from old naming convention: The old dbt_pkg_databricks_catalog_dbt_latest_official_c* schemas on Databricks/Snowflake/etc won't be automatically cleaned up. Consider a one-time manual cleanup.

Notes

  • Other warehouse types were less affected (snowflake/bigquery/redshift had 10-11 chars of differentiation), but the old approach was still fragile — this fix eliminates the truncation risk entirely.
  • The cancel-in-progress + --clear-on-end race condition (where a cancelled run's teardown drops the new run's schemas) is a separate, less frequent issue not addressed here.
  • The date stamp uses the CI runner's date at execution time (UTC), not the PR creation date.

Link to Devin run: https://app.devin.ai/sessions/e01c77a2322a476faf5f6162e9390351
Requested by: @haritamar

Summary by CodeRabbit

  • Chores
    • Updated test workflow to use a safe, hash-based schema suffix combining a sanitized branch segment, a YYMMDD date stamp, and an 8-character hash for uniqueness.
    • Added branch sanitization, length-safety safeguards, and an echo of the computed suffix for visibility.
    • Kept existing concurrency grouping while improving collision avoidance and documentation.

… collisions

The previous approach truncated the full ref string to 40 characters with
`head -c 40`. For databricks_catalog, the warehouse prefix alone consumed
39 of those 40 characters, leaving only 1 character to differentiate
branches. Any two branches starting with the same letter (e.g.
core-402-duckdb-support and core-397-clickhouse-support) produced
identical schema names, causing concurrent CI jobs to drop each other's
schemas (SCHEMA_NOT_FOUND / TABLE_OR_VIEW_NOT_FOUND).

New approach: use `pr<number>_<8-char hash of concurrency group>` (or
`run<id>_<hash>` for workflow_dispatch). This guarantees uniqueness
across concurrent jobs while staying well within PostgreSQL's 63-char
schema name limit (worst case: 46 chars).

Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
@devin-ai-integration
Copy link
Contributor Author

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment and CI monitoring

@github-actions
Copy link
Contributor

👋 @devin-ai-integration[bot]
Thank you for raising your pull request.
Please make sure to add tests and document all user-facing changes.
You can do this by editing the docs files in the elementary repository.

@coderabbitai
Copy link

coderabbitai bot commented Feb 27, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

  • Linear integration is disabled

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between aa8ae1d and c58c8c0.

📒 Files selected for processing (1)
  • .github/workflows/test-warehouse.yml

📝 Walkthrough

Walkthrough

Replaces substring-based schema naming with a collision-resistant scheme: builds CONCURRENCY_GROUP from warehouse-type, dbt-version, and branch, computes an 8-char SHA-256 SHORT_HASH, creates a sanitized lowercase SAFE_BRANCH, adds DATE_STAMP (YYMMDD), echoes and composes SCHEMA_SUFFIX as SAFE_BRANCH_DATE_STAMP_SHORT_HASH, and sets schema to dbt_pkg_$SCHEMA_SUFFIX. (50 words)

Changes

Cohort / File(s) Summary
GitHub Workflow Schema Naming
.github/workflows/test-warehouse.yml
Replaces truncated-ref schema derivation with a hash-based suffix. Adds CONCURRENCY_GROUP (from warehouse-type, dbt-version, branch), SHORT_HASH (first 8 chars of SHA-256), SAFE_BRANCH (lowercased, sanitized, max 24 chars), and DATE_STAMP (YYMMDD); composes SCHEMA_SUFFIX as SAFE_BRANCH_DATE_STAMP_SHORT_HASH, echoes the suffix, and writes schema as dbt_pkg_$SCHEMA_SUFFIX. Retains existing concurrency grouping key while changing the written schema name; includes inline comments on collision avoidance and PostgreSQL name-length budgeting.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 I nibble on hashes in moonlit code,
SAFE_BRANCH curled, DATE_STAMP in tow,
SHORT_HASH hops where conflicts won't grow,
SCHEMA_SUFFIX stitched soft and low,
CI gardens bloom where binaries sow. 🌿

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately captures the main fix: switching from truncation to a branch-name + date + hash scheme for CI schema names to prevent collisions.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch devin/1772185362-fix-databricks-schema-collision

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
.github/workflows/test-warehouse.yml (1)

142-153: Optional: log resolved schema suffix once for easier CI verification.

A single log line would simplify validating workflow_call propagation and diagnosing future schema issues.

💡 Minimal observability tweak
           if [ -n "$PR_NUMBER" ]; then
             SCHEMA_SUFFIX="pr${PR_NUMBER}_${SHORT_HASH}"
           else
             # workflow_dispatch or other non-PR triggers — use run id
             SCHEMA_SUFFIX="run${{ github.run_id }}_${SHORT_HASH}"
           fi
+          echo "Using CI schema suffix: $SCHEMA_SUFFIX"
 
           python "${{ github.workspace }}/dbt-data-reliability/integration_tests/profiles/generate_profiles.py" \
             --template "${{ github.workspace }}/dbt-data-reliability/integration_tests/profiles/profiles.yml.j2" \
             --output ~/.dbt/profiles.yml \
             --schema-name "dbt_pkg_$SCHEMA_SUFFIX"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/test-warehouse.yml around lines 142 - 153, Add a single
echo log that prints the resolved SCHEMA_SUFFIX (constructed from
PR_NUMBER/SHORT_HASH or run${{ github.run_id }}_SHORT_HASH) right after its
assignment and before invoking the python generate_profiles.py step so CI shows
the final schema name; reference the SCHEMA_SUFFIX variable and the
PR_NUMBER/SHORT_HASH resolution logic and ensure the log runs in both PR and
non-PR branches.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In @.github/workflows/test-warehouse.yml:
- Around line 142-153: Add a single echo log that prints the resolved
SCHEMA_SUFFIX (constructed from PR_NUMBER/SHORT_HASH or run${{ github.run_id
}}_SHORT_HASH) right after its assignment and before invoking the python
generate_profiles.py step so CI shows the final schema name; reference the
SCHEMA_SUFFIX variable and the PR_NUMBER/SHORT_HASH resolution logic and ensure
the log runs in both PR and non-PR branches.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

  • Linear integration is disabled

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 03bc6dd and 0917081.

📒 Files selected for processing (1)
  • .github/workflows/test-warehouse.yml

devin-ai-integration bot and others added 2 commits February 27, 2026 09:49
…ee schemas

Per feedback: use <truncated_branch>_<8-char hash of concurrency group>
instead of pr<number>_<hash>. This keeps schemas human-readable in the
data platform so you can tell which branch a schema belongs to.

The hash of the full concurrency group (warehouse type + dbt version +
branch) guarantees uniqueness even when branch names are truncated.

Also adds a log line echoing the resolved schema suffix for CI debugging.

Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
Schema format is now: dbt_pkg_<branch_24>_<YYMMDD>_<hash_8>
Example: dbt_pkg_core_402_duckdb_suppor_260227_a1b2c3d4

The date stamp makes it easy to identify and clean up old schemas
by age (e.g. DROP schemas older than 30 days).

Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
@devin-ai-integration devin-ai-integration bot changed the title fix: use PR number + hash for CI schema names to prevent cross-branch collisions fix: use branch name + date + hash for CI schema names to prevent collisions Feb 27, 2026
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
.github/workflows/test-warehouse.yml (1)

145-148: Consider extracting the concurrency group pattern to avoid duplication.

The concurrency group string is defined in two places:

  • Line 60: tests_${{ inputs.warehouse-type }}_dbt_${{ inputs.dbt-version }}_${{ github.head_ref || github.ref_name }}
  • Line 145: tests_${{ inputs.warehouse-type }}_dbt_${{ inputs.dbt-version }}_${BRANCH_NAME}

While functionally equivalent (since BRANCH_NAME is set from the same GitHub expression), if someone modifies the concurrency group format in one location, they could forget the other, causing the hash to no longer match the actual concurrency group—defeating the collision-avoidance purpose.

One option to reduce this risk is to add a comment near line 60 pointing to the reconstruction here, or vice versa.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/test-warehouse.yml around lines 145 - 148, Extract the
concurrency group construction into a single reusable variable and reference it
in both places instead of reconstructing it: create a single step or env var
named CONCURRENCY_GROUP (used by the existing CONCURRENCY_GROUP assignment and
by the earlier occurrence at line ~60) and then derive SHORT_HASH, SAFE_BRANCH,
and SCHEMA_SUFFIX from that single CONCURRENCY_GROUP/BRANCH_NAME source;
alternatively, if you prefer minimal change, add a clear comment next to the
first concurrency-group declaration pointing to the reconstruction here (and
vice versa) so future edits keep both formats in sync.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In @.github/workflows/test-warehouse.yml:
- Around line 145-148: Extract the concurrency group construction into a single
reusable variable and reference it in both places instead of reconstructing it:
create a single step or env var named CONCURRENCY_GROUP (used by the existing
CONCURRENCY_GROUP assignment and by the earlier occurrence at line ~60) and then
derive SHORT_HASH, SAFE_BRANCH, and SCHEMA_SUFFIX from that single
CONCURRENCY_GROUP/BRANCH_NAME source; alternatively, if you prefer minimal
change, add a clear comment next to the first concurrency-group declaration
pointing to the reconstruction here (and vice versa) so future edits keep both
formats in sync.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

  • Linear integration is disabled

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 0917081 and 7f66573.

📒 Files selected for processing (1)
  • .github/workflows/test-warehouse.yml

# Build a short, collision-free schema identifier from the branch
# name and an 8-char hash of the full concurrency group.
#
# Previous approach used `head -c 40` on the full ref string, which
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explaining the history is not interesting. Replace with a short explanation about how it is computed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — simplified the comment to just explain the format and budget, removed the history explanation. See c58c8c0.

Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant