fix: use branch name + date + hash for CI schema names to prevent collisions#940
fix: use branch name + date + hash for CI schema names to prevent collisions#940devin-ai-integration[bot] wants to merge 4 commits intomasterfrom
Conversation
… collisions The previous approach truncated the full ref string to 40 characters with `head -c 40`. For databricks_catalog, the warehouse prefix alone consumed 39 of those 40 characters, leaving only 1 character to differentiate branches. Any two branches starting with the same letter (e.g. core-402-duckdb-support and core-397-clickhouse-support) produced identical schema names, causing concurrent CI jobs to drop each other's schemas (SCHEMA_NOT_FOUND / TABLE_OR_VIEW_NOT_FOUND). New approach: use `pr<number>_<8-char hash of concurrency group>` (or `run<id>_<hash>` for workflow_dispatch). This guarantees uniqueness across concurrent jobs while staying well within PostgreSQL's 63-char schema name limit (worst case: 46 chars). Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
🤖 Devin AI EngineerI'll be helping with this pull request! Here's what you should know: ✅ I will automatically:
Note: I can only respond to comments from users who have write access to this repository. ⚙️ Control Options:
|
|
👋 @devin-ai-integration[bot] |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review infoConfiguration used: defaults Review profile: CHILL Plan: Pro Disabled knowledge base sources:
📒 Files selected for processing (1)
📝 WalkthroughWalkthroughReplaces substring-based schema naming with a collision-resistant scheme: builds Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
.github/workflows/test-warehouse.yml (1)
142-153: Optional: log resolved schema suffix once for easier CI verification.A single log line would simplify validating
workflow_callpropagation and diagnosing future schema issues.💡 Minimal observability tweak
if [ -n "$PR_NUMBER" ]; then SCHEMA_SUFFIX="pr${PR_NUMBER}_${SHORT_HASH}" else # workflow_dispatch or other non-PR triggers — use run id SCHEMA_SUFFIX="run${{ github.run_id }}_${SHORT_HASH}" fi + echo "Using CI schema suffix: $SCHEMA_SUFFIX" python "${{ github.workspace }}/dbt-data-reliability/integration_tests/profiles/generate_profiles.py" \ --template "${{ github.workspace }}/dbt-data-reliability/integration_tests/profiles/profiles.yml.j2" \ --output ~/.dbt/profiles.yml \ --schema-name "dbt_pkg_$SCHEMA_SUFFIX"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/test-warehouse.yml around lines 142 - 153, Add a single echo log that prints the resolved SCHEMA_SUFFIX (constructed from PR_NUMBER/SHORT_HASH or run${{ github.run_id }}_SHORT_HASH) right after its assignment and before invoking the python generate_profiles.py step so CI shows the final schema name; reference the SCHEMA_SUFFIX variable and the PR_NUMBER/SHORT_HASH resolution logic and ensure the log runs in both PR and non-PR branches.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In @.github/workflows/test-warehouse.yml:
- Around line 142-153: Add a single echo log that prints the resolved
SCHEMA_SUFFIX (constructed from PR_NUMBER/SHORT_HASH or run${{ github.run_id
}}_SHORT_HASH) right after its assignment and before invoking the python
generate_profiles.py step so CI shows the final schema name; reference the
SCHEMA_SUFFIX variable and the PR_NUMBER/SHORT_HASH resolution logic and ensure
the log runs in both PR and non-PR branches.
ℹ️ Review info
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Disabled knowledge base sources:
- Linear integration is disabled
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
.github/workflows/test-warehouse.yml
…ee schemas Per feedback: use <truncated_branch>_<8-char hash of concurrency group> instead of pr<number>_<hash>. This keeps schemas human-readable in the data platform so you can tell which branch a schema belongs to. The hash of the full concurrency group (warehouse type + dbt version + branch) guarantees uniqueness even when branch names are truncated. Also adds a log line echoing the resolved schema suffix for CI debugging. Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
Schema format is now: dbt_pkg_<branch_24>_<YYMMDD>_<hash_8> Example: dbt_pkg_core_402_duckdb_suppor_260227_a1b2c3d4 The date stamp makes it easy to identify and clean up old schemas by age (e.g. DROP schemas older than 30 days). Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
There was a problem hiding this comment.
🧹 Nitpick comments (1)
.github/workflows/test-warehouse.yml (1)
145-148: Consider extracting the concurrency group pattern to avoid duplication.The concurrency group string is defined in two places:
- Line 60:
tests_${{ inputs.warehouse-type }}_dbt_${{ inputs.dbt-version }}_${{ github.head_ref || github.ref_name }}- Line 145:
tests_${{ inputs.warehouse-type }}_dbt_${{ inputs.dbt-version }}_${BRANCH_NAME}While functionally equivalent (since
BRANCH_NAMEis set from the same GitHub expression), if someone modifies the concurrency group format in one location, they could forget the other, causing the hash to no longer match the actual concurrency group—defeating the collision-avoidance purpose.One option to reduce this risk is to add a comment near line 60 pointing to the reconstruction here, or vice versa.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/test-warehouse.yml around lines 145 - 148, Extract the concurrency group construction into a single reusable variable and reference it in both places instead of reconstructing it: create a single step or env var named CONCURRENCY_GROUP (used by the existing CONCURRENCY_GROUP assignment and by the earlier occurrence at line ~60) and then derive SHORT_HASH, SAFE_BRANCH, and SCHEMA_SUFFIX from that single CONCURRENCY_GROUP/BRANCH_NAME source; alternatively, if you prefer minimal change, add a clear comment next to the first concurrency-group declaration pointing to the reconstruction here (and vice versa) so future edits keep both formats in sync.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In @.github/workflows/test-warehouse.yml:
- Around line 145-148: Extract the concurrency group construction into a single
reusable variable and reference it in both places instead of reconstructing it:
create a single step or env var named CONCURRENCY_GROUP (used by the existing
CONCURRENCY_GROUP assignment and by the earlier occurrence at line ~60) and then
derive SHORT_HASH, SAFE_BRANCH, and SCHEMA_SUFFIX from that single
CONCURRENCY_GROUP/BRANCH_NAME source; alternatively, if you prefer minimal
change, add a clear comment next to the first concurrency-group declaration
pointing to the reconstruction here (and vice versa) so future edits keep both
formats in sync.
ℹ️ Review info
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Disabled knowledge base sources:
- Linear integration is disabled
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
.github/workflows/test-warehouse.yml
.github/workflows/test-warehouse.yml
Outdated
| # Build a short, collision-free schema identifier from the branch | ||
| # name and an 8-char hash of the full concurrency group. | ||
| # | ||
| # Previous approach used `head -c 40` on the full ref string, which |
There was a problem hiding this comment.
Explaining the history is not interesting. Replace with a short explanation about how it is computed.
There was a problem hiding this comment.
Done — simplified the comment to just explain the format and budget, removed the history explanation. See c58c8c0.
Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
Summary
Fixes flaky Databricks CI failures (
SCHEMA_NOT_FOUND/TABLE_OR_VIEW_NOT_FOUND) caused by cross-branch schema name collisions.Root cause: The "Write dbt profiles" step truncated the schema identifier to 40 characters via
head -c 40. Fordatabricks_catalog, the warehouse prefix alone consumes 39 of those 40 characters, leaving only 1 character to differentiate branches. Any two branches starting with the same letter (e.g.core-402-duckdb-supportandcore-397-clickhouse-support) produced identical schemas. When their CI jobs ran concurrently (different concurrency groups, same schema), one job'senv.clear()would drop the other's schemas mid-test.Evidence: Runs 22466863627 and 22466754085 overlapped (00:08–00:42 UTC, Feb 27), both targeting schema
dbt_pkg_databricks_catalog_dbt_latest_official_c, both failed. The latest run with no concurrent Databricks job passed.Fix: Replace
head -c 40truncation with a structured suffix:<branch_24>_<YYMMDD>_<8-char SHA-256 of concurrency group>.YYMMDD) — makes it easy to identify and clean up old schemas by ageExample schema:
dbt_pkg_core_402_duckdb_suppor_260227_a1b2c3d4Worst-case total:
dbt_pkg_(8) + branch (24) +_(1) + date (6) +_(1) + hash (8) +_elementary(11) +_gw7(4) = 63 chars — exactly at PostgreSQL's limit (the most restrictive warehouse).Review & Testing Checklist for Human
SCHEMA_SUFFIX. Confirm it matches the<branch>_<YYMMDD>_<hash>format._elementaryand_gw7suffixes are appended by pytest-xdist and dbt — verify these haven't changed length.dbt_pkg_databricks_catalog_dbt_latest_official_c*schemas on Databricks/Snowflake/etc won't be automatically cleaned up. Consider a one-time manual cleanup.Notes
cancel-in-progress+--clear-on-endrace condition (where a cancelled run's teardown drops the new run's schemas) is a separate, less frequent issue not addressed here.Link to Devin run: https://app.devin.ai/sessions/e01c77a2322a476faf5f6162e9390351
Requested by: @haritamar
Summary by CodeRabbit