Skip to content

feat: add full ClickHouse support - remove all skip_targets markers (CORE-397)#934

Open
devin-ai-integration[bot] wants to merge 14 commits intomasterfrom
core-397-clickhouse-support
Open

feat: add full ClickHouse support - remove all skip_targets markers (CORE-397)#934
devin-ai-integration[bot] wants to merge 14 commits intomasterfrom
core-397-clickhouse-support

Conversation

@devin-ai-integration
Copy link
Contributor

@devin-ai-integration devin-ai-integration bot commented Feb 25, 2026

feat(CORE-397): add full ClickHouse support

Summary

Adds full ClickHouse support by removing all skip_targets(["clickhouse"]) markers from 20 integration test files and implementing the underlying macro/SQL/infrastructure fixes to make the full test suite pass on ClickHouse (186 passed, 7 skipped locally with -n8).

Updates since last revision

Round 3 — Merged master (PRs #936, #937, #938) and verified compatibility:

Round 2 — Addressed overlap question between seed NULL repair mechanisms:

  • Removed dead clickhouse__ dispatch variant from replace_empty_strings_with_nulls.sql. Investigation showed this dispatch was never called — the macro is only invoked for BigQuery fusion seeds (via _fix_seed_if_needed). ClickHouse seed NULL repair is handled entirely by fix_clickhouse_seed_nulls() in clickhouse_utils.py via the HTTP API. Simplified macro back to a single implementation with a clarifying comment explaining the scope.

Round 1 — Addressed initial PR review comments:

  • Extracted clickhouse_utils.py: Moved all ClickHouse seed NULL-repair logic (HTTP API queries, table rebuild via EXCHANGE TABLES, schema resolution from profiles.yml) out of dbt_project.py into a dedicated integration_tests/tests/clickhouse_utils.py module. DbtProject._fix_seed_if_needed() now delegates to fix_clickhouse_seed_nulls(...).
  • Renamed ch_queryclickhouse_query_with_api() for clarity. Credentials are URL-encoded via urllib.parse.urlencode.
  • Refactored replace_empty_strings_with_nulls.sql: Added adapter.quote(col["name"]) for robustness against reserved-word column names.

Changes

Test enablement:

  • Removed all skip_targets(["clickhouse"]) markers and associated comments from 20 test files.
  • For multi-target skip lists (e.g. test_schema_changes.py), only "clickhouse" was removed; other targets preserved.
  • Removed now-unused import pytest from files where it was only needed for the skip decorator.

ClickHouse server configuration (users.xml):

  • Added join_use_nulls=1 — ClickHouse LEFT JOINs produce default values (not NULL) for unmatched rows by default, breaking anti-join patterns used in anomaly detection.
  • Added mutations_sync=1 — makes ALTER TABLE UPDATE mutations synchronous.
  • These are set only in the integration-tests Docker container. For managed/Cloud ClickHouse, users can set these per-session via dbt-clickhouse custom_settings or per-statement (SETTINGS mutations_sync=1).

Seed data NULL repair (clickhouse_utils.py):

  • ClickHouse columns are non-Nullable by default, so NULL values in CSV seeds become default values (0, ''). Added fix_clickhouse_seed_nulls() which rebuilds seed tables via the ClickHouse HTTP API using nullIf() to restore NULLs.
  • Schema is resolved dynamically from ~/.dbt/profiles.yml via get_clickhouse_schema() (not hardcoded).
  • Identifiers are validated with regex to prevent SQL injection.

Macro fixes (adapter.dispatch for ClickHouse):

  • full_names.sql: ClickHouse uses database=schema, so all name macros (full_table_name, full_schema_name, full_column_name, relation_to_full_name, model_node_to_full_name, configured_schemas_from_graph_as_tuple, full_name_split) now use 2-part names (schema.table) instead of 3-part.
  • buckets_cte.sql: Fixed complete_buckets_cte to respect time_bucket.count (was generating 1-period buckets instead of e.g. 7-hour buckets).
  • column_numeric_monitors.sql: Cast columns to Nullable(Float) before stddevPop/varSamp to handle NULL inputs.
  • day_of_week.sql / hour_of_week.sql: ClickHouse doesn't support DATE_FORMAT(date, 'EEEE'); use formatDateTime(date, '%W').
  • table_monitoring_query.sql: ClickHouse-specific event_freshness_metric_query that explicitly checks for NULL before computing timediff (ClickHouse's edr_datediff coalesces NULL→0, masking the fallback logic).
  • list_concat_with_separator.sql: ClickHouse-specific version that checks item is null before casting to String (ClickHouse's non-Nullable String rejects CAST(NULL, 'String')).
  • replace_empty_strings_with_nulls.sql: Simplified to single implementation (no dispatch). Only used for BigQuery fusion. Added clarifying comment explaining ClickHouse uses separate fix_clickhouse_seed_nulls() function.

CI:

  • Full CI matrix restored with all warehouse types and fusion variants (identical to master).

Review & Testing Checklist for Human

  • Verify clickhouse_utils.py HTTP API approach is safe — this bypasses dbt entirely to run DDL via ClickHouse's HTTP API. Check that the EXCHANGE TABLES + DROP pattern is safe (the finally block should prevent orphaned temp tables, but verify). Also verify that get_clickhouse_schema() profile parsing handles edge cases (e.g. missing keys, non-standard profile names).
  • Verify full_names.sql 2-part naming is complete — ClickHouse now uses 2-part names (schema.table) instead of 3-part (database.schema.table) across all name macros. Verify no other code assumes 3-part names for ClickHouse (e.g. in Elementary's Python code or other macros).
  • Document join_use_nulls=1 requirement for managed ClickHouse — the Docker config sets this automatically, but users on ClickHouse Cloud/managed instances need to set it via dbt-clickhouse profile custom_settings. Without it, LEFT JOIN anti-join patterns fail silently (produce "pass" instead of "fail" for anomaly tests). Consider adding this to Elementary's ClickHouse setup docs.
  • Test plan: Run a subset of anomaly detection tests locally on ClickHouse (e.g. test_anomalous_column_anomalies, test_event_freshness_anomalies) to verify the join_use_nulls and NULL handling fixes work end-to-end. Also test seed NULL repair by creating a seed with NULL values and verifying they're preserved after dbt seed.

Notes

  • Requested by: @haritamar
  • Link to Devin run
  • ClickHouse CI passed with all 186 tests passing (7 skipped) before restoring the full matrix.
  • The join_use_nulls=1 server config is critical — without it, LEFT JOIN anti-join patterns fail (produce "pass" instead of "fail" for anomaly tests).
  • For ClickHouse Cloud/managed instances, users should set join_use_nulls=1 via dbt-clickhouse profile custom_settings rather than server config.
  • CI status: 17/18 checks passing after merging master. The one failure (latest_official/databricks_catalog) is the known intermittent SCHEMA_NOT_FOUND issue (not related to this PR). All ClickHouse tests pass.

Summary by CodeRabbit

  • Tests

    • Expanded integration test coverage to run on ClickHouse across anomalies, sampling, schema validation, column monitoring, and data freshness suites.
  • New Features

    • Added ClickHouse runtime configuration to improve null/mutation behavior.
    • Improved seed data handling and empty-string→NULL remediation for ClickHouse targets.
    • Extracted ClickHouse-specific utilities into dedicated module for better code organization.
  • Bug Fixes

    • Fixed nullable handling in statistical calculations for ClickHouse.
    • Fixed bucket generation to respect time_bucket.count for ClickHouse.
    • Fixed event freshness metric NULL handling for ClickHouse.
    • Fixed dimension cast NULL errors in list concatenation for ClickHouse.
  • Chores

    • CI workflow restored to full matrix with all warehouse types.
    • Simplified seed NULL repair macro by removing unused dispatch variant.

Remove ClickHouse from skip_targets in all integration test files to enable
full ClickHouse support testing. For multi-target skip lists (e.g. in
test_schema_changes.py and test_exposure_schema_validity.py), only 'clickhouse'
was removed while keeping other targets.

Also remove now-unused 'import pytest' statements in files where pytest was
only imported for the skip_targets decorator.

CORE-397

Co-Authored-By: unknown <>
Reduce the warehouse-type matrix to only [clickhouse] to enable fast
iteration on ClickHouse test fixes. Will be restored to full matrix
once all ClickHouse tests pass.

CORE-397

Co-Authored-By: unknown <>
@linear
Copy link

linear bot commented Feb 25, 2026

@devin-ai-integration
Copy link
Contributor Author

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment and CI monitoring

@github-actions
Copy link
Contributor

👋 @devin-ai-integration[bot]
Thank you for raising your pull request.
Please make sure to add tests and document all user-facing changes.
You can do this by editing the docs files in the elementary repository.

@coderabbitai
Copy link

coderabbitai bot commented Feb 25, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

CI now targets only ClickHouse; many integration tests previously skipped for ClickHouse are enabled. ClickHouse-specific null-handling was added for macros and seed processing, and ClickHouse configuration was extended with a users.xml mount and profile settings.

Changes

Cohort / File(s) Summary
CI Workflow Simplification
.github/workflows/test-all-warehouses.yml
Matrix reduced to a single clickhouse entry; per-warehouse entries and dbt-version mappings for other warehouses removed.
Re-enabled ClickHouse Integration Tests
integration_tests/tests/.../*.py (multiple files)
integration_tests/tests/test_all_columns_anomalies.py, integration_tests/tests/test_anomalies_backfill_logic.py, integration_tests/tests/test_anomalies_ranges.py, integration_tests/tests/test_anomaly_exclude_metrics.py, integration_tests/tests/test_anomaly_test_configuration.py, integration_tests/tests/test_collect_metrics.py, integration_tests/tests/test_column_anomalies.py, integration_tests/tests/test_column_pii_sampling.py, integration_tests/tests/test_dbt_artifacts/test_artifacts.py, integration_tests/tests/test_dimension_anomalies.py, integration_tests/tests/test_disable_samples_config.py, integration_tests/tests/test_event_freshness_anomalies.py, integration_tests/tests/test_exposure_schema_validity.py, integration_tests/tests/test_failed_row_count.py, integration_tests/tests/test_freshness_anomalies.py, integration_tests/tests/test_override_samples_config.py, integration_tests/tests/test_sampling.py, integration_tests/tests/test_sampling_pii.py, integration_tests/tests/test_schema_changes.py, integration_tests/tests/test_volume_anomalies.py
Removed @pytest.mark.skip_targets(["clickhouse"]) (or removed clickhouse from skip lists) and some unused pytest imports across many test modules, enabling previously-skipped tests to run on ClickHouse.
ClickHouse null-handling macro
integration_tests/dbt_project/macros/replace_empty_strings_with_nulls.sql
Added ClickHouse-specific DDL path: alter columns to Nullable(String) and per-column UPDATE/DDL flow for converting empty strings to NULL; non-ClickHouse path retains original UPDATE behavior.
Seed data ClickHouse remediation
integration_tests/tests/dbt_project.py
seed now forwards data to _fix_seed_if_needed(table_name, data); _fix_seed_if_needed signature accepts optional data and calls new _fix_clickhouse_seed_nulls() for ClickHouse. _fix_clickhouse_seed_nulls() builds nullIf expressions, creates temporary nullable table, exchanges tables via ClickHouse DDL/HTTP API.
ClickHouse service config
integration_tests/docker-compose.yml, integration_tests/docker/clickhouse/users.xml
Added volume mount of ./docker/clickhouse/users.xml into container and added users.xml defining default profile with join_use_nulls and mutations_sync settings.
ClickHouse numeric aggregation updates
macros/edr/data_monitoring/monitors/column_numeric_monitors.sql
Changed ClickHouse implementations of stddev/variance macros to cast columns to nullable float types before aggregation to handle NULLs safely.

Sequence Diagram(s)

sequenceDiagram
  participant Seeder as "dbt_project.seed"
  participant Fixer as "_fix_clickhouse_seed_nulls"
  participant ClickHouse as "ClickHouse HTTP API\n(system.columns, DDL, INSERT)"

  Seeder->>Fixer: call _fix_clickhouse_seed_nulls(table_name, data)
  Fixer->>ClickHouse: query system.columns to get column types
  ClickHouse-->>Fixer: return column metadata
  Fixer->>ClickHouse: CREATE TEMP table with Nullable types (DDL)
  Fixer->>ClickHouse: INSERT data using nullIf expressions (HTTP POST)
  Fixer->>ClickHouse: ALTER/EXCHANGE to replace original table (DDL)
  ClickHouse-->>Fixer: acknowledge DDL/INSERT
  Fixer-->>Seeder: return / continue seed flow
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Poem

🐰 I hopped from skips to running ground,
ClickHouse now hears every test sound.
Nulls get nudged, configs snug and neat,
Seeds rebuilt so rows align sweet.
A joyful thump — the burrow’s complete!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 20.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly describes the main change: enabling ClickHouse by removing skip markers and adding support across the test suite.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch core-397-clickhouse-support

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
integration_tests/tests/test_exposure_schema_validity.py (1)

147-156: ⚠️ Potential issue | 🟡 Minor

Add ClickHouse "NUMERIC" type to the data_type mapping.

ClickHouse is an active integration test target but currently lacks an explicit entry in the mapping (lines 147-156). The fallback to lowercase "numeric" is problematic—ClickHouse's type system recognizes NUMERIC as a valid alias for its Decimal type, but the lowercase variant may not be properly introspected. Add an explicit entry to align with ClickHouse's type conventions and match the uppercase pattern used by other targets:

     data_type = {
         "snowflake": "NUMERIC",
         "bigquery": "NUMERIC",
         "spark": "int",
         "databricks": "int",
         "databricks_catalog": "int",
         "athena": "int",
         "trino": "int",
         "dremio": "int",
+        "clickhouse": "NUMERIC",
     }.get(dbt_project.dbt_runner.target, "numeric")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@integration_tests/tests/test_exposure_schema_validity.py` around lines 147 -
156, The mapping for data_type in test_exposure_schema_validity.py omits
ClickHouse, causing it to fall back to a lowercase "numeric"; update the literal
mapping used to derive data_type (the dict assigned to variable data_type) to
include an explicit "clickhouse": "NUMERIC" entry so
dbt_project.dbt_runner.target resolves to the uppercase NUMERIC for ClickHouse
like other adapters.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/test-all-warehouses.yml:
- Line 90: The CI workflow currently restricts the test matrix by setting the
"warehouse-type" matrix to just [clickhouse]; restore the full warehouse matrix
before merging by reverting that change so "warehouse-type" includes all
intended entries (e.g., clickhouse plus the other warehouse and fusion variants
used elsewhere in the matrix), ensuring the job definitions that reference
"warehouse-type" (the matrix key "warehouse-type") run across the full set of
targets rather than a single [clickhouse] value.

---

Outside diff comments:
In `@integration_tests/tests/test_exposure_schema_validity.py`:
- Around line 147-156: The mapping for data_type in
test_exposure_schema_validity.py omits ClickHouse, causing it to fall back to a
lowercase "numeric"; update the literal mapping used to derive data_type (the
dict assigned to variable data_type) to include an explicit "clickhouse":
"NUMERIC" entry so dbt_project.dbt_runner.target resolves to the uppercase
NUMERIC for ClickHouse like other adapters.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

  • Linear integration is disabled

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between f0307f3 and 0a2dd80.

📒 Files selected for processing (21)
  • .github/workflows/test-all-warehouses.yml
  • integration_tests/tests/test_all_columns_anomalies.py
  • integration_tests/tests/test_anomalies_backfill_logic.py
  • integration_tests/tests/test_anomalies_ranges.py
  • integration_tests/tests/test_anomaly_exclude_metrics.py
  • integration_tests/tests/test_anomaly_test_configuration.py
  • integration_tests/tests/test_collect_metrics.py
  • integration_tests/tests/test_column_anomalies.py
  • integration_tests/tests/test_column_pii_sampling.py
  • integration_tests/tests/test_dbt_artifacts/test_artifacts.py
  • integration_tests/tests/test_dimension_anomalies.py
  • integration_tests/tests/test_disable_samples_config.py
  • integration_tests/tests/test_event_freshness_anomalies.py
  • integration_tests/tests/test_exposure_schema_validity.py
  • integration_tests/tests/test_failed_row_count.py
  • integration_tests/tests/test_freshness_anomalies.py
  • integration_tests/tests/test_override_samples_config.py
  • integration_tests/tests/test_sampling.py
  • integration_tests/tests/test_sampling_pii.py
  • integration_tests/tests/test_schema_changes.py
  • integration_tests/tests/test_volume_anomalies.py
💤 Files with no reviewable changes (18)
  • integration_tests/tests/test_anomalies_backfill_logic.py
  • integration_tests/tests/test_all_columns_anomalies.py
  • integration_tests/tests/test_override_samples_config.py
  • integration_tests/tests/test_anomaly_test_configuration.py
  • integration_tests/tests/test_dimension_anomalies.py
  • integration_tests/tests/test_freshness_anomalies.py
  • integration_tests/tests/test_column_anomalies.py
  • integration_tests/tests/test_column_pii_sampling.py
  • integration_tests/tests/test_volume_anomalies.py
  • integration_tests/tests/test_sampling.py
  • integration_tests/tests/test_event_freshness_anomalies.py
  • integration_tests/tests/test_collect_metrics.py
  • integration_tests/tests/test_sampling_pii.py
  • integration_tests/tests/test_anomalies_ranges.py
  • integration_tests/tests/test_dbt_artifacts/test_artifacts.py
  • integration_tests/tests/test_disable_samples_config.py
  • integration_tests/tests/test_anomaly_exclude_metrics.py
  • integration_tests/tests/test_failed_row_count.py

# warehouse-type: redshift
- dbt-version: "${{ inputs.dbt-version || 'fusion' }}"
warehouse-type: databricks_catalog
warehouse-type: [clickhouse]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Restore full warehouse matrix before merge (current single-target matrix is merge-blocking).

Line 90 limits CI to ClickHouse only, which drops regression coverage for all other warehouses/fusion variants. Keep this for WIP iteration if needed, but restore the full matrix before this PR is merged.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/test-all-warehouses.yml at line 90, The CI workflow
currently restricts the test matrix by setting the "warehouse-type" matrix to
just [clickhouse]; restore the full warehouse matrix before merging by reverting
that change so "warehouse-type" includes all intended entries (e.g., clickhouse
plus the other warehouse and fusion variants used elsewhere in the matrix),
ensuring the job definitions that reference "warehouse-type" (the matrix key
"warehouse-type") run across the full set of targets rather than a single
[clickhouse] value.

…lity

ClickHouse LEFT OUTER JOIN produces default values (e.g. 1970-01-01 for DateTime)
instead of NULL for unmatched rows, causing the anti-join pattern to fail.
Changed missing_bucket_starts CTE to use NOT IN which works correctly on all databases.

Co-Authored-By: unknown <>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@macros/edr/data_monitoring/data_monitors_configuration/get_buckets_configuration.sql`:
- Around line 83-85: The query uses row-value NOT IN between
(all_buckets.bucket_start, all_buckets.bucket_end) and the subquery selecting
bucket_start, bucket_end from buckets_with_existing_metrics, which is
unsupported on Redshift/Athena/Dremio and problematic on Databricks; replace
this pattern with an equivalent NOT EXISTS correlated subquery: for each row in
all_buckets check NOT EXISTS (select 1 from buckets_with_existing_metrics b
where b.bucket_start = all_buckets.bucket_start AND b.bucket_end =
all_buckets.bucket_end), updating the WHERE clause in
get_buckets_configuration.sql so comparisons reference all_buckets.bucket_start
and all_buckets.bucket_end and the subquery alias (e.g.,
b.bucket_start/b.bucket_end) to ensure platform portability and correct NULL
behavior.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

  • Linear integration is disabled

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 0a2dd80 and 41701e9.

📒 Files selected for processing (1)
  • macros/edr/data_monitoring/data_monitors_configuration/get_buckets_configuration.sql

- Use Nullable(Float32) in clickhouse__standard_deviation and clickhouse__variance
  to handle CASE expressions that return NULL
- Add _fix_clickhouse_seed_nulls() to rebuild seed tables with proper Nullable types
  using ClickHouse HTTP API with nullIf() function
- Configure ClickHouse Docker with join_use_nulls=1 and mutations_sync=1
- Fix unused variable lint warning in dbt_project.py

Co-Authored-By: unknown <>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
integration_tests/tests/dbt_project.py (1)

368-375: ⚠️ Potential issue | 🟠 Major

seed_context doesn't apply the ClickHouse NULL fix.

seed() (line 263) calls _fix_seed_if_needed(table_name, data) inside its context manager, but seed_context (line 369) never does. Any test using seed_context on ClickHouse will silently retain default values where NULLs are expected.

🔧 Proposed fix
 `@contextmanager`
 def seed_context(
     self, data: List[dict], table_name: str
 ) -> Generator[None, None, None]:
     with DbtDataSeeder(
         self.dbt_runner, self.project_dir_path, self.seeds_dir_path
     ).seed(data, table_name):
+        self._fix_seed_if_needed(table_name, data)
         yield
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@integration_tests/tests/dbt_project.py` around lines 368 - 375, seed_context
fails to apply the ClickHouse NULL transformation because it enters the
DbtDataSeeder.seed context without invoking the seeder's _fix_seed_if_needed;
update seed_context to create a seeder instance, call
seeder._fix_seed_if_needed(table_name, data) before entering the with
seeder.seed(data, table_name) block (or call the public helper that applies the
fix if available) so ClickHouse NULLs are fixed prior to seeding; references:
seed_context, DbtDataSeeder.seed, DbtDataSeeder._fix_seed_if_needed.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@integration_tests/dbt_project/macros/replace_empty_strings_with_nulls.sql`:
- Around line 14-19: The ALTER statements in the macro use raw {{ col['name'] }}
which breaks on reserved words; update both statement blocks (the ones named
'alter_nullable_' ~ col['name'] and 'update_nulls_' ~ col['name']) to emit the
column identifier wrapped in ClickHouse backticks and escape any backticks in
the column name before quoting so names like from/order/index work correctly;
apply this change to the MODIFY COLUMN and the UPDATE ... WHERE expressions in
the macro.

In `@integration_tests/tests/dbt_project.py`:
- Around line 359-366: Wrap the temporary-table rebuild sequence that calls
ch_query (CREATE TABLE ..., EXCHANGE TABLES, DROP TABLE) in a try/finally so the
tmp table (tmp_name) is always dropped from schema even if CREATE or EXCHANGE
raises; additionally, catch exceptions around the CREATE TABLE ... AS SELECT
(using select_sql) to re-raise a new error or augment the exception with
contextual info (schema, table_name, tmp_name, select_sql) before letting it
propagate, and perform the final ch_query(f"DROP TABLE {schema}.{tmp_name}") in
the finally block to guarantee cleanup.
- Line 303: The code hardcodes schema as "default" via schema =
f"default{SCHEMA_NAME_SUFFIX}" which can mismatch the dbt ClickHouse target and
cause system.columns queries to return no rows; update the logic in the dbt
project test to read the effective schema from the dbt runner's target (e.g.,
via self.dbt_runner.target or an exposed property) instead of hardcoding, and if
you must keep a fallback, log a warning when cols_result is empty so the
missing-schema case is observable; modify the code paths that build the
system.columns query and any use of schema/SCHEMA_NAME_SUFFIX accordingly.
- Around line 331-344: The current loop over columns uses exact col_type
equality and can produce nested Nullable casts; update the logic around col_type
handling in the block that fills select_exprs so you first normalize the type by
stripping a leading "Nullable(" wrapper (and its trailing ")") to get base_type,
detect string-like types by checking if "String" is contained in base_type or
base_type.startswith("FixedString"), and treat LowCardinality(String) similarly;
choose default_val = "''" for those string-like base types, "0" for Int*/UInt*
and Float*, otherwise defaultValueOfTypeName(base_type), then append a single
cast to ::Nullable(base_type) (not ::Nullable(original_col_type)) to avoid
constructing Nullable(Nullable(...)). Apply these changes to the same loop that
iterates over columns/nullable_columns and builds select_exprs so col_type,
default_val, and the final cast use the normalized base_type.
- Around line 308-313: Add a small validator and use it before interpolating
identifiers: implement a helper like _assert_safe_identifier(name: str) that
checks re.fullmatch(r"[A-Za-z0-9_]+", name) and raises on failure, then call it
for table_name and schema (and any other identifiers such as col_name) before
using them in raw SQL; also wrap the validated identifiers in backticks when
building the SQL strings (the places constructing DDL/DDL-like queries and the
code paths referencing table_name, schema, and col_name should be updated) so
that all identifier interpolations are both validated and quoted.

---

Outside diff comments:
In `@integration_tests/tests/dbt_project.py`:
- Around line 368-375: seed_context fails to apply the ClickHouse NULL
transformation because it enters the DbtDataSeeder.seed context without invoking
the seeder's _fix_seed_if_needed; update seed_context to create a seeder
instance, call seeder._fix_seed_if_needed(table_name, data) before entering the
with seeder.seed(data, table_name) block (or call the public helper that applies
the fix if available) so ClickHouse NULLs are fixed prior to seeding;
references: seed_context, DbtDataSeeder.seed, DbtDataSeeder._fix_seed_if_needed.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

  • Linear integration is disabled

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 41701e9 and 1d2fe6c.

📒 Files selected for processing (5)
  • integration_tests/dbt_project/macros/replace_empty_strings_with_nulls.sql
  • integration_tests/docker-compose.yml
  • integration_tests/docker/clickhouse/users.xml
  • integration_tests/tests/dbt_project.py
  • macros/edr/data_monitoring/monitors/column_numeric_monitors.sql

if not nullable_columns:
return

schema = f"default{SCHEMA_NAME_SUFFIX}" if SCHEMA_NAME_SUFFIX else "default"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Hardcoded "default" schema may not match the actual dbt ClickHouse target schema.

The system.columns lookup will silently return no rows if the dbt profile points to a different database/schema, causing the NULL fix to no-op without any error or warning. The actual schema should be read from the dbt runner's target configuration rather than being hardcoded.

🔧 Suggested approach

Expose the effective schema via self.dbt_runner or pass it explicitly. As a quick fix, at minimum log a warning when cols_result is empty so a silent skip is observable:

+        logger.warning(
+            "ClickHouse fix: no columns found for %s.%s – "
+            "schema may be wrong (using '%s'). NULLs will not be repaired.",
+            schema, table_name, schema,
+        )
         if not cols_result:
             return
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@integration_tests/tests/dbt_project.py` at line 303, The code hardcodes
schema as "default" via schema = f"default{SCHEMA_NAME_SUFFIX}" which can
mismatch the dbt ClickHouse target and cause system.columns queries to return no
rows; update the logic in the dbt project test to read the effective schema from
the dbt runner's target (e.g., via self.dbt_runner.target or an exposed
property) instead of hardcoding, and if you must keep a fallback, log a warning
when cols_result is empty so the missing-schema case is observable; modify the
code paths that build the system.columns query and any use of
schema/SCHEMA_NAME_SUFFIX accordingly.

Comment on lines 359 to 366
ch_query(f"DROP TABLE IF EXISTS {schema}.{tmp_name}")
ch_query(
f"CREATE TABLE {schema}.{tmp_name} "
f"ENGINE = MergeTree() ORDER BY tuple() "
f"AS SELECT {select_sql} FROM {schema}.{table_name}"
)
ch_query(f"EXCHANGE TABLES {schema}.{table_name} AND {schema}.{tmp_name}")
ch_query(f"DROP TABLE {schema}.{tmp_name}")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

No cleanup on failure in the table-rebuild sequence.

If EXCHANGE succeeds but the final DROP TABLE throws, the tmp table leaks. More critically, if CREATE TABLE ... AS SELECT raises (e.g., a type-cast error in nullIf), the exception propagates and the caller receives an opaque error with no context about which table/column caused the failure. A try/finally guard ensures the tmp table is always dropped.

🔧 Proposed fix
-        ch_query(f"DROP TABLE IF EXISTS {schema}.{tmp_name}")
-        ch_query(
-            f"CREATE TABLE {schema}.{tmp_name} "
-            f"ENGINE = MergeTree() ORDER BY tuple() "
-            f"AS SELECT {select_sql} FROM {schema}.{table_name}"
-        )
-        ch_query(f"EXCHANGE TABLES {schema}.{table_name} AND {schema}.{tmp_name}")
-        ch_query(f"DROP TABLE {schema}.{tmp_name}")
+        ch_query(f"DROP TABLE IF EXISTS {schema}.{tmp_name}")
+        try:
+            ch_query(
+                f"CREATE TABLE {schema}.{tmp_name} "
+                f"ENGINE = MergeTree() ORDER BY tuple() "
+                f"AS SELECT {select_sql} FROM {schema}.{table_name}"
+            )
+            ch_query(f"EXCHANGE TABLES {schema}.{table_name} AND {schema}.{tmp_name}")
+        finally:
+            ch_query(f"DROP TABLE IF EXISTS {schema}.{tmp_name}")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
ch_query(f"DROP TABLE IF EXISTS {schema}.{tmp_name}")
ch_query(
f"CREATE TABLE {schema}.{tmp_name} "
f"ENGINE = MergeTree() ORDER BY tuple() "
f"AS SELECT {select_sql} FROM {schema}.{table_name}"
)
ch_query(f"EXCHANGE TABLES {schema}.{table_name} AND {schema}.{tmp_name}")
ch_query(f"DROP TABLE {schema}.{tmp_name}")
ch_query(f"DROP TABLE IF EXISTS {schema}.{tmp_name}")
try:
ch_query(
f"CREATE TABLE {schema}.{tmp_name} "
f"ENGINE = MergeTree() ORDER BY tuple() "
f"AS SELECT {select_sql} FROM {schema}.{table_name}"
)
ch_query(f"EXCHANGE TABLES {schema}.{table_name} AND {schema}.{tmp_name}")
finally:
ch_query(f"DROP TABLE IF EXISTS {schema}.{tmp_name}")
🧰 Tools
🪛 Ruff (0.15.2)

[error] 361-363: Possible SQL injection vector through string-based query construction

(S608)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@integration_tests/tests/dbt_project.py` around lines 359 - 366, Wrap the
temporary-table rebuild sequence that calls ch_query (CREATE TABLE ..., EXCHANGE
TABLES, DROP TABLE) in a try/finally so the tmp table (tmp_name) is always
dropped from schema even if CREATE or EXCHANGE raises; additionally, catch
exceptions around the CREATE TABLE ... AS SELECT (using select_sql) to re-raise
a new error or augment the exception with contextual info (schema, table_name,
tmp_name, select_sql) before letting it propagate, and perform the final
ch_query(f"DROP TABLE {schema}.{tmp_name}") in the finally block to guarantee
cleanup.

- Revert NOT IN subquery back to LEFT JOIN IS NULL (join_use_nulls=1 handles NULLs)
- Add _fix_seed_if_needed to seed_context for ClickHouse NULL fix
- Add try/finally for cleanup in table-rebuild sequence
- Handle Nullable wrapping to avoid Nullable(Nullable(...))
- Handle FixedString/LowCardinality string variants
- Add warning when cols_result is empty
- Backtick-quote column names in ClickHouse ALTER statements

Co-Authored-By: unknown <>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (2)
integration_tests/tests/dbt_project.py (2)

317-318: ⚠️ Potential issue | 🟡 Minor

table_name and schema are still interpolated directly into raw SQL (S608).

This was flagged in a prior review and remains unaddressed. Both identifiers should be validated (e.g., re.fullmatch(r"[A-Za-z0-9_]+", name)) before interpolation. The same pattern recurs in lines 375–382.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@integration_tests/tests/dbt_project.py` around lines 317 - 318, The SQL
builder is interpolating identifiers directly (variables table_name and schema)
causing injection risk; validate both identifiers with a strict regex (e.g., use
re.fullmatch(r"[A-Za-z0-9_]+", schema) and re.fullmatch(r"[A-Za-z0-9_]+",
table_name)) before formatting the query, raise a ValueError if validation
fails, and then interpolate only the validated values into the f-string used to
build the "SELECT name, type FROM system.columns ..." query; apply the same
validation/fix to the other occurrences where system.columns is queried (the
block that constructs the query using schema/table_name at the later
occurrence).

303-303: Hardcoded "default" schema and redundant ternary.

Two issues on this line:

  1. Redundant ternary: f"default{SCHEMA_NAME_SUFFIX}" if SCHEMA_NAME_SUFFIX else "default" always evaluates to f"default{SCHEMA_NAME_SUFFIX}" — when SCHEMA_NAME_SUFFIX is "", the f-string already produces "default".

  2. Hardcoded schema name (pre-existing): The "default" base is still hardcoded. If the dbt ClickHouse profile targets a different database, system.columns returns nothing and the NULL repair silently no-ops — the newly added warning log mitigates observability but does not fix the root cause. The schema should be read from the active dbt target (e.g., via self.dbt_runner or a target-properties helper analogous to get_database_and_schema_properties).

♻️ Simplification for the redundant ternary (independent of fix `#2`)
-        schema = f"default{SCHEMA_NAME_SUFFIX}" if SCHEMA_NAME_SUFFIX else "default"
+        schema = f"default{SCHEMA_NAME_SUFFIX}"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@integration_tests/tests/dbt_project.py` at line 303, The schema determination
uses a redundant ternary and a hardcoded "default" base; replace the
f"default{SCHEMA_NAME_SUFFIX}" if SCHEMA_NAME_SUFFIX else "default" expression
by deriving the schema from the active dbt target instead: call the existing dbt
target helper (e.g., use self.dbt_runner or a helper analogous to
get_database_and_schema_properties) to obtain the configured schema (and then
append SCHEMA_NAME_SUFFIX if present), and remove the redundant conditional so
schema is computed from the dbt target properties rather than the hardcoded
"default".
🧹 Nitpick comments (1)
integration_tests/tests/dbt_project.py (1)

292-292: Move import urllib.request to module level.

Imports inside method bodies hide dependencies and re-execute the import lookup on every call. Since urllib.request is a stdlib module, there is no conditional-import justification here.

♻️ Proposed fix
+import urllib.request
 import json
 import os
 from contextlib import contextmanager, nullcontext
-        import urllib.request
-
         # Find columns that contain at least one NULL in the original data
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@integration_tests/tests/dbt_project.py` at line 292, Move the local import of
urllib.request out of the function and place a single top-level import at the
module level; specifically, remove the inline "import urllib.request" found in
the test body and add "import urllib.request" with the other imports at the top
of the dbt_project.py module so the stdlib dependency is declared once and not
re-imported on every call.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@integration_tests/dbt_project/macros/replace_empty_strings_with_nulls.sql`:
- Around line 17-19: The ALTER TABLE ... UPDATE issued inside the statement call
(named 'update_nulls_' ~ col['name']) is asynchronous and must wait for
completion; modify the UPDATE statement for relation and column ({{ relation }},
{{ col['name'] }}) to include "SETTINGS mutations_sync = 1" so the mutation
completes before returning (i.e., ALTER TABLE {{ relation }} UPDATE `{{
col['name'] }}` = NULL WHERE `{{ col['name'] }}` = '' SETTINGS mutations_sync =
1).

In `@integration_tests/tests/dbt_project.py`:
- Line 312: The call to urllib.request.urlopen(req) can block indefinitely if
ClickHouse is unreachable; update the call to pass a sensible timeout (e.g., add
a timeout argument like timeout=10 or use a configurable CLICKHOUSE_TIMEOUT
constant) at the site where req is used, and wrap the call in exception handling
for urllib.error.URLError/socket.timeout so tests fail fast with a clear message
instead of hanging; locate the call by the urllib.request.urlopen(req)
expression in integration_tests/tests/dbt_project.py and apply the timeout and
error handling there.
- Around line 304-309: The ch_query helper hardcodes ClickHouse connection
details (ch_url="http://localhost:8123" and query params
user=default&password=default) which will fail in non-default CI; update
ch_query (and the ch_url construction) to read host, port, user, and password
from environment variables (e.g., CLICKHOUSE_HOST/PORT/USER/PASSWORD) or,
preferably, from the existing self.dbt_runner dbt profile/target if available,
falling back to the current defaults only if no config is found; ensure the
constructed request URL and query params use those values and add a clear
error/exception if none are available so failures are not silent.

---

Duplicate comments:
In `@integration_tests/tests/dbt_project.py`:
- Around line 317-318: The SQL builder is interpolating identifiers directly
(variables table_name and schema) causing injection risk; validate both
identifiers with a strict regex (e.g., use re.fullmatch(r"[A-Za-z0-9_]+",
schema) and re.fullmatch(r"[A-Za-z0-9_]+", table_name)) before formatting the
query, raise a ValueError if validation fails, and then interpolate only the
validated values into the f-string used to build the "SELECT name, type FROM
system.columns ..." query; apply the same validation/fix to the other
occurrences where system.columns is queried (the block that constructs the query
using schema/table_name at the later occurrence).
- Line 303: The schema determination uses a redundant ternary and a hardcoded
"default" base; replace the f"default{SCHEMA_NAME_SUFFIX}" if SCHEMA_NAME_SUFFIX
else "default" expression by deriving the schema from the active dbt target
instead: call the existing dbt target helper (e.g., use self.dbt_runner or a
helper analogous to get_database_and_schema_properties) to obtain the configured
schema (and then append SCHEMA_NAME_SUFFIX if present), and remove the redundant
conditional so schema is computed from the dbt target properties rather than the
hardcoded "default".

---

Nitpick comments:
In `@integration_tests/tests/dbt_project.py`:
- Line 292: Move the local import of urllib.request out of the function and
place a single top-level import at the module level; specifically, remove the
inline "import urllib.request" found in the test body and add "import
urllib.request" with the other imports at the top of the dbt_project.py module
so the stdlib dependency is declared once and not re-imported on every call.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

  • Linear integration is disabled

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 1d2fe6c and d43adc6.

📒 Files selected for processing (2)
  • integration_tests/dbt_project/macros/replace_empty_strings_with_nulls.sql
  • integration_tests/tests/dbt_project.py

f"{ch_url}/?user=default&password=default&mutations_sync=1",
data=encoded,
)
with urllib.request.urlopen(req) as resp:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

urlopen has no timeout — CI job will hang if ClickHouse is unreachable.

Without a timeout argument, urllib.request.urlopen blocks indefinitely. A DDL-class query (CREATE/EXCHANGE/DROP) that stalls will never time out.

🔧 Proposed fix
-            with urllib.request.urlopen(req) as resp:
+            with urllib.request.urlopen(req, timeout=60) as resp:
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
with urllib.request.urlopen(req) as resp:
with urllib.request.urlopen(req, timeout=60) as resp:
🧰 Tools
🪛 Ruff (0.15.2)

[error] 312-312: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.

(S310)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@integration_tests/tests/dbt_project.py` at line 312, The call to
urllib.request.urlopen(req) can block indefinitely if ClickHouse is unreachable;
update the call to pass a sensible timeout (e.g., add a timeout argument like
timeout=10 or use a configurable CLICKHOUSE_TIMEOUT constant) at the site where
req is used, and wrap the call in exception handling for
urllib.error.URLError/socket.timeout so tests fail fast with a clear message
instead of hanging; locate the call by the urllib.request.urlopen(req)
expression in integration_tests/tests/dbt_project.py and apply the timeout and
error handling there.

…tion guard, mutations_sync

Co-Authored-By: unknown <>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
integration_tests/tests/dbt_project.py (1)

303-303: ⚠️ Potential issue | 🟠 Major

schema is still hardcoded, so repair can silently no-op on non-default ClickHouse targets.

Line [303] still pins schema to default.... If the active dbt target schema/database differs, system.columns lookup will miss and the fix path won’t run for the seeded table.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@integration_tests/tests/dbt_project.py` at line 303, The schema is hardcoded
to "default" (schema variable) so the repair will no-op for non-default dbt
targets; change the assignment to derive the active target schema instead of
using the literal "default" (e.g., set schema to the active dbt target schema
plus SCHEMA_NAME_SUFFIX, falling back to the existing default only if the target
schema is absent). Update the code that queries system.columns to use this
computed schema (and database if applicable) so the lookup finds columns for the
current target; reference the schema variable and SCHEMA_NAME_SUFFIX when making
this change and ensure a safe fallback path remains.
🧹 Nitpick comments (2)
integration_tests/dbt_project/macros/replace_empty_strings_with_nulls.sql (2)

28-31: Non-ClickHouse path does not quote column identifiers.

The ClickHouse branch correctly backtick-quotes col['name'], but the non-ClickHouse branch emits bare {{ col["name"] }}. If any seed column is a reserved word in Postgres/Snowflake/BigQuery/etc., the UPDATE will fail. This is pre-existing behaviour, but since the file is being rewritten anyway it would be low-effort to add adapter-appropriate quoting here too (e.g. via adapter.quote(col["name"])).

♻️ Suggested improvement
             {% if normalized_data_type == "string" %}
                 {% set update_query %}
                     update {{ relation }}
-                    set {{ col["name"] }} = NULL
-                    where {{ col["name"] }} = ''
+                    set {{ adapter.quote(col["name"]) }} = NULL
+                    where {{ adapter.quote(col["name"]) }} = ''
                 {% endset %}
                 {% do elementary.run_query(update_query) %}
             {% endif %}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@integration_tests/dbt_project/macros/replace_empty_strings_with_nulls.sql`
around lines 28 - 31, The non-ClickHouse branch builds an UPDATE using bare
column names ({{ col["name"] }}) which can break on reserved identifiers; modify
the generated SQL in the update_query to use the adapter quoting helper (e.g.
adapter.quote(col["name"])) for the column in both the SET and WHERE clauses so
the column identifier is quoted appropriately for the current adapter (ensure
you update the template where update_query is set and any places referencing {{
col["name"] }} to use the quoted form).

10-21: Consider wrapping both ClickHouse statements in a single loop iteration's error context.

If the MODIFY COLUMN succeeds but the subsequent UPDATE fails (e.g., due to a transient error), the column is left as Nullable(String) with empty strings still present. This is an unlikely but possible partial-application state. Worth noting for awareness — no action required unless you want to add a comment or a defensive check.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@integration_tests/dbt_project/macros/replace_empty_strings_with_nulls.sql`
around lines 10 - 21, Wrap the two ClickHouse operations for each column in a
single error-handling block so they succeed or are rolled back logically: inside
the loop in replace_empty_strings_with_nulls.sql (the iteration using columns,
elementary.get_column_data_type and elementary.normalize_data_type), execute the
existing statement calls ('alter_nullable_' ~ col['name'] and 'update_nulls_' ~
col['name']) inside a try/except Jinja block (or equivalent dbt error-handling)
and on exception run a compensating ALTER to revert the column to its original
(non-Nullable) type using the previously fetched data_type, then re-raise the
error so failures are visible; ensure you reference and reuse the same generated
statement names to locate and control the two operations for each col.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@integration_tests/tests/dbt_project.py`:
- Around line 312-315: The URL currently interpolates ch_user and ch_password
directly into the query string when building the urllib.request.Request, which
breaks for special characters; fix by URL-encoding the credentials before
constructing the request (use urllib.parse.quote_plus or build the query via
urllib.parse.urlencode) so ch_user and ch_password are escaped, then pass the
encoded URL to urllib.request.Request (update the code that constructs req with
ch_url, ch_user, ch_password and mutations_sync).

---

Duplicate comments:
In `@integration_tests/tests/dbt_project.py`:
- Line 303: The schema is hardcoded to "default" (schema variable) so the repair
will no-op for non-default dbt targets; change the assignment to derive the
active target schema instead of using the literal "default" (e.g., set schema to
the active dbt target schema plus SCHEMA_NAME_SUFFIX, falling back to the
existing default only if the target schema is absent). Update the code that
queries system.columns to use this computed schema (and database if applicable)
so the lookup finds columns for the current target; reference the schema
variable and SCHEMA_NAME_SUFFIX when making this change and ensure a safe
fallback path remains.

---

Nitpick comments:
In `@integration_tests/dbt_project/macros/replace_empty_strings_with_nulls.sql`:
- Around line 28-31: The non-ClickHouse branch builds an UPDATE using bare
column names ({{ col["name"] }}) which can break on reserved identifiers; modify
the generated SQL in the update_query to use the adapter quoting helper (e.g.
adapter.quote(col["name"])) for the column in both the SET and WHERE clauses so
the column identifier is quoted appropriately for the current adapter (ensure
you update the template where update_query is set and any places referencing {{
col["name"] }} to use the quoted form).
- Around line 10-21: Wrap the two ClickHouse operations for each column in a
single error-handling block so they succeed or are rolled back logically: inside
the loop in replace_empty_strings_with_nulls.sql (the iteration using columns,
elementary.get_column_data_type and elementary.normalize_data_type), execute the
existing statement calls ('alter_nullable_' ~ col['name'] and 'update_nulls_' ~
col['name']) inside a try/except Jinja block (or equivalent dbt error-handling)
and on exception run a compensating ALTER to revert the column to its original
(non-Nullable) type using the previously fetched data_type, then re-raise the
error so failures are visible; ensure you reference and reuse the same generated
statement names to locate and control the two operations for each col.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

  • Linear integration is disabled

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between d43adc6 and e36e33e.

📒 Files selected for processing (2)
  • integration_tests/dbt_project/macros/replace_empty_strings_with_nulls.sql
  • integration_tests/tests/dbt_project.py

…t freshness Nullable cast

Co-Authored-By: unknown <>
…Nullable dimension cast

Co-Authored-By: unknown <>
…ead of hardcoding 'default'

Co-Authored-By: unknown <>
{% do elementary.run_query(update_query) %}
{% endif %}
{% endfor %}
{% if target.type == "clickhouse" %}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We normally solve this by using a macro with a "default__" and "clickhouse__" implementation, could we do the same here?

@@ -0,0 +1,8 @@
<clickhouse>
<profiles>
<default>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just curious, could you elaborate on these flags and why they are needed?
I'm wondering if for users that have Clickhouse on the cloud it will just work or they also have to modify settings.

<clickhouse>
<profiles>
<default>
<join_use_nulls>1</join_use_nulls>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These flags are only being set in the integration-tests ClickHouse Docker container to make behavior deterministic:

  • join_use_nulls=1: needed for correct anti-join semantics (LEFT JOIN ... WHERE right_side_key IS NULL) since ClickHouse otherwise returns type defaults instead of NULLs for unmatched rows.
  • mutations_sync=1: makes ALTER TABLE .. UPDATE/DELETE synchronous (mostly for deterministic tests / seed-fixing flows that update then immediately read).

For managed/Cloud ClickHouse, users typically don’t need server config changes if they can set these per-session/per-query (e.g. dbt-clickhouse custom_settings for join_use_nulls, and SETTINGS mutations_sync=1 where relevant).

…ngs_with_nulls

The macro is only called for BigQuery fusion seeds. ClickHouse seed NULL
repair is handled by fix_clickhouse_seed_nulls() in clickhouse_utils.py
via the HTTP API (covers all column types, not just strings).

Co-Authored-By: unknown <>
{%- else -%}
{{ return('') }}
{%- endif -%}
trim(BOTH '"' FROM splitByChar('.', full_table_name)[{{ part_index }}]) AS {{ part_name }}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't break anything? I wonder why before there was a difference between database and schema in this code.
I guess it was a bug?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the old code was a bug. The original clickhouse__full_name_split assumed 3-part names (database.schema.table at indices 1, 2, 3), but the corresponding full_table_name macro only produced 2-part names for all adapters (there was no ClickHouse dispatch for full_table_name before this PR).

So the old code was splitting a 2-part string (schema.table) with 3-part indices — database_name would get index 1 (= schema), schema_name would get index 2 (= table), and table_name would get index 3 (= empty string). This was silently wrong but never surfaced because ClickHouse tests were all skipped.

The fix makes it consistent: full_table_name now has a ClickHouse dispatch that produces 2-part names (schema.table), and full_name_split splits accordingly — both database_name and schema_name map to index 1 (since database=schema in ClickHouse), and table_name maps to index 2. This is validated by CI (all 186 ClickHouse tests pass).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant