feat: add full ClickHouse support - remove all skip_targets markers (CORE-397)#934
feat: add full ClickHouse support - remove all skip_targets markers (CORE-397)#934devin-ai-integration[bot] wants to merge 14 commits intomasterfrom
Conversation
Remove ClickHouse from skip_targets in all integration test files to enable full ClickHouse support testing. For multi-target skip lists (e.g. in test_schema_changes.py and test_exposure_schema_validity.py), only 'clickhouse' was removed while keeping other targets. Also remove now-unused 'import pytest' statements in files where pytest was only imported for the skip_targets decorator. CORE-397 Co-Authored-By: unknown <>
Reduce the warehouse-type matrix to only [clickhouse] to enable fast iteration on ClickHouse test fixes. Will be restored to full matrix once all ClickHouse tests pass. CORE-397 Co-Authored-By: unknown <>
🤖 Devin AI EngineerI'll be helping with this pull request! Here's what you should know: ✅ I will automatically:
Note: I can only respond to comments from users who have write access to this repository. ⚙️ Control Options:
|
|
👋 @devin-ai-integration[bot] |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughCI now targets only ClickHouse; many integration tests previously skipped for ClickHouse are enabled. ClickHouse-specific null-handling was added for macros and seed processing, and ClickHouse configuration was extended with a users.xml mount and profile settings. Changes
Sequence Diagram(s)sequenceDiagram
participant Seeder as "dbt_project.seed"
participant Fixer as "_fix_clickhouse_seed_nulls"
participant ClickHouse as "ClickHouse HTTP API\n(system.columns, DDL, INSERT)"
Seeder->>Fixer: call _fix_clickhouse_seed_nulls(table_name, data)
Fixer->>ClickHouse: query system.columns to get column types
ClickHouse-->>Fixer: return column metadata
Fixer->>ClickHouse: CREATE TEMP table with Nullable types (DDL)
Fixer->>ClickHouse: INSERT data using nullIf expressions (HTTP POST)
Fixer->>ClickHouse: ALTER/EXCHANGE to replace original table (DDL)
ClickHouse-->>Fixer: acknowledge DDL/INSERT
Fixer-->>Seeder: return / continue seed flow
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
integration_tests/tests/test_exposure_schema_validity.py (1)
147-156:⚠️ Potential issue | 🟡 MinorAdd ClickHouse
"NUMERIC"type to thedata_typemapping.ClickHouse is an active integration test target but currently lacks an explicit entry in the mapping (lines 147-156). The fallback to lowercase
"numeric"is problematic—ClickHouse's type system recognizesNUMERICas a valid alias for itsDecimaltype, but the lowercase variant may not be properly introspected. Add an explicit entry to align with ClickHouse's type conventions and match the uppercase pattern used by other targets:data_type = { "snowflake": "NUMERIC", "bigquery": "NUMERIC", "spark": "int", "databricks": "int", "databricks_catalog": "int", "athena": "int", "trino": "int", "dremio": "int", + "clickhouse": "NUMERIC", }.get(dbt_project.dbt_runner.target, "numeric")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@integration_tests/tests/test_exposure_schema_validity.py` around lines 147 - 156, The mapping for data_type in test_exposure_schema_validity.py omits ClickHouse, causing it to fall back to a lowercase "numeric"; update the literal mapping used to derive data_type (the dict assigned to variable data_type) to include an explicit "clickhouse": "NUMERIC" entry so dbt_project.dbt_runner.target resolves to the uppercase NUMERIC for ClickHouse like other adapters.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.github/workflows/test-all-warehouses.yml:
- Line 90: The CI workflow currently restricts the test matrix by setting the
"warehouse-type" matrix to just [clickhouse]; restore the full warehouse matrix
before merging by reverting that change so "warehouse-type" includes all
intended entries (e.g., clickhouse plus the other warehouse and fusion variants
used elsewhere in the matrix), ensuring the job definitions that reference
"warehouse-type" (the matrix key "warehouse-type") run across the full set of
targets rather than a single [clickhouse] value.
---
Outside diff comments:
In `@integration_tests/tests/test_exposure_schema_validity.py`:
- Around line 147-156: The mapping for data_type in
test_exposure_schema_validity.py omits ClickHouse, causing it to fall back to a
lowercase "numeric"; update the literal mapping used to derive data_type (the
dict assigned to variable data_type) to include an explicit "clickhouse":
"NUMERIC" entry so dbt_project.dbt_runner.target resolves to the uppercase
NUMERIC for ClickHouse like other adapters.
ℹ️ Review info
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Disabled knowledge base sources:
- Linear integration is disabled
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (21)
.github/workflows/test-all-warehouses.ymlintegration_tests/tests/test_all_columns_anomalies.pyintegration_tests/tests/test_anomalies_backfill_logic.pyintegration_tests/tests/test_anomalies_ranges.pyintegration_tests/tests/test_anomaly_exclude_metrics.pyintegration_tests/tests/test_anomaly_test_configuration.pyintegration_tests/tests/test_collect_metrics.pyintegration_tests/tests/test_column_anomalies.pyintegration_tests/tests/test_column_pii_sampling.pyintegration_tests/tests/test_dbt_artifacts/test_artifacts.pyintegration_tests/tests/test_dimension_anomalies.pyintegration_tests/tests/test_disable_samples_config.pyintegration_tests/tests/test_event_freshness_anomalies.pyintegration_tests/tests/test_exposure_schema_validity.pyintegration_tests/tests/test_failed_row_count.pyintegration_tests/tests/test_freshness_anomalies.pyintegration_tests/tests/test_override_samples_config.pyintegration_tests/tests/test_sampling.pyintegration_tests/tests/test_sampling_pii.pyintegration_tests/tests/test_schema_changes.pyintegration_tests/tests/test_volume_anomalies.py
💤 Files with no reviewable changes (18)
- integration_tests/tests/test_anomalies_backfill_logic.py
- integration_tests/tests/test_all_columns_anomalies.py
- integration_tests/tests/test_override_samples_config.py
- integration_tests/tests/test_anomaly_test_configuration.py
- integration_tests/tests/test_dimension_anomalies.py
- integration_tests/tests/test_freshness_anomalies.py
- integration_tests/tests/test_column_anomalies.py
- integration_tests/tests/test_column_pii_sampling.py
- integration_tests/tests/test_volume_anomalies.py
- integration_tests/tests/test_sampling.py
- integration_tests/tests/test_event_freshness_anomalies.py
- integration_tests/tests/test_collect_metrics.py
- integration_tests/tests/test_sampling_pii.py
- integration_tests/tests/test_anomalies_ranges.py
- integration_tests/tests/test_dbt_artifacts/test_artifacts.py
- integration_tests/tests/test_disable_samples_config.py
- integration_tests/tests/test_anomaly_exclude_metrics.py
- integration_tests/tests/test_failed_row_count.py
| # warehouse-type: redshift | ||
| - dbt-version: "${{ inputs.dbt-version || 'fusion' }}" | ||
| warehouse-type: databricks_catalog | ||
| warehouse-type: [clickhouse] |
There was a problem hiding this comment.
Restore full warehouse matrix before merge (current single-target matrix is merge-blocking).
Line 90 limits CI to ClickHouse only, which drops regression coverage for all other warehouses/fusion variants. Keep this for WIP iteration if needed, but restore the full matrix before this PR is merged.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.github/workflows/test-all-warehouses.yml at line 90, The CI workflow
currently restricts the test matrix by setting the "warehouse-type" matrix to
just [clickhouse]; restore the full warehouse matrix before merging by reverting
that change so "warehouse-type" includes all intended entries (e.g., clickhouse
plus the other warehouse and fusion variants used elsewhere in the matrix),
ensuring the job definitions that reference "warehouse-type" (the matrix key
"warehouse-type") run across the full set of targets rather than a single
[clickhouse] value.
…lity ClickHouse LEFT OUTER JOIN produces default values (e.g. 1970-01-01 for DateTime) instead of NULL for unmatched rows, causing the anti-join pattern to fail. Changed missing_bucket_starts CTE to use NOT IN which works correctly on all databases. Co-Authored-By: unknown <>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@macros/edr/data_monitoring/data_monitors_configuration/get_buckets_configuration.sql`:
- Around line 83-85: The query uses row-value NOT IN between
(all_buckets.bucket_start, all_buckets.bucket_end) and the subquery selecting
bucket_start, bucket_end from buckets_with_existing_metrics, which is
unsupported on Redshift/Athena/Dremio and problematic on Databricks; replace
this pattern with an equivalent NOT EXISTS correlated subquery: for each row in
all_buckets check NOT EXISTS (select 1 from buckets_with_existing_metrics b
where b.bucket_start = all_buckets.bucket_start AND b.bucket_end =
all_buckets.bucket_end), updating the WHERE clause in
get_buckets_configuration.sql so comparisons reference all_buckets.bucket_start
and all_buckets.bucket_end and the subquery alias (e.g.,
b.bucket_start/b.bucket_end) to ensure platform portability and correct NULL
behavior.
ℹ️ Review info
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Disabled knowledge base sources:
- Linear integration is disabled
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
macros/edr/data_monitoring/data_monitors_configuration/get_buckets_configuration.sql
macros/edr/data_monitoring/data_monitors_configuration/get_buckets_configuration.sql
Outdated
Show resolved
Hide resolved
- Use Nullable(Float32) in clickhouse__standard_deviation and clickhouse__variance to handle CASE expressions that return NULL - Add _fix_clickhouse_seed_nulls() to rebuild seed tables with proper Nullable types using ClickHouse HTTP API with nullIf() function - Configure ClickHouse Docker with join_use_nulls=1 and mutations_sync=1 - Fix unused variable lint warning in dbt_project.py Co-Authored-By: unknown <>
There was a problem hiding this comment.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
integration_tests/tests/dbt_project.py (1)
368-375:⚠️ Potential issue | 🟠 Major
seed_contextdoesn't apply the ClickHouse NULL fix.
seed()(line 263) calls_fix_seed_if_needed(table_name, data)inside its context manager, butseed_context(line 369) never does. Any test usingseed_contexton ClickHouse will silently retain default values where NULLs are expected.🔧 Proposed fix
`@contextmanager` def seed_context( self, data: List[dict], table_name: str ) -> Generator[None, None, None]: with DbtDataSeeder( self.dbt_runner, self.project_dir_path, self.seeds_dir_path ).seed(data, table_name): + self._fix_seed_if_needed(table_name, data) yield🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@integration_tests/tests/dbt_project.py` around lines 368 - 375, seed_context fails to apply the ClickHouse NULL transformation because it enters the DbtDataSeeder.seed context without invoking the seeder's _fix_seed_if_needed; update seed_context to create a seeder instance, call seeder._fix_seed_if_needed(table_name, data) before entering the with seeder.seed(data, table_name) block (or call the public helper that applies the fix if available) so ClickHouse NULLs are fixed prior to seeding; references: seed_context, DbtDataSeeder.seed, DbtDataSeeder._fix_seed_if_needed.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@integration_tests/dbt_project/macros/replace_empty_strings_with_nulls.sql`:
- Around line 14-19: The ALTER statements in the macro use raw {{ col['name'] }}
which breaks on reserved words; update both statement blocks (the ones named
'alter_nullable_' ~ col['name'] and 'update_nulls_' ~ col['name']) to emit the
column identifier wrapped in ClickHouse backticks and escape any backticks in
the column name before quoting so names like from/order/index work correctly;
apply this change to the MODIFY COLUMN and the UPDATE ... WHERE expressions in
the macro.
In `@integration_tests/tests/dbt_project.py`:
- Around line 359-366: Wrap the temporary-table rebuild sequence that calls
ch_query (CREATE TABLE ..., EXCHANGE TABLES, DROP TABLE) in a try/finally so the
tmp table (tmp_name) is always dropped from schema even if CREATE or EXCHANGE
raises; additionally, catch exceptions around the CREATE TABLE ... AS SELECT
(using select_sql) to re-raise a new error or augment the exception with
contextual info (schema, table_name, tmp_name, select_sql) before letting it
propagate, and perform the final ch_query(f"DROP TABLE {schema}.{tmp_name}") in
the finally block to guarantee cleanup.
- Line 303: The code hardcodes schema as "default" via schema =
f"default{SCHEMA_NAME_SUFFIX}" which can mismatch the dbt ClickHouse target and
cause system.columns queries to return no rows; update the logic in the dbt
project test to read the effective schema from the dbt runner's target (e.g.,
via self.dbt_runner.target or an exposed property) instead of hardcoding, and if
you must keep a fallback, log a warning when cols_result is empty so the
missing-schema case is observable; modify the code paths that build the
system.columns query and any use of schema/SCHEMA_NAME_SUFFIX accordingly.
- Around line 331-344: The current loop over columns uses exact col_type
equality and can produce nested Nullable casts; update the logic around col_type
handling in the block that fills select_exprs so you first normalize the type by
stripping a leading "Nullable(" wrapper (and its trailing ")") to get base_type,
detect string-like types by checking if "String" is contained in base_type or
base_type.startswith("FixedString"), and treat LowCardinality(String) similarly;
choose default_val = "''" for those string-like base types, "0" for Int*/UInt*
and Float*, otherwise defaultValueOfTypeName(base_type), then append a single
cast to ::Nullable(base_type) (not ::Nullable(original_col_type)) to avoid
constructing Nullable(Nullable(...)). Apply these changes to the same loop that
iterates over columns/nullable_columns and builds select_exprs so col_type,
default_val, and the final cast use the normalized base_type.
- Around line 308-313: Add a small validator and use it before interpolating
identifiers: implement a helper like _assert_safe_identifier(name: str) that
checks re.fullmatch(r"[A-Za-z0-9_]+", name) and raises on failure, then call it
for table_name and schema (and any other identifiers such as col_name) before
using them in raw SQL; also wrap the validated identifiers in backticks when
building the SQL strings (the places constructing DDL/DDL-like queries and the
code paths referencing table_name, schema, and col_name should be updated) so
that all identifier interpolations are both validated and quoted.
---
Outside diff comments:
In `@integration_tests/tests/dbt_project.py`:
- Around line 368-375: seed_context fails to apply the ClickHouse NULL
transformation because it enters the DbtDataSeeder.seed context without invoking
the seeder's _fix_seed_if_needed; update seed_context to create a seeder
instance, call seeder._fix_seed_if_needed(table_name, data) before entering the
with seeder.seed(data, table_name) block (or call the public helper that applies
the fix if available) so ClickHouse NULLs are fixed prior to seeding;
references: seed_context, DbtDataSeeder.seed, DbtDataSeeder._fix_seed_if_needed.
ℹ️ Review info
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Disabled knowledge base sources:
- Linear integration is disabled
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (5)
integration_tests/dbt_project/macros/replace_empty_strings_with_nulls.sqlintegration_tests/docker-compose.ymlintegration_tests/docker/clickhouse/users.xmlintegration_tests/tests/dbt_project.pymacros/edr/data_monitoring/monitors/column_numeric_monitors.sql
integration_tests/dbt_project/macros/replace_empty_strings_with_nulls.sql
Outdated
Show resolved
Hide resolved
| if not nullable_columns: | ||
| return | ||
|
|
||
| schema = f"default{SCHEMA_NAME_SUFFIX}" if SCHEMA_NAME_SUFFIX else "default" |
There was a problem hiding this comment.
Hardcoded "default" schema may not match the actual dbt ClickHouse target schema.
The system.columns lookup will silently return no rows if the dbt profile points to a different database/schema, causing the NULL fix to no-op without any error or warning. The actual schema should be read from the dbt runner's target configuration rather than being hardcoded.
🔧 Suggested approach
Expose the effective schema via self.dbt_runner or pass it explicitly. As a quick fix, at minimum log a warning when cols_result is empty so a silent skip is observable:
+ logger.warning(
+ "ClickHouse fix: no columns found for %s.%s – "
+ "schema may be wrong (using '%s'). NULLs will not be repaired.",
+ schema, table_name, schema,
+ )
if not cols_result:
return🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@integration_tests/tests/dbt_project.py` at line 303, The code hardcodes
schema as "default" via schema = f"default{SCHEMA_NAME_SUFFIX}" which can
mismatch the dbt ClickHouse target and cause system.columns queries to return no
rows; update the logic in the dbt project test to read the effective schema from
the dbt runner's target (e.g., via self.dbt_runner.target or an exposed
property) instead of hardcoding, and if you must keep a fallback, log a warning
when cols_result is empty so the missing-schema case is observable; modify the
code paths that build the system.columns query and any use of
schema/SCHEMA_NAME_SUFFIX accordingly.
| ch_query(f"DROP TABLE IF EXISTS {schema}.{tmp_name}") | ||
| ch_query( | ||
| f"CREATE TABLE {schema}.{tmp_name} " | ||
| f"ENGINE = MergeTree() ORDER BY tuple() " | ||
| f"AS SELECT {select_sql} FROM {schema}.{table_name}" | ||
| ) | ||
| ch_query(f"EXCHANGE TABLES {schema}.{table_name} AND {schema}.{tmp_name}") | ||
| ch_query(f"DROP TABLE {schema}.{tmp_name}") |
There was a problem hiding this comment.
No cleanup on failure in the table-rebuild sequence.
If EXCHANGE succeeds but the final DROP TABLE throws, the tmp table leaks. More critically, if CREATE TABLE ... AS SELECT raises (e.g., a type-cast error in nullIf), the exception propagates and the caller receives an opaque error with no context about which table/column caused the failure. A try/finally guard ensures the tmp table is always dropped.
🔧 Proposed fix
- ch_query(f"DROP TABLE IF EXISTS {schema}.{tmp_name}")
- ch_query(
- f"CREATE TABLE {schema}.{tmp_name} "
- f"ENGINE = MergeTree() ORDER BY tuple() "
- f"AS SELECT {select_sql} FROM {schema}.{table_name}"
- )
- ch_query(f"EXCHANGE TABLES {schema}.{table_name} AND {schema}.{tmp_name}")
- ch_query(f"DROP TABLE {schema}.{tmp_name}")
+ ch_query(f"DROP TABLE IF EXISTS {schema}.{tmp_name}")
+ try:
+ ch_query(
+ f"CREATE TABLE {schema}.{tmp_name} "
+ f"ENGINE = MergeTree() ORDER BY tuple() "
+ f"AS SELECT {select_sql} FROM {schema}.{table_name}"
+ )
+ ch_query(f"EXCHANGE TABLES {schema}.{table_name} AND {schema}.{tmp_name}")
+ finally:
+ ch_query(f"DROP TABLE IF EXISTS {schema}.{tmp_name}")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| ch_query(f"DROP TABLE IF EXISTS {schema}.{tmp_name}") | |
| ch_query( | |
| f"CREATE TABLE {schema}.{tmp_name} " | |
| f"ENGINE = MergeTree() ORDER BY tuple() " | |
| f"AS SELECT {select_sql} FROM {schema}.{table_name}" | |
| ) | |
| ch_query(f"EXCHANGE TABLES {schema}.{table_name} AND {schema}.{tmp_name}") | |
| ch_query(f"DROP TABLE {schema}.{tmp_name}") | |
| ch_query(f"DROP TABLE IF EXISTS {schema}.{tmp_name}") | |
| try: | |
| ch_query( | |
| f"CREATE TABLE {schema}.{tmp_name} " | |
| f"ENGINE = MergeTree() ORDER BY tuple() " | |
| f"AS SELECT {select_sql} FROM {schema}.{table_name}" | |
| ) | |
| ch_query(f"EXCHANGE TABLES {schema}.{table_name} AND {schema}.{tmp_name}") | |
| finally: | |
| ch_query(f"DROP TABLE IF EXISTS {schema}.{tmp_name}") |
🧰 Tools
🪛 Ruff (0.15.2)
[error] 361-363: Possible SQL injection vector through string-based query construction
(S608)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@integration_tests/tests/dbt_project.py` around lines 359 - 366, Wrap the
temporary-table rebuild sequence that calls ch_query (CREATE TABLE ..., EXCHANGE
TABLES, DROP TABLE) in a try/finally so the tmp table (tmp_name) is always
dropped from schema even if CREATE or EXCHANGE raises; additionally, catch
exceptions around the CREATE TABLE ... AS SELECT (using select_sql) to re-raise
a new error or augment the exception with contextual info (schema, table_name,
tmp_name, select_sql) before letting it propagate, and perform the final
ch_query(f"DROP TABLE {schema}.{tmp_name}") in the finally block to guarantee
cleanup.
- Revert NOT IN subquery back to LEFT JOIN IS NULL (join_use_nulls=1 handles NULLs) - Add _fix_seed_if_needed to seed_context for ClickHouse NULL fix - Add try/finally for cleanup in table-rebuild sequence - Handle Nullable wrapping to avoid Nullable(Nullable(...)) - Handle FixedString/LowCardinality string variants - Add warning when cols_result is empty - Backtick-quote column names in ClickHouse ALTER statements Co-Authored-By: unknown <>
There was a problem hiding this comment.
Actionable comments posted: 3
♻️ Duplicate comments (2)
integration_tests/tests/dbt_project.py (2)
317-318:⚠️ Potential issue | 🟡 Minor
table_nameandschemaare still interpolated directly into raw SQL (S608).This was flagged in a prior review and remains unaddressed. Both identifiers should be validated (e.g.,
re.fullmatch(r"[A-Za-z0-9_]+", name)) before interpolation. The same pattern recurs in lines 375–382.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@integration_tests/tests/dbt_project.py` around lines 317 - 318, The SQL builder is interpolating identifiers directly (variables table_name and schema) causing injection risk; validate both identifiers with a strict regex (e.g., use re.fullmatch(r"[A-Za-z0-9_]+", schema) and re.fullmatch(r"[A-Za-z0-9_]+", table_name)) before formatting the query, raise a ValueError if validation fails, and then interpolate only the validated values into the f-string used to build the "SELECT name, type FROM system.columns ..." query; apply the same validation/fix to the other occurrences where system.columns is queried (the block that constructs the query using schema/table_name at the later occurrence).
303-303: Hardcoded"default"schema and redundant ternary.Two issues on this line:
Redundant ternary:
f"default{SCHEMA_NAME_SUFFIX}" if SCHEMA_NAME_SUFFIX else "default"always evaluates tof"default{SCHEMA_NAME_SUFFIX}"— whenSCHEMA_NAME_SUFFIXis"", the f-string already produces"default".Hardcoded schema name (pre-existing): The
"default"base is still hardcoded. If the dbt ClickHouse profile targets a different database,system.columnsreturns nothing and the NULL repair silently no-ops — the newly added warning log mitigates observability but does not fix the root cause. The schema should be read from the active dbt target (e.g., viaself.dbt_runneror a target-properties helper analogous toget_database_and_schema_properties).♻️ Simplification for the redundant ternary (independent of fix `#2`)
- schema = f"default{SCHEMA_NAME_SUFFIX}" if SCHEMA_NAME_SUFFIX else "default" + schema = f"default{SCHEMA_NAME_SUFFIX}"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@integration_tests/tests/dbt_project.py` at line 303, The schema determination uses a redundant ternary and a hardcoded "default" base; replace the f"default{SCHEMA_NAME_SUFFIX}" if SCHEMA_NAME_SUFFIX else "default" expression by deriving the schema from the active dbt target instead: call the existing dbt target helper (e.g., use self.dbt_runner or a helper analogous to get_database_and_schema_properties) to obtain the configured schema (and then append SCHEMA_NAME_SUFFIX if present), and remove the redundant conditional so schema is computed from the dbt target properties rather than the hardcoded "default".
🧹 Nitpick comments (1)
integration_tests/tests/dbt_project.py (1)
292-292: Moveimport urllib.requestto module level.Imports inside method bodies hide dependencies and re-execute the import lookup on every call. Since
urllib.requestis a stdlib module, there is no conditional-import justification here.♻️ Proposed fix
+import urllib.request import json import os from contextlib import contextmanager, nullcontext- import urllib.request - # Find columns that contain at least one NULL in the original data🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@integration_tests/tests/dbt_project.py` at line 292, Move the local import of urllib.request out of the function and place a single top-level import at the module level; specifically, remove the inline "import urllib.request" found in the test body and add "import urllib.request" with the other imports at the top of the dbt_project.py module so the stdlib dependency is declared once and not re-imported on every call.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@integration_tests/dbt_project/macros/replace_empty_strings_with_nulls.sql`:
- Around line 17-19: The ALTER TABLE ... UPDATE issued inside the statement call
(named 'update_nulls_' ~ col['name']) is asynchronous and must wait for
completion; modify the UPDATE statement for relation and column ({{ relation }},
{{ col['name'] }}) to include "SETTINGS mutations_sync = 1" so the mutation
completes before returning (i.e., ALTER TABLE {{ relation }} UPDATE `{{
col['name'] }}` = NULL WHERE `{{ col['name'] }}` = '' SETTINGS mutations_sync =
1).
In `@integration_tests/tests/dbt_project.py`:
- Line 312: The call to urllib.request.urlopen(req) can block indefinitely if
ClickHouse is unreachable; update the call to pass a sensible timeout (e.g., add
a timeout argument like timeout=10 or use a configurable CLICKHOUSE_TIMEOUT
constant) at the site where req is used, and wrap the call in exception handling
for urllib.error.URLError/socket.timeout so tests fail fast with a clear message
instead of hanging; locate the call by the urllib.request.urlopen(req)
expression in integration_tests/tests/dbt_project.py and apply the timeout and
error handling there.
- Around line 304-309: The ch_query helper hardcodes ClickHouse connection
details (ch_url="http://localhost:8123" and query params
user=default&password=default) which will fail in non-default CI; update
ch_query (and the ch_url construction) to read host, port, user, and password
from environment variables (e.g., CLICKHOUSE_HOST/PORT/USER/PASSWORD) or,
preferably, from the existing self.dbt_runner dbt profile/target if available,
falling back to the current defaults only if no config is found; ensure the
constructed request URL and query params use those values and add a clear
error/exception if none are available so failures are not silent.
---
Duplicate comments:
In `@integration_tests/tests/dbt_project.py`:
- Around line 317-318: The SQL builder is interpolating identifiers directly
(variables table_name and schema) causing injection risk; validate both
identifiers with a strict regex (e.g., use re.fullmatch(r"[A-Za-z0-9_]+",
schema) and re.fullmatch(r"[A-Za-z0-9_]+", table_name)) before formatting the
query, raise a ValueError if validation fails, and then interpolate only the
validated values into the f-string used to build the "SELECT name, type FROM
system.columns ..." query; apply the same validation/fix to the other
occurrences where system.columns is queried (the block that constructs the query
using schema/table_name at the later occurrence).
- Line 303: The schema determination uses a redundant ternary and a hardcoded
"default" base; replace the f"default{SCHEMA_NAME_SUFFIX}" if SCHEMA_NAME_SUFFIX
else "default" expression by deriving the schema from the active dbt target
instead: call the existing dbt target helper (e.g., use self.dbt_runner or a
helper analogous to get_database_and_schema_properties) to obtain the configured
schema (and then append SCHEMA_NAME_SUFFIX if present), and remove the redundant
conditional so schema is computed from the dbt target properties rather than the
hardcoded "default".
---
Nitpick comments:
In `@integration_tests/tests/dbt_project.py`:
- Line 292: Move the local import of urllib.request out of the function and
place a single top-level import at the module level; specifically, remove the
inline "import urllib.request" found in the test body and add "import
urllib.request" with the other imports at the top of the dbt_project.py module
so the stdlib dependency is declared once and not re-imported on every call.
ℹ️ Review info
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Disabled knowledge base sources:
- Linear integration is disabled
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
integration_tests/dbt_project/macros/replace_empty_strings_with_nulls.sqlintegration_tests/tests/dbt_project.py
integration_tests/dbt_project/macros/replace_empty_strings_with_nulls.sql
Outdated
Show resolved
Hide resolved
| f"{ch_url}/?user=default&password=default&mutations_sync=1", | ||
| data=encoded, | ||
| ) | ||
| with urllib.request.urlopen(req) as resp: |
There was a problem hiding this comment.
urlopen has no timeout — CI job will hang if ClickHouse is unreachable.
Without a timeout argument, urllib.request.urlopen blocks indefinitely. A DDL-class query (CREATE/EXCHANGE/DROP) that stalls will never time out.
🔧 Proposed fix
- with urllib.request.urlopen(req) as resp:
+ with urllib.request.urlopen(req, timeout=60) as resp:📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| with urllib.request.urlopen(req) as resp: | |
| with urllib.request.urlopen(req, timeout=60) as resp: |
🧰 Tools
🪛 Ruff (0.15.2)
[error] 312-312: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.
(S310)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@integration_tests/tests/dbt_project.py` at line 312, The call to
urllib.request.urlopen(req) can block indefinitely if ClickHouse is unreachable;
update the call to pass a sensible timeout (e.g., add a timeout argument like
timeout=10 or use a configurable CLICKHOUSE_TIMEOUT constant) at the site where
req is used, and wrap the call in exception handling for
urllib.error.URLError/socket.timeout so tests fail fast with a clear message
instead of hanging; locate the call by the urllib.request.urlopen(req)
expression in integration_tests/tests/dbt_project.py and apply the timeout and
error handling there.
…tion guard, mutations_sync Co-Authored-By: unknown <>
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
integration_tests/tests/dbt_project.py (1)
303-303:⚠️ Potential issue | 🟠 Major
schemais still hardcoded, so repair can silently no-op on non-default ClickHouse targets.Line [303] still pins schema to
default.... If the active dbt target schema/database differs,system.columnslookup will miss and the fix path won’t run for the seeded table.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@integration_tests/tests/dbt_project.py` at line 303, The schema is hardcoded to "default" (schema variable) so the repair will no-op for non-default dbt targets; change the assignment to derive the active target schema instead of using the literal "default" (e.g., set schema to the active dbt target schema plus SCHEMA_NAME_SUFFIX, falling back to the existing default only if the target schema is absent). Update the code that queries system.columns to use this computed schema (and database if applicable) so the lookup finds columns for the current target; reference the schema variable and SCHEMA_NAME_SUFFIX when making this change and ensure a safe fallback path remains.
🧹 Nitpick comments (2)
integration_tests/dbt_project/macros/replace_empty_strings_with_nulls.sql (2)
28-31: Non-ClickHouse path does not quote column identifiers.The ClickHouse branch correctly backtick-quotes
col['name'], but the non-ClickHouse branch emits bare{{ col["name"] }}. If any seed column is a reserved word in Postgres/Snowflake/BigQuery/etc., the UPDATE will fail. This is pre-existing behaviour, but since the file is being rewritten anyway it would be low-effort to add adapter-appropriate quoting here too (e.g. viaadapter.quote(col["name"])).♻️ Suggested improvement
{% if normalized_data_type == "string" %} {% set update_query %} update {{ relation }} - set {{ col["name"] }} = NULL - where {{ col["name"] }} = '' + set {{ adapter.quote(col["name"]) }} = NULL + where {{ adapter.quote(col["name"]) }} = '' {% endset %} {% do elementary.run_query(update_query) %} {% endif %}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@integration_tests/dbt_project/macros/replace_empty_strings_with_nulls.sql` around lines 28 - 31, The non-ClickHouse branch builds an UPDATE using bare column names ({{ col["name"] }}) which can break on reserved identifiers; modify the generated SQL in the update_query to use the adapter quoting helper (e.g. adapter.quote(col["name"])) for the column in both the SET and WHERE clauses so the column identifier is quoted appropriately for the current adapter (ensure you update the template where update_query is set and any places referencing {{ col["name"] }} to use the quoted form).
10-21: Consider wrapping both ClickHouse statements in a single loop iteration's error context.If the
MODIFY COLUMNsucceeds but the subsequentUPDATEfails (e.g., due to a transient error), the column is left asNullable(String)with empty strings still present. This is an unlikely but possible partial-application state. Worth noting for awareness — no action required unless you want to add a comment or a defensive check.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@integration_tests/dbt_project/macros/replace_empty_strings_with_nulls.sql` around lines 10 - 21, Wrap the two ClickHouse operations for each column in a single error-handling block so they succeed or are rolled back logically: inside the loop in replace_empty_strings_with_nulls.sql (the iteration using columns, elementary.get_column_data_type and elementary.normalize_data_type), execute the existing statement calls ('alter_nullable_' ~ col['name'] and 'update_nulls_' ~ col['name']) inside a try/except Jinja block (or equivalent dbt error-handling) and on exception run a compensating ALTER to revert the column to its original (non-Nullable) type using the previously fetched data_type, then re-raise the error so failures are visible; ensure you reference and reuse the same generated statement names to locate and control the two operations for each col.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@integration_tests/tests/dbt_project.py`:
- Around line 312-315: The URL currently interpolates ch_user and ch_password
directly into the query string when building the urllib.request.Request, which
breaks for special characters; fix by URL-encoding the credentials before
constructing the request (use urllib.parse.quote_plus or build the query via
urllib.parse.urlencode) so ch_user and ch_password are escaped, then pass the
encoded URL to urllib.request.Request (update the code that constructs req with
ch_url, ch_user, ch_password and mutations_sync).
---
Duplicate comments:
In `@integration_tests/tests/dbt_project.py`:
- Line 303: The schema is hardcoded to "default" (schema variable) so the repair
will no-op for non-default dbt targets; change the assignment to derive the
active target schema instead of using the literal "default" (e.g., set schema to
the active dbt target schema plus SCHEMA_NAME_SUFFIX, falling back to the
existing default only if the target schema is absent). Update the code that
queries system.columns to use this computed schema (and database if applicable)
so the lookup finds columns for the current target; reference the schema
variable and SCHEMA_NAME_SUFFIX when making this change and ensure a safe
fallback path remains.
---
Nitpick comments:
In `@integration_tests/dbt_project/macros/replace_empty_strings_with_nulls.sql`:
- Around line 28-31: The non-ClickHouse branch builds an UPDATE using bare
column names ({{ col["name"] }}) which can break on reserved identifiers; modify
the generated SQL in the update_query to use the adapter quoting helper (e.g.
adapter.quote(col["name"])) for the column in both the SET and WHERE clauses so
the column identifier is quoted appropriately for the current adapter (ensure
you update the template where update_query is set and any places referencing {{
col["name"] }} to use the quoted form).
- Around line 10-21: Wrap the two ClickHouse operations for each column in a
single error-handling block so they succeed or are rolled back logically: inside
the loop in replace_empty_strings_with_nulls.sql (the iteration using columns,
elementary.get_column_data_type and elementary.normalize_data_type), execute the
existing statement calls ('alter_nullable_' ~ col['name'] and 'update_nulls_' ~
col['name']) inside a try/except Jinja block (or equivalent dbt error-handling)
and on exception run a compensating ALTER to revert the column to its original
(non-Nullable) type using the previously fetched data_type, then re-raise the
error so failures are visible; ensure you reference and reuse the same generated
statement names to locate and control the two operations for each col.
ℹ️ Review info
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Disabled knowledge base sources:
- Linear integration is disabled
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
integration_tests/dbt_project/macros/replace_empty_strings_with_nulls.sqlintegration_tests/tests/dbt_project.py
…t freshness Nullable cast Co-Authored-By: unknown <>
…Nullable dimension cast Co-Authored-By: unknown <>
…ead of hardcoding 'default' Co-Authored-By: unknown <>
Co-Authored-By: unknown <>
| {% do elementary.run_query(update_query) %} | ||
| {% endif %} | ||
| {% endfor %} | ||
| {% if target.type == "clickhouse" %} |
There was a problem hiding this comment.
We normally solve this by using a macro with a "default__" and "clickhouse__" implementation, could we do the same here?
| @@ -0,0 +1,8 @@ | |||
| <clickhouse> | |||
| <profiles> | |||
| <default> | |||
There was a problem hiding this comment.
just curious, could you elaborate on these flags and why they are needed?
I'm wondering if for users that have Clickhouse on the cloud it will just work or they also have to modify settings.
…g NULL macro Co-Authored-By: unknown <>
integration_tests/dbt_project/macros/replace_empty_strings_with_nulls.sql
Outdated
Show resolved
Hide resolved
| <clickhouse> | ||
| <profiles> | ||
| <default> | ||
| <join_use_nulls>1</join_use_nulls> |
There was a problem hiding this comment.
These flags are only being set in the integration-tests ClickHouse Docker container to make behavior deterministic:
join_use_nulls=1: needed for correct anti-join semantics (LEFT JOIN ... WHERE right_side_key IS NULL) since ClickHouse otherwise returns type defaults instead of NULLs for unmatched rows.mutations_sync=1: makesALTER TABLE .. UPDATE/DELETEsynchronous (mostly for deterministic tests / seed-fixing flows that update then immediately read).
For managed/Cloud ClickHouse, users typically don’t need server config changes if they can set these per-session/per-query (e.g. dbt-clickhouse custom_settings for join_use_nulls, and SETTINGS mutations_sync=1 where relevant).
integration_tests/dbt_project/macros/replace_empty_strings_with_nulls.sql
Outdated
Show resolved
Hide resolved
…ngs_with_nulls The macro is only called for BigQuery fusion seeds. ClickHouse seed NULL repair is handled by fix_clickhouse_seed_nulls() in clickhouse_utils.py via the HTTP API (covers all column types, not just strings). Co-Authored-By: unknown <>
integration_tests/dbt_project/macros/replace_empty_strings_with_nulls.sql
Show resolved
Hide resolved
Co-Authored-By: unknown <>
Co-Authored-By: unknown <>
| {%- else -%} | ||
| {{ return('') }} | ||
| {%- endif -%} | ||
| trim(BOTH '"' FROM splitByChar('.', full_table_name)[{{ part_index }}]) AS {{ part_name }} |
There was a problem hiding this comment.
This doesn't break anything? I wonder why before there was a difference between database and schema in this code.
I guess it was a bug?
There was a problem hiding this comment.
Yes, the old code was a bug. The original clickhouse__full_name_split assumed 3-part names (database.schema.table at indices 1, 2, 3), but the corresponding full_table_name macro only produced 2-part names for all adapters (there was no ClickHouse dispatch for full_table_name before this PR).
So the old code was splitting a 2-part string (schema.table) with 3-part indices — database_name would get index 1 (= schema), schema_name would get index 2 (= table), and table_name would get index 3 (= empty string). This was silently wrong but never surfaced because ClickHouse tests were all skipped.
The fix makes it consistent: full_table_name now has a ClickHouse dispatch that produces 2-part names (schema.table), and full_name_split splits accordingly — both database_name and schema_name map to index 1 (since database=schema in ClickHouse), and table_name maps to index 2. This is validated by CI (all 186 ClickHouse tests pass).
feat(CORE-397): add full ClickHouse support
Summary
Adds full ClickHouse support by removing all
skip_targets(["clickhouse"])markers from 20 integration test files and implementing the underlying macro/SQL/infrastructure fixes to make the full test suite pass on ClickHouse (186 passed, 7 skipped locally with-n8).Updates since last revision
Round 3 — Merged master (PRs #936, #937, #938) and verified compatibility:
dbt_project.py(both the ClickHousefix_clickhouse_seed_nullsimport and the newAdapterQueryRunner/UnsupportedJinjaErrorimports from PR fix: use direct adapter connection for run_query (bypass log parsing) #936 are now present).profiles.yml.j2template already includes ClickHouse configuration (host, port, user, password, schema, threads) and ClickHouse is in thetargetsloop for theelementaryprofile alias. No additional changes needed.RuntimeErrormessages instead of being silently swallowed. This benefits ClickHouse debugging as well.Round 2 — Addressed overlap question between seed NULL repair mechanisms:
clickhouse__dispatch variant fromreplace_empty_strings_with_nulls.sql. Investigation showed this dispatch was never called — the macro is only invoked for BigQuery fusion seeds (via_fix_seed_if_needed). ClickHouse seed NULL repair is handled entirely byfix_clickhouse_seed_nulls()inclickhouse_utils.pyvia the HTTP API. Simplified macro back to a single implementation with a clarifying comment explaining the scope.Round 1 — Addressed initial PR review comments:
clickhouse_utils.py: Moved all ClickHouse seed NULL-repair logic (HTTP API queries, table rebuild viaEXCHANGE TABLES, schema resolution fromprofiles.yml) out ofdbt_project.pyinto a dedicatedintegration_tests/tests/clickhouse_utils.pymodule.DbtProject._fix_seed_if_needed()now delegates tofix_clickhouse_seed_nulls(...).ch_query→clickhouse_query_with_api()for clarity. Credentials are URL-encoded viaurllib.parse.urlencode.replace_empty_strings_with_nulls.sql: Addedadapter.quote(col["name"])for robustness against reserved-word column names.Changes
Test enablement:
skip_targets(["clickhouse"])markers and associated comments from 20 test files.test_schema_changes.py), only"clickhouse"was removed; other targets preserved.import pytestfrom files where it was only needed for the skip decorator.ClickHouse server configuration (
users.xml):join_use_nulls=1— ClickHouse LEFT JOINs produce default values (not NULL) for unmatched rows by default, breaking anti-join patterns used in anomaly detection.mutations_sync=1— makes ALTER TABLE UPDATE mutations synchronous.custom_settingsor per-statement (SETTINGS mutations_sync=1).Seed data NULL repair (
clickhouse_utils.py):fix_clickhouse_seed_nulls()which rebuilds seed tables via the ClickHouse HTTP API usingnullIf()to restore NULLs.~/.dbt/profiles.ymlviaget_clickhouse_schema()(not hardcoded).Macro fixes (adapter.dispatch for ClickHouse):
full_names.sql: ClickHouse uses database=schema, so all name macros (full_table_name,full_schema_name,full_column_name,relation_to_full_name,model_node_to_full_name,configured_schemas_from_graph_as_tuple,full_name_split) now use 2-part names (schema.table) instead of 3-part.buckets_cte.sql: Fixedcomplete_buckets_cteto respecttime_bucket.count(was generating 1-period buckets instead of e.g. 7-hour buckets).column_numeric_monitors.sql: Cast columns toNullable(Float)beforestddevPop/varSampto handle NULL inputs.day_of_week.sql/hour_of_week.sql: ClickHouse doesn't supportDATE_FORMAT(date, 'EEEE'); useformatDateTime(date, '%W').table_monitoring_query.sql: ClickHouse-specificevent_freshness_metric_querythat explicitly checks for NULL before computing timediff (ClickHouse'sedr_datediffcoalesces NULL→0, masking the fallback logic).list_concat_with_separator.sql: ClickHouse-specific version that checksitem is nullbefore casting to String (ClickHouse's non-Nullable String rejectsCAST(NULL, 'String')).replace_empty_strings_with_nulls.sql: Simplified to single implementation (no dispatch). Only used for BigQuery fusion. Added clarifying comment explaining ClickHouse uses separatefix_clickhouse_seed_nulls()function.CI:
Review & Testing Checklist for Human
clickhouse_utils.pyHTTP API approach is safe — this bypasses dbt entirely to run DDL via ClickHouse's HTTP API. Check that theEXCHANGE TABLES+DROPpattern is safe (thefinallyblock should prevent orphaned temp tables, but verify). Also verify thatget_clickhouse_schema()profile parsing handles edge cases (e.g. missing keys, non-standard profile names).full_names.sql2-part naming is complete — ClickHouse now uses 2-part names (schema.table) instead of 3-part (database.schema.table) across all name macros. Verify no other code assumes 3-part names for ClickHouse (e.g. in Elementary's Python code or other macros).join_use_nulls=1requirement for managed ClickHouse — the Docker config sets this automatically, but users on ClickHouse Cloud/managed instances need to set it via dbt-clickhouse profilecustom_settings. Without it, LEFT JOIN anti-join patterns fail silently (produce "pass" instead of "fail" for anomaly tests). Consider adding this to Elementary's ClickHouse setup docs.test_anomalous_column_anomalies,test_event_freshness_anomalies) to verify thejoin_use_nullsand NULL handling fixes work end-to-end. Also test seed NULL repair by creating a seed with NULL values and verifying they're preserved afterdbt seed.Notes
join_use_nulls=1server config is critical — without it, LEFT JOIN anti-join patterns fail (produce "pass" instead of "fail" for anomaly tests).join_use_nulls=1via dbt-clickhouse profilecustom_settingsrather than server config.latest_official/databricks_catalog) is the known intermittentSCHEMA_NOT_FOUNDissue (not related to this PR). All ClickHouse tests pass.Summary by CodeRabbit
Tests
New Features
Bug Fixes
Chores