OH case-insensitive writes via OHSparkCatalog for spark.sql.casesensitive=true scenario by pandaamit91 · Pull Request #597 · linkedin/openhouse

pandaamit91 · 2026-05-22T18:50:11Z

Summary

Adds case-insensitive write support for OpenHouse Spark catalog so that df.writeTo(...).append() and INSERT INTO ... SELECT succeed even when the DataFrame's column casing differs from the stored OH table's casing, regardless of spark.sql.caseSensitive. Covers flat columns, partition columns, and arbitrarily-nested struct fields.

Changes

Client-facing API Changes
Internal API Changes
Bug Fixes
New Features
Performance Improvements
Code Style
Refactoring
Documentation
Tests

Two commits:
1. OHSparkCatalog + base OHWriteSchemaNormalizationRule
  - Wraps every loaded SparkTable with TableCapability.ACCEPT_ANY_SCHEMA so Spark's ResolveOutputRelation skips OH writes at analysis time (otherwise it throws "Cannot find data for output column" when caseSensitive=true).
  - Adds a post-hoc resolution rule that aligns source columns to target casing case-insensitively, replicating what ResolveOutputRelation would have done (rename + Cast for type widening).
2. Recursive name-based nested struct alignment
  - Replaces positional struct Cast with alignExpressionToTargetType that builds CreateNamedStruct, pulling each target struct field from the source by case-insensitive name. Prevents silent value misrouting when source nested fields
    are in a different order than target.
  - Handles arbitrary depth of nested structs; preserves null-struct semantics with If(IsNull(srcExpr), null, built).
  - Adds tests for partition columns (writeTo + SQL), reordered nested fields, and deep struct-in-struct case mismatches.

Testing Done

Manually Tested on local docker setup. Please include commands ran, and their output.
Added new tests for the changes made.
Updated existing tests to reflect the changes made.
No tests added or updated. Please explain why. If unsure, please feel free to ask for help.
Some other form of testing like staging or soak time in production. Please explain.

:integrations:spark:spark-3.1:openhouse-spark-itest:catalogTest passes
:integrations:spark:spark-3.1:openhouse-spark-itest:test (chains catalogTest → statementTest → test) passes
:integrations:spark:spark-3.5:openhouse-spark-3.5-itest:test passes (spark-3.5 itests inherit the spark-3.1 test source dir, so the new tests run there too)
:integrations:spark:spark-3.{1,5}:openhouse-spark-runtime:test passes (vacuously; no unit tests in those modules, but compile + assembly verified)
Backward-compat: the 3 pre-existing case-mismatch tests still pass with the new recursive helper in place

Additional Information

Breaking Changes
Deprecations
Large PR broken into smaller PRs, and PR plan linked in the description.

For all the boxes checked, include additional details of the changes made in this pull request.

…hemaNormalizationRule Writers (Spark SQL, DataFrame writeTo, Trino DML) may submit column names with different casing than what the OH table stores (e.g. "id" vs "ID"). With spark.sql.caseSensitive=true, Spark's ResolveOutputRelation rejects such writes with "Cannot find data for output column" before the OH server is reached. Fix (two-part): 1. OHSparkCatalog extends SparkCatalog and annotates every loaded OH table with TableCapability.ACCEPT_ANY_SCHEMA. This causes DataSourceV2Relation.skipSchemaResolution to return true, making V2WriteCommand.outputResolved true and causing ResolveOutputRelation to skip schema validation for OH write commands. 2. OHWriteSchemaNormalizationRule (injectPostHocResolutionRule) runs after all standard resolution rules. For each resolved V2WriteCommand targeting an OH relation, it inserts a Project node that renames source columns to match the stored column casing (matched by field ID). This ensures Iceberg sees the correct stored casing without mutating spark.sql.caseSensitive. Tables with case-duplicate columns (e.g. both "id" and "ID") are excluded from normalization — the target is ambiguous and writes must use exact casing. TestSparkSessionUtil and SparkTestBase are updated to use OHSparkCatalog instead of the bare SparkCatalog so all integration test sessions pick up the ACCEPT_ANY_SCHEMA capability. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

… rule OHWriteSchemaNormalizationRule previously delegated nested-struct type mismatches to Spark's struct Cast, which is positional. When a source struct had fields in a different order than the target, values were silently misrouted (e.g. source <lastName, firstName> -> target <firstname, lastname> put lastName's value in firstname). This commit replaces the Cast with a recursive alignExpressionToTargetType helper that builds CreateNamedStruct, pulling each target struct field from the source by case-insensitive name. Null-struct semantics are preserved with If(IsNull(srcExpr), null, built). Adds 4 new itests: - testWritePartitionColumnCaseMismatch_writeToAppend_succeeds - testWritePartitionColumnCaseMismatch_sqlInsert_succeeds - testWriteNestedStructReorderedFields_succeeds (the critical one; fails with positional Cast, passes with name-based matching) - testWriteDeepNestedStructCaseMismatch_succeeds Also fixes a pre-existing compile error: getOpenHouseCatalog was referenced in CatalogOperationTest but never defined in that class. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

pandaamit91 and others added 2 commits May 22, 2026 02:18

pandaamit91 changed the title ~~Ampanda/oh case insensitive nested writes~~ OH case-insensitive writes via OHSparkCatalog for spark.sql.casesensitive=true scenario May 22, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OH case-insensitive writes via OHSparkCatalog for spark.sql.casesensitive=true scenario#597

OH case-insensitive writes via OHSparkCatalog for spark.sql.casesensitive=true scenario#597
pandaamit91 wants to merge 2 commits into
linkedin:mainfrom
pandaamit91:ampanda/oh-case-insensitive-nested-writes

pandaamit91 commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

pandaamit91 commented May 22, 2026

Summary

Changes

Testing Done

Additional Information

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant