Skip to content

OH case-insensitive writes via OHSparkCatalog for spark.sql.casesensitive=true scenario#597

Open
pandaamit91 wants to merge 2 commits into
linkedin:mainfrom
pandaamit91:ampanda/oh-case-insensitive-nested-writes
Open

OH case-insensitive writes via OHSparkCatalog for spark.sql.casesensitive=true scenario#597
pandaamit91 wants to merge 2 commits into
linkedin:mainfrom
pandaamit91:ampanda/oh-case-insensitive-nested-writes

Conversation

@pandaamit91
Copy link
Copy Markdown
Contributor

Summary

Adds case-insensitive write support for OpenHouse Spark catalog so that df.writeTo(...).append() and INSERT INTO ... SELECT succeed even when the DataFrame's column casing differs from the stored OH table's casing, regardless of spark.sql.caseSensitive. Covers flat columns, partition columns, and arbitrarily-nested struct fields.

Changes

  • Client-facing API Changes

  • Internal API Changes

  • Bug Fixes

  • New Features

  • Performance Improvements

  • Code Style

  • Refactoring

  • Documentation

  • Tests

    Two commits:

    1. OHSparkCatalog + base OHWriteSchemaNormalizationRule

      • Wraps every loaded SparkTable with TableCapability.ACCEPT_ANY_SCHEMA so Spark's ResolveOutputRelation skips OH writes at analysis time (otherwise it throws "Cannot find data for output column" when caseSensitive=true).
      • Adds a post-hoc resolution rule that aligns source columns to target casing case-insensitively, replicating what ResolveOutputRelation would have done (rename + Cast for type widening).
    2. Recursive name-based nested struct alignment

      • Replaces positional struct Cast with alignExpressionToTargetType that builds CreateNamedStruct, pulling each target struct field from the source by case-insensitive name. Prevents silent value misrouting when source nested fields
        are in a different order than target.
      • Handles arbitrary depth of nested structs; preserves null-struct semantics with If(IsNull(srcExpr), null, built).
      • Adds tests for partition columns (writeTo + SQL), reordered nested fields, and deep struct-in-struct case mismatches.

Testing Done

  • Manually Tested on local docker setup. Please include commands ran, and their output.
  • Added new tests for the changes made.
  • Updated existing tests to reflect the changes made.
  • No tests added or updated. Please explain why. If unsure, please feel free to ask for help.
  • Some other form of testing like staging or soak time in production. Please explain.
  1. :integrations:spark:spark-3.1:openhouse-spark-itest:catalogTest passes
  2. :integrations:spark:spark-3.1:openhouse-spark-itest:test (chains catalogTest → statementTest → test) passes
  3. :integrations:spark:spark-3.5:openhouse-spark-3.5-itest:test passes (spark-3.5 itests inherit the spark-3.1 test source dir, so the new tests run there too)
  4. :integrations:spark:spark-3.{1,5}:openhouse-spark-runtime:test passes (vacuously; no unit tests in those modules, but compile + assembly verified)
  5. Backward-compat: the 3 pre-existing case-mismatch tests still pass with the new recursive helper in place

Additional Information

  • Breaking Changes
  • Deprecations
  • Large PR broken into smaller PRs, and PR plan linked in the description.

For all the boxes checked, include additional details of the changes made in this pull request.

pandaamit91 and others added 2 commits May 22, 2026 02:18
…hemaNormalizationRule

Writers (Spark SQL, DataFrame writeTo, Trino DML) may submit column names
with different casing than what the OH table stores (e.g. "id" vs "ID").
With spark.sql.caseSensitive=true, Spark's ResolveOutputRelation rejects such
writes with "Cannot find data for output column" before the OH server is reached.

Fix (two-part):

1. OHSparkCatalog extends SparkCatalog and annotates every loaded OH table
   with TableCapability.ACCEPT_ANY_SCHEMA. This causes
   DataSourceV2Relation.skipSchemaResolution to return true, making
   V2WriteCommand.outputResolved true and causing ResolveOutputRelation to
   skip schema validation for OH write commands.

2. OHWriteSchemaNormalizationRule (injectPostHocResolutionRule) runs after
   all standard resolution rules. For each resolved V2WriteCommand targeting
   an OH relation, it inserts a Project node that renames source columns to
   match the stored column casing (matched by field ID). This ensures Iceberg
   sees the correct stored casing without mutating spark.sql.caseSensitive.

Tables with case-duplicate columns (e.g. both "id" and "ID") are excluded
from normalization — the target is ambiguous and writes must use exact casing.

TestSparkSessionUtil and SparkTestBase are updated to use OHSparkCatalog
instead of the bare SparkCatalog so all integration test sessions pick up
the ACCEPT_ANY_SCHEMA capability.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… rule

OHWriteSchemaNormalizationRule previously delegated nested-struct type
mismatches to Spark's struct Cast, which is positional. When a source
struct had fields in a different order than the target, values were
silently misrouted (e.g. source <lastName, firstName> -> target
<firstname, lastname> put lastName's value in firstname).

This commit replaces the Cast with a recursive alignExpressionToTargetType
helper that builds CreateNamedStruct, pulling each target struct field
from the source by case-insensitive name. Null-struct semantics are
preserved with If(IsNull(srcExpr), null, built).

Adds 4 new itests:
- testWritePartitionColumnCaseMismatch_writeToAppend_succeeds
- testWritePartitionColumnCaseMismatch_sqlInsert_succeeds
- testWriteNestedStructReorderedFields_succeeds (the critical one;
  fails with positional Cast, passes with name-based matching)
- testWriteDeepNestedStructCaseMismatch_succeeds

Also fixes a pre-existing compile error: getOpenHouseCatalog was
referenced in CatalogOperationTest but never defined in that class.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@pandaamit91 pandaamit91 changed the title Ampanda/oh case insensitive nested writes OH case-insensitive writes via OHSparkCatalog for spark.sql.casesensitive=true scenario May 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant