OH case-insensitive writes via OHSparkCatalog for spark.sql.casesensitive=true scenario#597
Open
pandaamit91 wants to merge 2 commits into
Open
Conversation
…hemaNormalizationRule Writers (Spark SQL, DataFrame writeTo, Trino DML) may submit column names with different casing than what the OH table stores (e.g. "id" vs "ID"). With spark.sql.caseSensitive=true, Spark's ResolveOutputRelation rejects such writes with "Cannot find data for output column" before the OH server is reached. Fix (two-part): 1. OHSparkCatalog extends SparkCatalog and annotates every loaded OH table with TableCapability.ACCEPT_ANY_SCHEMA. This causes DataSourceV2Relation.skipSchemaResolution to return true, making V2WriteCommand.outputResolved true and causing ResolveOutputRelation to skip schema validation for OH write commands. 2. OHWriteSchemaNormalizationRule (injectPostHocResolutionRule) runs after all standard resolution rules. For each resolved V2WriteCommand targeting an OH relation, it inserts a Project node that renames source columns to match the stored column casing (matched by field ID). This ensures Iceberg sees the correct stored casing without mutating spark.sql.caseSensitive. Tables with case-duplicate columns (e.g. both "id" and "ID") are excluded from normalization — the target is ambiguous and writes must use exact casing. TestSparkSessionUtil and SparkTestBase are updated to use OHSparkCatalog instead of the bare SparkCatalog so all integration test sessions pick up the ACCEPT_ANY_SCHEMA capability. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… rule OHWriteSchemaNormalizationRule previously delegated nested-struct type mismatches to Spark's struct Cast, which is positional. When a source struct had fields in a different order than the target, values were silently misrouted (e.g. source <lastName, firstName> -> target <firstname, lastname> put lastName's value in firstname). This commit replaces the Cast with a recursive alignExpressionToTargetType helper that builds CreateNamedStruct, pulling each target struct field from the source by case-insensitive name. Null-struct semantics are preserved with If(IsNull(srcExpr), null, built). Adds 4 new itests: - testWritePartitionColumnCaseMismatch_writeToAppend_succeeds - testWritePartitionColumnCaseMismatch_sqlInsert_succeeds - testWriteNestedStructReorderedFields_succeeds (the critical one; fails with positional Cast, passes with name-based matching) - testWriteDeepNestedStructCaseMismatch_succeeds Also fixes a pre-existing compile error: getOpenHouseCatalog was referenced in CatalogOperationTest but never defined in that class. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds case-insensitive write support for OpenHouse Spark catalog so that
df.writeTo(...).append()andINSERT INTO ... SELECTsucceed even when the DataFrame's column casing differs from the stored OH table's casing, regardless ofspark.sql.caseSensitive. Covers flat columns, partition columns, and arbitrarily-nested struct fields.Changes
Client-facing API Changes
Internal API Changes
Bug Fixes
New Features
Performance Improvements
Code Style
Refactoring
Documentation
Tests
Two commits:
OHSparkCatalog+ baseOHWriteSchemaNormalizationRuleSparkTablewithTableCapability.ACCEPT_ANY_SCHEMAso Spark'sResolveOutputRelationskips OH writes at analysis time (otherwise it throws "Cannot find data for output column" whencaseSensitive=true).ResolveOutputRelationwould have done (rename +Castfor type widening).Recursive name-based nested struct alignment
CastwithalignExpressionToTargetTypethat buildsCreateNamedStruct, pulling each target struct field from the source by case-insensitive name. Prevents silent value misrouting when source nested fieldsare in a different order than target.
If(IsNull(srcExpr), null, built).Testing Done
:integrations:spark:spark-3.1:openhouse-spark-itest:catalogTestpasses:integrations:spark:spark-3.1:openhouse-spark-itest:test(chains catalogTest → statementTest → test) passes:integrations:spark:spark-3.5:openhouse-spark-3.5-itest:testpasses (spark-3.5 itests inherit the spark-3.1 test source dir, so the new tests run there too):integrations:spark:spark-3.{1,5}:openhouse-spark-runtime:testpasses (vacuously; no unit tests in those modules, but compile + assembly verified)Additional Information
For all the boxes checked, include additional details of the changes made in this pull request.