-
Notifications
You must be signed in to change notification settings - Fork 141
SNOW-2408964: support custom_schema when reading xml #4033
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| and format.lower() == "xml" | ||
| and XML_ROW_TAG_STRING not in self._cur_options | ||
| ): | ||
| raise ValueError("When read XML with user schema, rowtag must be set.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| raise ValueError("When read XML with user schema, rowtag must be set.") | |
| raise ValueError("When reading XML with user schema, rowtag must be set.") |
| return None | ||
|
|
||
|
|
||
| def generate_norm_column_name_to_ori_column_name_dict(result: dict): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this really need to be a function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are right, I removed this function
| # cast to input custom schema type | ||
| if self._user_schema: | ||
| cols = [ | ||
| df[single_quote(field._name)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are we using single_quote here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
currently all column name is produced with a single quote around,
I am not sure why we are having this behavior but I think we should not change now as it would be a BCR
single quote is added here so that we can correctly read the column
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh, I think I remember Jianzhun mentioning we use PIVOT and it snowflake pivot adds single quote for string, and datetime columns. This could break since server side is planning to do a BCR to remove quotes.
| with pytest.raises( | ||
| ValueError, match="When read XML with user schema, rowtag must be set." | ||
| ): | ||
| session.read.schema(user_schema).xml(f"@{tmp_stage_name}/{test_file_books_xml}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test regex pattern mismatch with actual error message. The test expects "When read XML with user schema, rowtag must be set." but the actual error message in dataframe_reader.py:1392 is "When reading XML with user schema, rowtag must be set." (note "reading" vs "read"). This test will fail.
# Fix: Update the match pattern to match the actual error message
with pytest.raises(
ValueError, match="When reading XML with user schema, rowtag must be set."
):| with pytest.raises( | |
| ValueError, match="When read XML with user schema, rowtag must be set." | |
| ): | |
| session.read.schema(user_schema).xml(f"@{tmp_stage_name}/{test_file_books_xml}") | |
| with pytest.raises( | |
| ValueError, match="When reading XML with user schema, rowtag must be set." | |
| ): | |
| session.read.schema(user_schema).xml(f"@{tmp_stage_name}/{test_file_books_xml}") | |
Spotted by Graphite Agent
Is this helpful? React 👍 or 👎 to let us know.
| ).option("mode", "failfast").xml(f"@{tmp_stage_name}/{test_file_books_xml}") | ||
|
|
||
|
|
||
| def test_read_xml_with_custom_schema(session): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we have one simple case for nested data? books2.xml seems to be a nested one
sfc-gh-aling
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
can you also help address the test coverage before merging it?
Which Jira issue is this PR addressing? Make sure that there is an accompanying issue to your PR.
Fixes SNOW-2408964
Fill out the following pre-review checklist:
Please describe how your code solves the related issue.
Here is a doc to talk about the behavior of XML read with user schema in snowpark:
https://docs.google.com/document/d/1W3os2SGbO_M89TjIa7xVuNrQ0RDDuGMHEOBHFuw0OYw/edit?tab=t.0