-
Notifications
You must be signed in to change notification settings - Fork 63
feat: support nested STRUCT and ARRAY data display in anywidget mode #2359
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
f583833 to
60785f3
Compare
bigframes/display/_flatten.py
Outdated
|
|
||
| def flatten_nested_data( | ||
| dataframe: pd.DataFrame, | ||
| ) -> tuple[pd.DataFrame, dict[str, list[int]], list[str], set[str]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tuple is hard to understand. Can we use a frozen dataclass, instead?
2bb97d3 to
3944249
Compare
bigframes/display/_flatten.py
Outdated
| ) | ||
|
|
||
| new_cols_to_add[new_col_name] = pd.Series( | ||
| new_list_array.to_pylist(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to_pylist() can be quite expensive to call. If we already have a pyarrow array, I don't think it's necessary to convert it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. I've removed the .to_pylist() calls and now pass the Arrow arrays directly to pandas for better performance.
bigframes/display/_flatten.py
Outdated
|
|
||
| new_cols_to_add[new_col_name] = pd.Series( | ||
| new_list_array.to_pylist(), | ||
| dtype=pd.ArrowDtype(pa.list_(field.type)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused. Why are we creating a list type here? Could you explain in comments what the purpose is? I thought we were flattening based on the function name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I've added a comment to clarify that the function is transforming an array<struct<...>> into separate array columns.
bigframes/display/_flatten.py
Outdated
| for orig_idx in dataframe.index: | ||
| non_array_data = non_array_df.loc[orig_idx].to_dict() | ||
| array_values = {} | ||
| max_len_in_row = 0 | ||
| non_na_array_found = False | ||
|
|
||
| for col_name in array_columns: | ||
| val = dataframe.loc[orig_idx, col_name] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looping through each value in Python, which is going to be very slow. Please use native code such as https://arrow.apache.org/docs/python/generated/pyarrow.compute.list_flatten.html to avoid such loops.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion. I've refactored the array explosion logic to use a much faster vectorized approach with pandas.explode and merge, which removes the Python loops entirely.
bigframes/display/_flatten.py
Outdated
| continue | ||
|
|
||
| # Create one row per array element, up to max_len_in_row | ||
| for array_idx in range(max_len_in_row): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looping through each element of each array in Python, which is going to be even slower.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have completely refactored _explode_array_columns to use a vectorized approach with pandas.explode and merge. This eliminated all Python loops, including the slow inner loop you pointed out, significantly improving performance.
- Replaced Python-based row explosion with optimized PyArrow computation for nested arrays. - Cleaned up comments in to strictly adhere to Google Python Style Guide (focused on 'why', removed redundant 'what'). - Renamed variable to for clarity. - Verified changes with Python unit tests and JavaScript frontend tests.
bigframes/display/_flatten.py
Outdated
| return "struct" | ||
| if pa.types.is_list(pa_type): | ||
| return ( | ||
| "array_of_struct" | ||
| if pa.types.is_struct(pa_type.value_type) | ||
| else "array" | ||
| ) | ||
| return "clear" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These magic strings worry me. Could you create an enum for category, instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. I've replaced the strings with a private _ColumnCategory Enum.
bigframes/display/_flatten.py
Outdated
| continuation_rows: A set of row indices that are continuation rows. | ||
| cleared_on_continuation: A list of column names that should be cleared on continuation rows. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not 100% clear to me what is meant by "continuation". I assume that it means rows post-flattening that correspond to the second element of an array and beyond? Please expand these docstrings further.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right. I've updated the docstrings in FlattenResult to explicitly clarify that "continuation rows" refer to the 2nd element onwards of an exploded array, and "cleared" columns are those (typically scalars) that are replicated but shouldn't be visually repeated.
bigframes/display/_flatten.py
Outdated
| """The result of flattening a DataFrame. | ||
| Attributes: | ||
| dataframe: The flattened DataFrame. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add some comments about what happens to the original index columns. Based on the description of the other fields, I assume that a unique index is created post-flatten?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated the docstrings and the implementation. The original index (including named Index and MultiIndex) is preserved and duplicated across the exploded rows. This serves as the visual grouping key for the table display.
bigframes/display/_flatten.py
Outdated
|
|
||
|
|
||
| @dataclasses.dataclass(frozen=True) | ||
| class ColumnClassification: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please put a leading _ in front of class names that aren't intended to be used outside of this module.
| continuation_rows: set[int] | None, | ||
| clear_on_continuation: list[str], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, add some more explanation to the docstrings. To keep it shorter, you could reference bigframes/display/_flatten.py so that folks can look there for the complete explanation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. I updated the docstrings to reference bigframes.display._flatten.FlattenResult for the detailed definitions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Neat feature!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please create a test_flatten.py file with a few tests that check some of the flattening logic directly without the HTML rendering part. Specifically, let's focus on what happens to index/multiindex columns, as that's my main worry / question.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. I created tests/unit/display/test_flatten.py. I moved the logic-specific tests there and added dedicated test cases (test_flatten_preserves_original_index, test_flatten_preserves_multiindex) to verify that indices are correctly preserved and duplicated during the flattening process.
8eb7211 to
ca19957
Compare
Implements flattening and expansion for complex data types in the interactive display for anywidget mode.
Key Features:
verified at:
Fixes #<438181139> 🦕