Skip to content

Conversation

@HyukjinKwon
Copy link
Member

@HyukjinKwon HyukjinKwon commented Dec 23, 2025

Rationale for this change

Two TODOs in test-dplyr-join.R (lines 191-192) requested test coverage for edge cases in join operations. These TODOs were added in commit c273ea7.

What changes are included in this PR?

Added one test case to r/tests/testthat/test-dplyr-join.R for duplicate columns, and added a comment about type casting because it seems it disallows the type cast. Here is example error message:

══ Failed tests ════════════════════════════════════════════════════════════════
── Error ('test-dplyr-join.R:238:3'): joins with type casting between int and float ──
Error in `compute.arrow_dplyr_query(x)`: Invalid: Incompatible data types for corresponding join field keys: FieldRef.Name(x) of type int32 and FieldRef.Name(x) of type double
Backtrace:
     ▆
  1. ├─arrow:::compare_dplyr_binding(...) at test-dplyr-join.R:238:3
  2. │ ├─testthat::expect_warning(...) at ./helper-expectation.R:95:3
  3. │ │ └─testthat:::expect_condition_matching_(...)
  4. │ │   └─testthat:::quasi_capture(...)
  5. │ │     ├─testthat (local) .capture(...)
  6. │ │     │ └─base::withCallingHandlers(...)
  7. │ │     └─rlang::eval_bare(quo_get_expr(.quo), quo_get_env(.quo))
  8. │ └─rlang::eval_tidy(expr, rlang::new_data_mask(rlang::env(.input = arrow_table(tbl))))
  9. ├─dplyr::collect(left_join(.input, right_float, by = "x"))
 10. └─arrow:::collect.arrow_dplyr_query(...)
 11.   └─arrow:::compute.arrow_dplyr_query(x)
 12.     └─base::tryCatch(...)
 13.       └─base (local) tryCatchList(expr, classes, parentenv, handlers)
 14.         └─base (local) tryCatchOne(expr, names, parentenv, handlers[[1L]])
 15.           └─value[[3L]](cond)
 16.             └─arrow:::augment_io_error_msg(e, call, schema = schema())
 17.               └─rlang::abort(msg, call = call)

Therefore this PR removes that todo.

Are these changes tested?

Yes, tests were added.

Are there any user-facing changes?

No, test-only.

@github-actions
Copy link

⚠️ GitHub issue #48629 has been automatically assigned in GitHub to PR creator.

@HyukjinKwon HyukjinKwon marked this pull request as draft December 23, 2025 06:08
@HyukjinKwon HyukjinKwon changed the title GH-48629: [R] Add test coverage for joins with duplicate columns and type casting GH-48629: [R] Add test coverage for joins with duplicate columns with a comment Dec 23, 2025
@HyukjinKwon HyukjinKwon changed the title GH-48629: [R] Add test coverage for joins with duplicate columns with a comment GH-48629: [R] Add test coverage for joins with duplicate columns and type casting Dec 23, 2025
@HyukjinKwon HyukjinKwon changed the title GH-48629: [R] Add test coverage for joins with duplicate columns and type casting GH-48629: [R] Add test coverage for joins with duplicate columns with a comment Dec 23, 2025
@HyukjinKwon HyukjinKwon changed the title GH-48629: [R] Add test coverage for joins with duplicate columns with a comment GH-48629: [R] Add test coverage for joins with duplicate columns with removing a todo Dec 23, 2025
@HyukjinKwon HyukjinKwon marked this pull request as ready for review December 23, 2025 07:06
Copy link
Member

@jonkeane jonkeane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this! I say this in more detail in the comment below, but we do have tests for some aspects of joins with duplicate columns, though I will admit they aren't named very well and follow a slightly non-standard parttern. Mind using this PR to clean those up as we add slightly more coverage (e.g. duplicate columns with no (explicit) suffix additions)?

Also, in the description, you mention

added a comment about type casting because it seems it disallows the type cast. Here is example error message:

But I don't think I see that? Would you mind either leaving it or creating a test that asserts the failure with a note explaining what's going on?

Thanks!

Comment on lines +218 to +225
# Test with custom suffixes
compare_dplyr_binding(
.input |>
left_join(right_dup, by = "x", suffix = c("_left", "_right")) |>
collect(),
left_dup
)
})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test especially is effectively duplicated by

test_that("suffix", {
left_suf <- Table$create(
key = c(1, 2),
left_unique = c(2.1, 3.1),
shared = c(10.1, 10.3)
)
right_suf <- Table$create(
key = c(1, 2, 3, 10, 20),
right_unique = c(1.1, 1.2, 3.1, 4.1, 4.3),
shared = c(20.1, 30, 40, 50, 60)
)
join_op <- inner_join(left_suf, right_suf, by = "key", suffix = c("_left", "_right"))
output <- collect(join_op)
res_col_names <- names(output)
expected_col_names <- c("key", "left_unique", "shared_left", "right_unique", "shared_right")
expect_equal(expected_col_names, res_col_names)
})

Though I will admit that the tests there do not follow the established pattern you follow here. Would you mind updating merging these two tests together? You can use your fixture for that purpose if you like, but if you do, would you mind adding a column with floats and also making the key column more obvious (e.g. by naming it key or to_join or something like that?

I'm happy to help describe in more detail if that's helpful!

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting review Awaiting review labels Dec 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants