Skip to content

Conversation

@becomeStar
Copy link
Contributor

Description

This PR implements the proactive connection logic in RingHashLoadBalancer as outlined in gRFC A61.

Previously, the Java implementation only initialized child balancers when a ring-chosen endpoint was in TRANSIENT_FAILURE during a picker's pickSubchannel call. This PR adds the missing logic: when a child balancer reports TRANSIENT_FAILURE, the LoadBalancer now proactively initializes the first available IDLE child if no other children are currently connecting or ready.

This ensures a backup subchannel starts warming up immediately outside the RPC flow, reducing failover latency and improving overall resilience. This behavior was previously present but was inadvertently lost after #10610.

Changes

  • Added maybeTriggerIdleChildConnection() to RingHashLoadBalancer to trigger proactive connections.
  • Updated existing unit tests to reflect the new connection timing (adjusting requestConnection verification counts).
  • Added new test cases to verify:
    • Proactive connection triggering when no other subchannels are active.
    • Suppression of proactive connections when a READY subchannel already exists.

Fixes #12024

Implement proactive connection logic in RingHashLoadBalancer as
outlined in gRFC A61. This address the missing logic where the
balancer should initialize the first IDLE child when a child
balancer reports TRANSIENT_FAILURE and no other children are
connecting.

This behavior, which was previously present before grpc#10610, ensures
that a backup subchannel starts connecting immediately outside of
the picker flow, reducing failover latency.

Fixes grpc#12024
Update existing unit tests and add new test cases to validate the
proactive connection behavior. Verification counts in several tests
have been adjusted to reflect that the balancer now initiates
connections immediately upon subchannel failure.
@becomeStar
Copy link
Contributor Author

@ejona86

Thank you for the detailed review.
I've addressed your feedback in the latest push.

  • ​Logic location & Quote: Moved the proactive connection condition to updateOverallBalancingState() to align with existing comments and local variables. I've also included the specific quote from gRFC A61 in the comments as suggested.
  • Method Refactoring: Renamed the method to triggerIdleChildConnection() and simplified it to focus solely on the triggering action.
  • ​Restored Tests: I've restored the deleted verify() lines in the test. I initially removed them out of concern that fixed indexing might make the test brittle if hash results changed. However, since the test remains deterministic and passes, I agree that keeping them provides stricter verification.

​Please let me know if there are any further adjustments needed.

@becomeStar becomeStar force-pushed the ring-hash-proactive-connection branch from ce9ea98 to f5a4934 Compare January 6, 2026 12:28
@becomeStar
Copy link
Contributor Author

I merged the latest master branch expecting the CI tests to pass, but unfortunately, they are still failing.
The logs (403 Forbidden) suggest this is likely an external infrastructure issue rather than a code-related one. I will try triggering the CI again tomorrow with an empty commit to see if the issue resolves itself.

@ejona86 ejona86 added the kokoro:run Add this label to a PR to tell Kokoro the code is safe and tests can be run label Jan 6, 2026
@grpc-kokoro grpc-kokoro removed the kokoro:run Add this label to a PR to tell Kokoro the code is safe and tests can be run label Jan 6, 2026
@ejona86 ejona86 merged commit 172f650 into grpc:master Jan 6, 2026
18 of 29 checks passed
@ejona86
Copy link
Member

ejona86 commented Jan 6, 2026

Thank you!

@becomeStar
Copy link
Contributor Author

Thank you for the detailed explanation regarding the determinism of the Ring Hash test. I now clearly understand how voidHash() ensures the test remains reliable despite the underlying complexity. I also appreciate you re-triggering the CI and merging the PR. Thanks for your support and for the great learning opportunity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

xds: ring_hash: Lazily initializing child pickfirst LBs is not happening when TF is reported at the point that no other children are connecting.

3 participants