Adjust timeout handling in client-side operations to account for RTT variations #1793

vbabanin · 2025-08-21T05:41:53Z

This PR removes the ClusterFixture.getPrimaryRTT()- based adjustment in CSOT Prose Tests, because in practice the adjustment collapses to 0ms (after min-subtraction) and introduces unnecessary order-dependent noise/flakiness.

Problem

We rely on ClusterFixture.getPrimaryRTT() to adjust for RTT fluctuations across environment.

ClusterFixture.getPrimaryRTT() reports a running average across 10 samples from its monitor thread. Test execution order is not deterministic, so CSOT may execute before the cluster has accumulated stable RTT samples. In that case, the first samples can be dominated by the first handshake, which has higher RTT due to spending more time on the server side, which skews the running average.

observed RTT distributions (examples)

Early / low-sample distribution can contain large outliers

---
config:
  useMaxWidth: false
  xyChart:
    width: 520
    height: 320
---
xychart-beta
  title "ClusterFixture.getPrimaryRTT() - Total samples: 15"
  x-axis "Histogram (bucket size: 1ms)" ["0-0ms","1-1ms","8-8ms","17-17ms","33-33ms","41-41ms","51-51ms"]
  y-axis "samples" 0 --> 7
  bar [7,2,2,1,1,1,1]

Min RTT used internally (TimeoutContext) is consistently 0ms

---
config:
  xyChart:
    width: 520
    height: 320
---
xychart-beta
  title "Getting min RTT in TimeoutContext - Total samples: 209"
  x-axis "Histogram (bucket size: 1ms)" ["0-0ms"]
  y-axis "samples" 0 --> 209
  bar [209]

Root cause 1:
A per-source histogram shows that “Opening connection” samples can include extreme outliers (up to ~1020ms), while subsequent samples immediately return to near-zero. Because getPrimaryRTT() is a running average over 10 samples, these early outliers can dominate the reported RTT when CSOT runs early.

---
config:
  useMaxWidth: false
  xyChart:
    width: 980
    height: 560
---
xychart-beta
  title "Histogram by source (bucket size: 10ms)"
  x-axis ["0-9ms","10-19ms","20-29ms","30-39ms","40-49ms","50-59ms","60-69ms","70-79ms","80-89ms","90-99ms","100-109ms","110-119ms","120-129ms","140-149ms","150-159ms","190-199ms","500-509ms","1010-1019ms","1020-1029ms"]
  y-axis "samples" 0 --> 7751
  bar "Total" [7751,52,22,8,10,7,5,2,5,1,4,2,3,1,1,1,1,1,1]
  bar "O=Opening connection" [4324,13,7,1,3,1,1,0,2,1,2,1,2,1,1,1,1,1,1]
  bar "H=Heartbeat" [2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
  bar "R=RTT monitor" [3425,39,15,7,7,6,4,2,3,0,2,1,1,0,0,0,0,0,0]

Root cause 2: unrelated test polluted cluster RTT monitor and cascaded into later tests

A separate source of skew was identified: shouldUseConnectTimeoutMsWhenEstablishingConnectionInBackground blocked "hello" / "isMaster" for ~500ms without using a uniquely scoped failpoint. This affected the shared ClusterFixture monitor, causing elevated RTT readings and cascading into subsequent tests, increasing flakiness.

This issue is independent from test ordering: even if CSOT runs later, a shared-monitor disruption can inflate RTT and affect any tests that rely on it.

Post-fix observation: After addressing the failpoint / blocking behavior, ClusterFixture.getPrimaryRTT() distributions collapsed to all-zeros in the observed runs:

---
config:
  useMaxWidth: false
  xyChart:
    width: 520
    height: 260
---
xychart-beta
  title "ClusterFixture.getPrimaryRTT() - Total samples: 40"
  x-axis "Histogram (bucket size: 1ms)" ["0-0ms"]
  y-axis "samples" 0 --> 40
  bar [40]

---
config:
  useMaxWidth: false
  xyChart:
    width: 520
    height: 260
---
xychart-beta
  title "ClusterFixture.getPrimaryRTT() - Total samples: 49"
  x-axis "Histogram (bucket size: 1ms)" ["0-0ms"]
  y-axis "samples" 0 --> 49
  bar [49]

Change in this PR

Remove the ClusterFixture.getPrimaryRTT().
Adjust timeout settings to reduce flakiness.

JAVA-5375

…variations. JAVA-5375

Adjust timeout handling in client-side operations to account for RTT …

9c44ffc

…variations. JAVA-5375

vbabanin self-assigned this Aug 21, 2025

vbabanin added 2 commits August 20, 2025 22:42

Merge branch 'main' into JAVA-5375

26de0bb

Include MongoExecutionTimeoutException in tests.

750e44c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adjust timeout handling in client-side operations to account for RTT variations #1793

Adjust timeout handling in client-side operations to account for RTT variations #1793

Uh oh!

vbabanin commented Aug 21, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Adjust timeout handling in client-side operations to account for RTT variations #1793

Are you sure you want to change the base?

Adjust timeout handling in client-side operations to account for RTT variations #1793

Uh oh!

Conversation

vbabanin commented Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Change in this PR

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vbabanin commented Aug 21, 2025 •

edited

Loading