fix(cluster): --use-slaves creates cluster-aware replicas by fcostaoliveira · Pull Request #253 · RedisLabsModules/RLTest

fcostaoliveira · 2026-06-07T09:30:00Z

Summary

--env oss-cluster --use-slaves previously created standalone replicas (via --slaveof) without --cluster-enabled yes. The slaves were not in cluster gossip — CLUSTER SLOTS returned empty replica arrays from masters, silently breaking tests that check cluster-aware replica routing.

Closes #252

Changes

redis_std.py: when clusterEnabled, slaves get --cluster-enabled yes and no --slaveof. Slaves boot as empty cluster-enabled nodes ready to join gossip.
redis_cluster.py::startEnv: after master slot assignment, MEET each slave from a master so it joins gossip, then run CLUSTER REPLICATE on the slave connection to attach it to its master.
redis_cluster.py::waitCluster: when slaves are configured, also wait for the replica entries to appear in CLUSTER SLOTS (one per shard) before returning. Without this, callers that read CLUSTER SLOTS immediately after waitCluster would see only the masters even though slaves are attached.

Test plan

Existing unit test suite passes locally (pytest tests/unit/ — 157 passed, 5 skipped)
Manual repro: oss-cluster --shards-count 3 --use-slaves shows replica entries in CLUSTER SLOTS from each master immediately after waitCluster returns
Manual repro: each slave's INFO cluster shows cluster_enabled:1
Manual repro: each slave's ROLE reports slave pointing at the correct master
Manual repro: CLUSTER NODES from any master shows the full 6-node topology (3 masters + 3 slaves) with proper master/slave relationships

Backward compatibility

Standalone mode (oss-standalone --use-slaves) is unchanged — --slaveof is still used because clusterEnabled=False. Only the cluster-mode path is affected. Any existing user who relied on --use-slaves producing standalone replicas inside oss-cluster mode (an unusual configuration) will need to adjust.

Downstream context

Surfaced during redis/memtier_benchmark#456 review. memtier's read-preference feature adds replica routing; the new CI matrix cell silently skipped because of this gap. This fix closes the gap so the CI signal becomes load-bearing for cluster-aware replica routing tests in any downstream project.

Co-Authored-By: Claude Sonnet 4.6 noreply@anthropic.com

In oss-cluster mode with --use-slaves, RLTest previously booted slaves with --slaveof <master_port> and WITHOUT --cluster-enabled yes, making them standalone replicas outside cluster gossip. CLUSTER SLOTS returned empty replica arrays from masters; tests that check cluster-aware replica routing silently skipped. * redis_std.py: when clusterEnabled, slaves get --cluster-enabled yes and no --slaveof. Slaves boot as empty cluster nodes ready to join gossip. * redis_cluster.py::startEnv: after master slot assignment, MEET each slave from a master so it joins gossip, then run CLUSTER REPLICATE on the slave connection to attach it to its master. * redis_cluster.py::waitCluster: also wait for replicas to appear in CLUSTER SLOTS (one per shard) when slaves are configured, so callers that read CLUSTER SLOTS immediately after waitCluster see the full master+replica topology. Closes RedisLabsModules#252 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

codecov-commenter · 2026-06-07T09:41:20Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 68.58006% with 104 lines in your changes missing coverage. Please review.
✅ Project coverage is 38.54%. Comparing base (02cfbb9) to head (4808b5f).
⚠️ Report is 15 commits behind head on master.

Files with missing lines	Patch %	Lines
RLTest/redis_std.py	63.00%	74 Missing ⚠️
RLTest/redis_cluster.py	79.03%	26 Missing ⚠️
RLTest/__main__.py	0.00%	4 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #253      +/-   ##
==========================================
+ Coverage   32.46%   38.54%   +6.08%     
==========================================
  Files          17       18       +1     
  Lines        2597     2978     +381     
==========================================
+ Hits          843     1148     +305     
- Misses       1754     1830      +76

Flag	Coverage Δ
unittests	`38.54% <68.58%> (+6.08%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Upstream RLTest does not currently support cluster-aware replicas when --use-slaves is set (slaves are started with --slaveof, not joined to cluster gossip). The fork branch fix/cluster-aware-replicas (PR RedisLabsModules/RLTest#253) implements the fix. Pinning temporarily so the OSS-CLUSTER + replicas: read-preference matrix cell becomes load-bearing. Revert to RedisLabsModules/RLTest@master once upstream merges. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Currently --use-slaves creates exactly one replica per shard. Add --replicas-per-shard=N (default 1) to allow N replicas per shard, extending the cluster-aware replicas fix landed earlier in this PR. * StandardEnv now stores slaves as per-replica lists (slaveProcesses / slavePorts / slaveStdouts / slaveStderrs / slaveExitCodes / slaveServerIds / slaveCmdArgsList / slaveOSEnvList) rather than scalars. getSlaveConnection(idx) and getSlavePort(idx) accept an index; getNumSlaves() returns the count. The legacy scalar attributes (slaveProcess, slavePort, slaveStdout, slaveStderr, slaveExitCode, slaveServerId, slaveCmdArgs, slaveOSEnv) remain available via @Property accessors that map to index 0, so external code and the existing test suite continue to work unchanged when replicasPerShard == 1. Standalone (non-cluster) mode is intentionally clamped to a single replica. * ClusterEnv allocates shardsCount * (1 + replicasPerShard) redises when useSlaves, and _attachSlavesToCluster loops over every live replica in each shard to MEET and CLUSTER REPLICATE it. _countAgreeSlots now normalizes per-row replica ordering so a multi-replica cluster can converge even when masters report replicas in different orders. * waitCluster's _expectedReplicasInSlots returns the total replica count across all shards, so the cluster is only marked ready once every replica appears in CLUSTER SLOTS. * New tests/unit/test_replicas_per_shard.py asserts both the construction-time bookkeeping (port allocation, list lengths, standalone clamping) and the end-to-end topology (3 shards x 2 replicas yields 9 redises and 2 replica entries per CLUSTER SLOTS row). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

iter2-R3 audit found 3 polish items worth landing before merge: * Wrap topology phase (_attachSlavesToCluster + waitCluster) in try/except with self.stopEnv() cleanup. Without this guard, a RuntimeError from waitCluster timeout or _attachSlavesToCluster retry exhaustion leaves all N * (1 + replicasPerShard) redis-server processes running — up to 9+ leaked processes with multi-replica. * Extract gossip/retry magic numbers (0.5s settle, 20 retries, 0.25s interval) into module-level constants (GOSSIP_SETTLE_SEC, REPLICATE_RETRY_MAX, REPLICATE_RETRY_INTERVAL_SEC) matching the existing CLUSTER_STATUS_INTERVAL_SEC precedent. * Docstring on _expectedReplicasInSlots clarifies that the count comes from process state, not CLUSTER SLOTS rows — robust to non-contiguous slot distribution during partial migration. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Tests that deliberately stop a replica (e.g. via SHUTDOWN NOSAVE to exercise a failover code path) currently trip the teardown checks: * checkExitCode() flags slaveExitCodes[i] == None as a bad exit code * __stopProcessImpl() prints "process is not alive, might have crash durring test execution, check this out. server id : N" Both are false positives — the test asked for the replica to die. Expose a small public method on StandardEnv, markSlaveDeadByTest(idx), that records the slave as expected-dead: * adds idx to a new _expectedDeadSlaves set on the shard * sets slaveExitCodes[idx] = 0 so checkExitCode() passes * sets slaveProcesses[idx] = None so stopEnv() skips the process * suppresses the "process is not alive" warning in __stopProcessImpl when the slave is in _expectedDeadSlaves (defensive, since stopEnv already short-circuits on the cleared process entry) * logs a yellow notice line for debuggability The index is the 0-based per-shard slot into slaveProcesses / slaveExitCodes / slaveServerIds — NOT the absolute serverId printed by RLTest in failure output. The docstring spells this out and points callers at slavePorts for the typical port -> idx lookup. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

fcostaoliveira and others added 3 commits June 7, 2026 11:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(cluster): --use-slaves creates cluster-aware replicas#253

fix(cluster): --use-slaves creates cluster-aware replicas#253
fcostaoliveira wants to merge 4 commits into
RedisLabsModules:masterfrom
fcostaoliveira:fix/cluster-aware-replicas

fcostaoliveira commented Jun 7, 2026

Uh oh!

codecov-commenter commented Jun 7, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

fcostaoliveira commented Jun 7, 2026

Summary

Changes

Test plan

Backward compatibility

Downstream context

Uh oh!

codecov-commenter commented Jun 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov-commenter commented Jun 7, 2026 •

edited

Loading