fix(cluster): --use-slaves creates cluster-aware replicas#253
Open
fcostaoliveira wants to merge 4 commits into
Open
fix(cluster): --use-slaves creates cluster-aware replicas#253fcostaoliveira wants to merge 4 commits into
fcostaoliveira wants to merge 4 commits into
Conversation
In oss-cluster mode with --use-slaves, RLTest previously booted slaves with --slaveof <master_port> and WITHOUT --cluster-enabled yes, making them standalone replicas outside cluster gossip. CLUSTER SLOTS returned empty replica arrays from masters; tests that check cluster-aware replica routing silently skipped. * redis_std.py: when clusterEnabled, slaves get --cluster-enabled yes and no --slaveof. Slaves boot as empty cluster nodes ready to join gossip. * redis_cluster.py::startEnv: after master slot assignment, MEET each slave from a master so it joins gossip, then run CLUSTER REPLICATE on the slave connection to attach it to its master. * redis_cluster.py::waitCluster: also wait for replicas to appear in CLUSTER SLOTS (one per shard) when slaves are configured, so callers that read CLUSTER SLOTS immediately after waitCluster see the full master+replica topology. Closes RedisLabsModules#252 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #253 +/- ##
==========================================
+ Coverage 32.46% 38.54% +6.08%
==========================================
Files 17 18 +1
Lines 2597 2978 +381
==========================================
+ Hits 843 1148 +305
- Misses 1754 1830 +76
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
fcostaoliveira
added a commit
to redis/memtier_benchmark
that referenced
this pull request
Jun 7, 2026
Upstream RLTest does not currently support cluster-aware replicas when --use-slaves is set (slaves are started with --slaveof, not joined to cluster gossip). The fork branch fix/cluster-aware-replicas (PR RedisLabsModules/RLTest#253) implements the fix. Pinning temporarily so the OSS-CLUSTER + replicas: read-preference matrix cell becomes load-bearing. Revert to RedisLabsModules/RLTest@master once upstream merges. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Currently --use-slaves creates exactly one replica per shard. Add --replicas-per-shard=N (default 1) to allow N replicas per shard, extending the cluster-aware replicas fix landed earlier in this PR. * StandardEnv now stores slaves as per-replica lists (slaveProcesses / slavePorts / slaveStdouts / slaveStderrs / slaveExitCodes / slaveServerIds / slaveCmdArgsList / slaveOSEnvList) rather than scalars. getSlaveConnection(idx) and getSlavePort(idx) accept an index; getNumSlaves() returns the count. The legacy scalar attributes (slaveProcess, slavePort, slaveStdout, slaveStderr, slaveExitCode, slaveServerId, slaveCmdArgs, slaveOSEnv) remain available via @Property accessors that map to index 0, so external code and the existing test suite continue to work unchanged when replicasPerShard == 1. Standalone (non-cluster) mode is intentionally clamped to a single replica. * ClusterEnv allocates shardsCount * (1 + replicasPerShard) redises when useSlaves, and _attachSlavesToCluster loops over every live replica in each shard to MEET and CLUSTER REPLICATE it. _countAgreeSlots now normalizes per-row replica ordering so a multi-replica cluster can converge even when masters report replicas in different orders. * waitCluster's _expectedReplicasInSlots returns the total replica count across all shards, so the cluster is only marked ready once every replica appears in CLUSTER SLOTS. * New tests/unit/test_replicas_per_shard.py asserts both the construction-time bookkeeping (port allocation, list lengths, standalone clamping) and the end-to-end topology (3 shards x 2 replicas yields 9 redises and 2 replica entries per CLUSTER SLOTS row). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
iter2-R3 audit found 3 polish items worth landing before merge: * Wrap topology phase (_attachSlavesToCluster + waitCluster) in try/except with self.stopEnv() cleanup. Without this guard, a RuntimeError from waitCluster timeout or _attachSlavesToCluster retry exhaustion leaves all N * (1 + replicasPerShard) redis-server processes running — up to 9+ leaked processes with multi-replica. * Extract gossip/retry magic numbers (0.5s settle, 20 retries, 0.25s interval) into module-level constants (GOSSIP_SETTLE_SEC, REPLICATE_RETRY_MAX, REPLICATE_RETRY_INTERVAL_SEC) matching the existing CLUSTER_STATUS_INTERVAL_SEC precedent. * Docstring on _expectedReplicasInSlots clarifies that the count comes from process state, not CLUSTER SLOTS rows — robust to non-contiguous slot distribution during partial migration. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Tests that deliberately stop a replica (e.g. via SHUTDOWN NOSAVE to
exercise a failover code path) currently trip the teardown checks:
* checkExitCode() flags slaveExitCodes[i] == None as a bad exit code
* __stopProcessImpl() prints "process is not alive, might have crash
durring test execution, check this out. server id : N"
Both are false positives — the test asked for the replica to die.
Expose a small public method on StandardEnv, markSlaveDeadByTest(idx),
that records the slave as expected-dead:
* adds idx to a new _expectedDeadSlaves set on the shard
* sets slaveExitCodes[idx] = 0 so checkExitCode() passes
* sets slaveProcesses[idx] = None so stopEnv() skips the process
* suppresses the "process is not alive" warning in __stopProcessImpl
when the slave is in _expectedDeadSlaves (defensive, since stopEnv
already short-circuits on the cleared process entry)
* logs a yellow notice line for debuggability
The index is the 0-based per-shard slot into slaveProcesses /
slaveExitCodes / slaveServerIds — NOT the absolute serverId printed
by RLTest in failure output. The docstring spells this out and points
callers at slavePorts for the typical port -> idx lookup.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
--env oss-cluster --use-slavespreviously created standalone replicas (via--slaveof) without--cluster-enabled yes. The slaves were not in cluster gossip —CLUSTER SLOTSreturned empty replica arrays from masters, silently breaking tests that check cluster-aware replica routing.Closes #252
Changes
redis_std.py: whenclusterEnabled, slaves get--cluster-enabled yesand no--slaveof. Slaves boot as empty cluster-enabled nodes ready to join gossip.redis_cluster.py::startEnv: after master slot assignment, MEET each slave from a master so it joins gossip, then runCLUSTER REPLICATEon the slave connection to attach it to its master.redis_cluster.py::waitCluster: when slaves are configured, also wait for the replica entries to appear inCLUSTER SLOTS(one per shard) before returning. Without this, callers that readCLUSTER SLOTSimmediately afterwaitClusterwould see only the masters even though slaves are attached.Test plan
pytest tests/unit/— 157 passed, 5 skipped)oss-cluster --shards-count 3 --use-slavesshows replica entries inCLUSTER SLOTSfrom each master immediately afterwaitClusterreturnsINFO clustershowscluster_enabled:1ROLEreportsslavepointing at the correct masterCLUSTER NODESfrom any master shows the full 6-node topology (3 masters + 3 slaves) with proper master/slave relationshipsBackward compatibility
Standalone mode (
oss-standalone --use-slaves) is unchanged —--slaveofis still used becauseclusterEnabled=False. Only the cluster-mode path is affected. Any existing user who relied on--use-slavesproducing standalone replicas insideoss-clustermode (an unusual configuration) will need to adjust.Downstream context
Surfaced during redis/memtier_benchmark#456 review. memtier's read-preference feature adds replica routing; the new CI matrix cell silently skipped because of this gap. This fix closes the gap so the CI signal becomes load-bearing for cluster-aware replica routing tests in any downstream project.
Co-Authored-By: Claude Sonnet 4.6 noreply@anthropic.com