Skip to content

fix(cluster): --use-slaves creates cluster-aware replicas#253

Open
fcostaoliveira wants to merge 4 commits into
RedisLabsModules:masterfrom
fcostaoliveira:fix/cluster-aware-replicas
Open

fix(cluster): --use-slaves creates cluster-aware replicas#253
fcostaoliveira wants to merge 4 commits into
RedisLabsModules:masterfrom
fcostaoliveira:fix/cluster-aware-replicas

Conversation

@fcostaoliveira
Copy link
Copy Markdown
Contributor

Summary

--env oss-cluster --use-slaves previously created standalone replicas (via --slaveof) without --cluster-enabled yes. The slaves were not in cluster gossip — CLUSTER SLOTS returned empty replica arrays from masters, silently breaking tests that check cluster-aware replica routing.

Closes #252

Changes

  • redis_std.py: when clusterEnabled, slaves get --cluster-enabled yes and no --slaveof. Slaves boot as empty cluster-enabled nodes ready to join gossip.
  • redis_cluster.py::startEnv: after master slot assignment, MEET each slave from a master so it joins gossip, then run CLUSTER REPLICATE on the slave connection to attach it to its master.
  • redis_cluster.py::waitCluster: when slaves are configured, also wait for the replica entries to appear in CLUSTER SLOTS (one per shard) before returning. Without this, callers that read CLUSTER SLOTS immediately after waitCluster would see only the masters even though slaves are attached.

Test plan

  • Existing unit test suite passes locally (pytest tests/unit/ — 157 passed, 5 skipped)
  • Manual repro: oss-cluster --shards-count 3 --use-slaves shows replica entries in CLUSTER SLOTS from each master immediately after waitCluster returns
  • Manual repro: each slave's INFO cluster shows cluster_enabled:1
  • Manual repro: each slave's ROLE reports slave pointing at the correct master
  • Manual repro: CLUSTER NODES from any master shows the full 6-node topology (3 masters + 3 slaves) with proper master/slave relationships

Backward compatibility

Standalone mode (oss-standalone --use-slaves) is unchanged — --slaveof is still used because clusterEnabled=False. Only the cluster-mode path is affected. Any existing user who relied on --use-slaves producing standalone replicas inside oss-cluster mode (an unusual configuration) will need to adjust.

Downstream context

Surfaced during redis/memtier_benchmark#456 review. memtier's read-preference feature adds replica routing; the new CI matrix cell silently skipped because of this gap. This fix closes the gap so the CI signal becomes load-bearing for cluster-aware replica routing tests in any downstream project.

Co-Authored-By: Claude Sonnet 4.6 noreply@anthropic.com

In oss-cluster mode with --use-slaves, RLTest previously booted slaves
with --slaveof <master_port> and WITHOUT --cluster-enabled yes, making
them standalone replicas outside cluster gossip. CLUSTER SLOTS returned
empty replica arrays from masters; tests that check cluster-aware
replica routing silently skipped.

* redis_std.py: when clusterEnabled, slaves get --cluster-enabled yes
  and no --slaveof. Slaves boot as empty cluster nodes ready to join
  gossip.

* redis_cluster.py::startEnv: after master slot assignment, MEET each
  slave from a master so it joins gossip, then run CLUSTER REPLICATE
  on the slave connection to attach it to its master.

* redis_cluster.py::waitCluster: also wait for replicas to appear in
  CLUSTER SLOTS (one per shard) when slaves are configured, so callers
  that read CLUSTER SLOTS immediately after waitCluster see the full
  master+replica topology.

Closes RedisLabsModules#252

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Jun 7, 2026

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 68.58006% with 104 lines in your changes missing coverage. Please review.
✅ Project coverage is 38.54%. Comparing base (02cfbb9) to head (4808b5f).
⚠️ Report is 15 commits behind head on master.

Files with missing lines Patch % Lines
RLTest/redis_std.py 63.00% 74 Missing ⚠️
RLTest/redis_cluster.py 79.03% 26 Missing ⚠️
RLTest/__main__.py 0.00% 4 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #253      +/-   ##
==========================================
+ Coverage   32.46%   38.54%   +6.08%     
==========================================
  Files          17       18       +1     
  Lines        2597     2978     +381     
==========================================
+ Hits          843     1148     +305     
- Misses       1754     1830      +76     
Flag Coverage Δ
unittests 38.54% <68.58%> (+6.08%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

fcostaoliveira added a commit to redis/memtier_benchmark that referenced this pull request Jun 7, 2026
Upstream RLTest does not currently support cluster-aware replicas when
--use-slaves is set (slaves are started with --slaveof, not joined to
cluster gossip). The fork branch fix/cluster-aware-replicas (PR
RedisLabsModules/RLTest#253) implements the fix.

Pinning temporarily so the OSS-CLUSTER + replicas: read-preference
matrix cell becomes load-bearing. Revert to RedisLabsModules/RLTest@master
once upstream merges.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
fcostaoliveira and others added 3 commits June 7, 2026 11:09
Currently --use-slaves creates exactly one replica per shard. Add
--replicas-per-shard=N (default 1) to allow N replicas per shard,
extending the cluster-aware replicas fix landed earlier in this PR.

* StandardEnv now stores slaves as per-replica lists (slaveProcesses /
  slavePorts / slaveStdouts / slaveStderrs / slaveExitCodes /
  slaveServerIds / slaveCmdArgsList / slaveOSEnvList) rather than
  scalars. getSlaveConnection(idx) and getSlavePort(idx) accept an
  index; getNumSlaves() returns the count. The legacy scalar attributes
  (slaveProcess, slavePort, slaveStdout, slaveStderr, slaveExitCode,
  slaveServerId, slaveCmdArgs, slaveOSEnv) remain available via
  @Property accessors that map to index 0, so external code and the
  existing test suite continue to work unchanged when
  replicasPerShard == 1. Standalone (non-cluster) mode is intentionally
  clamped to a single replica.

* ClusterEnv allocates shardsCount * (1 + replicasPerShard) redises
  when useSlaves, and _attachSlavesToCluster loops over every live
  replica in each shard to MEET and CLUSTER REPLICATE it.
  _countAgreeSlots now normalizes per-row replica ordering so a
  multi-replica cluster can converge even when masters report replicas
  in different orders.

* waitCluster's _expectedReplicasInSlots returns the total replica
  count across all shards, so the cluster is only marked ready once
  every replica appears in CLUSTER SLOTS.

* New tests/unit/test_replicas_per_shard.py asserts both the
  construction-time bookkeeping (port allocation, list lengths,
  standalone clamping) and the end-to-end topology (3 shards x 2
  replicas yields 9 redises and 2 replica entries per CLUSTER SLOTS
  row).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
iter2-R3 audit found 3 polish items worth landing before merge:

* Wrap topology phase (_attachSlavesToCluster + waitCluster) in
  try/except with self.stopEnv() cleanup. Without this guard, a
  RuntimeError from waitCluster timeout or _attachSlavesToCluster
  retry exhaustion leaves all N * (1 + replicasPerShard) redis-server
  processes running — up to 9+ leaked processes with multi-replica.

* Extract gossip/retry magic numbers (0.5s settle, 20 retries, 0.25s
  interval) into module-level constants (GOSSIP_SETTLE_SEC,
  REPLICATE_RETRY_MAX, REPLICATE_RETRY_INTERVAL_SEC) matching the
  existing CLUSTER_STATUS_INTERVAL_SEC precedent.

* Docstring on _expectedReplicasInSlots clarifies that the count
  comes from process state, not CLUSTER SLOTS rows — robust to
  non-contiguous slot distribution during partial migration.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Tests that deliberately stop a replica (e.g. via SHUTDOWN NOSAVE to
exercise a failover code path) currently trip the teardown checks:

  * checkExitCode() flags slaveExitCodes[i] == None as a bad exit code
  * __stopProcessImpl() prints "process is not alive, might have crash
    durring test execution, check this out. server id : N"

Both are false positives — the test asked for the replica to die.

Expose a small public method on StandardEnv, markSlaveDeadByTest(idx),
that records the slave as expected-dead:

  * adds idx to a new _expectedDeadSlaves set on the shard
  * sets slaveExitCodes[idx] = 0 so checkExitCode() passes
  * sets slaveProcesses[idx] = None so stopEnv() skips the process
  * suppresses the "process is not alive" warning in __stopProcessImpl
    when the slave is in _expectedDeadSlaves (defensive, since stopEnv
    already short-circuits on the cleared process entry)
  * logs a yellow notice line for debuggability

The index is the 0-based per-shard slot into slaveProcesses /
slaveExitCodes / slaveServerIds — NOT the absolute serverId printed
by RLTest in failure output. The docstring spells this out and points
callers at slavePorts for the typical port -> idx lookup.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

oss-cluster + --use-slaves creates standalone replicas, not cluster-aware ones

2 participants