fix: add @Exclusive tag to prevent cluster-mutating tests from running in parallel by delthas · Pull Request #2360 · scality/Zenko

delthas · 2026-03-20T11:12:30Z

Summary

Fixes intermittent CI failures in ctst-end2end-sharded caused by parallel cucumber workers interfering with each other when one worker runs scenarios that mutate cluster-wide state.

Problem

The ctst-end2end-sharded job runs cucumber with 4 parallel workers (--parallel $PARALLEL_RUNS). Some test scenarios create or modify Zenko locations via the management API, which triggers operator reconciliation and rolling restarts of backbeat components (replication data processor, notification processors, sorbet, etc.).

When these cluster-mutating scenarios run in one worker, the other 3 workers' tests are affected — their backbeat pods get killed and recreated mid-flight, causing replication timeouts, kafka cleaner failures, and azure archive restore retry timeouts.

Observed failure (run #8809)

8 out of 4418 scenarios failed:

6 replication scenarios (s3utils + location stripping) — objects stuck in pending/processing state
1 azure archive restore retry — timeout waiting for restored state
1 kafka cleaner — topics not cleaned in time

Root cause timeline

11:08 — Azure archive CRUD test starts on worker pid:62, creates location e2e-azure-archive-2-non-versioned via POST /config/{id}/location
11:08–11:28 — Operator reconciles, triggering 23+ rolling update events for backbeat-replication-data-processor across 6 different ReplicaSets. The data processor is killed and recreated 15 times.
11:22–11:28 — A replication-data-processor pod fails to mount backbeat-config secret (v21 doesn't exist yet), is killed. Processor is completely down for 6 minutes.
11:28:39 — Final processor pod created, becomes ready at ~11:29
11:29:10 — Replication tests start on workers pid:48 and pid:54 — seconds after the processor came up. The freshly-started processor hasn't re-joined Kafka consumer groups yet.
11:34–11:46 — All 6 replication scenarios timeout (300s) because the processor can't keep up.

The CRUD scenario creates 3 locations + modifies 3 locations = 6 reconciliation rounds, each triggering a full rolling restart of all backbeat deployments. The waitForZenkoToStabilize() call in the CRUD test only blocks that specific worker — the other 3 workers are unaware that pods are being churned.

Solution

Add an @Exclusive tag mechanism to cucumber's setParallelCanAssign that gives tagged scenarios exclusive access to all workers:

When an @Exclusive scenario is running, no other scenario can start on any worker
An @Exclusive scenario only starts when all other running scenarios have finished
The existing atMostOnePicklePerTag logic for @ColdStorage, @PRA, etc. is preserved as a fallback

This is safe from races because the coordinator runs in a single Node.js process — setParallelCanAssign is called synchronously from the event loop when deciding work placement. Cucumber also has a built-in deadlock safety valve: if all workers go idle but pickles remain, it force-assigns the first one.

Scenarios tagged with `@Exclusive`

Scenario	Feature	Mutation
Create, read, update and delete azure archive location	azureArchive.feature	Creates 3 locations + modifies them → 6 reconciliation rounds
Bucket Website CRUD	bucketWebsite.feature	Adds endpoint to overlay (no stabilization wait)
PRA (nominal case)	pra.feature	Installs/uninstalls entire DR site

Note: "Pause and resume archiving to azure" was initially tagged but removed after review — it only calls the Backbeat API (/_/lifecycle/pause/{location}) and does not modify the overlay or trigger operator reconciliation.

Alternatives considered

Move location creation to configure-e2e-ctst.sh — Would eliminate the problem for azure archive CRUD but doesn't generalize to other cluster-mutating scenarios (PRA, website). Would also require significant refactoring of the CRUD test itself.
Tag-based ordering (run mutating tests in a separate phase) — Cucumber doesn't natively support phased execution. Would require splitting into multiple cucumber-js invocations, losing the single-report output.
Reduce parallelism globally — Would slow down all tests, not just the problematic ones.

The @Exclusive approach is the most targeted: it only serializes the specific scenarios that cause cluster-wide churn, while allowing all other tests to run in parallel as before.

Estimated performance impact

Based on a successful run (attempt 4 of #8809):

Scenario	Duration	Extra wall-clock if exclusive
Azure CRUD (3 examples + 1 retry)	~17 min	~13 min
Bucket Website CRUD	~1s	~0s
PRA	N/A (excluded by `not @PRA`)	0

Current pipeline: ~82 min → estimated with @Exclusive: ~95 min (+16%)

Without retries, the cost drops to ~10%. This is a worthwhile tradeoff for eliminating a major source of CI flakiness that currently requires re-running the entire job (adding 82+ min per retry).

Issue: ZENKO-5228

bert-e · 2026-03-20T11:12:35Z

Hello delthas,

My role is to assist you with the merge of this
pull request. Please type @bert-e help to get information
on this process, or consult the user documentation.

Available options

name	description	privileged	authored
`/after_pull_request`	Wait for the given pull request id to be merged before continuing with the current one.
`/bypass_author_approval`	Bypass the pull request author's approval	⭐
`/bypass_build_status`	Bypass the build and test status	⭐
`/bypass_commit_size`	Bypass the check on the size of the changeset `TBA`	⭐
`/bypass_incompatible_branch`	Bypass the check on the source branch prefix	⭐
`/bypass_jira_check`	Bypass the Jira issue check	⭐
`/bypass_peer_approval`	Bypass the pull request peers' approval	⭐
`/bypass_leader_approval`	Bypass the pull request leaders' approval	⭐
`/approve`	Instruct Bert-E that the author has approved the pull request.		✍️
`/create_pull_requests`	Allow the creation of integration pull requests.
`/create_integration_branches`	Allow the creation of integration branches.
`/no_octopus`	Prevent Wall-E from doing any octopus merge and use multiple consecutive merge instead
`/unanimity`	Change review acceptance criteria from `one reviewer at least` to `all reviewers`
`/wait`	Instruct Bert-E not to run until further notice.

Available commands

name	description	privileged
`/help`	Print Bert-E's manual in the pull request.
`/status`	Print Bert-E's current status in the pull request `TBA`
`/clear`	Remove all comments from Bert-E from the history `TBA`
`/retry`	Re-start a fresh build `TBA`
`/build`	Re-start a fresh build `TBA`
`/force_reset`	Delete integration branches & pull requests, and restart merge process from the beginning.
`/reset`	Try to remove integration branches unless there are commits on them which do not appear on the source branch.

Status report is not available.

bert-e · 2026-03-20T11:12:43Z

Waiting for approval

The following approvals are needed before I can proceed with the merge:

the author
2 peers

tests/ctst/features/azureArchive.feature

SylvainSenechal

General thoughts :

Waiting for you to rerun 3/4 times to analyze real impact on both timing and flakiness
Still wish we could modify zenko operator to not reconcile every backbeat pod when they are not concerned at all by the change of the configuration
One trick we may consider here : While we can't control that much the order of execution for the tests, I believe cucumber runs the tests from top to bottom of the file, and maybe also run the files from a to z (that's why the kafka cleaner scenario has its file called zzz.xxx 🌚 ). So here, it might be interesting to put all the problematic tests together at the end of the file instead of having them in the middle

Other things :

If we merge this pr, need to update the "HOW_TO_WRITE_TEST.MD" : document the new Exclusive tag, and drop the rule 3 about not reconfiguring the env during tests
William had a different mechanism based on locking a file to do a single task at the same time for all worker, I think you can see the implementation in the Cli-testing folder : Worth to take a look and compare his solution with yours

Edit :
After spending time on this CI, I have also come to the same conclusion that flakiness comes from those pods being killed and recreated a lot of times while the tests are running, so this is defintly one of the big thing we need to fix.

francoisferrand

I got really mixed feeling on this one: we are just putting a bandaid, and not addressing the actual underlying issue (the system should remain stable even when creating locations!!!), just masking the problem and increasing build time...

maybe this is acceptable as a temporary measure or to facilitate investigations, but only in that context: so work must not stop there

System MUST remain stable and functional even through location creations/etc...
Either there is a bug in these tests (not waiting on the right events or the right way), or an issue in the software : but it must be fixed
Introducing exclusivity is the exact opposite of the good practices for tests (test must be idempotent, work in parallel, etc...)

tests/ctst/features/azureArchive.feature

tests/ctst/common/hooks.ts

…parallel Issue: ZENKO-5228

@exclusive

- Remove @exclusive from "Pause and resume archiving to azure" scenario: it only calls the Backbeat API to pause/resume lifecycle, no overlay change or operator reconciliation is triggered. - Update HOW_TO_WRITE_TESTS.md: document @exclusive tag in Rule 3, add note to Rule 4 distinguishing it from atMostOnePicklePerTag. Issue: ZENKO-5228

delthas · 2026-03-23T09:52:14Z

For now, 2 success of out 2 tries on the CTST runs (the previous failures were unrelated prerequisites)

francoisferrand · 2026-03-23T17:44:51Z

For now, 2 success of out 2 tries on the CTST runs (the previous failures were unrelated prerequisites)

"thanks" to the change I made to "merge results", once you succeed all re-runs will succeed.... So you should double-check the logs to be sure...
probably worth doing more than 2 times
if it repeats, it does confirm the flakiness is due to the zenko reconfigurations : which is a great step, yet reconfiguration is a normal event, so this PR would simply hide the bug... We cannot just bury the issue, how do we proceed to find & address the issue?

francoisferrand · 2026-03-23T17:48:57Z

If we merge this pr, need to update the "HOW_TO_WRITE_TEST.MD" : document the new Exclusive tag, and drop the rule 3 about not reconfiguring the env during tests

Exclusive is not recommended, since we should not have such a limitation. Do we want to document it and make it normal?

William had a different mechanism based on locking a file to do a single task at the same time for all worker, I think you can see the implementation in the Cli-testing folder : Worth to take a look and compare his solution with yours

This mechanism works the other way : it is used to "share" some action between different tests, by ensuring it is actually performed just once. Not sure how it translate to "exclusive" tests, which should block all other tests.

delthas · 2026-03-24T17:37:12Z

Followup created at ZENKO-5240

So far:

2 runs without any failure
4 runs with only "Bucket Websites.Bucket Website CRUD" failed -- I think this test is flaky, working on it -- will open a separate PR
1 runs with only "Quota Management for APIs.Object restoration implements strict quotas"

I'd suggest merging this, then merging Sylvain's PR, then I work on ZENKO-5240

delthas · 2026-03-24T17:49:55Z

As for documenting @Exclusive I do not have a strong opinion, but I'd argue it's better to document it with a warning than keeping it a secret that developers then have to explore to understand what it does when they encounter it.

francoisferrand · 2026-03-24T20:30:32Z

Followup created at ZENKO-5240

Out of 5 runs:

2 without any failure

2 with only "Bucket Websites.Bucket Website CRUD" failed -- I think this test is flaky, working on it -- will open a separate PR

1 with only "Quota Management for APIs.Object restoration implements strict quotas"

I'd suggest merging this, then merging Sylvain's PR, then I work on ZENKO-5240

Flakiness is an issue that could not be tackled for yers: the last attempt is the current one, where we try to first run locally and improve troubleshooting capabilities, so we can then eventually fix flakiness.
How do you plan to address the issue, so we can be confident we will fix the issue and not end up just hiding it? (not asking to answer here, just that we have a plan likely to succeed: so please think it out and put it in the ticket)

Another way of saying it: once we have identified the root cause (and reconciling is not, it's just a normal event which is a circumstance of the), no problem tweaking the test to reduce flakiness while we do the actual fix (or don't, if we dim it not an issue in production). But as long as we don't, changing this with no plan how to find the issue has very high risk of just hiding the problem under the rug...

I also added a snapshot of the flaky tests identified by trunk.io in the ticket for reference when work would continue. Not sure how precise or trustworthy the values are, but a few thoughts anyway:

the tests you mention were not in the list : became flaky due to some recent changes? or maybe just a very low flakiness rates?
the failure rate of individual tests is at 10% or less for most of them, and 26-33% for the top 6 most flaky. So I wonder if we should run a few more runs, just to eliminate statistical anomalies even more...

delthas force-pushed the improvement/ZENKO-5228/fix-ci-parallel-exclusive-tag branch from 57627e3 to ffb4756 Compare March 20, 2026 11:15

delthas requested review from SylvainSenechal and francoisferrand March 20, 2026 11:20

SylvainSenechal reviewed Mar 20, 2026

View reviewed changes

tests/ctst/features/azureArchive.feature Show resolved Hide resolved

SylvainSenechal reviewed Mar 20, 2026

View reviewed changes

francoisferrand reviewed Mar 21, 2026

View reviewed changes

tests/ctst/features/azureArchive.feature Show resolved Hide resolved

francoisferrand reviewed Mar 21, 2026

View reviewed changes

tests/ctst/common/hooks.ts Show resolved Hide resolved

delthas added 2 commits March 23, 2026 10:50

add @exclusive tag to prevent cluster-mutating tests from running in …

e8d25ea

…parallel Issue: ZENKO-5228

delthas force-pushed the improvement/ZENKO-5228/fix-ci-parallel-exclusive-tag branch from a63364f to 149ade8 Compare March 23, 2026 09:50

delthas requested review from SylvainSenechal and francoisferrand March 23, 2026 09:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: add @Exclusive tag to prevent cluster-mutating tests from running in parallel#2360

fix: add @Exclusive tag to prevent cluster-mutating tests from running in parallel#2360
delthas wants to merge 2 commits intodevelopment/2.14from
improvement/ZENKO-5228/fix-ci-parallel-exclusive-tag

delthas commented Mar 20, 2026 •

edited by francoisferrand

Loading

Uh oh!

bert-e commented Mar 20, 2026

Uh oh!

bert-e commented Mar 20, 2026

Uh oh!

Uh oh!

SylvainSenechal left a comment •

edited

Loading

Uh oh!

francoisferrand left a comment

Uh oh!

Uh oh!

Uh oh!

delthas commented Mar 23, 2026

Uh oh!

francoisferrand commented Mar 23, 2026 •

edited

Loading

Uh oh!

francoisferrand commented Mar 23, 2026

Uh oh!

delthas commented Mar 24, 2026 •

edited

Loading

Uh oh!

delthas commented Mar 24, 2026 •

edited

Loading

Uh oh!

francoisferrand commented Mar 24, 2026 •

edited by atlassian bot

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

delthas commented Mar 20, 2026 • edited by francoisferrand Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Observed failure (run #8809)

Root cause timeline

Solution

Scenarios tagged with @Exclusive

Alternatives considered

Estimated performance impact

Uh oh!

bert-e commented Mar 20, 2026

Hello delthas,

Uh oh!

bert-e commented Mar 20, 2026

Waiting for approval

Uh oh!

Uh oh!

SylvainSenechal left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

francoisferrand left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

delthas commented Mar 23, 2026

Uh oh!

francoisferrand commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

francoisferrand commented Mar 23, 2026

Uh oh!

delthas commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

delthas commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

francoisferrand commented Mar 24, 2026 • edited by atlassian bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

delthas commented Mar 20, 2026 •

edited by francoisferrand

Loading

Scenarios tagged with `@Exclusive`

SylvainSenechal left a comment •

edited

Loading

francoisferrand commented Mar 23, 2026 •

edited

Loading

delthas commented Mar 24, 2026 •

edited

Loading

delthas commented Mar 24, 2026 •

edited

Loading

francoisferrand commented Mar 24, 2026 •

edited by atlassian bot

Loading