Skip to content

fix: add @Exclusive tag to prevent cluster-mutating tests from running in parallel#2360

Open
delthas wants to merge 2 commits intodevelopment/2.14from
improvement/ZENKO-5228/fix-ci-parallel-exclusive-tag
Open

fix: add @Exclusive tag to prevent cluster-mutating tests from running in parallel#2360
delthas wants to merge 2 commits intodevelopment/2.14from
improvement/ZENKO-5228/fix-ci-parallel-exclusive-tag

Conversation

@delthas
Copy link
Contributor

@delthas delthas commented Mar 20, 2026

Summary

Fixes intermittent CI failures in ctst-end2end-sharded caused by parallel cucumber workers interfering with each other when one worker runs scenarios that mutate cluster-wide state.

Problem

The ctst-end2end-sharded job runs cucumber with 4 parallel workers (--parallel $PARALLEL_RUNS). Some test scenarios create or modify Zenko locations via the management API, which triggers operator reconciliation and rolling restarts of backbeat components (replication data processor, notification processors, sorbet, etc.).

When these cluster-mutating scenarios run in one worker, the other 3 workers' tests are affected — their backbeat pods get killed and recreated mid-flight, causing replication timeouts, kafka cleaner failures, and azure archive restore retry timeouts.

Observed failure (run #8809)

8 out of 4418 scenarios failed:

  • 6 replication scenarios (s3utils + location stripping) — objects stuck in pending/processing state
  • 1 azure archive restore retry — timeout waiting for restored state
  • 1 kafka cleaner — topics not cleaned in time

Root cause timeline

  1. 11:08 — Azure archive CRUD test starts on worker pid:62, creates location e2e-azure-archive-2-non-versioned via POST /config/{id}/location
  2. 11:08–11:28 — Operator reconciles, triggering 23+ rolling update events for backbeat-replication-data-processor across 6 different ReplicaSets. The data processor is killed and recreated 15 times.
  3. 11:22–11:28 — A replication-data-processor pod fails to mount backbeat-config secret (v21 doesn't exist yet), is killed. Processor is completely down for 6 minutes.
  4. 11:28:39 — Final processor pod created, becomes ready at ~11:29
  5. 11:29:10 — Replication tests start on workers pid:48 and pid:54 — seconds after the processor came up. The freshly-started processor hasn't re-joined Kafka consumer groups yet.
  6. 11:34–11:46 — All 6 replication scenarios timeout (300s) because the processor can't keep up.

The CRUD scenario creates 3 locations + modifies 3 locations = 6 reconciliation rounds, each triggering a full rolling restart of all backbeat deployments. The waitForZenkoToStabilize() call in the CRUD test only blocks that specific worker — the other 3 workers are unaware that pods are being churned.

Solution

Add an @Exclusive tag mechanism to cucumber's setParallelCanAssign that gives tagged scenarios exclusive access to all workers:

  • When an @Exclusive scenario is running, no other scenario can start on any worker
  • An @Exclusive scenario only starts when all other running scenarios have finished
  • The existing atMostOnePicklePerTag logic for @ColdStorage, @PRA, etc. is preserved as a fallback

This is safe from races because the coordinator runs in a single Node.js process — setParallelCanAssign is called synchronously from the event loop when deciding work placement. Cucumber also has a built-in deadlock safety valve: if all workers go idle but pickles remain, it force-assigns the first one.

Scenarios tagged with @Exclusive

Scenario Feature Mutation
Create, read, update and delete azure archive location azureArchive.feature Creates 3 locations + modifies them → 6 reconciliation rounds
Bucket Website CRUD bucketWebsite.feature Adds endpoint to overlay (no stabilization wait)
PRA (nominal case) pra.feature Installs/uninstalls entire DR site

Note: "Pause and resume archiving to azure" was initially tagged but removed after review — it only calls the Backbeat API (/_/lifecycle/pause/{location}) and does not modify the overlay or trigger operator reconciliation.

Alternatives considered

  1. Move location creation to configure-e2e-ctst.sh — Would eliminate the problem for azure archive CRUD but doesn't generalize to other cluster-mutating scenarios (PRA, website). Would also require significant refactoring of the CRUD test itself.

  2. Tag-based ordering (run mutating tests in a separate phase) — Cucumber doesn't natively support phased execution. Would require splitting into multiple cucumber-js invocations, losing the single-report output.

  3. Reduce parallelism globally — Would slow down all tests, not just the problematic ones.

The @Exclusive approach is the most targeted: it only serializes the specific scenarios that cause cluster-wide churn, while allowing all other tests to run in parallel as before.

Estimated performance impact

Based on a successful run (attempt 4 of #8809):

Scenario Duration Extra wall-clock if exclusive
Azure CRUD (3 examples + 1 retry) ~17 min ~13 min
Bucket Website CRUD ~1s ~0s
PRA N/A (excluded by not @PRA) 0

Current pipeline: ~82 min → estimated with @Exclusive: ~95 min (+16%)

Without retries, the cost drops to ~10%. This is a worthwhile tradeoff for eliminating a major source of CI flakiness that currently requires re-running the entire job (adding 82+ min per retry).

Issue: ZENKO-5228

@bert-e
Copy link
Contributor

bert-e commented Mar 20, 2026

Hello delthas,

My role is to assist you with the merge of this
pull request. Please type @bert-e help to get information
on this process, or consult the user documentation.

Available options
name description privileged authored
/after_pull_request Wait for the given pull request id to be merged before continuing with the current one.
/bypass_author_approval Bypass the pull request author's approval
/bypass_build_status Bypass the build and test status
/bypass_commit_size Bypass the check on the size of the changeset TBA
/bypass_incompatible_branch Bypass the check on the source branch prefix
/bypass_jira_check Bypass the Jira issue check
/bypass_peer_approval Bypass the pull request peers' approval
/bypass_leader_approval Bypass the pull request leaders' approval
/approve Instruct Bert-E that the author has approved the pull request. ✍️
/create_pull_requests Allow the creation of integration pull requests.
/create_integration_branches Allow the creation of integration branches.
/no_octopus Prevent Wall-E from doing any octopus merge and use multiple consecutive merge instead
/unanimity Change review acceptance criteria from one reviewer at least to all reviewers
/wait Instruct Bert-E not to run until further notice.
Available commands
name description privileged
/help Print Bert-E's manual in the pull request.
/status Print Bert-E's current status in the pull request TBA
/clear Remove all comments from Bert-E from the history TBA
/retry Re-start a fresh build TBA
/build Re-start a fresh build TBA
/force_reset Delete integration branches & pull requests, and restart merge process from the beginning.
/reset Try to remove integration branches unless there are commits on them which do not appear on the source branch.

Status report is not available.

@bert-e
Copy link
Contributor

bert-e commented Mar 20, 2026

Waiting for approval

The following approvals are needed before I can proceed with the merge:

  • the author

  • 2 peers

@delthas delthas force-pushed the improvement/ZENKO-5228/fix-ci-parallel-exclusive-tag branch from 57627e3 to ffb4756 Compare March 20, 2026 11:15
Copy link
Contributor

@SylvainSenechal SylvainSenechal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General thoughts :

  • Waiting for you to rerun 3/4 times to analyze real impact on both timing and flakiness
  • Still wish we could modify zenko operator to not reconcile every backbeat pod when they are not concerned at all by the change of the configuration
  • One trick we may consider here : While we can't control that much the order of execution for the tests, I believe cucumber runs the tests from top to bottom of the file, and maybe also run the files from a to z (that's why the kafka cleaner scenario has its file called zzz.xxx 🌚 ). So here, it might be interesting to put all the problematic tests together at the end of the file instead of having them in the middle

Other things :

  • If we merge this pr, need to update the "HOW_TO_WRITE_TEST.MD" : document the new Exclusive tag, and drop the rule 3 about not reconfiguring the env during tests
  • William had a different mechanism based on locking a file to do a single task at the same time for all worker, I think you can see the implementation in the Cli-testing folder : Worth to take a look and compare his solution with yours

Edit :
After spending time on this CI, I have also come to the same conclusion that flakiness comes from those pods being killed and recreated a lot of times while the tests are running, so this is defintly one of the big thing we need to fix.

Copy link
Contributor

@francoisferrand francoisferrand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got really mixed feeling on this one: we are just putting a bandaid, and not addressing the actual underlying issue (the system should remain stable even when creating locations!!!), just masking the problem and increasing build time...

maybe this is acceptable as a temporary measure or to facilitate investigations, but only in that context: so work must not stop there

  • System MUST remain stable and functional even through location creations/etc...
  • Either there is a bug in these tests (not waiting on the right events or the right way), or an issue in the software : but it must be fixed
  • Introducing exclusivity is the exact opposite of the good practices for tests (test must be idempotent, work in parallel, etc...)

delthas added 2 commits March 23, 2026 10:50
- Remove @exclusive from "Pause and resume archiving to azure" scenario:
  it only calls the Backbeat API to pause/resume lifecycle, no overlay
  change or operator reconciliation is triggered.
- Update HOW_TO_WRITE_TESTS.md: document @exclusive tag in Rule 3,
  add note to Rule 4 distinguishing it from atMostOnePicklePerTag.

Issue: ZENKO-5228
@delthas delthas force-pushed the improvement/ZENKO-5228/fix-ci-parallel-exclusive-tag branch from a63364f to 149ade8 Compare March 23, 2026 09:50
@delthas
Copy link
Contributor Author

delthas commented Mar 23, 2026

For now, 2 success of out 2 tries on the CTST runs (the previous failures were unrelated prerequisites)

@francoisferrand
Copy link
Contributor

francoisferrand commented Mar 23, 2026

For now, 2 success of out 2 tries on the CTST runs (the previous failures were unrelated prerequisites)

  • "thanks" to the change I made to "merge results", once you succeed all re-runs will succeed.... So you should double-check the logs to be sure...
  • probably worth doing more than 2 times
  • if it repeats, it does confirm the flakiness is due to the zenko reconfigurations : which is a great step, yet reconfiguration is a normal event, so this PR would simply hide the bug... We cannot just bury the issue, how do we proceed to find & address the issue?

@francoisferrand
Copy link
Contributor

  • If we merge this pr, need to update the "HOW_TO_WRITE_TEST.MD" : document the new Exclusive tag, and drop the rule 3 about not reconfiguring the env during tests

Exclusive is not recommended, since we should not have such a limitation. Do we want to document it and make it normal?

  • William had a different mechanism based on locking a file to do a single task at the same time for all worker, I think you can see the implementation in the Cli-testing folder : Worth to take a look and compare his solution with yours

This mechanism works the other way : it is used to "share" some action between different tests, by ensuring it is actually performed just once. Not sure how it translate to "exclusive" tests, which should block all other tests.

@delthas
Copy link
Contributor Author

delthas commented Mar 24, 2026

Followup created at ZENKO-5240

So far:

  • 2 runs without any failure
  • 4 runs with only "Bucket Websites.Bucket Website CRUD" failed -- I think this test is flaky, working on it -- will open a separate PR
  • 1 runs with only "Quota Management for APIs.Object restoration implements strict quotas"

I'd suggest merging this, then merging Sylvain's PR, then I work on ZENKO-5240

@delthas
Copy link
Contributor Author

delthas commented Mar 24, 2026

As for documenting @Exclusive I do not have a strong opinion, but I'd argue it's better to document it with a warning than keeping it a secret that developers then have to explore to understand what it does when they encounter it.

@francoisferrand
Copy link
Contributor

francoisferrand commented Mar 24, 2026

Followup created at ZENKO-5240

Out of 5 runs:

  • 2 without any failure
  • 2 with only "Bucket Websites.Bucket Website CRUD" failed -- I think this test is flaky, working on it -- will open a separate PR
  • 1 with only "Quota Management for APIs.Object restoration implements strict quotas"

I'd suggest merging this, then merging Sylvain's PR, then I work on ZENKO-5240

Flakiness is an issue that could not be tackled for yers: the last attempt is the current one, where we try to first run locally and improve troubleshooting capabilities, so we can then eventually fix flakiness.
How do you plan to address the issue, so we can be confident we will fix the issue and not end up just hiding it? (not asking to answer here, just that we have a plan likely to succeed: so please think it out and put it in the ticket)

Another way of saying it: once we have identified the root cause (and reconciling is not, it's just a normal event which is a circumstance of the), no problem tweaking the test to reduce flakiness while we do the actual fix (or don't, if we dim it not an issue in production). But as long as we don't, changing this with no plan how to find the issue has very high risk of just hiding the problem under the rug...


I also added a snapshot of the flaky tests identified by trunk.io in the ticket for reference when work would continue. Not sure how precise or trustworthy the values are, but a few thoughts anyway:

  • the tests you mention were not in the list : became flaky due to some recent changes? or maybe just a very low flakiness rates?
  • the failure rate of individual tests is at 10% or less for most of them, and 26-33% for the top 6 most flaky. So I wonder if we should run a few more runs, just to eliminate statistical anomalies even more...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants