monitor lifecycle conductor by benzekrimaha · Pull Request #2723 · scality/backbeat

benzekrimaha · 2026-03-02T16:23:04Z

bert-e · 2026-03-02T16:23:08Z

Hello benzekrimaha,

My role is to assist you with the merge of this
pull request. Please type @bert-e help to get information
on this process, or consult the user documentation.

Available options

name	description	privileged	authored
`/after_pull_request`	Wait for the given pull request id to be merged before continuing with the current one.
`/bypass_author_approval`	Bypass the pull request author's approval	⭐
`/bypass_build_status`	Bypass the build and test status	⭐
`/bypass_commit_size`	Bypass the check on the size of the changeset `TBA`	⭐
`/bypass_incompatible_branch`	Bypass the check on the source branch prefix	⭐
`/bypass_jira_check`	Bypass the Jira issue check	⭐
`/bypass_peer_approval`	Bypass the pull request peers' approval	⭐
`/bypass_leader_approval`	Bypass the pull request leaders' approval	⭐
`/approve`	Instruct Bert-E that the author has approved the pull request.		✍️
`/create_pull_requests`	Allow the creation of integration pull requests.
`/create_integration_branches`	Allow the creation of integration branches.
`/no_octopus`	Prevent Wall-E from doing any octopus merge and use multiple consecutive merge instead
`/unanimity`	Change review acceptance criteria from `one reviewer at least` to `all reviewers`
`/wait`	Instruct Bert-E not to run until further notice.

Available commands

name	description	privileged
`/help`	Print Bert-E's manual in the pull request.
`/status`	Print Bert-E's current status in the pull request `TBA`
`/clear`	Remove all comments from Bert-E from the history `TBA`
`/retry`	Re-start a fresh build `TBA`
`/build`	Re-start a fresh build `TBA`
`/force_reset`	Delete integration branches & pull requests, and restart merge process from the beginning.
`/reset`	Try to remove integration branches unless there are commits on them which do not appear on the source branch.

Status report is not available.

bert-e · 2026-03-02T16:23:13Z

Incorrect fix version

The Fix Version/s in issue BB-740 contains:

9.3.0

Considering where you are trying to merge, I ignored possible hotfix versions and I expected to find:

9.3.1

Please check the Fix Version/s of BB-740, or the target
branch of this pull request.

codecov · 2026-03-02T16:48:01Z

Codecov Report

❌ Patch coverage is 71.57895% with 27 lines in your changes missing coverage. Please review.
✅ Project coverage is 74.62%. Comparing base (145f7a6) to head (b73140a).
⚠️ Report is 19 commits behind head on development/9.3.

Files with missing lines	Patch %	Lines
...ecycle/bucketProcessor/LifecycleBucketProcessor.js	39.39%	20 Missing ⚠️
extensions/lifecycle/LifecycleMetrics.js	90.47%	2 Missing ⚠️
...tensions/lifecycle/conductor/LifecycleConductor.js	96.55%	1 Missing ⚠️
...sions/lifecycle/tasks/LifecycleDeleteObjectTask.js	66.66%	1 Missing ⚠️
extensions/lifecycle/tasks/LifecycleTask.js	66.66%	1 Missing ⚠️
...s/lifecycle/tasks/LifecycleUpdateExpirationTask.js	66.66%	1 Missing ⚠️
...s/lifecycle/tasks/LifecycleUpdateTransitionTask.js	66.66%	1 Missing ⚠️

❌ Your patch check has failed because the patch coverage (71.57%) is below the target coverage (80.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files

Files with missing lines	Coverage Δ
extensions/lifecycle/tasks/LifecycleTaskV2.js	`88.88% <ø> (ø)`
...tensions/lifecycle/conductor/LifecycleConductor.js	`84.13% <96.55%> (+0.42%)`	⬆️
...sions/lifecycle/tasks/LifecycleDeleteObjectTask.js	`92.25% <66.66%> (-0.51%)`	⬇️
extensions/lifecycle/tasks/LifecycleTask.js	`91.41% <66.66%> (-0.14%)`	⬇️
...s/lifecycle/tasks/LifecycleUpdateExpirationTask.js	`80.51% <66.66%> (-0.57%)`	⬇️
...s/lifecycle/tasks/LifecycleUpdateTransitionTask.js	`91.17% <66.66%> (-0.75%)`	⬇️
extensions/lifecycle/LifecycleMetrics.js	`96.10% <90.47%> (-2.12%)`	⬇️
...ecycle/bucketProcessor/LifecycleBucketProcessor.js	`72.91% <39.39%> (-6.96%)`	⬇️

... and 3 files with indirect coverage changes

Components	Coverage Δ
Bucket Notification	`80.37% <ø> (ø)`
Core Library	`81.18% <ø> (+0.48%)`	⬆️
Ingestion	`71.14% <ø> (ø)`
Lifecycle	`78.31% <71.57%> (-0.32%)`	⬇️
Oplog Populator	`85.83% <ø> (ø)`
Replication	`59.61% <ø> (-0.04%)`	⬇️
Bucket Scanner	`85.76% <ø> (ø)`

@@                 Coverage Diff                 @@
##           development/9.3    #2723      +/-   ##
===================================================
+ Coverage            74.48%   74.62%   +0.14%     
===================================================
  Files                  200      200              
  Lines                13603    13681      +78     
===================================================
+ Hits                 10132    10210      +78     
  Misses                3461     3461              
  Partials                10       10

Flag	Coverage Δ
api:retry	`9.09% <0.00%> (-0.06%)`	⬇️
api:routes	`8.91% <0.00%> (-0.06%)`	⬇️
bucket-scanner	`85.76% <ø> (ø)`
ft_test:queuepopulator	`10.85% <5.26%> (+0.82%)`	⬆️
ingestion	`12.48% <0.00%> (-0.08%)`	⬇️
lib	`7.57% <0.00%> (-0.05%)`	⬇️
lifecycle	`19.06% <57.89%> (+0.21%)`	⬆️
notification	`1.02% <0.00%> (-0.01%)`	⬇️
oplogPopulator	`0.14% <0.00%> (-0.01%)`	⬇️
replication	`18.42% <5.26%> (-0.08%)`	⬇️
unit	`51.10% <67.36%> (+0.09%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

delthas · 2026-03-18T16:29:03Z

extensions/lifecycle/conductor/LifecycleConductor.js

        const log = this.logger.newRequestLogger();
-        const start = new Date();
+        const start = Date.now();
+        this._scanId = uuid();


Hmm, we're storing the scan ID as a "global" field variable, but it sounds like it is really relevant/used only inside this function (through indirect calls). Could we drop the global field and instead pass it through to whatever uses it? Maybe in _createBucketTaskMessages?

extensions/lifecycle/conductor/LifecycleConductor.js

extensions/lifecycle/bucketProcessor/LifecycleBucketProcessor.js

extensions/lifecycle/conductor/LifecycleConductor.js

extensions/lifecycle/bucketProcessor/LifecycleBucketProcessor.js

extensions/lifecycle/conductor/LifecycleConductor.js

tests/unit/lifecycle/LifecycleConductor.spec.js

claude · 2026-03-19T16:35:06Z

LGTM

Review by Claude Code

This commit adds new Prometheus metrics to track lifecycle conductor scan health: - s3_lifecycle_conductor_full_scan_elapsed_seconds: scan duration - s3_lifecycle_conductor_scan_count: buckets/workflows per scan - s3_lifecycle_bucket_processor_scan_start_time: active scan timestamp - s3_lifecycle_bucket_processor_buckets_count: processed buckets It also generates a unique conductorScanId (UUID) per conductor scan and passes it through all listing methods (zookeeper, mongodb, bucketd). Include scanId and bucketSource in Kafka messages via contextInfo. Record scan-level metrics (elapsed time, bucket count, workflow count, lifecycle bucket count) at the end of each full scan. In LifecycleBucketProcessor, it detects the scan boundaries by comparing the incoming conductorScanId and emits scan start/end metrics. Tracks per-bucket completion after successful task processing and resets the scan gauge when the internal task queue drains. We also moved onProcessBuckets() call to scan start and add onConductorFullScan() at scan end for accurate timing. We also use Date.now() instead of new Date() for consistent elapsed time calculations. Issue: BB-740

Here we forward conductorScanId from the conductor Kafka message contextInfo through all lifecycle task entries and action queue entries: - LifecycleTask: includes conductorScanId in contextInfo when creating bucket re-queue entries, version listing entries, and all expiration and transition action entries - LifecycleTaskV2: same propagation for v2 listing, re-queue, and action entries - LifecycleDeleteObjectTask, LifecycleUpdateExpirationTask, LifecycleUpdateTransitionTask: extracts conductorScanId from the action entry and add it as a log default field for end-to-end traceability This ensures every log line emitted during lifecycle processing can be correlated back to the originating conductor scan. Issue: BB-740

Here we add new Grafana dashboard panels to visualize lifecycle conductor scan health and bucket processor activity: - Conductor full scan duration (elapsed seconds) - Conductor scan counts (total buckets, lifecycle buckets, workflows) - Bucket processor scan start time with idle detection - Bucket processor buckets processed per scan Issue: BB-740

In this commit we add unit tests for the new LifecycleMetrics scan monitoring methods: - onConductorFullScan: verify elapsed seconds, bucket/workflow/ lifecycle_bucket counts are set correctly - onBucketProcessorScanStart/ScanEnd: verify scan timestamp gauge is set and reset to 0 - onBucketProcessorBucketDone: verify bucket counter is incremented - Error handling: verify errors are caught and logged LifecycleConductor unit tests were also updated to verify that: - conductorScanId is present in Kafka bucket-task messages - bucketSource is included in contextInfo - scanId is a valid UUID generated per scan BackbeatTestConsumer was updated to normalize conductorScanId and bucketSource fields for deep equality comparison in functional tests. Issue: BB-740

bert-e · 2026-03-26T17:18:55Z

Incorrect fix version

The Fix Version/s in issue BB-740 contains:

9.3.0

Considering where you are trying to merge, I ignored possible hotfix versions and I expected to find:

9.3.2

Please check the Fix Version/s of BB-740, or the target
branch of this pull request.

extensions/lifecycle/bucketProcessor/LifecycleBucketProcessor.js

claude · 2026-03-26T17:20:59Z

Premature scan-end signal in bucket processor: the drain callback on the internal task queue fires every time the queue goes empty, which can happen between Kafka message batches within the same scan. Once fired, the scan start time resets to 0 and _processorScanMetricsActive becomes false, but subsequent messages from the same scan will not re-trigger onBucketProcessorScanStart (same scanId). This makes the Bucket Processor Scan Progress panel unreliable under non-saturating workloads.
- Consider resetting only on new-scan-start rather than on drain, or adding a timeout/debounce before declaring the scan complete.

Review by Claude Code

…dle signals The drain callback on _internalTaskScheduler fires every time the queue goes from non-empty to empty. Under light load this happens between Kafka message batches within the same conductor scan, prematurely resetting the scan start time gauge to 0 and setting _processorScanMetricsActive to false. Because subsequent messages carry the same conductorScanId, the new-scan branch is skipped and onBucketProcessorScanStart is never called again, leaving the Bucket Processor Scan Progress dashboard panel hidden for the remainder of the scan. Fix: replace the immediate reset with a 30-second debounced timeout. If new messages from the same scan arrive during the debounce window the timeout is cancelled, keeping the scan active. A new scan also cancels any pending debounce before performing a full metric reset. Issue: BB-740

extensions/lifecycle/bucketProcessor/LifecycleBucketProcessor.js

claude · 2026-03-26T17:41:02Z

Missing timeout cleanup on shutdown: this._scanEndTimeout is not cleared in LifecycleBucketProcessor.close(), so the 30s debounce callback can fire after the processor is partially torn down.
- Add clearTimeout(this._scanEndTimeout) at the top of close().

Review by Claude Code

bert-e · 2026-03-26T17:41:04Z

Incorrect fix version

The Fix Version/s in issue BB-740 contains:

None

Considering where you are trying to merge, I ignored possible hotfix versions and I expected to find:

9.3.2

Please check the Fix Version/s of BB-740, or the target
branch of this pull request.

extensions/lifecycle/LifecycleMetrics.js

extensions/lifecycle/bucketProcessor/LifecycleBucketProcessor.js

claude · 2026-03-26T17:49:49Z

Scan-end debounce timer not cleared on shutdown: _scanEndTimeout in LifecycleBucketProcessor is never cleared when the process stops. If the bucket processor shuts down while the 30s debounce is pending, the callback fires on a partially torn-down instance. Clear it in the stop/close path.
- Scan start timestamp inaccuracy: onBucketProcessorScanStart records Date.now() at the time the bucket processor sees the first Kafka message, not when the conductor actually started the scan. The Bucket Processor Scan Progress dashboard panel will undercount scan duration by the Kafka delivery lag. Consider passing the conductor start timestamp via contextInfo.
- bucketProcessorBucketsCount.reset() clears all label combinations: Calling .reset() on a gauge with labels zeros out all label combos, not just the current bucket_source. Fine if each deployment has one source, but worth documenting.
- lifecycle_bucket metric emits 0 for bucketd: The conductorScanCount metric with type=lifecycle_bucket will always be 0 for bucketd since isLifecycled is never set in listBucketdBuckets. The dashboard handles this correctly by summing bucket processor counts for bucketd, but the raw metric could be misleading without context.

Review by Claude Code

- Clear _scanEndTimeout in close() to prevent callback firing on a partially torn-down bucket processor instance - Propagate conductor's start timestamp (Date.now() at scan start) via contextInfo.scanStartTimestamp instead of using Date.now() in the bucket processor, so the scan progress dashboard accurately reflects the full scan duration - Document that bucketProcessorBucketsCount.reset() clears all label combinations (acceptable: single bucket source per deployment) - Update tests and BackbeatTestConsumer normalizer for the new scanStartTimestamp field

claude · 2026-03-26T17:58:56Z

LGTM

Review by Claude Code

benzekrimaha force-pushed the improvement/BB-740-monitor-lifecycle-conductor branch from 380069a to 25ea9d5 Compare March 2, 2026 16:32

benzekrimaha force-pushed the improvement/BB-740-monitor-lifecycle-conductor branch 5 times, most recently from 8316f88 to 408c96c Compare March 11, 2026 16:03

benzekrimaha marked this pull request as ready for review March 11, 2026 16:35

benzekrimaha force-pushed the improvement/BB-740-monitor-lifecycle-conductor branch from 408c96c to e1c5b13 Compare March 11, 2026 16:48

benzekrimaha requested review from a team, SylvainSenechal and francoisferrand March 13, 2026 08:49

francoisferrand requested a review from delthas March 18, 2026 09:19

benzekrimaha force-pushed the improvement/BB-740-monitor-lifecycle-conductor branch from e1c5b13 to aefb677 Compare March 18, 2026 10:04

benzekrimaha changed the title ~~Improvement/bb 740 monitor lifecycle conductor~~ Improvement/BB-740 monitor lifecycle conductor Mar 18, 2026

francoisferrand changed the title ~~Improvement/BB-740 monitor lifecycle conductor~~ monitor lifecycle conductor Mar 18, 2026

delthas reviewed Mar 18, 2026

View reviewed changes