Skip to content

monitor lifecycle conductor#2723

Open
benzekrimaha wants to merge 7 commits intodevelopment/9.3from
improvement/BB-740-monitor-lifecycle-conductor
Open

monitor lifecycle conductor#2723
benzekrimaha wants to merge 7 commits intodevelopment/9.3from
improvement/BB-740-monitor-lifecycle-conductor

Conversation

@benzekrimaha
Copy link
Copy Markdown
Contributor

@benzekrimaha benzekrimaha commented Mar 2, 2026

Issue: BB-740

@bert-e
Copy link
Copy Markdown
Contributor

bert-e commented Mar 2, 2026

Hello benzekrimaha,

My role is to assist you with the merge of this
pull request. Please type @bert-e help to get information
on this process, or consult the user documentation.

Available options
name description privileged authored
/after_pull_request Wait for the given pull request id to be merged before continuing with the current one.
/bypass_author_approval Bypass the pull request author's approval
/bypass_build_status Bypass the build and test status
/bypass_commit_size Bypass the check on the size of the changeset TBA
/bypass_incompatible_branch Bypass the check on the source branch prefix
/bypass_jira_check Bypass the Jira issue check
/bypass_peer_approval Bypass the pull request peers' approval
/bypass_leader_approval Bypass the pull request leaders' approval
/approve Instruct Bert-E that the author has approved the pull request. ✍️
/create_pull_requests Allow the creation of integration pull requests.
/create_integration_branches Allow the creation of integration branches.
/no_octopus Prevent Wall-E from doing any octopus merge and use multiple consecutive merge instead
/unanimity Change review acceptance criteria from one reviewer at least to all reviewers
/wait Instruct Bert-E not to run until further notice.
Available commands
name description privileged
/help Print Bert-E's manual in the pull request.
/status Print Bert-E's current status in the pull request TBA
/clear Remove all comments from Bert-E from the history TBA
/retry Re-start a fresh build TBA
/build Re-start a fresh build TBA
/force_reset Delete integration branches & pull requests, and restart merge process from the beginning.
/reset Try to remove integration branches unless there are commits on them which do not appear on the source branch.

Status report is not available.

@bert-e
Copy link
Copy Markdown
Contributor

bert-e commented Mar 2, 2026

Incorrect fix version

The Fix Version/s in issue BB-740 contains:

  • 9.3.0

Considering where you are trying to merge, I ignored possible hotfix versions and I expected to find:

  • 9.3.1

Please check the Fix Version/s of BB-740, or the target
branch of this pull request.

@benzekrimaha benzekrimaha force-pushed the improvement/BB-740-monitor-lifecycle-conductor branch from 380069a to 25ea9d5 Compare March 2, 2026 16:32
@codecov
Copy link
Copy Markdown

codecov bot commented Mar 2, 2026

Codecov Report

❌ Patch coverage is 71.57895% with 27 lines in your changes missing coverage. Please review.
✅ Project coverage is 74.62%. Comparing base (145f7a6) to head (b73140a).
⚠️ Report is 19 commits behind head on development/9.3.

Files with missing lines Patch % Lines
...ecycle/bucketProcessor/LifecycleBucketProcessor.js 39.39% 20 Missing ⚠️
extensions/lifecycle/LifecycleMetrics.js 90.47% 2 Missing ⚠️
...tensions/lifecycle/conductor/LifecycleConductor.js 96.55% 1 Missing ⚠️
...sions/lifecycle/tasks/LifecycleDeleteObjectTask.js 66.66% 1 Missing ⚠️
extensions/lifecycle/tasks/LifecycleTask.js 66.66% 1 Missing ⚠️
...s/lifecycle/tasks/LifecycleUpdateExpirationTask.js 66.66% 1 Missing ⚠️
...s/lifecycle/tasks/LifecycleUpdateTransitionTask.js 66.66% 1 Missing ⚠️

❌ Your patch check has failed because the patch coverage (71.57%) is below the target coverage (80.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files

Impacted file tree graph

Files with missing lines Coverage Δ
extensions/lifecycle/tasks/LifecycleTaskV2.js 88.88% <ø> (ø)
...tensions/lifecycle/conductor/LifecycleConductor.js 84.13% <96.55%> (+0.42%) ⬆️
...sions/lifecycle/tasks/LifecycleDeleteObjectTask.js 92.25% <66.66%> (-0.51%) ⬇️
extensions/lifecycle/tasks/LifecycleTask.js 91.41% <66.66%> (-0.14%) ⬇️
...s/lifecycle/tasks/LifecycleUpdateExpirationTask.js 80.51% <66.66%> (-0.57%) ⬇️
...s/lifecycle/tasks/LifecycleUpdateTransitionTask.js 91.17% <66.66%> (-0.75%) ⬇️
extensions/lifecycle/LifecycleMetrics.js 96.10% <90.47%> (-2.12%) ⬇️
...ecycle/bucketProcessor/LifecycleBucketProcessor.js 72.91% <39.39%> (-6.96%) ⬇️

... and 3 files with indirect coverage changes

Components Coverage Δ
Bucket Notification 80.37% <ø> (ø)
Core Library 81.18% <ø> (+0.48%) ⬆️
Ingestion 71.14% <ø> (ø)
Lifecycle 78.31% <71.57%> (-0.32%) ⬇️
Oplog Populator 85.83% <ø> (ø)
Replication 59.61% <ø> (-0.04%) ⬇️
Bucket Scanner 85.76% <ø> (ø)
@@                 Coverage Diff                 @@
##           development/9.3    #2723      +/-   ##
===================================================
+ Coverage            74.48%   74.62%   +0.14%     
===================================================
  Files                  200      200              
  Lines                13603    13681      +78     
===================================================
+ Hits                 10132    10210      +78     
  Misses                3461     3461              
  Partials                10       10              
Flag Coverage Δ
api:retry 9.09% <0.00%> (-0.06%) ⬇️
api:routes 8.91% <0.00%> (-0.06%) ⬇️
bucket-scanner 85.76% <ø> (ø)
ft_test:queuepopulator 10.85% <5.26%> (+0.82%) ⬆️
ingestion 12.48% <0.00%> (-0.08%) ⬇️
lib 7.57% <0.00%> (-0.05%) ⬇️
lifecycle 19.06% <57.89%> (+0.21%) ⬆️
notification 1.02% <0.00%> (-0.01%) ⬇️
oplogPopulator 0.14% <0.00%> (-0.01%) ⬇️
replication 18.42% <5.26%> (-0.08%) ⬇️
unit 51.10% <67.36%> (+0.09%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@benzekrimaha benzekrimaha force-pushed the improvement/BB-740-monitor-lifecycle-conductor branch 5 times, most recently from 8316f88 to 408c96c Compare March 11, 2026 16:03
@benzekrimaha benzekrimaha marked this pull request as ready for review March 11, 2026 16:35
@benzekrimaha benzekrimaha force-pushed the improvement/BB-740-monitor-lifecycle-conductor branch from 408c96c to e1c5b13 Compare March 11, 2026 16:48
@francoisferrand francoisferrand requested a review from delthas March 18, 2026 09:19
@benzekrimaha benzekrimaha force-pushed the improvement/BB-740-monitor-lifecycle-conductor branch from e1c5b13 to aefb677 Compare March 18, 2026 10:04
@benzekrimaha benzekrimaha changed the title Improvement/bb 740 monitor lifecycle conductor Improvement/BB-740 monitor lifecycle conductor Mar 18, 2026
@francoisferrand francoisferrand changed the title Improvement/BB-740 monitor lifecycle conductor monitor lifecycle conductor Mar 18, 2026
const log = this.logger.newRequestLogger();
const start = new Date();
const start = Date.now();
this._scanId = uuid();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, we're storing the scan ID as a "global" field variable, but it sounds like it is really relevant/used only inside this function (through indirect calls). Could we drop the global field and instead pass it through to whatever uses it? Maybe in _createBucketTaskMessages?

@benzekrimaha benzekrimaha force-pushed the improvement/BB-740-monitor-lifecycle-conductor branch from 725c3df to 11a94ea Compare March 19, 2026 09:36
@benzekrimaha benzekrimaha force-pushed the improvement/BB-740-monitor-lifecycle-conductor branch from a2128cf to a464b39 Compare March 19, 2026 09:43
@scality scality deleted a comment from claude bot Mar 19, 2026
@scality scality deleted a comment from claude bot Mar 19, 2026
@scality scality deleted a comment from claude bot Mar 19, 2026
@scality scality deleted a comment from claude bot Mar 19, 2026
@scality scality deleted a comment from claude bot Mar 19, 2026
@scality scality deleted a comment from claude bot Mar 19, 2026
@scality scality deleted a comment from claude bot Mar 19, 2026
@scality scality deleted a comment from claude bot Mar 19, 2026
@claude
Copy link
Copy Markdown

claude bot commented Mar 19, 2026

LGTM

Review by Claude Code

@scality scality deleted a comment from claude bot Mar 19, 2026
@scality scality deleted a comment from claude bot Mar 19, 2026
This commit adds new Prometheus metrics to track lifecycle conductor scan health:
- s3_lifecycle_conductor_full_scan_elapsed_seconds: scan duration
- s3_lifecycle_conductor_scan_count: buckets/workflows per scan
- s3_lifecycle_bucket_processor_scan_start_time: active scan timestamp
- s3_lifecycle_bucket_processor_buckets_count: processed buckets

It also generates a unique conductorScanId (UUID) per conductor scan and passes it
through all listing methods (zookeeper, mongodb, bucketd). Include
scanId and bucketSource in Kafka messages via contextInfo. Record
scan-level metrics (elapsed time, bucket count, workflow count,
lifecycle bucket count) at the end of each full scan.

In LifecycleBucketProcessor, it detects the scan boundaries by comparing the
incoming conductorScanId and emits scan start/end metrics. Tracks
per-bucket completion after successful task processing and resets the
scan gauge when the internal task queue drains.

We also moved onProcessBuckets() call to scan start and add onConductorFullScan()
at scan end for accurate timing. We also use Date.now() instead of new Date()
for consistent elapsed time calculations.

Issue: BB-740
Here we forward conductorScanId from the conductor Kafka message contextInfo
through all lifecycle task entries and action queue entries:

- LifecycleTask: includes conductorScanId in contextInfo when creating
  bucket re-queue entries, version listing entries, and all expiration
  and transition action entries
- LifecycleTaskV2: same propagation for v2 listing, re-queue, and
  action entries
- LifecycleDeleteObjectTask, LifecycleUpdateExpirationTask,
  LifecycleUpdateTransitionTask: extracts conductorScanId from the
  action entry and add it as a log default field for end-to-end
  traceability

This ensures every log line emitted during lifecycle processing can
be correlated back to the originating conductor scan.

Issue: BB-740
Here we add new Grafana dashboard panels to visualize lifecycle conductor scan
health and bucket processor activity:

- Conductor full scan duration (elapsed seconds)
- Conductor scan counts (total buckets, lifecycle buckets, workflows)
- Bucket processor scan start time with idle detection
- Bucket processor buckets processed per scan

Issue: BB-740
In this commit we add unit tests for the new LifecycleMetrics scan monitoring methods:
- onConductorFullScan: verify elapsed seconds, bucket/workflow/
  lifecycle_bucket counts are set correctly
- onBucketProcessorScanStart/ScanEnd: verify scan timestamp gauge
  is set and reset to 0
- onBucketProcessorBucketDone: verify bucket counter is incremented
- Error handling: verify errors are caught and logged

LifecycleConductor unit tests were also updated to verify that:
- conductorScanId is present in Kafka bucket-task messages
- bucketSource is included in contextInfo
- scanId is a valid UUID generated per scan

BackbeatTestConsumer was updated to normalize conductorScanId and
bucketSource fields for deep equality comparison in functional tests.

Issue: BB-740
@benzekrimaha benzekrimaha force-pushed the improvement/BB-740-monitor-lifecycle-conductor branch from 253df18 to 7a649cc Compare March 26, 2026 17:18
@bert-e
Copy link
Copy Markdown
Contributor

bert-e commented Mar 26, 2026

Incorrect fix version

The Fix Version/s in issue BB-740 contains:

  • 9.3.0

Considering where you are trying to merge, I ignored possible hotfix versions and I expected to find:

  • 9.3.2

Please check the Fix Version/s of BB-740, or the target
branch of this pull request.

@claude
Copy link
Copy Markdown

claude bot commented Mar 26, 2026

  • Premature scan-end signal in bucket processor: the drain callback on the internal task queue fires every time the queue goes empty, which can happen between Kafka message batches within the same scan. Once fired, the scan start time resets to 0 and _processorScanMetricsActive becomes false, but subsequent messages from the same scan will not re-trigger onBucketProcessorScanStart (same scanId). This makes the Bucket Processor Scan Progress panel unreliable under non-saturating workloads.
    - Consider resetting only on new-scan-start rather than on drain, or adding a timeout/debounce before declaring the scan complete.

    Review by Claude Code

…dle signals

The drain callback on _internalTaskScheduler fires every time the
queue goes from non-empty to empty. Under light load this happens
between Kafka message batches within the same conductor scan,
prematurely resetting the scan start time gauge to 0 and setting
_processorScanMetricsActive to false.

Because subsequent messages carry the same conductorScanId, the
new-scan branch is skipped and onBucketProcessorScanStart is never
called again, leaving the Bucket Processor Scan Progress dashboard
panel hidden for the remainder of the scan.

Fix: replace the immediate reset with a 30-second debounced timeout.
If new messages from the same scan arrive during the debounce window
the timeout is cancelled, keeping the scan active. A new scan also
cancels any pending debounce before performing a full metric reset.

Issue: BB-740
@claude
Copy link
Copy Markdown

claude bot commented Mar 26, 2026

  • Missing timeout cleanup on shutdown: this._scanEndTimeout is not cleared in LifecycleBucketProcessor.close(), so the 30s debounce callback can fire after the processor is partially torn down.
    • Add clearTimeout(this._scanEndTimeout) at the top of close().

Review by Claude Code

@bert-e
Copy link
Copy Markdown
Contributor

bert-e commented Mar 26, 2026

Incorrect fix version

The Fix Version/s in issue BB-740 contains:

  • None

Considering where you are trying to merge, I ignored possible hotfix versions and I expected to find:

  • 9.3.2

Please check the Fix Version/s of BB-740, or the target
branch of this pull request.

@claude
Copy link
Copy Markdown

claude bot commented Mar 26, 2026

  • Scan-end debounce timer not cleared on shutdown: _scanEndTimeout in LifecycleBucketProcessor is never cleared when the process stops. If the bucket processor shuts down while the 30s debounce is pending, the callback fires on a partially torn-down instance. Clear it in the stop/close path.
    - Scan start timestamp inaccuracy: onBucketProcessorScanStart records Date.now() at the time the bucket processor sees the first Kafka message, not when the conductor actually started the scan. The Bucket Processor Scan Progress dashboard panel will undercount scan duration by the Kafka delivery lag. Consider passing the conductor start timestamp via contextInfo.
    - bucketProcessorBucketsCount.reset() clears all label combinations: Calling .reset() on a gauge with labels zeros out all label combos, not just the current bucket_source. Fine if each deployment has one source, but worth documenting.
    - lifecycle_bucket metric emits 0 for bucketd: The conductorScanCount metric with type=lifecycle_bucket will always be 0 for bucketd since isLifecycled is never set in listBucketdBuckets. The dashboard handles this correctly by summing bucket processor counts for bucketd, but the raw metric could be misleading without context.

    Review by Claude Code

- Clear _scanEndTimeout in close() to prevent callback firing
  on a partially torn-down bucket processor instance
- Propagate conductor's start timestamp (Date.now() at scan
  start) via contextInfo.scanStartTimestamp instead of using
  Date.now() in the bucket processor, so the scan progress
  dashboard accurately reflects the full scan duration
- Document that bucketProcessorBucketsCount.reset() clears all
  label combinations (acceptable: single bucket source per
  deployment)
- Update tests and BackbeatTestConsumer normalizer for the new
  scanStartTimestamp field
@claude
Copy link
Copy Markdown

claude bot commented Mar 26, 2026

LGTM

Review by Claude Code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants