fix(deps): bump node-rdkafka to ^3.6.0 to fix cooperative-sticky rebalance bug#2728
fix(deps): bump node-rdkafka to ^3.6.0 to fix cooperative-sticky rebalance bug#2728delthas wants to merge 1 commit intodevelopment/9.3from
Conversation
Hello delthas,My role is to assist you with the merge of this Available options
Available commands
Status report is not available. |
Waiting for approvalThe following approvals are needed before I can proceed with the merge:
|
|
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files
@@ Coverage Diff @@
## development/9.3 #2728 +/- ##
===================================================
+ Coverage 74.51% 74.76% +0.25%
===================================================
Files 200 200
Lines 13610 13610
===================================================
+ Hits 10141 10176 +35
+ Misses 3459 3424 -35
Partials 10 10
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
|
LGTM — the dependency bump is clean with no code changes required. The node-rdkafka 3.x release has no API breaking changes (only dropped EOL Node.js support), and the librdkafka upgrade path from 2.3.0 to 2.12.0 has no consumer-breaking changes. The root cause analysis in the PR body is thorough. |
Upgrades node-rdkafka to ^3.6.0 (resolving to 3.6.1 / librdkafka 2.12.0) to fix a cooperative-sticky partition assignor bug that causes partitions to become orphaned during consumer group rebalances. Issue: BB-760
f528083 to
1735ada
Compare
|
Excellent investigation and write-up. The dependency bump itself is clean — only |
Summary
Bumps
node-rdkafkafrom^2.12.0(librdkafka 2.3.0) to^3.6.0(currently resolving to 3.6.1 / librdkafka 2.12.0) to pick up a fix for a cooperative-sticky partition assignor bug that causes partitions to become permanently orphaned during consumer group rebalances.Investigation
This was discovered while investigating a flaky CI failure in the "Kafka Cleaner — Verify that consumed messages gets deleted by kafkacleaner" test (Zenko CI run). The test passed on attempts 1-2 but failed on attempt 6.
What the kafkacleaner test does
The test lists all Kafka topics, snapshots their low/high watermarks, then periodically checks that the low watermark advances (messages consumed and cleaned by kafka-cleaner) or that the topic is empty. It fails if any topic remains uncleaned after ~60 minutes.
What failed
The
backbeat-metricstopic had ~28 messages across partitions 0-4, but they were never consumed. With no committed offsets, kafka-cleaner could not advance the low watermark.Root cause analysis
We traced the failure through three layers:
Layer 1 — Partitions orphaned for 50 minutes
During operator reconciliation, pods were rapidly replaced (18+ instances of each deployment). Tracing the
rdkafka.assign/rdkafka.revokeevents in fluentbit logs revealed that after a round of pod kills at ~07:38, the consumer groupbackbeat-metrics-group-crrended up with a single surviving member (notification-producer) holding only partition 4 out of 5. Partitions 0-3 were left completely unassigned for 50 minutes — no consumer was reading from them.Layer 2 — The librdkafka bug (root cause)
BackbeatConsumer._onRebalancecallsthis._consumer.commit()in the revoke callback (line 733 ofBackbeatConsumer.js). With the cooperative-sticky protocol, this commit during rebalance bumps the Kafka generation ID. The next JoinGroup request then fails with "illegal generation", which causes librdkafka to rejoin the group — but in doing so, the current assignment is lost. The rebalance never converges, and partitions from dead members are never redistributed to surviving consumers.This is a known librdkafka bug, present since v1.6.0, fixed in confluentinc/librdkafka#4908 (merged 2025-03-25). The fix was first released in librdkafka 2.10.0.
See also: confluentinc/librdkafka#4059 for the original issue report and detailed explanation.
Layer 3 —
auto.offset.reset=latestskips orphaned messagesWhen new consumers finally got assigned partitions 0-3 (after 08:31, when new pods joined), there were no committed offsets for those partitions.
MetricsConsumerdoes not setfromOffsetwhen constructingBackbeatConsumer, so librdkafka defaults toauto.offset.reset=largest(latest). The new consumers started reading from the end of each partition, silently skipping all 23 messages that had been produced during the orphan window. The consumers then polled happily for 4+ hours, seeing nothing.Why the test passed on attempts 1-2
On earlier attempts, the
backbeat-metricstopic was empty (0 messages produced), so the kafkacleaner test'slow + 1 >= highcondition (0 + 1 >= 0) was trivially true.Verification on a live cluster
We verified on a live ARTESCA cluster (
artesca-1) that MetricsConsumer consumption works correctly under normal conditions (no pod churn) — consumer groups were at lag 0 with all partitions assigned and messages consumed immediately. The bug only manifests during rapid pod replacement with multiple rebalances.Solution
Bump
node-rdkafkato^3.6.0, currently resolving to 3.6.1 (librdkafka 2.12.0). This includes the fix from librdkafka 2.10.0 that prevents the "commit during rebalance" from derailing the cooperative-sticky protocol, ensuring partitions are correctly redistributed after member deaths.Why ^3.6.0 specifically
We set the floor to
^3.6.0(librdkafka 2.12.0) since nothing in the 2.12.0 changes affects our use case: the KIP-848 consumer group protocol is opt-in (not enabled unlessgroup.protocol=consumeris explicitly set) and the metadata recovery behavior change is minor.Alternatives considered
backbeat-metricsto the kafkacleaner test'signoredTopics: Would hide the symptom but not fix the underlying bug. The metrics topic isn't special — any topic consumed via BackbeatConsumer with cooperative-sticky assignment is affected.fromOffset: 'earliest'in MetricsConsumer: Would fix the "skipped messages" symptom (layer 3) but not the partition orphaning (layer 2). Worth doing separately as defense-in-depth but not sufficient alone.commit()from the revoke callback: Would fix the root cause in BackbeatConsumer but is a riskier code change that could affect offset commit guarantees across all consumers. The librdkafka upgrade fixes the same issue at the library level without changing application semantics.Upgrade safety
librdkafka 2.3.0 → 2.12.0: The librdkafka CHANGELOG shows no consumer-breaking changes in this range. The only notable breaking change is in v2.4.0 where
INVALID_RECORDproducer errors became non-retriable — this does not affect consumers. KIP-848 (new consumer group protocol) in 2.12.0 is opt-in (group.protocol=consumermust be explicitly set) and does not affect the defaultclassicprotocol. The metadata recovery behavior change in 2.12.0 (brokers not in metadata responses are removed and clients re-bootstrap) is a minor behavioral difference that should not impact normal operation.node-rdkafka 2.x → 3.x: The 3.0.0 release only dropped support for EOL Node.js versions — no API changes. Backbeat requires Node >= 20 and runs on Node 22.14.0.
Issue: BB-760