Skip to content

fix: send FinishedHeight using safe height when finalized unavailable#72

Closed
init4samwise wants to merge 1 commit intomainfrom
samwise/eng-1830-wal-finishedheight-fix
Closed

fix: send FinishedHeight using safe height when finalized unavailable#72
init4samwise wants to merge 1 commit intomainfrom
samwise/eng-1830-wal-finishedheight-fix

Conversation

@init4samwise
Copy link

Summary

When the host chain doesn't report finalized blocks (beacon client not synced, checkpoint sync incomplete, or testnet without proper finality reporting), the WAL grows unbounded because FinishedHeight is never sent to reth.

This was observed on Pecorino with the warning:

WARN WAL contains too many blocks and is not getting cleared. That will lead to increased disk space usage. Check that you emit the FinishedHeight event from your ExExes. blocks=177 threshold=128

Root Cause

In update_canon_heights(), FinishedHeight is only sent when finalized_ru_block_hash != genesis_ru_hash. When the host doesn't report finalized blocks, load_finalized_block_heights() returns { host: 0, rollup: 0 }, and FinishedHeight is never sent.

Fix

Add a fallback to use the safe block height for FinishedHeight when finalized blocks aren't available. Safe blocks are unlikely to reorg, making them a reasonable threshold for clearing the WAL.

If there's a reorg beyond the safe block, reth will replay from the WAL as expected.

Closes ENG-1830

When the host chain doesn't report finalized blocks (beacon client not synced,
checkpoint sync incomplete, or testnet without proper finality reporting), the
WAL grows unbounded because FinishedHeight is never sent.

This change adds a fallback to use the safe block height for FinishedHeight
when finalized blocks aren't available. Safe blocks are unlikely to reorg,
making them a reasonable threshold for clearing the WAL.

Closes ENG-1830
@init4samwise init4samwise requested a review from a team as a code owner February 7, 2026 01:28
let PairedHeights { host: _, rollup: safe_ru_height } =
self.load_safe_block_heights(ru_height)?;
let safe_heights = self.load_safe_block_heights(ru_height)?;
let safe_ru_height = safe_heights.rollup;
Copy link
Member

@prestwich prestwich Feb 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is just a style change

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is just refactoring to capture the full PairedHeights struct so we can use safe_heights.host in the fallback below. Happy to revert to the destructuring style if preferred — I can rename it to make the intent clearer:

let safe_heights @ PairedHeights { rollup: safe_ru_height, .. } = 
    self.load_safe_block_heights(ru_height)?;

Or keep the current approach. Let me know your preference.

// and rollup finalized heights are both 0. Otherwise, we load the finalized
// RU header and set the finalized block hash.
let finalized_ru_block_hash =
self.set_finalized_ru_block_hash(finalized_heights, genesis_ru_hash)?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update the rustdoc for set_finalized_ru_block_hash to be more informative.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I'll update this in the next commit. The current doc doesn't capture the full picture — it should also document:

  • When it returns genesis hash (both heights = 0)
  • When it loads and sets the actual finalized header
  • The relationship with the new safe-height fallback

Will push an update.

// Fallback: use safe height when finalized isn't available.
// Safe blocks are unlikely to reorg, so this is a reasonable
// threshold for clearing the WAL.
self.update_highest_processed_height(safe_heights.host)?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in what cases will this fail. will failures cause difficult-to-recover issuss?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. Looking at update_highest_processed_height:

Failure cases:

  1. sealed_header(safe_heights.host)? — fails if DB read errors. Returns Err, propagates up.
  2. .expect("db inconsistent...") — panics if header doesn't exist. This shouldn't happen since load_safe_block_heights already validated these heights exist, but it's a hard crash if DB is corrupted.
  3. update_exex_head(...)self.host.events.send(...) — fails if reth node shut down (channel closed).

Recovery:

  • DB errors (1): Node crashes, restarts, resumes from last consistent state. No data loss.
  • DB inconsistency (2): Indicates deeper corruption — needs investigation. Same as existing behavior.
  • Channel closed (3): Node is shutting down anyway, graceful exit.

Is this safe for the fallback path?
The same failure modes exist on the finalized path. The only difference is when we call it — using safe vs finalized height. If safe_heights is valid (which load_safe_block_heights ensures), the call should succeed.

However, I see now from the Slack thread that this might be unrelated to the mainnet issue you're seeing. Should we hold off on this PR until the root cause is clearer?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i mean in what cases will using safe height to clear the WAL result in issues

i.e. what is the behavior when Ethereum has a deep reorg? How can a signet node recover from that?

what other edge cases have you considered?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right to push on this — I hadn't fully thought through the reorg implications.

Deep reorg scenario:
If we clear WAL up to safe height X, and then Ethereum reorgs past X:

  1. reth sends revert notification for reorged blocks
  2. on_host_revert tries to unwind the RU chain
  3. But reth may have already pruned the WAL entries we cleared — we can't replay from the ExEx side

Why this is problematic:
Safe blocks can reorg (unlike finalized). It requires >1/3 malicious validators, but it's not cryptographically impossible. If it happens, we'd be in an inconsistent state.

Edge cases I missed:

  • Beacon client temporarily reports stale safe/finalized (network issues)
  • Checkpoint sync completing mid-operation
  • Safe head progressing faster than we process during backfill

Conclusion:
You're right — using safe height for WAL clearing is not safe. The original code is correct: only clear WAL for truly finalized blocks.

The WAL warning on testnets without finality is annoying but not critical — it's a space issue, not a correctness issue. Better to live with disk growth than risk unrecoverable state.

Should I close this PR?

@prestwich
Copy link
Member

seems to be a red herring. if we see this behavior in testnets we can revisit

@prestwich prestwich closed this Feb 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants