persist: replace file-per-entry with WAL and refactor into generic indexedWAL#3044
persist: replace file-per-entry with WAL and refactor into generic indexedWAL#3044wen-coding merged 1 commit intomainfrom
Conversation
|
The latest Buf updates on your PR. Results from workflow Buf / buf (pull_request).
|
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #3044 +/- ##
========================================
Coverage 58.60% 58.61%
========================================
Files 2096 2097 +1
Lines 173400 173502 +102
========================================
+ Hits 101621 101696 +75
- Misses 62739 62761 +22
- Partials 9040 9045 +5
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
| for lane, first := range laneFirsts { | ||
| lw, ok := bp.lanes[lane] | ||
| if !ok { | ||
| continue // no WAL yet; PersistBlock will create one lazily | ||
| } | ||
| lane, fileN, err := parseBlockFilename(entry.Name()) | ||
| if err != nil { | ||
| firstBN, ok := lw.firstBlockNum().Get() | ||
| if !ok || first <= firstBN { | ||
| continue | ||
| } | ||
| first, ok := laneFirsts[lane] | ||
| if ok && fileN >= first { | ||
| continue | ||
| walIdx := lw.firstIdx + uint64(first-firstBN) | ||
| if err := lw.TruncateBefore(walIdx); err != nil { | ||
| return fmt.Errorf("truncate lane %s WAL before block %d: %w", lane, first, err) | ||
| } | ||
| path := filepath.Join(bp.dir, entry.Name()) | ||
| if err := os.Remove(path); err != nil && !os.IsNotExist(err) { | ||
| logger.Warn("failed to delete block file", "path", path, "err", err) | ||
| } |
Check warning
Code scanning / CodeQL
Iteration over map
| dbwal.Config{ | ||
| WriteBufferSize: 0, // synchronous writes | ||
| WriteBatchSize: 1, // no batching | ||
| FsyncEnabled: true, |
There was a problem hiding this comment.
Be aware of the latency impact here, if putting write in critical path, it would introduce some noticeable latency. Would recommend synchronous write + nofsync for perf reason, fsync does provide stronger guarantees, but the chance of all validators hitting power off at the same time is pretty rare
There was a problem hiding this comment.
That's reasonable, changed
There was a problem hiding this comment.
@yzang2019 Talked with @pompon0 about this, since lane block QC is only f+1, theoretically one OS crash can screw us. We are changing all lanes and commitqc writes to happen concurrently, so latency is less of a concern. Is it okay if I change the Fsync back to true here?
There was a problem hiding this comment.
Sure, if latency is less of a concern, we can start with fsync=true. But I'd recommend we add a metric/log around this to measure how long does it take to do this flush, in case the latency becomes a problem in the future
There was a problem hiding this comment.
Sounds good, this PR is already too big, I'll add metrics in a future PR.
| if err := w.wal.Write(entry); err != nil { | ||
| return err | ||
| } | ||
| if w.firstIdx == 0 { |
There was a problem hiding this comment.
Recommend using Count() == 0 instead of relying on firstIdx == 0 as a sentinel for "WAL is empty" incase the assumption of wal starting index from 0 is not valid in the future if we switch the wal library
c0727e6 to
422aa8f
Compare
| if ok && fileN >= first { | ||
| if first >= lw.nextBlockNum { | ||
| // Anchor advanced past all persisted blocks for this lane. | ||
| if err := lw.TruncateAll(); err != nil { |
There was a problem hiding this comment.
is truncation synchronous? How expensive it is?
There was a problem hiding this comment.
TruncateAll is synchronous, it removes the segment files and then change some internal pointers, not too expensive. I don't expect this to happen very often though. In practice if every validator keeps emitting blocks, there should always be 1 or 2 lane blocks which are generated but not in AppQC yet. Unless they do "generate a block then wait for a while, generate another block then wait for a while", are you imagining that as an attack?
There was a problem hiding this comment.
I'm interested in a happy path. It would be nice to make sure that removing blocks does not affect the latency of persisting blocks. What I want to avoid is that we do synchronously: remove block -> fsync -> insert block -> fsync (if we have a steady stream of 1 insertion/removal of block per batch) in which case the latency of insertion is 2x larger than it should be (2 fsyncs).
aeeddbf to
8de2a83
Compare
b6be8e4 to
273ab80
Compare
sei-tendermint/internal/autobahn/consensus/persist/commitqcs.go
Outdated
Show resolved
Hide resolved
a870b63 to
8921433
Compare
| func (s *laneWALState) truncateForAnchor(lane types.LaneID, qc *types.CommitQC) error { | ||
| first := qc.LaneRange(lane).First() | ||
| firstBN, ok := s.firstBlockNum().Get() | ||
| if !ok || first <= firstBN { |
There was a problem hiding this comment.
if firstBlockNum is None, this function does not set the nextBlockNum correctly if anchor is present.
There was a problem hiding this comment.
Good catch, fixed.
…dexedWAL Replaces the file-per-entry persistence for blocks and CommitQCs with WAL-based persistence (sei-db/wal), wrapped in a generic indexedWAL[T] that tracks monotonic indices and typed entries. Key changes: - indexedWAL[T]: generic WAL wrapper with Write, TruncateBefore, TruncateAll, ReadAll, and index tracking. - BlockPersister: three-layer architecture (laneWALState -> laneWAL -> BlockPersister) with per-lane WALs for independent truncation. MaybePruneAndPersistLane holds the per-lane lock for the entire truncate-then-append sequence. - CommitQCPersister: two-layer architecture (commitQCState -> CommitQCPersister). MaybePruneAndPersist holds the lock for the entire truncate-then-append sequence. - afterEach callbacks use utils.Option[func(T)] and are documented as being invoked under lock. - truncateForAnchor takes BlockNumber directly instead of CommitQC, allowing testDeleteBefore to call it without duplicating logic. - Fix: truncateForAnchor on an empty WAL now advances nextBlockNum when the anchor is ahead. Made-with: Cursor
157112c to
33720ff
Compare
Summary
Replace file-per-block and file-per-commitQC persistence with
sei-db/wal, extract common WAL mechanics into a genericindexedWAL[T], and expose truncate-then-append APIs so the caller (avail/state) controls all parallelism.New:
wal.go— genericindexedWAL[T]Generic wrapper around
sei-db/walwith monotonic index tracking, typed serialization viacodec[T], and lifecycle methods:Write,ReadAll,TruncateBefore(with verify callback and bounds validation),TruncateAll,FirstIdx,Count,Close. EnforcesINVARIANT: firstIdx <= nextIdx. Opens withAllowEmpty: true(critical for correct empty-log arithmetic).ReadAllincludes a post-replay count check to detect silent data loss. Fsync enabled for additional durability beyond the prune anchor.blocks.go— per-lane WAL persistenceThree-layer structure:
laneWALState(lock-free helpers) →laneWAL(per-lane mutex) →BlockPersister(lane map + RWMutex).blocks/<hex_lane_id>/subdirectories with independent per-lane truncation.laneWALStateprovides lock-free methods (persistBlock,truncateForAnchor,loadAll) that assume the caller holds the per-lane lock.truncateForAnchorderives the prune cursor from the CommitQC, verifying block numbers via a defense-in-depth callback. HandlesTruncateAllwhen the anchor advanced past all persisted blocks and re-anchorsnextBlockNum.loadAlldetects gaps at replay time.laneWALwraps alaneWALStatewith autils.Mutex, providingmaybePruneAndPersistandclose. The lock is held for the entire truncate-then-append sequence.MaybePruneAndPersistLane(public): truncate-then-append API for a single lane. Creates lane WALs lazily viagetOrCreateLane(double-checked locking underutils.RWMutex), then delegates tolaneWAL.maybePruneAndPersist. Does not spawn goroutines — the caller schedules parallelism. No-op persister (dir=None) skips disk I/O but still invokesafterEach.close()unexported — only used by tests and constructor error cleanup; useserrors.Join.PersistBlock,DeleteBefore,DeleteBeforeLane,LaneIDsWithWAL.commitqcs.go— single CommitQC WALTwo-layer structure:
commitQCState(lock-free helpers) →CommitQCPersister(single mutex).commitqcs/, linear RoadIndex-to-WAL-index mapping viaFirstIdx().commitQCStateprovides lock-free methods:persistCommitQCsilently ignores duplicates (idx < next) for idempotent startup, rejects gaps (idx > next).deleteBeforetruncates the WAL, handles TruncateAll + cursor advance for anchor-past-all, and re-persists the anchor QC for crash recovery.MaybePruneAndPersist(public): truncate-then-append API. Holds the lock for the entire sequence — truncates below the anchor, then appends new QCs withafterEachcallback. Callers need not coordinate ordering.Close()is idempotent.loadAllCommitQCsdetects gaps at replay time.PersistCommitQC,DeleteBefore,CommitQCPersistBatchstruct,ResetNext.state.go— orchestration and parallelismNewStatestartup: prunes stale WAL entries viaMaybePruneAndPersistLane(per committee lane) andMaybePruneAndPersistwith anchor-only (nil proposals), replacing the oldDeleteBeforecalls.runPersistwrite order:scope.Parallelfans out: one task forMaybePruneAndPersist(commit-QCs), one task per committee lane forMaybePruneAndPersistLane(blocks). No early cancellation; first error returned after all tasks finish. Each path invokesmarkCommitQCsPersisted/markBlockPersistedper entry so voting unblocks ASAP.anchorQCtyped asutils.Option[*types.CommitQC]— drives both commit-QC and block WAL truncation.blocksByLanebuilt from batch; all committee lanes iterated forscope.Parallelso truncation runs even for lanes with no new blocks.persistBatchsimplified: removedlaneFirstsandcommitQCFirst(derivable from anchorQC).inner.gonextBlockToPersistdoc clarifies per-lane-per-batch notification frequency.Design decisions
stateDiris None, disk I/O is skipped but cursors advance viaafterEachcallbacks, preventingrunPersistfrom spinning.go.modrequires Go 1.25+, so per-iteration loop variables are guaranteed — no explicitv := vin closures.Concurrency design
Both
BlockPersisterandCommitQCPersisterare internally thread-safe. Mutable state is protected withutils.Mutex/utils.RWMutex. Both hold their respective locks for the full truncate-then-append sequence, so callers need not coordinate ordering.BlockPersister — two-level locking:
utils.RWMutex[map[LaneID]*laneWAL]on the lanes map:getOrCreateLaneuses double-checked locking;close()takes write Lock.utils.Mutex[*laneWALState]per lane: serializes writes and truncations withinlaneWAL.maybePruneAndPersist. Different lanes are fully parallel.MaybePruneAndPersistLanereleases the map RLock before acquiring the per-lane lock. This is safe because lanes are only added, never removed.CommitQCPersister —
utils.Mutex[*commitQCState]protecting the WAL handle and cursor.MaybePruneAndPersistholds the lock for truncation, anchor re-persist, and all appends.Test plan
-race) on persist and avail packagesTestState(no-op persist) andTestStateWithPersistence(disk persist with prune cycles)TestStateRestartFromPersisted(full restart from WAL data)