Skip to content

feat(log): add daily+size rotation mode with multi-process safety#5083

Open
m199369309 wants to merge 1 commit into
xorbitsai:mainfrom
m199369309:feat/log-rotation-daily-size
Open

feat(log): add daily+size rotation mode with multi-process safety#5083
m199369309 wants to merge 1 commit into
xorbitsai:mainfrom
m199369309:feat/log-rotation-daily-size

Conversation

@m199369309

Copy link
Copy Markdown
Collaborator

Summary

  • New daily+size log rotation mode that caps individual log files at XINFERENCE_LOG_MAX_BYTES (default 100 MB) while preserving the daily rotation semantics. Files exceeding the cap within a day are rotated with a numeric suffix (xinference.log.YYYY-MM-DD.N); the midnight-rotated file keeps the bare YYYY-MM-DD suffix and represents the last segment of the day.
  • New SafeTimedAndSizeRotatingFileHandler (daily+size mode) and SafeRotatingFileHandler (size mode) with multi-process safety: fcntl rotation lock, inode check + stream reopen on foreign rotation, and os.fstat().st_size for size checks so all xoscar sub-pool processes agree on the true file size.
  • getFilesToDelete is rewritten to do retention_days-based cleanup first, then fall back to backupCount (by mtime) so retention is stable regardless of traffic.
  • audit.log is switched to daily+size, reusing XINFERENCE_LOG_MAX_BYTES / XINFERENCE_LOG_BACKUP_COUNT. Audit retention (XINFERENCE_AUDIT_LOG_RETENTION_DAYS=90) keeps its days-based semantics; backupCount acts as a disk-bounded fallback under burst traffic.
  • XINFERENCE_LOG_RETENTION_DAYS (previously dead code) is wired into the daily+size mode as the days-based retention window.

Behavior Changes

Default Before After
XINFERENCE_LOG_ROTATION daily daily+size
XINFERENCE_LOG_BACKUP_COUNT 30 300
  • Old default (daily with no per-file size cap) allowed single-day log files to grow unbounded (1 GB+ observed in production); the new default bounds disk usage at ≈ min(retention_days × daily_volume, backupCount × maxBytes).
  • Users who explicitly set XINFERENCE_LOG_ROTATION / XINFERENCE_LOG_BACKUP_COUNT are unaffected.
  • Log file names may now include a .N suffix within a single day. Log shippers (Filebeat, Fluentbit, etc.) should match xinference.log.YYYY-MM-DD[.N].

Multi-process Safety

Observed issues on a multi-pool supervisor (t-xinference-supervisor-001, 2026-06-25) before this change:

Issue Cause
Archive loss (file shrinks, .1 missing) Multiple processes os.rename race
Cross-writing to rotated files (mtime spans multiple days) After rename, other process's fd still points to old inode
Unbalanced rotation duty stream.tell() only reflects this process's fd
Cleanup deletes files still held by another process getFilesToDelete ignores foreign fds

Fixes (all in daily+size and size modes; daily mode is unchanged because its once-per-day rename frequency is tolerable):

  1. fcntl.flock serializes doRollover across processes.
  2. shouldRollover checks inode at entry; if another process renamed the file, reopen the stream.
  3. Size check uses os.fstat().st_size instead of stream.tell() to reflect the true file size.
  4. getFilesToDelete runs inside the lock.

Windows Caveat

fcntl is unavailable on Windows. The rotation lock degrades to a no-op there. Single-process use is unaffected; multi-process use on Windows may regress to lost archives or cleanup of files still held by another process. The multi-process concurrency tests are skipped on Windows. The NFS caveat still applies — fcntl.flock is unreliable on NFS.

Test plan

  • pytest xinference/deploy/test/test_log_rotation.py -v — 29 tests pass (shouldRollover / doRollover / getFilesToDelete / get_config_dict / create_rotating_handler / SafeRotatingFileHandler / multi-process concurrency with 4 workers)
  • pre-commit run --files <modified> — black / flake8 / isort / mypy / codespell all pass
  • CI matrix run on Linux / macOS / Windows (Windows job verifies the no-lock degradation path does not crash at import)
  • 24-hour integration run on a multi-pool supervisor to confirm archive stability (no shrinkage, no cross-day mtime)

🤖 Generated with Claude Code

Introduce a combined daily+size log rotation mode that caps individual
log files at XINFERENCE_LOG_MAX_BYTES while preserving the daily
rotation semantics. The new mode is the default; users can opt out via
XINFERENCE_LOG_ROTATION=daily or size.

New handlers in xinference/deploy/utils.py:
- SafeTimedAndSizeRotatingFileHandler: midnight rotation plus an
  extra size-triggered rotation within the day, named with a numeric
  suffix (e.g. xinference.log.YYYY-MM-DD.N) to avoid clobbering the
  midnight-rotated file.
- SafeRotatingFileHandler: multi-process safe replacement for the
  stdlib RotatingFileHandler used by the size mode.

Both handlers serialize doRollover across processes with fcntl, reopen
the stream when another process rotates the file (inode check), and
use os.fstat().st_size for size checks so all processes agree on the
true file size. getFilesToDelete is rewritten to do retention-days
cleanup plus a backupCount fallback.

audit.log is switched to daily+size as well, reusing
XINFERENCE_LOG_MAX_BYTES and XINFERENCE_LOG_BACKUP_COUNT. Audit
retention (XINFERENCE_AUDIT_LOG_RETENTION_DAYS=90) keeps its
days-based semantics; the backupCount acts as a disk-bounded
fallback under burst traffic.

XINFERENCE_LOG_RETENTION_DAYS (previously dead code) is now wired
into the daily+size mode as the days-based retention window.

Windows: fcntl is unavailable, so the rotation lock is a no-op
there. Single-process use is unaffected; multi-process use on
Windows may regress to lost archives. The multi-process tests are
skipped on Windows.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@XprobeBot XprobeBot added this to the v2.x milestone Jun 26, 2026

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new multi-process safe log rotation strategy, daily+size, which rotates logs based on both time (daily) and size limits. It implements SafeTimedAndSizeRotatingFileHandler and integrates it across the application, backed by comprehensive unit tests. The review feedback identifies two important issues: a critical bug where self.rolloverAt is not updated when a rollover is skipped because another process already performed it (leading to subsequent redundant rollovers), and an off-by-one error in the retention cutoff calculation that causes logs to be deleted one day earlier than configured.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +276 to +279
if current_inode != stream_inode:
self.stream.close()
self.stream = self._open()
return True

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Critical Bug: When a process skips a time-based rollover because another process already performed it (detected via _check_inode_and_reopen()), its self.rolloverAt is never updated. On its very next log write, shouldRollover will see that time.time() >= self.rolloverAt is still True. Since _check_inode_and_reopen() will now return False (as the stream was already reopened), it will proceed to perform a redundant, concurrent, and destructive rollover!

This completely defeats the multi-process safety and can lead to log loss or corrupted log archives.

To fix this, we must update self.rolloverAt using self.computeRollover(time.time()) when an inode change is detected and we are past the rollover time.

            if current_inode != stream_inode:
                self.stream.close()
                self.stream = self._open()
                if time.time() >= self.rolloverAt:
                    self.rolloverAt = self.computeRollover(time.time())
                return True

to_delete = set()

if self.retention_days > 0:
cutoff = time.time() - self.retention_days * 86400

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Off-by-one Day Bug in Retention: The cutoff for date-based retention is calculated as cutoff = time.time() - self.retention_days * 86400. Because date_epoch is parsed as midnight of that day, and cutoff is current_time - retention_days, any file from exactly retention_days ago will always have date_epoch < cutoff (since current_time is past midnight).

This results in logs being deleted 1 day earlier than configured. Specifically, if retention_days is set to 1, yesterday's logs are deleted immediately after midnight, keeping 0 days of backups!

Using cutoff = time.time() - (self.retention_days + 1) * 86400 correctly preserves the configured number of retention days.

Suggested change
cutoff = time.time() - self.retention_days * 86400
cutoff = time.time() - (self.retention_days + 1) * 86400

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants