feat(log): add daily+size rotation mode with multi-process safety by m199369309 · Pull Request #5083 · xorbitsai/inference

m199369309 · 2026-06-26T03:17:15Z

Summary

New daily+size log rotation mode that caps individual log files at XINFERENCE_LOG_MAX_BYTES (default 100 MB) while preserving the daily rotation semantics. Files exceeding the cap within a day are rotated with a numeric suffix (xinference.log.YYYY-MM-DD.N); the midnight-rotated file keeps the bare YYYY-MM-DD suffix and represents the last segment of the day.
New SafeTimedAndSizeRotatingFileHandler (daily+size mode) and SafeRotatingFileHandler (size mode) with multi-process safety: fcntl rotation lock, inode check + stream reopen on foreign rotation, and os.fstat().st_size for size checks so all xoscar sub-pool processes agree on the true file size.
getFilesToDelete is rewritten to do retention_days-based cleanup first, then fall back to backupCount (by mtime) so retention is stable regardless of traffic.
audit.log is switched to daily+size, reusing XINFERENCE_LOG_MAX_BYTES / XINFERENCE_LOG_BACKUP_COUNT. Audit retention (XINFERENCE_AUDIT_LOG_RETENTION_DAYS=90) keeps its days-based semantics; backupCount acts as a disk-bounded fallback under burst traffic.
XINFERENCE_LOG_RETENTION_DAYS (previously dead code) is wired into the daily+size mode as the days-based retention window.

Behavior Changes

Default	Before	After
`XINFERENCE_LOG_ROTATION`	`daily`	`daily+size`
`XINFERENCE_LOG_BACKUP_COUNT`	`30`	`300`

Old default (daily with no per-file size cap) allowed single-day log files to grow unbounded (1 GB+ observed in production); the new default bounds disk usage at ≈ min(retention_days × daily_volume, backupCount × maxBytes).
Users who explicitly set XINFERENCE_LOG_ROTATION / XINFERENCE_LOG_BACKUP_COUNT are unaffected.
Log file names may now include a .N suffix within a single day. Log shippers (Filebeat, Fluentbit, etc.) should match xinference.log.YYYY-MM-DD[.N].

Multi-process Safety

Observed issues on a multi-pool supervisor (t-xinference-supervisor-001, 2026-06-25) before this change:

Issue	Cause
Archive loss (file shrinks, `.1` missing)	Multiple processes `os.rename` race
Cross-writing to rotated files (mtime spans multiple days)	After rename, other process's fd still points to old inode
Unbalanced rotation duty	`stream.tell()` only reflects this process's fd
Cleanup deletes files still held by another process	`getFilesToDelete` ignores foreign fds

Fixes (all in daily+size and size modes; daily mode is unchanged because its once-per-day rename frequency is tolerable):

fcntl.flock serializes doRollover across processes.
shouldRollover checks inode at entry; if another process renamed the file, reopen the stream.
Size check uses os.fstat().st_size instead of stream.tell() to reflect the true file size.
getFilesToDelete runs inside the lock.

Windows Caveat

fcntl is unavailable on Windows. The rotation lock degrades to a no-op there. Single-process use is unaffected; multi-process use on Windows may regress to lost archives or cleanup of files still held by another process. The multi-process concurrency tests are skipped on Windows. The NFS caveat still applies — fcntl.flock is unreliable on NFS.

Test plan

pytest xinference/deploy/test/test_log_rotation.py -v — 29 tests pass (shouldRollover / doRollover / getFilesToDelete / get_config_dict / create_rotating_handler / SafeRotatingFileHandler / multi-process concurrency with 4 workers)
pre-commit run --files <modified> — black / flake8 / isort / mypy / codespell all pass
CI matrix run on Linux / macOS / Windows (Windows job verifies the no-lock degradation path does not crash at import)
24-hour integration run on a multi-pool supervisor to confirm archive stability (no shrinkage, no cross-day mtime)

🤖 Generated with Claude Code

Introduce a combined daily+size log rotation mode that caps individual log files at XINFERENCE_LOG_MAX_BYTES while preserving the daily rotation semantics. The new mode is the default; users can opt out via XINFERENCE_LOG_ROTATION=daily or size. New handlers in xinference/deploy/utils.py: - SafeTimedAndSizeRotatingFileHandler: midnight rotation plus an extra size-triggered rotation within the day, named with a numeric suffix (e.g. xinference.log.YYYY-MM-DD.N) to avoid clobbering the midnight-rotated file. - SafeRotatingFileHandler: multi-process safe replacement for the stdlib RotatingFileHandler used by the size mode. Both handlers serialize doRollover across processes with fcntl, reopen the stream when another process rotates the file (inode check), and use os.fstat().st_size for size checks so all processes agree on the true file size. getFilesToDelete is rewritten to do retention-days cleanup plus a backupCount fallback. audit.log is switched to daily+size as well, reusing XINFERENCE_LOG_MAX_BYTES and XINFERENCE_LOG_BACKUP_COUNT. Audit retention (XINFERENCE_AUDIT_LOG_RETENTION_DAYS=90) keeps its days-based semantics; the backupCount acts as a disk-bounded fallback under burst traffic. XINFERENCE_LOG_RETENTION_DAYS (previously dead code) is now wired into the daily+size mode as the days-based retention window. Windows: fcntl is unavailable, so the rotation lock is a no-op there. Single-process use is unaffected; multi-process use on Windows may regress to lost archives. The multi-process tests are skipped on Windows. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

gemini-code-assist

Code Review

This pull request introduces a new multi-process safe log rotation strategy, daily+size, which rotates logs based on both time (daily) and size limits. It implements SafeTimedAndSizeRotatingFileHandler and integrates it across the application, backed by comprehensive unit tests. The review feedback identifies two important issues: a critical bug where self.rolloverAt is not updated when a rollover is skipped because another process already performed it (leading to subsequent redundant rollovers), and an off-by-one error in the retention cutoff calculation that causes logs to be deleted one day earlier than configured.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-26T03:19:38Z

+            if current_inode != stream_inode:
+                self.stream.close()
+                self.stream = self._open()
+                return True


Critical Bug: When a process skips a time-based rollover because another process already performed it (detected via _check_inode_and_reopen()), its self.rolloverAt is never updated. On its very next log write, shouldRollover will see that time.time() >= self.rolloverAt is still True. Since _check_inode_and_reopen() will now return False (as the stream was already reopened), it will proceed to perform a redundant, concurrent, and destructive rollover!

This completely defeats the multi-process safety and can lead to log loss or corrupted log archives.

To fix this, we must update self.rolloverAt using self.computeRollover(time.time()) when an inode change is detected and we are past the rollover time.

if current_inode != stream_inode: self.stream.close() self.stream = self._open() if time.time() >= self.rolloverAt: self.rolloverAt = self.computeRollover(time.time()) return True

gemini-code-assist · 2026-06-26T03:19:38Z

+        to_delete = set()
+
+        if self.retention_days > 0:
+            cutoff = time.time() - self.retention_days * 86400


Off-by-one Day Bug in Retention: The cutoff for date-based retention is calculated as cutoff = time.time() - self.retention_days * 86400. Because date_epoch is parsed as midnight of that day, and cutoff is current_time - retention_days, any file from exactly retention_days ago will always have date_epoch < cutoff (since current_time is past midnight).

This results in logs being deleted 1 day earlier than configured. Specifically, if retention_days is set to 1, yesterday's logs are deleted immediately after midnight, keeping 0 days of backups!

Using cutoff = time.time() - (self.retention_days + 1) * 86400 correctly preserves the configured number of retention days.

Suggested change

cutoff = time.time() - self.retention_days * 86400

cutoff = time.time() - (self.retention_days + 1) * 86400

XprobeBot added the feature label Jun 26, 2026

XprobeBot added this to the v2.x milestone Jun 26, 2026

gemini-code-assist Bot reviewed Jun 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(log): add daily+size rotation mode with multi-process safety#5083

feat(log): add daily+size rotation mode with multi-process safety#5083
m199369309 wants to merge 1 commit into
xorbitsai:mainfrom
m199369309:feat/log-rotation-daily-size

m199369309 commented Jun 26, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 26, 2026

Uh oh!

gemini-code-assist Bot Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	cutoff = time.time() - self.retention_days * 86400
	cutoff = time.time() - (self.retention_days + 1) * 86400

Uh oh!

Conversation

m199369309 commented Jun 26, 2026

Summary

Behavior Changes

Multi-process Safety

Windows Caveat

Test plan

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants