feat(log): add daily+size rotation mode with multi-process safety#5083
feat(log): add daily+size rotation mode with multi-process safety#5083m199369309 wants to merge 1 commit into
Conversation
Introduce a combined daily+size log rotation mode that caps individual log files at XINFERENCE_LOG_MAX_BYTES while preserving the daily rotation semantics. The new mode is the default; users can opt out via XINFERENCE_LOG_ROTATION=daily or size. New handlers in xinference/deploy/utils.py: - SafeTimedAndSizeRotatingFileHandler: midnight rotation plus an extra size-triggered rotation within the day, named with a numeric suffix (e.g. xinference.log.YYYY-MM-DD.N) to avoid clobbering the midnight-rotated file. - SafeRotatingFileHandler: multi-process safe replacement for the stdlib RotatingFileHandler used by the size mode. Both handlers serialize doRollover across processes with fcntl, reopen the stream when another process rotates the file (inode check), and use os.fstat().st_size for size checks so all processes agree on the true file size. getFilesToDelete is rewritten to do retention-days cleanup plus a backupCount fallback. audit.log is switched to daily+size as well, reusing XINFERENCE_LOG_MAX_BYTES and XINFERENCE_LOG_BACKUP_COUNT. Audit retention (XINFERENCE_AUDIT_LOG_RETENTION_DAYS=90) keeps its days-based semantics; the backupCount acts as a disk-bounded fallback under burst traffic. XINFERENCE_LOG_RETENTION_DAYS (previously dead code) is now wired into the daily+size mode as the days-based retention window. Windows: fcntl is unavailable, so the rotation lock is a no-op there. Single-process use is unaffected; multi-process use on Windows may regress to lost archives. The multi-process tests are skipped on Windows. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request introduces a new multi-process safe log rotation strategy, daily+size, which rotates logs based on both time (daily) and size limits. It implements SafeTimedAndSizeRotatingFileHandler and integrates it across the application, backed by comprehensive unit tests. The review feedback identifies two important issues: a critical bug where self.rolloverAt is not updated when a rollover is skipped because another process already performed it (leading to subsequent redundant rollovers), and an off-by-one error in the retention cutoff calculation that causes logs to be deleted one day earlier than configured.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| if current_inode != stream_inode: | ||
| self.stream.close() | ||
| self.stream = self._open() | ||
| return True |
There was a problem hiding this comment.
Critical Bug: When a process skips a time-based rollover because another process already performed it (detected via _check_inode_and_reopen()), its self.rolloverAt is never updated. On its very next log write, shouldRollover will see that time.time() >= self.rolloverAt is still True. Since _check_inode_and_reopen() will now return False (as the stream was already reopened), it will proceed to perform a redundant, concurrent, and destructive rollover!
This completely defeats the multi-process safety and can lead to log loss or corrupted log archives.
To fix this, we must update self.rolloverAt using self.computeRollover(time.time()) when an inode change is detected and we are past the rollover time.
if current_inode != stream_inode:
self.stream.close()
self.stream = self._open()
if time.time() >= self.rolloverAt:
self.rolloverAt = self.computeRollover(time.time())
return True| to_delete = set() | ||
|
|
||
| if self.retention_days > 0: | ||
| cutoff = time.time() - self.retention_days * 86400 |
There was a problem hiding this comment.
Off-by-one Day Bug in Retention: The cutoff for date-based retention is calculated as cutoff = time.time() - self.retention_days * 86400. Because date_epoch is parsed as midnight of that day, and cutoff is current_time - retention_days, any file from exactly retention_days ago will always have date_epoch < cutoff (since current_time is past midnight).
This results in logs being deleted 1 day earlier than configured. Specifically, if retention_days is set to 1, yesterday's logs are deleted immediately after midnight, keeping 0 days of backups!
Using cutoff = time.time() - (self.retention_days + 1) * 86400 correctly preserves the configured number of retention days.
| cutoff = time.time() - self.retention_days * 86400 | |
| cutoff = time.time() - (self.retention_days + 1) * 86400 |
Summary
daily+sizelog rotation mode that caps individual log files atXINFERENCE_LOG_MAX_BYTES(default 100 MB) while preserving the daily rotation semantics. Files exceeding the cap within a day are rotated with a numeric suffix (xinference.log.YYYY-MM-DD.N); the midnight-rotated file keeps the bareYYYY-MM-DDsuffix and represents the last segment of the day.SafeTimedAndSizeRotatingFileHandler(daily+size mode) andSafeRotatingFileHandler(size mode) with multi-process safety:fcntlrotation lock, inode check + stream reopen on foreign rotation, andos.fstat().st_sizefor size checks so all xoscar sub-pool processes agree on the true file size.getFilesToDeleteis rewritten to doretention_days-based cleanup first, then fall back tobackupCount(by mtime) so retention is stable regardless of traffic.audit.logis switched todaily+size, reusingXINFERENCE_LOG_MAX_BYTES/XINFERENCE_LOG_BACKUP_COUNT. Audit retention (XINFERENCE_AUDIT_LOG_RETENTION_DAYS=90) keeps its days-based semantics;backupCountacts as a disk-bounded fallback under burst traffic.XINFERENCE_LOG_RETENTION_DAYS(previously dead code) is wired into the daily+size mode as the days-based retention window.Behavior Changes
XINFERENCE_LOG_ROTATIONdailydaily+sizeXINFERENCE_LOG_BACKUP_COUNT30300dailywith no per-file size cap) allowed single-day log files to grow unbounded (1 GB+ observed in production); the new default bounds disk usage at ≈min(retention_days × daily_volume, backupCount × maxBytes).XINFERENCE_LOG_ROTATION/XINFERENCE_LOG_BACKUP_COUNTare unaffected..Nsuffix within a single day. Log shippers (Filebeat, Fluentbit, etc.) should matchxinference.log.YYYY-MM-DD[.N].Multi-process Safety
Observed issues on a multi-pool supervisor (
t-xinference-supervisor-001, 2026-06-25) before this change:.1missing)os.renameracestream.tell()only reflects this process's fdgetFilesToDeleteignores foreign fdsFixes (all in
daily+sizeandsizemodes;dailymode is unchanged because its once-per-day rename frequency is tolerable):fcntl.flockserializesdoRolloveracross processes.shouldRolloverchecks inode at entry; if another process renamed the file, reopen the stream.os.fstat().st_sizeinstead ofstream.tell()to reflect the true file size.getFilesToDeleteruns inside the lock.Windows Caveat
fcntlis unavailable on Windows. The rotation lock degrades to a no-op there. Single-process use is unaffected; multi-process use on Windows may regress to lost archives or cleanup of files still held by another process. The multi-process concurrency tests are skipped on Windows. The NFS caveat still applies —fcntl.flockis unreliable on NFS.Test plan
pytest xinference/deploy/test/test_log_rotation.py -v— 29 tests pass (shouldRollover / doRollover / getFilesToDelete / get_config_dict / create_rotating_handler / SafeRotatingFileHandler / multi-process concurrency with 4 workers)pre-commit run --files <modified>— black / flake8 / isort / mypy / codespell all pass🤖 Generated with Claude Code