Skip to content

Stream Ray job logs with rolling buffer to bound memory usage#2101

Draft
avilches wants to merge 3 commits into
mainfrom
ray-logs-streaming
Draft

Stream Ray job logs with rolling buffer to bound memory usage#2101
avilches wants to merge 3 commits into
mainfrom
ray-logs-streaming

Conversation

@avilches
Copy link
Copy Markdown
Contributor

@avilches avilches commented May 7, 2026

Problem

When a Ray job produces very large logs (e.g. 1 GB), RayRunner.logs() was calling JobSubmissionClient.get_job_logs(), which downloads the entire log into a Python string before check_logs() can truncate it to the 50 MB limit. This causes OOM errors or HTTP timeouts.

Solution

Replace the blocking get_job_logs() call with a direct streaming HTTP GET to the Ray dashboard endpoint ({host}api/jobs/{job_id}/logs). Logs are read in 64 KB chunks into a collections.deque rolling buffer that discards the oldest chunks once the accumulated size exceeds FUNCTIONS_LOGS_SIZE_LIMIT.

Result: memory usage is bounded to ~50 MB regardless of total log size. The same amount of data travels over the network, but it is never all resident in RAM at once.

Behaviour preserved

  • Returns the last N bytes of logs (same as the previous check_logs truncation)
  • Prepends the same user-visible truncation warning when logs are cut
  • check_logs() call sites are untouched and continue to handle the FAILED + empty logs case and act as a safety net for other runners

@avilches avilches self-assigned this May 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant