Skip to content

Add a health check for the worker processes#686

Open
jaredoconnell wants to merge 1 commit intovllm-project:mainfrom
jaredoconnell:feat/process-health-check
Open

Add a health check for the worker processes#686
jaredoconnell wants to merge 1 commit intovllm-project:mainfrom
jaredoconnell:feat/process-health-check

Conversation

@jaredoconnell
Copy link
Copy Markdown
Collaborator

Summary

Creates an Async IO task that polls for failure of the worker processes.

This is necessary because if this happens presently, it doesn't detect the failure, and continues waiting idefinitely for the process to be ready, causing a hang.

Details

  • Polls the worker processes
  • In the event of a failure, it creates a human-readable error message with the exit code or the type of failure.

Here is what it looks like with a segmentation fault, which I've been getting.

╭─ Benchmarks ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ [--:--:--] ⠏   0% constant@1.00 (pending )                                                                                                                                                                                                                  │
│                                                                                                                                                                                                                                                             │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
Generating... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ (1/1) [ 0:00:28 < -:--:-- ]26-04-03 12:41:20|ERROR            |guidellm.scheduler.worker_group:_process_health_monitor:292 - Worker process 13820 died unexpectedly (signal 11)
26-04-03 12:41:20|ERROR            |guidellm.scheduler.worker_group:_process_health_monitor:292 - Worker process 13821 died unexpectedly (signal 11)
26-04-03 12:41:20|ERROR            |guidellm.scheduler.worker_group:_process_health_monitor:292 - Worker process 13822 died unexpectedly (signal 11)
26-04-03 12:41:20|ERROR            |guidellm.scheduler.worker_group:_process_health_monitor:292 - Worker process 13823 died unexpectedly (signal 11)
26-04-03 12:41:20|ERROR            |guidellm.scheduler.worker_group:_process_health_monitor:292 - Worker process 13824 died unexpectedly (signal 11)
26-04-03 12:41:20|ERROR            |guidellm.scheduler.worker_group:_process_health_monitor:292 - Worker process 13825 died unexpectedly (signal 11)
Traceback (most recent call last):
  File "/Users/joconnel/Documents/projects/ai/guidellm/.venv3.12/bin/guidellm", line 8, in <module>
    sys.exit(cli())
             ^^^^^
  File "/Users/joconnel/Documents/projects/ai/guidellm/.venv3.12/lib/python3.12/site-packages/click/core.py", line 1485, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/joconnel/Documents/projects/ai/guidellm/.venv3.12/lib/python3.12/site-packages/click/core.py", line 1406, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/Users/joconnel/Documents/projects/ai/guidellm/.venv3.12/lib/python3.12/site-packages/click/core.py", line 1873, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/joconnel/Documents/projects/ai/guidellm/.venv3.12/lib/python3.12/site-packages/click/core.py", line 1873, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/joconnel/Documents/projects/ai/guidellm/.venv3.12/lib/python3.12/site-packages/click/core.py", line 1269, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/joconnel/Documents/projects/ai/guidellm/.venv3.12/lib/python3.12/site-packages/click/core.py", line 824, in invoke
    return callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/joconnel/Documents/projects/ai/guidellm/src/guidellm/__main__.py", line 476, in run
    asyncio.run(
  File "/opt/homebrew/Cellar/python@3.12/3.12.10_1/Frameworks/Python.framework/Versions/3.12/lib/python3.12/asyncio/runners.py", line 195, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/python@3.12/3.12.10_1/Frameworks/Python.framework/Versions/3.12/lib/python3.12/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
  File "/Users/joconnel/Documents/projects/ai/guidellm/src/guidellm/benchmark/entrypoints.py", line 554, in benchmark_generative_text
    async for benchmark in benchmarker.run(
  File "/Users/joconnel/Documents/projects/ai/guidellm/src/guidellm/benchmark/benchmarker.py", line 133, in run
    async for (
  File "/Users/joconnel/Documents/projects/ai/guidellm/src/guidellm/scheduler/scheduler.py", line 143, in run
    raise err
  File "/Users/joconnel/Documents/projects/ai/guidellm/src/guidellm/scheduler/scheduler.py", line 126, in run
    await worker_group.create_processes()
  File "/Users/joconnel/Documents/projects/ai/guidellm/src/guidellm/scheduler/worker_group.py", line 273, in create_processes
    raise RuntimeError(f"Worker process group startup failed: {detail}")
RuntimeError: Worker process group startup failed: Worker process 13820 died unexpectedly (signal 11); Worker process 13821 died unexpectedly (signal 11); Worker process 13822 died unexpectedly (signal 11); Worker process 13823 died unexpectedly (signal
11); Worker process 13824 died unexpectedly (signal 11); Worker process 13825 died unexpectedly (signal 11); Worker process 13826 died unexpectedly (signal 11); Worker process 13827 died unexpectedly (signal 11); Worker process 13828 died unexpectedly
(signal 11); Worker process 13829 died unexpectedly (signal 11). Check system logs for details. Consider an alternative multiprocessing start method (spawn, fork, forkserver) via the GUIDELLM__MP_CONTEXT_TYPE environment variable

In this situation, all of them had segmentation faults at the same time for some reason. The system reported a segmentation fault to me.

Test Plan

  • You can probably kill a worker process to see it work.

  • "I certify that all code in this PR is my own, except as noted below."

Use of AI

  • Includes AI-assisted code completion
  • Includes code generated by an AI application
  • Includes AI-generated tests (NOTE: AI written tests should have a docstring that includes ## WRITTEN BY AI ##)

This is necessary because if this happens presently, it doesn't detect the failure, and continues waiting idefinitely for the process to be ready, causing a hang.

Generated-by: Cursor AI
Signed-off-by: Jared O'Connell <joconnel@redhat.com>
Copy link
Copy Markdown
Collaborator

@sjmonson sjmonson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't test but looks fine. See minor nits below.

Comment on lines +300 to +302
" Consider an alternative multiprocessing start method"
" (spawn, fork, forkserver) via the"
" GUIDELLM__MP_CONTEXT_TYPE environment variable"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like this error message because its making an assumption of the cause that we cannot guarantee.

detail = self._worker_error_details or "error_event is set"
raise RuntimeError(f"Worker process group startup failed: {detail}")

async def _process_health_monitor(self):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might make more sense as a thread since we are querying information on other processes and that could be i/o blocking. Not 100% sure though so will not block merge on it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants