Skip to content

Multiple squeue's in _cancel_if_blocked in reframe/core/schedulers/slurm.py are hitting slurm's RPC rate limit #3640

@lgerhardt

Description

@lgerhardt

Thanks for all the work you've done to use sacct where ever you can, it's been great at reducing load on slurm.

We noticed that when running reframe at our site, we see evidence that slurm is getting polled at a high enough rate that it's getting auto limited. Looking at the code, it looks like _cancel_if_blocked (which calls squeue) is getting called on a job by job basis. Since the full list of jobs is known when this call is done (it's iterating over the list of jobs in the poll function earlier in the same file), it would be less upsetting to slurm if it queried squeue for all the jobs at once in a single call.

Metadata

Metadata

Assignees

Type

Projects

Status

In Progress

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions