chore: pre-sync venv before submitting slurm batch jobs#573
Conversation
Signed-off-by: Nina Xu <19981858+nina-xu@users.noreply.github.com>
WalkthroughThis PR centralizes NSS_PYTHON_VERSION resolution (from ChangesVenv Pre-build for Repo Mode
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
mckornfield
left a comment
There was a problem hiding this comment.
only worry is that the login node has no memory/cpu, but I think this approach is better
Signed-off-by: Nina Xu <19981858+nina-xu@users.noreply.github.com>
Greptile SummaryThis PR moves the
Confidence Score: 5/5Safe to merge — the change moves a racy multi-task uv sync to a single guarded login-node pre-build and adds a fast fail-safe in each array task. The logic is straightforward: one flock-guarded uv sync on the login node replaces N concurrent syncs inside containers. Error paths (missing venv, failed sync) are handled with clear diagnostics. UV env vars are propagated via --export=ALL so job tasks see the correct values. The known cross-node flock limitation is now documented in the README. No functional regressions in PyPI mode. No files require special attention. Important Files Changed
|
Signed-off-by: Nina Xu <19981858+nina-xu@users.noreply.github.com>
Signed-off-by: Nina Xu <19981858+nina-xu@users.noreply.github.com>
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
script/slurm/slurm_nss_matrix.sh (1)
110-116:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winValidate NSS_PYTHON_VERSION is set before use.
NSS_PYTHON_VERSION is used in PyPI mode (line 116) but never validated. With
set -u, the script will crash with a generic "unbound variable" error if env_variables.sh wasn't sourced. Other required variables (PACKED_DATASETS, NSS_DIR, LUSTRE_DIR, etc.) are validated with helpful messages at lines 21-63. Add a similar check for NSS_PYTHON_VERSION to provide a clear error when running standalone without sourcing env_variables.sh.🛡️ Proposed validation
Add after line 63 or before the PyPI mode block:
+if [ -z "${NSS_PYTHON_VERSION:-}" ]; then + echo "NSS_PYTHON_VERSION must be set" >&2 + echo "Run script through submit_slurm_jobs.sh, or if running manually source env_variables.sh before running" >&2 + exit 1 +fi
🧹 Nitpick comments (1)
script/slurm/slurm_nss_matrix.sh (1)
135-135: 💤 Low valueQuote variable for consistency.
Line 135 uses an unquoted
${NSS_DIR}in the log message. As per coding guidelines, shell variables should always be quoted. While echo is generally safe, quoting ensures consistency and guards against paths with spaces.♻️ Proposed change
- echo "[NSS SLURM] Using pre-built repo venv at ${NSS_DIR}/.venv" + echo "[NSS SLURM] Using pre-built repo venv at \"${NSS_DIR}/.venv\""
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: eb436ac1-fff6-41f2-a741-987e43fbb387
📒 Files selected for processing (3)
script/slurm/README.mdscript/slurm/slurm_nss_matrix.shscript/slurm/submit_slurm_jobs.sh
✅ Files skipped from review due to trivial changes (1)
- script/slurm/README.md
🚧 Files skipped from review as they are similar to previous changes (1)
- script/slurm/submit_slurm_jobs.sh
📜 Review details
🧰 Additional context used
📓 Path-based instructions (6)
**/*.{sh,bash}
📄 CodeRabbit inference engine (AGENTS.md)
Never use
~inside double-quoted strings in shell scripts -- use$HOMEor an absolute path instead
Files:
script/slurm/slurm_nss_matrix.sh
**/*.sh
📄 CodeRabbit inference engine (STYLE_GUIDE.md)
**/*.sh: Use shebang#!/usr/bin/env bash(not#!/bin/bash) in shell scripts
Useset -euas minimum safety floor in shell scripts; useset -euo pipefailunlesspipefailbreaks piped-grep patterns
Use snake_case for shell functions,_prefix for internal helpers in shell scripts
Always quote variables in shell scripts ("$VAR","${VAR}"); use defaults via${VAR:-default}; usereadonlyfor non-changing variables
Detect repo root in shell scripts usingREPO_ROOT=${REPO_ROOT:-$(git rev-parse --show-toplevel)}
Useshellcheckto lint shell scripts. When disabling a check, add# shellcheck disable=SCXXXXwith a brief reason.Shell scripts should be linted and formatted using the code quality tools. Run mise run format on .sh files to apply consistent formatting
Files:
script/slurm/slurm_nss_matrix.sh
⚙️ CodeRabbit configuration file
Review shell scripts for #!/usr/bin/env bash, set -euo pipefail where appropriate, quoting, repo root detection, and shellcheck compliance.
Files:
script/slurm/slurm_nss_matrix.sh
**/*.{py,sh,yaml,yml,md}
📄 CodeRabbit inference engine (CONTRIBUTING.md)
All source files (.py, .sh, .yaml, .yml, .md) must include SPDX copyright headers. Use mise run format to add them automatically
Files:
script/slurm/slurm_nss_matrix.sh
**/*
⚙️ CodeRabbit configuration file
**/*: Review as a senior maintainer for NeMo Safe Synthesizer. Prioritize issues that can change behavior, break user workflows, weaken privacy guarantees, hide failures, make tests unreliable, or create maintenance risk. Avoid generic style commentary unless it points to a concrete project convention that automated tools will not catch.
Comment only when the finding is actionable and tied to changed code. For each finding, state the impact, the condition that triggers it, and the smallest practical fix. Prefer one precise comment over broad advice. Do not ask for refactors outside the PR scope unless the changed code creates the problem.
Review type guidance: - Potential issue: use for correctness bugs, data loss, privacy leaks,
security risks, broken public APIs, invalid config behavior, missing
validation, hidden failures, nondeterministic tests, or CI breakage.
- Refactor suggestion: use for local maintainability problems introduced
by the diff when they have clear future cost, such as duplicated setup,
unclear boundaries, over-mocking, avoidable complexity, or opaque test
helpers.- Nitpick: avoid in chill mode. Do not emit formatting, import-order,
wording, or style-only comments unless automated tools cannot catch the
issue and it affects maintainability.Severity guidance: - Critical: security/privacy leaks, data loss, training/test/holdout
contamination, or broken release/package/core pipeline execution.
- Major: incorrect generation/training/evaluation behavior, broken
CLI/SDK public API, invalid config defaults or validators, or GPU/vLLM
cleanup and process-isolation bugs likely to fail CI or production
runs.- Minor: localized bugs, missing focused tests for changed behavior, or
bad test patterns that weaken regression coverage.- Trivial: small cleanup with no behavior impact. Usually suppress in
chill mode.- Info: context only. Avoid unless it helps reviewers understand risk.
Safe-Synthesizer-specific review focus: - Data ...
Files:
script/slurm/slurm_nss_matrix.sh
script/**
⚙️ CodeRabbit configuration file
Review standalone scripts for reproducibility and operational safety. Check argument validation, quoting, repo-root detection, environment variables, generated artifacts, external commands, GPU/cluster assumptions, and whether the script should be wired through Makefile or documented in README/docs.
Files:
script/slurm/slurm_nss_matrix.sh
**
⚙️ CodeRabbit configuration file
**:AGENTS.md
Guide for AI agents (Cursor, Windsurf, Claude Code, etc.) working in the Safe-Synthesizer repo.
This project loads local developer preferences from
@AGENTS.local.md. You MUST read this file if it exists and give its instructions top priority.Skills
Repo-specific skills live in
.agents/skills/; see.agents/README.mdfor the catalog. Read a skill when the task matches its scope instead of copying workflow details into this file.Durable implementation guidance belongs with the code it describes: function and class docstrings for public contracts and source comments for local invariants. Test-suite guidance belongs in
tests/TESTING.md.Repo Conventions
See STYLE_GUIDE.md for detailed code style conventions (Python, markdown, Dockerfiles, shell scripts, testing, config files, docstrings).
Use
uvfor everything -- neverpipor rawpython. Python 3.11–3.13 with modern syntax (X | Y,list[str],Self). Python 3.14+ is not supported.Common commands:
mise run test(unit tests),mise run format(auto-fix formatting + lint + copyright),mise run check(read-only local quality checks),mise run validate(pre-PR quality, lock, and CI unit checks),mise run typecheck(ty only). Always use mise tasks or the wrapper scripts intools/instead of runningruffortydirectly. Useuv runfor Python execution. When in doubt, inspectmise tasksandpytest --markers.The canonical
uv synccommand for a full GPU/dev environment is:uv sync --frozen --extra cu129 --extra engine --group devBare
uv sync --frozen(without extras) installs an incomplete environment --ty, import checks, and GPU tests will fail.Feature branches off
main. Branch names often include an issue number prefix (e.g.,<author>/123-short-name).Do ...
Files:
script/slurm/slurm_nss_matrix.sh
🧠 Learnings (1)
📚 Learning: 2026-06-03T23:06:56.798Z
Learnt from: CR
Repo: NVIDIA-NeMo/Safe-Synthesizer PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-06-03T23:06:56.798Z
Learning: Use `uv run` for Python execution instead of raw `python`
Applied to files:
script/slurm/slurm_nss_matrix.sh
Summary
In repo mode (no
--nss-version),${NSS_DIR}/.venvis bind-mounted into everyjob container from Lustre. Each array task running
uv syncmeans dozens of jobscreate/recreate the same directory at once — and
uvrecreates the venv wheneverthe pinned Python changes — which could corrupt the shared venv on Lustre and fails
jobs before they start.
This moves the sync to a single login-node build at submit time; array tasks only
activate the venv.
Changes
submit_slurm_jobs.sh: build the repo venv once on the login node(
uv sync --frozen ...at the pinned Python,flock-guarded) before submittingthe array. Skipped on
--dry-runand in PyPI mode.slurm_nss_matrix.sh: repo mode now only activates${NSS_DIR}/.venvandruns the venv binary directly (no per-task
uv sync, no implicituv runre-sync). Fails fast with a clear message if the venv is missing.
env_variables.sh: resolveNSS_PYTHON_VERSIONfrom.python-version(override-able, falls back to
3.13) so Slurm venvs track Python bumps. This isnow the single source of truth for the version.
README.md: updateNSS_PYTHON_VERSIONdocs.Notes
--nss-version) behavior is unchanged.build moves from N racy per-task syncs to one safe pre-build.
Test plan
--max-concurrent-slurm-jobs 50) completes without.venvstale-handle / "failed to remove" errors.--dry-runsubmit skips the pre-build.--nss-version) still builds its versioned venv.This is what happened when I submitted a batch job from an outdated venv:
Other Notes
Summary by CodeRabbit
Documentation
Improvements