chore: pre-sync venv before submitting slurm batch jobs by nina-xu · Pull Request #573 · NVIDIA-NeMo/Safe-Synthesizer

nina-xu · 2026-06-08T15:59:23Z

Summary

In repo mode (no --nss-version), ${NSS_DIR}/.venv is bind-mounted into every
job container from Lustre. Each array task running uv sync means dozens of jobs
create/recreate the same directory at once — and uv recreates the venv whenever
the pinned Python changes — which could corrupt the shared venv on Lustre and fails
jobs before they start.

This moves the sync to a single login-node build at submit time; array tasks only
activate the venv.

Changes

submit_slurm_jobs.sh: build the repo venv once on the login node
(uv sync --frozen ... at the pinned Python, flock-guarded) before submitting
the array. Skipped on --dry-run and in PyPI mode.
slurm_nss_matrix.sh: repo mode now only activates ${NSS_DIR}/.venv and
runs the venv binary directly (no per-task uv sync, no implicit uv run
re-sync). Fails fast with a clear message if the venv is missing.
env_variables.sh: resolve NSS_PYTHON_VERSION from .python-version
(override-able, falls back to 3.13) so Slurm venvs track Python bumps. This is
now the single source of truth for the version.
README.md: update NSS_PYTHON_VERSION docs.

Notes

PyPI mode (--nss-version) behavior is unchanged.
Resulting venv contents are identical to before (frozen lockfile) — only the
build moves from N racy per-task syncs to one safe pre-build.

Test plan

Full matrix submit (e.g. --max-concurrent-slurm-jobs 50) completes without
.venv stale-handle / "failed to remove" errors.
--dry-run submit skips the pre-build.
PyPI mode (--nss-version) still builds its versioned venv.

This is what happened when I submitted a batch job from an outdated venv:

bash submit_slurm_jobs.sh --runs 5 --exp-name nss_weekly_run_2026-06-08 --dataset-group short --max-concurrent-slurm-jobs 50 
[submit] pre-building repo venv at /lustre/fsw/portfolios/nemotron/projects/nemotron_data_dev/users/ninaxu/Safe-Synthesizer/.venv (python=3.13)
Using CPython 3.13.12
Removed virtual environment at: .venv
Creating virtual environment at: .venv
      Built nemo-safe-synthesizer @ file:///lustre/fs11/portfolios/nemotron/projects/nemotron_data
Prepared 1 package in 6.69s
Installed 335 packages in 2m 45s
 + accelerate==1.12.0
 + aiohappyeyeballs==2.6.1
….

Other Notes

Closes #

Summary by CodeRabbit

Documentation
- Clarified Python-version selection behavior and added guidance to run job submissions one at a time to avoid concurrent environment build corruption.
Improvements
- Python version now defaults to the repository's pinned version with an automatic fallback.
- Added a cap on concurrent package downloads for greater stability.
- Shared virtual environment is prebuilt and validated before batch runs to reduce repeated setup and failures.

Signed-off-by: Nina Xu <19981858+nina-xu@users.noreply.github.com>

coderabbitai · 2026-06-08T15:59:31Z

Walkthrough

This PR centralizes NSS_PYTHON_VERSION resolution (from ${NSS_DIR}/.python-version, fallback 3.13), adds UV_CONCURRENT_DOWNLOADS default, pre-builds a shared repo ${NSS_DIR}/.venv once on the login node under an flock lock (skipped in PyPI/dry-run), validates the venv, and updates job tasks to use the pre-built venv’s safe-synthesizer instead of per-task uv runs.

Changes

Venv Pre-build for Repo Mode

Layer / File(s)	Summary
Environment variable resolution and setup `script/slurm/env_variables.sh`, `script/slurm/slurm_nss_matrix.sh`, `script/slurm/README.md`	`env_variables.sh` adds `UV_CONCURRENT_DOWNLOADS` (default `20`) and resolves `NSS_PYTHON_VERSION` from `${NSS_DIR}/.python-version` (whitespace-stripped) with `3.13` fallback; `slurm_nss_matrix.sh` removes its local fallback and expects the variable to be set; README updated to document the resolution and PyPI-mode venv naming.
Login-node venv pre-build with locking and validation `script/slurm/submit_slurm_jobs.sh`, `script/slurm/slurm_nss_matrix.sh`, `script/slurm/README.md`	Adds `prebuild_repo_venv()` that serializes building `${NSS_DIR}/.venv` with `flock` under `${LUSTRE_DIR}/.locks`, skips in PyPI mode or `--dry-run`, validates via `${NSS_DIR}/.venv/bin/safe-synthesizer --version`, and requires job tasks to activate/use the pre-built venv instead of running `uv sync`/`uv run` per-task. README adds admonition to run submits one at a time to avoid concurrent venv corruption.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

NVIDIA-NeMo/Safe-Synthesizer#490: Modifies the same SLURM scripts and addresses venv activation and concurrent venv build issues.

Suggested reviewers

kendrickb-nvidia

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main change: moving venv pre-build from individual Slurm tasks to a single pre-sync before job submission, which is the core objective of the PR.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch nina-xu/slurm-pre-sync

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

mckornfield

only worry is that the login node has no memory/cpu, but I think this approach is better

Signed-off-by: Nina Xu <19981858+nina-xu@users.noreply.github.com>

greptile-apps · 2026-06-08T17:14:02Z

Greptile Summary

This PR moves the uv sync venv build from per-task execution inside each Slurm array job to a single login-node pre-build before submission, eliminating the race condition that corrupted the shared Lustre .venv when dozens of tasks tried to recreate it concurrently.

submit_slurm_jobs.sh: adds prebuild_repo_venv(), which runs uv sync --frozen under flock and then functionally verifies the venv with safe-synthesizer --version; skipped in dry-run and PyPI mode.
slurm_nss_matrix.sh: repo-mode tasks now only activate the pre-built venv and invoke the binary directly; a fast -x guard fails the task immediately if the venv is missing.
env_variables.sh: NSS_PYTHON_VERSION is now derived from .python-version (with a 3.13 fallback) and UV_CONCURRENT_DOWNLOADS is capped at 20 to limit login-node memory during the pre-build.

Confidence Score: 5/5

Safe to merge — the change moves a racy multi-task uv sync to a single guarded login-node pre-build and adds a fast fail-safe in each array task.

The logic is straightforward: one flock-guarded uv sync on the login node replaces N concurrent syncs inside containers. Error paths (missing venv, failed sync) are handled with clear diagnostics. UV env vars are propagated via --export=ALL so job tasks see the correct values. The known cross-node flock limitation is now documented in the README. No functional regressions in PyPI mode.

No files require special attention.

Important Files Changed

Filename	Overview
script/slurm/submit_slurm_jobs.sh	Adds prebuild_repo_venv() to run uv sync once on the login node before sbatch submission; skipped in dry-run and PyPI mode; flock-guarded on Lustre (same-node only, documented limitation in README).
script/slurm/slurm_nss_matrix.sh	Removes per-task uv sync in repo mode; adds a fast -x guard on the pre-built safe-synthesizer binary and activates the venv directly; NSS_PYTHON_VERSION fallback removed (relies on env_variables.sh sourced at submit time).
script/slurm/env_variables.sh	NSS_PYTHON_VERSION now reads from .python-version (stripping whitespace) with a 3.13 fallback; UV_CONCURRENT_DOWNLOADS capped at 20 to reduce login-node memory pressure during pre-build.
script/slurm/README.md	Docs updated to reflect new NSS_PYTHON_VERSION default behavior and adds a warning against running multiple concurrent submits.

Sequence Diagram

sequenceDiagram
    participant U as User (login node)
    participant S as submit_slurm_jobs.sh
    participant E as env_variables.sh
    participant F as flock / uv sync
    participant SL as Slurm Scheduler
    participant T as Array Task (slurm_nss_matrix.sh)

    U->>S: bash submit_slurm_jobs.sh --runs N ...
    S->>E: source env_variables.sh
    E-->>S: NSS_PYTHON_VERSION (from .python-version or 3.13)
    S->>F: prebuild_repo_venv()
    Note over F: flock -x on Lustre lock file
    F->>F: uv sync --frozen --python NSS_PYTHON_VERSION
    F->>F: safe-synthesizer --version (functional verify)
    F-->>S: venv ready
    S->>SL: "sbatch --export=ALL (array 0..N-1)"
    SL-->>U: array job ID

    loop Each array task
        SL->>T: launch task i
        T->>T: source uv env
        alt repo mode
            T->>T: check -x .venv/bin/safe-synthesizer
            T->>T: source .venv/bin/activate
            T->>T: "NSS_RUN_CMD = .venv/bin/safe-synthesizer"
        else PyPI mode (--nss-version)
            T->>T: uv venv + uv pip install (unchanged)
        end
        T->>T: safe-synthesizer run ...
    end

_{Reviews (3): Last reviewed commit: "shorten comments" | Re-trigger Greptile}

Signed-off-by: Nina Xu <19981858+nina-xu@users.noreply.github.com>

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

script/slurm/slurm_nss_matrix.sh (1)
110-116: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Validate NSS_PYTHON_VERSION is set before use.

NSS_PYTHON_VERSION is used in PyPI mode (line 116) but never validated. With set -u, the script will crash with a generic "unbound variable" error if env_variables.sh wasn't sourced. Other required variables (PACKED_DATASETS, NSS_DIR, LUSTRE_DIR, etc.) are validated with helpful messages at lines 21-63. Add a similar check for NSS_PYTHON_VERSION to provide a clear error when running standalone without sourcing env_variables.sh.
🛡️ Proposed validation

Add after line 63 or before the PyPI mode block:
+if [ -z "${NSS_PYTHON_VERSION:-}" ]; then
+    echo "NSS_PYTHON_VERSION must be set" >&2
+    echo "Run script through submit_slurm_jobs.sh, or if running manually source env_variables.sh before running" >&2
+    exit 1
+fi

🧹 Nitpick comments (1)

script/slurm/slurm_nss_matrix.sh (1)
135-135: 💤 Low value

Quote variable for consistency.

Line 135 uses an unquoted ${NSS_DIR} in the log message. As per coding guidelines, shell variables should always be quoted. While echo is generally safe, quoting ensures consistency and guards against paths with spaces.
♻️ Proposed change
-    echo "[NSS SLURM] Using pre-built repo venv at ${NSS_DIR}/.venv"
+    echo "[NSS SLURM] Using pre-built repo venv at \"${NSS_DIR}/.venv\""

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: eb436ac1-fff6-41f2-a741-987e43fbb387

📥 Commits

Reviewing files that changed from the base of the PR and between f0a94d3 and aeeb8c6.

📒 Files selected for processing (3)

script/slurm/README.md
script/slurm/slurm_nss_matrix.sh
script/slurm/submit_slurm_jobs.sh

✅ Files skipped from review due to trivial changes (1)

script/slurm/README.md

🚧 Files skipped from review as they are similar to previous changes (1)

script/slurm/submit_slurm_jobs.sh

📜 Review details

🧰 Additional context used

📓 Path-based instructions (6)

**/*.{sh,bash}

📄 CodeRabbit inference engine (AGENTS.md)

Never use ~ inside double-quoted strings in shell scripts -- use $HOME or an absolute path instead

Files:

script/slurm/slurm_nss_matrix.sh

**/*.sh

📄 CodeRabbit inference engine (STYLE_GUIDE.md)

**/*.sh: Use shebang #!/usr/bin/env bash (not #!/bin/bash) in shell scripts
Use set -eu as minimum safety floor in shell scripts; use set -euo pipefail unless pipefail breaks piped-grep patterns
Use snake_case for shell functions, _ prefix for internal helpers in shell scripts
Always quote variables in shell scripts ("$VAR", "${VAR}"); use defaults via ${VAR:-default}; use readonly for non-changing variables
Detect repo root in shell scripts using REPO_ROOT=${REPO_ROOT:-$(git rev-parse --show-toplevel)}
Use shellcheck to lint shell scripts. When disabling a check, add # shellcheck disable=SCXXXX with a brief reason.

Shell scripts should be linted and formatted using the code quality tools. Run mise run format on .sh files to apply consistent formatting

Files:

script/slurm/slurm_nss_matrix.sh

⚙️ CodeRabbit configuration file

Review shell scripts for #!/usr/bin/env bash, set -euo pipefail where appropriate, quoting, repo root detection, and shellcheck compliance.

Files:

script/slurm/slurm_nss_matrix.sh

**/*.{py,sh,yaml,yml,md}

📄 CodeRabbit inference engine (CONTRIBUTING.md)

All source files (.py, .sh, .yaml, .yml, .md) must include SPDX copyright headers. Use mise run format to add them automatically

Files:

script/slurm/slurm_nss_matrix.sh

**/*

⚙️ CodeRabbit configuration file

**/*: Review as a senior maintainer for NeMo Safe Synthesizer. Prioritize issues that can change behavior, break user workflows, weaken privacy guarantees, hide failures, make tests unreliable, or create maintenance risk. Avoid generic style commentary unless it points to a concrete project convention that automated tools will not catch.
Comment only when the finding is actionable and tied to changed code. For each finding, state the impact, the condition that triggers it, and the smallest practical fix. Prefer one precise comment over broad advice. Do not ask for refactors outside the PR scope unless the changed code creates the problem.
Review type guidance: - Potential issue: use for correctness bugs, data loss, privacy leaks,
security risks, broken public APIs, invalid config behavior, missing
validation, hidden failures, nondeterministic tests, or CI breakage.

Refactor suggestion: use for local maintainability problems introduced
by the diff when they have clear future cost, such as duplicated setup,
unclear boundaries, over-mocking, avoidable complexity, or opaque test
helpers.

Nitpick: avoid in chill mode. Do not emit formatting, import-order,
wording, or style-only comments unless automated tools cannot catch the
issue and it affects maintainability.

Severity guidance: - Critical: security/privacy leaks, data loss, training/test/holdout
contamination, or broken release/package/core pipeline execution.

Major: incorrect generation/training/evaluation behavior, broken
CLI/SDK public API, invalid config defaults or validators, or GPU/vLLM
cleanup and process-isolation bugs likely to fail CI or production
runs.

Minor: localized bugs, missing focused tests for changed behavior, or
bad test patterns that weaken regression coverage.

Trivial: small cleanup with no behavior impact. Usually suppress in
chill mode.

Info: context only. Avoid unless it helps reviewers understand risk.
Safe-Synthesizer-specific review focus: - Data ...

Files:

script/slurm/slurm_nss_matrix.sh

script/**

⚙️ CodeRabbit configuration file

Review standalone scripts for reproducibility and operational safety. Check argument validation, quoting, repo-root detection, environment variables, generated artifacts, external commands, GPU/cluster assumptions, and whether the script should be wired through Makefile or documented in README/docs.

Files:

script/slurm/slurm_nss_matrix.sh

**

⚙️ CodeRabbit configuration file

**:

AGENTS.md

Guide for AI agents (Cursor, Windsurf, Claude Code, etc.) working in the Safe-Synthesizer repo.

This project loads local developer preferences from @AGENTS.local.md. You MUST read this file if it exists and give its instructions top priority.

Skills

Repo-specific skills live in .agents/skills/; see .agents/README.md for the catalog. Read a skill when the task matches its scope instead of copying workflow details into this file.

Durable implementation guidance belongs with the code it describes: function and class docstrings for public contracts and source comments for local invariants. Test-suite guidance belongs in tests/TESTING.md.

Repo Conventions

See STYLE_GUIDE.md for detailed code style conventions (Python, markdown, Dockerfiles, shell scripts, testing, config files, docstrings).

Use uv for everything -- never pip or raw python. Python 3.11–3.13 with modern syntax (X | Y, list[str], Self). Python 3.14+ is not supported.

Common commands: mise run test (unit tests), mise run format (auto-fix formatting + lint + copyright), mise run check (read-only local quality checks), mise run validate (pre-PR quality, lock, and CI unit checks), mise run typecheck (ty only). Always use mise tasks or the wrapper scripts in tools/ instead of running ruff or ty directly. Use uv run for Python execution. When in doubt, inspect mise tasks and pytest --markers.

The canonical uv sync command for a full GPU/dev environment is:
uv sync --frozen --extra cu129 --extra engine --group dev
Bare uv sync --frozen (without extras) installs an incomplete environment -- ty, import checks, and GPU tests will fail.

Feature branches off main. Branch names often include an issue number prefix (e.g., <author>/123-short-name).

Do ...

Files:

script/slurm/slurm_nss_matrix.sh

🧠 Learnings (1)

📚 Learning: 2026-06-03T23:06:56.798Z

Learnt from: CR
Repo: NVIDIA-NeMo/Safe-Synthesizer PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-06-03T23:06:56.798Z
Learning: Use `uv run` for Python execution instead of raw `python`

Applied to files:

script/slurm/slurm_nss_matrix.sh

pre-sync venv before submitting slurm batch jobs

1e9b353

Signed-off-by: Nina Xu <19981858+nina-xu@users.noreply.github.com>

github-actions Bot added the area:tests label Jun 8, 2026

nina-xu added the area:dev-ex Affects build or dev experience label Jun 8, 2026

mckornfield previously approved these changes Jun 8, 2026

View reviewed changes

Comment thread script/slurm/slurm_nss_matrix.sh

mckornfield changed the title ~~pre-sync venv before submitting slurm batch jobs~~ fix: pre-sync venv before submitting slurm batch jobs Jun 8, 2026

mckornfield changed the title ~~fix: pre-sync venv before submitting slurm batch jobs~~ chore: pre-sync venv before submitting slurm batch jobs Jun 8, 2026

limit vu download concurrency

f0a94d3

Signed-off-by: Nina Xu <19981858+nina-xu@users.noreply.github.com>

nina-xu dismissed mckornfield’s stale review via f0a94d3 June 8, 2026 16:55

nina-xu marked this pull request as ready for review June 8, 2026 17:09

nina-xu requested a review from a team as a code owner June 8, 2026 17:09

nina-xu requested a review from mckornfield June 8, 2026 17:10

coderabbitai Bot added docs Documentation-only change chore Maintenance not tied to a user-visible change labels Jun 8, 2026

greptile-apps Bot reviewed Jun 8, 2026

View reviewed changes

Comment thread script/slurm/submit_slurm_jobs.sh

Comment thread script/slurm/slurm_nss_matrix.sh Outdated

mckornfield reviewed Jun 8, 2026

View reviewed changes

Comment thread script/slurm/env_variables.sh

mckornfield previously approved these changes Jun 8, 2026

View reviewed changes

greptile feedback

69eaba3

Signed-off-by: Nina Xu <19981858+nina-xu@users.noreply.github.com>

nina-xu dismissed mckornfield’s stale review via 69eaba3 June 8, 2026 18:33

nina-xu requested a review from mckornfield June 8, 2026 18:37

shorten comments

aeeb8c6

Signed-off-by: Nina Xu <19981858+nina-xu@users.noreply.github.com>

coderabbitai Bot removed docs Documentation-only change chore Maintenance not tied to a user-visible change labels Jun 8, 2026

coderabbitai Bot reviewed Jun 8, 2026

View reviewed changes

Comment thread script/slurm/slurm_nss_matrix.sh

mckornfield approved these changes Jun 8, 2026

View reviewed changes

nina-xu added this pull request to the merge queue Jun 8, 2026

Merged via the queue into main with commit e470749 Jun 8, 2026
18 checks passed

nina-xu deleted the nina-xu/slurm-pre-sync branch June 8, 2026 19:24

nina-xu mentioned this pull request Jun 9, 2026

fix: slurm venv functional check #580

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

chore: pre-sync venv before submitting slurm batch jobs#573

chore: pre-sync venv before submitting slurm batch jobs#573
nina-xu merged 4 commits into
mainfrom
nina-xu/slurm-pre-sync

nina-xu commented Jun 8, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 8, 2026 •

edited

Loading

Uh oh!

mckornfield left a comment

Uh oh!

Uh oh!

greptile-apps Bot commented Jun 8, 2026 •

edited

Loading

Sequence Diagram

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

AGENTS.md

Skills

Repo Conventions

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

nina-xu commented Jun 8, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Notes

Test plan

Other Notes

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Uh oh!

mckornfield left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

greptile-apps Bot commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

AGENTS.md

Skills

Repo Conventions

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

nina-xu commented Jun 8, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 8, 2026 •

edited

Loading

greptile-apps Bot commented Jun 8, 2026 •

edited

Loading