Skip to content

chore: pre-sync venv before submitting slurm batch jobs#573

Merged
nina-xu merged 4 commits into
mainfrom
nina-xu/slurm-pre-sync
Jun 8, 2026
Merged

chore: pre-sync venv before submitting slurm batch jobs#573
nina-xu merged 4 commits into
mainfrom
nina-xu/slurm-pre-sync

Conversation

@nina-xu

@nina-xu nina-xu commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Summary

In repo mode (no --nss-version), ${NSS_DIR}/.venv is bind-mounted into every
job container from Lustre. Each array task running uv sync means dozens of jobs
create/recreate the same directory at once — and uv recreates the venv whenever
the pinned Python changes — which could corrupt the shared venv on Lustre and fails
jobs before they start.

This moves the sync to a single login-node build at submit time; array tasks only
activate the venv.

Changes

  • submit_slurm_jobs.sh: build the repo venv once on the login node
    (uv sync --frozen ... at the pinned Python, flock-guarded) before submitting
    the array. Skipped on --dry-run and in PyPI mode.
  • slurm_nss_matrix.sh: repo mode now only activates ${NSS_DIR}/.venv and
    runs the venv binary directly (no per-task uv sync, no implicit uv run
    re-sync). Fails fast with a clear message if the venv is missing.
  • env_variables.sh: resolve NSS_PYTHON_VERSION from .python-version
    (override-able, falls back to 3.13) so Slurm venvs track Python bumps. This is
    now the single source of truth for the version.
  • README.md: update NSS_PYTHON_VERSION docs.

Notes

  • PyPI mode (--nss-version) behavior is unchanged.
  • Resulting venv contents are identical to before (frozen lockfile) — only the
    build moves from N racy per-task syncs to one safe pre-build.

Test plan

  • Full matrix submit (e.g. --max-concurrent-slurm-jobs 50) completes without
    .venv stale-handle / "failed to remove" errors.
  • --dry-run submit skips the pre-build.
  • PyPI mode (--nss-version) still builds its versioned venv.

This is what happened when I submitted a batch job from an outdated venv:

bash submit_slurm_jobs.sh --runs 5 --exp-name nss_weekly_run_2026-06-08 --dataset-group short --max-concurrent-slurm-jobs 50 
[submit] pre-building repo venv at /lustre/fsw/portfolios/nemotron/projects/nemotron_data_dev/users/ninaxu/Safe-Synthesizer/.venv (python=3.13)
Using CPython 3.13.12
Removed virtual environment at: .venv
Creating virtual environment at: .venv
      Built nemo-safe-synthesizer @ file:///lustre/fs11/portfolios/nemotron/projects/nemotron_data
Prepared 1 package in 6.69s
Installed 335 packages in 2m 45s
 + accelerate==1.12.0
 + aiohappyeyeballs==2.6.1
….

Other Notes

  • Closes #

Summary by CodeRabbit

  • Documentation

    • Clarified Python-version selection behavior and added guidance to run job submissions one at a time to avoid concurrent environment build corruption.
  • Improvements

    • Python version now defaults to the repository's pinned version with an automatic fallback.
    • Added a cap on concurrent package downloads for greater stability.
    • Shared virtual environment is prebuilt and validated before batch runs to reduce repeated setup and failures.

Signed-off-by: Nina Xu <19981858+nina-xu@users.noreply.github.com>
@coderabbitai

coderabbitai Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Walkthrough

This PR centralizes NSS_PYTHON_VERSION resolution (from ${NSS_DIR}/.python-version, fallback 3.13), adds UV_CONCURRENT_DOWNLOADS default, pre-builds a shared repo ${NSS_DIR}/.venv once on the login node under an flock lock (skipped in PyPI/dry-run), validates the venv, and updates job tasks to use the pre-built venv’s safe-synthesizer instead of per-task uv runs.

Changes

Venv Pre-build for Repo Mode

Layer / File(s) Summary
Environment variable resolution and setup
script/slurm/env_variables.sh, script/slurm/slurm_nss_matrix.sh, script/slurm/README.md
env_variables.sh adds UV_CONCURRENT_DOWNLOADS (default 20) and resolves NSS_PYTHON_VERSION from ${NSS_DIR}/.python-version (whitespace-stripped) with 3.13 fallback; slurm_nss_matrix.sh removes its local fallback and expects the variable to be set; README updated to document the resolution and PyPI-mode venv naming.
Login-node venv pre-build with locking and validation
script/slurm/submit_slurm_jobs.sh, script/slurm/slurm_nss_matrix.sh, script/slurm/README.md
Adds prebuild_repo_venv() that serializes building ${NSS_DIR}/.venv with flock under ${LUSTRE_DIR}/.locks, skips in PyPI mode or --dry-run, validates via ${NSS_DIR}/.venv/bin/safe-synthesizer --version, and requires job tasks to activate/use the pre-built venv instead of running uv sync/uv run per-task. README adds admonition to run submits one at a time to avoid concurrent venv corruption.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

Suggested reviewers

  • kendrickb-nvidia
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: moving venv pre-build from individual Slurm tasks to a single pre-sync before job submission, which is the core objective of the PR.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch nina-xu/slurm-pre-sync

Comment @coderabbitai help to get the list of available commands and usage tips.

@nina-xu nina-xu added the area:dev-ex Affects build or dev experience label Jun 8, 2026
mckornfield
mckornfield previously approved these changes Jun 8, 2026

@mckornfield mckornfield left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only worry is that the login node has no memory/cpu, but I think this approach is better

Comment thread script/slurm/slurm_nss_matrix.sh
@mckornfield mckornfield changed the title pre-sync venv before submitting slurm batch jobs fix: pre-sync venv before submitting slurm batch jobs Jun 8, 2026
@mckornfield mckornfield changed the title fix: pre-sync venv before submitting slurm batch jobs chore: pre-sync venv before submitting slurm batch jobs Jun 8, 2026
Signed-off-by: Nina Xu <19981858+nina-xu@users.noreply.github.com>
@nina-xu nina-xu marked this pull request as ready for review June 8, 2026 17:09
@nina-xu nina-xu requested a review from a team as a code owner June 8, 2026 17:09
@nina-xu nina-xu requested a review from mckornfield June 8, 2026 17:10
@coderabbitai coderabbitai Bot added docs Documentation-only change chore Maintenance not tied to a user-visible change labels Jun 8, 2026
@greptile-apps

greptile-apps Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR moves the uv sync venv build from per-task execution inside each Slurm array job to a single login-node pre-build before submission, eliminating the race condition that corrupted the shared Lustre .venv when dozens of tasks tried to recreate it concurrently.

  • submit_slurm_jobs.sh: adds prebuild_repo_venv(), which runs uv sync --frozen under flock and then functionally verifies the venv with safe-synthesizer --version; skipped in dry-run and PyPI mode.
  • slurm_nss_matrix.sh: repo-mode tasks now only activate the pre-built venv and invoke the binary directly; a fast -x guard fails the task immediately if the venv is missing.
  • env_variables.sh: NSS_PYTHON_VERSION is now derived from .python-version (with a 3.13 fallback) and UV_CONCURRENT_DOWNLOADS is capped at 20 to limit login-node memory during the pre-build.

Confidence Score: 5/5

Safe to merge — the change moves a racy multi-task uv sync to a single guarded login-node pre-build and adds a fast fail-safe in each array task.

The logic is straightforward: one flock-guarded uv sync on the login node replaces N concurrent syncs inside containers. Error paths (missing venv, failed sync) are handled with clear diagnostics. UV env vars are propagated via --export=ALL so job tasks see the correct values. The known cross-node flock limitation is now documented in the README. No functional regressions in PyPI mode.

No files require special attention.

Important Files Changed

Filename Overview
script/slurm/submit_slurm_jobs.sh Adds prebuild_repo_venv() to run uv sync once on the login node before sbatch submission; skipped in dry-run and PyPI mode; flock-guarded on Lustre (same-node only, documented limitation in README).
script/slurm/slurm_nss_matrix.sh Removes per-task uv sync in repo mode; adds a fast -x guard on the pre-built safe-synthesizer binary and activates the venv directly; NSS_PYTHON_VERSION fallback removed (relies on env_variables.sh sourced at submit time).
script/slurm/env_variables.sh NSS_PYTHON_VERSION now reads from .python-version (stripping whitespace) with a 3.13 fallback; UV_CONCURRENT_DOWNLOADS capped at 20 to reduce login-node memory pressure during pre-build.
script/slurm/README.md Docs updated to reflect new NSS_PYTHON_VERSION default behavior and adds a warning against running multiple concurrent submits.

Sequence Diagram

sequenceDiagram
    participant U as User (login node)
    participant S as submit_slurm_jobs.sh
    participant E as env_variables.sh
    participant F as flock / uv sync
    participant SL as Slurm Scheduler
    participant T as Array Task (slurm_nss_matrix.sh)

    U->>S: bash submit_slurm_jobs.sh --runs N ...
    S->>E: source env_variables.sh
    E-->>S: NSS_PYTHON_VERSION (from .python-version or 3.13)
    S->>F: prebuild_repo_venv()
    Note over F: flock -x on Lustre lock file
    F->>F: uv sync --frozen --python NSS_PYTHON_VERSION
    F->>F: safe-synthesizer --version (functional verify)
    F-->>S: venv ready
    S->>SL: "sbatch --export=ALL (array 0..N-1)"
    SL-->>U: array job ID

    loop Each array task
        SL->>T: launch task i
        T->>T: source uv env
        alt repo mode
            T->>T: check -x .venv/bin/safe-synthesizer
            T->>T: source .venv/bin/activate
            T->>T: "NSS_RUN_CMD = .venv/bin/safe-synthesizer"
        else PyPI mode (--nss-version)
            T->>T: uv venv + uv pip install (unchanged)
        end
        T->>T: safe-synthesizer run ...
    end
Loading

Reviews (3): Last reviewed commit: "shorten comments" | Re-trigger Greptile

Comment thread script/slurm/submit_slurm_jobs.sh
Comment thread script/slurm/slurm_nss_matrix.sh Outdated
Comment thread script/slurm/env_variables.sh
mckornfield
mckornfield previously approved these changes Jun 8, 2026
Signed-off-by: Nina Xu <19981858+nina-xu@users.noreply.github.com>
Signed-off-by: Nina Xu <19981858+nina-xu@users.noreply.github.com>
@coderabbitai coderabbitai Bot removed docs Documentation-only change chore Maintenance not tied to a user-visible change labels Jun 8, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
script/slurm/slurm_nss_matrix.sh (1)

110-116: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Validate NSS_PYTHON_VERSION is set before use.

NSS_PYTHON_VERSION is used in PyPI mode (line 116) but never validated. With set -u, the script will crash with a generic "unbound variable" error if env_variables.sh wasn't sourced. Other required variables (PACKED_DATASETS, NSS_DIR, LUSTRE_DIR, etc.) are validated with helpful messages at lines 21-63. Add a similar check for NSS_PYTHON_VERSION to provide a clear error when running standalone without sourcing env_variables.sh.

🛡️ Proposed validation

Add after line 63 or before the PyPI mode block:

+if [ -z "${NSS_PYTHON_VERSION:-}" ]; then
+    echo "NSS_PYTHON_VERSION must be set" >&2
+    echo "Run script through submit_slurm_jobs.sh, or if running manually source env_variables.sh before running" >&2
+    exit 1
+fi
🧹 Nitpick comments (1)
script/slurm/slurm_nss_matrix.sh (1)

135-135: 💤 Low value

Quote variable for consistency.

Line 135 uses an unquoted ${NSS_DIR} in the log message. As per coding guidelines, shell variables should always be quoted. While echo is generally safe, quoting ensures consistency and guards against paths with spaces.

♻️ Proposed change
-    echo "[NSS SLURM] Using pre-built repo venv at ${NSS_DIR}/.venv"
+    echo "[NSS SLURM] Using pre-built repo venv at \"${NSS_DIR}/.venv\""

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: eb436ac1-fff6-41f2-a741-987e43fbb387

📥 Commits

Reviewing files that changed from the base of the PR and between f0a94d3 and aeeb8c6.

📒 Files selected for processing (3)
  • script/slurm/README.md
  • script/slurm/slurm_nss_matrix.sh
  • script/slurm/submit_slurm_jobs.sh
✅ Files skipped from review due to trivial changes (1)
  • script/slurm/README.md
🚧 Files skipped from review as they are similar to previous changes (1)
  • script/slurm/submit_slurm_jobs.sh
📜 Review details
🧰 Additional context used
📓 Path-based instructions (6)
**/*.{sh,bash}

📄 CodeRabbit inference engine (AGENTS.md)

Never use ~ inside double-quoted strings in shell scripts -- use $HOME or an absolute path instead

Files:

  • script/slurm/slurm_nss_matrix.sh
**/*.sh

📄 CodeRabbit inference engine (STYLE_GUIDE.md)

**/*.sh: Use shebang #!/usr/bin/env bash (not #!/bin/bash) in shell scripts
Use set -eu as minimum safety floor in shell scripts; use set -euo pipefail unless pipefail breaks piped-grep patterns
Use snake_case for shell functions, _ prefix for internal helpers in shell scripts
Always quote variables in shell scripts ("$VAR", "${VAR}"); use defaults via ${VAR:-default}; use readonly for non-changing variables
Detect repo root in shell scripts using REPO_ROOT=${REPO_ROOT:-$(git rev-parse --show-toplevel)}
Use shellcheck to lint shell scripts. When disabling a check, add # shellcheck disable=SCXXXX with a brief reason.

Shell scripts should be linted and formatted using the code quality tools. Run mise run format on .sh files to apply consistent formatting

Files:

  • script/slurm/slurm_nss_matrix.sh

⚙️ CodeRabbit configuration file

Review shell scripts for #!/usr/bin/env bash, set -euo pipefail where appropriate, quoting, repo root detection, and shellcheck compliance.

Files:

  • script/slurm/slurm_nss_matrix.sh
**/*.{py,sh,yaml,yml,md}

📄 CodeRabbit inference engine (CONTRIBUTING.md)

All source files (.py, .sh, .yaml, .yml, .md) must include SPDX copyright headers. Use mise run format to add them automatically

Files:

  • script/slurm/slurm_nss_matrix.sh
**/*

⚙️ CodeRabbit configuration file

**/*: Review as a senior maintainer for NeMo Safe Synthesizer. Prioritize issues that can change behavior, break user workflows, weaken privacy guarantees, hide failures, make tests unreliable, or create maintenance risk. Avoid generic style commentary unless it points to a concrete project convention that automated tools will not catch.
Comment only when the finding is actionable and tied to changed code. For each finding, state the impact, the condition that triggers it, and the smallest practical fix. Prefer one precise comment over broad advice. Do not ask for refactors outside the PR scope unless the changed code creates the problem.
Review type guidance: - Potential issue: use for correctness bugs, data loss, privacy leaks,
security risks, broken public APIs, invalid config behavior, missing
validation, hidden failures, nondeterministic tests, or CI breakage.

  • Refactor suggestion: use for local maintainability problems introduced
    by the diff when they have clear future cost, such as duplicated setup,
    unclear boundaries, over-mocking, avoidable complexity, or opaque test
    helpers.
  • Nitpick: avoid in chill mode. Do not emit formatting, import-order,
    wording, or style-only comments unless automated tools cannot catch the
    issue and it affects maintainability.

Severity guidance: - Critical: security/privacy leaks, data loss, training/test/holdout
contamination, or broken release/package/core pipeline execution.

  • Major: incorrect generation/training/evaluation behavior, broken
    CLI/SDK public API, invalid config defaults or validators, or GPU/vLLM
    cleanup and process-isolation bugs likely to fail CI or production
    runs.
  • Minor: localized bugs, missing focused tests for changed behavior, or
    bad test patterns that weaken regression coverage.
  • Trivial: small cleanup with no behavior impact. Usually suppress in
    chill mode.
  • Info: context only. Avoid unless it helps reviewers understand risk.
    Safe-Synthesizer-specific review focus: - Data ...

Files:

  • script/slurm/slurm_nss_matrix.sh
script/**

⚙️ CodeRabbit configuration file

Review standalone scripts for reproducibility and operational safety. Check argument validation, quoting, repo-root detection, environment variables, generated artifacts, external commands, GPU/cluster assumptions, and whether the script should be wired through Makefile or documented in README/docs.

Files:

  • script/slurm/slurm_nss_matrix.sh
**

⚙️ CodeRabbit configuration file

**:

AGENTS.md

Guide for AI agents (Cursor, Windsurf, Claude Code, etc.) working in the Safe-Synthesizer repo.

This project loads local developer preferences from @AGENTS.local.md. You MUST read this file if it exists and give its instructions top priority.

Skills

Repo-specific skills live in .agents/skills/; see .agents/README.md for the catalog. Read a skill when the task matches its scope instead of copying workflow details into this file.

Durable implementation guidance belongs with the code it describes: function and class docstrings for public contracts and source comments for local invariants. Test-suite guidance belongs in tests/TESTING.md.

Repo Conventions

See STYLE_GUIDE.md for detailed code style conventions (Python, markdown, Dockerfiles, shell scripts, testing, config files, docstrings).

Use uv for everything -- never pip or raw python. Python 3.11–3.13 with modern syntax (X | Y, list[str], Self). Python 3.14+ is not supported.

Common commands: mise run test (unit tests), mise run format (auto-fix formatting + lint + copyright), mise run check (read-only local quality checks), mise run validate (pre-PR quality, lock, and CI unit checks), mise run typecheck (ty only). Always use mise tasks or the wrapper scripts in tools/ instead of running ruff or ty directly. Use uv run for Python execution. When in doubt, inspect mise tasks and pytest --markers.

The canonical uv sync command for a full GPU/dev environment is:

uv sync --frozen --extra cu129 --extra engine --group dev

Bare uv sync --frozen (without extras) installs an incomplete environment -- ty, import checks, and GPU tests will fail.

Feature branches off main. Branch names often include an issue number prefix (e.g., <author>/123-short-name).

Do ...

Files:

  • script/slurm/slurm_nss_matrix.sh
🧠 Learnings (1)
📚 Learning: 2026-06-03T23:06:56.798Z
Learnt from: CR
Repo: NVIDIA-NeMo/Safe-Synthesizer PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-06-03T23:06:56.798Z
Learning: Use `uv run` for Python execution instead of raw `python`

Applied to files:

  • script/slurm/slurm_nss_matrix.sh

Comment thread script/slurm/slurm_nss_matrix.sh
@nina-xu nina-xu added this pull request to the merge queue Jun 8, 2026
Merged via the queue into main with commit e470749 Jun 8, 2026
18 checks passed
@nina-xu nina-xu deleted the nina-xu/slurm-pre-sync branch June 8, 2026 19:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:dev-ex Affects build or dev experience area:tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants