Skip to content

env probe: recover Buck-target torch native libs + capture triton/fbgemm commits (schema 1.8)#226

Open
vivekkhandelwal1 wants to merge 2 commits into
mainfrom
env-probe/schema-1.8-buck-libtorch-recovery-and-pkg-commits
Open

env probe: recover Buck-target torch native libs + capture triton/fbgemm commits (schema 1.8)#226
vivekkhandelwal1 wants to merge 2 commits into
mainfrom
env-probe/schema-1.8-buck-libtorch-recovery-and-pkg-commits

Conversation

@vivekkhandelwal1

Copy link
Copy Markdown
Collaborator

Summary

Two recoverability gaps surfaced while confirming aorta env probe against a Buck torch target (fbcode//caffe2:torch), where import torch succeeds in-process but the Python package is a link-tree and the C++ runtime is dlopen'd from a build-artifact dir:

  • Buck/monorepo native-lib recovery. libtorch_hip.so was located only at <torch.__file__>/../lib, which does not exist in a Buck par layout, so composable_kernel.pytorch_bundled, aotriton.*, and pytorch_build.binary_introspection.* came back null. Added a /proc/self/maps fallback (_loaded_lib_path_from_maps / _torch_native_lib_dir) that recovers the real lib directory from the process's mapped libraries (torch is already imported). The original on-disk path and reason strings are preserved as the last-resort fallback, so existing "missing lib" diagnostics are unchanged when nothing is locatable. Linux-only, fail-soft.
  • Package commit capture. Generalized aiter's +g<sha> parse into _extract_commit_from_version / _capture_python_package_commit and added a commit field to the triton and fbgemm blocks. Best-effort: setuptools_scm +g<sha>, ROCm fork .git<sha>, or a git_version/__commit__ module attr; null when no SHA (e.g. "3.5.0+fb").

Schema 1.7 → 1.8, additive nested keys only — EnvSnapshot.from_dict still round-trips pre-1.8 snapshots. Updates docs/env-probe.md.

Test plan

  • tests/instrumentation/test_environment.py — added TestPackageCommitExtraction and TestTorchNativeLibDir (incl. a test proving the symbol dump recovers via /proc/self/maps for a Buck-style torch with no <torch>/lib); updated the 3 existing triton/fbgemm assertions + schema-version pin.
  • env + probe e2e + recipe suites pass (one unrelated pre-existing failure: test_partial_false_on_clean_full_probe, which requires triton installed in the venv and fails identically on main).
  • Live validation against a real ROCm libtorch_hip.so on a gfx950 Buck torch target (unit-tested with a fake mapped lib; not yet exercised against a live HIP build).

Made with Cursor

…emm commits (schema 1.8)

Two recoverability gaps surfaced while confirming the probe against a
Buck torch target (fbcode//caffe2:torch), where torch imports in-process
but its Python package is a link-tree and the C++ runtime is dlopen'd
from a build-artifact dir:

- libtorch_hip.so was located only at <torch.__file__>/../lib, which
  does not exist in a Buck par layout -> composable_kernel.pytorch_bundled,
  aotriton.*, and pytorch_build.binary_introspection came back null. Add a
  /proc/self/maps fallback (_loaded_lib_path_from_maps / _torch_native_lib_dir)
  that recovers the real lib dir from the process's mapped libraries, since
  torch is already imported. Original on-disk path + reason strings are kept
  as the fallback-of-last-resort, so existing diagnostics are unchanged when
  nothing is locatable. Linux-only, fail-soft.

- Generalize aiter's +g<sha> commit parse into _extract_commit_from_version /
  _capture_python_package_commit and add a `commit` field to the triton and
  fbgemm blocks (best-effort: setuptools_scm +g<sha>, ROCm fork .git<sha>, or
  a git_version/__commit__ module attr; null when no SHA, e.g. "3.5.0+fb").

Schema 1.7 -> 1.8 (additive nested keys only; from_dict still round-trips
pre-1.8 snapshots). Adds unit tests for commit extraction and the maps-based
lib recovery; updates docs/env-probe.md.

Co-authored-by: Cursor <cursoragent@cursor.com>
Copilot AI review requested due to automatic review settings June 16, 2026 10:40

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Improves aorta env probe recoverability and metadata capture when running against Buck-target PyTorch builds by (1) finding torch native libraries via /proc/self/maps when <torch>/lib is absent, and (2) capturing best-effort git commits for triton and fbgemm_gpu into the env snapshot (schema bump 1.7 → 1.8).

Changes:

  • Bump env snapshot schema version to 1.8 and add nested commit fields under triton and fbgemm.
  • Add /proc/self/maps fallback to locate torch native shared libraries for Buck/monorepo layouts.
  • Update unit tests and documentation to reflect the new schema and recovery behavior.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
src/aorta/instrumentation/environment.py Adds commit extraction helpers, adds triton.commit/fbgemm.commit, and introduces /proc/self/maps fallback for torch native lib discovery.
tests/instrumentation/test_environment.py Updates schema pin and assertions; adds new tests for commit parsing and Buck torch native-lib recovery.
docs/env-probe.md Documents schema 1.8 additions and the Buck/monorepo native-lib recovery behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/aorta/instrumentation/environment.py
Comment thread docs/env-probe.md
@vivekkhandelwal1

Copy link
Copy Markdown
Collaborator Author

Live-validated on a gfx950 node against a real ROCm libtorch_hip.so (torch 2.9.1+rocm6.4, torch.version.hip=6.4.43484-..., 371 MB lib):

  • maps locator found the loaded libtorch_hip.so from /proc/self/maps.
  • Primary path (real <torch>/lib): symbol counts populated (pytorch_flash::=4, _efficient_attention=18, aotriton::=3, mha_fwd_aot=1), torch_lib_bundled={libaotriton_v2.so: true}, aotriton bundled_version=0.11.0 with real hash — confirms the refactor doesn't regress the normal layout.
  • Fallback path (bogus <torch>/lib, lib still mapped — i.e. the Buck-target scenario): _torch_native_lib_dir recovered the real lib dir via /proc/self/maps, the nm | c++filt dump processed 30,482 demangled symbols, no "not found" reasons, and symbol counts were identical to the primary path.
  • triton commit: this ROCm wheel reports package_version=3.5.1 with no SHA → commit: null (correctly no false-positive).

So the previously-null composable_kernel.pytorch_bundled / aotriton.* / pytorch_build.binary_introspection.* fields now populate for a Buck torch target. Checking the live-validation box.

…cs to 1.8)

- _capture_python_package_commit no longer returns an arbitrary module
  attribute value: a commit attr (git_version/__commit__/...) is accepted
  only if it embeds or is a 7-40 char hex SHA, via the new
  _commit_from_attr_value helper. Non-SHA values ("unknown", "dirty", a
  tag) now yield None so the `commit` field is always a real SHA or null.
- docs/env-probe.md: schema_version row + sample output bumped 1.7 -> 1.8,
  and added the 1.8 changelog entry (demoting 1.7 from "current").
- Tests: reject non-SHA attrs, accept/lowercase a bare full SHA, and a
  unit test for _commit_from_attr_value.

Co-authored-by: Cursor <cursoragent@cursor.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated no new comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants