env probe: recover Buck-target torch native libs + capture triton/fbgemm commits (schema 1.8)#226
Conversation
…emm commits (schema 1.8) Two recoverability gaps surfaced while confirming the probe against a Buck torch target (fbcode//caffe2:torch), where torch imports in-process but its Python package is a link-tree and the C++ runtime is dlopen'd from a build-artifact dir: - libtorch_hip.so was located only at <torch.__file__>/../lib, which does not exist in a Buck par layout -> composable_kernel.pytorch_bundled, aotriton.*, and pytorch_build.binary_introspection came back null. Add a /proc/self/maps fallback (_loaded_lib_path_from_maps / _torch_native_lib_dir) that recovers the real lib dir from the process's mapped libraries, since torch is already imported. Original on-disk path + reason strings are kept as the fallback-of-last-resort, so existing diagnostics are unchanged when nothing is locatable. Linux-only, fail-soft. - Generalize aiter's +g<sha> commit parse into _extract_commit_from_version / _capture_python_package_commit and add a `commit` field to the triton and fbgemm blocks (best-effort: setuptools_scm +g<sha>, ROCm fork .git<sha>, or a git_version/__commit__ module attr; null when no SHA, e.g. "3.5.0+fb"). Schema 1.7 -> 1.8 (additive nested keys only; from_dict still round-trips pre-1.8 snapshots). Adds unit tests for commit extraction and the maps-based lib recovery; updates docs/env-probe.md. Co-authored-by: Cursor <cursoragent@cursor.com>
There was a problem hiding this comment.
Pull request overview
Improves aorta env probe recoverability and metadata capture when running against Buck-target PyTorch builds by (1) finding torch native libraries via /proc/self/maps when <torch>/lib is absent, and (2) capturing best-effort git commits for triton and fbgemm_gpu into the env snapshot (schema bump 1.7 → 1.8).
Changes:
- Bump env snapshot schema version to 1.8 and add nested
commitfields undertritonandfbgemm. - Add
/proc/self/mapsfallback to locate torch native shared libraries for Buck/monorepo layouts. - Update unit tests and documentation to reflect the new schema and recovery behavior.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
src/aorta/instrumentation/environment.py |
Adds commit extraction helpers, adds triton.commit/fbgemm.commit, and introduces /proc/self/maps fallback for torch native lib discovery. |
tests/instrumentation/test_environment.py |
Updates schema pin and assertions; adds new tests for commit parsing and Buck torch native-lib recovery. |
docs/env-probe.md |
Documents schema 1.8 additions and the Buck/monorepo native-lib recovery behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Live-validated on a gfx950 node against a real ROCm
So the previously-null |
…cs to 1.8)
- _capture_python_package_commit no longer returns an arbitrary module
attribute value: a commit attr (git_version/__commit__/...) is accepted
only if it embeds or is a 7-40 char hex SHA, via the new
_commit_from_attr_value helper. Non-SHA values ("unknown", "dirty", a
tag) now yield None so the `commit` field is always a real SHA or null.
- docs/env-probe.md: schema_version row + sample output bumped 1.7 -> 1.8,
and added the 1.8 changelog entry (demoting 1.7 from "current").
- Tests: reject non-SHA attrs, accept/lowercase a bare full SHA, and a
unit test for _commit_from_attr_value.
Co-authored-by: Cursor <cursoragent@cursor.com>
Summary
Two recoverability gaps surfaced while confirming
aorta env probeagainst a Buck torch target (fbcode//caffe2:torch), whereimport torchsucceeds in-process but the Python package is a link-tree and the C++ runtime isdlopen'd from a build-artifact dir:libtorch_hip.sowas located only at<torch.__file__>/../lib, which does not exist in a Buck par layout, socomposable_kernel.pytorch_bundled,aotriton.*, andpytorch_build.binary_introspection.*came backnull. Added a/proc/self/mapsfallback (_loaded_lib_path_from_maps/_torch_native_lib_dir) that recovers the real lib directory from the process's mapped libraries (torch is already imported). The original on-disk path and reason strings are preserved as the last-resort fallback, so existing "missing lib" diagnostics are unchanged when nothing is locatable. Linux-only, fail-soft.aiter's+g<sha>parse into_extract_commit_from_version/_capture_python_package_commitand added acommitfield to thetritonandfbgemmblocks. Best-effort: setuptools_scm+g<sha>, ROCm fork.git<sha>, or agit_version/__commit__module attr;nullwhen no SHA (e.g."3.5.0+fb").Schema 1.7 → 1.8, additive nested keys only —
EnvSnapshot.from_dictstill round-trips pre-1.8 snapshots. Updatesdocs/env-probe.md.Test plan
tests/instrumentation/test_environment.py— addedTestPackageCommitExtractionandTestTorchNativeLibDir(incl. a test proving the symbol dump recovers via/proc/self/mapsfor a Buck-style torch with no<torch>/lib); updated the 3 existing triton/fbgemm assertions + schema-version pin.test_partial_false_on_clean_full_probe, which requirestritoninstalled in the venv and fails identically onmain).libtorch_hip.soon a gfx950 Buck torch target (unit-tested with a fake mapped lib; not yet exercised against a live HIP build).Made with Cursor