[telemetry] Detect Python package manager(s) at project setup#1918
Conversation
Why: We need first-party data on which Python package manager(s) our users' projects actually use (pip/conda/uv/poetry) to prioritize VPEX setup-flow investment, replacing public-survey estimates. Measurement only -- no setup behavior changes. What: - Add packageManagerDetection.ts: a pure, signal-based classifier that reports all applicable managers plus a best-guess primary (uv > poetry > conda > pip), the firing signals, hasLockfile, and interpreter source. Treats bare uv/poetry on PATH as weak signals. - Add Events.PYTHON_ENV_SETUP_DETECTED with a typed, documented schema in telemetry/constants.ts (reuses existing Telemetry client; opt-out honored; categorical data only, no paths/package/cluster names). - Add telemetry/packageManagerExtensions.ts: the emit half, layered onto the Telemetry class via the commandExtensions declare-module pattern (recordPackageManagerDetection). Keeps disk/Python-extension deps out of the Telemetry client. - Add PackageManagerTelemetry.ts: the collection half -- a best-effort, non-blocking collector (disk + already-resolved interpreter metadata) that gathers signals, runs the pure classifier, and calls the emit method. Deduplicated per session on (trigger, projectRoot); failures degrade to unknown and are swallowed. - Wire emission into three touchpoints: project-open env check (auto_open), the set-up-environment command (explicit_command), and first Run/Debug with Databricks Connect (run/debug). - Add unit tests for the detector and pure helpers, and a dashboard-owner handoff note. Detection correctness: - interpreterSource is derived from the active interpreter alone, never from project files: a uv.lock project on a conda/venv/system interpreter reports that interpreter's real source, keeping the setup-flow gap visible. A genuinely uv-provisioned venv is identified by the `uv =` marker in pyvenv.cfg (pure pyvenvCfgMarksUv), not by uv.lock. - conda is attributed only when the active interpreter resides under CONDA_PREFIX (pure interpreterUnderCondaPrefix, with a path-boundary check), not on the bare env var, which is session-global in the extension host (launching from an activated conda shell) and would otherwise over-count conda for uv/poetry/pip projects. - pyproject [tool.uv]/[tool.poetry] detection uses a pure, bounded table-header scan (pyprojectHasToolSection) instead of substring matching: ignores comments and in-value mentions, rejects prefix collisions (e.g. tool.uvicorn), and matches subtable and array-of-table headers (e.g. [tool.uv.sources], [[tool.poetry.source]]) that the substring check missed. - No external executable is run for telemetry: the uv-on-PATH probe was removed (it spawned a PATH-resolved `uv` for a weak, non-attributing signal); detection now only reads disk and already-resolved interpreter metadata. Verification: - yarn run build (typecheck) passes. - eslint clean; prettier formatted. - yarn run test:unit: 228 passing, 0 failing (includes detector + helper tests). Co-authored-by: Isaac
c236f29 to
be9f174
Compare
|
Integration tests: https://github.com/databricks-eng/eng-dev-ecosystem/actions/runs/27942293534 |
| /** A Python package/environment manager detected for a project. */ | ||
| export type PackageManagerName = "uv" | "poetry" | "pip" | "conda"; | ||
| /** Best-guess primary manager, or "unknown" when no signal fires. */ | ||
| export type PrimaryManagerName = PackageManagerName | "unknown"; | ||
| /** How the active interpreter was provisioned. */ | ||
| export type InterpreterSource = "uv" | "conda" | "system" | "venv" | "unknown"; |
There was a problem hiding this comment.
It looks to be duplicated from packageManagerDetection.ts, can we re-use same types?
There was a problem hiding this comment.
Done in ab23a28 — constants.ts now imports PackageManager/PrimaryManager/InterpreterSource (type-only) from the pure packageManagerDetection module as the single source of truth; the duplicate unions and the as casts are gone, so the compiler enforces schema/classifier alignment.
| * free-form content (see {@link detectPackageManagers}). Telemetry opt-out is | ||
| * honoured by the underlying {@link Telemetry} client. | ||
| */ | ||
| export class PackageManagerTelemetry { |
There was a problem hiding this comment.
No tests for PackageManagerTelemetry?
There was a problem hiding this comment.
Added in ab23a28 — PackageManagerTelemetry.test.ts covering the emit payload, per-(trigger, project) dedupe, the disconnected guard (incl. not consuming the dedupe slot), the telemetry-disabled short-circuit, and the requirements-suffix cases.
| prefix.startsWith(base + "/") || | ||
| prefix.startsWith(base + "\\") |
There was a problem hiding this comment.
This assumes filesystem is case sensitive, on Windows case C:\Some and C:\some returning false, but they denote same folder
There was a problem hiding this comment.
Fixed in ab23a28 — interpreterUnderCondaPrefix now folds case via a caseInsensitive param defaulting to process.platform === "win32", so Windows paths differing only in case match. Added tests.
| try { | ||
| return fs | ||
| .readdirSync(projectRoot) | ||
| .some((name) => /^requirements.*\.txt$/.test(name)); |
There was a problem hiding this comment.
This regex matches "requirementswhatever.txt" as well, was it intentional?
There was a problem hiding this comment.
Fixed in ab23a28 — tightened to /^requirements([-_.].+)?\.txt$/, so requirements.txt / requirements-dev.txt / requirements_test.txt match but requirementsfoo.txt does not. Covered by tests.
anton-107
left a comment
There was a problem hiding this comment.
A few follow-up comments from a closer pass (correctness/maintainability + one telemetry-gating suggestion). Nice structure overall — the pure/emit/collect split and the privacy posture of the payload are good. None of these are blockers; the casts (1) and the opt-out gating (5) are the two I'd most want addressed.
| context: PackageManagerDetectionContext | ||
| ) { | ||
| this.recordEvent(Events.PYTHON_ENV_SETUP_DETECTED, { | ||
| managersDetected: detection.managers as PackageManagerName[], |
There was a problem hiding this comment.
Unsafe casts hide schema divergence. detection.managers as PackageManagerName[] (and interpreterSource as InterpreterSource on L46) bridge two parallel union definitions: PackageManager/InterpreterSource/PrimaryManager in packageManagerDetection.ts vs PackageManagerName/InterpreterSource/PrimaryManagerName in constants.ts. These as casts defeat the one check that matters here — if someone later adds a manager to one union but not the other, it compiles silently and ships a mismatched event schema. Suggest a single source of truth: import the detection types into constants.ts (or re-export from one place) and drop the casts so the compiler enforces alignment.
There was a problem hiding this comment.
Done in ab23a28 — unified the unions (single source of truth in packageManagerDetection.ts, imported type-only into constants.ts) and removed both as casts here, so a divergence now fails to compile.
| override async check() { | ||
| // First environment check on project open: emit package-manager | ||
| // detection telemetry (deduplicated per session, never throws). | ||
| void this.packageManagerTelemetry.emitDetection("auto_open"); |
There was a problem hiding this comment.
targetCompute is nondeterministic for auto_open. This fires before await this.connectionManager.waitForConnect() on the next line. emitDetection awaits resolveEnvironment() and then reads connectionManager.cluster/.serverless via getComputeType(), so whether the connection has resolved by then is a race against the parallel waitForConnect(). Result: targetCompute for auto_open will usually be none but sometimes the real value — dirty data on the trigger that fires most. Either move the emit after waitForConnect() to make it deterministic, or accept+document that auto_open always reports none for compute.
There was a problem hiding this comment.
Fixed in 34693f5 — moved the emit to after await waitForConnect() (which resolves only on CONNECTED) and added an isConnected() guard. targetCompute is now deterministic for auto_open, and unauthenticated sessions are no longer reported at all.
| * A `pyproject.toml` exists but declares neither `[tool.uv]` nor | ||
| * `[tool.poetry]` (i.e. a plain PEP 621 / pip-installable project). | ||
| */ | ||
| hasPyprojectPipOnly?: boolean; |
There was a problem hiding this comment.
hasPyprojectPipOnly can misattribute uv projects as pip. uv works with a bare [project] table and no [tool.uv] section, so a uv project without a committed uv.lock classifies as pip. In the common case (uv.lock present) uv still fires and wins primary, so impact is limited to lockfile-less uv projects — but since the whole point is sizing uv adoption accurately, worth a one-line caveat here (or in the handoff doc) so the analytics owner knows pyproject.pipOnly slightly over-counts pip / under-counts uv.
There was a problem hiding this comment.
Documented in ab23a28 — caveat added on the hasPyprojectPipOnly field and a "Known measurement caveats" section in the handoff doc, noting pip is slightly over-counted / uv under-counted for lockfile-less uv projects.
There was a problem hiding this comment.
Follow-up: in b05df5d this is now fixed in code, not just documented. pyproject.toml is attributed to pip only when it declares an actual packaging table ([project]/[build-system]) and no uv/poetry section — a file carrying only tool config like [tool.ruff] is no longer counted as pip. (Found independently by a Codex review pass too.) The lockfile-less-uv caveat in the handoff doc still stands.
| * Pure over its inputs; returns false if either is missing. Accepts both `/` | ||
| * and `\\` separators so it is platform-agnostic and deterministic in tests. | ||
| */ | ||
| export function interpreterUnderCondaPrefix( |
There was a problem hiding this comment.
Case-sensitivity vs the "platform-agnostic" claim. The boundary check handles / vs \ separators but compares case-sensitively, so on Windows C:\Conda\envs\ml vs c:\conda\... would not match even though Windows paths are case-insensitive. Low impact (both values usually come from the same source), but the doc comment says "platform-agnostic" which overstates it. Either lowercase both sides when on Windows, or soften the comment.
There was a problem hiding this comment.
Fixed in ab23a28 — case-insensitive comparison on Windows, and corrected the "platform-agnostic" comment to reflect it. Added tests for both case modes.
| * Deduplicated per `(project root, trigger)` within the session. Never | ||
| * throws. | ||
| */ | ||
| async emitDetection(trigger: SetupTrigger): Promise<void> { |
There was a problem hiding this comment.
Suggestion: short-circuit when telemetry isn't at all. The event payload is correctly suppressed by @vscode/extension-telemetry when the user's telemetry.telemetryLevel isn't all (regular events only send at all), so nothing leaks. But recordEvent only gates on !this.reporter, and reporter is constructed regardless of the user's setting — so for an opted-out user, collectSignals still runs fs.readdirSync(projectRoot) + fs.readFileSync on pyproject.toml/pyvenv.cfg on every auto_open/run/setup; only the send is dropped. That's wasted work, and the optics are poor for a privacy-conscious user auditing the extension (a telemetry code path reading project files despite telemetry being off). Recommend bailing out of emitDetection early when telemetry is disabled (e.g. check vscode.env.isTelemetryEnabled / the reporter level before collecting) so opted-out users get zero telemetry-driven disk access. Separately, the handoff doc says "no paths/names" but doesn't mention that every event — including this one — inherits the ambient user.hashedUserName / user.host / workspaceId envelope; worth stating so reviewers know this links toolchain choices to a stable identity.
There was a problem hiding this comment.
Done in ab23a28 — emitDetection now bails before any disk read when telemetry is disabled (new Telemetry.isTelemetryEnabled, true only at level all), so opted-out users get zero telemetry-driven file access. Also documented the ambient user.hashedUserName/host/workspaceId identity envelope in the handoff doc.
Why: The detection should describe active Databricks users' projects, not extension installs that never authenticate. Previously auto_open fired during activation, before sign-in, so unauthenticated sessions (and non-Databricks folders, which can never reach CONNECTED) were reported with target_compute=none and no user/workspace metadata. What: - Add an isConnected() accessor (connectionManager.state === "CONNECTED") and guard emitDetection on it, before the dedupe bookkeeping so a pre-connect call doesn't consume the dedupe slot and suppress the real emit once connected. - Move the auto_open emit in EnvironmentDependenciesVerifier.check() to after await waitForConnect() (which resolves only on CONNECTED), so it fires once the workspace connects. This also makes targetCompute deterministic for auto_open (was racing the parallel waitForConnect). - Document the connected-only behavior in the handoff note. Verification: - yarn run build, eslint, prettier clean; yarn run test:unit green. Co-authored-by: Isaac
- Single source of truth for the manager/interpreter unions: constants.ts now imports PackageManager/PrimaryManager/InterpreterSource from the pure packageManagerDetection module (type-only, no cycle) and the unsafe `as` casts in packageManagerExtensions.ts are removed, so the compiler enforces schema/classifier alignment. (misha, anton) - Add PackageManagerTelemetry.test.ts: emit payload, per-(trigger,project) dedupe, disconnected guard (incl. not consuming the dedupe slot), telemetry-disabled short-circuit, and the requirements-suffix cases. (misha) - interpreterUnderCondaPrefix folds case via a caseInsensitive param defaulting to win32, so Windows paths that differ only in case match; fixed the "platform-agnostic" comment and added tests. (misha, anton) - Tighten the requirements matcher to /^requirements([-_.].+)?\.txt$/ so it no longer matches requirementsfoo.txt. (misha) - Short-circuit emitDetection when telemetry is disabled (new Telemetry.isTelemetryEnabled, true only at level "all") so opted-out users get zero telemetry-driven disk reads; document the ambient user.hashedUserName/host/workspaceId identity envelope. (anton) - Document the pyproject.pipOnly pip-over-count / uv-under-count caveat for lockfile-less uv projects. (anton) Verification: - yarn run build, eslint (0 warnings), prettier clean. - yarn run test:unit: 236 passing, 0 failing. Co-authored-by: Isaac
- pyproject is attributed to pip only when it declares an actual packaging table ([project]/[build-system]); a file with only tool config (e.g. [tool.ruff]) is no longer counted as pip. Adds pure pyprojectHasPackagingTable and gates hasPyprojectPipOnly on it. - Poetry-managed interpreters are attributed to poetry instead of being collapsed into the generic venv signal (which the detector reads as pip). Adds a distinct "poetry" InterpreterSource and an interpreter.poetry signal; the collector maps the Python extension's Poetry tool to it. - Omit pythonVersion from the event when the interpreter is unresolved, instead of shipping the literal string "undefined" (recordEvent serializes undefined via String()). - Tests for all three, plus the new pure helpers; handoff doc schema updated (interpreter_source gains "poetry"; signals gains interpreter.poetry). Verification: - yarn run build, eslint (0 warnings), prettier clean. - yarn run test:unit: 246 passing, 0 failing. - codex review --base main: no actionable findings. Co-authored-by: Isaac
Update: review feedback + Codex review addressedPushed three follow-up commits on top of the original (no force-push, so the inline threads above stay anchored):
I replied inline on each addressed thread with the specific commit. Schema note for the analytics owner: Verification: |
|
If integration tests don't run automatically, an authorized user can run them manually by following the instructions below: Trigger: Inputs:
Checks will be approved automatically on success. |
anton-107
left a comment
There was a problem hiding this comment.
Verified all 8 addressed review comments against the source (not just the reply text), and built + ran the unit suite locally.
Comments — all correctly resolved:
- Type duplication / unsafe casts (misha-db, anton-107): single source of truth now in
packageManagerDetection.ts, imported type-only intoconstants.ts; bothascasts gone, so schema/classifier divergence fails to compile. - Case-sensitive path compare (misha-db, anton-107):
interpreterUnderCondaPrefixfolds case viacaseInsensitivedefaulting towin32; doc comment corrected. requirements*.txtover-broad regex (misha-db): tightened to/^requirements([-_.].+)?\.txt$/;requirementsfoo.txtexcluded.hasPyprojectPipOnlymisattributing uv (anton-107): went beyond the doc caveat — now gated onpyprojectHasPackagingTable()so a tool-onlypyproject.tomlisn't counted as pip.- Telemetry-off short-circuit (anton-107):
emitDetectionbails before any disk read when telemetry isn'tall. auto_opencompute race (anton-107): emit moved afterawait waitForConnect()+isConnected()guard;targetComputedeterministic, unauthenticated sessions not reported.- Collector tests (misha-db):
PackageManagerTelemetry.test.tsadded, covering emit payload, dedupe, disconnected-guard (incl. not consuming the dedupe slot), telemetry-disabled, and requirements-suffix cases.
Local verification: yarn run build passes; yarn run test:unit = 260 passing. The 6 failing tests are all in CliWrapper.test.js (missing bundled CLI binary in a clean checkout) — the PR touches no CLI files, so they're pre-existing/environmental, not regressions.
LGTM. Only non-blocking note: the lockfile-less-uv skew remains a documented known limitation by design.
Changes
Measurement-only telemetry to learn which Python package manager(s) our users' projects actually use (pip / conda / uv / poetry), so the VPEX setup-flow investment can be prioritized from first-party data instead of public-survey estimates. No setup behavior changes — this is detection only.
The work splits cleanly into three layers so each is independently testable and the dependency direction stays correct (high-level → low-level):
packageManagerDetection.ts): given a set of already-collected signals, reports every applicable manager, a best-guess primary (priorityuv > poetry > conda > pip), the firing signals,hasLockfile, and interpreter source. Side-effect free and total.telemetry/packageManagerExtensions.ts): addsrecordPackageManagerDetectionto the existingTelemetryclass via the samedeclare modulepattern ascommandExtensions.ts. Keeps disk/Python-extension dependencies out of the telemetry client.PackageManagerTelemetry.ts): a best-effort, non-blocking collector that reads disk and already-resolved interpreter metadata, runs the pure classifier, and calls the emit method. Deduplicated per session on(trigger, projectRoot); any failure degrades tounknownand is swallowed so it never disrupts setup.Emission is wired into three setup touchpoints: project-open environment check (
auto_open), the set-up-environment command (explicit_command), and first Run/Debug with Databricks Connect (run/debug).A new
Events.PYTHON_ENV_SETUP_DETECTEDevent carries a typed, documented schema (reuses the existing telemetry transport; opt-out honored; categorical data only — no paths, package names, or cluster names). A handoff note for the analytics/dashboard owner is included atsrc/telemetry/PACKAGE_MANAGER_DETECTION.md.Detection correctness (the parts most worth reviewing):
interpreterSourceis derived from the active interpreter alone, never from project files. Auv.lockproject running a conda/venv/system interpreter reports that interpreter's real source, keeping the "uv project, interpreter not uv-managed yet" setup-flow gap visible. A genuinely uv-provisioned venv is identified by theuv =marker inpyvenv.cfg, not byuv.lock.CONDA_PREFIX(path-boundary checked), not on the bare env var — which is session-global in the extension host (launching VS Code from an activated conda shell) and would otherwise over-count conda for uv/poetry/pip projects.pyproject[tool.uv]/[tool.poetry]detection uses a bounded table-header scan, not substring matching: ignores comments and in-value mentions, rejects prefix collisions (e.g.tool.uvicorn), and matches subtable and array-of-table headers ([tool.uv.sources],[[tool.poetry.source]]).uvfor a weak, non-attributing signal). Detection reads only disk and already-resolved interpreter metadata.Scope / privacy: measurement only — no changes to setup behavior (the VPEX flows are a separate effort). Only enum/categorical data and a closed set of signal identifiers are emitted; the existing telemetry opt-out (
telemetry.telemetryLevel) is respected by the transport.Tests
yarn run test:unit: 202 passing, 0 failing — includes the pure classifier (each manager, interpreter sources, overlaps like uv+pip / conda+pip / poetry+uv, weak signals, none) and pure helpers (pyprojectHasToolSection,pyvenvCfgMarksUv,interpreterUnderCondaPrefix), covering the conda-prefix boundary and shell-global false-positive cases.yarn run build(typecheck) passes.eslintclean;prettierformatted.Reviewer can validate with: