Skip to content

[telemetry] Detect Python package manager(s) at project setup#1918

Merged
rugpanov merged 5 commits into
mainfrom
telemetry-package-manager-detection
Jun 26, 2026
Merged

[telemetry] Detect Python package manager(s) at project setup#1918
rugpanov merged 5 commits into
mainfrom
telemetry-package-manager-detection

Conversation

@rugpanov

Copy link
Copy Markdown
Contributor

Changes

Measurement-only telemetry to learn which Python package manager(s) our users' projects actually use (pip / conda / uv / poetry), so the VPEX setup-flow investment can be prioritized from first-party data instead of public-survey estimates. No setup behavior changes — this is detection only.

The work splits cleanly into three layers so each is independently testable and the dependency direction stays correct (high-level → low-level):

  • Pure classifier (packageManagerDetection.ts): given a set of already-collected signals, reports every applicable manager, a best-guess primary (priority uv > poetry > conda > pip), the firing signals, hasLockfile, and interpreter source. Side-effect free and total.
  • Emit (telemetry/packageManagerExtensions.ts): adds recordPackageManagerDetection to the existing Telemetry class via the same declare module pattern as commandExtensions.ts. Keeps disk/Python-extension dependencies out of the telemetry client.
  • Collection (PackageManagerTelemetry.ts): a best-effort, non-blocking collector that reads disk and already-resolved interpreter metadata, runs the pure classifier, and calls the emit method. Deduplicated per session on (trigger, projectRoot); any failure degrades to unknown and is swallowed so it never disrupts setup.

Emission is wired into three setup touchpoints: project-open environment check (auto_open), the set-up-environment command (explicit_command), and first Run/Debug with Databricks Connect (run/debug).

A new Events.PYTHON_ENV_SETUP_DETECTED event carries a typed, documented schema (reuses the existing telemetry transport; opt-out honored; categorical data only — no paths, package names, or cluster names). A handoff note for the analytics/dashboard owner is included at src/telemetry/PACKAGE_MANAGER_DETECTION.md.

Detection correctness (the parts most worth reviewing):

  • interpreterSource is derived from the active interpreter alone, never from project files. A uv.lock project running a conda/venv/system interpreter reports that interpreter's real source, keeping the "uv project, interpreter not uv-managed yet" setup-flow gap visible. A genuinely uv-provisioned venv is identified by the uv = marker in pyvenv.cfg, not by uv.lock.
  • conda is attributed only when the active interpreter resides under CONDA_PREFIX (path-boundary checked), not on the bare env var — which is session-global in the extension host (launching VS Code from an activated conda shell) and would otherwise over-count conda for uv/poetry/pip projects.
  • pyproject [tool.uv]/[tool.poetry] detection uses a bounded table-header scan, not substring matching: ignores comments and in-value mentions, rejects prefix collisions (e.g. tool.uvicorn), and matches subtable and array-of-table headers ([tool.uv.sources], [[tool.poetry.source]]).
  • No external executable is run for telemetry: the uv-on-PATH probe was removed (it spawned a PATH-resolved uv for a weak, non-attributing signal). Detection reads only disk and already-resolved interpreter metadata.

Scope / privacy: measurement only — no changes to setup behavior (the VPEX flows are a separate effort). Only enum/categorical data and a closed set of signal identifiers are emitted; the existing telemetry opt-out (telemetry.telemetryLevel) is respected by the transport.

Tests

  • yarn run test:unit: 202 passing, 0 failing — includes the pure classifier (each manager, interpreter sources, overlaps like uv+pip / conda+pip / poetry+uv, weak signals, none) and pure helpers (pyprojectHasToolSection, pyvenvCfgMarksUv, interpreterUnderCondaPrefix), covering the conda-prefix boundary and shell-global false-positive cases.
  • yarn run build (typecheck) passes.
  • eslint clean; prettier formatted.

Reviewer can validate with:

cd packages/databricks-vscode
yarn run build
yarn run test:unit
npx eslint src --ext ts && npx prettier . -c

Why:
We need first-party data on which Python package manager(s) our users'
projects actually use (pip/conda/uv/poetry) to prioritize VPEX setup-flow
investment, replacing public-survey estimates. Measurement only -- no setup
behavior changes.

What:
- Add packageManagerDetection.ts: a pure, signal-based classifier that
  reports all applicable managers plus a best-guess primary (uv > poetry >
  conda > pip), the firing signals, hasLockfile, and interpreter source.
  Treats bare uv/poetry on PATH as weak signals.
- Add Events.PYTHON_ENV_SETUP_DETECTED with a typed, documented schema in
  telemetry/constants.ts (reuses existing Telemetry client; opt-out honored;
  categorical data only, no paths/package/cluster names).
- Add telemetry/packageManagerExtensions.ts: the emit half, layered onto the
  Telemetry class via the commandExtensions declare-module pattern
  (recordPackageManagerDetection). Keeps disk/Python-extension deps out of the
  Telemetry client.
- Add PackageManagerTelemetry.ts: the collection half -- a best-effort,
  non-blocking collector (disk + already-resolved interpreter metadata) that
  gathers signals, runs the pure classifier, and calls the emit method.
  Deduplicated per session on (trigger, projectRoot); failures degrade to
  unknown and are swallowed.
- Wire emission into three touchpoints: project-open env check (auto_open),
  the set-up-environment command (explicit_command), and first Run/Debug
  with Databricks Connect (run/debug).
- Add unit tests for the detector and pure helpers, and a dashboard-owner
  handoff note.

Detection correctness:
- interpreterSource is derived from the active interpreter alone, never from
  project files: a uv.lock project on a conda/venv/system interpreter reports
  that interpreter's real source, keeping the setup-flow gap visible. A
  genuinely uv-provisioned venv is identified by the `uv =` marker in
  pyvenv.cfg (pure pyvenvCfgMarksUv), not by uv.lock.
- conda is attributed only when the active interpreter resides under
  CONDA_PREFIX (pure interpreterUnderCondaPrefix, with a path-boundary check),
  not on the bare env var, which is session-global in the extension host
  (launching from an activated conda shell) and would otherwise over-count
  conda for uv/poetry/pip projects.
- pyproject [tool.uv]/[tool.poetry] detection uses a pure, bounded table-header
  scan (pyprojectHasToolSection) instead of substring matching: ignores
  comments and in-value mentions, rejects prefix collisions (e.g. tool.uvicorn),
  and matches subtable and array-of-table headers (e.g. [tool.uv.sources],
  [[tool.poetry.source]]) that the substring check missed.
- No external executable is run for telemetry: the uv-on-PATH probe was
  removed (it spawned a PATH-resolved `uv` for a weak, non-attributing signal);
  detection now only reads disk and already-resolved interpreter metadata.

Verification:
- yarn run build (typecheck) passes.
- eslint clean; prettier formatted.
- yarn run test:unit: 228 passing, 0 failing (includes detector + helper tests).

Co-authored-by: Isaac
@rugpanov rugpanov force-pushed the telemetry-package-manager-detection branch from c236f29 to be9f174 Compare June 19, 2026 09:26
@rugpanov rugpanov temporarily deployed to test-trigger-is June 19, 2026 09:26 — with GitHub Actions Inactive
@rugpanov

Copy link
Copy Markdown
Contributor Author

@rugpanov rugpanov requested review from anton-107 and misha-db June 22, 2026 13:19
Comment on lines +47 to +52
/** A Python package/environment manager detected for a project. */
export type PackageManagerName = "uv" | "poetry" | "pip" | "conda";
/** Best-guess primary manager, or "unknown" when no signal fires. */
export type PrimaryManagerName = PackageManagerName | "unknown";
/** How the active interpreter was provisioned. */
export type InterpreterSource = "uv" | "conda" | "system" | "venv" | "unknown";

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks to be duplicated from packageManagerDetection.ts, can we re-use same types?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in ab23a28constants.ts now imports PackageManager/PrimaryManager/InterpreterSource (type-only) from the pure packageManagerDetection module as the single source of truth; the duplicate unions and the as casts are gone, so the compiler enforces schema/classifier alignment.

* free-form content (see {@link detectPackageManagers}). Telemetry opt-out is
* honoured by the underlying {@link Telemetry} client.
*/
export class PackageManagerTelemetry {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No tests for PackageManagerTelemetry?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in ab23a28PackageManagerTelemetry.test.ts covering the emit payload, per-(trigger, project) dedupe, the disconnected guard (incl. not consuming the dedupe slot), the telemetry-disabled short-circuit, and the requirements-suffix cases.

Comment on lines +284 to +285
prefix.startsWith(base + "/") ||
prefix.startsWith(base + "\\")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assumes filesystem is case sensitive, on Windows case C:\Some and C:\some returning false, but they denote same folder

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in ab23a28interpreterUnderCondaPrefix now folds case via a caseInsensitive param defaulting to process.platform === "win32", so Windows paths differing only in case match. Added tests.

try {
return fs
.readdirSync(projectRoot)
.some((name) => /^requirements.*\.txt$/.test(name));

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This regex matches "requirementswhatever.txt" as well, was it intentional?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in ab23a28 — tightened to /^requirements([-_.].+)?\.txt$/, so requirements.txt / requirements-dev.txt / requirements_test.txt match but requirementsfoo.txt does not. Covered by tests.

@anton-107 anton-107 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few follow-up comments from a closer pass (correctness/maintainability + one telemetry-gating suggestion). Nice structure overall — the pure/emit/collect split and the privacy posture of the payload are good. None of these are blockers; the casts (1) and the opt-out gating (5) are the two I'd most want addressed.

context: PackageManagerDetectionContext
) {
this.recordEvent(Events.PYTHON_ENV_SETUP_DETECTED, {
managersDetected: detection.managers as PackageManagerName[],

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unsafe casts hide schema divergence. detection.managers as PackageManagerName[] (and interpreterSource as InterpreterSource on L46) bridge two parallel union definitions: PackageManager/InterpreterSource/PrimaryManager in packageManagerDetection.ts vs PackageManagerName/InterpreterSource/PrimaryManagerName in constants.ts. These as casts defeat the one check that matters here — if someone later adds a manager to one union but not the other, it compiles silently and ships a mismatched event schema. Suggest a single source of truth: import the detection types into constants.ts (or re-export from one place) and drop the casts so the compiler enforces alignment.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in ab23a28 — unified the unions (single source of truth in packageManagerDetection.ts, imported type-only into constants.ts) and removed both as casts here, so a divergence now fails to compile.

override async check() {
// First environment check on project open: emit package-manager
// detection telemetry (deduplicated per session, never throws).
void this.packageManagerTelemetry.emitDetection("auto_open");

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

targetCompute is nondeterministic for auto_open. This fires before await this.connectionManager.waitForConnect() on the next line. emitDetection awaits resolveEnvironment() and then reads connectionManager.cluster/.serverless via getComputeType(), so whether the connection has resolved by then is a race against the parallel waitForConnect(). Result: targetCompute for auto_open will usually be none but sometimes the real value — dirty data on the trigger that fires most. Either move the emit after waitForConnect() to make it deterministic, or accept+document that auto_open always reports none for compute.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 34693f5 — moved the emit to after await waitForConnect() (which resolves only on CONNECTED) and added an isConnected() guard. targetCompute is now deterministic for auto_open, and unauthenticated sessions are no longer reported at all.

* A `pyproject.toml` exists but declares neither `[tool.uv]` nor
* `[tool.poetry]` (i.e. a plain PEP 621 / pip-installable project).
*/
hasPyprojectPipOnly?: boolean;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hasPyprojectPipOnly can misattribute uv projects as pip. uv works with a bare [project] table and no [tool.uv] section, so a uv project without a committed uv.lock classifies as pip. In the common case (uv.lock present) uv still fires and wins primary, so impact is limited to lockfile-less uv projects — but since the whole point is sizing uv adoption accurately, worth a one-line caveat here (or in the handoff doc) so the analytics owner knows pyproject.pipOnly slightly over-counts pip / under-counts uv.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documented in ab23a28 — caveat added on the hasPyprojectPipOnly field and a "Known measurement caveats" section in the handoff doc, noting pip is slightly over-counted / uv under-counted for lockfile-less uv projects.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-up: in b05df5d this is now fixed in code, not just documented. pyproject.toml is attributed to pip only when it declares an actual packaging table ([project]/[build-system]) and no uv/poetry section — a file carrying only tool config like [tool.ruff] is no longer counted as pip. (Found independently by a Codex review pass too.) The lockfile-less-uv caveat in the handoff doc still stands.

* Pure over its inputs; returns false if either is missing. Accepts both `/`
* and `\\` separators so it is platform-agnostic and deterministic in tests.
*/
export function interpreterUnderCondaPrefix(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Case-sensitivity vs the "platform-agnostic" claim. The boundary check handles / vs \ separators but compares case-sensitively, so on Windows C:\Conda\envs\ml vs c:\conda\... would not match even though Windows paths are case-insensitive. Low impact (both values usually come from the same source), but the doc comment says "platform-agnostic" which overstates it. Either lowercase both sides when on Windows, or soften the comment.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in ab23a28 — case-insensitive comparison on Windows, and corrected the "platform-agnostic" comment to reflect it. Added tests for both case modes.

* Deduplicated per `(project root, trigger)` within the session. Never
* throws.
*/
async emitDetection(trigger: SetupTrigger): Promise<void> {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: short-circuit when telemetry isn't at all. The event payload is correctly suppressed by @vscode/extension-telemetry when the user's telemetry.telemetryLevel isn't all (regular events only send at all), so nothing leaks. But recordEvent only gates on !this.reporter, and reporter is constructed regardless of the user's setting — so for an opted-out user, collectSignals still runs fs.readdirSync(projectRoot) + fs.readFileSync on pyproject.toml/pyvenv.cfg on every auto_open/run/setup; only the send is dropped. That's wasted work, and the optics are poor for a privacy-conscious user auditing the extension (a telemetry code path reading project files despite telemetry being off). Recommend bailing out of emitDetection early when telemetry is disabled (e.g. check vscode.env.isTelemetryEnabled / the reporter level before collecting) so opted-out users get zero telemetry-driven disk access. Separately, the handoff doc says "no paths/names" but doesn't mention that every event — including this one — inherits the ambient user.hashedUserName / user.host / workspaceId envelope; worth stating so reviewers know this links toolchain choices to a stable identity.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in ab23a28emitDetection now bails before any disk read when telemetry is disabled (new Telemetry.isTelemetryEnabled, true only at level all), so opted-out users get zero telemetry-driven file access. Also documented the ambient user.hashedUserName/host/workspaceId identity envelope in the handoff doc.

rugpanov added 2 commits June 25, 2026 14:48
Why:
The detection should describe active Databricks users' projects, not extension
installs that never authenticate. Previously auto_open fired during activation,
before sign-in, so unauthenticated sessions (and non-Databricks folders, which
can never reach CONNECTED) were reported with target_compute=none and no
user/workspace metadata.

What:
- Add an isConnected() accessor (connectionManager.state === "CONNECTED") and
  guard emitDetection on it, before the dedupe bookkeeping so a pre-connect
  call doesn't consume the dedupe slot and suppress the real emit once
  connected.
- Move the auto_open emit in EnvironmentDependenciesVerifier.check() to after
  await waitForConnect() (which resolves only on CONNECTED), so it fires once
  the workspace connects. This also makes targetCompute deterministic for
  auto_open (was racing the parallel waitForConnect).
- Document the connected-only behavior in the handoff note.

Verification:
- yarn run build, eslint, prettier clean; yarn run test:unit green.

Co-authored-by: Isaac
- Single source of truth for the manager/interpreter unions: constants.ts now
  imports PackageManager/PrimaryManager/InterpreterSource from the pure
  packageManagerDetection module (type-only, no cycle) and the unsafe `as`
  casts in packageManagerExtensions.ts are removed, so the compiler enforces
  schema/classifier alignment. (misha, anton)
- Add PackageManagerTelemetry.test.ts: emit payload, per-(trigger,project)
  dedupe, disconnected guard (incl. not consuming the dedupe slot),
  telemetry-disabled short-circuit, and the requirements-suffix cases. (misha)
- interpreterUnderCondaPrefix folds case via a caseInsensitive param defaulting
  to win32, so Windows paths that differ only in case match; fixed the
  "platform-agnostic" comment and added tests. (misha, anton)
- Tighten the requirements matcher to /^requirements([-_.].+)?\.txt$/ so it no
  longer matches requirementsfoo.txt. (misha)
- Short-circuit emitDetection when telemetry is disabled (new
  Telemetry.isTelemetryEnabled, true only at level "all") so opted-out users
  get zero telemetry-driven disk reads; document the ambient
  user.hashedUserName/host/workspaceId identity envelope. (anton)
- Document the pyproject.pipOnly pip-over-count / uv-under-count caveat for
  lockfile-less uv projects. (anton)

Verification:
- yarn run build, eslint (0 warnings), prettier clean.
- yarn run test:unit: 236 passing, 0 failing.

Co-authored-by: Isaac
- pyproject is attributed to pip only when it declares an actual packaging
  table ([project]/[build-system]); a file with only tool config (e.g.
  [tool.ruff]) is no longer counted as pip. Adds pure pyprojectHasPackagingTable
  and gates hasPyprojectPipOnly on it.
- Poetry-managed interpreters are attributed to poetry instead of being
  collapsed into the generic venv signal (which the detector reads as pip).
  Adds a distinct "poetry" InterpreterSource and an interpreter.poetry signal;
  the collector maps the Python extension's Poetry tool to it.
- Omit pythonVersion from the event when the interpreter is unresolved, instead
  of shipping the literal string "undefined" (recordEvent serializes undefined
  via String()).
- Tests for all three, plus the new pure helpers; handoff doc schema updated
  (interpreter_source gains "poetry"; signals gains interpreter.poetry).

Verification:
- yarn run build, eslint (0 warnings), prettier clean.
- yarn run test:unit: 246 passing, 0 failing.
- codex review --base main: no actionable findings.

Co-authored-by: Isaac
@rugpanov

Copy link
Copy Markdown
Contributor Author

Update: review feedback + Codex review addressed

Pushed three follow-up commits on top of the original (no force-push, so the inline threads above stay anchored):

  • 34693f5 — Gate detection on workspace connection. Emit only while CONNECTED (so unauthenticated sessions / non-Databricks folders are never reported); auto_open moved after waitForConnect(), which also makes targetCompute deterministic.
  • ab23a28 — Address review feedback: single source of truth for the manager/interpreter unions (dropped the unsafe casts), PackageManagerTelemetry tests, Windows case-insensitive conda-path compare, tightened requirements regex, short-circuit before any disk read when telemetry is disabled, and doc caveats (pyproject.pipOnly, identity envelope).
  • b05df5d — Address a Codex review pass: pip is attributed only when pyproject.toml has a real packaging table ([project]/[build-system]), Poetry-managed interpreters now attribute to poetry (new poetry interpreter source) instead of pip, and pythonVersion is omitted when unresolved (no "undefined" string).

I replied inline on each addressed thread with the specific commit. Schema note for the analytics owner: interpreterSource gained a poetry value and signals gained interpreter.poetry (both additive) — see src/telemetry/PACKAGE_MANAGER_DETECTION.md.

Verification: yarn run build clean, eslint 0 warnings, prettier clean, yarn run test:unit 246 passing / 0 failing, and codex review --base main reports no actionable findings.

@rugpanov rugpanov requested review from anton-107 and misha-db June 25, 2026 15:34
@rugpanov rugpanov deployed to test-trigger-is June 25, 2026 15:42 — with GitHub Actions Active
@github-actions

Copy link
Copy Markdown
Contributor

If integration tests don't run automatically, an authorized user can run them manually by following the instructions below:

Trigger:
go/deco-tests-run/vscode

Inputs:

  • PR number: 1918
  • Commit SHA: 5967215d0192fe26cf1a26904951941560511d0f

Checks will be approved automatically on success.

@anton-107 anton-107 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verified all 8 addressed review comments against the source (not just the reply text), and built + ran the unit suite locally.

Comments — all correctly resolved:

  • Type duplication / unsafe casts (misha-db, anton-107): single source of truth now in packageManagerDetection.ts, imported type-only into constants.ts; both as casts gone, so schema/classifier divergence fails to compile.
  • Case-sensitive path compare (misha-db, anton-107): interpreterUnderCondaPrefix folds case via caseInsensitive defaulting to win32; doc comment corrected.
  • requirements*.txt over-broad regex (misha-db): tightened to /^requirements([-_.].+)?\.txt$/; requirementsfoo.txt excluded.
  • hasPyprojectPipOnly misattributing uv (anton-107): went beyond the doc caveat — now gated on pyprojectHasPackagingTable() so a tool-only pyproject.toml isn't counted as pip.
  • Telemetry-off short-circuit (anton-107): emitDetection bails before any disk read when telemetry isn't all.
  • auto_open compute race (anton-107): emit moved after await waitForConnect() + isConnected() guard; targetCompute deterministic, unauthenticated sessions not reported.
  • Collector tests (misha-db): PackageManagerTelemetry.test.ts added, covering emit payload, dedupe, disconnected-guard (incl. not consuming the dedupe slot), telemetry-disabled, and requirements-suffix cases.

Local verification: yarn run build passes; yarn run test:unit = 260 passing. The 6 failing tests are all in CliWrapper.test.js (missing bundled CLI binary in a clean checkout) — the PR touches no CLI files, so they're pre-existing/environmental, not regressions.

LGTM. Only non-blocking note: the lockfile-less-uv skew remains a documented known limitation by design.

@rugpanov rugpanov merged commit 133c453 into main Jun 26, 2026
6 of 8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants