Skip to content

Persistent on-disk hash cache keyed by git SHA + bazel version #360

@tinder-maxwellelliott

Description

@tinder-maxwellelliott

Background

Comparing bazel-diff against ewhauser/bazel-differ and bazel-contrib/target-determinator — both ship a persistent on-disk cache for hash output, keyed by git revision (plus bazel binary SHA + bazel version + workspace SHA for TD). bazel-diff today re-runs generate-hashes from scratch on every invocation, even when CI is replaying the same commit hour after hour.

What ewhauser/bazel-differ does

  • internal/cache/disk_cache.go — file path layout <cacheDir>/<sha[0:2]>/<sha>/hashes.json.
  • cmd/get_targets.go — the wrapper command does git checkout + lookup-or-compute + diff + a downstream query in one go, so cache hits skip the Bazel call entirely.

What target-determinator does

  • pkg/cache.go (in their tree) — cache entry key includes bazel-binary SHA256, bazel-version string, git tree SHA, target pattern, and the cquery expression. This is the more correct keying — bazel-version changes alone can flip our skylarkEnvironmentHashCode, and a bazel binary swap (e.g. local bazelisk upgrade) silently changes outputs.

Why we'd want it

  • CI hosts that retry the same commit (re-runs, rebases that re-push the same tree, matrix flakes) currently re-do the full bazel query graph walk every time. On a graph with tens of thousands of targets, that's ~minutes per re-run, paid in cloud cost and developer wait time.
  • Local dev: a developer iterating on a branch often git checkouts back to the same base SHA between experiments; caching lets the second generate-hashes against that base be ~instant.

Sketch

  • New flag(s) on generate-hashes (and optionally a wrapper command):
    • --cacheDir=<path> — opt-in; absent means today's behavior.
    • Implicit key fragments to include in the entry: git rev-parse HEAD, bazel binary SHA256, bazel --version string, the set of --bazelStartupOptions / --bazelCommandOptions / --cqueryCommandOptions / --useCquery / --fineGrainedHashExternalRepos* / --ignoredRuleHashingAttributes (all of which materially affect the output), the workspace path (for cross-workspace safety).
  • Cache invalidation is purely "don't trust on key mismatch"; no TTL, no GC (let the user manage cacheDir size).
  • The cache lives alongside the existing hash JSON, not inside it — so consumers that read hashes.json directly are unaffected.

Open questions

  • Should the cache be at the command level (one entry per generate-hashes invocation) or per-target (allowing partial-incremental hashing)? TD does the former; per-target would be a bigger lift but unlocks proper incremental hashing.
  • Atomic writes: standard "write temp + rename" so concurrent CI on the same workspace doesn't tear.
  • Cache hit logging: surface to verbose / structured logs so CI can confirm hits.

This is a feature, not a correctness fix — filing as an issue so we can discuss before anyone starts on it.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions