Skip to content

flake: Go module verification fails when sum.golang.org returns 500 (lint/offlinedocs) #1276

@flake-investigator

Description

@flake-investigator

CI run: https://github.com/coder/coder/actions/runs/21086635027

Commit: 4d414a0df79ed37dafff5c9d5951d5799a63d672 ("feat: add --use-parameter-defaults flag") by Asher [email protected]

What failed

Two separate jobs failed because sum.golang.org returned 500 while Go was verifying modules:

lint job

... verifying module: github.com/prometheus/[email protected]: reading https://sum.golang.org/tile/8/0/x025/567: 500 Internal Server Error

offlinedocs job (during setup-sqlc)

... github.com/pganalyze/pg_query_go/[email protected]: verifying module: ... reading https://sum.golang.org/tile/8/0/x141/114: 500 Internal Server Error

The required job then failed because these required checks were red.

Root cause classification

Infrastructure / external dependency outage (Go checksum database).

Why this is worth tracking

Even if upstream is intermittently unavailable, it causes CI to hard-fail (no built-in retries in go install / go mod download). We may want CI-level mitigation.

Suggested mitigations

  • Wrap Go module download/install steps in retry/backoff (especially tool installs in CI actions).
  • Consider a checksum DB fallback/mirror for CI (e.g. alternate GOSUMDB), if acceptable.

Assignment rationale

This is CI resiliency work (not tied to a particular product component). Assigning to kacpersaw as a recent maintainer of CI resiliency changes (e.g. get.helm.sh outage fallback in PR coder/coder#21268).

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions