Skip to content

feat: differentiated exit codes (0/1/2/3/4) for the Kosli CLI#714

Open
dangrondahl wants to merge 15 commits intomainfrom
error_codes
Open

feat: differentiated exit codes (0/1/2/3/4) for the Kosli CLI#714
dangrondahl wants to merge 15 commits intomainfrom
error_codes

Conversation

@dangrondahl
Copy link
Contributor

@dangrondahl dangrondahl commented Mar 19, 2026

Summary

Differentiates the existing uniform exit-1-on-all-errors into structured, semantic exit codes so CI/CD pipelines and shell scripts can distinguish between failure types without parsing stderr.

Code Meaning Error type
0 Success
1 Compliance/policy violation ErrCompliance
2 Server unreachable or 5xx ErrServer
3 Auth/config error (401/403) ErrConfig
4 CLI usage error ErrUsage

Important: read the ADR first

Before reviewing the code, please read docs/adr/20260319-differentiated-exit-codes.md. It documents:

  • The decision and alternatives considered
  • A showstopper bug found during review (and its fix)
  • The risk profile — scripts checking [ $? -eq 1 ] for "any error" will miss exit codes 2, 3, 4
  • Whether this warrants a major version bump

The ADR should be discussed and agreed upon before this PR is approved.

What changed

  • internal/errors/ — new error type taxonomy (ErrCompliance, ErrServer, ErrConfig, ErrUsage) and ExitCodeFor() dispatch function
  • cmd/kosli/main.go — bypasses logger.Error (which calls log.Fatalfos.Exit(1)) and uses fmt.Fprintf + os.Exit(ExitCodeFor(err)) so differentiated codes actually take effect
  • internal/requests/requests.go — HTTP 5xx → ErrServer, 401/403 → ErrConfig
  • cmd/kosli/*.go — all compliance-producing commands wrap failures as ErrCompliance
  • cmd/kosli/docs.go + internal/docgen/ — per-command exit code tables in generated CLI docs
  • 55 wantExitCode test annotations across 14 test files

Test plan

  • make test_integration passes
  • Verify binary exit codes manually:
    • ./kosli version → exit 0
    • KOSLI_API_TOKEN=bad ./kosli list environments --org kosli → exit 3
    • ./kosli list environments --host http://localhost:1 --max-api-retries 0 --org x --api-token x → exit 2
    • ./kosli list environments --bogus-flag → exit 4
  • Review ADR and agree on versioning strategy
  • Confirm generated docs render exit code tables correctly

References

@dangrondahl
Copy link
Contributor Author

dangrondahl commented Mar 19, 2026

How exit codes appear in generated docs

Each command gets an Exit Codes section in its generated documentation, with the table varying by command type.

Example (in Mintlify)

image

Commands

Default commands (e.g. kosli attest custom, kosli begin trail, kosli list environments)

Code Meaning
0 No error.
1 Unexpected error.
2 Kosli server is unreachable or returned a server error.
3 Invalid API token or unauthorized access.
4 CLI usage error (e.g. missing or invalid flags).

Assert commands (e.g. kosli assert artifact, kosli assert snapshot)

Code Meaning
0 No error.
1 Assertion/compliance violation.
2 Kosli server is unreachable or returned a server error.
3 Invalid API token or unauthorized access.
4 CLI usage error (e.g. missing or invalid flags).

Attest commands with --assert (e.g. kosli attest jira, kosli attest pullrequest github)

Code Meaning
0 No error.
1 Assertion/compliance violation (only when --assert is used).
2 Kosli server is unreachable or returned a server error.
3 Invalid API token or unauthorized access.
4 CLI usage error (e.g. missing or invalid flags).

Evaluate commands (e.g. kosli evaluate trail)

Code Meaning
0 No error (policy allowed).
1 Policy denied.
2 Kosli server is unreachable or returned a server error.
3 Invalid API token or unauthorized access.
4 CLI usage error (e.g. missing or invalid flags).

kosli assert status (special case)

Code Meaning
0 Kosli server is responsive.
2 Kosli server is unreachable or down.

Non-API commands (kosli version, kosli completion, kosli docs)

Code Meaning
0 No error.
4 CLI usage error.

@dangrondahl
Copy link
Contributor Author

Given the error code options each command has, I think it's worth to consider if e.g we should write a skill for creating new commands, or have it in our CLAUDE.md

@dangrondahl
Copy link
Contributor Author

Code review

No issues found. Checked for bugs and CLAUDE.md compliance.

🤖 Generated with Claude Code

- internal/errors: ErrCompliance(1), ErrServer(2), ErrConfig(3), ErrUsage(4) + ExitCodeFor
- main.go: os.Exit with ExitCodeFor(err) instead of always returning 0
- assertArtifact: compliance failure now returns ErrCompliance (exit 1)
- Network errors and 5xx (after retries) → ErrServer (exit 2)
- 401/403 responses → ErrConfig (exit 3)
- Other 4xx responses remain plain errors (exit 1)
- ErrorBeforePrintingUsage returns ErrUsage, covering all PreRunE validation failures
- innerMain unknown-flag branch returns ErrUsage instead of the raw Cobra error
- assertSnapshot/assertApproval: INCOMPLIANT/not-approved → ErrCompliance (exit 1)
- status.go: unresponsive → ErrServer (exit 2)
- docgen: ExitCodeEntry type + ExitCodesSection on Formatter interface (Hugo + Mintlify)
- docs.go: per-command exit code sets wired through metaFn; all runnable commands
  get a default set (0/2/3/4), assert commands additionally get 1
…in docs

- All four assert pullrequest commands now return ErrCompliance (exit 1)
  when no PR/MR is found for the given commit
- commandExitCodes map extended with all four pullrequest subcommand paths
  so their generated docs show exit code 1 in the Exit Codes table
…ode 1

- Wrap compliance assertion failures in attestPROptions.run and attestJira.run
  as ErrCompliance so --assert exits with code 1 not 0
- Fix wrapAttestationError to preserve typed errors when no string replacement
  occurs (previously fmt.Errorf stripped the error type)
- Add commandExitCodes entries for all four attest pullrequest providers and
  attest jira so their docs show the exit-code 1 row
- Add TestWrapAttestationError covering nil, ErrCompliance, ErrServer, and
  message-rewrite cases
…itCodesAssert

- exitCodesDefault now includes exit 1 ("Unexpected error.") since unclassified
  errors fall through to ExitCompliance=1 in ExitCodeFor
- New exitCodesAttest for attest commands with --assert: exit 1 meaning clarifies
  it only applies when --assert is used
- attest pullrequest * and attest jira now use exitCodesAttest instead of
  exitCodesAssert
…liance)

- printEvaluateResultAsJson and printEvaluateResultAsTable now return
  ErrCompliance on policy denial so the process exits with code 1
- Add exitCodesEvaluate set (code 1: "Policy denied.") and register
  kosli evaluate trail and kosli evaluate trails in commandExitCodes
- Add TestPolicyDenialIsErrCompliance covering both formatters for
  allow and deny cases
…cenario tests

- ExitCodeFor now recognises Cobra's built-in usage error patterns
  (unknown flag, required flag not set, wrong arg count) so tests that
  use executeCommandC get the same exit code as the real binary
- Add wantExitCode int to cmdTestCase; runTestCmd asserts ExitCodeFor(err)
  when the field is non-zero
- New TestExitCodeScenarios covers codes 0–4 end-to-end using httptest
  (no local server required): version (0), deny-all policy (1),
  5xx / unreachable (2), 401/403 (3), unknown flag / cobra required flag /
  wrong arg count (4)
- Add wantExitCode: 4 to all arg/flag usage error cases in evaluateTrail
  and wantExitCode: 1 to all deny-all policy cases
- Add wantExitCode annotations to assertSnapshot tests 01 (exit 4) and
  04 (exit 1)
- Extend TestExitCodeFor with four Cobra error pattern cases
…fore ExitCodeFor

Three bugs found by five-agent swarm review:

1. main.go: logger.Error() calls log.Fatalf() which exits 1 immediately,
   making os.Exit(ExitCodeFor(err)) unreachable dead code. Replaced with
   fmt.Fprintf to stderr so the differentiated exit code actually takes
   effect. Same fix in innerMain for the unknown-subcommand path.

2. requests.go: HTTP 5xx responses were returned as plain fmt.Errorf,
   falling through to exit 1. Now wrapped as ErrServer (exit 2).

3. docs.go/testHelpers/etc: import ordering and alignment (gofmt).
Documents the decision to split the uniform exit-1-on-all-errors into
structured codes (1=compliance, 2=server, 3=config, 4=usage), the
review process that found the Fatalf dead-code bug, and remaining
test coverage work.
…e tests

Add wantExitCode to 44 test cases across 12 files (55 total, up from 11).
Covers compliance failures (exit 1), usage errors (exit 4), and success
regression guards (exit 0) for all assert, attest --assert, and evaluate
command families.
…ode annotations

- Add exit codes table to 4 docs golden files (hugo/mintlify × snyk/artifact)
- Change wantExitCode from 4 to 1 for 3 RunE validation errors that are
  plain fmt.Errorf (not wrapped via ErrorBeforePrintingUsage as ErrUsage)
@dangrondahl
Copy link
Contributor Author

Exit code behavior when --org / --api-token come from env vars

Investigated whether exit code 4 (usage error) could incorrectly fire when --org or --api-token are omitted from the CLI but provided via KOSLI_ORG / KOSLI_API_TOKEN.

Finding: no issue. The env vars are resolved before validation runs:

  1. root.go:PersistentPreRunE calls bindFlags() which runs viper.AutomaticEnv() with the KOSLI_ prefix
  2. For each flag not explicitly set on the command line, bindFlags checks if Viper has a value (from env var or config file) and calls cmd.Flags().Set() to populate it — this writes directly into the global struct fields (global.Org, global.ApiToken)
  3. Only after this, each command's PreRunE calls RequireGlobalFlags(global, ...) which checks global.Org == "" / global.ApiToken == ""

So when the env var is set, the global field is non-empty by the time validation runs, and no error is returned. Exit code 4 only fires when neither the flag nor the env var provides a value.

The execution order is: PersistentPreRunE (binds env vars → global) → PreRunE (validates global via RequireGlobalFlagsErrorBeforePrintingUsageErrUsage) → RunE.

@dangrondahl dangrondahl marked this pull request as ready for review March 20, 2026 11:00
@dangrondahl
Copy link
Contributor Author

How exit code 2 (ErrServer) relates to --max-api-retries

There are two code paths to exit code 2, both in internal/requests/requests.go. The --max-api-retries flag (default 3) controls how many attempts happen before exit code 2 is returned — it doesn't change the exit code itself, just delays it.

Path 1 — retries exhausted (line 249-252): The retryablehttp client retries up to --max-api-retries times on 5xx, 429, and network errors (customCheckRetry). When all retries fail, c.HttpClient.Do(req) returns an error like giving up after 4 attempt(s): ... which is wrapped as ErrServer → exit 2.

Path 2 — 5xx with parseable response body (line 293-294): If the final retry returns a 5xx with a JSON body (rather than a connection error), the response is parsed and wrapped as ErrServer → exit 2. In practice this is a safety net for edge cases like --max-api-retries=0.

With the default retry count of 3, almost all server/network errors go through Path 1.

@dangrondahl
Copy link
Contributor Author

dangrondahl commented Mar 20, 2026

Manual test examples for exit codes

For anyone building locally, here are quick ways to trigger each exit code:

Exit 0 — Success:

./kosli version
echo $?  # 0

Exit 1 — Compliance / unexpected error:

# Requires a valid token 
# --assert with a non-existent artifact triggers a compliance failure.
./kosli assert artifact --fingerprint 0000000000000000000000000000000000000000000000000000000000000000 \
  --org <insert-org-or-have-as-env-var> --api-token <insert-api-token-or-have-as-env-var> --flow nonexistent
echo $?  # 1

Exit 2 — Server unreachable:

./kosli list flows --org myorg --api-token fake --host http://localhost:9999
echo $?  # 2

Exit 3 — Auth error (401/403):

./kosli list flows --org <insert-org-or-have-as-env-var> --api-token fake
echo $?  # 3

Exit 4 — Usage error (missing/invalid flags):

./kosli list flows
echo $?  # 4  (--org is not set)

@ToreMerkely
Copy link
Contributor

ToreMerkely commented Mar 20, 2026

I suggest the following edits.
Kosli error codes start at 50 (or 51) and then goes up. That shuld make it clear that these are Kosli error codes that we have made a causes desission about and that you can trust it.

Exit 1 can not be trusted. It can come from anywhere.

@dangrondahl
Copy link
Contributor Author

dangrondahl commented Mar 20, 2026

I suggest the following edits. Kosli error codes start at 50 (or 51) and then goes up. That shuld make it clear that these are Kosli error codes that we have made a causes desission about and that you can trust it.

Exit 1 can not be trusted. It can come from anywhere.

I think I agree. with the starting from the 50s.
But for the exit 1, what do we do then, are you thinking:

Code Meaning
0 No error (policy allowed).
51 Unexpected error.
52 Kosli server is unreachable or returned a server error.
53 Invalid API token or unauthorized access.
54 CLI usage error (e.g. missing or invalid flags).

or

Code Meaning
0 No error (policy allowed).
1 Unexpected error. (comes from anywhere)
51 Kosli server is unreachable or returned a server error.
52 Invalid API token or unauthorized access.
53 CLI usage error (e.g. missing or invalid flags).

@ToreMerkely :
The second one. We start on 51 and any exit code not listed is unknow to us.

Example of other error messages that we are not in control of:

  • ./kosli is mising you get 127
  • ./kosli does not have write permission you get 126

@ToreMerkely
Copy link
Contributor

Do we want to consider this, or is that too far out?

Leave gaps between categories. If you pack them 51-52-53-54, you can't subdivide later.
Something like:

                                                                                                  
  ┌───────┬───────────────────┐                                                                   
  │ Range │      Domain       │
  ├───────┼───────────────────┤                                                                   
  │ 51-59 │ Compliance/policy │
  ├───────┼───────────────────┤
  │ 61-69 │ Server/infra      │                                                                   
  ├───────┼───────────────────┤
  │ 71-79 │ Auth/config       │                                                                   
  ├───────┼───────────────────┤                                                                   
  │ 81-89 │ Usage             │
  └───────┴───────────────────┘                                                                   
                  

Then you could later split e.g. 51 = assertion failed, 52 = approval missing, 53 = artifact not
found — without renumbering.

HTTP status gaps to consider:

  • 404 — currently not its own category. "Environment not found" vs "server broken" is a useful
    distinction for scripts. Does it belong in compliance, config, or its own bucket?
  • 429 rate limiting — retry-worthy like 5xx but different cause. Same bucket or separate?
  • Timeouts vs connection refused — both "server" but scripts might want to distinguish

Other edge cases:

  • Partial failures in multi-host mode — what code when 2 of 3 hosts succeed?
  • Dry-run currently forces exit 0 — should it still, or should it return "what would have been"
    the exit code?
  • Signal handling — if the CLI gets SIGTERM mid-request, does it exit cleanly with a code or
    just die? (128+N territory)

The biggest design question: should the codes be stable across minor versions? If yes, document
that contract explicitly. CI pipelines will hardcode these numbers and any renumbering is a
breaking change.

Captures swarm investigation of ~80 commands' error surfaces,
consumption patterns, and industry research. Evaluates compact
exit codes, ranged exit codes, JSON envelopes, and a hybrid.
Identifies MCP server as the most concrete near-term consumer
that would benefit from structured JSON error output.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants