Skip to content

feat: add metrics for auth_query and SecretChecker erpc calls#969

Draft
mentels wants to merge 6 commits into
mainfrom
pooler-392-auth-query-alerting
Draft

feat: add metrics for auth_query and SecretChecker erpc calls#969
mentels wants to merge 6 commits into
mainfrom
pooler-392-auth-query-alerting

Conversation

@mentels
Copy link
Copy Markdown
Contributor

@mentels mentels commented May 6, 2026

Summary

Closes POOLER-392.

Adds four distribution metrics to improve observability of auth_query and SecretChecker operations:

  • supavisor_auth_query_connection_duration — Postgrex connection setup time, emitted from AuthQuery.start_link/2
  • supavisor_auth_query_query_duration — queue wait + query RTT, emitted from AuthQuery.fetch_user_secret/3; logs a warning when execution exceeds 1s
  • supavisor_secret_checker_get_secrets_duration_{local,remote} — full erpc round-trip for get_secrets, split by locality (same pattern as pool_checkout)
  • supavisor_secret_checker_update_credentials_duration_{local,remote} — same for update_credentials

All metrics tagged with status only — no per-tenant tags to avoid cardinality explosion.

Also fixes Logger.error in do_get_secrets/1 to include project: tenant structured metadata (the function runs in the erpc caller process, not the SecretChecker GenServer, so the metadata was previously missing).

🤖 Generated with Claude Code

Adds four telemetry distributions to make auth_query failures observable
without per-tenant cardinality:

- supavisor_auth_query_connection_duration: Postgrex connection time in
  AuthQuery.start_link/2, tagged by result/reason
- supavisor_auth_query_query_duration: query RTT including queue wait in
  fetch_user_secret/3, tagged by result/reason; logs a warning when
  execution exceeds 1s
- supavisor_secret_checker_get_secrets_duration_{local,remote}: full
  erpc round-trip for get_secrets, split by same-node vs cross-node
- supavisor_secret_checker_update_credentials_duration_{local,remote}:
  same for update_credentials

Also fixes Logger.error in do_get_secrets/1 to include project: tenant
structured metadata so the tenant is visible when the erpc call exits.

Motivated by an euc11 incident where Postgrex queue buildup silently
blocked get_secrets for all tenants with no metric to alert on.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@mentels mentels requested a review from a team as a code owner May 6, 2026 19:15
@mentels mentels marked this pull request as draft May 6, 2026 19:21
require Supavisor

@disabled Application.compile_env(:supavisor, :metrics_disabled, false)
@slow_auth_query_ms 1_000
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this is the right number, we can see what the times are in the first cluster we deploy to.

)
end

defp client_auth_metrics do
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

technically, these are not per-tenant metrics but thought it's not worth creating a new module just for them

Comment on lines +135 to +136
queue_target: 1,
queue_interval: 4
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

making the test fast

Comment on lines +44 to +46
Logger.error("SecretChecker: get_secrets call exited: #{inspect(reason)}",
project: Supavisor.id(id, :tenant)
)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we were missing the project md if that was an RPC call

queue_interval: 5_000,
ssl_opts: ssl_opts
]
|> Postgrex.start_link()
Copy link
Copy Markdown
Member

@v0idpwn v0idpwn May 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Connection in postgrex is async unless we pass the sync flag, this function returns a pid even without the connection being complete, so the measurement won't get slow connections

Comment on lines +134 to +136
if duration_ms > @slow_auth_query_ms do
Logger.warning("auth_query took over #{@slow_auth_query_ms}ms (#{duration_ms}ms)")
end
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really like this!

@@ -2,6 +2,7 @@ defmodule Supavisor.AuthQueryTest do
use ExUnit.Case, async: true
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think after the changes in this file, this test case must be sync?

Comment on lines 172 to +173
{:error, :not_started}
|> tap(&telemetry_stop(fun, start, &1, :local))
Copy link
Copy Markdown
Member

@v0idpwn v0idpwn May 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think not_started as in "pool not up" should not be tagged as error in telemetry but as something else... Maybe should review the return values of the function too and do telemetry via reason, e.g.: not_started, exited, timeout, ...?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants