feat: add metrics for auth_query and SecretChecker erpc calls#969
feat: add metrics for auth_query and SecretChecker erpc calls#969mentels wants to merge 6 commits into
Conversation
Adds four telemetry distributions to make auth_query failures observable
without per-tenant cardinality:
- supavisor_auth_query_connection_duration: Postgrex connection time in
AuthQuery.start_link/2, tagged by result/reason
- supavisor_auth_query_query_duration: query RTT including queue wait in
fetch_user_secret/3, tagged by result/reason; logs a warning when
execution exceeds 1s
- supavisor_secret_checker_get_secrets_duration_{local,remote}: full
erpc round-trip for get_secrets, split by same-node vs cross-node
- supavisor_secret_checker_update_credentials_duration_{local,remote}:
same for update_credentials
Also fixes Logger.error in do_get_secrets/1 to include project: tenant
structured metadata so the tenant is visible when the erpc call exits.
Motivated by an euc11 incident where Postgrex queue buildup silently
blocked get_secrets for all tenants with no metric to alert on.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
| require Supavisor | ||
|
|
||
| @disabled Application.compile_env(:supavisor, :metrics_disabled, false) | ||
| @slow_auth_query_ms 1_000 |
There was a problem hiding this comment.
I'm not sure if this is the right number, we can see what the times are in the first cluster we deploy to.
| ) | ||
| end | ||
|
|
||
| defp client_auth_metrics do |
There was a problem hiding this comment.
technically, these are not per-tenant metrics but thought it's not worth creating a new module just for them
| queue_target: 1, | ||
| queue_interval: 4 |
There was a problem hiding this comment.
making the test fast
| Logger.error("SecretChecker: get_secrets call exited: #{inspect(reason)}", | ||
| project: Supavisor.id(id, :tenant) | ||
| ) |
There was a problem hiding this comment.
we were missing the project md if that was an RPC call
| queue_interval: 5_000, | ||
| ssl_opts: ssl_opts | ||
| ] | ||
| |> Postgrex.start_link() |
There was a problem hiding this comment.
Connection in postgrex is async unless we pass the sync flag, this function returns a pid even without the connection being complete, so the measurement won't get slow connections
| if duration_ms > @slow_auth_query_ms do | ||
| Logger.warning("auth_query took over #{@slow_auth_query_ms}ms (#{duration_ms}ms)") | ||
| end |
| @@ -2,6 +2,7 @@ defmodule Supavisor.AuthQueryTest do | |||
| use ExUnit.Case, async: true | |||
There was a problem hiding this comment.
I think after the changes in this file, this test case must be sync?
| {:error, :not_started} | ||
| |> tap(&telemetry_stop(fun, start, &1, :local)) |
There was a problem hiding this comment.
I think not_started as in "pool not up" should not be tagged as error in telemetry but as something else... Maybe should review the return values of the function too and do telemetry via reason, e.g.: not_started, exited, timeout, ...?
Summary
Closes POOLER-392.
Adds four distribution metrics to improve observability of
auth_queryandSecretCheckeroperations:supavisor_auth_query_connection_duration— Postgrex connection setup time, emitted fromAuthQuery.start_link/2supavisor_auth_query_query_duration— queue wait + query RTT, emitted fromAuthQuery.fetch_user_secret/3; logs a warning when execution exceeds 1ssupavisor_secret_checker_get_secrets_duration_{local,remote}— full erpc round-trip forget_secrets, split by locality (same pattern aspool_checkout)supavisor_secret_checker_update_credentials_duration_{local,remote}— same forupdate_credentialsAll metrics tagged with
statusonly — no per-tenant tags to avoid cardinality explosion.Also fixes
Logger.errorindo_get_secrets/1to includeproject: tenantstructured metadata (the function runs in the erpc caller process, not the SecretChecker GenServer, so the metadata was previously missing).🤖 Generated with Claude Code