Fix SQL timeout: use wall-clock time, not accumulated sleep counter#557
Open
MikhailKuklin wants to merge 2 commits into
Open
Fix SQL timeout: use wall-clock time, not accumulated sleep counter#557MikhailKuklin wants to merge 2 commits into
MikhailKuklin wants to merge 2 commits into
Conversation
The polling loop in SQLExecutor.execute() incremented an `elapsed`
counter by `poll_interval` each iteration, which only tracks sleep
time and ignores how long each `get_statement` RPC takes. Under
warehouse load, get_statement can take several seconds per call, so
each iteration is much longer than poll_interval. The counter
undercounts real wall clock, and the configured timeout fires later
than intended — or, for sufficiently slow RPCs, the query completes
naturally before the timeout ever fires.
Empirically reproduced (2026-06-04): a query with `timeout=900` on
a loaded warehouse ran 1999 seconds wall clock before completing
naturally, never hitting the timeout — the counter only reached ~666.
Fix: use `time.monotonic()` for the timeout boundary. The timeout
now fires at real wall clock regardless of RPC latency.
Validated against the fix: same query, same timeout, fired at 903s
wall clock with the statement reaching `CANCELED`. Error message
also now reports actual elapsed time, not the configured limit.
The error message changes from:
'SQL query timed out after {timeout} seconds...'
to:
'SQL query timed out after {elapsed_wall:.1f} seconds (limit:
{timeout}s)...'
so users can tell at a glance whether the timeout fired close to the
configured boundary or whether something delayed cancellation.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
SQLExecutor.execute()measures its timeout by incrementing anelapsedcounter bypoll_interval(2s) each iteration. The counter only counts sleep time and ignores how long eachget_statementRPC takes. Under warehouse load,get_statementcan take several seconds per call, so the counter undercounts real wall clock and the configured timeout fires much later than intended — or, for sufficiently slow RPCs, never fires at all before the statement completes naturally.This PR switches the timeout boundary to
time.monotonic(), so the timeout fires at real wall clock regardless of RPC latency.Reproduction
A heavy cross-join query with
timeout=900was expected to raiseSQLExecutionErrorat ~900s wall clock. Actual behavior on a warehouse running another query in parallel:elapsedcounter only reached ~666 over the full 1999s, because each iteration was averaging ~6s wall (2s sleep + ~4sget_statementRPC under load) but only incrementing the counter by 2.Fix
The error message now reports actual elapsed wall clock alongside the configured limit, so callers can see at a glance whether the timeout fired close to the boundary or whether something delayed cancellation.
Validation
Validated against the patched code on a live Databricks SQL warehouse:
timeout=30on a query naturally taking minutespoll_intervalovershoot), statement stateCANCELED✓timeout=900on the same query that previously ran 1999s without the fixCANCELED✓Full-scale post-fix error message:
"SQL query timed out after 902.3 seconds (limit: 900s) and was canceled. Statement ID: 01f16014-883e-1d8b-8694-de041fd3db03".Scope
Single-file change. Strict tightening of timeout behavior; no caller observably depended on timeouts being lenient.
parallel_executor.pyuses the sameSQLExecutorand inherits the fix transparently.Note
We discovered this downstream and shipped the fix internally on a fork. Filing here so the broader
ai-dev-kituser base benefits — the latent bug only manifests under warehouse load, which is why it has gone unnoticed for ~5 months despite the file being touched several times for unrelated changes./cc @QuentinAmbard since you authored this file and are the most active maintainer — happy to address any review comments.