Reduce metadata refresh retries from 20 to 3#576
Conversation
Iceberg's BaseMetastoreTableOperations defaults META_DATA_REFRESH_RETRIES to 20 with exponential backoff capped at 5s, so a failing refresh stalls for ~90 seconds before surfacing the underlying error. HTS already returns the authoritative metadata pointer, so the high retry budget mostly serves to mask read-after-write windows on object stores - 3 retries is enough. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
IllegalArgumentException - why is this treated as a deterministic error? Your fix will probably make these 504 to 500 errors but s retry should probably be skipped for these. The actual root cause is retrying deterministic exceptions. Invalid update timestamp ... is an IllegalArgumentException from Preconditions.checkArgument on an immutable metadata file — retrying it can never succeed. Could we pass a shouldRetry predicate that excludes IllegalArgumentException instead of (or in addition to) lowering the retry count? That way real transients (HDFS NameNode failovers, S3 throttling) still get some headroom for retries. Also 3 retries ≈ 2 seconds of sleep, not 7. With Iceberg's exponentialBackoff the first three sleeps are 100ms, 400ms, 1.6s. That's likely too short for legitimate transients on flaky HDFS and could cause regressions. Worth either keeping retries higher and excluding deterministic exceptions, or justifying the choice with empirical p99 data. |
Summary
The select statement on some malformed tables will stuck for 60s, timeout, and throws a 504 error. However, the underlying server shows a different error. Take
u_openhouse.test_rohit2as an example. It is a replica table with a wrongly-ordered snapshot logs. SorefreshFromMetadataLocationfailed forInvalid update timestamp 1738457438373: before last snapshot log entry at 1738534225676, and retried 20 times (iceberg's default config). The total time spent on retry is 90s which causes the client to timeout. The same pattern of retrying 20 times for 90s will happen to all tables with a bad metadata. So I want to reduce the retry number to acheive 2 things:Changes
For all the boxes checked, please include additional details of the changes made in this pull request.
Testing Done
For all the boxes checked, include a detailed description of the testing done for the changes made in this pull request.
Additional Information
For all the boxes checked, include additional details of the changes made in this pull request.