Skip to content

Reduce metadata refresh retries from 20 to 3#576

Open
jiang95-dev wants to merge 2 commits into
linkedin:mainfrom
jiang95-dev:lejiang/reduce-retry
Open

Reduce metadata refresh retries from 20 to 3#576
jiang95-dev wants to merge 2 commits into
linkedin:mainfrom
jiang95-dev:lejiang/reduce-retry

Conversation

@jiang95-dev
Copy link
Copy Markdown
Collaborator

@jiang95-dev jiang95-dev commented May 11, 2026

Summary

The select statement on some malformed tables will stuck for 60s, timeout, and throws a 504 error. However, the underlying server shows a different error. Take u_openhouse.test_rohit2 as an example. It is a replica table with a wrongly-ordered snapshot logs. So refreshFromMetadataLocation failed for Invalid update timestamp 1738457438373: before last snapshot log entry at 1738534225676, and retried 20 times (iceberg's default config). The total time spent on retry is 90s which causes the client to timeout. The same pattern of retrying 20 times for 90s will happen to all tables with a bad metadata. So I want to reduce the retry number to acheive 2 things:

  1. Avoid a wrong error code on the client side.
  2. Fast fail at parsing the metadata to reduce waste of computing resources.

Changes

  • Client-facing API Changes
  • Internal API Changes
  • Bug Fixes
  • New Features
  • Performance Improvements
  • Code Style
  • Refactoring
  • Documentation
  • Tests

For all the boxes checked, please include additional details of the changes made in this pull request.

Testing Done

  • Manually Tested on local docker setup. Please include commands ran, and their output.
  • Added new tests for the changes made.
  • Updated existing tests to reflect the changes made.
  • No tests added or updated. Please explain why. If unsure, please feel free to ask for help.
  • Some other form of testing like staging or soak time in production. Please explain.

For all the boxes checked, include a detailed description of the testing done for the changes made in this pull request.

Additional Information

  • Breaking Changes
  • Deprecations
  • Large PR broken into smaller PRs, and PR plan linked in the description.

For all the boxes checked, include additional details of the changes made in this pull request.

jiang95-dev and others added 2 commits May 11, 2026 11:38
Iceberg's BaseMetastoreTableOperations defaults META_DATA_REFRESH_RETRIES
to 20 with exponential backoff capped at 5s, so a failing refresh stalls
for ~90 seconds before surfacing the underlying error. HTS already returns
the authoritative metadata pointer, so the high retry budget mostly serves
to mask read-after-write windows on object stores - 3 retries is enough.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@cbb330
Copy link
Copy Markdown
Collaborator

cbb330 commented May 11, 2026

  1. client timeout is 4min. why is 90 causing issues?

  2. can you verify that other tables are not benefitting from the default retry? we don't want to regress on it, so i'm hoping that you can show 0 logs for "happy path" queries are utilizing >2 retries.

@maluchari
Copy link
Copy Markdown
Collaborator

Summary

The select statement on some malformed tables will stuck for 60s, timeout, and throws a 504 error. However, the underlying server shows a different error. Take u_openhouse.test_rohit2 as an example. It is a replica table with a wrongly-ordered snapshot logs. So refreshFromMetadataLocation failed for Invalid update timestamp 1738457438373: before last snapshot log entry at 1738534225676, and retried 20 times (iceberg's default config). The total time spent on retry is 90s which causes the client to timeout. The same pattern of retrying 20 times for 90s will happen to all tables with a bad metadata. So I want to reduce the retry number to acheive 2 things:

  1. Avoid a wrong error code on the client side.
  2. Fast fail at parsing the metadata to reduce waste of computing resources.

Changes

  • Client-facing API Changes
  • Internal API Changes
  • Bug Fixes
  • New Features
  • Performance Improvements
  • Code Style
  • Refactoring
  • Documentation
  • Tests

For all the boxes checked, please include additional details of the changes made in this pull request.

Testing Done

  • Manually Tested on local docker setup. Please include commands ran, and their output.
  • Added new tests for the changes made.
  • Updated existing tests to reflect the changes made.
  • No tests added or updated. Please explain why. If unsure, please feel free to ask for help.
  • Some other form of testing like staging or soak time in production. Please explain.

For all the boxes checked, include a detailed description of the testing done for the changes made in this pull request.

Additional Information

  • Breaking Changes
  • Deprecations
  • Large PR broken into smaller PRs, and PR plan linked in the description.

For all the boxes checked, include additional details of the changes made in this pull request.

IllegalArgumentException - why is this treated as a deterministic error? Your fix will probably make these 504 to 500 errors but s retry should probably be skipped for these. The actual root cause is retrying deterministic exceptions. Invalid update timestamp ... is an IllegalArgumentException from Preconditions.checkArgument on an immutable metadata file — retrying it can never succeed. Could we pass a shouldRetry predicate that excludes IllegalArgumentException instead of (or in addition to) lowering the retry count? That way real transients (HDFS NameNode failovers, S3 throttling) still get some headroom for retries. Also 3 retries ≈ 2 seconds of sleep, not 7. With Iceberg's exponentialBackoff the first three sleeps are 100ms, 400ms, 1.6s. That's likely too short for legitimate transients on flaky HDFS and could cause regressions. Worth either keeping retries higher and excluding deterministic exceptions, or justifying the choice with empirical p99 data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants