Reduce metadata refresh retries from 20 to 3 by jiang95-dev · Pull Request #576 · linkedin/openhouse

jiang95-dev · 2026-05-11T19:23:06Z

Summary

The select statement on some malformed tables will stuck for 60s, timeout, and throws a 504 error. However, the underlying server shows a different error. Take u_openhouse.test_rohit2 as an example. It is a replica table with a wrongly-ordered snapshot logs. So refreshFromMetadataLocation failed for Invalid update timestamp 1738457438373: before last snapshot log entry at 1738534225676, and retried 20 times (iceberg's default config). The total time spent on retry is 90s which causes the client to timeout. The same pattern of retrying 20 times for 90s will happen to all tables with a bad metadata. So I want to reduce the retry number to acheive 2 things:

Avoid a wrong error code on the client side.
Fast fail at parsing the metadata to reduce waste of computing resources.

Changes

For all the boxes checked, please include additional details of the changes made in this pull request.

Testing Done

Manually Tested on local docker setup. Please include commands ran, and their output.
Added new tests for the changes made.
Updated existing tests to reflect the changes made.
No tests added or updated. Please explain why. If unsure, please feel free to ask for help.
Some other form of testing like staging or soak time in production. Please explain.

For all the boxes checked, include a detailed description of the testing done for the changes made in this pull request.

Additional Information

Breaking Changes
Deprecations
Large PR broken into smaller PRs, and PR plan linked in the description.

For all the boxes checked, include additional details of the changes made in this pull request.

Iceberg's BaseMetastoreTableOperations defaults META_DATA_REFRESH_RETRIES to 20 with exponential backoff capped at 5s, so a failing refresh stalls for ~90 seconds before surfacing the underlying error. HTS already returns the authoritative metadata pointer, so the high retry budget mostly serves to mask read-after-write windows on object stores - 3 retries is enough. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

cbb330 · 2026-05-11T19:56:39Z

client timeout is 4min. why is 90 causing issues?
can you verify that other tables are not benefitting from the default retry? we don't want to regress on it, so i'm hoping that you can show 0 logs for "happy path" queries are utilizing >2 retries.

maluchari · 2026-05-11T21:13:59Z

Summary

The select statement on some malformed tables will stuck for 60s, timeout, and throws a 504 error. However, the underlying server shows a different error. Take u_openhouse.test_rohit2 as an example. It is a replica table with a wrongly-ordered snapshot logs. So refreshFromMetadataLocation failed for Invalid update timestamp 1738457438373: before last snapshot log entry at 1738534225676, and retried 20 times (iceberg's default config). The total time spent on retry is 90s which causes the client to timeout. The same pattern of retrying 20 times for 90s will happen to all tables with a bad metadata. So I want to reduce the retry number to acheive 2 things:

Avoid a wrong error code on the client side.

Fast fail at parsing the metadata to reduce waste of computing resources.

Changes

Client-facing API Changes

Internal API Changes

Bug Fixes

New Features

Performance Improvements

Code Style

Refactoring

Documentation

Tests

For all the boxes checked, please include additional details of the changes made in this pull request.

Testing Done

Manually Tested on local docker setup. Please include commands ran, and their output.

Added new tests for the changes made.

Updated existing tests to reflect the changes made.

No tests added or updated. Please explain why. If unsure, please feel free to ask for help.

Some other form of testing like staging or soak time in production. Please explain.

For all the boxes checked, include a detailed description of the testing done for the changes made in this pull request.

Additional Information

Breaking Changes

Deprecations

Large PR broken into smaller PRs, and PR plan linked in the description.

For all the boxes checked, include additional details of the changes made in this pull request.

IllegalArgumentException - why is this treated as a deterministic error? Your fix will probably make these 504 to 500 errors but s retry should probably be skipped for these. The actual root cause is retrying deterministic exceptions. Invalid update timestamp ... is an IllegalArgumentException from Preconditions.checkArgument on an immutable metadata file — retrying it can never succeed. Could we pass a shouldRetry predicate that excludes IllegalArgumentException instead of (or in addition to) lowering the retry count? That way real transients (HDFS NameNode failovers, S3 throttling) still get some headroom for retries. Also 3 retries ≈ 2 seconds of sleep, not 7. With Iceberg's exponentialBackoff the first three sleeps are 100ms, 400ms, 1.6s. That's likely too short for legitimate transients on flaky HDFS and could cause regressions. Worth either keeping retries higher and excluding deterministic exceptions, or justifying the choice with empirical p99 data.

jiang95-dev and others added 2 commits May 11, 2026 11:38

fix

8cd8111

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce metadata refresh retries from 20 to 3#576

Reduce metadata refresh retries from 20 to 3#576
jiang95-dev wants to merge 2 commits into
linkedin:mainfrom
jiang95-dev:lejiang/reduce-retry

jiang95-dev commented May 11, 2026 •

edited

Loading

Uh oh!

cbb330 commented May 11, 2026

Uh oh!

maluchari commented May 11, 2026

Summary

Changes

Testing Done

Additional Information

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jiang95-dev commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Testing Done

Additional Information

Uh oh!

cbb330 commented May 11, 2026

Uh oh!

maluchari commented May 11, 2026

Summary

Changes

Testing Done

Additional Information

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jiang95-dev commented May 11, 2026 •

edited

Loading