Apache Iceberg version
1.10.0
Query engine
Spark
Please describe the bug 🐞
We are seeing severe outbound socket exhaustion (TIME_WAIT) when running Iceberg maintenance operations (specifically CALL system.rewrite_data_files) on large tables stored on S3.
This happens even with Apache HTTP client + connection pooling enabled and after removing any Hadoop/S3A usage.
The issue seems to correlate strongly with metadata / manifest (.avro) downloads, not with large data file reads.
Environment
- Iceberg version: 1.10.0
- Spark version: 4.0.1
- Spark on Kubernetes (Spark Operator / SparkApplication CRD)
- Storage: Amazon S3
- FileIO:
org.apache.iceberg.aws.s3.S3FileIO
- HTTP client: Apache HTTP client (Iceberg shaded)
- AWS SDK: via
iceberg-aws-bundle
- No
s3a://, no hadoop-aws in use
- REST Catalog (Lakekeeper), but REST traffic is minimal; sockets are clearly to S3
Observed behavior
During rewrite_data_files on a table with ~26k data files:
-
Outbound connections to S3 explode to 40k–45k sockets in TIME_WAIT
-
Remote IPs are public S3 endpoints (3.x, 52.x)
-
Happens primarily while reading metadata files:
metadata/*.json
manifest-list.avro
snap-*.avro
-
Kernel ephemeral ports get exhausted, causing job instability
And socket inspection from inside the executor pod:
~43k TIME_WAIT sockets
Top destinations:
- 3.5.x.x
- 52.218.x.x
Relevant configuration
spark.sql.catalog.lakehouse.io-impl=org.apache.iceberg.aws.s3.S3FileIO
spark.sql.catalog.lakehouse.http-client.type=apache
spark.sql.catalog.lakehouse.http-client.apache.max-connections=200
spark.sql.catalog.lakehouse.http-client.apache.connection-max-idle-time-ms=300000
spark.sql.catalog.lakehouse.http-client.apache.connection-time-to-live-ms=3600000
spark.sql.iceberg.planning.max-threads=4 # reducing to 1 helps but does not eliminate
Why this looks like an Iceberg-level issue
- The connection explosion correlates with manifest/metadata access, not data file I/O
- Planning and rewrite phases appear to trigger bursty, highly parallel small-object GETs
- Even with pooling, connections are frequently closed and recreated
This suggests:
- Metadata access patterns may be too aggressively parallel
- Manifest downloads may bypass or defeat effective connection reuse
- Planning threads / metadata splits may cause connection churn beyond what pooling can absorb
Questions / possible directions
- Is metadata/manifest I/O intentionally parallelized at this level?
- Are there known issues with connection reuse during manifest reads?
- Should
planning.max-threads or metadata split behavior be auto-throttled?
- Are there additional cache knobs or client reuse guarantees for metadata reads?
- Has similar behavior been observed or addressed in newer versions?
We’re happy to provide:
- Additional logs (with request paths)
- Repro steps
- Packet/socket stats
- A minimal test case if needed
Thanks — this one is pretty brutal in production environments with strict networking limits.
Willingness to contribute
Apache Iceberg version
1.10.0
Query engine
Spark
Please describe the bug 🐞
We are seeing severe outbound socket exhaustion (TIME_WAIT) when running Iceberg maintenance operations (specifically
CALL system.rewrite_data_files) on large tables stored on S3.This happens even with Apache HTTP client + connection pooling enabled and after removing any Hadoop/S3A usage.
The issue seems to correlate strongly with metadata / manifest (.avro) downloads, not with large data file reads.
Environment
org.apache.iceberg.aws.s3.S3FileIOiceberg-aws-bundles3a://, nohadoop-awsin useObserved behavior
During
rewrite_data_fileson a table with ~26k data files:Outbound connections to S3 explode to 40k–45k sockets in TIME_WAIT
Remote IPs are public S3 endpoints (
3.x,52.x)Happens primarily while reading metadata files:
metadata/*.jsonmanifest-list.avrosnap-*.avroKernel ephemeral ports get exhausted, causing job instability
And socket inspection from inside the executor pod:
Relevant configuration
Why this looks like an Iceberg-level issue
This suggests:
Questions / possible directions
planning.max-threadsor metadata split behavior be auto-throttled?We’re happy to provide:
Thanks — this one is pretty brutal in production environments with strict networking limits.
Willingness to contribute