Skip to content

Massive TIME_WAIT socket exhaustion during metadata (manifest/avro) reads with S3FileIO + Apache HTTP client #14951

@Sbaia

Description

@Sbaia

Apache Iceberg version

1.10.0

Query engine

Spark

Please describe the bug 🐞

We are seeing severe outbound socket exhaustion (TIME_WAIT) when running Iceberg maintenance operations (specifically CALL system.rewrite_data_files) on large tables stored on S3.

This happens even with Apache HTTP client + connection pooling enabled and after removing any Hadoop/S3A usage.
The issue seems to correlate strongly with metadata / manifest (.avro) downloads, not with large data file reads.


Environment

  • Iceberg version: 1.10.0
  • Spark version: 4.0.1
  • Spark on Kubernetes (Spark Operator / SparkApplication CRD)
  • Storage: Amazon S3
  • FileIO: org.apache.iceberg.aws.s3.S3FileIO
  • HTTP client: Apache HTTP client (Iceberg shaded)
  • AWS SDK: via iceberg-aws-bundle
  • No s3a://, no hadoop-aws in use
  • REST Catalog (Lakekeeper), but REST traffic is minimal; sockets are clearly to S3

Observed behavior

During rewrite_data_files on a table with ~26k data files:

  • Outbound connections to S3 explode to 40k–45k sockets in TIME_WAIT

  • Remote IPs are public S3 endpoints (3.x, 52.x)

  • Happens primarily while reading metadata files:

    • metadata/*.json
    • manifest-list.avro
    • snap-*.avro
  • Kernel ephemeral ports get exhausted, causing job instability

And socket inspection from inside the executor pod:

~43k TIME_WAIT sockets
Top destinations:
- 3.5.x.x
- 52.218.x.x

Relevant configuration

spark.sql.catalog.lakehouse.io-impl=org.apache.iceberg.aws.s3.S3FileIO

spark.sql.catalog.lakehouse.http-client.type=apache
spark.sql.catalog.lakehouse.http-client.apache.max-connections=200
spark.sql.catalog.lakehouse.http-client.apache.connection-max-idle-time-ms=300000
spark.sql.catalog.lakehouse.http-client.apache.connection-time-to-live-ms=3600000

spark.sql.iceberg.planning.max-threads=4   # reducing to 1 helps but does not eliminate

Why this looks like an Iceberg-level issue

  • The connection explosion correlates with manifest/metadata access, not data file I/O
  • Planning and rewrite phases appear to trigger bursty, highly parallel small-object GETs
  • Even with pooling, connections are frequently closed and recreated

This suggests:

  • Metadata access patterns may be too aggressively parallel
  • Manifest downloads may bypass or defeat effective connection reuse
  • Planning threads / metadata splits may cause connection churn beyond what pooling can absorb

Questions / possible directions

  • Is metadata/manifest I/O intentionally parallelized at this level?
  • Are there known issues with connection reuse during manifest reads?
  • Should planning.max-threads or metadata split behavior be auto-throttled?
  • Are there additional cache knobs or client reuse guarantees for metadata reads?
  • Has similar behavior been observed or addressed in newer versions?

We’re happy to provide:

  • Additional logs (with request paths)
  • Repro steps
  • Packet/socket stats
  • A minimal test case if needed

Thanks — this one is pretty brutal in production environments with strict networking limits.

Willingness to contribute

  • I can contribute a fix for this bug independently
  • I would be willing to contribute a fix for this bug with guidance from the Iceberg community
  • I cannot contribute a fix for this bug at this time

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions