Add limits.max_agent_doc_size config by michel-laterman · Pull Request #7135 · elastic/fleet-server

michel-laterman · 2026-06-01T20:29:50Z

What is the problem this PR solves?

Prevent fleet-server from going OOM when a compressed checkin request size explodes on decompression.

How does this PR solve the problem?

Add a new config option called limits.max_agent_doc_size that defaults to 20mb. This limit is applied to the gzip.Reader for compressed requests.

Design Checklist

~~I have ensured my design is stateless and will work when multiple fleet-server instances are behind a load balancer.~~
~~I have or intend to scale test my changes, ensuring it will work reliably with 100K+ agents connected.~~
I have included fail safe mechanisms to limit the load on fleet-server: rate limiting, circuit breakers, caching, load shedding, etc.

Checklist

I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have made corresponding change to the default configuration files
I have added tests that prove my fix is effective or that my feature works
I have added an entry in ./changelog/fragments using the changelog tool

Add a new config option that applies to checkin request bodies after decompression. This is intended to prevent a compressed request from leading to an OOM on decompression.

github-actions · 2026-06-01T21:51:41Z

TL;DR

The E2E Test Buildkite step failed at package level (testing/e2e), but the provided log is truncated and does not include the actual failing test/error line. Immediate next action is to re-run the job (or fetch build/test-e2e-<os>.out) to capture the first failure line.

Remediation

Re-run Buildkite build 14954 (or open the full raw log/artifact) and capture the first --- FAIL: / panic: / WARNING: DATA RACE line from testing/e2e output.
If available, upload/extract build/test-e2e-linux.out from the failing run (the mage target writes this file even when go test fails) and use that as the canonical failure source.

Investigation details

Root Cause

Not conclusive from available data. The pre-fetched step log ends with package-level failure only:

FAIL github.com/elastic/fleet-server/testing/e2e 1242.627s
no corresponding --- FAIL: <test> / panic / race line is present in the provided file.

Related execution paths:

.buildkite/scripts/e2e_test.sh:16 runs mage test:e2e test:junitReport
magefile.go:2162 runs e2e with go test ... -race ... ./...
magefile.go:2172 writes build/test-e2e-<os>.out after test execution

Evidence

Build: https://buildkite.com/elastic/fleet-server/builds/14954
Job/step: E2E Test (.buildkite/scripts/e2e_test.sh)
Key log excerpt:
- FAIL github.com/elastic/fleet-server/testing/e2e 1242.627s
- Error: exit status 1

Verification

Not run locally: Docker daemon is unavailable in this environment, so e2e reproduction here is not possible.

Follow-up

Once the first failing line is available, I can map it to the exact source/test and provide a concrete code-level fix.

Note

🔒 Integrity filter blocked 2 items

The following items were blocked because they don't meet the GitHub integrity level.

Add limits.max_agent_doc_size config #7135 pull_request_read: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".
#7135 pull_request_read: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".

To allow these resources, lower min-integrity in your GitHub frontmatter:

tools:
  github:
    min-integrity: approved  # merged | approved | unapproved | none

What is this? | From workflow: PR Buildkite Detective

Give us feedback! React with 🚀 if perfect, 👍 if helpful, 👎 if not.

Co-authored-by: Shaunak Kashyap <ycombinator@gmail.com>

* Add limits.max_agent_doc_size config Add a new config option that applies to checkin request bodies after decompression. This is intended to prevent a compressed request from leading to an OOM on decompression. * Change to using a MaxBodyDecompressed limit per endpoint * cleanup * Apply suggestions from code review --------- (cherry picked from commit 44083ae) Co-authored-by: Michel Laterman <82832767+michel-laterman@users.noreply.github.com> Co-authored-by: Shaunak Kashyap <ycombinator@gmail.com>

Add limits.max_agent_doc_size config

c8e4519

Add a new config option that applies to checkin request bodies after decompression. This is intended to prevent a compressed request from leading to an OOM on decompression.

michel-laterman added bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team backport-9.4 Automated backport to 9.4 branch. labels Jun 1, 2026

michel-laterman mentioned this pull request Jun 1, 2026

Health check request should be chunked elastic/elastic-agent#11981

Open

michel-laterman marked this pull request as ready for review June 1, 2026 20:47

michel-laterman requested a review from a team as a code owner June 1, 2026 20:47

michel-laterman requested review from blakerouse and samuelvl June 1, 2026 20:47

Merge branch 'main' into fix/checkin-max-agent-size

442e055

michel-laterman commented Jun 3, 2026

View reviewed changes

Comment thread internal/pkg/config/env_defaults.go Outdated

github-actions Bot mentioned this pull request Jun 4, 2026

fix: enforce policy-based access control on artifact downloads #7009

Merged

5 tasks

Merge branch 'main' into fix/checkin-max-agent-size

589b99d

ycombinator reviewed Jun 4, 2026

View reviewed changes

Comment thread internal/pkg/api/handleCheckin.go Outdated

michel-laterman added 2 commits June 4, 2026 14:43

Change to using a MaxBodyDecompressed limit per endpoint

8c9643a

cleanup

2b7e5be

michel-laterman requested a review from ycombinator June 5, 2026 15:39

ycombinator reviewed Jun 5, 2026

View reviewed changes

Comment thread internal/pkg/api/handleCheckin_test.go Outdated

ycombinator reviewed Jun 5, 2026

View reviewed changes

Comment thread internal/pkg/config/limits.go

Apply suggestions from code review

a305978

Co-authored-by: Shaunak Kashyap <ycombinator@gmail.com>

ycombinator approved these changes Jun 5, 2026

View reviewed changes

michel-laterman merged commit 44083ae into elastic:main Jun 5, 2026
10 checks passed

michel-laterman deleted the fix/checkin-max-agent-size branch June 5, 2026 17:24

mergify Bot mentioned this pull request Jun 5, 2026

[9.4](backport #7135) Add limits.max_agent_doc_size config #7168

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add limits.max_agent_doc_size config#7135

Add limits.max_agent_doc_size config#7135
michel-laterman merged 6 commits into
elastic:mainfrom
michel-laterman:fix/checkin-max-agent-size

michel-laterman commented Jun 1, 2026

Uh oh!

github-actions Bot commented Jun 1, 2026

Root Cause

Evidence

Verification

Follow-up

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

michel-laterman commented Jun 1, 2026

What is the problem this PR solves?

How does this PR solve the problem?

Design Checklist

Checklist

Uh oh!

github-actions Bot commented Jun 1, 2026

TL;DR

Remediation

Root Cause

Evidence

Verification

Follow-up

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants