Skip to content

Add limits.max_agent_doc_size config#7135

Merged
michel-laterman merged 6 commits into
elastic:mainfrom
michel-laterman:fix/checkin-max-agent-size
Jun 5, 2026
Merged

Add limits.max_agent_doc_size config#7135
michel-laterman merged 6 commits into
elastic:mainfrom
michel-laterman:fix/checkin-max-agent-size

Conversation

@michel-laterman
Copy link
Copy Markdown
Contributor

What is the problem this PR solves?

Prevent fleet-server from going OOM when a compressed checkin request size explodes on decompression.

How does this PR solve the problem?

Add a new config option called limits.max_agent_doc_size that defaults to 20mb. This limit is applied to the gzip.Reader for compressed requests.

Design Checklist

  • I have ensured my design is stateless and will work when multiple fleet-server instances are behind a load balancer.
  • I have or intend to scale test my changes, ensuring it will work reliably with 100K+ agents connected.
  • I have included fail safe mechanisms to limit the load on fleet-server: rate limiting, circuit breakers, caching, load shedding, etc.

Checklist

  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the changelog tool

Add a new config option that applies to checkin request bodies after
decompression. This is intended to prevent a compressed request from
leading to an OOM on decompression.
@michel-laterman michel-laterman added bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team backport-9.4 Automated backport to 9.4 branch. labels Jun 1, 2026
@michel-laterman michel-laterman marked this pull request as ready for review June 1, 2026 20:47
@michel-laterman michel-laterman requested a review from a team as a code owner June 1, 2026 20:47
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 1, 2026

TL;DR

The E2E Test Buildkite step failed at package level (testing/e2e), but the provided log is truncated and does not include the actual failing test/error line. Immediate next action is to re-run the job (or fetch build/test-e2e-<os>.out) to capture the first failure line.

Remediation

  • Re-run Buildkite build 14954 (or open the full raw log/artifact) and capture the first --- FAIL: / panic: / WARNING: DATA RACE line from testing/e2e output.
  • If available, upload/extract build/test-e2e-linux.out from the failing run (the mage target writes this file even when go test fails) and use that as the canonical failure source.
Investigation details

Root Cause

Not conclusive from available data. The pre-fetched step log ends with package-level failure only:

  • FAIL github.com/elastic/fleet-server/testing/e2e 1242.627s
  • no corresponding --- FAIL: <test> / panic / race line is present in the provided file.

Related execution paths:

  • .buildkite/scripts/e2e_test.sh:16 runs mage test:e2e test:junitReport
  • magefile.go:2162 runs e2e with go test ... -race ... ./...
  • magefile.go:2172 writes build/test-e2e-<os>.out after test execution

Evidence

Verification

  • Not run locally: Docker daemon is unavailable in this environment, so e2e reproduction here is not possible.

Follow-up

Once the first failing line is available, I can map it to the exact source/test and provide a concrete code-level fix.

Note

🔒 Integrity filter blocked 2 items

The following items were blocked because they don't meet the GitHub integrity level.

  • Add limits.max_agent_doc_size config #7135 pull_request_read: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".
  • #7135 pull_request_read: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".

To allow these resources, lower min-integrity in your GitHub frontmatter:

tools:
  github:
    min-integrity: approved  # merged | approved | unapproved | none

What is this? | From workflow: PR Buildkite Detective

Give us feedback! React with 🚀 if perfect, 👍 if helpful, 👎 if not.

Comment thread internal/pkg/config/env_defaults.go Outdated
Comment thread internal/pkg/api/handleCheckin.go Outdated
Comment thread internal/pkg/api/handleCheckin_test.go Outdated
Comment thread internal/pkg/config/limits.go
Co-authored-by: Shaunak Kashyap <ycombinator@gmail.com>
@michel-laterman michel-laterman merged commit 44083ae into elastic:main Jun 5, 2026
10 checks passed
@michel-laterman michel-laterman deleted the fix/checkin-max-agent-size branch June 5, 2026 17:24
ebeahan pushed a commit that referenced this pull request Jun 5, 2026
* Add limits.max_agent_doc_size config

Add a new config option that applies to checkin request bodies after
decompression. This is intended to prevent a compressed request from
leading to an OOM on decompression.

* Change to using a MaxBodyDecompressed limit per endpoint

* cleanup

* Apply suggestions from code review



---------


(cherry picked from commit 44083ae)

Co-authored-by: Michel Laterman <82832767+michel-laterman@users.noreply.github.com>
Co-authored-by: Shaunak Kashyap <ycombinator@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-9.4 Automated backport to 9.4 branch. bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants