Skip to content

fix: close response body in ParseErr to prevent goroutine leak#804

Merged
frodenas merged 3 commits into
vmware:mainfrom
sircthulhu:fix/parse-err-body-leak
May 5, 2026
Merged

fix: close response body in ParseErr to prevent goroutine leak#804
frodenas merged 3 commits into
vmware:mainfrom
sircthulhu:fix/parse-err-body-leak

Conversation

@sircthulhu
Copy link
Copy Markdown
Contributor

@sircthulhu sircthulhu commented May 4, 2026

Description

Related issue: #803

Three response-body leaks in govcd that all show up under sustained 4xx / 5xx loads:

  1. ParseErr (XML / JSON error path)checkRespWithErrType returns (nil, ParseErr(bodyType, resp, errType)), so the caller has no handle on the response and cannot close it. ParseErr itself only forwards the response to decodeBody, which io.ReadAlls the body but never closes it.
  2. executeJsonRequest (OpenAPI / JSON path) — on every non-2xx the function reads the body with io.ReadAll and returns (resp, err). All callers follow if err != nil { return nil, err } and only defer closeBody(resp) afterwards, so the deferred close never runs.
  3. rewrapRespBodyNoopCloser — reads the body and replaces resp.Body with io.NopCloser(...). After that any subsequent resp.Body.Close() is a no-op, so the underlying TCP connection is never released. Used from api_vcd.go, entity_client.go, and openapi.go — every request that goes through these paths leaks a goroutine.

Same class of leak (net/http.setRequestCancel.func4 goroutine pinned to a never-closed body), three different code paths.

Verification

A real-world Kubernetes operator built on top of go-vcloud-director was leaking about 28 goroutines per second of error rate; one snapshot showed 12 651 leaked setRequestCancel.func4 goroutines (~200 MB RSS) before the pod OOMed. With all three fixes applied the same workload's goroutine count stays flat at the controller-runtime baseline (~600) and RSS hovers under 100 MB indefinitely.

checkRespWithErrType returns (nil, ParseErr(bodyType, resp, errType))
on every 4xx / 5xx response, which means the caller never gets the
response and therefore cannot close its Body. ParseErr passes the
response on to decodeBody, which io.ReadAll's the body but does not
close it either. As a result every failed VCD API call leaves one
net/http.setRequestCancel.func4 goroutine alive forever, holding the
underlying TCP connection.

Observed in a Kubernetes operator that retries failed VCD calls in a
reconcile loop: the controller pod accumulated ~28 goroutines per
second of error rate (12 651 in net/http.setRequestCancel.func4 in
one snapshot, ~200 MB RSS) until it OOMed.

Fix it at ParseErr — the last point that has access to the response
on the error path — by deferring resp.Body.Close() before
decodeBody is invoked.

Signed-off-by: Kirill Ilin <stitch14@yandex.ru>
@sircthulhu sircthulhu requested a review from frodenas as a code owner May 4, 2026 15:33
sircthulhu added 2 commits May 4, 2026 20:43
executeJsonRequest reads the body with io.ReadAll on every non-2xx
response and returns (resp, err) to the caller. All callers follow the
pattern

    resp, err := client.executeJsonRequest(...)
    if err != nil {
        return nil, err
    }
    defer closeBody(resp)

so on the error path the deferred close never runs and resp.Body stays
open, leaking one net/http.setRequestCancel.func4 goroutine per failed
call. This is the same class of leak as vmware#803/vmware#804 in ParseErr but on a
different code path (OpenAPI / JSON requests).

Close the body explicitly after the read.

Signed-off-by: Kirill Ilin <stitch14@yandex.ru>
rewrapRespBodyNoopCloser reads the response body into a byte slice and
then replaces resp.Body with an io.NopCloser. After that any subsequent
resp.Body.Close() in the caller is a no-op, so the underlying TCP
connection is never released and the matching net/http.setRequestCancel
goroutine waits forever.

Three call-sites are affected (api_vcd.go, entity_client.go,
openapi.go), so under any sustained workload that goes through OpenAPI
or the entity client the operator on top of this library leaks one
goroutine per request. Same class of leak as vmware#803/vmware#804 in ParseErr and
in executeJsonRequest, just on a third code path.

Close the original Body before replacing it with the NopCloser.

Signed-off-by: Kirill Ilin <stitch14@yandex.ru>
@frodenas
Copy link
Copy Markdown
Member

frodenas commented May 5, 2026

Thanks @sircthulhu

@frodenas frodenas merged commit 425e0f6 into vmware:main May 5, 2026
2 checks passed
@sircthulhu sircthulhu deleted the fix/parse-err-body-leak branch May 5, 2026 04:48
@frodenas frodenas mentioned this pull request May 5, 2026
frodenas added a commit that referenced this pull request May 5, 2026
Signed-off-by: Ferran Rodenas <ferran.rodenas@broadcom.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants