fix: break circular references so Agent cleanup doesn't hang with MCPClient#1830
fix: break circular references so Agent cleanup doesn't hang with MCPClient#1830dbschmigelski wants to merge 5 commits intostrands-agents:mainfrom
Conversation
…Client Agent had circular references (Agent → _ToolCaller._agent → Agent and Agent → _PluginRegistry._agent → Agent) that prevented refcount-based destruction. When an Agent with a managed MCPClient was created inside a function, cleanup was deferred to interpreter shutdown where MCPClient.stop() → thread.join() would hang. Use weakref.ref for the back-references so the Agent is collected immediately when it goes out of scope. Closes strands-agents#1732
|
Issue: The PR description is missing a "Documentation PR" section. Suggestion: Since this is an internal bug fix that doesn't change public API, please add a "Documentation PR" section with a justification explaining why documentation updates are not required. For example: |
| """Return the agent, raising ReferenceError if it has been garbage collected.""" | ||
| agent = self._agent_ref() | ||
| if agent is None: | ||
| raise ReferenceError("Agent has been garbage collected") |
There was a problem hiding this comment.
Suggestion: The error message is clear and appropriate. Consider whether this ReferenceError could propagate to user code unexpectedly - though since these are internal classes with private naming (_PluginRegistry), this should be fine in practice.
There was a problem hiding this comment.
The PR description now thoroughly documents this edge case under "What could break" - the behavior change is clearly communicated. Resolving this thread. ✅
There was a problem hiding this comment.
Assessment: Request Changes
This PR correctly addresses issue #1732 by using weakref.ref to break circular references in _ToolCaller and _PluginRegistry. The implementation is clean and the tests effectively validate the fix.
Review Details
Blocking Issue:
- Documentation PR section missing: Please add a "Documentation PR" section to the PR description with either a link to documentation changes or a justification (e.g., "Not required - internal bug fix that doesn't affect public API")
Implementation Quality:
- The weakref pattern is correctly applied with appropriate property accessors
- Tests comprehensively cover the circular reference fix scenario from issue #1732
- Error handling via
ReferenceErroris appropriate for this edge case
Note for future consideration:
- The experimental bidi module (
src/strands/experimental/bidi/agent/loop.py) also storesself._agentwhich could create similar circular references - this may be worth addressing in a follow-up PR
Good fix for a tricky memory management issue! 🔧
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
Python 3.14 introduces deferred reference counting where del may not immediately trigger collection on some platforms (observed on macOS). The tests were too strict by asserting refcount-only collection. Allow a single gc.collect() as a fallback, which still proves the important property: no persistent reference cycle exists. The Agent is collected promptly during normal execution, not deferred to interpreter shutdown where MCPClient.stop() would hang.
Add tests that exercise the ReferenceError path in _ToolCaller._agent and _PluginRegistry._agent properties when the Agent has been garbage collected. Addresses Codecov patch coverage gap.
…ed_up_on_del Same Python 3.14 deferred refcounting issue — del agent may not immediately trigger __del__ on all platforms.
| self._agent_ref = weakref.ref(agent) | ||
|
|
||
| @property | ||
| def _agent(self) -> "Agent | BidiAgent": |
There was a problem hiding this comment.
This seems like a fine restriction; this is convenience and if we need a work around, we can come up with one
|
I belive that you've seen an agent suggested : why we deceided to choose |
Description
When an Agent with an MCPClient tool provider was created inside a function, the Python script would hang indefinitely on exit.
Root Cause: Reference Cycles Preventing Timely Cleanup
CPython uses two garbage collection strategies:
The Agent class had two circular reference chains:
Because of these cycles, when an Agent went out of scope (e.g., a function returned), its refcount never reached zero. The object stayed alive, waiting for Python's cyclic GC to eventually detect and collect it.
Why This Caused a Hang
The Agent has a del finalizer (agent.py:647) that calls self.tool_registry.cleanup(). For MCPClient tool providers, cleanup flows through:
MCPClient.stop() signals a background thread to shut down its async event loop, then calls thread.join() to wait for it to finish.
The critical issue: during interpreter shutdown, the async event loop infrastructure is already being torn down. The background thread can't complete its async cleanup, so thread.join() blocks forever → hang.
This only happened when the Agent was created inside a function, because:
The Fix
Replace the strong back-references with weakref.ref in both _ToolCaller and _PluginRegistry:
The weak references don't contribute to the Agent's refcount, so the cycle is broken. Now when an Agent goes out of scope, refcounting can immediately destroy it, triggering del → MCPClient.stop() while the
runtime is still fully operational.
What could break
Stashing agent.tool and using it after the agent is gone:
Previously this would have worked because _ToolCaller held a strong reference keeping the Agent alive. Now it doesn't.
Related Issues
#1732
Documentation PR
N/A
Type of Change
Bug fix
Testing
How have you tested the change? Verify that the changes do not break functionality or introduce warnings in consuming repositories: agents-docs, agents-tools, agents-cli
hatch run prepareChecklist
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.