Skip to content

fix: break circular references so Agent cleanup doesn't hang with MCPClient#1830

Open
dbschmigelski wants to merge 5 commits intostrands-agents:mainfrom
dbschmigelski:weakref
Open

fix: break circular references so Agent cleanup doesn't hang with MCPClient#1830
dbschmigelski wants to merge 5 commits intostrands-agents:mainfrom
dbschmigelski:weakref

Conversation

@dbschmigelski
Copy link
Member

@dbschmigelski dbschmigelski commented Mar 6, 2026

Description

When an Agent with an MCPClient tool provider was created inside a function, the Python script would hang indefinitely on exit.

Root Cause: Reference Cycles Preventing Timely Cleanup

CPython uses two garbage collection strategies:

  1. Reference counting: immediate, deterministic. Objects are destroyed the instant their refcount hits zero.
  2. Cyclic garbage collector: periodic, non-deterministic. Catches objects involved in reference cycles that refcounting can't handle.

The Agent class had two circular reference chains:

  Agent → Agent.tool (_ToolCaller) → _ToolCaller._agent → Agent
  Agent → Agent._plugin_registry (_PluginRegistry) → _PluginRegistry._agent → Agent

Because of these cycles, when an Agent went out of scope (e.g., a function returned), its refcount never reached zero. The object stayed alive, waiting for Python's cyclic GC to eventually detect and collect it.

Why This Caused a Hang

The Agent has a del finalizer (agent.py:647) that calls self.tool_registry.cleanup(). For MCPClient tool providers, cleanup flows through:

  Agent.__del__() → tool_registry.cleanup() → MCPClient.remove_consumer() → MCPClient.stop()

MCPClient.stop() signals a background thread to shut down its async event loop, then calls thread.join() to wait for it to finish.

The critical issue: during interpreter shutdown, the async event loop infrastructure is already being torn down. The background thread can't complete its async cleanup, so thread.join() blocks forever → hang.

This only happened when the Agent was created inside a function, because:

  • Function-scoped Agent: Goes out of scope when the function returns, but the reference cycle keeps it alive. The cyclic GC only collects it during interpreter shutdown → hang.
  • Module-scoped Agent: Module-level variables are explicitly cleared early in shutdown (while the runtime is still functional), so cleanup succeeds normally.

The Fix

Replace the strong back-references with weakref.ref in both _ToolCaller and _PluginRegistry:

  # Before (strong reference → creates cycle)
  self._agent = agent

  # After (weak reference → no cycle)
  self._agent_ref = weakref.ref(agent)

  @property
  def _agent(self):
      agent = self._agent_ref()
      if agent is None:
          raise ReferenceError("Agent has been garbage collected")
      return agent

The weak references don't contribute to the Agent's refcount, so the cycle is broken. Now when an Agent goes out of scope, refcounting can immediately destroy it, triggering del → MCPClient.stop() while the
runtime is still fully operational.


What could break

Stashing agent.tool and using it after the agent is gone:

  caller = agent.tool   # holds a reference to _ToolCaller
  del agent             # agent is now collected (refcount hits 0)
  caller.my_tool()      # ReferenceError — weakref is dead

Previously this would have worked because _ToolCaller held a strong reference keeping the Agent alive. Now it doesn't.

Related Issues

#1732

Documentation PR

N/A

Type of Change

Bug fix

Testing

How have you tested the change? Verify that the changes do not break functionality or introduce warnings in consuming repositories: agents-docs, agents-tools, agents-cli

  • I ran hatch run prepare

Checklist

  • I have read the CONTRIBUTING document
  • I have added any necessary tests that prove my fix is effective or my feature works
  • I have updated the documentation accordingly
  • I have added an appropriate example to the documentation to outline the feature, or no new docs are needed
  • My changes generate no new warnings
  • Any dependent changes have been merged and published

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

…Client

Agent had circular references (Agent → _ToolCaller._agent → Agent and
Agent → _PluginRegistry._agent → Agent) that prevented refcount-based
destruction. When an Agent with a managed MCPClient was created inside a
function, cleanup was deferred to interpreter shutdown where
MCPClient.stop() → thread.join() would hang.

Use weakref.ref for the back-references so the Agent is collected
immediately when it goes out of scope.

Closes strands-agents#1732
@github-actions
Copy link

github-actions bot commented Mar 6, 2026

Issue: The PR description is missing a "Documentation PR" section.

Suggestion: Since this is an internal bug fix that doesn't change public API, please add a "Documentation PR" section with a justification explaining why documentation updates are not required. For example:

## Documentation PR
Not required - internal bug fix for circular reference cleanup that doesn't affect public API.

"""Return the agent, raising ReferenceError if it has been garbage collected."""
agent = self._agent_ref()
if agent is None:
raise ReferenceError("Agent has been garbage collected")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: The error message is clear and appropriate. Consider whether this ReferenceError could propagate to user code unexpectedly - though since these are internal classes with private naming (_PluginRegistry), this should be fine in practice.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description now thoroughly documents this edge case under "What could break" - the behavior change is clearly communicated. Resolving this thread. ✅

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assessment: Request Changes

This PR correctly addresses issue #1732 by using weakref.ref to break circular references in _ToolCaller and _PluginRegistry. The implementation is clean and the tests effectively validate the fix.

Review Details

Blocking Issue:

  • Documentation PR section missing: Please add a "Documentation PR" section to the PR description with either a link to documentation changes or a justification (e.g., "Not required - internal bug fix that doesn't affect public API")

Implementation Quality:

  • The weakref pattern is correctly applied with appropriate property accessors
  • Tests comprehensively cover the circular reference fix scenario from issue #1732
  • Error handling via ReferenceError is appropriate for this edge case

Note for future consideration:

  • The experimental bidi module (src/strands/experimental/bidi/agent/loop.py) also stores self._agent which could create similar circular references - this may be worth addressing in a follow-up PR

Good fix for a tricky memory management issue! 🔧

@codecov
Copy link

codecov bot commented Mar 6, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Python 3.14 introduces deferred reference counting where del may not
immediately trigger collection on some platforms (observed on macOS).
The tests were too strict by asserting refcount-only collection.

Allow a single gc.collect() as a fallback, which still proves the
important property: no persistent reference cycle exists. The Agent is
collected promptly during normal execution, not deferred to interpreter
shutdown where MCPClient.stop() would hang.
Add tests that exercise the ReferenceError path in _ToolCaller._agent
and _PluginRegistry._agent properties when the Agent has been garbage
collected. Addresses Codecov patch coverage gap.
…ed_up_on_del

Same Python 3.14 deferred refcounting issue — del agent may not
immediately trigger __del__ on all platforms.
self._agent_ref = weakref.ref(agent)

@property
def _agent(self) -> "Agent | BidiAgent":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a fine restriction; this is convenience and if we need a work around, we can come up with one

@JackYPCOnline
Copy link
Contributor

JackYPCOnline commented Mar 9, 2026

I belive that you've seen an agent suggested :

class Agent:
    def __enter__(self):
        return self

    def __exit__(self, *exc):
        self.tool_registry.cleanup()  # deterministic, explicit

    def close(self):
        self.tool_registry.cleanup()

why we deceided to choose weekrefthan other? Is it due to backward-compatibility?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants