Skip to content

fix(browser): make get_text extraction method and get_html truncation configurable#400

Open
danielferreira-dias wants to merge 1 commit intostrands-agents:mainfrom
danielferreira-dias:fix/browser-get-text-and-get-html
Open

fix(browser): make get_text extraction method and get_html truncation configurable#400
danielferreira-dias wants to merge 1 commit intostrands-agents:mainfrom
danielferreira-dias:fix/browser-get-text-and-get-html

Conversation

@danielferreira-dias
Copy link

Summary

  • get_text: Default text extraction changed from text_content() to inner_text(),
    which excludes script/style/hidden content. The old behavior is available via
    method: "text_content" on GetTextAction.
  • get_html: Removed the hard-coded 1,000-character truncation. Full HTML is now returned
    by default. An optional max_length field on GetHtmlAction lets callers opt into truncation.

Motivation

  • text_content() includes <script>, <style>, and hidden element text, polluting LLM
    context when agents read pages. inner_text() is style-aware and produces much cleaner output.
  • The 1,000-char HTML truncation made get_html unusable for reading full page content,
    and the LLM was never informed about it. Meanwhile get_text had no truncation at all,
    making the limit inconsistent.
  • get_html returns proper HTML with tags (<script>, <style>, <nav>, etc.), which
    downstream processing (like markdown conversion) can strip effectively. Without full HTML,
    the only option is get_text where JavaScript/CSS appears as untagged noise that cannot
    be reliably cleaned.

References

  • Playwright: innerText vs textContenttext_content() includes scripts/styles, inner_text() excludes them
  • Playwright Issue #18894 — edge case where inner_text can still include script content; Playwright team recommends stripping <script> elements as workaround
  • Playwright itself imposes no character limit on content(), inner_html(), or text_content() — the 1,000-char truncation was an arbitrary application-level constraint

Test plan

  • Verify get_text with default (no method field) uses inner_text and excludes scripts/styles
  • Verify get_text with method: "text_content" returns raw text including scripts
  • Verify get_html with no max_length returns full HTML without truncation
  • Verify get_html with max_length: 500 truncates and appends "..."
  • Existing browser tests still pass

… configurable

get_text now defaults to inner_text (excludes script/style/hidden content)
instead of text_content, with an opt-in method field to restore the old
behavior. get_html removes the hard-coded 1000-char truncation in favor
of an optional max_length field that defaults to no truncation.

Co-Authored-By: Daniel Dias <DDias@euronext.com>
@danielferreira-dias
Copy link
Author

We have a browser automation agent that navigates pages and processes their content. We built an AfterToolCallEvent hook that intercepts get_html output, strips noise (<script>, <style>, hidden elements, etc.) and converts the HTML to clean markdown using markdownify.

The problem is the 1,000-char hard truncation on get_html — by the time our hook receives the output, it's already an incomplete HTML fragment. This makes the entire get_html → markdown conversion pipeline useless.

Removing the truncation would let downstream hooks and tools properly process full page HTML, which is a cleaner approach than relying on get_text/text_content() that includes unrendered scripts and CSS noise.

@danielferreira-dias
Copy link
Author

This would be extremely valuable to fix, at least removing the get_html truncation limitation, which currently makes downstream HTML processing (e.g., HTML-to-markdown conversion via hooks) non-functional.

Feature Issue
Bug Issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant