fix(browser): make get_text extraction method and get_html truncation configurable#400
Conversation
… configurable get_text now defaults to inner_text (excludes script/style/hidden content) instead of text_content, with an opt-in method field to restore the old behavior. get_html removes the hard-coded 1000-char truncation in favor of an optional max_length field that defaults to no truncation. Co-Authored-By: Daniel Dias <DDias@euronext.com>
|
We have a browser automation agent that navigates pages and processes their content. We built an AfterToolCallEvent hook that intercepts get_html output, strips noise (<script>, <style>, hidden elements, etc.) and converts the HTML to clean markdown using markdownify. The problem is the 1,000-char hard truncation on get_html — by the time our hook receives the output, it's already an incomplete HTML fragment. This makes the entire get_html → markdown conversion pipeline useless. Removing the truncation would let downstream hooks and tools properly process full page HTML, which is a cleaner approach than relying on get_text/text_content() that includes unrendered scripts and CSS noise. |
|
This would be extremely valuable to fix, at least removing the get_html truncation limitation, which currently makes downstream HTML processing (e.g., HTML-to-markdown conversion via hooks) non-functional. |
Summary
text_content()toinner_text(),which excludes script/style/hidden content. The old behavior is available via
method: "text_content"onGetTextAction.by default. An optional
max_lengthfield onGetHtmlActionlets callers opt into truncation.Motivation
text_content()includes<script>,<style>, and hidden element text, polluting LLMcontext when agents read pages.
inner_text()is style-aware and produces much cleaner output.get_htmlunusable for reading full page content,and the LLM was never informed about it. Meanwhile
get_texthad no truncation at all,making the limit inconsistent.
get_htmlreturns proper HTML with tags (<script>,<style>,<nav>, etc.), whichdownstream processing (like markdown conversion) can strip effectively. Without full HTML,
the only option is
get_textwhere JavaScript/CSS appears as untagged noise that cannotbe reliably cleaned.
References
text_content()includes scripts/styles,inner_text()excludes theminner_textcan still include script content; Playwright team recommends stripping<script>elements as workaroundcontent(),inner_html(), ortext_content()— the 1,000-char truncation was an arbitrary application-level constraintTest plan
get_textwith default (nomethodfield) usesinner_textand excludes scripts/stylesget_textwithmethod: "text_content"returns raw text including scriptsget_htmlwith nomax_lengthreturns full HTML without truncationget_htmlwithmax_length: 500truncates and appends "..."