Implement embedder and reranker input length limits and sanitization #744

edwinyyyu · 2025-12-09T23:44:06Z

Purpose of the change

Embedders and rerankers have a context length limit, often pretty short. Embedding time may increase with input length. Malicious inputs may be long. Allow configuration to specify a length limit.

Embedders and rerankers do not deal with empty strings, whitespace, or punctuation-only strings consistently.

Description

Maximum input lengths are specified as Unicode code points because that is what is most native in Python, and because for cases where it is necessary, it can be good enough to compute a hard limit to set:

Unicode code points have a maximum number of 4 bytes regardless of which encoding.
Tokens should be a minimum of 1 byte.
Thus, for a model with a hard context limit of 8192 tokens, the input length limit may be set to 2048 Unicode code points.

For embedders:
Chunk into balanced partitions when the text is too long. Average the embeddings of the partitions.

For cross-encoder rerankers:
Chunk text into maximal partitions when the text is too long. Take the maximum score of the partitions.

Fixes #823

Type of change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Refactor (does not change functionality, e.g., code style improvements, linting)
Documentation update
Project Maintenance (updates to build scripts, CI, etc., that do not affect the main project)
Security (improves security without changing functionality)

How Has This Been Tested?

Checklist

Maintainer Checklist

Confirmed all checks passed
Contributor has signed the commit(s)
Reviewed the code
Run, Tested, and Verified the change(s) work as expected

Signed-off-by: Edwin Yu <[email protected]>

Copilot

Pull request overview

This PR implements input length limits for embedders and rerankers to protect against excessively long inputs that could exceed model context limits or introduce security risks. When inputs exceed the configured maximum length (specified in Unicode code points), embedders partition text into balanced chunks and average the resulting embeddings, while cross-encoder rerankers use maximal chunks and return the maximum score across partitions.

Key changes include:

Added max_input_length parameter to embedder and reranker configurations
Implemented text chunking utilities (chunk_text, chunk_text_balanced, unflatten_like)
Modified embedders to handle long inputs by chunking and averaging embeddings

Reviewed changes

Copilot reviewed 17 out of 18 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
tests/memmachine/conftest.py	Added Cohere test fixtures and max_input_length to OpenAI embedder
tests/memmachine/common/test_utils.py	New test file for text chunking utilities
tests/memmachine/common/resource_manager/test_embedder_manager.py	Fixed class name typo from Config to Conf
tests/memmachine/common/reranker/test_cross_encoder_reranker.py	Converted to integration tests with real model and added large input tests
tests/memmachine/common/reranker/test_cohere_reranker.py	New integration tests for Cohere reranker with large input tests
tests/memmachine/common/embedder/test_sentence_transformer_embedder.py	New integration tests with large input test cases
tests/memmachine/common/embedder/test_openai_embedder.py	New integration tests with large input test cases
tests/memmachine/common/embedder/test_amazon_bedrock_embedder.py	New integration tests with large input test cases
src/memmachine/common/utils.py	Implemented chunking and unflattening utilities
src/memmachine/common/resource_manager/reranker_manager.py	Added max_input_length parameter to cross-encoder initialization
src/memmachine/common/resource_manager/embedder_manager.py	Added max_input_length parameter to all embedder initializations
src/memmachine/common/reranker/cross_encoder_reranker.py	Implemented chunking and max-score logic for long inputs
src/memmachine/common/embedder/sentence_transformer_embedder.py	Implemented chunking and averaging for long inputs
src/memmachine/common/embedder/openai_embedder.py	Implemented chunking and averaging for long inputs
src/memmachine/common/embedder/amazon_bedrock_embedder.py	Implemented chunking and averaging for long inputs
src/memmachine/common/configuration/reranker_conf.py	Added max_input_length field to CrossEncoderRerankerConf
src/memmachine/common/configuration/embedder_conf.py	Added max_input_length field and renamed Config to Conf

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/memmachine/common/embedder/sentence_transformer_embedder.py

src/memmachine/common/reranker/cross_encoder_reranker.py

src/memmachine/common/embedder/openai_embedder.py

src/memmachine/common/embedder/sentence_transformer_embedder.py

src/memmachine/common/embedder/amazon_bedrock_embedder.py

src/memmachine/common/embedder/openai_embedder.py

malatewang · 2025-12-13T05:48:52Z

src/memmachine/common/embedder/sentence_transformer_embedder.py

        )

-        return response.astype(float).tolist()
+        chunk_embeddings = response.astype(float).tolist()


Since this is common logic, we may create a function that do chunk and call the vendor specific embed functions.

This logic is not entirely common. There may be embedder implementations that do their own chunking internally, like how the Cohere reranker does on AWS. This is also not applicable to other modalities.

Maybe a new API is needed for other modalities, so that's not too big a consideration.

Signed-off-by: Edwin Yu <[email protected]>

malatewang · 2025-12-18T19:48:48Z

src/memmachine/common/embedder/openai_embedder.py

                    response = (
                        await self._client.embeddings.create(
-                            input=inputs,
+                            input=chunks,


Is there any limitation on the total input length or the input chunk number?

There is a limit of 2048 inputs, and a limit of 300000 total tokens for OpenAI.

I've added these limits to OpenAI. It will probably take some more time to add support for the other Embedder and Reranker implementations and it will blow up this PR even more. Thoughts?

Signed-off-by: Edwin Yu <[email protected]>

edwinyyyu force-pushed the length_limits branch from 6476b2c to 5822ba4 Compare December 9, 2025 23:46

edwinyyyu changed the title ~~WIP: Implement embedder input length limits~~ WIP: Implement embedder and reranker input length limits Dec 10, 2025

edwinyyyu force-pushed the length_limits branch from eec97d4 to b73a668 Compare December 10, 2025 00:08

edwinyyyu requested review from jealous, malatewang, o-love and vinares December 12, 2025 01:12

edwinyyyu force-pushed the length_limits branch 6 times, most recently from 68fb4d9 to 5fbc372 Compare December 12, 2025 17:40

edwinyyyu added 2 commits December 12, 2025 09:42

Implement embedder input length limits

c2eabc1

Signed-off-by: Edwin Yu <[email protected]>

Implement cross-encoder reranker input length limit

22489ed

Signed-off-by: Edwin Yu <[email protected]>

edwinyyyu force-pushed the length_limits branch 2 times, most recently from 20b8e5c to eb4a5cc Compare December 12, 2025 17:56

edwinyyyu marked this pull request as ready for review December 12, 2025 17:56

sscargal requested a review from Copilot December 12, 2025 20:59

Copilot AI reviewed Dec 12, 2025

View reviewed changes

malatewang reviewed Dec 13, 2025

View reviewed changes

edwinyyyu requested a review from malatewang December 16, 2025 17:55

edwinyyyu changed the title ~~WIP: Implement embedder and reranker input length limits~~ Implement embedder and reranker input length limits Dec 16, 2025

edwinyyyu force-pushed the length_limits branch 2 times, most recently from 99b64e9 to da64438 Compare December 17, 2025 23:17

edwinyyyu changed the title ~~Implement embedder and reranker input length limits~~ Implement embedder and reranker input length limits and sanitization Dec 17, 2025

Fixes and tests

3111580

Signed-off-by: Edwin Yu <[email protected]>

edwinyyyu force-pushed the length_limits branch from da64438 to 3111580 Compare December 17, 2025 23:22

edwinyyyu mentioned this pull request Dec 17, 2025

[Bug]: send rerank request when there's no episodes in the project #823

Open

malatewang approved these changes Dec 18, 2025

View reviewed changes

Support many large inputs using OpenAIEmbedder

fc6d1b1

Signed-off-by: Edwin Yu <[email protected]>

edwinyyyu force-pushed the length_limits branch from 557a355 to fc6d1b1 Compare December 18, 2025 23:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement embedder and reranker input length limits and sanitization #744

Implement embedder and reranker input length limits and sanitization #744

Uh oh!

edwinyyyu commented Dec 9, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

malatewang Dec 13, 2025

Uh oh!

edwinyyyu Dec 16, 2025

Uh oh!

edwinyyyu Dec 18, 2025

Uh oh!

malatewang Dec 18, 2025

Uh oh!

edwinyyyu Dec 18, 2025 •

edited

Loading

Uh oh!

edwinyyyu Dec 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Implement embedder and reranker input length limits and sanitization #744

Are you sure you want to change the base?

Implement embedder and reranker input length limits and sanitization #744

Uh oh!

Conversation

edwinyyyu commented Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose of the change

Description

Type of change

How Has This Been Tested?

Checklist

Maintainer Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

malatewang Dec 13, 2025

Choose a reason for hiding this comment

Uh oh!

edwinyyyu Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

edwinyyyu Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

malatewang Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

edwinyyyu Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

edwinyyyu Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

edwinyyyu commented Dec 9, 2025 •

edited

Loading

edwinyyyu Dec 18, 2025 •

edited

Loading