Skip to content

Conversation

@edwinyyyu
Copy link
Contributor

@edwinyyyu edwinyyyu commented Dec 9, 2025

Purpose of the change

Embedders and rerankers have a context length limit, often pretty short. Embedding time may increase with input length. Malicious inputs may be long. Allow configuration to specify a length limit.

Embedders and rerankers do not deal with empty strings, whitespace, or punctuation-only strings consistently.

Description

Maximum input lengths are specified as Unicode code points because that is what is most native in Python, and because for cases where it is necessary, it can be good enough to compute a hard limit to set:

  • Unicode code points have a maximum number of 4 bytes regardless of which encoding.
  • Tokens should be a minimum of 1 byte.
  • Thus, for a model with a hard context limit of 8192 tokens, the input length limit may be set to 2048 Unicode code points.

For embedders:
Chunk into balanced partitions when the text is too long. Average the embeddings of the partitions.

For cross-encoder rerankers:
Chunk text into maximal partitions when the text is too long. Take the maximum score of the partitions.

Fixes #823

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Refactor (does not change functionality, e.g., code style improvements, linting)
  • Documentation update
  • Project Maintenance (updates to build scripts, CI, etc., that do not affect the main project)
  • Security (improves security without changing functionality)

How Has This Been Tested?

  • Unit Test
  • Integration Test
  • End-to-end Test
  • Test Script (please provide)
  • Manual verification (list step-by-step instructions)

Checklist

  • I have signed the commit(s) within this pull request
  • My code follows the style guidelines of this project (See STYLE_GUIDE.md)
  • I have performed a self-review of my own code
  • I have commented my code
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added unit tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • Any dependent changes have been merged and published in downstream modules
  • I have checked my code and corrected any misspellings

Maintainer Checklist

  • Confirmed all checks passed
  • Contributor has signed the commit(s)
  • Reviewed the code
  • Run, Tested, and Verified the change(s) work as expected

@edwinyyyu edwinyyyu changed the title WIP: Implement embedder input length limits WIP: Implement embedder and reranker input length limits Dec 10, 2025
@edwinyyyu edwinyyyu force-pushed the length_limits branch 6 times, most recently from 68fb4d9 to 5fbc372 Compare December 12, 2025 17:40
@edwinyyyu edwinyyyu force-pushed the length_limits branch 2 times, most recently from 20b8e5c to eb4a5cc Compare December 12, 2025 17:56
@edwinyyyu edwinyyyu marked this pull request as ready for review December 12, 2025 17:56
@sscargal sscargal requested a review from Copilot December 12, 2025 20:59
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements input length limits for embedders and rerankers to protect against excessively long inputs that could exceed model context limits or introduce security risks. When inputs exceed the configured maximum length (specified in Unicode code points), embedders partition text into balanced chunks and average the resulting embeddings, while cross-encoder rerankers use maximal chunks and return the maximum score across partitions.

Key changes include:

  • Added max_input_length parameter to embedder and reranker configurations
  • Implemented text chunking utilities (chunk_text, chunk_text_balanced, unflatten_like)
  • Modified embedders to handle long inputs by chunking and averaging embeddings

Reviewed changes

Copilot reviewed 17 out of 18 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
tests/memmachine/conftest.py Added Cohere test fixtures and max_input_length to OpenAI embedder
tests/memmachine/common/test_utils.py New test file for text chunking utilities
tests/memmachine/common/resource_manager/test_embedder_manager.py Fixed class name typo from Config to Conf
tests/memmachine/common/reranker/test_cross_encoder_reranker.py Converted to integration tests with real model and added large input tests
tests/memmachine/common/reranker/test_cohere_reranker.py New integration tests for Cohere reranker with large input tests
tests/memmachine/common/embedder/test_sentence_transformer_embedder.py New integration tests with large input test cases
tests/memmachine/common/embedder/test_openai_embedder.py New integration tests with large input test cases
tests/memmachine/common/embedder/test_amazon_bedrock_embedder.py New integration tests with large input test cases
src/memmachine/common/utils.py Implemented chunking and unflattening utilities
src/memmachine/common/resource_manager/reranker_manager.py Added max_input_length parameter to cross-encoder initialization
src/memmachine/common/resource_manager/embedder_manager.py Added max_input_length parameter to all embedder initializations
src/memmachine/common/reranker/cross_encoder_reranker.py Implemented chunking and max-score logic for long inputs
src/memmachine/common/embedder/sentence_transformer_embedder.py Implemented chunking and averaging for long inputs
src/memmachine/common/embedder/openai_embedder.py Implemented chunking and averaging for long inputs
src/memmachine/common/embedder/amazon_bedrock_embedder.py Implemented chunking and averaging for long inputs
src/memmachine/common/configuration/reranker_conf.py Added max_input_length field to CrossEncoderRerankerConf
src/memmachine/common/configuration/embedder_conf.py Added max_input_length field and renamed Config to Conf

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

)

return response.astype(float).tolist()
chunk_embeddings = response.astype(float).tolist()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is common logic, we may create a function that do chunk and call the vendor specific embed functions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic is not entirely common. There may be embedder implementations that do their own chunking internally, like how the Cohere reranker does on AWS. This is also not applicable to other modalities.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a new API is needed for other modalities, so that's not too big a consideration.

@edwinyyyu edwinyyyu requested a review from malatewang December 16, 2025 17:55
@edwinyyyu edwinyyyu changed the title WIP: Implement embedder and reranker input length limits Implement embedder and reranker input length limits Dec 16, 2025
@edwinyyyu edwinyyyu force-pushed the length_limits branch 2 times, most recently from 99b64e9 to da64438 Compare December 17, 2025 23:17
@edwinyyyu edwinyyyu changed the title Implement embedder and reranker input length limits Implement embedder and reranker input length limits and sanitization Dec 17, 2025
Signed-off-by: Edwin Yu <[email protected]>
response = (
await self._client.embeddings.create(
input=inputs,
input=chunks,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any limitation on the total input length or the input chunk number?

Copy link
Contributor Author

@edwinyyyu edwinyyyu Dec 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a limit of 2048 inputs, and a limit of 300000 total tokens for OpenAI.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added these limits to OpenAI. It will probably take some more time to add support for the other Embedder and Reranker implementations and it will blow up this PR even more. Thoughts?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: send rerank request when there's no episodes in the project

2 participants