-
Notifications
You must be signed in to change notification settings - Fork 114
Implement embedder and reranker input length limits and sanitization #744
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
6476b2c to
5822ba4
Compare
eec97d4 to
b73a668
Compare
68fb4d9 to
5fbc372
Compare
Signed-off-by: Edwin Yu <[email protected]>
Signed-off-by: Edwin Yu <[email protected]>
20b8e5c to
eb4a5cc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR implements input length limits for embedders and rerankers to protect against excessively long inputs that could exceed model context limits or introduce security risks. When inputs exceed the configured maximum length (specified in Unicode code points), embedders partition text into balanced chunks and average the resulting embeddings, while cross-encoder rerankers use maximal chunks and return the maximum score across partitions.
Key changes include:
- Added
max_input_lengthparameter to embedder and reranker configurations - Implemented text chunking utilities (
chunk_text,chunk_text_balanced,unflatten_like) - Modified embedders to handle long inputs by chunking and averaging embeddings
Reviewed changes
Copilot reviewed 17 out of 18 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/memmachine/conftest.py | Added Cohere test fixtures and max_input_length to OpenAI embedder |
| tests/memmachine/common/test_utils.py | New test file for text chunking utilities |
| tests/memmachine/common/resource_manager/test_embedder_manager.py | Fixed class name typo from Config to Conf |
| tests/memmachine/common/reranker/test_cross_encoder_reranker.py | Converted to integration tests with real model and added large input tests |
| tests/memmachine/common/reranker/test_cohere_reranker.py | New integration tests for Cohere reranker with large input tests |
| tests/memmachine/common/embedder/test_sentence_transformer_embedder.py | New integration tests with large input test cases |
| tests/memmachine/common/embedder/test_openai_embedder.py | New integration tests with large input test cases |
| tests/memmachine/common/embedder/test_amazon_bedrock_embedder.py | New integration tests with large input test cases |
| src/memmachine/common/utils.py | Implemented chunking and unflattening utilities |
| src/memmachine/common/resource_manager/reranker_manager.py | Added max_input_length parameter to cross-encoder initialization |
| src/memmachine/common/resource_manager/embedder_manager.py | Added max_input_length parameter to all embedder initializations |
| src/memmachine/common/reranker/cross_encoder_reranker.py | Implemented chunking and max-score logic for long inputs |
| src/memmachine/common/embedder/sentence_transformer_embedder.py | Implemented chunking and averaging for long inputs |
| src/memmachine/common/embedder/openai_embedder.py | Implemented chunking and averaging for long inputs |
| src/memmachine/common/embedder/amazon_bedrock_embedder.py | Implemented chunking and averaging for long inputs |
| src/memmachine/common/configuration/reranker_conf.py | Added max_input_length field to CrossEncoderRerankerConf |
| src/memmachine/common/configuration/embedder_conf.py | Added max_input_length field and renamed Config to Conf |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| ) | ||
|
|
||
| return response.astype(float).tolist() | ||
| chunk_embeddings = response.astype(float).tolist() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is common logic, we may create a function that do chunk and call the vendor specific embed functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic is not entirely common. There may be embedder implementations that do their own chunking internally, like how the Cohere reranker does on AWS. This is also not applicable to other modalities.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a new API is needed for other modalities, so that's not too big a consideration.
99b64e9 to
da64438
Compare
Signed-off-by: Edwin Yu <[email protected]>
da64438 to
3111580
Compare
| response = ( | ||
| await self._client.embeddings.create( | ||
| input=inputs, | ||
| input=chunks, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any limitation on the total input length or the input chunk number?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a limit of 2048 inputs, and a limit of 300000 total tokens for OpenAI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added these limits to OpenAI. It will probably take some more time to add support for the other Embedder and Reranker implementations and it will blow up this PR even more. Thoughts?
Signed-off-by: Edwin Yu <[email protected]>
557a355 to
fc6d1b1
Compare
Purpose of the change
Embedders and rerankers have a context length limit, often pretty short. Embedding time may increase with input length. Malicious inputs may be long. Allow configuration to specify a length limit.
Embedders and rerankers do not deal with empty strings, whitespace, or punctuation-only strings consistently.
Description
Maximum input lengths are specified as Unicode code points because that is what is most native in Python, and because for cases where it is necessary, it can be good enough to compute a hard limit to set:
For embedders:
Chunk into balanced partitions when the text is too long. Average the embeddings of the partitions.
For cross-encoder rerankers:
Chunk text into maximal partitions when the text is too long. Take the maximum score of the partitions.
Fixes #823
Type of change
How Has This Been Tested?
Checklist
Maintainer Checklist