Skip to content

refactor: initialize translation pipeline from config#44

Open
ClemDoum wants to merge 1 commit into
mainfrom
refactor(translation-worker)/from-config
Open

refactor: initialize translation pipeline from config#44
ClemDoum wants to merge 1 commit into
mainfrom
refactor(translation-worker)/from-config

Conversation

@ClemDoum
Copy link
Copy Markdown
Contributor

@ClemDoum ClemDoum commented May 28, 2026

⚠️ breaking (translated content changed from map to list)

Description

Make the translation worker API more generic by allowing multiple translation pipeline implementation.
Initializes the translation pipeline from a config object and decoralated instanciation of pipeline component and loading of language specific resources:

json_config = '{"sentence_splitter": {"model": "ARGOS"}, "translator": {"model": "ARGOS"}}'
config = TranslationConfig.model_validate_json(json_config)
translator = config.to_translator()
sentence_splitter = config.to_sentence_splitter()

with translator.load(source="en", target="es"), sentence_splitter.load(language="en"):
    ...

Fixed translation format to be consistent with ES translator translations.

Changes

datashare-python

Added

  • added DatashareLanguage to reflect DS language formatting and validation (uppercase language names)
  • added IETFLanguage to support locals
  • defined Language = DatashareLanguage | IETFLanguage
  • added a Translation to reflect translation format in DS

Fixed

  • changed Document.content_translated from a dict[str, str] to a list[Translation]

translation-worker

Added

  • added the Translator and SentenceSplitter abstraction and made argos component inherit from them
  • updated implem to allows initializing a translation pipeline from a TranslationConfig

Changed

  • refactored batching by allowing multiple worker to process batches from the same source language. It's not longer allowed to run multiple batch translation inside the same worker. For parallel CPU processing we rely solely on CUDA batch processing + horizontal scaling
  • refactored translation in a publish/consumer fashion, where the publisher translates batches and populate an asyncio Queue when a translation buffer is full. The consumer consumes the queue and concurrently writes translations to ES
  • improved logging

@ClemDoum ClemDoum force-pushed the refactor(translation-worker)/from-config branch 7 times, most recently from 078f655 to 0396302 Compare June 1, 2026 10:24
@ClemDoum ClemDoum force-pushed the refactor(translation-worker)/from-config branch from 0396302 to 5f441e0 Compare June 1, 2026 10:28
@ClemDoum ClemDoum marked this pull request as ready for review June 1, 2026 11:01
@ClemDoum
Copy link
Copy Markdown
Contributor Author

ClemDoum commented Jun 1, 2026

Addresses: #26

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant