Add TwelveLabs video RAG template (Pegasus parser + Marengo embedder) by mohit-twelvelabs · Pull Request #129 · pathwaycom/llm-app

mohit-twelvelabs · 2026-06-25T10:34:32Z

Hi! I'm Mohit, I work at TwelveLabs (@mohit-twelvelabs).

Process note: Per CONTRIBUTING.md and the PR template, this change ideally starts with an issue/Discord discussion and requires signing the CLA. I'm opening this PR as a concrete proposal to make the discussion easier — happy to file an issue, adjust scope, or iterate however the maintainers prefer, and I'll sign the CLA when prompted.

Introduction

This adds a new, fully opt-in application template: Video RAG with TwelveLabs (templates/video_rag_twelvelabs/). It lets a Pathway pipeline do RAG over video by bringing in two TwelveLabs models:

Pegasus (video understanding) — a Pathway parser (TwelveLabsVideoParser, a pw.UDF) that uploads each video as a TwelveLabs asset and turns it into a rich text description (what happens on screen, who/what appears, spoken and on-screen text, the overall topic). Pathway then indexes that text exactly like it indexes a PDF.
Marengo (multimodal embeddings, 512-dim) — a Pathway embedder (MarengoEmbedder, a BaseEmbedder subclass) used as the retriever embedder.

Both components live in a local pathway_twelvelabs package and are wired in entirely through app.yaml (mirroring the multimodal_rag and slides_ai_search templates), so models, prompts, the data source, and the LLM can all be swapped without touching Python.

Context

The existing templates handle documents (PDF/DOCX/slides) but not video. Video is hard to drop into RAG because most stacks only transcribe the audio and discard everything visual. Pegasus captures the whole video as text, and Marengo gives a shared multimodal embedding space. This extends Pathway's live-sync + in-memory-index story to a new modality with zero new infrastructure.

How has this been tested?

templates/video_rag_twelvelabs/test_twelvelabs.py: 4 no-network unit tests (stubbed SDK; run without credentials) covering the embedder vector shape, the Pegasus upload-then-analyze flow, failed-asset handling, and an embedding-dimension regression test. 2 of these are dimension/default checks; a 5th test is a live smoke test that's skipped unless TWELVELABS_API_KEY is set.
Live, end to end against the real TwelveLabs API:
- Marengo text embedding returns a 512-dim vector, and MarengoEmbedder.get_embedding_dimension() correctly reports 512 (this required overriding the base probe, which assumes a single-vector return).
- Pegasus analyzed a short public sample video end to end (asset upload → analyze → text) in ~7s, returning a correct description.
Linters matching the repo's CI (black, isort --profile black, flake8) pass on all new files. The new module type-checks cleanly under mypy; the template dir is added to the existing [tool.mypy] exclude list, consistent with the other RAG templates.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature or improvement (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)

This is purely additive: a new template directory plus one row in the main README table and one entry in the mypy exclude list. No existing template, default, or behavior is changed.

Related issue(s):

(none yet — happy to open one if preferred)

Checklist:

My code follows the code style of this project,
My change requires a change to the documentation (added a template README and a row in the main README),
I described the modification in the CHANGELOG.md file. (No CHANGELOG.md exists in this repo.)

You can grab a free API key at https://twelvelabs.io — there's a generous free tier.

CLAassistant · 2026-06-25T10:34:46Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

Add TwelveLabs video RAG template (Pegasus parser + Marengo embedder)

b0083d0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add TwelveLabs video RAG template (Pegasus parser + Marengo embedder)#129

Add TwelveLabs video RAG template (Pegasus parser + Marengo embedder)#129
mohit-twelvelabs wants to merge 1 commit into
pathwaycom:mainfrom
mohit-twelvelabs:feat/twelvelabs-integration

mohit-twelvelabs commented Jun 25, 2026

Uh oh!

CLAassistant commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

mohit-twelvelabs commented Jun 25, 2026

Introduction

Context

How has this been tested?

Types of changes

Related issue(s):

Checklist:

Uh oh!

CLAassistant commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants