Semantic Video Search Engine

📰 https://news.opensuse.org/2025/10/08/gsoc-semantic-video-search

An end-to-end video ingestion and semantic search system. The pipeline extracts transcript, speaker, visual, action, and audio-event metadata from videos, enriches segments with an LLM, indexes them in ChromaDB, and serves hybrid text/visual search through FastAPI and Streamlit.

What Is Included

FastAPI search API backed by ChromaDB vector search.
Streamlit search UI for selecting videos and jumping to result timestamps.
Batch ingestion pipeline for extraction, segmentation, enrichment, and indexing.
RabbitMQ ingestion queue with publisher and worker entrypoints.
Docker Compose stack for local API, UI, ChromaDB, RabbitMQ, and optional worker.
Kubernetes manifests for production-style deployment.
Lightweight tests, Compose validation, and CI.

Architecture

data/videos/*.mp4
  -> ingestion_pipeline.run_pipeline
  -> extracted transcript, shots, visual captions, actions, audio events
  -> manual speaker map
  -> enriched segments
  -> ChromaDB collection
  -> FastAPI search API
  -> Streamlit search UI

RabbitMQ can decouple ingestion from callers:

publisher CLI -> RabbitMQ queue -> ingestion worker -> pipeline -> ChromaDB

Prerequisites

Python 3.12 recommended.
Docker and Docker Compose.
FFmpeg for local ingestion outside Docker.
Enough disk for model caches and processed media.
HF_TOKEN for WhisperX speaker diarization.
GEMINI_API_KEY when LLM_PROVIDER=gemini.
Optional TMDB_API_KEY for movie metadata lookup.

Local Setup

cp .env.example .env
make install-dev
make validate

Edit .env and set real secrets. Tracked config files are safe defaults only; credentials must come from environment variables.

For local ingestion outside Docker, install the heavier runtime dependencies into the same virtualenv:

.venv/bin/python -m pip install -r requirements-ingestion.txt -r requirements-ui.txt

Run The Stack

Start ChromaDB, RabbitMQ, the API, and the search UI:

make compose-up

That alone gets you a working search demo — the API auto-indexes a tiny bundled Sintel preview into Chroma on first boot via api/demo_bootstrap.py, so opening http://localhost:8501/ immediately shows the multi-page Streamlit UI with searchable results. No HF token, Gemini key, or worker needed for that path. Set DEMO_BOOTSTRAP=0 in .env to disable.

Default URLs:

Demo UI: http://localhost:8501 — multi-page Streamlit (Home / Submit / Pipeline / Search). The Pipeline page animates per-step progress live as the worker advances. For the Submit page to actually queue jobs, also run make compose-worker.
Speaker identification UI: http://localhost:5050 when started with make compose-speaker
Search API: http://localhost:1234
API liveness: http://localhost:1234/healthz
API readiness: http://localhost:1234/readyz
RabbitMQ management: http://localhost:15672 with credentials from .env or video_se / video_se_dev if unset
ChromaDB: http://localhost:8000

If you copied .env.example, RabbitMQ uses the local development credentials from that file. The container worker connects through the Compose service name, while host-side publisher commands use localhost.

Stop services:

make compose-down

Ingest A Video

Put videos under data/videos.

Run ingestion synchronously from the local virtualenv:

.venv/bin/python -m ingestion_pipeline.run_pipeline --video data/videos/your_video.mp4

Optional metadata:

.venv/bin/python -m ingestion_pipeline.run_pipeline \
  --video data/videos/your_video.mp4 \
  --title "Movie Title" \
  --year 2024

By default SPEAKER_UI_MODE=external, so the pipeline waits for the configured speaker_map.json after extraction. Set SPEAKER_MAP_TIMEOUT_SECONDS to cap that wait for workers, or leave it empty for no timeout. Run the speaker identification UI in another terminal when needed:

.venv/bin/streamlit run app/ui/speaker_id_tool.py

Or run the containerized speaker UI against the Compose data mounts:

make compose-speaker

Queue-Based Ingestion

Start the worker profile:

make compose-worker

Publish a job. When targeting the container worker, use the path as seen inside the worker container. The host-side publisher reads RABBITMQ_URL from .env, or you can pass --rabbitmq-url explicitly:

make publish-ingest VIDEO=/data/videos/your_video.mp4

Equivalent direct command:

.venv/bin/python -m ingestion_pipeline.publisher --video /data/videos/your_video.mp4

The worker requires RABBITMQ_URL, consumes INGESTION_QUEUE, runs the same pipeline, acknowledges successful jobs, and rejects failed jobs without requeueing. Compose wires this URL automatically for the container worker.

Configuration

Config is loaded from CONFIG_PATH or config.yaml, then environment variables override runtime values.

Important variables:

HF_TOKEN, GEMINI_API_KEY, TMDB_API_KEY
ML_DEVICE, OUTPUT_DIR, VIDEO_DATA_PATH, MODEL_CACHE_DIR
SPEAKER_UI_MODE, SPEAKER_MAP_TIMEOUT_SECONDS
API_HOST, API_PORT, SEARCH_API_TIMEOUT_SECONDS, UI_HOST, UI_PORT
CHROMA_HOST, CHROMA_PORT, CHROMA_COLLECTION, CHROMA_IMAGE_TAG
RABBITMQ_URL, INGESTION_QUEUE
RABBITMQ_IMAGE_TAG, RABBITMQ_DEFAULT_USER, RABBITMQ_DEFAULT_PASS for local Compose or the bundled Kubernetes RabbitMQ
LLM_PROVIDER, GEMINI_MODEL, OLLAMA_HOST, OLLAMA_PORT, OLLAMA_MODEL
LOG_LEVEL for runtime log verbosity (defaults to INFO)

Use config.example.yaml and .env.example as references. Do not put secrets in tracked YAML.

Search API

POST http://localhost:1234/search with a JSON body:

Field	Required	Description
`query`	yes	Free-text query, ≤ 1000 chars.
`top_k`	no (default 5)	1–50.
`video_filename`	no	Restrict to one video's segments by filename stem.
`min_duration_sec`	no	Drop segments shorter than this.
`max_duration_sec`	no	Drop segments longer than this.

Example:

curl -s -X POST http://localhost:1234/search \
  -H 'Content-Type: application/json' \
  -d '{"query":"explosion fire","top_k":3,"min_duration_sec":5}'

Each indexed segment also carries pipe-delimited token fields (speakers_tokens, keywords_tokens, actions_tokens, audio_events_tokens) usable with Chroma's $contains operator for precise server-side filtering, plus a duration_sec numeric metadata field. See docs/operations.md for filter recipes.

Development

Run the lightweight validation suite:

make validate

This runs unit tests, validates Compose config with and without the worker profile, and compiles lightweight Python entrypoints. CI runs the same target.

For an integration test that talks to a real Chroma:

make compose-up           # in another terminal
make test-integration

The tests/integration/ suite skips itself when Chroma is unreachable, so the same files work locally and in the compose-smoke CI lane.

Run the benchmark suite when working on a hot path (search service, ingestion validators, segmentation loop, ...):

make bench           # full suite, text report
make bench-smoke     # 10% iterations, suitable for a quick sanity check
make bench-baseline  # capture benchmarks/reports/baseline.json
make bench-check     # gate on regression vs baseline

See benchmarks/README.md for what is measured, how to add a new benchmark, and how to read the report.

Deployment

Build and publish the Docker images from docker/:

docker/api.Dockerfile
docker/ingestion.Dockerfile
docker/ui-search.Dockerfile
docker/ui-speaker.Dockerfile

Kubernetes manifests live in k8s/. Create secrets outside Git:

kubectl apply -f k8s/ns.yaml
kubectl -n video-se create secret generic video-se-secrets \
  --from-literal=HF_TOKEN=... \
  --from-literal=GEMINI_API_KEY=... \
  --from-literal=TMDB_API_KEY=... \
  --from-literal=RABBITMQ_DEFAULT_USER=video_se \
  --from-literal=RABBITMQ_DEFAULT_PASS='change-me' \
  --from-literal=RABBITMQ_URL='amqp://video_se:change-me@rabbitmq:5672/%2F'
kubectl apply -k k8s/

Replace <REG>/video-se-*:TAG image placeholders in the manifests with your registry and immutable image tags. Service Dockerfiles use python:3.12.13-slim; Chroma is pinned to chromadb/chroma:1.5.6; bundled RabbitMQ is pinned to rabbitmq:4.1.4-management. Update them intentionally after validation. The bundled kustomization includes RabbitMQ and an ingestion-worker deployment; set RABBITMQ_URL to a managed broker endpoint instead if you do not want to run RabbitMQ in-cluster. k8s/ingestion-job.yaml and k8s/toolbox.yaml are manual operational helpers and are intentionally excluded from the default kustomization.

Operations

See docs/operations.md for local runbook commands, environment notes, queue behavior, and troubleshooting.

Name		Name	Last commit message	Last commit date
Latest commit History 321 Commits
.github/workflows		.github/workflows
api		api
app		app
benchmarks		benchmarks
core		core
data		data
docker		docker
docs		docs
ingestion_pipeline		ingestion_pipeline
k8s		k8s
tests		tests
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
Makefile		Makefile
README.md		README.md
config.example.yaml		config.example.yaml
config.yaml		config.yaml
docker-compose.yml		docker-compose.yml
index.html		index.html
inspect_db.py		inspect_db.py
pyproject.toml		pyproject.toml
requirements-api.txt		requirements-api.txt
requirements-dev.txt		requirements-dev.txt
requirements-ingestion.txt		requirements-ingestion.txt
requirements-ui.txt		requirements-ui.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Semantic Video Search Engine

What Is Included

Architecture

Prerequisites

Local Setup

Run The Stack

Ingest A Video

Queue-Based Ingestion

Configuration

Search API

Development

Deployment

Operations

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Semantic Video Search Engine

What Is Included

Architecture

Prerequisites

Local Setup

Run The Stack

Ingest A Video

Queue-Based Ingestion

Configuration

Search API

Development

Deployment

Operations

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages