📰 https://news.opensuse.org/2025/10/08/gsoc-semantic-video-search
An end-to-end video ingestion and semantic search system. The pipeline extracts transcript, speaker, visual, action, and audio-event metadata from videos, enriches segments with an LLM, indexes them in ChromaDB, and serves hybrid text/visual search through FastAPI and Streamlit.
- FastAPI search API backed by ChromaDB vector search.
- Streamlit search UI for selecting videos and jumping to result timestamps.
- Batch ingestion pipeline for extraction, segmentation, enrichment, and indexing.
- RabbitMQ ingestion queue with publisher and worker entrypoints.
- Docker Compose stack for local API, UI, ChromaDB, RabbitMQ, and optional worker.
- Kubernetes manifests for production-style deployment.
- Lightweight tests, Compose validation, and CI.
data/videos/*.mp4
-> ingestion_pipeline.run_pipeline
-> extracted transcript, shots, visual captions, actions, audio events
-> manual speaker map
-> enriched segments
-> ChromaDB collection
-> FastAPI search API
-> Streamlit search UI
RabbitMQ can decouple ingestion from callers:
publisher CLI -> RabbitMQ queue -> ingestion worker -> pipeline -> ChromaDB
- Python 3.12 recommended.
- Docker and Docker Compose.
- FFmpeg for local ingestion outside Docker.
- Enough disk for model caches and processed media.
HF_TOKENfor WhisperX speaker diarization.GEMINI_API_KEYwhenLLM_PROVIDER=gemini.- Optional
TMDB_API_KEYfor movie metadata lookup.
cp .env.example .env
make install-dev
make validateEdit .env and set real secrets. Tracked config files are safe defaults only; credentials must come from environment variables.
For local ingestion outside Docker, install the heavier runtime dependencies into the same virtualenv:
.venv/bin/python -m pip install -r requirements-ingestion.txt -r requirements-ui.txtStart ChromaDB, RabbitMQ, the API, and the search UI:
make compose-upThat alone gets you a working search demo — the API auto-indexes a tiny bundled Sintel preview into Chroma on first boot via api/demo_bootstrap.py, so opening http://localhost:8501/ immediately shows the multi-page Streamlit UI with searchable results. No HF token, Gemini key, or worker needed for that path. Set DEMO_BOOTSTRAP=0 in .env to disable.
Default URLs:
- Demo UI:
http://localhost:8501— multi-page Streamlit (Home / Submit / Pipeline / Search). The Pipeline page animates per-step progress live as the worker advances. For the Submit page to actually queue jobs, also runmake compose-worker. - Speaker identification UI:
http://localhost:5050when started withmake compose-speaker - Search API:
http://localhost:1234 - API liveness:
http://localhost:1234/healthz - API readiness:
http://localhost:1234/readyz - RabbitMQ management:
http://localhost:15672with credentials from.envorvideo_se/video_se_devif unset - ChromaDB:
http://localhost:8000
If you copied .env.example, RabbitMQ uses the local development credentials from that file. The container worker connects through the Compose service name, while host-side publisher commands use localhost.
Stop services:
make compose-downPut videos under data/videos.
Run ingestion synchronously from the local virtualenv:
.venv/bin/python -m ingestion_pipeline.run_pipeline --video data/videos/your_video.mp4Optional metadata:
.venv/bin/python -m ingestion_pipeline.run_pipeline \
--video data/videos/your_video.mp4 \
--title "Movie Title" \
--year 2024By default SPEAKER_UI_MODE=external, so the pipeline waits for the configured speaker_map.json after extraction. Set SPEAKER_MAP_TIMEOUT_SECONDS to cap that wait for workers, or leave it empty for no timeout. Run the speaker identification UI in another terminal when needed:
.venv/bin/streamlit run app/ui/speaker_id_tool.pyOr run the containerized speaker UI against the Compose data mounts:
make compose-speakerStart the worker profile:
make compose-workerPublish a job. When targeting the container worker, use the path as seen inside the worker container. The host-side publisher reads RABBITMQ_URL from .env, or you can pass --rabbitmq-url explicitly:
make publish-ingest VIDEO=/data/videos/your_video.mp4Equivalent direct command:
.venv/bin/python -m ingestion_pipeline.publisher --video /data/videos/your_video.mp4The worker requires RABBITMQ_URL, consumes INGESTION_QUEUE, runs the same pipeline, acknowledges successful jobs, and rejects failed jobs without requeueing. Compose wires this URL automatically for the container worker.
Config is loaded from CONFIG_PATH or config.yaml, then environment variables override runtime values.
Important variables:
HF_TOKEN,GEMINI_API_KEY,TMDB_API_KEYML_DEVICE,OUTPUT_DIR,VIDEO_DATA_PATH,MODEL_CACHE_DIRSPEAKER_UI_MODE,SPEAKER_MAP_TIMEOUT_SECONDSAPI_HOST,API_PORT,SEARCH_API_TIMEOUT_SECONDS,UI_HOST,UI_PORTCHROMA_HOST,CHROMA_PORT,CHROMA_COLLECTION,CHROMA_IMAGE_TAGRABBITMQ_URL,INGESTION_QUEUERABBITMQ_IMAGE_TAG,RABBITMQ_DEFAULT_USER,RABBITMQ_DEFAULT_PASSfor local Compose or the bundled Kubernetes RabbitMQLLM_PROVIDER,GEMINI_MODEL,OLLAMA_HOST,OLLAMA_PORT,OLLAMA_MODELLOG_LEVELfor runtime log verbosity (defaults to INFO)
Use config.example.yaml and .env.example as references. Do not put secrets in tracked YAML.
POST http://localhost:1234/search with a JSON body:
| Field | Required | Description |
|---|---|---|
query |
yes | Free-text query, ≤ 1000 chars. |
top_k |
no (default 5) | 1–50. |
video_filename |
no | Restrict to one video's segments by filename stem. |
min_duration_sec |
no | Drop segments shorter than this. |
max_duration_sec |
no | Drop segments longer than this. |
Example:
curl -s -X POST http://localhost:1234/search \
-H 'Content-Type: application/json' \
-d '{"query":"explosion fire","top_k":3,"min_duration_sec":5}'Each indexed segment also carries pipe-delimited token fields
(speakers_tokens, keywords_tokens, actions_tokens,
audio_events_tokens) usable with Chroma's $contains operator for
precise server-side filtering, plus a duration_sec numeric metadata
field. See docs/operations.md
for filter recipes.
Run the lightweight validation suite:
make validateThis runs unit tests, validates Compose config with and without the worker profile, and compiles lightweight Python entrypoints. CI runs the same target.
For an integration test that talks to a real Chroma:
make compose-up # in another terminal
make test-integrationThe tests/integration/ suite skips itself when Chroma is unreachable, so the same files work locally and in the compose-smoke CI lane.
Run the benchmark suite when working on a hot path (search service, ingestion validators, segmentation loop, ...):
make bench # full suite, text report
make bench-smoke # 10% iterations, suitable for a quick sanity check
make bench-baseline # capture benchmarks/reports/baseline.json
make bench-check # gate on regression vs baselineSee benchmarks/README.md for what is measured, how to add a new benchmark, and how to read the report.
Build and publish the Docker images from docker/:
docker/api.Dockerfiledocker/ingestion.Dockerfiledocker/ui-search.Dockerfiledocker/ui-speaker.Dockerfile
Kubernetes manifests live in k8s/. Create secrets outside Git:
kubectl apply -f k8s/ns.yaml
kubectl -n video-se create secret generic video-se-secrets \
--from-literal=HF_TOKEN=... \
--from-literal=GEMINI_API_KEY=... \
--from-literal=TMDB_API_KEY=... \
--from-literal=RABBITMQ_DEFAULT_USER=video_se \
--from-literal=RABBITMQ_DEFAULT_PASS='change-me' \
--from-literal=RABBITMQ_URL='amqp://video_se:change-me@rabbitmq:5672/%2F'
kubectl apply -k k8s/Replace <REG>/video-se-*:TAG image placeholders in the manifests with your registry and immutable image tags. Service Dockerfiles use python:3.12.13-slim; Chroma is pinned to chromadb/chroma:1.5.6; bundled RabbitMQ is pinned to rabbitmq:4.1.4-management. Update them intentionally after validation. The bundled kustomization includes RabbitMQ and an ingestion-worker deployment; set RABBITMQ_URL to a managed broker endpoint instead if you do not want to run RabbitMQ in-cluster. k8s/ingestion-job.yaml and k8s/toolbox.yaml are manual operational helpers and are intentionally excluded from the default kustomization.
See docs/operations.md for local runbook commands, environment notes, queue behavior, and troubleshooting.