Skip to content

chrisk60331/VideoAnalyzer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

VideoAnalyzer

Drop in a video. Ask anything about it.

VideoAnalyzer runs a full multi-modal analysis pipeline on any video file — object detection, transcription, scene segmentation, audio classification, OCR — then spins up an AI assistant that knows exactly what happened, when, and why.


What it does

Upload a video → everything below runs automatically in the background:

Step What happens
Metadata probe Duration, resolution, FPS via ffprobe
Whisper transcription Full VTT transcript with timestamps (faster-whisper)
YOLO object detection Frame-by-frame detection at 1 FPS (YOLOv8)
Scene segmentation Cut detection + per-scene brightness, motion, color palette
Audio classification Speech / silence / music+noise segmentation
OCR On-screen text extracted from scene keyframes (EasyOCR)
Context assembly Everything merged into a structured document for the AI

Then you chat with a Backboard AI assistant that can answer questions like:

"What objects appear between 1:30 and 2:00?" "Find every moment someone says 'product launch'" "Describe what's happening at 0:45 — visually, audibly, and any text on screen" "When does the car first appear and when does it leave?"


Stack

  • Python 3.10+ · Flask · uv
  • YOLOv8 (Ultralytics) — object detection
  • faster-whisper — speech transcription
  • EasyOCR — on-screen text recognition
  • ffmpeg — frame extraction + audio processing
  • Backboard — AI assistant with tool-call loop, thread memory, document storage

Quickstart

Prerequisites

brew install ffmpeg        # macOS
# or: sudo apt install ffmpeg

You'll also need a Backboard API key.

Install & run

git clone https://github.com/your-username/video-analyzer
cd video-analyzer

cp .env.example .env
# → add your BACKBOARD_API_KEY to .env

./start.sh

Open http://localhost:5050 and drop in a video.

start.sh syncs dependencies via uv, clears temp files, and starts the server. Ctrl+C to stop.


Configuration

# .env
BACKBOARD_API_KEY=your_api_key_here
WHISPER_MODEL=base          # tiny | base | small | medium | large
FLASK_PORT=5050

Whisper model size trades speed for accuracy. base is a good starting point; small or medium for better results on noisy audio.


API

All logic lives in the API — the UI is thin.

Videos

POST   /api/videos                  Upload a video (multipart/form-data, field: file)
GET    /api/videos                  List all videos + status
GET    /api/videos/{id}             Full analysis JSON
GET    /api/videos/{id}/video       Stream source file
GET    /api/videos/{id}/transcript.vtt  VTT transcript

Processing is async. Poll GET /api/videos/{id} and watch status: uploadingprocessingready (or error)

Chat

POST   /api/chat                    Send a message (returns task_id)
GET    /api/chat/task/{task_id}     Poll for response

Chat uses a task-polling pattern — post a message, get a task_id, poll until status: done.

Chat request body:

{
  "thread_id": "...",
  "content": "What objects appear in the first minute?",
  "video_id": "..."
}

Assistant tools

The AI has six tools it can call mid-conversation:

Tool What it returns
get_transcript Full or time-filtered VTT transcript
search_transcript Timestamps matching a word or phrase
get_objects_at_time Objects detected at a specific timestamp
get_object_timeline Full appearance timeline for a named object
get_scene_info Scene detail: colors, motion, audio, OCR text
get_audio_segments Speech / silence / music timeline

Project structure

src/
├── app.py                  Flask app factory
├── models.py               Pydantic models (Video, Scene, ObjectSpan, ...)
├── backboard_client.py     Backboard SDK client
├── api/
│   ├── videos.py           Upload, list, serve endpoints
│   └── chat.py             Chat + task-polling + tool-call loop
├── assistant/
│   ├── setup.py            Assistant + system prompt
│   └── tools.py            Tool definitions (JSON schema)
└── services/
    ├── pipeline.py         Orchestrates all analysis steps
    ├── detector.py         YOLO frame detection
    ├── transcriber.py      Whisper transcription
    ├── audio.py            Audio segmentation
    ├── visual.py           Scene analysis + color palette
    ├── ocr.py              EasyOCR on keyframes
    ├── video_service.py    Backboard storage + local cache
    └── tool_handler.py     Dispatches assistant tool calls
templates/
├── index.html              Upload page
└── workspace.html          Video + chat workspace
models/
└── yolo26n.pt              YOLOv8 weights

Supported formats

.mp4 .mov .webm .avi .mkv — up to 500 MB


License

MIT

About

VideoAnalyzer runs a full multi-modal analysis pipeline on any video file — object detection, transcription, scene segmentation, audio classification, OCR — then spins up an AI assistant that knows exactly what happened, when, and why.

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors