A fully local, privacy-first AI study assistant powered by Intel OpenVINO inference. No cloud APIs, no telemetry — all computation stays on your machine.
✨ Features:
- 🎯 Real-time streaming chat with markdown/LaTeX rendering
- 🧠 Study Mode for active recall testing
- 📷 Vision model support (image analysis)
- 💾 Persistent chat history (SQLite)
- 🎮 Multi-device support (CPU, GPU, NPU)
- 📡 Ollama & OpenAI-compatible APIs
- ⚡ Fast inference with OpenVINO INT4 models
The project is split into two components:
server/—ovserve, a FastAPI model server with Ollama-compatible and OpenAI-compatible APIs, running OpenVINO inference on Intel CPU/GPU/NPU.client/— StudyBot, a Flask web UI that streams responses, manages chat history, and adds pedagogical features like Study Mode.
┌────────────────────────────────────────────────────────────────┐
│ Browser (http://localhost:5000) │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ StudyBot UI (index.html) │ │
│ │ • Model selector (dynamic from /api/tags) │ │
│ │ • Streaming chat with markdown/LaTeX/code rendering │ │
│ │ • Study Mode toggle (active recall testing) │ │
│ │ • Image upload for VLM queries │ │
│ │ • Chat history sidebar (SQLite-backed) │ │
│ └──────────────────┬───────────────────────────────────────┘ │
└─────────────────────┼──────────────────────────────────────────┘
│ SSE stream (POST /chat)
▼
┌────────────────────────────────────────────────────────────────┐
│ Flask Client (client/app.py — port 5000) │
│ • Prepends system prompt + Study Mode instructions │
│ • Proxies requests to ovserve │
│ • Persists chat sessions in SQLite (client/history.db) │
│ • Handles abort relay to server │
└──────────────────────┬─────────────────────────────────────────┘
│ NDJSON stream (POST /api/chat)
▼
┌────────────────────────────────────────────────────────────────┐
│ ovserve (server/server.py — port 11435) │
│ FastAPI + OpenVINO │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ Endpoints: │ │
│ │ /api/chat, /api/generate, /v1/chat/completions │ │
│ │ /api/tags, /api/pull, /api/delete, /api/load │ │
│ │ /api/ps, /api/devices, /api/benchmarks, /api/abort │ │
│ └──────────────────┬───────────────────────────────────────┘ │
│ │ │
│ ┌──────────────────▼───────────────────────────────────────┐ │
│ │ model_manager.py │ │
│ │ • Loads models via openvino-genai (LLMPipeline/VLMPipe) │ │
│ │ • Fallback: optimum-intel OVModelForCausalLM │ │
│ │ • Token streaming (callback + TextIteratorStreamer) │ │
│ │ • Per-session abort, keep-alive auto-unload (1hr) │ │
│ └──────────────────┬───────────────────────────────────────┘ │
│ │ │
│ ┌──────────────────▼───────────────────────────────────────┐ │
│ │ device_selector.py │ │
│ │ • Detects CPU, GPU, NPU via OpenVINO Core │ │
│ │ • Benchmarks each device per model │ │
│ │ • Selects fastest working device automatically │ │
│ │ • Manages compile cache (server/ov_cache/) │ │
│ └──────────────────────────────────────────────────────────┘ │
│ │
│ Model Storage: │
│ • Bundled: server/gemma-3-4b-it-int4-ov/ (VLM) │
│ • Bundled: server/qwen2.5-7b-instruct-int4-ov/ (LLM) │
│ • Pulled: ~/.ovserve/models/ (registry.json) │
└────────────────────────────────────────────────────────────────┘
- CPU: Intel 12th Gen or newer (Alder Lake+)
- RAM: 16 GB (models are INT4-quantized, ~4-7 GB each)
- Disk: 20 GB free for model storage
- OS: Windows 10/11 or Linux
- Python: 3.10+
- CPU: Intel Core Ultra (Meteor Lake / Arrow Lake)
- iGPU: Intel Arc Graphics (Xe-LPG) — primary inference device
- The iGPU provides 10-17 tokens/sec for 4B-7B parameter INT4 models
- First-token latency: ~500ms-2s (improves with compile cache)
- NPU: Intel AI Boost — not currently functional (see Roadmap)
- RAM: 16-32 GB (shared with iGPU)
| Model | Device | Tokens/sec | First Token Latency |
|---|---|---|---|
| Gemma 3 4B VLM (INT4) | GPU | 12-17 tok/s | 0.5-2.8s |
| Qwen 2.5 7B (INT4) | GPU | 10-12 tok/s | 0.6-1.4s |
| Phi-3 Mini 4K (INT4) | GPU | 13-15 tok/s | 0.9-1.2s |
| Qwen 2.5 7B (INT4) | CPU | ~0.4 tok/s | ~55s |
Note: GPU numbers are from Intel Arc iGPU on a Core Ultra system. First inference after model load is slower due to OpenVINO compilation. The compile cache (
server/ov_cache/) eliminates this on subsequent runs.
- Python 3.10+
- Intel-compatible GPU drivers (for iGPU inference)
- Model weight directories in
server/(not included in Git — see.gitignore)
Server (OpenVINO inference engine):
cd server
pip install -r Requirements.txtClient (StudyBot web UI):
cd client
pip install -r requirements.txtIf model directories are not present, you can pull them through the API after starting the server:
# Pull from HuggingFace (auto-exports to OpenVINO format)
curl -X POST http://localhost:11435/api/pull \
-H "Content-Type: application/json" \
-d '{"model": "OpenVINO/gemma-3-4b-it-int4-ov"}'Or manually place pre-exported OpenVINO model directories:
server/gemma-3-4b-it-int4-ov/server/qwen2.5-7b-instruct-int4-ov/
To store pulled models somewhere else, set OVSERVE_MODELS_DIR before starting the server:
$env:OVSERVE_MODELS_DIR = "D:\ovserve\models"
python server\ovserve.pycd c:\Users\Manthan\OneDrive\ai-project
start.batThis launches both services:
- ovserve (FastAPI) on
http://localhost:11435 - StudyBot (Flask) on
http://localhost:5000
Then open your browser to http://localhost:5000 and start chatting! 🎉
Terminal 1 — Start the model server
cd server
python ovserve.pyServer listens on http://localhost:11435. You'll see a hardware banner listing detected CPU/GPU/NPU devices.
cd client
python app.pyClient listens on http://localhost:5000. Open this URL in your browser.
The server ships its own lightweight chat UI at http://localhost:11435/ (via chat.html). This bypasses the Flask client and talks directly to the API — useful for quick testing without Study Mode features.
Base URL: http://localhost:11435
| Method | Endpoint | Description |
|---|---|---|
GET |
/api/tags |
List all local models |
POST |
/api/pull |
Download model from HuggingFace |
POST |
/api/load |
Preload model on a specific device |
POST |
/api/unload |
Unload model from memory |
DELETE |
/api/delete |
Remove model files and registry entry |
| Method | Endpoint | Description |
|---|---|---|
POST |
/api/generate |
Single-turn text generation (streaming/non-streaming) |
POST |
/api/chat |
Multi-turn chat (streaming/non-streaming) |
POST |
/v1/chat/completions |
OpenAI-compatible chat API |
POST |
/api/abort |
Abort an in-progress generation by session ID |
| Method | Endpoint | Description |
|---|---|---|
GET |
/api/version |
Server version |
GET |
/api/ps |
List currently loaded models |
GET |
/api/devices |
Show detected hardware (CPU/GPU/NPU) |
GET |
/api/benchmarks |
Generation performance history |
GET |
/ |
Server-side chat UI |
Chat (streaming):
curl -X POST http://localhost:11435/api/chat \
-H "Content-Type: application/json" \
-d '{
"model": "OpenVINO/qwen2.5-7b-instruct-int4-ov",
"messages": [{"role":"user","content":"What is OpenVINO?"}],
"stream": true
}'Force a specific device:
curl -X POST http://localhost:11435/api/generate \
-H "Content-Type: application/json" \
-d '{
"model": "OpenVINO/qwen2.5-7b-instruct-int4-ov",
"prompt": "Explain recursion briefly.",
"stream": false,
"options": { "device": "GPU", "num_predict": 128 }
}'We've created comprehensive documentation to help you get the most out of StudyBot:
| Document | For | Contents |
|---|---|---|
| SETUP_GUIDE.md | 🆕 New users | Step-by-step installation on Windows/macOS/Linux, venv setup, hardware detection, first-time model downloads, troubleshooting setup issues |
| USAGE_GUIDE.md | 👤 End users | How to use StudyBot UI, Study Mode, image upload, chat history, keyboard shortcuts, tips & tricks, common questions |
| API_DOCUMENTATION.md | 👨💻 Developers | Complete API reference with request/response examples, cURL tests, OpenAI compatibility, streaming formats, rate limiting (planned) |
| ARCHITECTURE.md | 🏗️ Maintainers | System design, component breakdown, data flow, performance considerations, security analysis, future enhancements |
| CONTRIBUTING.md | 🤝 Contributors | Code style guide, testing framework, PR process, areas needing help, commit message conventions |
| TROUBLESHOOTING.md | 🆘 Debugging | Solutions to 50+ common issues (server won't start, GPU not detected, chat not responding, etc.) with steps and workarounds |
ai-project/
├── README.md # This file
├── .gitignore
├── client/
│ ├── app.py # Flask server, SSE proxy, session management
│ ├── db.py # SQLite persistence layer
│ ├── requirements.txt # Flask + requests
│ └── templates/
│ └── index.html # Full StudyBot UI (48KB)
└── server/
├── ovserve.py # Entry point — banner + uvicorn launch
├── server.py # FastAPI routes and request/response schemas
├── model_manager.py # Model lifecycle, generation, caching
├── device_selector.py # Device detection, benchmarking, auto-selection
├── chat.html # Lightweight server-side chat UI
├── Requirements.txt # OpenVINO + FastAPI dependencies
├── run_chat.py # Example: Python API client
├── test_gemma.py # Example: VLM pipeline test
├── device_benchmarks.json # (Generated) Device performance cache
├── ov_cache/ # (Generated) Compiled IR cache per device
├── gemma-3-4b-it-int4-ov/ # (Git-ignored) Pre-bundled VLM
└── qwen2.5-7b-instruct-int4-ov/ # (Git-ignored) Pre-bundled LLM
The server benchmarks all detected devices per model and selects the fastest one that produces valid output.
| Device | Status | Notes |
|---|---|---|
| CPU | ✅ Always works | Fallback device. Slow (~0.4 tok/s for 7B) but universally compatible. |
| GPU | ✅ Primary device | Intel Arc iGPU provides 10-17 tok/s. Compile cache accelerates reloads. |
| NPU | ❌ Not working | Intel AI Boost detected but fails on all current models due to dynamic-shape compilation errors. See Roadmap. |
You can force a device per request:
{ "options": { "device": "GPU" } }ovserve runs on port 11435 and implements Ollama's REST API surface. Any Ollama-compatible client can be pointed at it:
OLLAMA_HOST=http://localhost:11435 ollama listfrom ollama import Client
client = Client(host='http://localhost:11435')
response = client.generate(model='OpenVINO/gemma-3-4b-it-int4-ov', prompt='Hello')| Symptom | Fix |
|---|---|
/health reports backend offline |
Ensure python ovserve.py is running on port 11435. |
| Model load errors on NPU | Use "options": {"device":"GPU"} or "CPU" instead. NPU is not yet functional. |
| Client starts but no models shown | Check curl http://localhost:11435/api/tags returns models. |
| Module import errors | Reinstall: pip install -r server/Requirements.txt and pip install -r client/requirements.txt. |
| Slow first inference | Expected — OpenVINO compiles the model graph on first use. Compile cache (ov_cache/) speeds up reloads. |
| Out of memory on GPU | Switch to CPU: "options": {"device":"CPU"}. Or use a smaller model. |
| Feature | Status | Details |
|---|---|---|
| NPU inference | 🔴 Blocked | Intel AI Boost is detected but compilation fails on all models (dynamic shape errors). Requires openvino-genai StaticLLMPipeline or updated NPU driver/compiler. |
| VLM benchmarking | 🟡 Missing | device_selector.py only benchmarks text models via OVModelForCausalLM. VLM models skip benchmarks and use a GPU-first heuristic. |
OVSERVE_MODELS_DIR env var |
🟡 Missing | Server README claimed this existed, but models dir is hardcoded to ~/.ovserve/models/. |
| Multi-model concurrent loading | 🟡 Not supported | Only one model is loaded at a time. Loading a new model unloads the previous one. |
- Add
StaticLLMPipelinesupport for NPU (requires openvino-genai ≥ 2025.x) - Implement VLM device benchmarking via
openvino_genai.VLMPipeline - Add
OVSERVE_MODELS_DIRenvironment variable override - Reduce Flask proxy overhead (consider direct browser→ovserve with CORS)
- Batch benchmark file writes instead of per-generation I/O
- API changes: Start from
server/server.pyfor endpoint modifications. - Model lifecycle: Start from
server/model_manager.pyfor generation, caching, or device logic. - UI changes: Start from
client/templates/index.htmlfor the StudyBot interface. - Device logic: Start from
server/device_selector.pyfor hardware detection and routing.
| Package | Version | Purpose |
|---|---|---|
openvino |
≥2024.0.0 | CPU/GPU inference runtime |
openvino-genai |
latest | Token streaming, VLM pipeline |
optimum-intel[openvino] |
latest | Model conversion and loading |
transformers |
≥4.40.0 | Tokenizers and chat templates |
fastapi |
≥0.110.0 | Web server |
uvicorn |
≥0.29.0 | ASGI runner |
pydantic |
≥2.0.0 | Request validation |
requests |
latest | HTTP client (for pull) |
| Package | Version | Purpose |
|---|---|---|
flask |
≥3.0.0 | Web framework |
requests |
≥2.31.0 | HTTP client (proxy to ovserve) |
| Issue | Impact | Workaround |
|---|---|---|
| No input validation on model names | Path traversal vulnerability | Only pull from known models |
| CORS allows all origins | Security risk (CSRF/XSS) | Runs locally only (port 5000 is localhost) |
Memory leak: _abort_events dict |
Grows indefinitely over time | Restart server periodically |
| No rate limiting | DOS attack possible | Runs locally only |
| Database no indexes | Slow with 1000+ messages | SQLite limits to <10K messages |
- Image size not validated (DOS on huge base64)
- Model cache never garbage-collected
- No timeout on model loading (can hang server)
- Single model at a time (no concurrent inference)
- Hardcoded paths in start.bat (not portable)
- No structured logging (only print statements)
- No API versioning
- Chat history title is naive (first 30 chars)
- No request logging/analytics
- No error tracking (Sentry integration)
- ✨ User authentication (JWT)
- ✨ Full-text search in chat history
- ✨ Chat branching (continue from any point)
- ✨ Message editing/deletion
- ✨ Export to Markdown/PDF
- ✨ Model comparison (A/B testing)
- ✨ Fine-tuning support (LoRA)
- ✨ Webhooks for events
- ✨ Docker containerization
- New here? → Start with SETUP_GUIDE.md
- How do I...? → Check USAGE_GUIDE.md
- What's an API? → See API_DOCUMENTATION.md
- Something broke? → Try TROUBLESHOOTING.md
- Want to contribute? → Read CONTRIBUTING.md
Found a bug? Create a GitHub Issue with:
- What you did
- What happened
- What you expected
- Your environment (OS, Python version, GPU model)
- Full error message from terminal
- GitHub Discussions — Ask anything, get community help
- GitHub Issues — Report confirmed bugs
- OpenVINO Docs — How OpenVINO works
- FastAPI Docs — API framework
- Ollama API Spec — Ollama compatibility
- Transformers Docs — Model tokenization
We welcome contributions! See CONTRIBUTING.md for:
- Code style guide
- How to write tests
- Pull request process
- Areas that need help
Quick ways to help:
- 🐛 Report bugs with reproducible steps
- 📝 Improve documentation
- ✨ Implement planned features
- 🧪 Write tests (especially for model_manager.py)
- 🎨 Improve UI/UX
This project is open source. See LICENSE file for details.
- Intel OpenVINO — Efficient inference runtime
- FastAPI — Modern async web framework
- Ollama — API compatibility inspiration
- Transformers/HuggingFace — Tokenizers and model hosting
| Component | Status | Notes |
|---|---|---|
| Core inference | ✅ Working | Tested on Intel CPU/GPU |
| Chat UI | ✅ Working | Streaming, history, Study Mode |
| API endpoints | ✅ Mostly working | See known issues above |
| NPU support | 🔴 Broken | Compilation fails on dynamic shapes |
| Testing | 🟡 Minimal | Unit tests needed for model_manager.py |
| Documentation | 🟢 Good | Setup, usage, API, troubleshooting guides |
Last updated: April 2026
Version: 0.1.0-ovserve
Python: 3.10+
OpenVINO: 2024.0.0+