StudyBot + ovserve — Local AI Study Assistant

A fully local, privacy-first AI study assistant powered by Intel OpenVINO inference. No cloud APIs, no telemetry — all computation stays on your machine.

✨ Features:

🎯 Real-time streaming chat with markdown/LaTeX rendering
🧠 Study Mode for active recall testing
📷 Vision model support (image analysis)
💾 Persistent chat history (SQLite)
🎮 Multi-device support (CPU, GPU, NPU)
📡 Ollama & OpenAI-compatible APIs
⚡ Fast inference with OpenVINO INT4 models

The project is split into two components:

server/ — ovserve, a FastAPI model server with Ollama-compatible and OpenAI-compatible APIs, running OpenVINO inference on Intel CPU/GPU/NPU.
client/ — StudyBot, a Flask web UI that streams responses, manages chat history, and adds pedagogical features like Study Mode.

Architecture

┌────────────────────────────────────────────────────────────────┐
│  Browser (http://localhost:5000)                               │
│  ┌──────────────────────────────────────────────────────────┐  │
│  │  StudyBot UI (index.html)                                │  │
│  │  • Model selector (dynamic from /api/tags)               │  │
│  │  • Streaming chat with markdown/LaTeX/code rendering      │  │
│  │  • Study Mode toggle (active recall testing)             │  │
│  │  • Image upload for VLM queries                          │  │
│  │  • Chat history sidebar (SQLite-backed)                  │  │
│  └──────────────────┬───────────────────────────────────────┘  │
└─────────────────────┼──────────────────────────────────────────┘
                      │ SSE stream (POST /chat)
                      ▼
┌────────────────────────────────────────────────────────────────┐
│  Flask Client (client/app.py — port 5000)                     │
│  • Prepends system prompt + Study Mode instructions           │
│  • Proxies requests to ovserve                                │
│  • Persists chat sessions in SQLite (client/history.db)       │
│  • Handles abort relay to server                              │
└──────────────────────┬─────────────────────────────────────────┘
                       │ NDJSON stream (POST /api/chat)
                       ▼
┌────────────────────────────────────────────────────────────────┐
│  ovserve (server/server.py — port 11435)                      │
│  FastAPI + OpenVINO                                           │
│  ┌──────────────────────────────────────────────────────────┐  │
│  │  Endpoints:                                              │  │
│  │  /api/chat, /api/generate, /v1/chat/completions          │  │
│  │  /api/tags, /api/pull, /api/delete, /api/load            │  │
│  │  /api/ps, /api/devices, /api/benchmarks, /api/abort      │  │
│  └──────────────────┬───────────────────────────────────────┘  │
│                     │                                          │
│  ┌──────────────────▼───────────────────────────────────────┐  │
│  │  model_manager.py                                        │  │
│  │  • Loads models via openvino-genai (LLMPipeline/VLMPipe) │  │
│  │  • Fallback: optimum-intel OVModelForCausalLM            │  │
│  │  • Token streaming (callback + TextIteratorStreamer)      │  │
│  │  • Per-session abort, keep-alive auto-unload (1hr)       │  │
│  └──────────────────┬───────────────────────────────────────┘  │
│                     │                                          │
│  ┌──────────────────▼───────────────────────────────────────┐  │
│  │  device_selector.py                                      │  │
│  │  • Detects CPU, GPU, NPU via OpenVINO Core               │  │
│  │  • Benchmarks each device per model                      │  │
│  │  • Selects fastest working device automatically          │  │
│  │  • Manages compile cache (server/ov_cache/)              │  │
│  └──────────────────────────────────────────────────────────┘  │
│                                                                │
│  Model Storage:                                                │
│  • Bundled: server/gemma-3-4b-it-int4-ov/ (VLM)              │
│  • Bundled: server/qwen2.5-7b-instruct-int4-ov/ (LLM)       │
│  • Pulled:  ~/.ovserve/models/ (registry.json)                │
└────────────────────────────────────────────────────────────────┘

Hardware Requirements

Minimum

CPU: Intel 12th Gen or newer (Alder Lake+)
RAM: 16 GB (models are INT4-quantized, ~4-7 GB each)
Disk: 20 GB free for model storage
OS: Windows 10/11 or Linux
Python: 3.10+

Recommended (Intel Core Ultra)

CPU: Intel Core Ultra (Meteor Lake / Arrow Lake)
iGPU: Intel Arc Graphics (Xe-LPG) — primary inference device
- The iGPU provides 10-17 tokens/sec for 4B-7B parameter INT4 models
- First-token latency: ~500ms-2s (improves with compile cache)
NPU: Intel AI Boost — not currently functional (see Roadmap)
RAM: 16-32 GB (shared with iGPU)

Observed Performance (from actual benchmarks)

Model	Device	Tokens/sec	First Token Latency
Gemma 3 4B VLM (INT4)	GPU	12-17 tok/s	0.5-2.8s
Qwen 2.5 7B (INT4)	GPU	10-12 tok/s	0.6-1.4s
Phi-3 Mini 4K (INT4)	GPU	13-15 tok/s	0.9-1.2s
Qwen 2.5 7B (INT4)	CPU	~0.4 tok/s	~55s

Note: GPU numbers are from Intel Arc iGPU on a Core Ultra system. First inference after model load is slower due to OpenVINO compilation. The compile cache (server/ov_cache/) eliminates this on subsequent runs.

Setup

Prerequisites

Python 3.10+
Intel-compatible GPU drivers (for iGPU inference)
Model weight directories in server/ (not included in Git — see .gitignore)

Install Dependencies

Server (OpenVINO inference engine):

cd server
pip install -r Requirements.txt

Client (StudyBot web UI):

cd client
pip install -r requirements.txt

Download Models

If model directories are not present, you can pull them through the API after starting the server:

# Pull from HuggingFace (auto-exports to OpenVINO format)
curl -X POST http://localhost:11435/api/pull \
  -H "Content-Type: application/json" \
  -d '{"model": "OpenVINO/gemma-3-4b-it-int4-ov"}'

Or manually place pre-exported OpenVINO model directories:

server/gemma-3-4b-it-int4-ov/
server/qwen2.5-7b-instruct-int4-ov/

To store pulled models somewhere else, set OVSERVE_MODELS_DIR before starting the server:

$env:OVSERVE_MODELS_DIR = "D:\ovserve\models"
python server\ovserve.py

Quick Start

🚀 One Command (Windows)

cd c:\Users\Manthan\OneDrive\ai-project
start.bat

This launches both services:

ovserve (FastAPI) on http://localhost:11435
StudyBot (Flask) on http://localhost:5000

Then open your browser to http://localhost:5000 and start chatting! 🎉

🔧 Manual Setup (Two Terminals)

Terminal 1 — Start the model server

cd server
python ovserve.py

Server listens on http://localhost:11435. You'll see a hardware banner listing detected CPU/GPU/NPU devices.

Terminal 2 — Start StudyBot

cd client
python app.py

Client listens on http://localhost:5000. Open this URL in your browser.

Alternative: Server-only mode

The server ships its own lightweight chat UI at http://localhost:11435/ (via chat.html). This bypasses the Flask client and talks directly to the API — useful for quick testing without Study Mode features.

API Reference

Base URL: http://localhost:11435

Model Management

Method	Endpoint	Description
`GET`	`/api/tags`	List all local models
`POST`	`/api/pull`	Download model from HuggingFace
`POST`	`/api/load`	Preload model on a specific device
`POST`	`/api/unload`	Unload model from memory
`DELETE`	`/api/delete`	Remove model files and registry entry

Generation

Method	Endpoint	Description
`POST`	`/api/generate`	Single-turn text generation (streaming/non-streaming)
`POST`	`/api/chat`	Multi-turn chat (streaming/non-streaming)
`POST`	`/v1/chat/completions`	OpenAI-compatible chat API
`POST`	`/api/abort`	Abort an in-progress generation by session ID

System Info

Method	Endpoint	Description
`GET`	`/api/version`	Server version
`GET`	`/api/ps`	List currently loaded models
`GET`	`/api/devices`	Show detected hardware (CPU/GPU/NPU)
`GET`	`/api/benchmarks`	Generation performance history
`GET`	`/`	Server-side chat UI

Example Requests

Chat (streaming):

curl -X POST http://localhost:11435/api/chat \
  -H "Content-Type: application/json" \
  -d '{
    "model": "OpenVINO/qwen2.5-7b-instruct-int4-ov",
    "messages": [{"role":"user","content":"What is OpenVINO?"}],
    "stream": true
  }'

Force a specific device:

curl -X POST http://localhost:11435/api/generate \
  -H "Content-Type: application/json" \
  -d '{
    "model": "OpenVINO/qwen2.5-7b-instruct-int4-ov",
    "prompt": "Explain recursion briefly.",
    "stream": false,
    "options": { "device": "GPU", "num_predict": 128 }
  }'

📚 Documentation

We've created comprehensive documentation to help you get the most out of StudyBot:

Document	For	Contents
SETUP_GUIDE.md	🆕 New users	Step-by-step installation on Windows/macOS/Linux, venv setup, hardware detection, first-time model downloads, troubleshooting setup issues
USAGE_GUIDE.md	👤 End users	How to use StudyBot UI, Study Mode, image upload, chat history, keyboard shortcuts, tips & tricks, common questions
API_DOCUMENTATION.md	👨‍💻 Developers	Complete API reference with request/response examples, cURL tests, OpenAI compatibility, streaming formats, rate limiting (planned)
ARCHITECTURE.md	🏗️ Maintainers	System design, component breakdown, data flow, performance considerations, security analysis, future enhancements
CONTRIBUTING.md	🤝 Contributors	Code style guide, testing framework, PR process, areas needing help, commit message conventions
TROUBLESHOOTING.md	🆘 Debugging	Solutions to 50+ common issues (server won't start, GPU not detected, chat not responding, etc.) with steps and workarounds

ai-project/
├── README.md                          # This file
├── .gitignore
├── client/
│   ├── app.py                         # Flask server, SSE proxy, session management
│   ├── db.py                          # SQLite persistence layer
│   ├── requirements.txt               # Flask + requests
│   └── templates/
│       └── index.html                 # Full StudyBot UI (48KB)
└── server/
    ├── ovserve.py                     # Entry point — banner + uvicorn launch
    ├── server.py                      # FastAPI routes and request/response schemas
    ├── model_manager.py               # Model lifecycle, generation, caching
    ├── device_selector.py             # Device detection, benchmarking, auto-selection
    ├── chat.html                      # Lightweight server-side chat UI
    ├── Requirements.txt               # OpenVINO + FastAPI dependencies
    ├── run_chat.py                    # Example: Python API client
    ├── test_gemma.py                  # Example: VLM pipeline test
    ├── device_benchmarks.json         # (Generated) Device performance cache
    ├── ov_cache/                      # (Generated) Compiled IR cache per device
    ├── gemma-3-4b-it-int4-ov/         # (Git-ignored) Pre-bundled VLM
    └── qwen2.5-7b-instruct-int4-ov/  # (Git-ignored) Pre-bundled LLM

Device Routing

The server benchmarks all detected devices per model and selects the fastest one that produces valid output.

Device	Status	Notes
CPU	✅ Always works	Fallback device. Slow (~0.4 tok/s for 7B) but universally compatible.
GPU	✅ Primary device	Intel Arc iGPU provides 10-17 tok/s. Compile cache accelerates reloads.
NPU	❌ Not working	Intel AI Boost detected but fails on all current models due to dynamic-shape compilation errors. See Roadmap.

You can force a device per request:

{ "options": { "device": "GPU" } }

Ollama Compatibility

ovserve runs on port 11435 and implements Ollama's REST API surface. Any Ollama-compatible client can be pointed at it:

OLLAMA_HOST=http://localhost:11435 ollama list

from ollama import Client
client = Client(host='http://localhost:11435')
response = client.generate(model='OpenVINO/gemma-3-4b-it-int4-ov', prompt='Hello')

Troubleshooting

Symptom	Fix
`/health` reports backend offline	Ensure `python ovserve.py` is running on port 11435.
Model load errors on NPU	Use `"options": {"device":"GPU"}` or `"CPU"` instead. NPU is not yet functional.
Client starts but no models shown	Check `curl http://localhost:11435/api/tags` returns models.
Module import errors	Reinstall: `pip install -r server/Requirements.txt` and `pip install -r client/requirements.txt`.
Slow first inference	Expected — OpenVINO compiles the model graph on first use. Compile cache (`ov_cache/`) speeds up reloads.
Out of memory on GPU	Switch to CPU: `"options": {"device":"CPU"}`. Or use a smaller model.

Roadmap

Not Yet Implemented

Feature	Status	Details
NPU inference	🔴 Blocked	Intel AI Boost is detected but compilation fails on all models (`dynamic shape` errors). Requires `openvino-genai` StaticLLMPipeline or updated NPU driver/compiler.
VLM benchmarking	🟡 Missing	`device_selector.py` only benchmarks text models via `OVModelForCausalLM`. VLM models skip benchmarks and use a GPU-first heuristic.
`OVSERVE_MODELS_DIR` env var	🟡 Missing	Server README claimed this existed, but models dir is hardcoded to `~/.ovserve/models/`.
Multi-model concurrent loading	🟡 Not supported	Only one model is loaded at a time. Loading a new model unloads the previous one.

Planned Improvements

Add StaticLLMPipeline support for NPU (requires openvino-genai ≥ 2025.x)
Implement VLM device benchmarking via openvino_genai.VLMPipeline
Add OVSERVE_MODELS_DIR environment variable override
Reduce Flask proxy overhead (consider direct browser→ovserve with CORS)
Batch benchmark file writes instead of per-generation I/O

Development Notes

API changes: Start from server/server.py for endpoint modifications.
Model lifecycle: Start from server/model_manager.py for generation, caching, or device logic.
UI changes: Start from client/templates/index.html for the StudyBot interface.
Device logic: Start from server/device_selector.py for hardware detection and routing.

Dependencies

Server

Package	Version	Purpose
`openvino`	≥2024.0.0	CPU/GPU inference runtime
`openvino-genai`	latest	Token streaming, VLM pipeline
`optimum-intel[openvino]`	latest	Model conversion and loading
`transformers`	≥4.40.0	Tokenizers and chat templates
`fastapi`	≥0.110.0	Web server
`uvicorn`	≥0.29.0	ASGI runner
`pydantic`	≥2.0.0	Request validation
`requests`	latest	HTTP client (for pull)

Client

Package	Version	Purpose
`flask`	≥3.0.0	Web framework
`requests`	≥2.31.0	HTTP client (proxy to ovserve)

Known Issues & Limitations

🔴 Critical Issues (Should Fix ASAP)

Issue	Impact	Workaround
No input validation on model names	Path traversal vulnerability	Only pull from known models
CORS allows all origins	Security risk (CSRF/XSS)	Runs locally only (port 5000 is localhost)
Memory leak: `_abort_events` dict	Grows indefinitely over time	Restart server periodically
No rate limiting	DOS attack possible	Runs locally only
Database no indexes	Slow with 1000+ messages	SQLite limits to <10K messages

🟠 High Priority Issues

Image size not validated (DOS on huge base64)
Model cache never garbage-collected
No timeout on model loading (can hang server)
Single model at a time (no concurrent inference)
Hardcoded paths in start.bat (not portable)

🟡 Medium Priority

No structured logging (only print statements)
No API versioning
Chat history title is naive (first 30 chars)
No request logging/analytics
No error tracking (Sentry integration)

🟢 Future Features

✨ User authentication (JWT)
✨ Full-text search in chat history
✨ Chat branching (continue from any point)
✨ Message editing/deletion
✨ Export to Markdown/PDF
✨ Model comparison (A/B testing)
✨ Fine-tuning support (LoRA)
✨ Webhooks for events
✨ Docker containerization

Getting Help

📖 Read the Docs

New here? → Start with SETUP_GUIDE.md
How do I...? → Check USAGE_GUIDE.md
What's an API? → See API_DOCUMENTATION.md
Something broke? → Try TROUBLESHOOTING.md
Want to contribute? → Read CONTRIBUTING.md

🐛 Report Issues

Found a bug? Create a GitHub Issue with:

What you did
What happened
What you expected
Your environment (OS, Python version, GPU model)
Full error message from terminal

💬 Ask Questions

GitHub Discussions — Ask anything, get community help
GitHub Issues — Report confirmed bugs

🔗 External Resources

OpenVINO Docs — How OpenVINO works
FastAPI Docs — API framework
Ollama API Spec — Ollama compatibility
Transformers Docs — Model tokenization

Contributing

We welcome contributions! See CONTRIBUTING.md for:

Code style guide
How to write tests
Pull request process
Areas that need help

Quick ways to help:

🐛 Report bugs with reproducible steps
📝 Improve documentation
✨ Implement planned features
🧪 Write tests (especially for model_manager.py)
🎨 Improve UI/UX

License

This project is open source. See LICENSE file for details.

Acknowledgments

Intel OpenVINO — Efficient inference runtime
FastAPI — Modern async web framework
Ollama — API compatibility inspiration
Transformers/HuggingFace — Tokenizers and model hosting

Status

Component	Status	Notes
Core inference	✅ Working	Tested on Intel CPU/GPU
Chat UI	✅ Working	Streaming, history, Study Mode
API endpoints	✅ Mostly working	See known issues above
NPU support	🔴 Broken	Compilation fails on dynamic shapes
Testing	🟡 Minimal	Unit tests needed for model_manager.py
Documentation	🟢 Good	Setup, usage, API, troubleshooting guides

Last updated: April 2026
Version: 0.1.0-ovserve
Python: 3.10+
OpenVINO: 2024.0.0+

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
client		client
server		server
smoke_tmp		smoke_tmp
smoke_tmp2		smoke_tmp2
.gitignore		.gitignore
API_DOCUMENTATION.md		API_DOCUMENTATION.md
ARCHITECTURE.md		ARCHITECTURE.md
CONTRIBUTING.md		CONTRIBUTING.md
README.md		README.md
SETUP_GUIDE.md		SETUP_GUIDE.md
TROUBLESHOOTING.md		TROUBLESHOOTING.md
USAGE_GUIDE.md		USAGE_GUIDE.md
start.bat		start.bat
stop.bat		stop.bat

Folders and files

Latest commit

History

Repository files navigation

StudyBot + ovserve — Local AI Study Assistant

Architecture

Hardware Requirements

Minimum

Recommended (Intel Core Ultra)

Observed Performance (from actual benchmarks)

Setup

Prerequisites

Install Dependencies

Download Models

Quick Start

🚀 One Command (Windows)

🔧 Manual Setup (Two Terminals)

Terminal 2 — Start StudyBot

Alternative: Server-only mode

API Reference

Model Management

Generation

System Info

Example Requests

📚 Documentation

Device Routing

Ollama Compatibility

Troubleshooting

Roadmap

Not Yet Implemented

Planned Improvements

Development Notes

Dependencies

Server

Client

Known Issues & Limitations

🔴 Critical Issues (Should Fix ASAP)

🟠 High Priority Issues

🟡 Medium Priority

🟢 Future Features

Getting Help

📖 Read the Docs

🐛 Report Issues

💬 Ask Questions

🔗 External Resources

Contributing

License

Acknowledgments

Status

About

Resources

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages