Welcome to The Final Cut, a high-performance, locally-hosted AI Chat application and API server. It is powered by Apple Silicon (MLX) and designed from the ground up to offer both a beautiful web UI and an OpenAI-compatible high-speed streaming API for external clients like LM Studio.
To start the entire platform (both the backend AI Server and the frontend Chat UI):
- Open Finder and navigate to this folder (
Project_N2K). - Double-click on the
start_tfc.commandfile.
This script will automatically:
- Start the Python FastAPI backend on Port 8000.
- Start the React/Vite development server for the UI on Port 5173.
- Open your default web browser directly to the Chat UI.
- Allow you to shut down everything safely simply by closing the terminal window or pressing
Ctrl+C.
Once the server is running, the pristine chat interface is accessible at:
http://localhost:5173
- Thinking Blocks: The UI intelligently captures all internal reasoning emitted by deep-thinking models (like Qwen3.5) until the closing
</think>tag. It isolates this monologue in an auto-collapsible dropdown to keep the chat clean while preserving insight into the AI's reasoning. - Syntax Highlighting: Full support for markdown rendering and code blocks with 1-click copy-to-clipboard functionality.
- Telemetry Footer: Every response calculates and displays token speed (tokens/sec), generation time, and total tokens.
- Local Persistence: Your chat history, folder organization, and sidebar settings are saved automatically using your browser's
localStorage.
The backend server exposes a lightning-fast, fully OpenAI-compatible chat completions endpoint. You can drop it into any application, SDK, or UI that accepts a custom OpenAI Base URL.
- Base URL:
http://localhost:8000/v1(Note for LM Studio: Add this as a custom OpenAI endpoint) - Model Name:
"qwen3.5"(Any string works; the server automatically routes to your loaded local model). - Context Length: Adjust as needed in LM Studio.
The /v1/chat/completions endpoint strictly adheres to the March 2026 OpenAI streaming specification. It fully supports:
- High-speed
text/event-streamSSE generation containingdeltaobjects. system_fingerprintidentifiers ("fp_finalcut_2026").- Exact
usagestatistics injected into the final chunk (withfinish_reason: "stop"). This ensures LM Studio correctly parses and displays your backend's Tokens-Per-Second metrics.
Enjoy your absolute control over The Final Cut.