Skip to content

tush-dev/Oracle

Repository files navigation

πŸ€– Oracle β€” Production-Grade AI RAG Assistant

Ask grounded questions over your documents and GitHub repositories with intelligent retrieval.

Full-stack Retrieval-Augmented Generation (RAG) platform with hybrid search, reranking, OCR ingestion, repository indexing, speech transcription, and evaluation pipelines.

πŸš€ Live Demo β€’ πŸ’» Repository


✨ Overview

Oracle is a production-focused AI knowledge assistant designed to provide accurate, grounded answers from uploaded documents and GitHub repositories.

Instead of relying on generic LLM responses, Oracle retrieves relevant context from your actual data using a sophisticated retrieval pipeline before generating answers.

It combines:

  • πŸ“„ Document ingestion
  • πŸ™ GitHub repository indexing
  • πŸ” Hybrid semantic + keyword retrieval
  • 🧠 Context reranking
  • 🎀 Speech transcription
  • πŸ“Š RAG evaluation metrics
  • ⚑ Real-time chat interface

This makes Oracle ideal for:

  • Developer documentation assistants
  • Repository Q&A systems
  • Internal enterprise knowledge assistants
  • Research document querying
  • Technical codebase exploration

πŸš€ Key Features

πŸ“‚ Multi-Source Knowledge Ingestion

Oracle can ingest and understand content from multiple sources:

βœ… PDF documents
βœ… GitHub repositories
βœ… OCR-extracted scanned content
βœ… Audio transcription workflows


🧠 Advanced RAG Pipeline

Unlike basic chatbot wrappers, Oracle uses a robust retrieval architecture:

  • Structure-aware chunking
  • Semantic vector retrieval
  • BM25 keyword retrieval
  • Reciprocal Rank Fusion (RRF)
  • Cohere reranking
  • Grounded answer generation
  • Automated RAG evaluation

This improves factual accuracy and reduces hallucinations.


πŸ™ GitHub Repository Intelligence

Index entire repositories and ask questions like:

β€œHow does authentication work?”
β€œWhere is the API routing implemented?”
β€œWhich file handles embeddings?”
β€œExplain the database architecture.”

Oracle parses repository trees, extracts meaningful code context, and enables intelligent codebase querying.


🎀 Speech + OCR Support

Supports richer input workflows:

  • Audio transcription using AssemblyAI
  • OCR extraction from scanned PDFs
  • Voice-driven knowledge querying

πŸ“Š RAG Evaluation Layer

Oracle includes evaluation pipelines to assess response quality using:

  • Faithfulness
  • Answer relevance
  • Context precision

This makes the system engineering-focusedβ€”not just UI-focused.


πŸ— Architecture

System Flow

User Query
   ↓
Frontend Chat UI
   ↓
Backend API Layer
   ↓
Document / GitHub Ingestion
   ↓
Text Parsing + Structure-Aware Chunking
   ↓
Embedding Generation (Voyage AI)
   ↓
Pinecone Vector Storage
   ↓
Hybrid Retrieval (Vector + BM25)
   ↓
Reciprocal Rank Fusion
   ↓
Cohere Reranking
   ↓
Groq Answer Generation
   ↓
RAG Evaluation
   ↓
Grounded Response to User

πŸ”¬ RAG Pipeline Deep Dive

1. Content Ingestion

Documents and repositories are parsed into clean structured text.

Supported ingestion:

  • PDFs
  • GitHub repositories
  • OCR-extracted content
  • speech transcripts

2. Intelligent Chunking

Instead of naive fixed-size chunking, Oracle uses structure-aware segmentation:

  • heading boundaries
  • function blocks
  • class boundaries
  • file segmentation
  • table-aware chunking

This preserves semantic meaning during retrieval.


3. Embedding Generation

Chunks are converted into dense semantic vectors using:

Voyage AI embeddings

Each chunk is enriched with source metadata for grounded retrieval.


4. Hybrid Retrieval

Oracle combines:

Semantic Search

Dense vector similarity from Pinecone.

Keyword Search

BM25 lexical retrieval.

Then merges both via:

Reciprocal Rank Fusion (RRF)

This significantly improves recall.


5. Reranking

Retrieved candidates are reranked using:

Cohere Rerank API

This ensures only the most relevant context reaches the LLM.


6. Grounded Generation

Final context is passed to:

Groq LLM inference

with strict grounding instructions to reduce hallucination.


7. Evaluation..

Generated responses are scored on:

  • faithfulness
  • relevance
  • precision

to monitor answer quality.


πŸ›  Tech Stack

Frontend

  • React
  • Vite
  • TypeScript

Backend

  • Node.js
  • Express
  • TypeScript

AI / Retrieval

  • Pinecone
  • Voyage AI
  • Cohere
  • Groq
  • BM25
  • Reciprocal Rank Fusion

Integrations

  • GitHub API
  • Clerk
  • Supabase
  • OCR services
  • AssemblyAI

Deployment

  • Vercel

πŸ“ Project Structure

backend/
 ┣ src/
 ┃ ┣ rag/          β†’ ingestion, chunking, retrieval, reranking, evaluation
 ┃ ┣ routes/       β†’ REST API endpoints
 ┃ ┣ services/     β†’ GitHub, OCR, auth, storage integrations
 ┃ β”— lib/          β†’ provider wrappers and shared utilities

frontend/
 ┣ src/
 ┃ ┣ pages/        β†’ main application screens
 ┃ ┣ components/   β†’ reusable UI components
 ┃ ┣ hooks/        β†’ frontend custom hooks
 ┃ β”— lib/          β†’ API utilities

βš™οΈ Local Setup

Backend

cd backend
cp .env.example .env
npm install
npm run build
npm run dev

Frontend

cd frontend
npm install
npm run dev

🌍 Deployment

Oracle supports deployment across modern platforms.

See:

  • Vercel
  • Render
  • Railway

Detailed deployment instructions:

DEPLOYMENT.md

🎯 Why This Project Matters

Oracle demonstrates practical software engineering + applied AI concepts:

βœ… Production-grade RAG architecture
βœ… Retrieval engineering
βœ… Hybrid search systems
βœ… LLM grounding strategies
βœ… AI evaluation pipelines
βœ… GitHub repository intelligence
βœ… Full-stack TypeScript architecture
βœ… Real-world deployment workflows


πŸ‘¨β€πŸ’» Author

Tushar Panwar

B.Tech @ NIT Jalandhar
Full-Stack Developer | AI Engineering Enthusiast


⭐ If you found this interesting, consider starring the repo.

About

Oracle - Advanced RAG

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors