Skip to content

hyprex-deva/Wiki-RAG

Repository files navigation

Local Wikipedia RAG with Gemma 3 4B

A fully local Retrieval-Augmented Generation (RAG) system built using Wikipedia .zim dumps, Ollama, Gemma 3 4B, embeddings, and FAISS.

This project extracts Wikipedia data from a .zim archive, cleans and chunks the text, generates embeddings, stores them in a vector database, and allows semantic question-answering locally using Gemma.


Features

  • Local Wikipedia semantic search
  • Fully offline RAG pipeline
  • Ollama integration
  • Gemma 3 4B support
  • FAISS vector search
  • Chunk metadata filtering
  • CLI chatbot interface
  • ASCII loading animation
  • Wikipedia .zim extraction using libzim

Tech Stack

Component Technology
LLM Gemma 3 4B
Runtime Ollama
Embeddings nomic-embed-text
Vector DB FAISS
Dataset Wikipedia .zim
Language Python
Extraction libzim

Project Architecture

Wikipedia .zim
    ↓
Extract Articles
    ↓
Clean Text
    ↓
Chunk Text
    ↓
Generate Embeddings
    ↓
Store in FAISS
    ↓
User Query
    ↓
Retrieve Relevant Chunks
    ↓
Send Context to Gemma
    ↓
Generate Answer

Installation

Clone the repository

git clone https://github.com/your-username/your-repo.git
cd your-repo

Create virtual environment

python -m venv .venv

Activate:

Windows

.venv\Scripts\activate

Linux / Mac

source .venv/bin/activate

Install dependencies

pip install libzim numpy requests tqdm notebook jupyter

Optional:

pip install sentence-transformers torch

Install Ollama

Download: https://ollama.com/download

Pull models:

ollama pull gemma2:4b
ollama pull nomic-embed-text

Dataset

Download a Wikipedia .zim file from:

https://library.kiwix.org/

Example:

  • Wikipedia English
  • Wikipedia Mini
  • Custom datasets

Extraction Pipeline

The pipeline:

  1. Opens .zim archive using libzim
  2. Extracts articles
  3. Cleans HTML/text
  4. Chunks text into overlapping segments
  5. Generates embeddings
  6. Stores embeddings in FAISS

Chunking Strategy

chunk_size = 300
overlap = 50

Each chunk stores:

  • text
  • metadata
  • chunk length

Retrieval Strategy

The system:

  • retrieves more chunks than required
  • filters low-quality chunks
  • limits context size before generation

This improves:

  • answer quality
  • retrieval relevance
  • context efficiency

Running the Chatbot

python rag_chat.py

Example:

Ask: What is a black hole?

Thinking... ⠸

Answer:
A black hole is a region of spacetime where gravity is so strong that nothing, including light, can escape.

Problems Faced During Development

libzim API differences

Different versions of libzim exposed different APIs:

  • missing iter_entries
  • missing get_entry_by_id
  • different entry handling

Fix

Used:

zim.get_random_entry()

with:

  • deduplication
  • redirect filtering
  • namespace filtering

Memoryview decode errors

item.content returned memoryview instead of bytes.

Fix

bytes(item.content).decode("utf-8", errors="ignore")

Embedding context overflow

Large chunks exceeded embedding model context limits.

Fix

  • reduced chunk size
  • added hard text trimming before embedding

Ollama API response issues

Sometimes Ollama returned:

{"error": "..."}

instead of:

{"response": "..."}

Fix

Added validation and debugging for API responses.


Notebook input issues

Interactive loops inside Jupyter notebooks behaved inconsistently.

Fix

Moved chatbot loop into standalone .py script.


Current Limitations

  • FAISS installation issues on some Windows setups
  • Ollama embeddings become slow at very large scales
  • Current pipeline still loads large chunk lists into memory
  • Retrieval quality can still be improved with reranking

Future Improvements

  • Switch embeddings to sentence-transformers
  • Streaming dataset processing
  • SQLite / JSONL chunk storage
  • Better FAISS indexes (IVF / HNSW)
  • Web UI
  • Citation-aware answers
  • Hybrid retrieval
  • Multi-threaded embedding generation

About

Using Wikipedia Data with RAG to give answers from an LLM

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors