DIME: Dimension Importance Estimation for Dense Retrieval

This repository contains a comprehensive implementation of DIME (Dimension Importance Estimation) methods for improving dense retrieval systems. The implementation includes three different approaches for determining dimension importance in query embeddings, each with both reranking and refetching capabilities.

📖 Medium Article

For a detailed explanation and analysis, read the accompanying Medium article: [Your Article Title Here] - Add your Medium article link

🌟 Overview

Dense retrieval systems often suffer from the "curse of dimensionality" where not all embedding dimensions contribute equally to relevance. DIME addresses this by identifying and zeroing out less important dimensions in query vectors, leading to improved retrieval performance.

🔬 Implemented Approaches

Magnitude-based DIME 🎯
- Uses absolute values |qi| of query dimensions
- Simplest and fastest approach
- No external dependencies
PRF-based DIME 📊
- Uses Pseudo-Relevance Feedback from initial retrieval
- Computes centroids from top-k retrieved documents
- Supports weighted and unweighted averaging
LLM-based DIME 🤖
- Uses LLM-generated documents for importance estimation
- Works without initial retrieval
- Can be enhanced with actual LLM APIs

🔄 Operation Modes

Each approach supports two operation modes:

Rerank: Re-score existing retrieval results with modified query vectors
Refetch: Perform new retrieval with modified query vectors

🚀 Quick Start

Installation

Clone the repository:

git clone https://github.com/your-username/your-repo-name.git
cd your-repo-name

Install dependencies:

pip install -r requirements.txt

Usage

Open and run the Jupyter notebook:

jupyter notebook dime_implementation.ipynb

Or use the example below:

import pandas as pd
import numpy as np
from sentence_transformers import SentenceTransformer

# Load data and model
df = pd.read_csv('apparel_dataset.csv')
model = SentenceTransformer('all-MiniLM-L6-v2')

# Generate embeddings
product_embeddings = model.encode(
    df['text'].tolist(), 
    normalize_embeddings=True
)

# Initialize DIME classes
from dime_implementation import MagnitudeBasedDIME, PRFBasedDIME, LLMBasedDIME

magnitude_dime = MagnitudeBasedDIME(model, product_embeddings, df)
prf_dime = PRFBasedDIME(model, product_embeddings, df)
llm_dime = LLMBasedDIME(model, product_embeddings, df)

# Test different approaches
query = "black dress shirt"

# Magnitude-based approach
mag_results = magnitude_dime.magnitude_rerank(query, zero_out_ratio=0.2)

# PRF-based approach
prf_results = prf_dime.prf_rerank(query, prf_k=5, zero_out_ratio=0.2)

# LLM-based approach
llm_results = llm_dime.llm_rerank(query, zero_out_ratio=0.2)

📊 Dataset

The included apparel_dataset.csv contains 2000 e-commerce apparel products with:

Product ID: Unique identifier
Title: Product name
Description: Detailed product description
Text: Combined text for embedding (title + description + brand + category)

🔧 Configuration

Key parameters you can adjust:

zero_out_ratio: Fraction of dimensions to zero out (0.0-0.5)
prf_k: Number of documents for PRF centroid computation (3-20)
initial_top_k: Initial retrieval size (100-1000)
final_top_k: Final results to return (5-20)
weighted: Use weighted vs. unweighted centroids
attention_type: "linear" or "softmax" attention
temperature: Softmax temperature (0.5-2.0)

🔍 Algorithm Details

Magnitude-based DIME

Compute importance = |query_embedding|
Sort dimensions by importance
Zero out bottom (1-α) fraction
Use modified query for retrieval/reranking

PRF-based DIME

Initial retrieval with original query
Compute centroid from top-k results
Compute importance = centroid ⊙ query_embedding
Zero out least important dimensions
Use modified query for retrieval/reranking

LLM-based DIME

Generate expanded document for query
Embed the LLM document
Compute importance = llm_embedding ⊙ query_embedding
Zero out least important dimensions
Use modified query for retrieval/reranking

⭐ Star this repository if you find it helpful! ⭐

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitignore		.gitignore
README.md		README.md
apparel_dataset.csv		apparel_dataset.csv
dime_implementation.ipynb		dime_implementation.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DIME: Dimension Importance Estimation for Dense Retrieval

📖 Medium Article

🌟 Overview

🔬 Implemented Approaches

🔄 Operation Modes

🚀 Quick Start

Installation

Usage

📊 Dataset

🔧 Configuration

🔍 Algorithm Details

Magnitude-based DIME

PRF-based DIME

LLM-based DIME

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DIME: Dimension Importance Estimation for Dense Retrieval

📖 Medium Article

🌟 Overview

🔬 Implemented Approaches

🔄 Operation Modes

🚀 Quick Start

Installation

Usage

📊 Dataset

🔧 Configuration

🔍 Algorithm Details

Magnitude-based DIME

PRF-based DIME

LLM-based DIME

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages