This repository contains a comprehensive implementation of DIME (Dimension Importance Estimation) methods for improving dense retrieval systems. The implementation includes three different approaches for determining dimension importance in query embeddings, each with both reranking and refetching capabilities.
For a detailed explanation and analysis, read the accompanying Medium article: [Your Article Title Here] - Add your Medium article link
Dense retrieval systems often suffer from the "curse of dimensionality" where not all embedding dimensions contribute equally to relevance. DIME addresses this by identifying and zeroing out less important dimensions in query vectors, leading to improved retrieval performance.
-
Magnitude-based DIME 🎯
- Uses absolute values
|qi|of query dimensions - Simplest and fastest approach
- No external dependencies
- Uses absolute values
-
PRF-based DIME 📊
- Uses Pseudo-Relevance Feedback from initial retrieval
- Computes centroids from top-k retrieved documents
- Supports weighted and unweighted averaging
-
LLM-based DIME 🤖
- Uses LLM-generated documents for importance estimation
- Works without initial retrieval
- Can be enhanced with actual LLM APIs
Each approach supports two operation modes:
- Rerank: Re-score existing retrieval results with modified query vectors
- Refetch: Perform new retrieval with modified query vectors
- Clone the repository:
git clone https://github.com/your-username/your-repo-name.git
cd your-repo-name- Install dependencies:
pip install -r requirements.txtOpen and run the Jupyter notebook:
jupyter notebook dime_implementation.ipynbOr use the example below:
import pandas as pd
import numpy as np
from sentence_transformers import SentenceTransformer
# Load data and model
df = pd.read_csv('apparel_dataset.csv')
model = SentenceTransformer('all-MiniLM-L6-v2')
# Generate embeddings
product_embeddings = model.encode(
df['text'].tolist(),
normalize_embeddings=True
)
# Initialize DIME classes
from dime_implementation import MagnitudeBasedDIME, PRFBasedDIME, LLMBasedDIME
magnitude_dime = MagnitudeBasedDIME(model, product_embeddings, df)
prf_dime = PRFBasedDIME(model, product_embeddings, df)
llm_dime = LLMBasedDIME(model, product_embeddings, df)
# Test different approaches
query = "black dress shirt"
# Magnitude-based approach
mag_results = magnitude_dime.magnitude_rerank(query, zero_out_ratio=0.2)
# PRF-based approach
prf_results = prf_dime.prf_rerank(query, prf_k=5, zero_out_ratio=0.2)
# LLM-based approach
llm_results = llm_dime.llm_rerank(query, zero_out_ratio=0.2)The included apparel_dataset.csv contains 2000 e-commerce apparel products with:
- Product ID: Unique identifier
- Title: Product name
- Description: Detailed product description
- Text: Combined text for embedding (title + description + brand + category)
Key parameters you can adjust:
zero_out_ratio: Fraction of dimensions to zero out (0.0-0.5)prf_k: Number of documents for PRF centroid computation (3-20)initial_top_k: Initial retrieval size (100-1000)final_top_k: Final results to return (5-20)weighted: Use weighted vs. unweighted centroidsattention_type: "linear" or "softmax" attentiontemperature: Softmax temperature (0.5-2.0)
- Compute
importance = |query_embedding| - Sort dimensions by importance
- Zero out bottom
(1-α)fraction - Use modified query for retrieval/reranking
- Initial retrieval with original query
- Compute centroid from top-k results
- Compute
importance = centroid ⊙ query_embedding - Zero out least important dimensions
- Use modified query for retrieval/reranking
- Generate expanded document for query
- Embed the LLM document
- Compute
importance = llm_embedding ⊙ query_embedding - Zero out least important dimensions
- Use modified query for retrieval/reranking
⭐ Star this repository if you find it helpful! ⭐