An autonomous GitHub Action that uses semantic embeddings and vector search to detect duplicate GitHub issues. Instead of relying on keyword matching, it understands the meaning behind issue descriptions and flags potentially duplicate reports for human review.
I started exploring this idea during my time contributing to open source through GSSoC. While things did not unfold exactly as I had hoped, the experience exposed me to a real challenge faced by maintainers: identifying duplicate issues hidden behind different wording.
That observation eventually led to this project, which uses semantic embeddings and vector search to help surface potential duplicates automatically while keeping maintainers in control.
Examples:
- "Unable to log in to dashboard"
- "Authentication fails after entering credentials"
Although semantically identical, keyword matching may miss them.
AI Duplicate Issue Detector solves this problem by:
- Converting issue descriptions into semantic embeddings.
- Storing embeddings in a vector database.
- Performing similarity searches against previously reported issues.
- Alerting maintainers when highly similar issues are detected.
The final decision always remains with a human maintainer.
- Semantic duplicate detection using vector embeddings
- GitHub Actions integration
- Automated issue scanning on issue creation and updates
- Pinecone-powered similarity search
- Markdown and code-block preprocessing
- Dockerized execution environment
- Human-in-the-loop workflow
- Graceful failure handling for missing tokens or services
GitHub Issue
โ
โผ
Text Preprocessing
โ
โผ
OpenAI Embeddings
โ
โผ
Pinecone Vector Search
โ
โผ
Similarity Evaluation
โ
โผ
GitHub Comment Notification
| Component | Technology |
|---|---|
| Language | Python 3.11 |
| Embeddings | OpenAI text-embedding-3-small |
| Vector Database | Pinecone |
| CI/CD | GitHub Actions |
| Containerization | Docker |
| Repository Integration | GitHub REST API |
issue-detector/
โ
โโโ .github/
โ โโโ workflows/
โ โ โโโ duplicate-checker.yml
โ โโโ ISSUE_TEMPLATE/
โ โโโ bug_report.yml # file for the issue form
โโโ .gitignore # Ignored files and folders
โโโ action.yml # Defines the GitHub Action inputs/outputs
โโโ CONTRIBUTING.md # Open source contribution guidelines
โโโ Dockerfile # Container configuration
โโโ LICENSE # MIT License
โโโ main.py # Core application logic
โโโ README.md # Project documentation
โโโ requirements.txt # Python dependencies
Before using this project, ensure you have:
- Python 3.11+
- OpenAI API Key
- Pinecone API Key
- Pinecone Index
- GitHub Repository with Actions enabled
git clone https://github.com/Shashank-8p/issue-detector.git
cd issue-detectorpython -m venv venvLinux/macOS:
source venv/bin/activateWindows:
.\venv\Scripts\activatepip install -r requirements.txtCreate a .env file:
OPENAI_API_KEY=your_openai_api_key
PINECONE_API_KEY=your_pinecone_api_key
PINECONE_INDEX_NAME=your_index_name
GITHUB_EVENT_PATH=dummy_payload.json| Variable | Description |
|---|---|
| OPENAI_API_KEY | OpenAI API access key |
| PINECONE_API_KEY | Pinecone access key |
| PINECONE_INDEX_NAME | Pinecone vector index name |
| GITHUB_EVENT_PATH | Local GitHub event payload |
GitHub Actions pass event data to scripts via a temporary JSON file. To test this bot locally without risking live repository infrastructure, we can "trick" the script by providing our own mock file.
(Ensure you have set GITHUB_EVENT_PATH=dummy_payload.json in your local .env file before starting).
Create a file named dummy_payload.json in your root folder and paste this mock issue data:
{
"issue": {
"number": 99,
"title": "Login page not working",
"body": "Users cannot authenticate after entering credentials."
}
}Execute the script in your terminal:
python main.pyExpected Outcome: Because this is the first time the bot has seen this text, it should calculate the embeddings, query Pinecone, fail to find a match, and print [UNIQUE ISSUE] No severe duplicate detected. It will then save this issue to your test database.
To prove the AI works, open your dummy_payload.json file and slightly alter the wording to simulate a new user posting the same bug:
{
"issue": {
"number": 100,
"title": "Cannot sign in to the platform",
"body": "The authentication system is broken when I type my password."
}
}Run python main.py one more time.
Expected Outcome: The bot will read the new words, realize the semantic meaning matches Issue #99, and print [MATCH FOUND] This issue appears to be a duplicate!
Create:
.github/workflows/duplicate-checker.yml
name: Issue Duplicate Checker
on:
issues:
types:
- opened
- edited
jobs:
check-duplicates:
runs-on: ubuntu-latest
permissions:
issues: write
steps:
- name: Checkout Repository
uses: actions/checkout@v4
- name: Run AI Duplicate Detector
uses: Shashank-8p/issue-detector@main
with:
openai_api_key: ${{ secrets.OPENAI_API_KEY }}
pinecone_api_key: ${{ secrets.PINECONE_API_KEY }}
pinecone_index_name: ${{ secrets.PINECONE_INDEX_NAME }}
github_token: ${{ secrets.GITHUB_TOKEN }}- A GitHub issue is created or edited.
- The issue content is cleaned and preprocessed.
- OpenAI generates a semantic embedding.
- Pinecone searches for the nearest existing issues.
- Similarity scores are evaluated.
- If the score exceeds the threshold, a notification comment is posted.
- Maintainers decide whether the issue should be closed as a duplicate.
Issue #15:
"Cannot sign in to dashboard"
โ Embedding Generated
โ Pinecone Search
Issue #3:
"Authentication failing on main page"
Similarity Score: 0.82
โ Threshold Passed
Bot Comment:
"This issue appears highly similar to #3.
Please verify whether it is a duplicate."
Contributions are welcome.
Potential areas for improvement:
- Better text preprocessing
- Multi-language issue support
- Adaptive similarity thresholds
- Issue clustering
- Repository analytics dashboard
- Additional vector database support
- Support for multiple embedding models
- Automatic duplicate issue labeling
- Maintainer feedback learning loop
- Slack and Discord notifications
- Historical issue trend analysis
- Repository health metrics
MIT License
Copyright (c) Shashank Pathak
- OpenAI
- Pinecone
- GitHub Actions
- Open-source maintainers who inspired this project