-
Notifications
You must be signed in to change notification settings - Fork 0
feat: knowledge records and topics implementation #7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
- Add Knowledge Records endpoints and schemas for contextual AI enrichment - Add hierarchical Topics (taxonomy) endpoints and schemas - Add documentation for AI enrichment architecture and design decisions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: dbcdf66277
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
openapi.yaml
Outdated
| parent_id: | ||
| type: string | ||
| format: uuid | ||
| nullable: true | ||
| description: Parent topic ID (null for top-level topics) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Model null parent_id with JSON Schema, not nullable
The spec declares openapi: 3.1.0, which uses JSON Schema 2020-12, but parent_id is marked with nullable: true. In OAS 3.1, nullable isn’t part of the schema vocabulary, so many validators and code generators ignore it and treat parent_id as a plain string. That will reject or mis-deserialize responses where top-level topics return parent_id: null (as shown in the examples). To avoid validation/client breakage, model this as type: [string, "null"] or anyOf for both TopicData.parent_id and CreateTopicInputBody.parent_id.
Useful? React with 👍 / 👎.
Implements the full backend for knowledge records and topics: - Database schema migration (sql/002_knowledge_and_topics.sql) - Models with validation (internal/models/) - Repository layer with CRUD operations (internal/repository/) - Service layer with business logic (internal/service/) - HTTP handlers with proper error handling (internal/api/handlers/) - ConflictError type for 409 responses (internal/errors/) - Route registration in main.go - Comprehensive integration tests Key features: - Topics: hierarchical with auto-calculated levels, cascade delete - Knowledge Records: bulk delete by tenant_id - Cross-tenant validation for parent topics - Title uniqueness within parent+tenant scope
This commit adds the foundation for AI-powered feedback enrichment: ## New Features - Automatic embedding generation for knowledge records, topics, and text feedback - pgvector integration for vector similarity search - OpenAI text-embedding-3-small model support (1536 dimensions) ## Changes ### Database Schema (sql/003_embeddings.sql) - Added embedding vector columns to knowledge_records, topics, and feedback_records - Added AI enrichment fields to feedback_records: topic_id, classification_confidence, sentiment, sentiment_score, emotion - Created HNSW indexes for fast vector similarity search - Added indexes for sentiment/emotion filtering ### New Package: internal/embeddings/ - client.go: Embedding client interface - openai.go: OpenAI implementation using text-embedding-3-small - mock.go: Deterministic mock client for testing ### Configuration - Added OPENAI_API_KEY to config (optional - AI enrichment disabled if not set) ### Services - KnowledgeRecordsService: Auto-generates embedding on create/update - TopicsService: Auto-generates embedding on create/update - FeedbackRecordsService: Auto-generates embedding for text feedback ### Repositories - Added UpdateEmbedding methods to all repositories - Added UpdateEnrichment for feedback records with full AI fields - Extended GetByID/List queries to include new AI enrichment fields ## Dependencies - github.com/sashabaranov/go-openai v1.36.1 - github.com/pgvector/pgvector-go v0.3.0 ## Usage Set OPENAI_API_KEY environment variable to enable AI enrichment. When enabled, embeddings are generated asynchronously after record creation.
Removes sentiment, sentiment_score, and emotion fields from feedback_records. These fields were added prematurely - they require separate LLM API calls (not just embeddings) which adds cost and complexity beyond the original requirements. Keeping only embedding-based enrichment: - embedding: vector for semantic search - topic_id: classification via vector similarity - classification_confidence: confidence score for topic match Changes: - sql/003_embeddings.sql: Removed sentiment/emotion columns and indexes - sql/004_remove_sentiment_fields.sql: Migration to drop existing columns - internal/models/feedback_records.go: Removed Sentiment, SentimentScore, Emotion - internal/repository/feedback_records_repository.go: Updated queries
Implements automatic topic classification using vector similarity search. When feedback is created, it's now automatically classified against existing topics based on embedding similarity. ## New Features ### Topic Classification - Feedback records are automatically matched to the most similar topic - Uses cosine similarity with configurable threshold (default: 0.5) - Classification happens asynchronously after embedding generation - Results stored in topic_id and classification_confidence fields ### Filter by Topic - Added topic_id filter to GET /v1/feedback-records endpoint - Allows querying all feedback classified under a specific topic ## Changes ### Models - Added TopicMatch struct to models/topics.go (shared type) - Added TopicID filter to ListFeedbackRecordsFilters ### Repository - Added FindSimilarTopic method to TopicsRepository - Uses pgvector cosine distance operator (<=>) - Added topic_id condition to feedback list query ### Service - Added TopicClassifier interface - Added NewFeedbackRecordsServiceWithClassification constructor - Updated enrichRecord to classify after embedding generation - Logs classification results at debug level ### Main - Reordered initialization (topics repo before feedback service) - Wired topics repo as classifier for feedback service ## Usage 1. Create topics with embeddings (auto-generated on create) 2. Create feedback records - they auto-classify to best matching topic 3. Query feedback by topic: GET /v1/feedback-records?topic_id=<uuid>
…upport - Add theme_id column to feedback_records for hierarchical taxonomy - Implement threshold-based classification (0.30 for themes, 0.40 for subtopics) - Update FindSimilarTopic to support level filtering - Add TopicMatch model for classification results - Update OpenAPI spec with theme_id field and filter - Add pgAdmin to docker-compose for database visualization - Add CSV ingestion script for testing with sample data - Include sample feedback data in testdata/
…ification only
- Add parent_id column back to topics table for explicit Level 1 → Level 2 hierarchy
- Update /topics/{id}/similar to /topics/{id}/children endpoint
- Level 2 topics now require parent_id when created
- Embeddings are used only for feedback → topic classification
- Update OpenAPI spec with parent_id field and children endpoint
- Add embedding-classification documentation
- Update ingestion script to create topics with parent_id
- Simplify classification to only classify to Level 2 topics
This new ClassificationWorker to periodically retry classification of feedback records with embeddings but no topic classification. - add configuration option for classification retry interval and batch size. - Update feedback records model with UnclassifiedRecord type for handling records needing re-classification. - Implemented repository methods to list unclassified records and update their classification. - Modified feedback records service to support retrying classification in batches. - Added a new CLI tool for ingesting feedback from CSV files into the system.
f7ca4be to
dec18a0
Compare
…T-4o labeling
This commit introduces a complete taxonomy generation pipeline for automatically
categorizing feedback records into hierarchical topics.
## Python Microservice (services/taxonomy-generator/)
- FastAPI service for ML-intensive clustering operations
- UMAP dimensionality reduction (1536 → 10 dimensions)
- HDBSCAN clustering for automatic cluster discovery
- GPT-4o labeling to generate human-readable topic names
- Supports Level 1 (broad categories) and Level 2 (sub-topics)
- Level 2 topics generated only for dense clusters (500+ items)
## Go API Integration
- TaxonomyClient: HTTP client to communicate with Python service
- TaxonomyHandler: REST endpoints for taxonomy operations
- POST /v1/taxonomy/{tenant_id}/generate (async)
- POST /v1/taxonomy/{tenant_id}/generate/sync (blocking)
- GET /v1/taxonomy/{tenant_id}/status
- Schedule management endpoints
## Periodic Re-clustering
- clustering_jobs table for scheduling taxonomy regeneration
- TaxonomyScheduler worker polls for due jobs
- Supports daily, weekly, monthly intervals per tenant
## Infrastructure
- Dockerfile for Go API (multi-stage build)
- Docker Compose orchestration for all services
- Environment configuration for taxonomy service URL
…nt scheduling - Resolved conflicts with origin/feat/taxonomies - Kept taxonomy scheduler (required for per-tenant periodic clustering) - Removed classification retry worker (was removed in remote) - Added TaxonomyServiceURL, TaxonomySchedulerEnabled, TaxonomyPollInterval config - Scheduler disabled by default (TAXONOMY_SCHEDULER_ENABLED=false)
- Added ListByTopicWithDescendants repository method for direct topic_id lookup - Modified service to use direct lookup by default instead of similarity search - Added UseSimilarity filter option to explicitly request vector similarity search - Direct lookup uses pre-computed topic assignments from taxonomy generation - Includes descendant topics (Level 1 shows all Level 2 feedback) Benefits: - Much faster queries (simple WHERE clause vs vector similarity) - Accurate cluster-based results matching taxonomy generation - Falls back to similarity search with use_similarity=true query param
…omies Resolved conflicts: - .env.example: Combined River job queue and Taxonomy service settings - internal/config/config.go: Combined both configurations with all helper functions - api: Removed binary from tracking, added to .gitignore Fixed linting issues in incoming code: - taxonomy_client.go: Check resp.Body.Close() error returns - taxonomy_scheduler.go: Check UpdateAfterRun() error returns - feedback_records_repository.go: Remove duplicate query assignment
- Changed level2_min_cluster_size from 500 to 50 to allow for more flexible clustering.
This adds support for configurable taxonomy hierarchy depth, allowing
users to generate taxonomies with 1 to 4+ levels without code changes.
## Changes
### Python Taxonomy Service
- Refactored clustering to use recursive algorithm supporting N levels
- Added `max_levels` config parameter (default: 4)
- Added per-level cluster size configurations
- Updated GPT-4o prompts with level-aware context
### Configuration Options
- `max_levels`: Maximum taxonomy depth (1-10, default: 4)
- `level_min_cluster_sizes`: Min items needed to create children per level
- `level_hdbscan_min_cluster_sizes`: HDBSCAN cluster size per level
### CSV Ingestion Script
- Added semicolon delimiter support for normalized CSV format
- Fixed column mapping for hub-combined-test-data format
## Usage
Change depth via API request (no code changes needed):
```bash
# 2 levels
curl -X POST "http://localhost:8080/v1/taxonomy/TENANT/generate" \
-H "Authorization: Bearer API_KEY" \
-d '{"max_levels": 2}'
# 4 levels (default)
curl -X POST "http://localhost:8080/v1/taxonomy/TENANT/generate" \
-H "Authorization: Bearer API_KEY" \
-d '{"max_levels": 4}'
```
## Sample 4-Level Hierarchy
```
Account Testing
└─ Email Errors
└─ Email Change Issues
└─ Workspace Creation Errors
```
After embedding generation completes for feedback records, automatically assign the most similar topic based on vector similarity. This provides immediate topic classification without waiting for batch clustering. - Add TenantID to EmbeddingJobArgs for tenant-isolated topic lookup - Add FindMostSpecificTopic() to find highest-level topic above threshold - Add AssignTopic() with idempotent behavior (preserves manual overrides) - Extend EmbeddingWorker with topic assignment after embedding success - Wire TopicMatcher and FeedbackAssigner dependencies in main.go Failures during topic assignment are logged but don't fail the embedding job, ensuring graceful degradation when no topics exist yet. Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Add FeedbackCount field to Topic model for API response - Reduce HDBSCAN min_cluster_size thresholds for smaller datasets - Lower level_min_cluster_sizes for more granular subdivision - Change default max_levels from 4 to 3 for <10k datasets - Suppress expected UMAP n_jobs warning when random_state is set - Remove unused strPtr function from ingest script Co-Authored-By: Claude Opus 4.5 <[email protected]>
Summary
This PR implements the Knowledge Records and Topics (taxonomy) feature for AI Enrichment in Formbricks Hub.
Changes
API Specification:
/v1/knowledge-records) for contextual AI enrichment data/v1/topics) for hierarchical feedback classificationopenapi.yamlwith full CRUD support, schema definitions, and comprehensive examplesGo Implementation:
sql/002_knowledge_and_topics.sql)internal/models/)internal/repository/)internal/service/)internal/api/handlers/)internal/errors/)main.goDocumentation:
docs/enrichment.mddetailing the architecture, design decisions, and roadmapKey Features
Test Plan
make lint(0 issues)go build ./...make tests)To run tests locally: