Skip to content

Conversation

@mattinannt
Copy link
Member

@mattinannt mattinannt commented Jan 26, 2026

Summary

This PR implements the Knowledge Records and Topics (taxonomy) feature for AI Enrichment in Formbricks Hub.

Changes

API Specification:

  • Added Knowledge Records endpoints (/v1/knowledge-records) for contextual AI enrichment data
  • Added Topics endpoints (/v1/topics) for hierarchical feedback classification
  • Updated openapi.yaml with full CRUD support, schema definitions, and comprehensive examples

Go Implementation:

  • Database schema migration (sql/002_knowledge_and_topics.sql)
  • Models with validation tags (internal/models/)
  • Repository layer with CRUD operations (internal/repository/)
  • Service layer with business logic (internal/service/)
  • HTTP handlers with proper error handling (internal/api/handlers/)
  • ConflictError type for 409 responses (internal/errors/)
  • Route registration in main.go
  • Comprehensive integration tests

Documentation:

  • Added docs/enrichment.md detailing the architecture, design decisions, and roadmap

Key Features

  • Topics: Hierarchical structure with auto-calculated levels, cascade delete on parent removal
  • Knowledge Records: Contextual data for AI enrichment, bulk delete by tenant_id
  • Cross-tenant validation: Parent topics must belong to the same tenant
  • Title uniqueness: Enforced within parent+tenant scope for topics
  • Enterprise-grade: Full validation, proper error handling, RFC 7807 problem details

Test Plan

  • OpenAPI spec validated with YAML parser
  • Go code passes make lint (0 issues)
  • Go code compiles with go build ./...
  • Integration tests pass with running database (make tests)
  • Manual API testing via Swagger UI or curl

To run tests locally:

make dev-setup    # Start Postgres via Docker
make init-db      # Apply schema migrations (including new 002_knowledge_and_topics.sql)
make tests        # Run integration tests

- Add Knowledge Records endpoints and schemas for contextual AI enrichment
- Add hierarchical Topics (taxonomy) endpoints and schemas
- Add documentation for AI enrichment architecture and design decisions
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: dbcdf66277

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

openapi.yaml Outdated
Comment on lines 1630 to 1634
parent_id:
type: string
format: uuid
nullable: true
description: Parent topic ID (null for top-level topics)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Model null parent_id with JSON Schema, not nullable

The spec declares openapi: 3.1.0, which uses JSON Schema 2020-12, but parent_id is marked with nullable: true. In OAS 3.1, nullable isn’t part of the schema vocabulary, so many validators and code generators ignore it and treat parent_id as a plain string. That will reject or mis-deserialize responses where top-level topics return parent_id: null (as shown in the examples). To avoid validation/client breakage, model this as type: [string, "null"] or anyOf for both TopicData.parent_id and CreateTopicInputBody.parent_id.

Useful? React with 👍 / 👎.

mattinannt and others added 12 commits January 26, 2026 16:16
Implements the full backend for knowledge records and topics:

- Database schema migration (sql/002_knowledge_and_topics.sql)
- Models with validation (internal/models/)
- Repository layer with CRUD operations (internal/repository/)
- Service layer with business logic (internal/service/)
- HTTP handlers with proper error handling (internal/api/handlers/)
- ConflictError type for 409 responses (internal/errors/)
- Route registration in main.go
- Comprehensive integration tests

Key features:
- Topics: hierarchical with auto-calculated levels, cascade delete
- Knowledge Records: bulk delete by tenant_id
- Cross-tenant validation for parent topics
- Title uniqueness within parent+tenant scope
This commit adds the foundation for AI-powered feedback enrichment:

## New Features
- Automatic embedding generation for knowledge records, topics, and text feedback
- pgvector integration for vector similarity search
- OpenAI text-embedding-3-small model support (1536 dimensions)

## Changes

### Database Schema (sql/003_embeddings.sql)
- Added embedding vector columns to knowledge_records, topics, and feedback_records
- Added AI enrichment fields to feedback_records: topic_id, classification_confidence,
  sentiment, sentiment_score, emotion
- Created HNSW indexes for fast vector similarity search
- Added indexes for sentiment/emotion filtering

### New Package: internal/embeddings/
- client.go: Embedding client interface
- openai.go: OpenAI implementation using text-embedding-3-small
- mock.go: Deterministic mock client for testing

### Configuration
- Added OPENAI_API_KEY to config (optional - AI enrichment disabled if not set)

### Services
- KnowledgeRecordsService: Auto-generates embedding on create/update
- TopicsService: Auto-generates embedding on create/update
- FeedbackRecordsService: Auto-generates embedding for text feedback

### Repositories
- Added UpdateEmbedding methods to all repositories
- Added UpdateEnrichment for feedback records with full AI fields
- Extended GetByID/List queries to include new AI enrichment fields

## Dependencies
- github.com/sashabaranov/go-openai v1.36.1
- github.com/pgvector/pgvector-go v0.3.0

## Usage
Set OPENAI_API_KEY environment variable to enable AI enrichment.
When enabled, embeddings are generated asynchronously after record creation.
Removes sentiment, sentiment_score, and emotion fields from feedback_records.

These fields were added prematurely - they require separate LLM API calls
(not just embeddings) which adds cost and complexity beyond the original
requirements.

Keeping only embedding-based enrichment:
- embedding: vector for semantic search
- topic_id: classification via vector similarity
- classification_confidence: confidence score for topic match

Changes:
- sql/003_embeddings.sql: Removed sentiment/emotion columns and indexes
- sql/004_remove_sentiment_fields.sql: Migration to drop existing columns
- internal/models/feedback_records.go: Removed Sentiment, SentimentScore, Emotion
- internal/repository/feedback_records_repository.go: Updated queries
Implements automatic topic classification using vector similarity search.
When feedback is created, it's now automatically classified against existing
topics based on embedding similarity.

## New Features

### Topic Classification
- Feedback records are automatically matched to the most similar topic
- Uses cosine similarity with configurable threshold (default: 0.5)
- Classification happens asynchronously after embedding generation
- Results stored in topic_id and classification_confidence fields

### Filter by Topic
- Added topic_id filter to GET /v1/feedback-records endpoint
- Allows querying all feedback classified under a specific topic

## Changes

### Models
- Added TopicMatch struct to models/topics.go (shared type)
- Added TopicID filter to ListFeedbackRecordsFilters

### Repository
- Added FindSimilarTopic method to TopicsRepository
- Uses pgvector cosine distance operator (<=>)
- Added topic_id condition to feedback list query

### Service
- Added TopicClassifier interface
- Added NewFeedbackRecordsServiceWithClassification constructor
- Updated enrichRecord to classify after embedding generation
- Logs classification results at debug level

### Main
- Reordered initialization (topics repo before feedback service)
- Wired topics repo as classifier for feedback service

## Usage

1. Create topics with embeddings (auto-generated on create)
2. Create feedback records - they auto-classify to best matching topic
3. Query feedback by topic: GET /v1/feedback-records?topic_id=<uuid>
…upport

- Add theme_id column to feedback_records for hierarchical taxonomy
- Implement threshold-based classification (0.30 for themes, 0.40 for subtopics)
- Update FindSimilarTopic to support level filtering
- Add TopicMatch model for classification results
- Update OpenAPI spec with theme_id field and filter
- Add pgAdmin to docker-compose for database visualization
- Add CSV ingestion script for testing with sample data
- Include sample feedback data in testdata/
…ification only

- Add parent_id column back to topics table for explicit Level 1 → Level 2 hierarchy
- Update /topics/{id}/similar to /topics/{id}/children endpoint
- Level 2 topics now require parent_id when created
- Embeddings are used only for feedback → topic classification
- Update OpenAPI spec with parent_id field and children endpoint
- Add embedding-classification documentation
- Update ingestion script to create topics with parent_id
- Simplify classification to only classify to Level 2 topics
This new ClassificationWorker to periodically retry classification of feedback records with embeddings but no topic classification.

- add configuration option for classification retry interval and batch size.
- Update feedback records model with UnclassifiedRecord type for handling records needing re-classification.
- Implemented repository methods to list unclassified records and update their classification.
- Modified feedback records service to support retrying classification in batches.
- Added a new CLI tool for ingesting feedback from CSV files into the system.
mattinannt and others added 14 commits January 28, 2026 12:09
…T-4o labeling

This commit introduces a complete taxonomy generation pipeline for automatically
categorizing feedback records into hierarchical topics.

## Python Microservice (services/taxonomy-generator/)
- FastAPI service for ML-intensive clustering operations
- UMAP dimensionality reduction (1536 → 10 dimensions)
- HDBSCAN clustering for automatic cluster discovery
- GPT-4o labeling to generate human-readable topic names
- Supports Level 1 (broad categories) and Level 2 (sub-topics)
- Level 2 topics generated only for dense clusters (500+ items)

## Go API Integration
- TaxonomyClient: HTTP client to communicate with Python service
- TaxonomyHandler: REST endpoints for taxonomy operations
  - POST /v1/taxonomy/{tenant_id}/generate (async)
  - POST /v1/taxonomy/{tenant_id}/generate/sync (blocking)
  - GET /v1/taxonomy/{tenant_id}/status
  - Schedule management endpoints

## Periodic Re-clustering
- clustering_jobs table for scheduling taxonomy regeneration
- TaxonomyScheduler worker polls for due jobs
- Supports daily, weekly, monthly intervals per tenant

## Infrastructure
- Dockerfile for Go API (multi-stage build)
- Docker Compose orchestration for all services
- Environment configuration for taxonomy service URL
…nt scheduling

- Resolved conflicts with origin/feat/taxonomies
- Kept taxonomy scheduler (required for per-tenant periodic clustering)
- Removed classification retry worker (was removed in remote)
- Added TaxonomyServiceURL, TaxonomySchedulerEnabled, TaxonomyPollInterval config
- Scheduler disabled by default (TAXONOMY_SCHEDULER_ENABLED=false)
- Added ListByTopicWithDescendants repository method for direct topic_id lookup
- Modified service to use direct lookup by default instead of similarity search
- Added UseSimilarity filter option to explicitly request vector similarity search
- Direct lookup uses pre-computed topic assignments from taxonomy generation
- Includes descendant topics (Level 1 shows all Level 2 feedback)

Benefits:
- Much faster queries (simple WHERE clause vs vector similarity)
- Accurate cluster-based results matching taxonomy generation
- Falls back to similarity search with use_similarity=true query param
…omies

Resolved conflicts:
- .env.example: Combined River job queue and Taxonomy service settings
- internal/config/config.go: Combined both configurations with all helper functions
- api: Removed binary from tracking, added to .gitignore

Fixed linting issues in incoming code:
- taxonomy_client.go: Check resp.Body.Close() error returns
- taxonomy_scheduler.go: Check UpdateAfterRun() error returns
- feedback_records_repository.go: Remove duplicate query assignment
- Changed level2_min_cluster_size from 500 to 50 to allow for more flexible clustering.
This adds support for configurable taxonomy hierarchy depth, allowing
users to generate taxonomies with 1 to 4+ levels without code changes.

## Changes

### Python Taxonomy Service
- Refactored clustering to use recursive algorithm supporting N levels
- Added `max_levels` config parameter (default: 4)
- Added per-level cluster size configurations
- Updated GPT-4o prompts with level-aware context

### Configuration Options
- `max_levels`: Maximum taxonomy depth (1-10, default: 4)
- `level_min_cluster_sizes`: Min items needed to create children per level
- `level_hdbscan_min_cluster_sizes`: HDBSCAN cluster size per level

### CSV Ingestion Script
- Added semicolon delimiter support for normalized CSV format
- Fixed column mapping for hub-combined-test-data format

## Usage

Change depth via API request (no code changes needed):

```bash
# 2 levels
curl -X POST "http://localhost:8080/v1/taxonomy/TENANT/generate" \
  -H "Authorization: Bearer API_KEY" \
  -d '{"max_levels": 2}'

# 4 levels (default)
curl -X POST "http://localhost:8080/v1/taxonomy/TENANT/generate" \
  -H "Authorization: Bearer API_KEY" \
  -d '{"max_levels": 4}'
```

## Sample 4-Level Hierarchy

```
Account Testing
  └─ Email Errors
      └─ Email Change Issues
          └─ Workspace Creation Errors
```
After embedding generation completes for feedback records, automatically
assign the most similar topic based on vector similarity. This provides
immediate topic classification without waiting for batch clustering.

- Add TenantID to EmbeddingJobArgs for tenant-isolated topic lookup
- Add FindMostSpecificTopic() to find highest-level topic above threshold
- Add AssignTopic() with idempotent behavior (preserves manual overrides)
- Extend EmbeddingWorker with topic assignment after embedding success
- Wire TopicMatcher and FeedbackAssigner dependencies in main.go

Failures during topic assignment are logged but don't fail the embedding
job, ensuring graceful degradation when no topics exist yet.

Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Add FeedbackCount field to Topic model for API response
- Reduce HDBSCAN min_cluster_size thresholds for smaller datasets
- Lower level_min_cluster_sizes for more granular subdivision
- Change default max_levels from 4 to 3 for <10k datasets
- Suppress expected UMAP n_jobs warning when random_state is set
- Remove unused strPtr function from ingest script

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants