Skip to content

Built-in ingestion pipeline with configurable user profiles #32

@jmjava

Description

@jmjava

Summary

Add a built-in ingestion pipeline that runs on startup, configurable per-user profiles, and resilient per-document error handling with structured failure reporting.

Changes

Built-in ingestion runner

  • IngestionRunner implements ApplicationRunner and performs RAG content ingestion on startup when guide.reload-content-on-startup=true
  • IngestionResult record captures loaded/failed URLs, directories, and individual documents with elapsed time
  • IngestionFailure record pairs each failure with a human-readable reason (extracted from the exception message)
  • On completion, a structured INGESTION COMPLETE banner is printed to stdout showing what loaded, what failed and why, RAG store stats, and the MCP endpoint
  • DataManagerController.loadReferences() now returns the structured IngestionResult

Use existing rag-core storage interfaces

  • DataManager now depends on ChunkingContentElementRepository (from embabel-agent-rag-core) instead of a custom RagStore wrapper, so any backend implementing that library interface (e.g. DrivineStore for Neo4j) can be plugged in without changes
  • Uses ContentElementRepositoryInfo (from embabel-agent-rag-core) for store metrics instead of a custom RagStats record
  • Removed RagStore, DrivineRagStoreAdapter, and RagStats — these duplicated abstractions already provided by the rag and rag-neo modules
  • Ingestion via ContentRefreshPolicy.ingestUriIfNeeded() is called directly in DataManager.ingestPage() rather than being wrapped behind a custom interface method

Resilient error handling

  • URL ingestion: each URL is independently try-caught so one timeout or parse failure doesn't block the rest
  • Directory ingestion: each document within a directory is independently try-caught so one bad file doesn't skip remaining documents in that directory
  • All failures are collected with source identity and reason, then displayed in the summary banner

Configurable user profiles

  • GUIDE_PROFILE environment variable (default: user) controls which application-{profile}.yml is loaded
  • user-config/ directory holds personal profile overrides (gitignored); application-user.yml.example provided as a template
  • application-user.yml checked into src/main/resources/ as a sensible default
  • .env.example documents required environment variables including GUIDE_PROFILE

Helper script: scripts/fresh-ingest.sh

  • Convenience script to wipe Neo4j RAG data and re-ingest from scratch
  • Starts Neo4j via Docker Compose, clears all ContentElement nodes, then launches the app with reload-content-on-startup=true
  • Automatically kills any existing process on the app port before starting
  • Passes --spring.config.additional-location=file:./user-config/ so Spring Boot picks up personal profile files

Documentation

  • scripts/README.md — usage instructions for the helper scripts
  • scripts/INGESTION-TESTING.md — step-by-step testing guide for the ingestion pipeline

Tests

  • Unit tests for IngestionFailure, IngestionResult, IngestionRunner, and DataManagerController
  • Fixed flaky JWT token comparison in HubApiControllerTest

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions