Skip to content

[RFC] Asset Library Pipeline - Entity Extraction to Icon Sourcing #27

@madjin

Description

@madjin

Overview

Pipeline for extracting entities from daily content and sourcing visual assets (icons/logos). Seeking collaboration to improve coverage and methodology.

Related PR: #26

Current Pipeline

Daily Facts → Entity Extraction (LLM) → Inventory → Asset Matching → Coverage Report
                                            ↓
                                    CoinGecko (tokens)
                                    Manual curation (others)

Scripts

Script Purpose
scripts/etl/extract-entities.py Extract entities via LLM
scripts/posters/fetch-icons.py Fetch token icons from CoinGecko
scripts/posters/generate-asset-checklist.py Generate coverage report

Current Coverage

Category Coverage
Tokens 20% (19/96)
Platforms 17% (33/189)
Tech 11% (18/157)
Projects 14% (34/244)
Plugins 30% (53/175)

Strengths

  1. Automated extraction - LLM identifies entities from unstructured content
  2. Normalization - --normalize-only dedupes without re-extraction (saves API calls)
  3. CoinGecko integration - Reliable token icons with rate limiting
  4. Fuzzy matching - Containment matching reduces false negatives
  5. Pre-scan efficiency - Checks existing files before making API calls

Weaknesses / Open Questions

  1. Low platform coverage - No reliable automated source for platform icons
  2. Manual curation - Plugins/projects need manual sourcing
  3. Entity noise - Extraction sometimes includes generic terms
  4. No OSINT automation - Finding official sources is still manual research
  5. No validation - Can't verify icon authenticity/currency

Ideas for Improvement

  • Better extraction prompts to reduce noise
  • GitHub API for project avatars/social images
  • Web scraping for official brand pages (og:image, favicons)
  • Community-sourced icon contributions
  • Image similarity detection to avoid duplicates

How to Contribute

  1. Improve coverage - Add CoinGecko ID mappings for missing tokens in fetch-icons.py
  2. Source research - Find reliable APIs/methods for platform/tech icons
  3. Pipeline feedback - Suggest improvements to extraction/matching logic
  4. Icon contributions - Submit PRs with properly sourced icons

Files

  • scripts/posters/assets/entity-inventory.json - Current entity list (1143 entities)
  • scripts/posters/assets/asset-checklist.md - Coverage report
  • scripts/posters/assets/icons/ - Downloaded icons

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions