Skip to content

πŸ₯· The FREE, Sovereign alternative to Firecrawl & Tavily. Pure Rust Stealth Scraper + Built-in God-Tier Meta-Search for AI Agents. Bypass Cloudflare & DataDome via HITL. Zero-bloat, ultra-clean LLM data. 99.99% Success Rate. πŸ¦€

License

Notifications You must be signed in to change notification settings

DevsHero/ShadowCrawl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

138 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ₯· ShadowCrawl MCP β€” v3.0.0

ShadowCrawl Logo

Search Smarter. Scrape Anything. Block Nothing.

The God-Tier Intelligence Engine for AI Agents

The Sovereign, Self-Hosted Alternative to Firecrawl, Jina, and Tavily.

License: MIT Rust MCP CI


ShadowCrawl is not just a scraper or a search wrapper β€” it is a complete intelligence layer purpose-built for AI Agents. ShadowCrawl ships a native Rust meta-search engine running inside the same binary. Zero extra containers. Parallel engines. LLM-grade clean output.

When every other tool gets blocked, ShadowCrawl doesn't retreat β€” it escalates: native engines β†’ native Chromium CDP headless β†’ Human-In-The-Loop (HITL) nuclear option. You always get results.


⚑ God-Tier Internal Meta-Search (v3.0.0)

ShadowCrawl v3.0.0 ships a 100% Rust-native metasearch engine that queries 4 engines in parallel and fuses results intelligently:

Engine Coverage Notes
πŸ”΅ DuckDuckGo General Web HTML scrape, no API key needed
🟒 Bing General + News Best for current events
πŸ”΄ Google Authoritative Results High-relevance, deduped
🟠 Brave Search Privacy-Focused Independent index, low overlap

🧠 What makes it God-Tier?

Parallel Concurrency β€” All 4 engines fire simultaneously. Total latency = slowest engine, not sum of all.

Smart Deduplication + Scoring β€” Cross-engine results are merged by URL fingerprint. Pages confirmed by 2+ engines receive a corroboration score boost. Domain authority weighting (docs, .gov, .edu, major outlets) pushes high-trust sources to the top.

Ultra-Clean Output for LLMs β€” Clean fields and predictable structure:

  • published_at is parsed and stored as a clean ISO-8601 field (2025-07-23T00:00:00)
  • content / snippet is clean β€” zero date-prefix garbage
  • breadcrumbs extracted from URL path for navigation context
  • domain and source_type auto-classified (blog, docs, reddit, news, etc.)

Result: LLMs receive dense, token-efficient, structured data β€” not a wall of noisy text.

Unstoppable Fallback β€” If an engine returns a bot-challenge page (anomaly.js, Cloudflare, PerimeterX), it is automatically retried via the native Chromium CDP instance (headless Chrome, bundled in-binary). No manual intervention. No 0-result failures.

Quality > Quantity β€” ~20 deduplicated, scored results rather than 50 raw duplicates. For an AI agent with a limited context window, 20 high-quality results outperform 50 noisy ones every time.


πŸ›  Full Feature Roster

Feature Details
πŸ” God-Tier Meta-Search Parallel Google / Bing / DDG / Brave Β· dedup Β· scoring Β· breadcrumbs Β· published_at
πŸ•· Universal Scraper Rust-native + native Chromium CDP for JS-heavy and anti-bot sites
πŸ›‚ Human Auth (HITL) human_auth_session: Real browser + persistent cookies + instruction overlay + Automatic Re-injection. Fetch any protected URL.
🧠 Semantic Memory Embedded LanceDB + Model2Vec for long-term research recall (no DB container)
πŸ€– HITL Non-Robot Search Visible Brave Browser + keyboard hooks for human CAPTCHA / login-wall bypass
🌐 Deep Crawler Recursive, bounded crawl to map entire subdomains
πŸ”’ Proxy Master Native HTTP/SOCKS5 pool rotation with health checks
🧽 Universal Janitor Strips cookie banners, popups, skeleton screens β€” delivers clean Markdown
πŸ”₯ Hydration Extractor Resolves React/Next.js hydration JSON (__NEXT_DATA__, embedded state)
πŸ›‘ Anti-Bot Arsenal Stealth UA rotation, fingerprint spoofing, CDP automation, mobile profile emulation
πŸ“Š Structured Extract CSS-selector + prompt-driven field extraction from any page
πŸ” Batch Scrape Parallel scrape of N URLs with configurable concurrency

πŸ— Zero-Bloat Architecture

ShadowCrawl is pure binary: a single Rust executable exposes MCP tools (stdio) and an optional HTTP server β€” no Docker, no sidecars.


πŸ’Ž The Nuclear Option: Human Auth Session (v3.0.0)

When standard automation fails (Cloudflare, CAPTCHA, complex logins), ShadowCrawl activates the human element.

πŸ›‚ human_auth_session β€” The "Unblocker"

This is our signature tool that surpasses all competitors. While most scrapers fail on login-walled content, human_auth_session opens a real, visible browser window for you to solve the challenge.

Once you click FINISH & RETURN, all authentication cookies are transparently captured and persisted in ~/.shadowcrawl/sessions/. Subsequent requests to the same domain automatically inject these cookies β€” making future fetches fully automated and effortless.

  • 🟒 Instruction Overlay β€” A native green banner guides the user on what to solve.
  • πŸͺ Persistent Sessions β€” Solve once, scrape forever. No need to log in manually again for weeks.
  • πŸ›‘ Security first β€” Cookies are stored locally and encrypted (optional/upcoming).
  • πŸš€ Auto-injection β€” Next web_fetch or web_crawl calls automatically load found sessions.

πŸ’₯ Boss-Level Anti-Bot Evidence

We don't claim β€” we show receipts. All captured with human_auth_session and our advanced CDP engines (2026-02-20):

Target Protection Evidence Extracted
LinkedIn Cloudflare + Auth JSON Β· Snippet 60+ job listings βœ…
Ticketmaster Cloudflare Turnstile JSON Β· Snippet Tour dates & venues βœ…
Airbnb DataDome JSON Β· Snippet 1,000+ Tokyo listings βœ…
Upwork reCAPTCHA JSON Β· Snippet 160K+ job postings βœ…
Amazon AWS Shield JSON Β· Snippet RTX 5070 Ti search results βœ…
nowsecure.nl Cloudflare JSON Manual button verified βœ…

πŸ“– Full analysis: proof/README.md


πŸ“¦ Quick Start

Option A β€” Download Prebuilt Binaries (Recommended)

Download the latest release assets from GitHub Releases and run one of:

Prebuilt assets are published for: windows-x64, windows-arm64, linux-x64, linux-arm64.

  • shadowcrawl-mcp β€” MCP stdio server (recommended for VS Code / Cursor / Claude Desktop)
  • shadowcrawl β€” HTTP server (default port 5000; override via --port, PORT, or SHADOWCRAWL_PORT)

Confirm the HTTP server is alive:

./shadowcrawl --port 5000
curl http://localhost:5000/health

πŸ§ͺ Build (Release, All Features)

Build all binaries with all optional features enabled:

cd mcp-server
cargo build --release --all-features

Option B β€” Build / Install from Source

git clone https://github.com/DevsHero/shadowcrawl.git
cd shadowcrawl

Build:

cd mcp-server
cargo build --release --features non_robot_search --bin shadowcrawl --bin shadowcrawl-mcp

Or install (puts binaries into your Cargo bin directory):

cargo install --path mcp-server --locked

Binaries land at:

  • target/release/shadowcrawl β€” HTTP server (default port 5000; override via --port, PORT, or SHADOWCRAWL_PORT)
  • target/release/shadowcrawl-mcp β€” MCP stdio server

Prerequisites for HITL:

  • Brave Browser (brave.com/download)
  • Accessibility permission (macOS: System Preferences β†’ Privacy & Security β†’ Accessibility)
  • A desktop session (not SSH-only)

Platform guides: docs/window_setup.md Β· docs/ubuntu_setup.md

After any binary rebuild/update, restart your MCP client session to pick up new tool definitions.


βœ… Agent Best Practices (ShadowCrawl Rules)

Use this exact decision flow to get the highest-quality results with minimal tokens:

  1. memory_search first (avoid re-fetching)
  2. web_search_json for initial research (search + content summaries in one call)
  3. web_fetch for specific URLs (docs/articles) - output_format="clean_json" for token-efficient output - set query + strict_relevance=true when you want only query-relevant paragraphs
  4. If web_fetch returns 403/429/rate-limit β†’ proxy_control grab then retry with use_proxy=true
  5. If web_fetch returns auth_risk_score >= 0.4 β†’ visual_scout (confirm login wall) β†’ human_auth_session (The God-Tier Nuclear Option)

Structured extraction (schema-first):

  • Prefer fetch_then_extract for one-shot fetch + extract.
  • strict=true (default) enforces schema shape: missing arrays become [], missing scalars become null (no schema drift).
  • Treat confidence=0.0 as β€œplaceholder / unrendered page” (often JS-only like crates.io). Escalate to browser rendering (CDP/HITL) instead of trusting the fields.
  • πŸ’‘ New in v3.0.0: Placeholder detection is now scalar-only. Pure-array schemas (only lists/structs) never trigger confidence=0.0, fixing prior regressions.

clean_json notes:

  • Large pages are truncated to respect max_chars (look for clean_json_truncated warning). Increase max_chars to see more.
  • key_code_blocks is extracted from fenced blocks and signature-like inline code; short docs pages are supported.
  • πŸ•· v3.0.0 fix: Module extraction on docs.rs works recursively for all relative and absolute sub-paths.

🧩 MCP Integration

ShadowCrawl exposes all tools via the Model Context Protocol (stdio transport).

VS Code / Copilot Chat

Add to your MCP config (~/.config/Code/User/mcp.json):

{
  "servers": {
    "shadowcrawl": {
      "type": "stdio",
      "command": "env",
      "args": [
        "RUST_LOG=info",
        "SEARCH_ENGINES=google,bing,duckduckgo,brave",
        "LANCEDB_URI=/YOUR_PATH/shadowcrawl/lancedb",
        "HTTP_TIMEOUT_SECS=30",
        "MAX_CONTENT_CHARS=10000",
        "IP_LIST_PATH=/YOUR_PATH/shadowcrawl/ip.txt",
        "PROXY_SOURCE_PATH=/YOUR_PATH/shadowcrawl/proxy_source.json",
        "/YOUR_PATH/shadowcrawl/mcp-server/target/release/shadowcrawl-mcp"
      ]
    }
  }
}

Cursor / Claude Desktop

Use the same stdio setup as VS Code (run shadowcrawl-mcp locally and pass env vars via env or your client’s env field).

πŸ“– Full multi-IDE guide: docs/IDE_SETUP.md


βš™οΈ Key Environment Variables

Variable Default Description
CHROME_EXECUTABLE auto-detected Override path to Chromium/Chrome/Brave binary
SEARCH_ENGINES google,bing,duckduckgo,brave Active search engines (comma-separated)
SEARCH_MAX_RESULTS_PER_ENGINE 10 Results per engine before merge
SEARCH_CDP_FALLBACK true if browser found Auto-retry blocked engines via native Chromium CDP (alias: SEARCH_BROWSERLESS_FALLBACK)
SEARCH_SIMULATE_BLOCK β€” Force blocked path for testing: duckduckgo,bing or all
LANCEDB_URI β€” Path for semantic research memory (optional)
SHADOWCRAWL_NEUROSIPHON 1 (enabled) Set to 0 / false / off to disable all NeuroSiphon techniques (import nuking, SPA extraction, semantic shaving, search reranking)
HTTP_TIMEOUT_SECS 30 Per-request timeout
OUTBOUND_LIMIT 32 Max concurrent outbound connections
MAX_CONTENT_CHARS 10000 Max chars per scraped document
IP_LIST_PATH β€” Path to proxy IP list
SCRAPE_DELAY_PRESET polite fast / polite / cautious

πŸ† Comparison

Feature Firecrawl / Jina / Tavily ShadowCrawl v2.4.0
Cost $49–$499/mo $0 β€” self-hosted
Privacy They see your queries 100% private, local-only
Search Engine Proprietary / 3rd-party API Native Rust (4 engines, parallel)
Result Quality Mixed, noisy snippets Deduped, scored, LLM-clean
Cloudflare Bypass Rarely Native Chromium CDP + HITL fallback
LinkedIn / Airbnb Blocked 99.99% success (HITL)
JS Rendering Cloud API Native Brave + bundled Chromium CDP
Semantic Memory None Embedded LanceDB + Model2Vec
Proxy Support Paid add-on Native SOCKS5/HTTP rotation
MCP Native Partial Full MCP stdio + HTTP

πŸ€– Agent Optimal Setup: IDE Copilot Instructions

ShadowCrawl works best when your AI agent knows the operational rules before it starts β€” which tool to call first, when to rotate proxies, and when not to use extract_fields. Without these rules, agents waste tokens re-fetching cached data and can misuse tools on incompatible sources.

The complete rules file is already in this repo as .clinerules. Copy the block below into the IDE-specific file for your editor.


πŸ—‚οΈ VS Code β€” .github/copilot-instructions.md

Create (or append to) .github/copilot-instructions.md in your workspace root:

## MCP Usage Guidelines β€” ShadowCrawl

### Shadowcrawl Priority Rules
- ALWAYS call `memory_search` BEFORE `web_search` or `web_search_json` β€” skip live fetch when similarity β‰₯ 0.60
- For initial research, use `web_search_json` (search + content in one call) instead of `web_search` + separate `web_fetch`
- For doc/article pages: `web_fetch` with `output_format: clean_json`, `strict_relevance: true`, `query: "<your question>"`
- If `web_fetch` returns 403/429/rate-limit: call `proxy_control` with `action: "grab"` then retry with `use_proxy: true`
- Use `extract_fields` ONLY on structured HTML (docs, articles). NOT on raw `.md` / `.json` / `.txt` files
- Use `web_crawl` to discover sub-pages on a large doc site before targeted fetching
- `hitl_web_fetch` is last resort only β€” try automated methods + proxy rotation first

🐾 Cursor β€” .cursorrules

Create or append to .cursorrules in your project root with the same block above.


🟩 Cline (VS Code extension) β€” .clinerules

Already included in this repository as .clinerules. Cline loads it automatically β€” no action needed.


🧠 Claude Desktop β€” System Prompt / Custom Instructions

Paste the rules block into the Custom Instructions or System Prompt field in Claude Desktop settings (Settings β†’ Advanced β†’ System Prompt).


🧳 Other Agents (Windsurf, Aider, Continue, AutoGen, etc.)

Any agent that accepts a system prompt or workspace instruction file: paste the same block. The rules are plain markdown and tool-agnostic.


Quick Decision Flow

Question / research task
        β”‚
        β–Ό
memory_search ──► hit (β‰₯ 0.60)? ──► use cached result, STOP
        β”‚ miss
        β–Ό
web_search_json ──► enough content? ──► use it, STOP
        β”‚ need deeper page
        β–Ό
web_fetch(clean_json + strict_relevance + query)
        β”‚ 403 / 429 / blocked?
        β–Ό
proxy_control grab ──► retry with use_proxy: true
        β”‚ still blocked?
        β–Ό
hitl_web_fetch  (LAST RESORT)

πŸ“– Full rules + per-tool quick-reference table: .clinerules


v3.0.0 (2026-02-20)

Added

  • human_auth_session (The Nuclear Option): Launches a visible browser for human login/CAPTCHA solving. Captures and persists full authentication cookies to ~/.shadowcrawl/sessions/{domain}.json. Enables full automation for protected URLs after a single manual session.
  • Instruction Overlay: human_auth_session now displays a custom green "ShadowCrawl" instruction banner on top of the browser window to guide users through complex auth walls.
  • Persistent Session Auto-Injection: web_fetch, web_crawl, and visual_scout now automatically check for and inject matching cookies from the local session store.
  • extract_structured / fetch_then_extract: new optional params placeholder_word_threshold (int, default 10) and placeholder_empty_ratio (float 0–1, default 0.9) allow agents to tune placeholder detection sensitivity per-call.
  • web_crawl: new optional max_chars param (default 10 000) caps total JSON output size to prevent workspace storage spill.
  • Rustdoc module extraction: extract_structured / fetch_then_extract correctly populate modules: [...] on docs.rs pages using the NAME/index.html sub-directory convention.
  • GitHub Discussions & Issues hydration: fetch_via_cdp detects github.com/*/discussions/* and /issues/* URLs; extends network-idle window to 2.5 s / 12 s max and polls for .timeline-comment, .js-discussion, .comment-body DOM nodes.
  • Contextual code blocks (clean_json mode): SniperCodeBlock gains a context: Option<String> field. Performs two-pass extraction for prose preceding fenced blocks and Markdown sentences containing inline snippets.
  • IDE copilot-instructions guide (README): new πŸ€– Agent Optimal Setup section.
  • .clinerules workspace file: all 7 priority rules + decision-flow diagram + per-tool quick-reference table.
  • Agent priority rules in tool schemas: every MCP tool description now carries machine-readable ⚠️ AGENT RULE / βœ… BEST PRACTICE.

Changed

  • Placeholder detection (Scalar-Only Logic): Confidence override to 0.0 now only considers scalar (non-array) fields. Pure-array schemas (headers, modules, structs) never trigger fake placeholder warnings, fixing false-positives on rich but list-heavy documentation pages.
  • web_fetch(output_format="clean_json"): applies a max_chars-based paragraph budget and emits clean_json_truncated when output is clipped.
  • extract_fields / fetch_then_extract: placeholder/unrendered pages (very low content + mostly empty schema fields) force confidence=0.0.
  • Short-content bypass (strict_relevance / extract_relevant_sections): early exit with a descriptive warning when word_count < 200. Short pages (GitHub Discussions, Q&A threads) are returned whole.

Fixed

  • BUG-6: modules: [] always empty on rustdoc pages β€” refactored regex to support both absolute and simple relative module links (init/index.html, optim/index.html).
  • BUG-7: false-positive confidence=0.0 on real docs.rs pages; replaced whole-schema empty ratio with scalar-only ratio + raised threshold.
  • BUG-9: web_crawl could spill 16 KB+ of JSON into VS Code workspace storage; handler now truncates response to max_chars (default 10 000).
  • web_fetch(output_format="clean_json"): paragraph filter now adapts for word_count < 200.
  • fetch_then_extract: prevents false-high confidence on JS-only placeholder pages (e.g. crates.io) by overriding confidence to 0.0.
  • cdp_fallback_failed on GitHub Discussions: extended CDP hydration window and selector polling ensures full thread capture.

β˜• Acknowledgments & Support

ShadowCrawl is built with ❀️ by a solo developer for the open-source AI community. If this tool saved you from a $500/mo scraping API bill:

  • ⭐ Star the repo β€” it helps others discover this
  • πŸ› Found a bug? Open an issue
  • πŸ’‘ Feature request? Start a discussion
  • β˜• Fuel more updates:

Sponsor

License: MIT β€” free for personal and commercial use.

About

πŸ₯· The FREE, Sovereign alternative to Firecrawl & Tavily. Pure Rust Stealth Scraper + Built-in God-Tier Meta-Search for AI Agents. Bypass Cloudflare & DataDome via HITL. Zero-bloat, ultra-clean LLM data. 99.99% Success Rate. πŸ¦€

Topics

Resources

License

Stars

Watchers

Forks

Sponsor this project

  •  

Packages