The God-Tier Intelligence Engine for AI Agents
The Sovereign, Self-Hosted Alternative to Firecrawl, Jina, and Tavily.
ShadowCrawl is not just a scraper or a search wrapper β it is a complete intelligence layer purpose-built for AI Agents. ShadowCrawl ships a native Rust meta-search engine running inside the same binary. Zero extra containers. Parallel engines. LLM-grade clean output.
When every other tool gets blocked, ShadowCrawl doesn't retreat β it escalates: native engines β native Chromium CDP headless β Human-In-The-Loop (HITL) nuclear option. You always get results.
ShadowCrawl v3.0.0 ships a 100% Rust-native metasearch engine that queries 4 engines in parallel and fuses results intelligently:
| Engine | Coverage | Notes |
|---|---|---|
| π΅ DuckDuckGo | General Web | HTML scrape, no API key needed |
| π’ Bing | General + News | Best for current events |
| π΄ Google | Authoritative Results | High-relevance, deduped |
| π Brave Search | Privacy-Focused | Independent index, low overlap |
Parallel Concurrency β All 4 engines fire simultaneously. Total latency = slowest engine, not sum of all.
Smart Deduplication + Scoring β Cross-engine results are merged by URL fingerprint. Pages confirmed by 2+ engines receive a corroboration score boost. Domain authority weighting (docs, .gov, .edu, major outlets) pushes high-trust sources to the top.
Ultra-Clean Output for LLMs β Clean fields and predictable structure:
published_atis parsed and stored as a clean ISO-8601 field (2025-07-23T00:00:00)content/snippetis clean β zero date-prefix garbagebreadcrumbsextracted from URL path for navigation contextdomainandsource_typeauto-classified (blog,docs,reddit,news, etc.)
Result: LLMs receive dense, token-efficient, structured data β not a wall of noisy text.
Unstoppable Fallback β If an engine returns a bot-challenge page (anomaly.js, Cloudflare, PerimeterX), it is automatically retried via the native Chromium CDP instance (headless Chrome, bundled in-binary). No manual intervention. No 0-result failures.
Quality > Quantity β ~20 deduplicated, scored results rather than 50 raw duplicates. For an AI agent with a limited context window, 20 high-quality results outperform 50 noisy ones every time.
| Feature | Details |
|---|---|
| π God-Tier Meta-Search | Parallel Google / Bing / DDG / Brave Β· dedup Β· scoring Β· breadcrumbs Β· published_at |
| π· Universal Scraper | Rust-native + native Chromium CDP for JS-heavy and anti-bot sites |
| π Human Auth (HITL) | human_auth_session: Real browser + persistent cookies + instruction overlay + Automatic Re-injection. Fetch any protected URL. |
| π§ Semantic Memory | Embedded LanceDB + Model2Vec for long-term research recall (no DB container) |
| π€ HITL Non-Robot Search | Visible Brave Browser + keyboard hooks for human CAPTCHA / login-wall bypass |
| π Deep Crawler | Recursive, bounded crawl to map entire subdomains |
| π Proxy Master | Native HTTP/SOCKS5 pool rotation with health checks |
| π§½ Universal Janitor | Strips cookie banners, popups, skeleton screens β delivers clean Markdown |
| π₯ Hydration Extractor | Resolves React/Next.js hydration JSON (__NEXT_DATA__, embedded state) |
| π‘ Anti-Bot Arsenal | Stealth UA rotation, fingerprint spoofing, CDP automation, mobile profile emulation |
| π Structured Extract | CSS-selector + prompt-driven field extraction from any page |
| π Batch Scrape | Parallel scrape of N URLs with configurable concurrency |
ShadowCrawl is pure binary: a single Rust executable exposes MCP tools (stdio) and an optional HTTP server β no Docker, no sidecars.
When standard automation fails (Cloudflare, CAPTCHA, complex logins), ShadowCrawl activates the human element.
This is our signature tool that surpasses all competitors. While most scrapers fail on login-walled content, human_auth_session opens a real, visible browser window for you to solve the challenge.
Once you click FINISH & RETURN, all authentication cookies are transparently captured and persisted in ~/.shadowcrawl/sessions/. Subsequent requests to the same domain automatically inject these cookies β making future fetches fully automated and effortless.
- π’ Instruction Overlay β A native green banner guides the user on what to solve.
- πͺ Persistent Sessions β Solve once, scrape forever. No need to log in manually again for weeks.
- π‘ Security first β Cookies are stored locally and encrypted (optional/upcoming).
- π Auto-injection β Next
web_fetchorweb_crawlcalls automatically load found sessions.
We don't claim β we show receipts. All captured with human_auth_session and our advanced CDP engines (2026-02-20):
| Target | Protection | Evidence | Extracted |
|---|---|---|---|
| Cloudflare + Auth | JSON Β· Snippet | 60+ job listings β | |
| Ticketmaster | Cloudflare Turnstile | JSON Β· Snippet | Tour dates & venues β |
| Airbnb | DataDome | JSON Β· Snippet | 1,000+ Tokyo listings β |
| Upwork | reCAPTCHA | JSON Β· Snippet | 160K+ job postings β |
| Amazon | AWS Shield | JSON Β· Snippet | RTX 5070 Ti search results β |
| nowsecure.nl | Cloudflare | JSON | Manual button verified β |
π Full analysis: proof/README.md
Download the latest release assets from GitHub Releases and run one of:
Prebuilt assets are published for: windows-x64, windows-arm64, linux-x64, linux-arm64.
shadowcrawl-mcpβ MCP stdio server (recommended for VS Code / Cursor / Claude Desktop)shadowcrawlβ HTTP server (default port5000; override via--port,PORT, orSHADOWCRAWL_PORT)
Confirm the HTTP server is alive:
./shadowcrawl --port 5000
curl http://localhost:5000/healthBuild all binaries with all optional features enabled:
cd mcp-server
cargo build --release --all-featuresgit clone https://github.com/DevsHero/shadowcrawl.git
cd shadowcrawlBuild:
cd mcp-server
cargo build --release --features non_robot_search --bin shadowcrawl --bin shadowcrawl-mcpOr install (puts binaries into your Cargo bin directory):
cargo install --path mcp-server --lockedBinaries land at:
target/release/shadowcrawlβ HTTP server (default port5000; override via--port,PORT, orSHADOWCRAWL_PORT)target/release/shadowcrawl-mcpβ MCP stdio server
Prerequisites for HITL:
- Brave Browser (brave.com/download)
- Accessibility permission (macOS: System Preferences β Privacy & Security β Accessibility)
- A desktop session (not SSH-only)
Platform guides: docs/window_setup.md Β· docs/ubuntu_setup.md
After any binary rebuild/update, restart your MCP client session to pick up new tool definitions.
Use this exact decision flow to get the highest-quality results with minimal tokens:
memory_searchfirst (avoid re-fetching)web_search_jsonfor initial research (search + content summaries in one call)web_fetchfor specific URLs (docs/articles) -output_format="clean_json"for token-efficient output - setquery+strict_relevance=truewhen you want only query-relevant paragraphs- If
web_fetchreturns 403/429/rate-limit βproxy_controlgrabthen retry withuse_proxy=true - If
web_fetchreturnsauth_risk_score >= 0.4βvisual_scout(confirm login wall) βhuman_auth_session(The God-Tier Nuclear Option)
Structured extraction (schema-first):
- Prefer
fetch_then_extractfor one-shot fetch + extract. strict=true(default) enforces schema shape: missing arrays become[], missing scalars becomenull(no schema drift).- Treat
confidence=0.0as βplaceholder / unrendered pageβ (often JS-only like crates.io). Escalate to browser rendering (CDP/HITL) instead of trusting the fields. - π‘ New in v3.0.0: Placeholder detection is now scalar-only. Pure-array schemas (only lists/structs) never trigger confidence=0.0, fixing prior regressions.
clean_json notes:
- Large pages are truncated to respect
max_chars(look forclean_json_truncatedwarning). Increasemax_charsto see more. key_code_blocksis extracted from fenced blocks and signature-like inline code; short docs pages are supported.- π· v3.0.0 fix: Module extraction on
docs.rsworks recursively for all relative and absolute sub-paths.
ShadowCrawl exposes all tools via the Model Context Protocol (stdio transport).
Add to your MCP config (~/.config/Code/User/mcp.json):
Use the same stdio setup as VS Code (run shadowcrawl-mcp locally and pass env vars via env or your clientβs env field).
π Full multi-IDE guide: docs/IDE_SETUP.md
| Variable | Default | Description |
|---|---|---|
CHROME_EXECUTABLE |
auto-detected | Override path to Chromium/Chrome/Brave binary |
SEARCH_ENGINES |
google,bing,duckduckgo,brave |
Active search engines (comma-separated) |
SEARCH_MAX_RESULTS_PER_ENGINE |
10 |
Results per engine before merge |
SEARCH_CDP_FALLBACK |
true if browser found |
Auto-retry blocked engines via native Chromium CDP (alias: SEARCH_BROWSERLESS_FALLBACK) |
SEARCH_SIMULATE_BLOCK |
β | Force blocked path for testing: duckduckgo,bing or all |
LANCEDB_URI |
β | Path for semantic research memory (optional) |
SHADOWCRAWL_NEUROSIPHON |
1 (enabled) |
Set to 0 / false / off to disable all NeuroSiphon techniques (import nuking, SPA extraction, semantic shaving, search reranking) |
HTTP_TIMEOUT_SECS |
30 |
Per-request timeout |
OUTBOUND_LIMIT |
32 |
Max concurrent outbound connections |
MAX_CONTENT_CHARS |
10000 |
Max chars per scraped document |
IP_LIST_PATH |
β | Path to proxy IP list |
SCRAPE_DELAY_PRESET |
polite |
fast / polite / cautious |
| Feature | Firecrawl / Jina / Tavily | ShadowCrawl v2.4.0 |
|---|---|---|
| Cost | $49β$499/mo | $0 β self-hosted |
| Privacy | They see your queries | 100% private, local-only |
| Search Engine | Proprietary / 3rd-party API | Native Rust (4 engines, parallel) |
| Result Quality | Mixed, noisy snippets | Deduped, scored, LLM-clean |
| Cloudflare Bypass | Rarely | Native Chromium CDP + HITL fallback |
| LinkedIn / Airbnb | Blocked | 99.99% success (HITL) |
| JS Rendering | Cloud API | Native Brave + bundled Chromium CDP |
| Semantic Memory | None | Embedded LanceDB + Model2Vec |
| Proxy Support | Paid add-on | Native SOCKS5/HTTP rotation |
| MCP Native | Partial | Full MCP stdio + HTTP |
ShadowCrawl works best when your AI agent knows the operational rules before it starts β which tool to call first, when to rotate proxies, and when not to use extract_fields. Without these rules, agents waste tokens re-fetching cached data and can misuse tools on incompatible sources.
The complete rules file is already in this repo as .clinerules. Copy the block below into the IDE-specific file for your editor.
Create (or append to) .github/copilot-instructions.md in your workspace root:
## MCP Usage Guidelines β ShadowCrawl
### Shadowcrawl Priority Rules
- ALWAYS call `memory_search` BEFORE `web_search` or `web_search_json` β skip live fetch when similarity β₯ 0.60
- For initial research, use `web_search_json` (search + content in one call) instead of `web_search` + separate `web_fetch`
- For doc/article pages: `web_fetch` with `output_format: clean_json`, `strict_relevance: true`, `query: "<your question>"`
- If `web_fetch` returns 403/429/rate-limit: call `proxy_control` with `action: "grab"` then retry with `use_proxy: true`
- Use `extract_fields` ONLY on structured HTML (docs, articles). NOT on raw `.md` / `.json` / `.txt` files
- Use `web_crawl` to discover sub-pages on a large doc site before targeted fetching
- `hitl_web_fetch` is last resort only β try automated methods + proxy rotation firstCreate or append to .cursorrules in your project root with the same block above.
Already included in this repository as .clinerules. Cline loads it automatically β no action needed.
Paste the rules block into the Custom Instructions or System Prompt field in Claude Desktop settings (Settings β Advanced β System Prompt).
Any agent that accepts a system prompt or workspace instruction file: paste the same block. The rules are plain markdown and tool-agnostic.
Question / research task
β
βΌ
memory_search βββΊ hit (β₯ 0.60)? βββΊ use cached result, STOP
β miss
βΌ
web_search_json βββΊ enough content? βββΊ use it, STOP
β need deeper page
βΌ
web_fetch(clean_json + strict_relevance + query)
β 403 / 429 / blocked?
βΌ
proxy_control grab βββΊ retry with use_proxy: true
β still blocked?
βΌ
hitl_web_fetch (LAST RESORT)
π Full rules + per-tool quick-reference table:
.clinerules
human_auth_session(The Nuclear Option): Launches a visible browser for human login/CAPTCHA solving. Captures and persists full authentication cookies to~/.shadowcrawl/sessions/{domain}.json. Enables full automation for protected URLs after a single manual session.- Instruction Overlay:
human_auth_sessionnow displays a custom green "ShadowCrawl" instruction banner on top of the browser window to guide users through complex auth walls. - Persistent Session Auto-Injection:
web_fetch,web_crawl, andvisual_scoutnow automatically check for and inject matching cookies from the local session store. extract_structured/fetch_then_extract: new optional paramsplaceholder_word_threshold(int, default 10) andplaceholder_empty_ratio(float 0β1, default 0.9) allow agents to tune placeholder detection sensitivity per-call.web_crawl: new optionalmax_charsparam (default 10 000) caps total JSON output size to prevent workspace storage spill.- Rustdoc module extraction:
extract_structured/fetch_then_extractcorrectly populatemodules: [...]on docs.rs pages using theNAME/index.htmlsub-directory convention. - GitHub Discussions & Issues hydration:
fetch_via_cdpdetectsgithub.com/*/discussions/*and/issues/*URLs; extends network-idle window to 2.5 s / 12 s max and polls for.timeline-comment,.js-discussion,.comment-bodyDOM nodes. - Contextual code blocks (
clean_jsonmode):SniperCodeBlockgains acontext: Option<String>field. Performs two-pass extraction for prose preceding fenced blocks and Markdown sentences containing inline snippets. - IDE copilot-instructions guide (README): new
π€ Agent Optimal Setupsection. .clinerulesworkspace file: all 7 priority rules + decision-flow diagram + per-tool quick-reference table.- Agent priority rules in tool schemas: every MCP tool description now carries machine-readable
β οΈ AGENT RULE/β BEST PRACTICE.
- Placeholder detection (Scalar-Only Logic): Confidence override to 0.0 now only considers scalar (non-array) fields. Pure-array schemas (headers, modules, structs) never trigger fake placeholder warnings, fixing false-positives on rich but list-heavy documentation pages.
web_fetch(output_format="clean_json"): applies amax_chars-based paragraph budget and emitsclean_json_truncatedwhen output is clipped.extract_fields/fetch_then_extract: placeholder/unrendered pages (very low content + mostly empty schema fields) forceconfidence=0.0.- Short-content bypass (
strict_relevance/extract_relevant_sections): early exit with a descriptive warning whenword_count < 200. Short pages (GitHub Discussions, Q&A threads) are returned whole.
- BUG-6:
modules: []always empty on rustdoc pages β refactored regex to support both absolute and simple relative module links (init/index.html,optim/index.html). - BUG-7: false-positive
confidence=0.0on real docs.rs pages; replaced whole-schema empty ratio with scalar-only ratio + raised threshold. - BUG-9:
web_crawlcould spill 16 KB+ of JSON into VS Code workspace storage; handler now truncates response tomax_chars(default 10 000). web_fetch(output_format="clean_json"): paragraph filter now adapts forword_count < 200.fetch_then_extract: prevents false-high confidence on JS-only placeholder pages (e.g. crates.io) by overriding confidence to 0.0.cdp_fallback_failedon GitHub Discussions: extended CDP hydration window and selector polling ensures full thread capture.
ShadowCrawl is built with β€οΈ by a solo developer for the open-source AI community. If this tool saved you from a $500/mo scraping API bill:
- β Star the repo β it helps others discover this
- π Found a bug? Open an issue
- π‘ Feature request? Start a discussion
- β Fuel more updates:
License: MIT β free for personal and commercial use.
{ "servers": { "shadowcrawl": { "type": "stdio", "command": "env", "args": [ "RUST_LOG=info", "SEARCH_ENGINES=google,bing,duckduckgo,brave", "LANCEDB_URI=/YOUR_PATH/shadowcrawl/lancedb", "HTTP_TIMEOUT_SECS=30", "MAX_CONTENT_CHARS=10000", "IP_LIST_PATH=/YOUR_PATH/shadowcrawl/ip.txt", "PROXY_SOURCE_PATH=/YOUR_PATH/shadowcrawl/proxy_source.json", "/YOUR_PATH/shadowcrawl/mcp-server/target/release/shadowcrawl-mcp" ] } } }