A basic web server for tracking URL content snapshots (SHA-256 fingerprints).
Tracking the status of data sources is an essential feature of any web content consumer pipeline. WebCrawl demonstrates one way of implementing that. It is a simple web server with a database that holds the hash of a web page's source, so that modifications can be recognized. The API can be used to create new or find existing entries for a particular URL.
Note: This project is intended for educational purposes, and as such is limited in its functionality. A practical solution should take into account dynamic content.
- Ruby 3.4.9
- Rails 8.1.2.1
- Bundler 4
- SQLite 3
Ruby version is listed in [.ruby-version](.ruby-version). Then:
bin/setupAll JSON endpoints are under /api/v1.
This app does not authenticate requests. A production deployment will usually want authentication and authorization (ex. API keys, OAuth2, or mutual TLS) in front of or inside the app depending on what type of network it is deployed on.
| Method | Path | Description |
|---|---|---|
POST |
/api/v1/snapshots |
Provide url. Fetches the page over HTTP(S), stores a normalized URL and SHA-256 of the response body. 201 if new, 200 if updated. 422 with { "error": "fetch_failed", ... } when the URL cannot be fetched. |
GET |
/api/v1/snapshots/read |
Query param url. Returns the stored snapshot. 200 with { "url", "content_sha256", "added_at" }. 404 if unknown. 400 for missing or invalid URL. |
GET |
/up |
Rails health check. |
Risk: POST /api/v1/snapshots causes the server to request an attacker-chosen HTTP(S) URL. The fetcher allows redirects and only validates scheme and host presence, so requests can reach internal addresses or unintended hosts after a redirect chain.
Mitigations:
- Network: Restrict outbound traffic, or force HTTP through a controlled forward proxy that enforces policy.
- Application URL policy: Allowlist permitted hostnames or URL prefixes, or maintain a blocklist of IP ranges.
- Redirects: Disable following redirects, or re-validate each hop against the same URL/IP rules.
- Product: Only fetch URLs you trust, or run fetches in an isolated worker with no access to internal networks.
Risk: Without authentication, rate limits, or quotas, an attacker can exhaust threads, bandwidth, CPU usage, or database capacity.
Mitigations:
- Edge: Rate limiting and bot protection at API gateway, load balancer, CDN, or WAF; mutual TLS or allowlists for private APIs.
- Application: Per-IP or per-key rate limits and concurrency caps on snapshot creation; queues (background job + worker pool) so web processes stay bounded; idempotency keys to prevent heavy retries.
- Fetch policy: Tighten timeouts and maximum body size for your SLA; consider refusing very large URLs or bodies.
- Observability: Alerts on error rate, latency, queue depth, and outbound connection count.
See comments in [Dockerfile](Dockerfile) for build and run examples.