Skip to content

dsdugal/webcrawl

Repository files navigation

WebCrawl

A basic web server for tracking URL content snapshots (SHA-256 fingerprints).

Purpose

Tracking the status of data sources is an essential feature of any web content consumer pipeline. WebCrawl demonstrates one way of implementing that. It is a simple web server with a database that holds the hash of a web page's source, so that modifications can be recognized. The API can be used to create new or find existing entries for a particular URL.

Note: This project is intended for educational purposes, and as such is limited in its functionality. A practical solution should take into account dynamic content.

Requirements

  • Ruby 3.4.9
  • Rails 8.1.2.1
  • Bundler 4
  • SQLite 3

Setup

Ruby version is listed in [.ruby-version](.ruby-version). Then:

bin/setup

API

All JSON endpoints are under /api/v1.

This app does not authenticate requests. A production deployment will usually want authentication and authorization (ex. API keys, OAuth2, or mutual TLS) in front of or inside the app depending on what type of network it is deployed on.

Method Path Description
POST /api/v1/snapshots Provide url. Fetches the page over HTTP(S), stores a normalized URL and SHA-256 of the response body. 201 if new, 200 if updated. 422 with { "error": "fetch_failed", ... } when the URL cannot be fetched.
GET /api/v1/snapshots/read Query param url. Returns the stored snapshot. 200 with { "url", "content_sha256", "added_at" }. 404 if unknown. 400 for missing or invalid URL.
GET /up Rails health check.

Security

Server-Side Request Forgery (SSRF)

Risk: POST /api/v1/snapshots causes the server to request an attacker-chosen HTTP(S) URL. The fetcher allows redirects and only validates scheme and host presence, so requests can reach internal addresses or unintended hosts after a redirect chain.

Mitigations:

  • Network: Restrict outbound traffic, or force HTTP through a controlled forward proxy that enforces policy.
  • Application URL policy: Allowlist permitted hostnames or URL prefixes, or maintain a blocklist of IP ranges.
  • Redirects: Disable following redirects, or re-validate each hop against the same URL/IP rules.
  • Product: Only fetch URLs you trust, or run fetches in an isolated worker with no access to internal networks.

Denial of Service (DoS)

Risk: Without authentication, rate limits, or quotas, an attacker can exhaust threads, bandwidth, CPU usage, or database capacity.

Mitigations:

  • Edge: Rate limiting and bot protection at API gateway, load balancer, CDN, or WAF; mutual TLS or allowlists for private APIs.
  • Application: Per-IP or per-key rate limits and concurrency caps on snapshot creation; queues (background job + worker pool) so web processes stay bounded; idempotency keys to prevent heavy retries.
  • Fetch policy: Tighten timeouts and maximum body size for your SLA; consider refusing very large URLs or bodies.
  • Observability: Alerts on error rate, latency, queue depth, and outbound connection count.

Docker

See comments in [Dockerfile](Dockerfile) for build and run examples.

About

A basic web server for tracking URL content snapshots.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages