WebCrawl

A basic web server for tracking URL content snapshots (SHA-256 fingerprints).

Purpose

Tracking the status of data sources is an essential feature of any web content consumer pipeline. WebCrawl demonstrates one way of implementing that. It is a simple web server with a database that holds the hash of a web page's source, so that modifications can be recognized. The API can be used to create new or find existing entries for a particular URL.

Note: This project is intended for educational purposes, and as such is limited in its functionality. A practical solution should take into account dynamic content.

Requirements

Ruby 3.4.9
Rails 8.1.2.1
Bundler 4
SQLite 3

Setup

Ruby version is listed in [.ruby-version](.ruby-version). Then:

bin/setup

API

All JSON endpoints are under /api/v1.

This app does not authenticate requests. A production deployment will usually want authentication and authorization (ex. API keys, OAuth2, or mutual TLS) in front of or inside the app depending on what type of network it is deployed on.

Method	Path	Description
`POST`	`/api/v1/snapshots`	Provide `url`. Fetches the page over HTTP(S), stores a normalized URL and SHA-256 of the response body. `201` if new, `200` if updated. `422` with `{ "error": "fetch_failed", ... }` when the URL cannot be fetched.
`GET`	`/api/v1/snapshots/read`	Query param `url`. Returns the stored snapshot. `200` with `{ "url", "content_sha256", "added_at" }`. `404` if unknown. `400` for missing or invalid URL.
`GET`	`/up`	Rails health check.

Security

Server-Side Request Forgery (SSRF)

Risk: POST /api/v1/snapshots causes the server to request an attacker-chosen HTTP(S) URL. The fetcher allows redirects and only validates scheme and host presence, so requests can reach internal addresses or unintended hosts after a redirect chain.

Mitigations:

Network: Restrict outbound traffic, or force HTTP through a controlled forward proxy that enforces policy.
Application URL policy: Allowlist permitted hostnames or URL prefixes, or maintain a blocklist of IP ranges.
Redirects: Disable following redirects, or re-validate each hop against the same URL/IP rules.
Product: Only fetch URLs you trust, or run fetches in an isolated worker with no access to internal networks.

Denial of Service (DoS)

Risk: Without authentication, rate limits, or quotas, an attacker can exhaust threads, bandwidth, CPU usage, or database capacity.

Mitigations:

Edge: Rate limiting and bot protection at API gateway, load balancer, CDN, or WAF; mutual TLS or allowlists for private APIs.
Application: Per-IP or per-key rate limits and concurrency caps on snapshot creation; queues (background job + worker pool) so web processes stay bounded; idempotency keys to prevent heavy retries.
Fetch policy: Tighten timeouts and maximum body size for your SLA; consider refusing very large URLs or bodies.
Observability: Alerts on error rate, latency, queue depth, and outbound connection count.

Docker

See comments in [Dockerfile](Dockerfile) for build and run examples.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.github		.github
app		app
bin		bin
config		config
db		db
lib		lib
log		log
public		public
script		script
spec		spec
storage		storage
vendor		vendor
.dockerignore		.dockerignore
.gitignore		.gitignore
.rspec		.rspec
.rubocop.yml		.rubocop.yml
.ruby-version		.ruby-version
Dockerfile		Dockerfile
Gemfile		Gemfile
Gemfile.lock		Gemfile.lock
LICENSE		LICENSE
README.md		README.md
Rakefile		Rakefile
config.ru		config.ru

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WebCrawl

Purpose

Requirements

Setup

API

Security

Server-Side Request Forgery (SSRF)

Denial of Service (DoS)

Docker

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

WebCrawl

Purpose

Requirements

Setup

API

Security

Server-Side Request Forgery (SSRF)

Denial of Service (DoS)

Docker

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages