PDF Finder

PDF Finder is designed as a robust, end-to-end pipeline for discovering and downloading PDF documents from the public web using the Google Custom Search API. This script creates structured search queries, paginates through configured result sets and extracts full metadata—including titles, snippets and MIME types for every returned item.

By explicitly appending filetype:pdf to each query and evaluating MIME responses server-side during download, the script ensures only valid PDF files are processed and stored. This makes it a reliable tool for gathering targeted documents at scale without manual oversight.

A major strength of PDF Finder is its emphasis on safety, traceability and consistency. Every downloaded file is assigned a sanitized, filesystem-safe filename based on its discovered title or URL, with automatic conflict resolution to prevent overwrites.

Comprehensive logging captures each stage of execution providing full transparency for debugging and auditing. Additionally, the deduplication mechanism prevents redundant downloads by removing repeated links across queries, ensuring efficient use of bandwidth and storage.

The Python script also includes a complete manifest system for writing structured results to both JSON and CSV formats, allowing seamless integration with downstream workflows such as data analysis, archival or ingestion into research tools.

Configurable environment variables make it easy to adjust behavior for large or repeated jobs, including pagination depth, request delay, timeout settings, output directories and user-defined search queries. Combined, these features make PDF Finder a flexible and production-ready solution for researchers, analysts and engineers who need an automated method for bulk PDF retrieval and metadata collection.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.env.template		.env.template
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
app.py		app.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PDF Finder

About

Uh oh!

Releases

Packages

Languages

License

devbret/pdf-finder

Folders and files

Latest commit

History

Repository files navigation

PDF Finder

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages