PolyFile is a file analysis utility that identifies and maps the semantic and syntactic structure of files—including polyglots, chimeras, and "schizophrenic" files that are validly multiple types simultaneously.
Key capabilities:
- Pure-Python libmagic implementation (263+ MIME types)
- Recursive embedded file detection (like binwalk)
- Parsers for PDF, ZIP, JPEG, iNES, and 188 Kaitai Struct formats
- Interactive HTML hex viewer with structure mapping
- Drop-in replacement for Unix
filecommand
Part of the ALAN Parsers Project alongside PolyTracker.
Matchers classify file types. Two types:
- libmagic DSL matchers (defined in
polyfile/magic_defs/) - Python matchers (classes extending
Matcher)
Parsers create AST representations of file structure. Registered via decorator:
from polyfile import register_parser
@register_parser("application/pdf")
def parse_pdf(file_stream, match):
# Return parsed structure
...| Module | Purpose | Size |
|---|---|---|
polyfile.py |
Core engine—Match, Parser, Submatch classes | 15 KB |
magic.py |
Pure-Python libmagic DSL implementation | 117 KB |
pdf.py |
PDF parser with embedded file detection | 48 KB |
debugger.py |
Interactive GDB-style debugger | 42 KB |
kaitaimatcher.py |
Kaitai Struct format bridge | — |
polyfile/
├── polyfile/ # Main package
│ ├── magic_defs/ # 354 libmagic definition files
│ ├── kaitai/parsers/ # 188 auto-generated Kaitai parsers (excluded from lint)
│ └── templates/ # HTML output templates
├── polymerge/ # Companion merge tool
├── tests/ # Test suite
├── docs/ # Extension guide, JSON format spec
└── kaitai_struct_formats/ # Git submodule with KSY definitions
# Install from source (requires Java for Kaitai compiler)
pip install -e .[dev]
# Install from PyPI
pip install polyfile# Run flake8 (excludes auto-generated kaitai parsers)
flake8 polyfile polymerge --max-complexity=10 --max-line-length=127 \
--exclude=polyfile/kaitai/parsers# Run all tests
pytest tests
# Run specific test file
pytest tests/test_magic.py
pytest tests/test_pdf.py
pytest tests/test_corkami.py # Polyglot corpuspip-auditRun all checks before committing changes:
# Lint
flake8 polyfile polymerge --max-complexity=10 --max-line-length=127 \
--exclude=polyfile/kaitai/parsers
# Security audit (checks for vulnerable dependencies)
pip-audit
# Tests
pytest tests# Find libmagic definitions by MIME type
rg "application/pdf" polyfile/magic_defs/
# Find Python matchers
ast-grep --pattern 'class $NAME(Matcher): $$$' --lang py polyfile/# Find registered parsers
rg "@register_parser" polyfile/
# Find parser for specific MIME type
rg 'register_parser.*application/zip' polyfile/- CLI:
polyfile/__main__.py - Core analysis:
polyfile/polyfile.py:PolyFile.struc() - Magic matching:
polyfile/magic.py:MagicMatcher.match()
tests/
├── test_magic.py # libmagic implementation vs corpus
├── test_pdf.py # PDF parsing
├── test_corkami.py # Polyglot/chimera edge cases
├── test_kaitai.py # Kaitai format tests
└── unit/
├── test_ast.py # AST utilities
└── test_http.py # HTTP protocol parsing
- Uses real file corpus including libmagic's official test suite
- Tests polyglot files to verify multi-type detection
- Parser tests validate structure extraction
from polyfile import Matcher, Match
class MyMatcher(Matcher):
def match(self, data: bytes) -> Match | None:
if data.startswith(b'MAGIC'):
return Match(
mime_type="application/x-myformat",
name="My Format",
offset=0,
length=len(data)
)
return Nonefrom polyfile import register_parser, Parser, Submatch
@register_parser("application/x-myformat")
class MyParser(Parser):
def parse(self, file_stream, match) -> Iterator[Submatch]:
# Yield Submatch objects representing structure
yield Submatch(
name="header",
start=0,
length=8,
value=file_stream.read(8)
)- Add
.ksyfile tokaitai_struct_formats/ - Map MIME type in
polyfile/kaitai/parsers/__init__.py - Rebuild:
python compile_kaitai_parsers.py
See docs/extending_polyfile.md for detailed guide.
- Use
FileStreamabstraction for seeking/reading PathOrStdin/PathOrStdoutfor CLI flexibility
Match→ top-level file type matchSubmatch→ nested structure within a match- Build trees for embedded files (ZIP contents, PDF streams)
- Raise
InvalidMatchwhen parser cannot process data - Matchers return
Nonefor non-matching data
polyfile/kaitai/parsers/is auto-generated—never edit manually- Java required at install time for Kaitai compilation
- libmagic DSL has quirks—see blog post