Integration

Compression, distortion, novelty, and meaning in large language models

Independent research — no affiliation, no funding, no product.

Every token a language model generates comes with a surprisal value — the model's own prediction error, computed for free as part of inference. We use it during training (it's the loss function). We occasionally inspect it for debugging. And then, at inference time, we throw it away.

This paper argues that's a mistake, and that the reason it's a mistake reveals something fundamental about what these systems are doing.

The argument in brief

The paper treats what LLMs do as hierarchical multi-scale compression, and the autoregressive loop as sequential coding through a bandwidth-limited channel. From this, a single measurable variable falls out: the enrichment fraction — the proportion of tokens doing novel compressive work over a window of generation. It determines whether output is productive, degenerate, or noise — a distinction invisible in fluency metrics but directly observable in the model's own surprisal.

Left: attractor collapse (enrichment → 0). Centre: coherent prose (mixed enriching/stabilising). Right: noise (enrichment → 1).

The three-regime structure itself is well-established empirically (Holtzman et al. 2020, Basu et al. 2021, Nakaishi 2024, Mikhaylovskiy 2025). The contribution is the enrichment fraction as a continuous, theoretically grounded metric — and its embedding within the compression hierarchy, which explains why these regimes exist rather than merely documenting that they do.

What's novel

The autoregressive loop as sequential coding through a bandwidth-limited channel — the mismatch between internal state dimensionality and single-token output is not metaphorical but information-theoretic.
The enrichment fraction as a continuous variable characterising generation regime, derived from the model's own surprisal.
What grows across depth despite the data processing inequality — not information about the input, but the organisation of preserved information (statistical complexity).
Grounding as a rate-distortion question — converts a philosophical debate into one with measurable parameters.
Alignment as bandwidth constraint — the binding constraint is feedback bandwidth, not feedback quality. The receiver's distortion measure is too rich to convey through available channels.

Predictions

The framework generates testable predictions — four measurable via generation statistics, seven concerning internal structure. Two are already partially established by convergent results across independent research programmes. The predictions and their current status are tracked in the paper.

Sections

The paper is five sections plus a conclusion, with formal constructions in an appendix.

Introduction — The bandwidth problem, prior work across five research threads, contributions, and outline.
The Compression Hierarchy — Rate-distortion as the organising principle. The one sharp boundary (lossless/lossy), the compression continuum, what transformers do, and how training fixes the distortion measure.
Structure Across Depth — What grows across layers despite the data processing inequality. The DPI resolution, statistical complexity as the measure of what grows, and why depth buys capability rather than merely capacity.
The Autoregressive Loop — The projection bottleneck formalised. Enrichment fraction, the three regimes, chain-of-thought and in-context learning as steering strategies. Where the framework meets and extends the Mirostat / typical sampling literature.
Grounding and Alignment — Two distortion measures (training-induced vs task-specific). Grounding as rate-distortion question. Alignment as distortion measure mismatch — the binding constraint is feedback bandwidth, not quality.

Versioning

Tagged releases (v0.x) mark stable versions. main is the working branch; publish tracks the latest tagged release and serves the live site. The version string is injected automatically at render time via git describe.

Building locally

Requires Python, Quarto, and TinyTeX.

python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
quarto render paper   # outputs to paper/docs/

Citation

If you reference this work, please cite:

@misc{disisto2026integration,
  author       = {DiSisto, Daniel},
  title        = {Integration: Compression, Distortion, Novelty, and Meaning},
  year         = {2026},
  url          = {https://github.com/ddisisto/integration-framework},
  note         = {Preprint}
}

License

This work is licensed under CC BY 4.0.

Name		Name	Last commit message	Last commit date
Latest commit History 102 Commits
.claude/skills/release		.claude/skills/release
drafts		drafts
paper		paper
paper2		paper2
research		research
.gitignore		.gitignore
CITATION.cff		CITATION.cff
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Integration

The argument in brief

What's novel

Predictions

Sections

Versioning

Building locally

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Integration

The argument in brief

What's novel

Predictions

Sections

Versioning

Building locally

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages