Skip to content

Treat \\<space> as non-breaking space in inline text (issue #162, bd-1aip)#165

Merged
cscheid merged 1 commit intomainfrom
bugfix/1aip-inline-backslash-space-nbsp
May 8, 2026
Merged

Treat \\<space> as non-breaking space in inline text (issue #162, bd-1aip)#165
cscheid merged 1 commit intomainfrom
bugfix/1aip-inline-backslash-space-nbsp

Conversation

@cscheid
Copy link
Copy Markdown
Member

@cscheid cscheid commented May 7, 2026

Fixes #162.

Summary

A literal \ followed by an ASCII space inside inline text was being absorbed into a single pandoc_str token together with the space (and any following text), producing AST shapes that drifted on round trip — a \ b parsed as [Str "a", Space, Str "\ b"] and re-parsed as [Str "a", Space, Str "\", Space, Str "b"] after one cycle. Reporter linked three quarto-web docs that hit this in pipe-table cells (cell padding inserts the space).

This change adopts Pandoc's non-breaking-space shorthand: \<ASCII space> collapses to a single U+00A0 (NBSP). The writer already emits NBSP literally as its UTF-8 bytes, and the existing grammar already parses literal NBSP correctly, so the round trip is now stable end-to-end with no grammar regen and no writer changes — the entire fix is one extra branch in process_backslash_escapes (crates/pampa/src/pandoc/treesitter_utils/text_helpers.rs).

This changes the AST for inputs containing \<space> in inline text: previously a single Str containing literal \<space>X, now a Str containing U+00A0 X (matching Pandoc). Consumers reading raw Str text will see a different character, but the source rendering is unchanged.

End-to-end verification

$ printf -- 'a \\ b\n' | cargo run -q --bin pampa
[ Para [Str "a", Space, Str "\u{a0}b"] ]

$ printf -- 'a \\ b\n' | cargo run -q --bin pampa -- -t qmd | xxd
00000000: 6120 c2a0 620a                           a ..b.

$ printf -- 'a \\ b\n' | cargo run -q --bin pampa -- -t qmd \
    | cargo run -q --bin pampa
[ Para [Str "a", Space, Str "\u{a0}b"] ]

Round-trip stable, byte-for-byte equal to Pandoc's markdown -> markdown output for the same input.

Tests

  • Three unit tests in crates/pampa/tests/test_treesitter_refactoring.rs:
    • test_backslash_space_becomes_nbsp — reporter's exact case.
    • test_backslash_space_at_paragraph_start\<space> at the start of inline content.
    • test_backslash_space_in_table_cell_round_trips — pipe-table cell case (the reporter's wild incidence). Compares native AST text rather than going through the JSON-roundtrip suite — see "Side find" below.
  • Two new round-trip fixtures under crates/pampa/tests/roundtrip_tests/qmd-json-qmd/ for stability: inline_backslash_space.qmd and inline_backslash_space_paragraph_start.qmd.
  • Existing test_escaped_backslash_then_space_remains_literal locks in that \\<space> (escaped backslash + space) is unchanged and does NOT produce NBSP.

cargo xtask verify --skip-hub-build --skip-hub-tests --skip-trace-viewer-build --skip-trace-viewer-tests is clean.

Scope

Only \<ASCII space> for now. Pandoc treats \<TAB> and \<NBSP> similarly, but those didn't surface in the reporter's bug or the linked docs, so they're left out of this change.

\<EOL> remains a LineBreak (different code path, unchanged).

Side find — bd-j9wp

While building the table-cell test, I tried to use a qmd-json-qmd round-trip fixture and the test failed even though the round-trip is content-stable. The codebase is fully deterministic — verified across 5 repeated runs that the same input always produces the same JSON1 (captionS: 17) and the same JSON3 (captionS: 15). The mismatch is real and reproducible.

What's happening: each parse builds its own astContext.sourceInfoPool (23 entries for the original qmd, 21 for the regenerated qmd — each pool is deterministic but the regenerated qmd has different source positions, so the pool contents and traversal-order IDs differ). The test's remove_location_fields (crates/pampa/tests/test.rs:395-411) scrubs astContext itself plus most foreign-keys-into-the-pool (s, attrS, targetS, citationIdS), but it misses scalar S-foreign-keys not nested inside a scrubbed envelope. captionS is one such miss. After scrubbing, captionS: 17 and captionS: 15 are both dangling references to the now-deleted pools, but the test still compares them numerically and fails.

I worked around this for #165 by switching the table case to a unit test that compares native AST text (no source info). The underlying scrub-list gap is filed as bd-j9wp (priority 3) — not a determinism issue, just a missing key in the comparison filter.

Test plan

  • Confirm that the AST change (Str "\<space>X"Str "<NBSP>X") doesn't break a downstream consumer that was working around the previous behavior. None expected, but worth a glance at quarto-web rendering of one of the three linked docs to be safe.

🤖 Generated with Claude Code

…aip)

A literal backslash followed by an ASCII space inside inline text was
being captured into a single Str together with the space (and any
following text), producing AST shapes that drifted on round trip.
The writer's emitting \\ for any literal \ then re-parsed as a split
Str + Space + Str, so a \ b -> [Str "a", Space, Str "\ b"] became
[Str "a", Space, Str "\", Space, Str "b"] after one round trip.

Pandoc handles \<space> as its non-breaking-space shorthand (U+00A0).
Adopt the same rule in process_backslash_escapes: \<ASCII space>
collapses to a single NBSP character. The writer emits NBSP as its
UTF-8 bytes and the existing grammar already handles literal NBSP, so
the round trip is now stable end-to-end.

User-visible incidence: trailing \ inside pipe-table cells in
quarto-web docs (cell padding inserts the space before the column
terminator). End-to-end verification:

  $ printf -- 'a \\ b\n' | cargo run -q --bin pampa
  [ Para [Str "a", Space, Str "\u{a0}b"] ]
  $ printf -- 'a \\ b\n' | cargo run -q --bin pampa -- -t qmd | xxd
  00000000: 6120 c2a0 620a                           a ..b.

Matches Pandoc:
  $ printf -- 'a \\ b\n' | pandoc -f markdown -t native
  [ Para [ Str "a" , Space , Str "\160b" ] ]

Tests:
- Three unit tests in test_treesitter_refactoring.rs covering the
  inline case, paragraph-start case, and a pipe-table cell case
  (the last compares native AST to bypass an unrelated source-info
  scrubbing gap in remove_location_fields, filed as bd-j9wp).
- Two round-trip fixtures under tests/roundtrip_tests/qmd-json-qmd/
  for stability (inline_backslash_space.qmd and
  inline_backslash_space_paragraph_start.qmd).
- Existing test_escaped_backslash_then_space_remains_literal
  locks in that \\<space> (escaped backslash + space) is unchanged
  and does NOT produce NBSP.

Scope: ASCII space only. \<TAB>, \<NBSP>, etc. are also Pandoc-NBSP
in Pandoc's reader; can be extended later if a user need surfaces.
\<EOL> remains LineBreak (unchanged code path).

Closes bd-1aip.
@cscheid
Copy link
Copy Markdown
Member Author

cscheid commented May 8, 2026

(We're not currently rendering quarto-web, but nice of you to remind us, Claude)

@cscheid cscheid merged commit 68de2bc into main May 8, 2026
4 checks passed
@cscheid cscheid deleted the bugfix/1aip-inline-backslash-space-nbsp branch May 8, 2026 14:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Inline \ round-trip splits one Str into three inlines

1 participant