Conversation
…aip)
A literal backslash followed by an ASCII space inside inline text was
being captured into a single Str together with the space (and any
following text), producing AST shapes that drifted on round trip.
The writer's emitting \\ for any literal \ then re-parsed as a split
Str + Space + Str, so a \ b -> [Str "a", Space, Str "\ b"] became
[Str "a", Space, Str "\", Space, Str "b"] after one round trip.
Pandoc handles \<space> as its non-breaking-space shorthand (U+00A0).
Adopt the same rule in process_backslash_escapes: \<ASCII space>
collapses to a single NBSP character. The writer emits NBSP as its
UTF-8 bytes and the existing grammar already handles literal NBSP, so
the round trip is now stable end-to-end.
User-visible incidence: trailing \ inside pipe-table cells in
quarto-web docs (cell padding inserts the space before the column
terminator). End-to-end verification:
$ printf -- 'a \\ b\n' | cargo run -q --bin pampa
[ Para [Str "a", Space, Str "\u{a0}b"] ]
$ printf -- 'a \\ b\n' | cargo run -q --bin pampa -- -t qmd | xxd
00000000: 6120 c2a0 620a a ..b.
Matches Pandoc:
$ printf -- 'a \\ b\n' | pandoc -f markdown -t native
[ Para [ Str "a" , Space , Str "\160b" ] ]
Tests:
- Three unit tests in test_treesitter_refactoring.rs covering the
inline case, paragraph-start case, and a pipe-table cell case
(the last compares native AST to bypass an unrelated source-info
scrubbing gap in remove_location_fields, filed as bd-j9wp).
- Two round-trip fixtures under tests/roundtrip_tests/qmd-json-qmd/
for stability (inline_backslash_space.qmd and
inline_backslash_space_paragraph_start.qmd).
- Existing test_escaped_backslash_then_space_remains_literal
locks in that \\<space> (escaped backslash + space) is unchanged
and does NOT produce NBSP.
Scope: ASCII space only. \<TAB>, \<NBSP>, etc. are also Pandoc-NBSP
in Pandoc's reader; can be extended later if a user need surfaces.
\<EOL> remains LineBreak (unchanged code path).
Closes bd-1aip.
Member
Author
|
(We're not currently rendering |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #162.
Summary
A literal
\followed by an ASCII space inside inline text was being absorbed into a singlepandoc_strtoken together with the space (and any following text), producing AST shapes that drifted on round trip —a \ bparsed as[Str "a", Space, Str "\ b"]and re-parsed as[Str "a", Space, Str "\", Space, Str "b"]after one cycle. Reporter linked three quarto-web docs that hit this in pipe-table cells (cell padding inserts the space).This change adopts Pandoc's non-breaking-space shorthand:
\<ASCII space>collapses to a singleU+00A0(NBSP). The writer already emits NBSP literally as its UTF-8 bytes, and the existing grammar already parses literal NBSP correctly, so the round trip is now stable end-to-end with no grammar regen and no writer changes — the entire fix is one extra branch inprocess_backslash_escapes(crates/pampa/src/pandoc/treesitter_utils/text_helpers.rs).This changes the AST for inputs containing
\<space>in inline text: previously a singleStrcontaining literal\<space>X, now aStrcontainingU+00A0 X(matching Pandoc). Consumers reading rawStrtext will see a different character, but the source rendering is unchanged.End-to-end verification
Round-trip stable, byte-for-byte equal to Pandoc's
markdown -> markdownoutput for the same input.Tests
crates/pampa/tests/test_treesitter_refactoring.rs:test_backslash_space_becomes_nbsp— reporter's exact case.test_backslash_space_at_paragraph_start—\<space>at the start of inline content.test_backslash_space_in_table_cell_round_trips— pipe-table cell case (the reporter's wild incidence). Compares native AST text rather than going through the JSON-roundtrip suite — see "Side find" below.crates/pampa/tests/roundtrip_tests/qmd-json-qmd/for stability:inline_backslash_space.qmdandinline_backslash_space_paragraph_start.qmd.test_escaped_backslash_then_space_remains_literallocks in that\\<space>(escaped backslash + space) is unchanged and does NOT produce NBSP.cargo xtask verify --skip-hub-build --skip-hub-tests --skip-trace-viewer-build --skip-trace-viewer-testsis clean.Scope
Only
\<ASCII space>for now. Pandoc treats\<TAB>and\<NBSP>similarly, but those didn't surface in the reporter's bug or the linked docs, so they're left out of this change.\<EOL>remains aLineBreak(different code path, unchanged).Side find — bd-j9wp
While building the table-cell test, I tried to use a
qmd-json-qmdround-trip fixture and the test failed even though the round-trip is content-stable. The codebase is fully deterministic — verified across 5 repeated runs that the same input always produces the same JSON1 (captionS: 17) and the same JSON3 (captionS: 15). The mismatch is real and reproducible.What's happening: each parse builds its own
astContext.sourceInfoPool(23 entries for the original qmd, 21 for the regenerated qmd — each pool is deterministic but the regenerated qmd has different source positions, so the pool contents and traversal-order IDs differ). The test'sremove_location_fields(crates/pampa/tests/test.rs:395-411) scrubsastContextitself plus most foreign-keys-into-the-pool (s,attrS,targetS,citationIdS), but it misses scalar S-foreign-keys not nested inside a scrubbed envelope.captionSis one such miss. After scrubbing,captionS: 17andcaptionS: 15are both dangling references to the now-deleted pools, but the test still compares them numerically and fails.I worked around this for #165 by switching the table case to a unit test that compares native AST text (no source info). The underlying scrub-list gap is filed as bd-j9wp (priority 3) — not a determinism issue, just a missing key in the comparison filter.
Test plan
Str "\<space>X"→Str "<NBSP>X") doesn't break a downstream consumer that was working around the previous behavior. None expected, but worth a glance atquarto-webrendering of one of the three linked docs to be safe.🤖 Generated with Claude Code