Skip to content

Use UTF-8 instead of ASCII for character data at lexical levels 0 and 1#5

Merged
philliphoff merged 1 commit into
mainfrom
philliphoff/fix-utf8-text-encoding
May 19, 2026
Merged

Use UTF-8 instead of ASCII for character data at lexical levels 0 and 1#5
philliphoff merged 1 commit into
mainfrom
philliphoff/fix-utf8-text-encoding

Conversation

@philliphoff
Copy link
Copy Markdown
Owner

Problem

ConvertCharacterData in Iso8211FieldReader.cs uses Encoding.ASCII for lexical levels 0 and 1, which replaces any byte above 127 with ?. Real-world S-101 datasets (e.g., Canadian Hydrographic Service) encode French accented characters in UTF-8, so text like Île d'Orléans is decoded as ??le d'Orl??ans.

Fix

Changed Encoding.ASCIIEncoding.UTF8 in the ConvertCharacterData method. This is safe because:

  • UTF-8 is a strict superset of ASCII — all ASCII text decodes identically
  • Real-world S-101 datasets use UTF-8 for accented characters

Tests

Added 3 new tests to Iso8211FieldReaderTests.cs:

  • ASCII text still decodes correctly at lexical level 0 (regression)
  • UTF-8 encoded Île d'Orléans decodes correctly at lexical level 0
  • UTF-8 encoded Île d'Orléans decodes correctly at lexical level 1

All 54 field reader tests pass.

Change ConvertCharacterData in Iso8211FieldReader to use Encoding.UTF8
instead of Encoding.ASCII for lexical levels below 2. This fixes decoding
of UTF-8 encoded text (e.g. French accented characters like Île d'Orléans)
in real-world S-101 datasets. UTF-8 is backward-compatible with ASCII so
existing pure-ASCII data continues to decode identically.

Add tests verifying ASCII regression and UTF-8 decoding at levels 0 and 1.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@philliphoff philliphoff merged commit 408d3d8 into main May 19, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant