Skip to content

Incorrect parsing of <angle brackets > as text characters #779

@levensta

Description

@levensta
  • Are you running the latest version?
  • Have you included sample input, output, error, and expected output?
  • Have you checked if you are using correct configuration?
  • Did you try online tool?
  • Have you checked the docs for helpful APIs and examples?

Description

I encountered a case where a text content like <...> inside a tag is incorrectly handled by the parser. The parser tries to interpret this as a tag name and breaks the tree structure. This is also applicable to the symbol \U+2026 Horizontal Ellipsis

What does <...> mean?

In text, <...> usually means something has been intentionally left out. It’s a placeholder that implies omitted words, omitted text that isn’t being shown, quoted, or repeated

However, the . symbol can indeed be used in the XML tag name, but only if it is not the starting character (NameStartChar in spec). And the ellipsis symbol seems to be invalid for NameChar. Please look at the Names and Tokens section here https://www.w3.org/TR/xml/#sec-common-syn

Also, while writing this issue, I tried using only the angle bracket characters individually and saw strange behavior. I've described this below as a second case.

I assume that the correct parser behavior if the angle brackets don't form a valid tag name according to XML specification would be to leave the angle bracket characters in the #text element or escape them in entities.

Input

Code

const xmlParser = new XMLParser({
  preserveOrder: true,
  allowBooleanAttributes: true,
  ignoreAttributes: false,
  ignoreDeclaration: true,
});

const firstCase = `<?xml version="1.0"?>
<root>
  <p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Integer pretium odio non ex hendrerit, eu convallis sapien ultricies <...> Sed sagittis at est auctor varius. Donec sit amet nibh sodales, varius nunc eu, tempus turpis. <...> Nulla gravida erat a tortor sollicitudin laoreet.</p>
  <foo></foo>
</root>`;

const secondCase = `<?xml version="1.0"?>
<root>
<p>if (1 < 3) return text;</p>
</root>
`;

const jObj = xmlParser.parse(firstCase); // check first and second case

const xmlBuilder = new XMLBuilder({
  ignoreAttributes: false,
  preserveOrder: true,
});
const xmlContent = xmlBuilder.build(jObj);
console.log(xmlContent);

Output

In the first case:

jObj
[
  {
    "root": [
      {
        "p": [
          {
            "#text": "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Integer pretium odio non ex hendrerit, eu convallis sapien ultricies"
          },
          {
            "...": [
              {
                "#text": "Sed sagittis at est auctor varius. Donec sit amet nibh sodales, varius nunc eu, tempus turpis."
              },
              {
                "...": [
                  {
                    "#text": "Nulla gravida erat a tortor sollicitudin laoreet."
                  }
                ]
              },
              {
                "foo": []
              }
            ]
          }
        ]
      }
    ]
  }
]
<root><p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Integer pretium odio non ex hendrerit, eu convallis sapien ultricies<...>Sed sagittis at est auctor varius. Donec sit amet nibh sodales, varius nunc eu, tempus turpis.<...>Nulla gravida erat a tortor sollicitudin laoreet.</...><foo></foo></...></p></root>

In the second case:

jObj
[
  {
    "root": [
      {
        "p": [
          {
            "#text": "if (1"
          },
          {
            "": [],
            ":@": {
              "@_3)": true,
              "@_return": true,
              "@_text;": true,
              "@_</p": true
            }
          }
        ]
      }
    ]
  }
]
<root><p>if (1< 3) return text; </p></></p></root>

expected data

In the first case:

<root><p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Integer pretium odio non ex hendrerit, eu convallis sapien ultricies <...> Sed sagittis at est auctor varius. Donec sit amet nibh sodales, varius nunc eu, tempus turpis. <...> Nulla gravida erat a tortor sollicitudin laoreet.</p><foo></foo></root>

In the second case:

<root><p>if (1 < 3) return text;</p></root>

Would you like to work on this issue?

  • Yes
  • No

Bookmark this repository for further updates. Visit SoloThought to know about recent features.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions