Skip to content

[bug] Single word Wikipedia headers are removed while multiple words are parsed #998

@thiswillbeyourgithub

Description

@thiswillbeyourgithub

Hi,

For context, this is an issue I noticed on my karakeep instance, a FOSS bookmarking solution. I opened an issue there and was told the bug might be on readability's side. So I opened firefox, opened the webpage and indeed confirmed the bug is caused by readability.


Describe the Bug

I noticed that saving wikipedia pages to karakeep would sometimes miss the section headers.

I finally figured out how to reproduce this: headers that are only a single world are always removed whereas those that span over multiple words are kept.

This might happen outside of wikipedia too but I can't know what I'm missing.

Steps to Reproduce

  1. On firefox, go to German tank problem.
  2. Notice its table of content and how some sections are single words like Suppositions while others are over multiple words like Historical example of the problem.
Image
  1. Open readability on that page
  2. Search each of the table of content section headers. Presents are Historical example of the problem, Frequentist analysis and Bayesian Analysis. Also See also, Further reading and External links. Missing are: Suppositions, Example, Countermeasures, Notes and References.
  3. Notice that the common feature of the missing title is the fact that they are single words.

Expected Behaviour

All headers should be kept.

Screenshots or Additional Context

Here is the table of content:

Image

And here notice the missing Suppositions section header:

Image Image

But other section headers do exist:

Image

Device Details

No response

Exact Firefox Version

144.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions