After addressing issues #838, #839 and #840, I pushed revised XML metadata to DataCite for all records. The new validation check implemented in PR #916 failed for a total of 18 DOIs (out of ~180k) due to the presence of control characters (other than whitespace) from incorrectly formatted LaTeX encodings (unescaped backslashes).
The validation implemented in the HEPData code raised an exception, but the metadata was still pushed successfully to DataCite without errors. It seems that metadata is stripped of control characters on the DataCite side. Therefore, this is not a practical problem, but nevertheless we should check that new uploads do not contain non-whitespace control characters (HEPData/hepdata-validator#70). To avoid validation errors, we could also strip non-whitespace control characters from the DataCite XML before validating against the DataCite schema, e.g.
xml = "".join(char for char in xml if unicodedata.category(char) != "Cc" or char in "\n\r\t")
After addressing issues #838, #839 and #840, I pushed revised XML metadata to DataCite for all records. The new validation check implemented in PR #916 failed for a total of 18 DOIs (out of ~180k) due to the presence of control characters (other than whitespace) from incorrectly formatted LaTeX encodings (unescaped backslashes).
commentcontains$t\bar{t}$for 81945, 81945.v1, 81945.v1/t1 and 81945.v1/t2.descriptioncontains$m_{t\bar{t}}$for 99692.v1/t15, 99692.v1/t16, 99692.v1/t17, h99692.v1/t18, 99692.v1/t19 and 99692.v1/t20.observablescontains$\frac{1}{N_{\\mathrm{\\gamma}}}\frac{\\mathrm{d}^3N}{\\mathrm{d}z_{\\mathrm{T}}\\\ mathrm{d}\\Delta\varphi\\mathrm{d}\\Delta\\eta}$for 98564.v1/t4.descriptioncontains$c\bar{c}$for 96269.v1/t14 and 96269.v1/t17.descriptioncontains\abs{V_{tq}}for 95117.v1/t23.descriptioncontains\bold{p}_X^Bfor 131599.v1/t5 and 131599.v1/t6.descriptioncontainstab$\beta$for 155628.v1/t43.descriptioncontains\asfor 157601.v1/t5.The validation implemented in the HEPData code raised an exception, but the metadata was still pushed successfully to DataCite without errors. It seems that metadata is stripped of control characters on the DataCite side. Therefore, this is not a practical problem, but nevertheless we should check that new uploads do not contain non-whitespace control characters (HEPData/hepdata-validator#70). To avoid validation errors, we could also strip non-whitespace control characters from the DataCite XML before validating against the DataCite schema, e.g.