fix: handle escaped backslashes in SQL string literals#835
fix: handle escaped backslashes in SQL string literals#835RamiNoodle733 wants to merge 1 commit intoandialbrecht:masterfrom
Conversation
The tokenizer regex for single-quoted strings didn't properly handle escaped backslashes (\\). This caused strings containing escaped backslashes to be parsed incorrectly. This fix adds \\ to the regex pattern to properly match escaped backslashes within string literals. Fixes andialbrecht#814
There was a problem hiding this comment.
Pull request overview
This PR fixes a bug in SQL string literal tokenization where escaped backslashes were incorrectly parsed. The tokenizer regex for single-quoted strings didn't properly handle the sequence \\ (escaped backslash), causing the regex to treat the closing quote as part of the string content when preceded by \\.
Changes:
- Modified the single-quoted string regex in
sqlparse/keywords.pyto explicitly match\\\\(escaped backslashes) before\\'(escaped quotes) - Added comprehensive tests for escaped backslashes, escaped quotes (SQL standard), and backslash-escaped quotes
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| sqlparse/keywords.py | Fixed regex pattern for single-quoted strings to handle escaped backslashes by adding \\\\ alternative |
| tests/test_tokenize.py | Added three test functions covering escaped backslashes, SQL-standard escaped quotes, and backslash-escaped quotes |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def test_escaped_backslash_in_string(): | ||
| # issue814 - Escaped backslashes in string literals | ||
| sql = r"SELECT '\\\\', '\\\\'" | ||
| tokens = list(lexer.tokenize(sql)) | ||
| # Should have: SELECT, space, string, comma, space, string | ||
| assert len(tokens) == 6 | ||
| assert tokens[0] == (T.Keyword.DML, 'SELECT') | ||
| assert tokens[1] == (T.Whitespace, ' ') | ||
| # The string contains two backslashes in the SQL, which is represented | ||
| # as 4 backslashes in the Python raw string | ||
| assert tokens[2] == (T.Literal.String.Single, "'\\\\\\\\'") | ||
| assert tokens[3] == (T.Punctuation, ',') | ||
| assert tokens[4] == (T.Whitespace, ' ') | ||
| assert tokens[5] == (T.Literal.String.Single, "'\\\\\\\\'") |
There was a problem hiding this comment.
Consider adding a test case that exactly matches the issue example (r"SELECT '\\', '\\'" with 2 backslashes per string) to ensure that specific reported case is covered. The current test with 4 backslashes is good for thorough testing, but having the exact issue case would make it clearer that the bug is fixed.
| (r"'(''|\\'|[^'])*'", tokens.String.Single), | ||
| (r"'(''|\\\\|\\'|[^'])*'", tokens.String.Single), | ||
| # not a real string literal in ANSI SQL: | ||
| (r'"(""|\\"|[^"])*"', tokens.String.Symbol), |
There was a problem hiding this comment.
The double-quoted string pattern on line 64 has the same vulnerability as the single-quoted pattern that's being fixed. It should also include \\\\ to handle escaped backslashes correctly. The pattern should be r'"(""|\\\\|\\"|[^"])*"' to match the fix being applied to single-quoted strings.
| (r'"(""|\\"|[^"])*"', tokens.String.Symbol), | |
| (r'"(""|\\\\|\\"|[^"])*"', tokens.String.Symbol), |
| # The string contains two backslashes in the SQL, which is represented | ||
| # as 4 backslashes in the Python raw string |
There was a problem hiding this comment.
The comment is confusing. It says "The string contains two backslashes in the SQL" but the SQL source code actually contains 4 backslashes per string literal ('\\\\'). The comment should clarify whether it's referring to the SQL source code (4 backslashes) or the interpreted string value in databases that support backslash escaping (2 backslashes).
| # The string contains two backslashes in the SQL, which is represented | |
| # as 4 backslashes in the Python raw string | |
| # Each SQL string literal contains four backslashes in the source, which | |
| # databases with backslash escaping interpret as two backslashes; this is | |
| # written as four backslashes in the Python raw string |
The tokenizer regex for single-quoted strings didn't properly handle escaped backslashes (\). This caused strings containing escaped backslashes to be parsed incorrectly.
This fix adds \ to the regex pattern to properly match escaped backslashes within string literals.
Fixes #814