fix: Validate the presence of null unicode in output#3164
fix: Validate the presence of null unicode in output#3164gregfurman merged 3 commits intohatchet-dev:mainfrom
Conversation
|
@gregfurman is attempting to deploy a commit to the Hatchet Team on Vercel. A member of the Team first needs to authorize it. |
50e0462 to
7df3a7c
Compare
|
📝 Documentation updates detected! New suggestion: Add troubleshooting section for null unicode character errors Tip: Enable auto-create PR in your Docs Collection to publish suggestions automatically 🤖 |
| if "\\u0000" in serialized_output: | ||
| raise IllegalTaskOutputError(dedent(f""" | ||
| Task outputs cannot contain the unicode null character \\u0000 | ||
|
|
There was a problem hiding this comment.
it might be easier to leave the check as it was before (using the serialized_output) instead of moving it up, because the way this is implemented we need to have the serialization code in multiple places now. WDYT?
There was a problem hiding this comment.
The issue here is that this dump_json serialises the unicode character in the output string, making it difficult to tell whether this is an illegal character ("\u0000" == "\x00") or a correctly escaped string ("\\u0000").
Like, the following cases are false-positives that are incorrectly rejected by the string-based check, despite the unicode char not being present
import json
assert "\\u0000" in json.dumps({"foo\\u0000":"bar"})
assert "\\u0000" in json.dumps({"foo\\\\u0000":"bar"})TBH all of this makes me think that we shouldn't even be handling this validation on the SDK/client side, instead just allowing the server to reject invalid requests...
There was a problem hiding this comment.
ah yeah, makes sense 🤔
I think that doing this only server-side makes sense if we can raise this same sort of error on the SDK, but it might be complicated because of how we're sending payloads to the server, which I don't think will allow us to raise an error from the task. there's also an argument that raising the error from the task doesn't really help at all anyways, so if you subscribe to that theory, I'm open to removing this check on the SDK!
There was a problem hiding this comment.
The error response from the server describes this. Unsure how clear it'll be to a user though 👇
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
status = StatusCode.INVALID_ARGUMENT
details = "Invalid request: encoded jsonb contains invalid null character \u0000 in field `payload`"
debug_error_string = "UNKNOWN:Error received from peer {grpc_status:3, grpc_message:"Invalid request: encoded jsonb contains invalid null character \\u0000 in field `payload`"}"
>There was a problem hiding this comment.
yeah, I think the issue here is that if you e.g. had OTel set up or were using something like Sentry, you'd see this error in your workers logs but I don't think (need to verify) that it'd be tied to a task, which might make it hard to debug. I'm sure we can work around this somehow though
There was a problem hiding this comment.
Ah sorry yeah I think that's what I was suggesting might not work - you mean catching the error from the server on the worker, right? We can do that, but I think the issue is in that case, the error won't be tied to the task that caused it without a bigger refactor (assuming I'm not missing anything here, happy to be wrong on this one!)
There was a problem hiding this comment.
Makes sense. Sounds like exclusively handing validations server-side is out-of-scope for this PR. Also, we should probably be additionally validating the input payload for this \u0000 value, unless we're OK with a gRPC error being returned.
Otherwise, are you happy with the current approach? Or do you think we need some e2e testing to be safe?
There was a problem hiding this comment.
yep! happy with it but don't love the contains_null_unicode_character as-is since it'll be annoying to maintain in the future I think since it differs from the TypeAdapter logic we have elsewhere heh
There was a problem hiding this comment.
So should we try and include this null unicode check as a pydantic validation step?
There was a problem hiding this comment.
if the type adapter can do that, that'd be awesome - if not, there might be no free lunch here and we either need to have it as-is or do it post-serialization on the bytes / string directly which would be unfortunate as you mentioned before
| assert contains_null_unicode_character(3.14) is False | ||
| assert contains_null_unicode_character(True) is False | ||
|
|
||
| from pydantic import BaseModel |
There was a problem hiding this comment.
I think we should pull these imports up to the top level
mrkaye97
left a comment
There was a problem hiding this comment.
qq: did you try creating a task that returns something with the null unicode sequence and allowing it to get through the SDK (can just comment out the check) to make sure it errors correctly instead of being rejected by the db?
Yeah I did. The server returns a gRPC error that is raised: Otherwise, without the SDK nor Golang check, Postgres rejects the payload with the following error: Should I add an e2e test case for this? |
|
@mrkaye97 I've reworked the approach to instead use pydantic's WrapSerializer to automatically validate upon Tbh this is not the cleanest approach, and it'd probably be better to use a Lmk if you're happy with this, else I'm open to suggestions for how to close that original issue 😅 |
Hmm... what happens if the I also have to wonder: Is it possible to use regex or some such with the original solution? I feel like there has to be a more straightforward way to do this.... 😅 |
|
I think something like this on the serialized output would work # Matches \u0000 in JSON that represents an actual null byte, NOT the escaped text
# \\u0000. In JSON, a real null byte is serialized as \u0000, while the literal text
# \u0000 is serialized as \\u0000. We match \u0000 preceded by an even number of
# backslashes (including zero) — an odd count means the backslash before 'u' is itself
# escaped, making it literal text rather than a null escape.
_REAL_NULL_ESCAPE_RE = re.compile(r"(?:^|[^\\])(?:\\\\)*\\u0000")
def _json_contains_null_byte(serialized: str) -> bool:
"""Check if a JSON string contains an escaped null byte (\\u0000) that represents
an actual null character, as opposed to the literal text "\\u0000".
JSON serializers escape actual null bytes as \\u0000, but they escape the literal
text "\\u0000" as \\\\u0000. The naive substring check `"\\u0000" in s` matches
both, causing false positives.
"""
return bool(_REAL_NULL_ESCAPE_RE.search(serialized)) |
741bbdb to
7c9820b
Compare
|
Nice! That is definitely cleaner. |
67064db to
63020d5
Compare
3a21c40 to
def1391
Compare
def1391 to
d20d75c
Compare
Description
This PR addresses the hatchet server and python SDK incorrectly interpreting escaped unicode as invalid.
For example, validations would interpret
"\\u0000"(which is actually[\, u, 0, 0, 0, 0]) as the unicode for\u0000.Fixes #3153
Type of change
What's Changed
ValidateJSONBto account for properly escaped unicode (i.e"\\u0000").contains_null_unicode_characterutil.