Skip to content

fix: Validate the presence of null unicode in output#3164

Merged
gregfurman merged 3 commits intohatchet-dev:mainfrom
gregfurman:fix/repository/jsonb-unicode
Mar 14, 2026
Merged

fix: Validate the presence of null unicode in output#3164
gregfurman merged 3 commits intohatchet-dev:mainfrom
gregfurman:fix/repository/jsonb-unicode

Conversation

@gregfurman
Copy link
Copy Markdown
Collaborator

Description

This PR addresses the hatchet server and python SDK incorrectly interpreting escaped unicode as invalid.

For example, validations would interpret "\\u0000" (which is actually [\, u, 0, 0, 0, 0]) as the unicode for\u0000.

Fixes #3153

Type of change

  • Bug fix (non-breaking change which fixes an issue)

What's Changed

  • Fixes ValidateJSONB to account for properly escaped unicode (i.e "\\u0000").
  • Updates python SDK to check for null unicode before serialization using new contains_null_unicode_character util.

@gregfurman gregfurman self-assigned this Mar 4, 2026
@vercel
Copy link
Copy Markdown

vercel Bot commented Mar 4, 2026

@gregfurman is attempting to deploy a commit to the Hatchet Team on Vercel.

A member of the Team first needs to authorize it.

@gregfurman gregfurman force-pushed the fix/repository/jsonb-unicode branch 4 times, most recently from 50e0462 to 7df3a7c Compare March 5, 2026 13:29
@gregfurman gregfurman requested a review from mrkaye97 March 6, 2026 08:40
@gregfurman gregfurman marked this pull request as ready for review March 6, 2026 08:41
@promptless-for-oss
Copy link
Copy Markdown

📝 Documentation updates detected!

New suggestion: Add troubleshooting section for null unicode character errors


Tip: Enable auto-create PR in your Docs Collection to publish suggestions automatically 🤖

@gregfurman gregfurman requested a review from mnafees March 6, 2026 14:52
if "\\u0000" in serialized_output:
raise IllegalTaskOutputError(dedent(f"""
Task outputs cannot contain the unicode null character \\u0000

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it might be easier to leave the check as it was before (using the serialized_output) instead of moving it up, because the way this is implemented we need to have the serialization code in multiple places now. WDYT?

Copy link
Copy Markdown
Collaborator Author

@gregfurman gregfurman Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue here is that this dump_json serialises the unicode character in the output string, making it difficult to tell whether this is an illegal character ("\u0000" == "\x00") or a correctly escaped string ("\\u0000").

Like, the following cases are false-positives that are incorrectly rejected by the string-based check, despite the unicode char not being present

import json

assert "\\u0000" in json.dumps({"foo\\u0000":"bar"})
assert "\\u0000" in json.dumps({"foo\\\\u0000":"bar"})

TBH all of this makes me think that we shouldn't even be handling this validation on the SDK/client side, instead just allowing the server to reject invalid requests...

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah yeah, makes sense 🤔

I think that doing this only server-side makes sense if we can raise this same sort of error on the SDK, but it might be complicated because of how we're sending payloads to the server, which I don't think will allow us to raise an error from the task. there's also an argument that raising the error from the task doesn't really help at all anyways, so if you subscribe to that theory, I'm open to removing this check on the SDK!

Copy link
Copy Markdown
Collaborator Author

@gregfurman gregfurman Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error response from the server describes this. Unsure how clear it'll be to a user though 👇

grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
        status = StatusCode.INVALID_ARGUMENT
        details = "Invalid request: encoded jsonb contains invalid null character \u0000 in field `payload`"
        debug_error_string = "UNKNOWN:Error received from peer  {grpc_status:3, grpc_message:"Invalid request: encoded jsonb contains invalid null character \\u0000 in field `payload`"}"
>

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I think the issue here is that if you e.g. had OTel set up or were using something like Sentry, you'd see this error in your workers logs but I don't think (need to verify) that it'd be tied to a task, which might make it hard to debug. I'm sure we can work around this somehow though

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah sorry yeah I think that's what I was suggesting might not work - you mean catching the error from the server on the worker, right? We can do that, but I think the issue is in that case, the error won't be tied to the task that caused it without a bigger refactor (assuming I'm not missing anything here, happy to be wrong on this one!)

Copy link
Copy Markdown
Collaborator Author

@gregfurman gregfurman Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Sounds like exclusively handing validations server-side is out-of-scope for this PR. Also, we should probably be additionally validating the input payload for this \u0000 value, unless we're OK with a gRPC error being returned.

Otherwise, are you happy with the current approach? Or do you think we need some e2e testing to be safe?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep! happy with it but don't love the contains_null_unicode_character as-is since it'll be annoying to maintain in the future I think since it differs from the TypeAdapter logic we have elsewhere heh

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So should we try and include this null unicode check as a pydantic validation step?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the type adapter can do that, that'd be awesome - if not, there might be no free lunch here and we either need to have it as-is or do it post-serialization on the bytes / string directly which would be unfortunate as you mentioned before

Comment thread sdks/python/tests/test_serde.py Outdated
assert contains_null_unicode_character(3.14) is False
assert contains_null_unicode_character(True) is False

from pydantic import BaseModel
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should pull these imports up to the top level

Copy link
Copy Markdown
Contributor

@mrkaye97 mrkaye97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

qq: did you try creating a task that returns something with the null unicode sequence and allowing it to get through the SDK (can just comment out the check) to make sure it errors correctly instead of being rejected by the db?

@gregfurman
Copy link
Copy Markdown
Collaborator Author

qq: did you try creating a task that returns something with the null unicode sequence and allowing it to get through the SDK (can just comment out the check) to make sure it errors correctly instead of being rejected by the db?

Yeah I did. The server returns a gRPC error that is raised:

grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
        status = StatusCode.INVALID_ARGUMENT
        details = "Invalid request: encoded jsonb contains invalid null character \u0000 in field `payload`"
        debug_error_string = "UNKNOWN:Error received from peer  {grpc_status:3, grpc_message:"Invalid request: encoded jsonb contains invalid null character \\u0000 in field `payload`"}"

Otherwise, without the SDK nor Golang check, Postgres rejects the payload with the following error:

2026-03-09T13:44:28.375+02:00 ERR error pre-acking message error="could not trigger workflows from names: failed to create tasks: failed to create tasks for step id cd80d7e1-00a5-40aa-88e5-02863d3a43c9: ERROR: unsupported Unicode escape sequence (SQLSTATE 22P05)" service=server

Should I add an e2e test case for this?

@gregfurman gregfurman requested a review from mrkaye97 March 9, 2026 17:42
@gregfurman
Copy link
Copy Markdown
Collaborator Author

@mrkaye97 I've reworked the approach to instead use pydantic's WrapSerializer to automatically validate upon dump_json.

Tbh this is not the cleanest approach, and it'd probably be better to use a BaseModel type with field validation over this Annotated hack, but this approach is backwards compatible + will validate input and output payloads.

Lmk if you're happy with this, else I'm open to suggestions for how to close that original issue 😅

@mrkaye97
Copy link
Copy Markdown
Contributor

mrkaye97 commented Mar 9, 2026

@mrkaye97 I've reworked the approach to instead use pydantic's WrapSerializer to automatically validate upon dump_json.

Tbh this is not the cleanest approach, and it'd probably be better to use a BaseModel type with field validation over this Annotated hack, but this approach is backwards compatible + will validate input and output payloads.

Lmk if you're happy with this, else I'm open to suggestions for how to close that original issue 😅

Hmm... what happens if the validator is a dataclass, or is dict or something? I don't love this but I do prefer it to listing out all the possible validators though. If it works with dataclass, typeddict, etc., I'm good with it

I also have to wonder: Is it possible to use regex or some such with the original solution? I feel like there has to be a more straightforward way to do this.... 😅

@jfaust
Copy link
Copy Markdown

jfaust commented Mar 11, 2026

I think something like this on the serialized output would work

# Matches \u0000 in JSON that represents an actual null byte, NOT the escaped text
# \\u0000. In JSON, a real null byte is serialized as \u0000, while the literal text
# \u0000 is serialized as \\u0000. We match \u0000 preceded by an even number of
# backslashes (including zero) — an odd count means the backslash before 'u' is itself
# escaped, making it literal text rather than a null escape.
_REAL_NULL_ESCAPE_RE = re.compile(r"(?:^|[^\\])(?:\\\\)*\\u0000")


def _json_contains_null_byte(serialized: str) -> bool:
    """Check if a JSON string contains an escaped null byte (\\u0000) that represents
    an actual null character, as opposed to the literal text "\\u0000".

    JSON serializers escape actual null bytes as \\u0000, but they escape the literal
    text "\\u0000" as \\\\u0000. The naive substring check `"\\u0000" in s` matches
    both, causing false positives.
    """
    return bool(_REAL_NULL_ESCAPE_RE.search(serialized))

@gregfurman gregfurman force-pushed the fix/repository/jsonb-unicode branch from 741bbdb to 7c9820b Compare March 12, 2026 16:13
@gregfurman
Copy link
Copy Markdown
Collaborator Author

@mrkaye97 OK have pushed a regex change instead of the (ridiculously) convoluted pydantic typing approach.

Also, @jfaust thanks for the suggestion! I've adapted your regex to use a negative look-behind which simplifies things a bit 🙂

@jfaust
Copy link
Copy Markdown

jfaust commented Mar 12, 2026

Nice! That is definitely cleaner.

Copy link
Copy Markdown
Contributor

@mrkaye97 mrkaye97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great stuff! mind bumping the patch version of the SDK and writing a changelog entry for this?

and /ht @jfaust for the report + help here too :)

@gregfurman gregfurman requested a review from mrkaye97 March 12, 2026 16:51
@gregfurman gregfurman force-pushed the fix/repository/jsonb-unicode branch from 67064db to 63020d5 Compare March 12, 2026 16:53
@gregfurman gregfurman force-pushed the fix/repository/jsonb-unicode branch 4 times, most recently from 3a21c40 to def1391 Compare March 13, 2026 16:44
@gregfurman gregfurman force-pushed the fix/repository/jsonb-unicode branch from def1391 to d20d75c Compare March 14, 2026 11:14
@gregfurman gregfurman merged commit 794fbe7 into hatchet-dev:main Mar 14, 2026
45 of 46 checks passed
@gregfurman gregfurman deleted the fix/repository/jsonb-unicode branch March 14, 2026 13:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Python: tasks outputting anything with the substring \u0000 get failed as illegal

4 participants