Skip to content

feat!: add hash_kind to TreeRefIter and Data#2497

Open
Christoph Rüßler (cruessler) wants to merge 16 commits intoGitoxideLabs:mainfrom
cruessler:pass-hash-len-to-tree-ref-iter
Open

feat!: add hash_kind to TreeRefIter and Data#2497
Christoph Rüßler (cruessler) wants to merge 16 commits intoGitoxideLabs:mainfrom
cruessler:pass-hash-len-to-tree-ref-iter

Conversation

@cruessler
Copy link
Copy Markdown
Contributor

@cruessler Christoph Rüßler (cruessler) commented Mar 31, 2026

This PR is an attempt at making the hash length in gix_object::tree::ref_iter::decode depend on the decoded tree’s hash’s size. Previously, it was hardcoded to 20 which is only correct for SHA1, but not for SHA256.

There were quite a few follow-up changes as TreeRef::from_bytes as well as TreeRefIter::from_bytes now take hash_kind as an argument. Also, since TreeRef can be instantiated in Data::decode, Data now also has to carry hash_kind. Another option that I didn’t explore would have been to add hash_kind as an argument to Data::decode.

Update 2026-04-04: I changed the initial addition of hash_len to hash_kind. Tests are now passing, with the exception of the fuzzer having found an issue.

Copy link
Copy Markdown
Member

@Byron Sebastian Thiel (Byron) left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot! I think it's great to see that overall, passing the hash kind through isn't crazy. I also left a few comments which I hope can lead to a small direction change.

impl<'a> ObjectRef<'a> {
/// Deserialize an object from a loose serialisation
pub fn from_loose(data: &'a [u8]) -> Result<ObjectRef<'a>, LooseDecodeError> {
pub fn from_loose(data: &'a [u8], hash_len: usize) -> Result<ObjectRef<'a>, LooseDecodeError> {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's very uncommon to take a hash_len I think, and would expect this to be gix_hash::Kind as this is what it boils down to.
Let's do that instead unless there is a very good reason to use a hash_len.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes way more sense, in particular as TreeRefIter and Data don’t grow in size that way.

(The main reason I initially picked hash_len was that it felt more targeted and precise, but in hindsight I agree that gix_hash::Kind is a much better way to carry that information.)

Comment on lines +245 to +252
#[derive(Debug)]
pub struct State {
pub hash_len: usize,
}

pub type Stream<'is> = Stateful<&'is [u8], State>;

pub fn tree<'a, E: ParserError<&'a [u8]>>(stream: &mut Stream<'a>) -> ModalResult<TreeRef<'a>, E> {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd love it if object_hash: gix_hash::Kind could be a parameter here with everything else unchanged.
Probably this was first attempted and this is what it needs to be for Winnow, but… honestly, maybe the slow version can even be removed in favor of fast_decode. I wonder if anything speaks against that.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you suspected, that’s what I tried first. :-) But winnow places certain constraints on what it accepts as a parsing function. Just looking at tree locally, I don’t see why it shouldn’t be possible to rewrite it not to need winnow anymore. I haven’t tried yet, but could, either in this PR or in a follow-up.

Comment on lines +266 to +268
/// TODO:
/// Document.
hash_len: usize,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To my mind it's fine to use hash_len if it stays private (after all, that's what the iter really wants). One can even then consider keeping Kind instead as the function to get its len is const for sure and thus has no chance of being worse than the whole usize.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to hash_kind. (Aside: this is called object_hash in other places, I started wondering whether it should also be called that here.)

Using hash_kind instead of hash_len also has the advantage that it can be packed and doesn’t grow the size of either TreeRefIter or Data.

Comment on lines +295 to +297
/// TODO:
/// Document.
pub hash_len: usize,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely gix_hash::Kind here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed.

@cruessler Christoph Rüßler (cruessler) changed the title feat!: add hash_len to TreeRefIter and Data feat!: add hash_kind to TreeRefIter and Data Apr 4, 2026
@cruessler Christoph Rüßler (cruessler) marked this pull request as ready for review April 4, 2026 09:20
@cruessler
Copy link
Copy Markdown
Contributor Author

Sebastian Thiel (@Byron) I made some changes based on your feedback. The PR contains a lot of small commits now, feel free to shuffle them around as you see fit!

The fuzzer seems to have found something, though at first sight, I wasn’t sure what to make of that finding. I’ll try figuring out what this is about, but maybe you’re more familiar with it and can quickly trace it back to one of my changes.

I added one TODO(SHA256) in gix-diff-tests. I plan on addressing it in the follow-up PR that makes gix-diff-tests run with SHA256 (we don’t do that yet).

Copy link
Copy Markdown
Contributor

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7383f0210e

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "Codex (@codex) review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "Codex (@codex) address that feedback".

/// Parse our tree data and return the parse tree for direct access to its entries.
pub fn decode(&self) -> Result<gix_object::TreeRef<'_>, gix_object::decode::Error> {
gix_object::TreeRef::from_bytes(&self.data)
gix_object::TreeRef::from_bytes(&self.data, self.repo.object_hash())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Decode tree bytes with the tree id hash kind

Tree::decode() now passes self.repo.object_hash() into TreeRef::from_bytes, but the rest of this type (e.g. find_entry()/iter()) uses self.id.kind(). This inconsistency can misdecode or reject valid tree data when a Tree is constructed via Tree::from_data() with an id kind that differs from the repository default (an allowed public API path), because tree entry parsing depends on hash length. Using self.id.kind() here keeps behavior consistent and tied to the actual object id being decoded.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants