Possible zstd_fast optimization involving 6 byte matching code #4561

Mr-Brocoli · 2025-12-25T02:31:40Z

I believe that when using a 6 byte or stronger hash in zstd_fast it is faster to have a ZSTD_match6Found function instead of a ZSTD_match4Found function. My reasoning for this is that a 6 or 7 byte hash is almost guaranteed to result in a 6 byte or larger match. Any 4 or 5 byte matches found are usually unhappy accidents, and they do not prove really efficient to encode. So, I added an extra 2 byte check after the initial 4 bytes are matched which goes in our favour. Overall the few extra cycles to guarantee the match length to be at least 6 immediately, prevents the zstd_count function that happens later from having to mispredict as much and count 16 bytes rather than 8 bytes. The overall behaviour of this choice seems to be favourable, and improves encode speeds. Unfortunately, it is not a crazy improvement, I estimate it to be in the 0.8-0.9% compression speed improvement range for every zstd_fast mode that uses a 6 byte or stronger hash (silesia / enwik). Additionally, compression seems to be improved very very marginally on 7 byte hash modes due to some very weak 4 byte matches being huffman encoded as literals instead which is interesting.

As for testing and thinking about the optimization itself. It seems like very hard to compress files like x-ray have a small regression whereas very easy to compress files like nci compress much faster (almost ~2% faster). I found mixed data like all of Silesia and enwik8/9 themselves to be getting an overall speed boost of 0.8-0.9%. The reasoning for this is that hard to compress data accumulates very short matches and doesn't really trigger the slow path of zstd_Count much since it can't really get to 12 byte or more match lengths as much, which makes the optimization have slightly more cycles used per token to probe 2 bytes more always. I don’t know a lot about data compression formally so I would just assume that the goal is to win on the averages and real world data, which silesia and enwik are always cited. My own intuition also says I can't really think of why an engineer would install zstd to chug through very hard to compress files so gearing logic towards helping mid to low entropy files seems good to me. Also, I was reading about openzl and how general text data can still be useful to use zstd on no matter what openzl does, so that’s why I thought it was important to make sure dickens and enwik8/9 saw improvements.

Finally, thanks for taking the time to read my reasoning and thoughts. Ultimately I know changing fast mode is scary, but I wouldn't be writing this if it didn't seem a little worth investigating at least.

FINALLY finally, for now I only made the change to ZSTD_compressBlock_fast_noDict_generic when not using branches, just to start out small, to investigate if the change is worthwhile enough to place in every fast related function appropriately.

EDIT:
Hm, I want to be honest, I didn't realize just how much instruction alignment matters. After some more testing I found that adding some nops manually before _start would also actually give similar speed boosts to doing my 6 byte pre emptive verification I suggest in this pull request. So, now I am thinking to myself that all my logic and such that I had for why my idea could be faster is just within the speed realm of the compiler ordering instructions slightly differently such that the algorithm overall can be faster. That's an annoying lesson to learn, and makes me even more confused and brings up more questions. Whether my change is actually helpful or not would now even require investigating the assembly to see if it magically makes the ordering better and happier. Well, I'm going to have to think harder than I am now if I'm to be suggesting changes...

EDIT 2:
Ah, now with extra extensive testing I am seeing that even "good" changes I am almost certain would work in principle don't even seem to matter as every compilation with slight changes can just move the needle in either direction in terms of speed. Now I almost have to wonder if it's time to optimize the compilers themselves so they just work better and faster with alignment and such... At one point I managed to create a build of zstd that was just randomly 1.6% faster all around. I could close this pull request all together now I think but I'm curious on what the thoughts are about these sorts of things. With the scale of which algorithms like zstd are used at I feel like more research should be done into getting the "magically runs faster" builds of applications from compilers. It all seems quite... fragile...

Added zstd_match6Found_branch and required code

e233253

meta-cla bot added the CLA Signed label Dec 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Possible zstd_fast optimization involving 6 byte matching code #4561

Possible zstd_fast optimization involving 6 byte matching code #4561

Uh oh!

Mr-Brocoli commented Dec 25, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Possible zstd_fast optimization involving 6 byte matching code #4561

Are you sure you want to change the base?

Possible zstd_fast optimization involving 6 byte matching code #4561

Uh oh!

Conversation

Mr-Brocoli commented Dec 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Mr-Brocoli commented Dec 25, 2025 •

edited

Loading