Skip to content

Add an indirection field to split/splice calls#376

Open
meroton-benjamin wants to merge 1 commit into
bazelbuild:mainfrom
meroton:main
Open

Add an indirection field to split/splice calls#376
meroton-benjamin wants to merge 1 commit into
bazelbuild:mainfrom
meroton:main

Conversation

@meroton-benjamin

Copy link
Copy Markdown

The split and splice api calls have an implicit maximum file size they can represent due to the maximum grpc message size. Remove this file size limit by adding an indirect_chunk_list field to the SplitBlobResponse and SpliceBlobRequest respectively.

This allows implementations to, in a backwards compatible manner, defer the list of chunks to an object stored in CAS making the split and splice blob requests capable of representing blobs of any size.

The split and splice api calls have an implicit maximum file size they
can represent due to the maximum grpc message size. Remove this file
size limit by adding an indirect_chunk_list field to the
SplitBlobResponse and SpliceBlobRequest respectively.

This allows implementations to, in a backwards compatible manner, defer
the list of chunks to an object stored in CAS making the split and
splice blob requests capable of representing blobs of any size.
@sluongng

Copy link
Copy Markdown
Collaborator

cc: @tyler-french

@sluongng sluongng left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for raising the PR. This looks great for first draft.

I know we discussed this on the call, but it might be worth pointing out for other reviewers who werent there:

  • Size limitation of the current split/splice + CDC chunking implementation: what's the threshold, etc...

  • Whats' the current behavior? If it's undefined, or perhaps results in wrong file content, then worth calling that out.

  • Provide some level of guidance on how a server should decide when to use the new field and/or when a client should read the new field over the old one.

I hope we can either provide this information inline with the doc, or at least leave some traces inside the PR body/commit message to help guide others when implementing it. Thanks

// Deprecated: Use DigestFunction_Value.Descriptor instead.
func (DigestFunction_Value) EnumDescriptor() ([]byte, []int) {
return file_build_bazel_remote_execution_v2_remote_execution_proto_rawDescGZIP(), []int{40, 0}
return file_build_bazel_remote_execution_v2_remote_execution_proto_rawDescGZIP(), []int{41, 0}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Side note: we should add pb.go into gitattribute as generated file so it's treated as so during code review

@mostynb mostynb left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a breaking change that requires an API version update?

ie if an older client doesn't use the new field gets a response that requires the use of the new field (because the list of blobs is too large), will things quietly break?

// The chunking function used to split the blob.
ChunkingFunction.Value chunking_function = 2;

// The digest of the indirect chunk list which MUST be present in the

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For backwards-compatibility reasons, I guess this field needs to be optional? In which case this first sentence should say something like "if specified and non-null".

Comment on lines 2016 to 2018
// The server MUST use the same digest function as the one explicitly or
// implicitly (through hash length) specified in the split request.
repeated Digest chunk_digests = 1;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we explicitly state that this field is required?

// The chunking function that the client used to split the blob.
ChunkingFunction.Value chunking_function = 5;

// The digest of the indirect chunk list stored in the CAS. The client MAY set

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should specify that the blob is of type ChunkList.

@tyler-french

Copy link
Copy Markdown
Contributor

Perhaps I missed some prior discussion here, but I'd like to propose a slightly different approach for very large blobs: add streaming variants of the split/splice APIs, where chunk digests are sent in groups.

I agree that very large blobs can expose issues with CDC, but storing an indirect chunk list in CAS feels like using CAS as API transport. That leaks what should be a server-side storage/detail into the protocol, requires additional client behavior to read/write the indirection object, and does not address one of the main scalability issues with the current SpliceBlob flow: the server only learns about the blob/chunk mapping at the very end.

It also does not seem fully backward-compatible in practice. A client needs to indicate that it supports the indirect representation, and then needs extra read/write logic to materialize or dereference the chunk-list object. Older clients that only understand the existing SplitBlob / SpliceBlob shape would not know how to use that indirection.

For the case that motivated this PR, has the indirection approach worked well in practice? In particular, is the server still able to verify, within a reasonable time, that concatenating all chunks produces the expected blob digest?

The current client flow is roughly:

FindMissingBlobs(some group of chunks)
Write(some chunks)
FindMissingBlobs(some group of chunks)
Write(some chunks)
FindMissingBlobs(some group of chunks)
Write(some chunks)
SpliceBlob(all chunk digests) // This can be very large and may time out.

The problem is that the client is uploading a blob chunk-by-chunk, but the server does not know how those chunks compose the final blob until the final SpliceBlob call. For example, for a 20 GB blob with ~40k chunks, SpliceBlob may need to read all chunk data, hash the concatenation, and verify that it matches the requested blob digest in one RPC.

With an indirect chunk list, the write path also gets more complicated: the client uploads the chunks, writes the chunk-list object into CAS, and then calls SpliceBlob with a reference to that object. The read path has a similar extra step: SplitBlob returns indirection metadata, and the client then has to fetch and parse the chunk-list object from CAS. That solves one message-size limit, but by adding more protocol states and moving chunk-list transport into CAS.

A more idiomatic and scalable approach would be to add streaming variants of these APIs. This keeps the chunk-list transport in the RPC layer, separate from any particular server-side storage strategy. It also lets clients and servers exchange chunk digests incrementally without exceeding unary message limits or requiring clients to understand how the server represents oversized chunk lists internally.

For example, the write-side API could be:

// SpliceChunks is a streaming variant of SpliceBlob.
// Clients stream chunk digests as they become available, then commit the
// splice by sending the expected blob digest on the final request.
rpc SpliceChunks(stream SpliceChunksRequest) returns (SpliceBlobResponse);
message SpliceChunksRequest {
  // The instance of the execution system to operate against. A server may
  // support multiple instances of the execution system (with their own workers,
  // storage, caches, etc.). The server MAY require use of this field to select
  // between them in an implementation-defined fashion, otherwise it can be
  // omitted. Subsequent requests MAY omit it, but if set it MUST match the
  // first request.
  string instance_name = 1;

  // Expected digest of the spliced blob. This commits the splice. Clients
  // SHOULD only set this on the final request and MUST set it exactly once.
  Digest blob_digest = 2;

  // The ordered list of digests of the chunks which need to be concatenated to
  // assemble the original blob. Chunk digests may be split across multiple
  // stream requests.
  repeated Digest chunk_digests = 3;

  // The digest function of all chunks to be concatenated and of the blob to be
  // spliced. The server MUST use the same digest function for both cases.
  // Subsequent requests MAY omit it, but if set it MUST match the first
  // request.
  DigestFunction.Value digest_function = 4;

  // The chunking function that the client used to split the blob.
  ChunkingFunction.Value chunking_function = 5;
}

And the read-side equivalent could be:

// SplitChunks is a streaming variant of SplitBlob.
// The complete chunk list is the concatenation of chunk_digests across all
// responses in stream order. The server sets blob_digest on the final response
// to indicate that there are no more chunks.
rpc SplitChunks(SplitBlobRequest) returns (stream SplitChunksResponse);
message SplitChunksResponse {
  // The ordered list of digests of the chunks into which the blob was split.
  repeated Digest chunk_digests = 1;

  // The digest of the split blob. This indicates that all chunk digests have
  // been sent. Servers SHOULD only set this on the final response and MUST set
  // it exactly once.
  Digest blob_digest = 2;

  // The chunking function used to split the blob. The server MUST send the same
  // value in all responses.
  ChunkingFunction.Value chunking_function = 3;
}

I put together a draft here: #377

Would you be open to this kind of streaming API instead? Are there benefits to keeping the chunk-list extension on the unary APIs that I am missing?

@meroton-benjamin

Copy link
Copy Markdown
Author

Thanks for raising the PR. This looks great for first draft.

I know we discussed this on the call, but it might be worth pointing out for other reviewers who werent there:

* Size limitation of the current split/splice + CDC chunking implementation: what's the threshold, etc...

With regards to the current limit, there are quite a bit of moving pieces to the current behavior as there are many parameters that could be set. In essence its $x_{min}\cdot\frac{g}{d}$ where $x_{min}$ is the minimum chunk size, $g$ is your max grpc message size and $d$ is the size of your digest including overhead for repeated etc. For small sized blobs in sha256 $d$ is about 73 bytes.

This gives us multiple scenarios. Using RepMaxCDC with an optimized profile (256KiB min chunks, 2MiB grpc messages) we get a worst case blob size of 7GiB (average 9GiB). When doing something similar with FastCDC and 512KiB average chunk size we get a worst case blob size of 3.5GiB (average 14GiB). The worst case chunking is unlikely to materialize outside of synthetic data.

Sacrificing performance to maximize the chunk size while remaining inside of 4MiB grpc messages we can get a worst case blobs of 112GiB lower limit (150GiB average) for RepMaxCDC or 14GiB lower limit (56GiB average) for FastCDC.

* Whats' the _current_ behavior? If it's undefined, or perhaps results in wrong file content, then worth calling that out.

The current behavior is a transport layer error. If a client were to attempt to send chunk lists that wouldn't fit they would get an RESOURCE_EXHAUSTED error together with a helpful message like "Message was N bytes but this server only accepts messages up to M bytes" or similar.

If the Server on the other hand tried to respond with a message that was too big the client would simply hang up with an RST_STREAM frame.

I was previously under the impression that we negotiated max_message_size as part of the GetCapabilities call but I don't think we do. We simply rely on all messages always being below the maximum message size of the server/client pair.

* Provide some level of guidance on how a server should decide when to use the new field and/or when a client should read the new field over the old one.

I hope we can either provide this information inline with the doc, or at least leave some traces inside the PR body/commit message to help guide others when implementing it. Thanks

My intention was that if the new field is set then the full list of chunks must be assumed to be in the new field, the old field should still be set to allow fetching of blobs during the roundtrip to the cas. Servers and clients should set the new field as soon as they believe they are close to exceeding maximum message size if they want to maximize compatibility.

@meroton-benjamin

Copy link
Copy Markdown
Author

@tyler-french I agree that this is less than ideal, but I think the larger issue stems from that we in REv2 only represent blobs by their hash leaving no room for having e.g. a file represented by a more lightweight tree construct that doesn't require streaming the full blob to splice even if 99% of the blob is unchanged from earlier versions.

This means that no matter how we construct api we do we will always have to deal with that issue, a compliant splice call for a 100GB blob will always have to download all 100GB worth of chunks to hash them and write the resulting blob somewhere. If we instead allowed files to be represented by more than one digest we we wouldn't be in this pickle but that's probably a change that requires a v3.

For your streaming variant suggestion, one issue I see is that it makes it difficult to cache a splice call, the splicing server needs to know which blob it is trying to splice at first call in order to be able to skip it if it has already been done. In flight deduplication is also not really possible as we at no point can know whether two streams are trying to splice the same blob with the same chunks until the very last message arrives for the stream.

@tyler-french

Copy link
Copy Markdown
Contributor

This means that no matter how we construct api we do we will always have to deal with that issue, a compliant splice call for a 100GB blob will always have to download all 100GB worth of chunks to hash them and write the resulting blob somewhere. If we instead allowed files to be represented by more than one digest we we wouldn't be in this pickle but that's probably a change that requires a v3.

Agreed. I like the idea mentioned about using Prolly trees - seems pretty nifty.

the splicing server needs to know which blob it is trying to splice at first call in order to be able to skip it if it has already been done. In flight deduplication is also not really possible as we at no point can know whether two streams are trying to splice the same blob with the same chunks until the very last message arrives for the stream.

This is good feedback, thank you! I adjusted the implementation to use a more "finish_write" style bool flag to commit the splice. I agree this is an important implementation detail: #377

@meroton-benjamin

Copy link
Copy Markdown
Author

I have been giving it some thought, performed some implementation work
and some benchmarking. I think it's better if we do not implement this
feature. Not implementing this feature does place a limit on the maximum
size of binary blobs while using cdc. The limit is decently large but
not at all unreasonable. Follows are tables of the smallest blob which
would reach the limit as well as the average blob that reaches the
limit:

With throughput optimized settings:

RepMaxCDC FastCDC
Minimum 7GiB 3.5GiB
Average 9GiB 14GiB

The throughput optimized parameters chosen for the example are using
sha256, a maximum message size of 2MiB, average chunk size of 512KiB for
FastCDC and a minimum chunk size of 256KiB for RepMaxCDC.

With size optimized settings:

RepMaxCDC FastCDC
Minimum 112GiB 14GiB
Average 150GiB 56GiB

The size optimized parameters chosen for the example are using sha256,
a maximum message size of 4MiB, an average chunk size of 1MiB for
FastCDC and a minimum chunk size of 2MiB for RepMaxCDC. It should be
noted that implementations will probably want to set their chunk sizes a
few hundred bytes below these value in order to guarantee they can fit
the blob inside of a single message such as BatchUpdate/BatchRead
messages.

For larger or smaller digest algorithms such as the one from sha512 or
md5 the sizes will be smaller or larger respectively based on their
relative digest size.

Either this suggestion or #377 would overcome these limitation.
Effectively adding support for arbitrary sized blobs I do not think this
is something we want to pursue. The reason is that performing a
SplitBlob and/or a SpliceBlob call are extremely expensive to perform in
practice.

Benchmarking the performance of these calls in a test cluster capable of
doing bytestream reads at 200 MiB/s (~1.6Gbit/s) shows that the vast
majority of time is spent doing requests to the bytestream api. This is
a consequence of the blobs having a flat structure addressed by their
content. That is in order to split a blob we have to read the whole blob
and similarly in order to splice a blob we have to read all of the
individual chunks that constitute the blob. When benchmarking with my
test cluster this could be done in tens of seconds for the throughput
optimized settings to several hundred seconds for the size optimized
settings.

Even in a production grade cluster capable of performing bytestream
reads/writes at tens to hundreds of gigabits per second this would still
be a very expensive call and support for even larger blobs would be
theoretical.

My suggestion would instead be to encourage implementations to set a
maximum blob size (per the GetCapabilities call) to less than the
minimum blob size in the tables above and instead we would add a new
object type to the api for the purpose of representing a large file.

This new object file could, much like a directory object, be a Merkle
tree (or other similar structure). This way parts of the Merkle tree
could have already been computed and validated (such as one uploaded for
a mostly similar blob) and importantly the tree itself can be validated
based on the digest of its tree nodes rather than the content of the
digests of the tree nodes, thereby mostly removing the costly data
transfer steps required when using the split and splice blob apis.

Care should be takes to make sure we pick a good stable algorithm to
produce these Merkle trees such that we get good and stable overlap
between trees of blobs that are mostly similar. In other forums there
have been suggestions to use Prollytrees but I have not performed a
thorough review of the options available.

@sluongng

sluongng commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator

I think it's better if we do not implement this feature.
...
Either this suggestion or #377 would overcome these limitation.
Effectively adding support for arbitrary sized blobs I do not think this
is something we want to pursue. The reason is that performing a
SplitBlob and/or a SpliceBlob call are extremely expensive to perform in
practice.

I am a bit confused by this statement. Are you saying that we should close this PR? or should we close #377 as well?

Or are you saying that either #376 or #377 is fine, but BuildBarn will not implement them in favor of something else?

I am open to reading more about this ProllyTree alternative if somebody is willing to put forward a PR proposal.

@sluongng

sluongng commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator

Reading your comment a bit more, I think there are 2 separate problems that are being conflated:

  • Enablement: allowing blobs that are above 10 GBs to be expressed and transferred using split/splice api for chunk deduplication.

  • Performance: improving the speed of having to cold transfer the whole large blob, assuming all chunks are missing.

#377 should still help solve the first bullet point.

I think your concern is about the second bullet point, which I suspected can be implementations-dependent (client and server). Once the blob is chunked, you can parallelize the BS Read/Write rpcs to maximize your network throughput. Yes the verification can be slower, because you will need to concat these chunks orderly. I think this is where a multi-threads blake3 digest would win out clearly versus a sha256 + CPU sha_ni extension, but some additional testing is required.

I think the performance conversation would be much more productive if you had a test blob vector that we all can use to test against different implementations. This way, we can discuss if the slowness can be reproduced across implementations. Then we can use that same test vector to discuss alternative big blob representations, such as ProllyTree.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants