feat(io): add bulk delete API to FileIO.#659
Conversation
|
@wgtmac Could you please help review this PR? This is my first code contribution to the iceberg-cpp module, and I would really appreciate your guidance and help. Thank you very much! |
| /// | ||
| /// \param file_locations The locations of the files to delete. | ||
| /// \return void if all deletes succeeded, an error code if any delete failed. | ||
| virtual Status DeleteFiles(std::span<const std::string> file_locations) { |
There was a problem hiding this comment.
It would be better if you could make some changes that utilize this new API. I saw you mentioned ExpireSnapshots.
There was a problem hiding this comment.
Thanks, that makes sense. I originally planned to split this into two PRs: first add the FileIO::DeleteFiles API, then update ExpireSnapshots to use it.
I agree that this PR would be more useful if it also adopted the new API. I’ll take a look at updating ExpireSnapshots to use DeleteFiles for grouped file cleanup.
| #include "iceberg/test/matchers.h" | ||
|
|
||
| namespace iceberg { | ||
| namespace { |
There was a problem hiding this comment.
We don't need this or its counterpart. If you want to wrap RecordingFileIO in an anonymous namespace, you should shrink the scope instead.
| namespace { |
There was a problem hiding this comment.
Makes sense. I’ll shrink the anonymous namespace scope to cover only the helper RecordingFileIO and keep the tests outside of the iceberg namespace.
There was a problem hiding this comment.
Why not putting all cases in the iceberg namespace so we don't need to use iceberg:: prefix everywhere?
There was a problem hiding this comment.
Fixed. I moved the test cases into the iceberg namespace and kept RecordingFileIO in an anonymous namespace, so the iceberg:: prefixes are no longer needed.
| /// \return void if all deletes succeeded, an error code if any delete failed. | ||
| virtual Status DeleteFiles(std::span<const std::string> file_locations) { | ||
| for (const auto& file_location : file_locations) { | ||
| auto status = DeleteFile(file_location); |
There was a problem hiding this comment.
nit: put the implementation to file_io.cc and use ICEBERG_RETURN_UNEXPECTED to save some lines.
There was a problem hiding this comment.
Will update it. I'll move the default DeleteFiles implementation to file_io.cc and use ICEBERG_RETURN_UNEXPECTED for error propagation.
| /// | ||
| /// \param file_locations The locations of the files to delete. | ||
| /// \return void if all deletes succeeded, an error code if any delete failed. | ||
| virtual Status DeleteFiles(std::span<const std::string> file_locations) { |
There was a problem hiding this comment.
To be complete, let's override this function in arrow_io_internal.h by adapting to https://github.com/apache/arrow/blob/eab3d11b484df5bb6ddd75630f9bfa17c0a3f491/cpp/src/arrow/filesystem/filesystem.h#L274
A relevant question is whether to use const std::vector<std::string>& as the parameter so no conversion is needed to call the Arrow equivalent.
There was a problem hiding this comment.
I changed FileIO::DeleteFiles to take const std::vectorstd::string& to match Arrow FileSystem::DeleteFiles, and added an ArrowFileSystemFileIO override that resolves Iceberg file locations and delegates to arrow_fs_->DeleteFiles(paths). I also added a LocalFileIO test for the bulk delete path.
| #include "iceberg/test/matchers.h" | ||
|
|
||
| namespace iceberg { | ||
| namespace { |
There was a problem hiding this comment.
Why not putting all cases in the iceberg namespace so we don't need to use iceberg:: prefix everywhere?
04cef13 to
1e963a0
Compare
Co-authored-by: Gang Wu <ustcwg@gmail.com>
|
|
||
| class RecordingFileIO : public FileIO { | ||
| public: | ||
| explicit RecordingFileIO(std::string failure_path = "") |
There was a problem hiding this comment.
The name RecordingFileIO is not that obvious about its purpose but this is fine in the test.
|
Thanks @slfan1989 for working on this and @zhjwpku for the review! |
@wgtmac @zhjwpku Thank you for the reviews and guidance! I really appreciate your time and help throughout the PR. |
Summary
Add a new
FileIO::DeleteFiles(...)API as a bulk deletion entry point.The default implementation deletes files sequentially by calling the existing
DeleteFile(...)method and returns the first deletion error encountered.This PR only adds the API and backward-compatible fallback behavior. It does not
yet update
ExpireSnapshotsto useDeleteFiles(...), and it does not introduceparallel deletion.
Fixed: #658
Motivation
ExpireSnapshotsand other cleanup flows may need to delete many files. Abulk deletion API gives FileIO implementations a common extension point for
future optimized deletion strategies, such as storage-native batch deletion or
parallel fallback deletion.