feat: Add batch uploader and document re-parsing tools to Python SDK by CHLK · Pull Request #38 · oceanbase/powerrag

CHLK · 2026-01-09T12:22:40Z

Summary

This PR extends the RAGFlow Python SDK with powerful batch processing tools for document management, including a batch uploader and a tool for re-parsing failed documents.

Key Features

Batch Uploader: New BatchUploader class for efficient bulk document uploads with progress tracking and error handling
Document Extractor: Utility for extracting documents from various sources (files, directories, URLs)
Field Mapper: Flexible field mapping system for document metadata transformation
File Reader: Support for reading multiple file formats (PDF, DOCX, TXT, etc.)
Reparse Tool: Tool for identifying and re-parsing failed documents in datasets
Comprehensive Documentation: Extensive README with usage examples and best practices

Changes

Added ragflow_sdk/tools/batch_uploader.py - Batch upload functionality
Added ragflow_sdk/tools/reparse_failed_documents.py - Document re-parsing tool
Added ragflow_sdk/tools/document_extractor.py - Document extraction utilities
Added ragflow_sdk/tools/field_mapper.py - Field mapping system
Added ragflow_sdk/tools/file_reader.py - Multi-format file reader
Added ragflow_sdk/tools/models.py - Data models for tools
Updated ragflow_sdk/modules/dataset.py with batch operation support
Added comprehensive test suites for all new tools
Updated SDK documentation with examples

Solution Description

# Conflicts: # .gitignore # api/apps/sdk/doc.py # sdk/python/ragflow_sdk/modules/dataset.py

…ailed_documents.py

CLAassistant · 2026-01-09T12:22:46Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.

keyang.lk seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

Copilot

Pull request overview

This PR adds comprehensive batch processing tools to the RAGFlow Python SDK, including a batch uploader for efficient bulk document uploads and a tool for re-parsing failed documents. The implementation includes extensive test coverage, documentation, and example scripts.

Key Changes:

New tools module with BatchUploader, DocumentExtractor, FieldMapper, FileReader, and FailedDocumentReparser
Batch document upload API endpoint with metadata support
Optimized document parsing with batch processing
Comprehensive test suites with unit tests
Enhanced error handling and response parsing in SDK core

Reviewed changes

Copilot reviewed 22 out of 24 changed files in this pull request and generated 30 comments.

Show a summary per file

File	Description
`sdk/python/ragflow_sdk/tools/batch_uploader.py`	Implements batch upload with snapshot-based resume support and retry logic
`sdk/python/ragflow_sdk/tools/reparse_failed_documents.py`	Tool for identifying and re-parsing failed documents with pagination
`sdk/python/ragflow_sdk/tools/document_extractor.py`	Iterator-based document extraction from various file formats
`sdk/python/ragflow_sdk/tools/field_mapper.py`	Flexible field mapping with auto-detection capabilities
`sdk/python/ragflow_sdk/tools/file_reader.py`	Multi-format file reader supporting JSON, CSV, Excel, etc.
`sdk/python/ragflow_sdk/tools/models.py`	Data models for tools including Snapshot, FileCursor, Document
`sdk/python/test/test_tools/`	Comprehensive unit tests for all new tools
`api/apps/sdk/doc.py`	New batch upload API endpoint and optimized document parsing
`sdk/python/ragflow_sdk/ragflow.py`	Enhanced response parsing with proper error handling
`sdk/python/ragflow_sdk/modules/dataset.py`	New upload_documents_with_meta method with batch support
`sdk/python/pyproject.toml`	Added test dependencies (pandas, pytest-cov, etc.)
`sdk/python/examples/`	Example scripts for batch upload and document reparsing
`web/.env`	PORT changed from 9222 to 9223

Copilot · 2026-01-09T12:34:50Z

sdk/python/ragflow_sdk/tools/batch_uploader.py

+                snapshot_file=snapshot_file,
+                file_extension=file_extension
+            ):
+                current_file_path = file_path


Variable current_file_path is not used.

Suggested change

current_file_path = file_path

Copilot · 2026-01-09T12:34:51Z

sdk/python/test/conftest.py

+    """Disable beartype by monkey patching beartype_this_package to do nothing."""
+    try:
+        import beartype.claw
+        original_beartype_this_package = beartype.claw.beartype_this_package


Variable original_beartype_this_package is not used.

Suggested change

original_beartype_this_package = beartype.claw.beartype_this_package

Copilot · 2026-01-09T12:34:51Z

sdk/python/ragflow_sdk/ragflow.py

+            return res.json()
+        except Exception as e:
+            error_url = url or res.url if hasattr(res, 'url') else 'unknown'
+            raise Exception(f"Failed to parse JSON response (status {res.status_code}): {str(e)}. Response text: {res.text[:500]}")


Variable error_url is not used.

Suggested change

raise Exception(f"Failed to parse JSON response (status {res.status_code}): {str(e)}. Response text: {res.text[:500]}")

raise Exception(

f"Failed to parse JSON response (status {res.status_code}, URL: {error_url}): "

f"{str(e)}. Response text: {res.text[:500]}"

)

Copilot · 2026-01-09T12:34:51Z