feat: Add batch uploader and document re-parsing tools to Python SDK#38
feat: Add batch uploader and document re-parsing tools to Python SDK#38CHLK wants to merge 4 commits intooceanbase:mainfrom
Conversation
# Conflicts: # .gitignore # api/apps/sdk/doc.py # sdk/python/ragflow_sdk/modules/dataset.py
…ailed_documents.py
|
keyang.lk seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account. You have signed the CLA already but the status is still pending? Let us recheck it. |
There was a problem hiding this comment.
Pull request overview
This PR adds comprehensive batch processing tools to the RAGFlow Python SDK, including a batch uploader for efficient bulk document uploads and a tool for re-parsing failed documents. The implementation includes extensive test coverage, documentation, and example scripts.
Key Changes:
- New tools module with BatchUploader, DocumentExtractor, FieldMapper, FileReader, and FailedDocumentReparser
- Batch document upload API endpoint with metadata support
- Optimized document parsing with batch processing
- Comprehensive test suites with unit tests
- Enhanced error handling and response parsing in SDK core
Reviewed changes
Copilot reviewed 22 out of 24 changed files in this pull request and generated 30 comments.
Show a summary per file
| File | Description |
|---|---|
sdk/python/ragflow_sdk/tools/batch_uploader.py |
Implements batch upload with snapshot-based resume support and retry logic |
sdk/python/ragflow_sdk/tools/reparse_failed_documents.py |
Tool for identifying and re-parsing failed documents with pagination |
sdk/python/ragflow_sdk/tools/document_extractor.py |
Iterator-based document extraction from various file formats |
sdk/python/ragflow_sdk/tools/field_mapper.py |
Flexible field mapping with auto-detection capabilities |
sdk/python/ragflow_sdk/tools/file_reader.py |
Multi-format file reader supporting JSON, CSV, Excel, etc. |
sdk/python/ragflow_sdk/tools/models.py |
Data models for tools including Snapshot, FileCursor, Document |
sdk/python/test/test_tools/ |
Comprehensive unit tests for all new tools |
api/apps/sdk/doc.py |
New batch upload API endpoint and optimized document parsing |
sdk/python/ragflow_sdk/ragflow.py |
Enhanced response parsing with proper error handling |
sdk/python/ragflow_sdk/modules/dataset.py |
New upload_documents_with_meta method with batch support |
sdk/python/pyproject.toml |
Added test dependencies (pandas, pytest-cov, etc.) |
sdk/python/examples/ |
Example scripts for batch upload and document reparsing |
web/.env |
PORT changed from 9222 to 9223 |
| snapshot_file=snapshot_file, | ||
| file_extension=file_extension | ||
| ): | ||
| current_file_path = file_path |
There was a problem hiding this comment.
Variable current_file_path is not used.
| current_file_path = file_path |
| """Disable beartype by monkey patching beartype_this_package to do nothing.""" | ||
| try: | ||
| import beartype.claw | ||
| original_beartype_this_package = beartype.claw.beartype_this_package |
There was a problem hiding this comment.
Variable original_beartype_this_package is not used.
| original_beartype_this_package = beartype.claw.beartype_this_package |
| return res.json() | ||
| except Exception as e: | ||
| error_url = url or res.url if hasattr(res, 'url') else 'unknown' | ||
| raise Exception(f"Failed to parse JSON response (status {res.status_code}): {str(e)}. Response text: {res.text[:500]}") |
There was a problem hiding this comment.
Variable error_url is not used.
| raise Exception(f"Failed to parse JSON response (status {res.status_code}): {str(e)}. Response text: {res.text[:500]}") | |
| raise Exception( | |
| f"Failed to parse JSON response (status {res.status_code}, URL: {error_url}): " | |
| f"{str(e)}. Response text: {res.text[:500]}" | |
| ) |
| # Create test file | ||
| filename = f"test.{extension}" | ||
| create_method = getattr(self, create_func) | ||
| filepath = create_method(temp_dir, filename, test_data) |
There was a problem hiding this comment.
Variable filepath is not used.
| filepath = create_method(temp_dir, filename, test_data) | |
| _ = create_method(temp_dir, filename, test_data) |
| filepath = create_method(temp_dir, filename, test_data) | ||
| else: # xlsx or xls | ||
| filepath = self._create_excel_file(temp_dir, filename, test_data) |
There was a problem hiding this comment.
Variable filepath is not used.
| filepath = create_method(temp_dir, filename, test_data) | |
| else: # xlsx or xls | |
| filepath = self._create_excel_file(temp_dir, filename, test_data) | |
| create_method(temp_dir, filename, test_data) | |
| else: # xlsx or xls | |
| self._create_excel_file(temp_dir, filename, test_data) |
| import os | ||
| import tempfile | ||
| import pytest | ||
| from unittest.mock import Mock, MagicMock, patch, call |
There was a problem hiding this comment.
Import of 'MagicMock' is not used.
Import of 'patch' is not used.
Import of 'call' is not used.
| from unittest.mock import Mock, MagicMock, patch, call | |
| from unittest.mock import Mock |
| import tempfile | ||
| import pytest | ||
| from unittest.mock import Mock, MagicMock, patch, call | ||
| from pathlib import Path |
There was a problem hiding this comment.
Import of 'Path' is not used.
| from pathlib import Path |
| from pathlib import Path | ||
|
|
||
| from ragflow_sdk.tools import BatchUploader, DocumentExtractor, FieldMapper | ||
| from ragflow_sdk.tools.models import Snapshot, FileCursor |
There was a problem hiding this comment.
Import of 'Snapshot' is not used.
| from ragflow_sdk.tools.models import Snapshot, FileCursor | |
| from ragflow_sdk.tools.models import FileCursor |
| # | ||
|
|
||
| import pytest | ||
| from unittest.mock import Mock, MagicMock, patch, call |
There was a problem hiding this comment.
Import of 'MagicMock' is not used.
Import of 'patch' is not used.
| from unittest.mock import Mock, MagicMock, patch, call | |
| from unittest.mock import Mock |
|
|
||
| beartype.claw.beartype_this_package = noop_beartype_this_package | ||
| os.environ['BEARTYPE_DISABLE'] = '1' | ||
| except ImportError: |
There was a problem hiding this comment.
'except' clause does nothing but pass and there is no explanatory comment.
| except ImportError: | |
| except ImportError: | |
| # beartype is an optional dependency in tests; if it's not installed, just skip the monkey-patch. |
Summary
This PR extends the RAGFlow Python SDK with powerful batch processing tools for document management, including a batch uploader and a tool for re-parsing failed documents.
Key Features
BatchUploaderclass for efficient bulk document uploads with progress tracking and error handlingChanges
ragflow_sdk/tools/batch_uploader.py- Batch upload functionalityragflow_sdk/tools/reparse_failed_documents.py- Document re-parsing toolragflow_sdk/tools/document_extractor.py- Document extraction utilitiesragflow_sdk/tools/field_mapper.py- Field mapping systemragflow_sdk/tools/file_reader.py- Multi-format file readerragflow_sdk/tools/models.py- Data models for toolsragflow_sdk/modules/dataset.pywith batch operation supportSolution Description