ai-content-process/.cursorrules at main · defmethodinc/ai-content-process · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
# AI Content Processing Project - Cursor Rules

## Project Overview
This is a Python-based AI content processing tool that extracts text from various file types (PDFs, documents, videos, audio) using OpenAI GPT and Google Gemini APIs. The project follows a modular processor architecture with parallel processing capabilities.

## Code Style & Standards

### Python Standards
- Follow PEP 8 style guidelines
- Use type hints for all function parameters and return values
- Use descriptive variable and function names
- Maximum line length: 100 characters
- Use snake_case for variables and functions, PascalCase for classes
- Use double quotes for strings consistently

### Import Organization
```python
# Standard library imports first
import os
import sys
from pathlib import Path

# Third-party imports
import openai
from dotenv import load_dotenv

# Local imports last
from src.config import Config
from src.text_extractor import TextExtractor
```

### Documentation
- Use docstrings for all classes and functions
- Follow Google-style docstrings
- Include type information in docstrings when helpful
- Document exception handling and error scenarios

## Architecture Patterns

### Processor Pattern
- All file processors inherit from `BaseProcessor` abstract class
- Implement `can_process()` and `extract_text()` methods
- Use the factory pattern for processor selection
- Keep processors stateless and reusable

### Configuration Management
- Store all configuration in `src/config.py`
- Use environment variables for API keys and sensitive data
- Provide sensible defaults for optional settings
- Validate configuration at startup

### Error Handling
- Use specific exception types rather than generic Exception
- Provide helpful error messages with context
- Log errors appropriately
- Don't let individual file failures stop batch processing

## File Organization

### Project Structure
```
src/
├── __init__.py
├── config.py              # Configuration and environment variables
├── text_extractor.py      # Main extractor orchestrator
└── file_processors/
    ├── __init__.py
    ├── base_processor.py   # Abstract base class
    ├── openai_processor.py # OpenAI/GPT processor
    ├── gemini_processor.py # Google Gemini processor
    └── youtube_processor.py # YouTube-specific processor
```

### Naming Conventions
- Use descriptive module names that indicate purpose
- Test files: `test_*.py`
- Example files: `*_examples.py`
- Configuration files: `config.py` or `settings.py`

## API Integration Best Practices

### OpenAI Integration
- Use the latest OpenAI client library patterns
- Handle rate limiting gracefully
- Implement proper retry logic
- Use appropriate models for different content types

### Google Gemini Integration
- Follow Google AI SDK patterns
- Handle multimodal inputs (text, video, audio)
- Implement proper content safety handling
- Use appropriate Gemini models for media processing

### API Key Management
- Never hardcode API keys
- Use environment variables via python-dotenv
- Validate API keys at startup
- Provide clear error messages for missing keys

## Performance Guidelines

### Parallel Processing
- Use ThreadPoolExecutor for I/O-bound operations
- Default to parallel processing with configurable worker counts
- Provide options for sequential processing when needed
- Handle exceptions in parallel workers gracefully

### File Handling
- Use pathlib.Path for all file operations
- Validate file existence and permissions early
- Implement file size limits to prevent memory issues
- Use appropriate chunk sizes for large files

### Memory Management
- Process files in streams when possible
- Avoid loading entire large files into memory
- Clean up temporary files and resources
- Monitor memory usage in batch operations

## CLI Design Principles

### Argument Parsing
- Use argparse with clear help text
- Provide examples in the epilog
- Support both individual files and directory processing
- Allow glob patterns for file selection

### User Experience
- Show progress indicators for long operations
- Provide informative error messages
- Support different output formats
- Allow saving results to files

### Output Formatting
- Use consistent formatting for results
- Provide both detailed and summary views
- Include processing statistics
- Handle Unicode content properly

## Testing Guidelines

### Test Structure
- Use pytest for testing framework
- Test both success and failure scenarios
- Mock external API calls for unit tests
- Test file processing with sample files

### Test Data
- Use small, representative test files
- Include edge cases (empty files, large files, corrupted files)
- Test with different file encodings
- Create fixtures for common test scenarios

## Security Considerations

### API Security
- Protect API keys in environment variables
- Implement rate limiting to prevent abuse
- Validate file types and sizes before processing
- Sanitize file paths to prevent directory traversal

### Input Validation
- Validate file extensions against allowed types
- Check file sizes before processing
- Sanitize user inputs in CLI arguments
- Handle malicious or corrupted files gracefully

## Dependencies Management

### Core Dependencies
- openai: Latest stable version for GPT integration
- google-generativeai: For Gemini API access
- fastapi: For API server functionality
- python-dotenv: Environment variable management

### Development Dependencies
- pytest: Testing framework
- black: Code formatting
- mypy: Static type checking
- flake8: Linting

### Version Pinning
- Pin major versions in requirements.txt
- Use >= for minimum version requirements
- Test with multiple Python versions (3.8+)
- Regular dependency updates for security

## Logging and Monitoring

### Logging Strategy
- Use Python's logging module consistently
- Different log levels for different environments
- Log API calls and processing statistics
- Include correlation IDs for batch operations

### Error Tracking
- Log full stack traces for debugging
- Include context information in error logs
- Monitor API usage and rate limits
- Track processing success rates

## Environment Setup

### Development Environment
- Use virtual environments (venv)
- Provide clear setup instructions
- Include example environment files
- Document all required API keys

### Production Considerations
- Environment-specific configuration
- Proper secret management
- Resource monitoring and limits
- Graceful error handling and recovery

## Code Review Checklist

Before submitting code:
- [ ] All functions have type hints and docstrings
- [ ] Error handling is appropriate and informative
- [ ] Configuration is externalized properly
- [ ] Tests cover new functionality
- [ ] API keys and secrets are not hardcoded
- [ ] File operations use pathlib.Path
- [ ] Parallel processing is implemented safely
- [ ] Memory usage is considered for large files
- [ ] CLI interface follows established patterns
- [ ] Code follows PEP 8 style guidelines