financial-document-processor
Financial Document Processor
Overview
This skill provides guidance for extracting structured data from financial documents (invoices, receipts, statements, etc.) using OCR and PDF text extraction. It emphasizes data safety practices that prevent catastrophic failures from destructive operations.
Critical Data Safety Principles
NEVER perform destructive operations on source data without verification or backup.
Before any file processing:
- Create a backup of all source files before processing
- Work on copies, not originals
- Verify outputs match expectations before any cleanup
- Use atomic operations (copy → verify → delete) instead of direct moves
Safe File Operation Pattern
# CORRECT: Copy first, verify, then clean up
cp -r /source/documents/ /backup/documents/
# ... process files ...
# ... verify outputs match expectations ...
# Only after verification: rm /backup/documents/
# WRONG: Delete before moving (data loss risk)
rm -f /source/*.pdf && mv /source/* /dest/ # Files deleted before move!
Workflow
Step 1: Assess the Environment and Requirements
Before writing any processing code:
- List all source files and note their exact paths
- Identify file types (PDF text-based, PDF scanned, JPG, PNG, etc.)
- Check available tools:
which tesseract,which pdftotext,python3 -c "import pypdf" - Understand the expected output format (CSV columns, required fields, etc.)
Step 2: Create Backup
Always backup source files before any processing:
# Create timestamped backup directory
BACKUP_DIR="/tmp/backup_$(date +%Y%m%d_%H%M%S)"
mkdir -p "$BACKUP_DIR"
cp -r /path/to/source/documents/* "$BACKUP_DIR/"
echo "Backup created at: $BACKUP_DIR"
Step 3: Test Extraction on Sample Files
Before processing all documents:
- Select 2-3 representative files (different formats, edge cases)
- Test OCR/extraction on these samples
- Verify extracted values match visual inspection
- Adjust extraction logic based on sample results
# Test extraction on a single file first
sample_file = "/path/to/sample_invoice.pdf"
extracted_data = extract_document(sample_file)
print(f"Extracted: {extracted_data}")
# Manually verify these values match the document
Step 4: Handle Format Variations
Financial documents often have format variations:
- Number formats: European (1.234,56) vs US (1,234.56)
- Date formats: DD/MM/YYYY vs MM/DD/YYYY vs YYYY-MM-DD
- Currency symbols: $, €, £, or spelled out
- Empty/missing fields: VAT may be blank, not zero
def parse_amount(text):
"""Handle multiple number format conventions."""
# Remove currency symbols and whitespace
cleaned = re.sub(r'[$€£\s]', '', text)
# Detect European format (comma as decimal separator)
if re.match(r'^\d{1,3}(\.\d{3})*,\d{2}$', cleaned):
cleaned = cleaned.replace('.', '').replace(',', '.')
# US format (comma as thousands separator)
elif ',' in cleaned:
cleaned = cleaned.replace(',', '')
return float(cleaned) if cleaned else None
Step 5: Process All Documents
After successful sample testing:
- Process documents one at a time with error handling
- Log extraction results for each document
- Collect all results before writing output file
results = []
errors = []
for doc_path in document_paths:
try:
data = extract_document(doc_path)
results.append(data)
print(f"✓ Processed: {doc_path}")
except Exception as e:
errors.append((doc_path, str(e)))
print(f"✗ Failed: {doc_path} - {e}")
if errors:
print(f"\nWarning: {len(errors)} documents failed to process")
for path, error in errors:
print(f" - {path}: {error}")
Step 6: Verify Before File Operations
Before moving files or writing final outputs:
- Compare extracted record count to source file count
- Spot-check extracted values against source documents
- Verify output format matches requirements
# Verification checklist
assert len(results) == len(document_paths), "Record count mismatch"
# Spot-check a few values
for sample in random.sample(results, min(3, len(results))):
print(f"Please verify: {sample['filename']} -> Total: {sample['total']}")
Step 7: Move Files (Only After Verification)
Only after verification passes:
# Move files to destination (not delete!)
for file in /source/documents/*.pdf; do
mv "$file" /processed/
done
# Only remove backup after confirming processed files exist
ls /processed/*.pdf && rm -rf "$BACKUP_DIR"
Common Pitfalls
1. Destructive Commands Without Backup
Problem: Using rm or overwriting files before verifying success.
Prevention: Always create backups first; use copy-verify-delete pattern.
2. Command Order in Shell Pipelines
Problem: rm -f *.pdf && mv *.pdf /dest/ - files are deleted before move.
Prevention: Test commands on sample data; understand execution order.
3. Incomplete Script Verification
Problem: Running truncated or incomplete scripts on production data. Prevention: Verify script content before execution; test on samples first.
4. Fabricating Missing Data
Problem: Writing guessed values when extraction fails. Prevention: Report failures explicitly; use null/empty for missing values.
5. Premature Optimization
Problem: Immediately reprocessing when values look wrong without investigation. Prevention: First analyze OCR output and extraction logic issues without moving files.
6. PDF vs Image Handling
Problem: Using OCR on text-based PDFs or text extraction on scanned PDFs. Prevention: Check if PDF has extractable text before choosing extraction method.
def is_text_based_pdf(pdf_path):
"""Check if PDF contains extractable text."""
from pypdf import PdfReader
reader = PdfReader(pdf_path)
for page in reader.pages:
if page.extract_text().strip():
return True
return False
Verification Strategies
Pre-Processing Verification
- Source files exist and are readable
- Backup created successfully
- Required tools installed (tesseract, pdftotext, pypdf)
- Sample extraction produces reasonable values
Post-Processing Verification
- Output record count matches input file count
- No extraction errors occurred (or errors are documented)
- Spot-checked values match source documents
- Output format matches requirements (correct columns, types)
- Files moved to correct destinations
- Original backup preserved until final verification
Recovery Plan
If something goes wrong:
- Stop immediately - do not continue processing
- Restore from backup:
cp -r "$BACKUP_DIR"/* /source/ - Investigate the failure before retrying
- Fix extraction logic on samples before reprocessing all files
Tool Selection Guide
| File Type | Primary Tool | Fallback |
|---|---|---|
| Text-based PDF | pypdf, pdftotext | - |
| Scanned PDF | tesseract (after pdf2image) | pypdf |
| JPG/PNG images | tesseract | - |
| Mixed PDF (text + scans) | pypdf first, tesseract for image pages | - |
Install dependencies:
# System packages
apt-get install tesseract-ocr poppler-utils
# Python packages
pip install pypdf pytesseract pdf2image pillow