hebrew-ocr-forms
Hebrew OCR Forms
Instructions
Step 1: Identify the Form Type
| Form Type | Source | Key Identifiers | Common Fields |
|---|---|---|---|
| Nesach Tabu | Land Registry | "נסח טאבו", "לשכת רישום המקרקעין" | Gush, Chelka, Owner, Liens |
| Tofes 106 | Tax Authority | "טופס 106", "דו״ח שנתי למעביד" | Salary, Tax, Employer |
| Ishur Nikui | Tax Authority | "אישור ניכוי מס במקור" | Tax rate, Validity, TZ |
| Tofes 857 | Tax Authority | "טופס 857", "רווח הון" | Transaction, Gain, Tax |
| Ishur Zkauyot | Bituach Leumi | "אישור זכאויות", "ביטוח לאומי" | Benefit type, Amount |
| Tofes 100 | Bituach Leumi | "טופס 100", "דין וחשבון" | Employees, Wages |
| Rishayon Rechev | Vehicle Licensing | "רישיון רכב" | Plate, Owner, Expiry |
Step 2: Preprocess the Scanned Image
See scripts/preprocess_image.py for the full preprocessing pipeline. Key steps:
- Convert to grayscale
- Deskew -- Israeli forms are often slightly rotated from scanning
- Binarize with adaptive threshold -- handles uneven lighting from scanners
- Remove noise with morphological operations
Step 3: Run Hebrew OCR with Tesseract
See scripts/extract_form_fields.py for the full extraction pipeline.
Tesseract configuration for Hebrew forms:
config = (
'--oem 1 ' # LSTM neural net (best for Hebrew)
'--psm 6 ' # Assume uniform block of text
'-l heb+eng ' # Hebrew + English (forms have both)
'-c preserve_interword_spaces=1' # Keep spacing for field alignment
)
- For tabular forms (Tabu, Tofes 106), use PSM 4 instead of PSM 6
- Always use LSTM mode (--oem 1) for best Hebrew accuracy
- Include both heb and eng languages since forms mix Hebrew and English/numbers
Step 4: Extract Fields by Form Type
Tabu Extract (Nesach Tabu) key fields:
- Gush (block) number: look for "גוש" followed by digits
- Chelka (parcel) number: look for "חלקה" followed by digits
- Owner name: follows "בעלים" or "שם הבעלים"
- ID number (TZ): follows "ת.ז." or "מספר זהות", 9 digits
Tax Form (Tofes 106) key fields:
- Tax year: follows "שנת מס"
- Employer number: follows "מספר מעביד" or "מס׳ עוסק", 9 digits
- Gross salary: follows "שכר ברוטו" or "הכנסה חייבת"
- Tax deducted: follows "מס שנוכה" or "ניכוי מס"
Step 5: Handle RTL and Bidirectional Text
import unicodedata
def normalize_bidi_text(text):
"""Normalize bidirectional text from Hebrew OCR output."""
lines = text.split('
')
normalized = []
for line in lines:
# Strip bidi control characters
clean = ''.join(
c for c in line
if unicodedata.category(c) != 'Cf'
)
clean = ' '.join(clean.split())
if clean:
normalized.append(clean)
return '
'.join(normalized)
Step 6: Validate Extracted Data
After extraction, validate key fields:
- TZ numbers: Run through Israeli ID validation algorithm (check digit)
- Dates: Verify DD/MM/YYYY format and valid date ranges
- Amounts: Check that numeric fields parse correctly, no OCR artifacts in digits
- Gush/Chelka: Verify numeric format, reasonable ranges
- Cross-reference: If TZ appears in multiple fields, verify consistency
Examples
Example 1: Process Tabu Extract
User says: "Extract data from this scanned Tabu document" Result: Preprocess image, run Hebrew OCR, identify as Nesach Tabu, extract gush/chelka/owner/rights fields, validate TZ, return structured JSON.
Example 2: Batch Process Tax Forms
User says: "I have 50 scanned Tofes 106 forms -- extract salary and tax data" Result: Set up batch pipeline -- preprocess each image, OCR with Hebrew+English, extract Tofes 106 fields, validate, output to CSV/JSON with confidence scores.
Example 3: OCR Quality Issues
User says: "The OCR isn't reading the Hebrew text correctly" Result: Diagnose preprocessing -- check image resolution (recommend 300 DPI minimum), verify deskewing, adjust binarization threshold, try different PSM modes, check Hebrew language pack installation.
Bundled Resources
Scripts
scripts/preprocess_image.py— Prepare scanned Israeli form images for OCR: grayscale conversion, deskewing rotated scans, adaptive binarization for uneven lighting, morphological noise removal, optional CLAHE contrast enhancement, and border removal. Run:python scripts/preprocess_image.py --helpscripts/extract_form_fields.py— Run Tesseract Hebrew OCR on preprocessed form images and extract structured fields by form type. Supports auto-detection of Tabu, Tofes 106, and other Israeli government forms. Outputs JSON with extracted fields and Israeli ID validation. Run:python scripts/extract_form_fields.py --help
References
references/israeli-form-types.md— Detailed catalog of Israeli government form types (Tabu/land registry, Tax Authority forms, Bituach Leumi documents) with field descriptions, regex extraction patterns, ID validation rules, date/currency formats, and OCR tips per form layout. Consult when identifying an unknown form or building field extraction logic for a specific document type.
OCR Engine Options
Tesseract is a strong free baseline for Hebrew print, but modern cloud vision APIs often match or beat it on noisy scans and stamped forms. Pick per use case:
| Engine | Hebrew print | Handwriting | Cost model | When to use |
|---|---|---|---|---|
| Tesseract (heb+eng, LSTM) | Good for clean scans at 300 DPI+ | Weak | Free, self-hosted | Default for batch local pipelines, privacy-sensitive data |
| Google Cloud Vision – Document Text Detection | Very good, robust to noise | Partial (printed-looking handwriting only) | Per-request | Mixed-quality scans, large batches, PDF forms |
| AWS Textract (AnalyzeDocument) | Good; stronger on forms/tables | Partial | Per-page | Forms with structured fields, key-value extraction |
| Azure AI Vision – Read API | Very good, layout-aware | Partial | Per-transaction | Enterprise Azure environments, signed PDFs |
| Claude Vision (claude-sonnet-4-6 / opus-4-6) | Very good, context-aware | Good (with prompt guidance) | Token-based | Unusual form layouts, cross-field validation, when you want the model to also reason about the data |
Notes: none of these engines is reliable for cursive handwritten Hebrew on forms like old Tabu extracts. For those, flag for human review instead of auto-extraction.
Gotchas
- Hebrew OCR accuracy drops significantly for handwritten text, especially for the letters vav, zayin, and yod which look similar. Always include confidence scores and flag low-confidence characters.
- Scanned Israeli government forms often have mixed Hebrew and Arabic text. Agents may OCR the Arabic sections as Hebrew or skip them entirely. Both languages use RTL but different Unicode ranges.
- Israeli ID numbers (mispar zehut) and phone numbers extracted via OCR should be validated with check-digit algorithms after extraction, as single-digit OCR errors are common.
- Many Israeli forms use dot-matrix printed Hebrew, which has lower OCR accuracy than laser-printed text. Agents may report higher confidence than warranted for older government documents.
Reference Links
| Source | URL | What to Check |
|---|---|---|
| Tesseract documentation | https://tesseract-ocr.github.io/tessdoc/ | Installation, language packs, LSTM mode, PSM values |
| Tesseract Hebrew traineddata | https://github.com/tesseract-ocr/tessdata | Hebrew model for tesseract |
| Google Cloud Vision – Documents | https://cloud.google.com/vision/docs/fulltext-annotations | Document Text Detection API reference |
| AWS Textract – AnalyzeDocument | https://docs.aws.amazon.com/textract/latest/dg/how-it-works-analyzing.html | Forms and tables extraction |
| Azure AI Vision – Read API | https://learn.microsoft.com/en-us/azure/ai-services/computer-vision/overview-ocr | Multi-language OCR including Hebrew |
| Israeli ID check-digit spec | https://he.wikipedia.org/wiki/מספר_זהות_(ישראל) | Algorithm for validating a 9-digit Israeli ID |
Troubleshooting
Error: "Tesseract Hebrew language pack not found"
Cause: Hebrew traineddata not installed
Solution: Install with sudo apt-get install tesseract-ocr-heb (Ubuntu) or brew install tesseract-lang (macOS). Verify with tesseract --list-langs.
Error: "OCR output is garbled or reversed"
Cause: Bidirectional text ordering issue
Solution: Use normalize_bidi_text() function. Ensure Tesseract is using --oem 1 (LSTM mode). Check that the image is not mirrored.
Error: "Low accuracy on specific form sections"
Cause: Poor scan quality, stamps/signatures overlapping text, colored backgrounds Solution: Increase preprocessing -- apply stronger denoising, crop to specific form regions, use ROI-based extraction for known form layouts.
More from skills-il/localization
hebrew-i18n
Implement comprehensive Hebrew internationalization (i18n) patterns for web and mobile applications. Use when user asks about Hebrew localization, "beinle'umiyut", i18n for Israeli apps, Hebrew plural forms, Hebrew date formatting, RTL CSS logical properties, bidirectional text handling, React/Vue/Angular/Next.js RTL integration, Tailwind CSS RTL, or next-intl setup. Covers Hebrew pluralization rules, date and number formatting for Israel, RTL-first CSS, Tailwind RTL utilities, and bidi text algorithms. Do NOT use for NLP or content writing (use hebrew-nlp-toolkit or hebrew-content-writer instead).
6hebrew-nlp-toolkit
Guide developers in using Hebrew NLP models and tools including DictaLM, DictaBERT, AlephBERT, and ivrit.ai. Use when user asks about Hebrew text processing, Hebrew NLP, "ivrit", Hebrew tokenization, Hebrew NER, Hebrew sentiment analysis, Hebrew speech-to-text, or needs to process Hebrew language text programmatically. Covers model selection, preprocessing, and Hebrew-specific NLP challenges. Do NOT use for Arabic NLP (different tools) or general English NLP tasks.
5israeli-travel-planner
Plan domestic travel in Israel with local transportation, accommodations, national parks, and cultural considerations. Use when user asks about traveling in Israel, Israeli hotel chains, bus routes, Israel Railways, Rav-Kav card, national parks, tiyul b'aretz, Dead Sea, Eilat, or trip planning within Israel. Covers Egged/Dan/Kavim buses, train schedules, Rashut HaTeva sites, Shabbat travel restrictions, and seasonal advice.
4