hebrew-ocr-forms

Installation

SKILL.md

Hebrew OCR Forms

Instructions

Step 1: Identify the Form Type

Form Type	Source	Key Identifiers	Common Fields
Nesach Tabu	Land Registry	"נסח טאבו", "לשכת רישום המקרקעין"	Gush, Chelka, Owner, Liens
Tofes 106	Tax Authority	"טופס 106", "דו״ח שנתי למעביד"	Salary, Tax, Employer
Ishur Nikui	Tax Authority	"אישור ניכוי מס במקור"	Tax rate, Validity, TZ
Tofes 857	Tax Authority	"טופס 857", "רווח הון"	Transaction, Gain, Tax
Ishur Zkauyot	Bituach Leumi	"אישור זכאויות", "ביטוח לאומי"	Benefit type, Amount
Tofes 100	Bituach Leumi	"טופס 100", "דין וחשבון"	Employees, Wages
Rishayon Rechev	Vehicle Licensing	"רישיון רכב"	Plate, Owner, Expiry

Step 2: Preprocess the Scanned Image

See scripts/preprocess_image.py for the full preprocessing pipeline. Key steps:

Convert to grayscale
Deskew -- Israeli forms are often slightly rotated from scanning
Binarize with adaptive threshold -- handles uneven lighting from scanners
Remove noise with morphological operations

Step 3: Run Hebrew OCR with Tesseract

See scripts/extract_form_fields.py for the full extraction pipeline.

Tesseract configuration for Hebrew forms:

config = (
    '--oem 1 '          # LSTM neural net (best for Hebrew)
    '--psm 6 '          # Assume uniform block of text
    '-l heb+eng '       # Hebrew + English (forms have both)
    '-c preserve_interword_spaces=1'  # Keep spacing for field alignment
)

For tabular forms (Tabu, Tofes 106), use PSM 4 instead of PSM 6
Always use LSTM mode (--oem 1) for best Hebrew accuracy
Include both heb and eng languages since forms mix Hebrew and English/numbers

Step 4: Extract Fields by Form Type

Tabu Extract (Nesach Tabu) key fields:

Gush (block) number: look for "גוש" followed by digits
Chelka (parcel) number: look for "חלקה" followed by digits
Owner name: follows "בעלים" or "שם הבעלים"
ID number (TZ): follows "ת.ז." or "מספר זהות", 9 digits

Tax Form (Tofes 106) key fields:

Tax year: follows "שנת מס"
Employer number: follows "מספר מעביד" or "מס׳ עוסק", 9 digits
Gross salary: follows "שכר ברוטו" or "הכנסה חייבת"
Tax deducted: follows "מס שנוכה" or "ניכוי מס"

Step 5: Handle RTL and Bidirectional Text

import unicodedata

def normalize_bidi_text(text):
    """Normalize bidirectional text from Hebrew OCR output."""
    lines = text.split('
')
    normalized = []
    for line in lines:
        # Strip bidi control characters
        clean = ''.join(
            c for c in line
            if unicodedata.category(c) != 'Cf'
        )
        clean = ' '.join(clean.split())
        if clean:
            normalized.append(clean)
    return '
'.join(normalized)

Step 6: Validate Extracted Data

After extraction, validate key fields:

TZ numbers: Run through Israeli ID validation algorithm (check digit)
Dates: Verify DD/MM/YYYY format and valid date ranges
Amounts: Check that numeric fields parse correctly, no OCR artifacts in digits
Gush/Chelka: Verify numeric format, reasonable ranges
Cross-reference: If TZ appears in multiple fields, verify consistency

Examples

Example 1: Process Tabu Extract

User says: "Extract data from this scanned Tabu document" Result: Preprocess image, run Hebrew OCR, identify as Nesach Tabu, extract gush/chelka/owner/rights fields, validate TZ, return structured JSON.

Example 2: Batch Process Tax Forms

User says: "I have 50 scanned Tofes 106 forms -- extract salary and tax data" Result: Set up batch pipeline -- preprocess each image, OCR with Hebrew+English, extract Tofes 106 fields, validate, output to CSV/JSON with confidence scores.

Example 3: OCR Quality Issues

User says: "The OCR isn't reading the Hebrew text correctly" Result: Diagnose preprocessing -- check image resolution (recommend 300 DPI minimum), verify deskewing, adjust binarization threshold, try different PSM modes, check Hebrew language pack installation.

Bundled Resources

Scripts

scripts/preprocess_image.py — Prepare scanned Israeli form images for OCR: grayscale conversion, deskewing rotated scans, adaptive binarization for uneven lighting, morphological noise removal, optional CLAHE contrast enhancement, and border removal. Run: python scripts/preprocess_image.py --help
scripts/extract_form_fields.py — Run Tesseract Hebrew OCR on preprocessed form images and extract structured fields by form type. Supports auto-detection of Tabu, Tofes 106, and other Israeli government forms. Outputs JSON with extracted fields and Israeli ID validation. Run: python scripts/extract_form_fields.py --help

References

references/israeli-form-types.md — Detailed catalog of Israeli government form types (Tabu/land registry, Tax Authority forms, Bituach Leumi documents) with field descriptions, regex extraction patterns, ID validation rules, date/currency formats, and OCR tips per form layout. Consult when identifying an unknown form or building field extraction logic for a specific document type.

OCR Engine Options

Tesseract is a strong free baseline for Hebrew print, but modern cloud vision APIs often match or beat it on noisy scans and stamped forms. Pick per use case:

Engine	Hebrew print	Handwriting	Cost model	When to use
Tesseract (heb+eng, LSTM)	Good for clean scans at 300 DPI+	Weak	Free, self-hosted	Default for batch local pipelines, privacy-sensitive data
Google Cloud Vision – Document Text Detection	Very good, robust to noise	Partial (printed-looking handwriting only)	Per-request	Mixed-quality scans, large batches, PDF forms
AWS Textract (AnalyzeDocument)	Good; stronger on forms/tables	Partial	Per-page	Forms with structured fields, key-value extraction
Azure AI Vision – Read API	Very good, layout-aware	Partial	Per-transaction	Enterprise Azure environments, signed PDFs
Claude Vision (claude-sonnet-4-6 / opus-4-6)	Very good, context-aware	Good (with prompt guidance)	Token-based	Unusual form layouts, cross-field validation, when you want the model to also reason about the data

Notes: none of these engines is reliable for cursive handwritten Hebrew on forms like old Tabu extracts. For those, flag for human review instead of auto-extraction.

Gotchas

Hebrew OCR accuracy drops significantly for handwritten text, especially for the letters vav, zayin, and yod which look similar. Always include confidence scores and flag low-confidence characters.
Scanned Israeli government forms often have mixed Hebrew and Arabic text. Agents may OCR the Arabic sections as Hebrew or skip them entirely. Both languages use RTL but different Unicode ranges.
Israeli ID numbers (mispar zehut) and phone numbers extracted via OCR should be validated with check-digit algorithms after extraction, as single-digit OCR errors are common.
Many Israeli forms use dot-matrix printed Hebrew, which has lower OCR accuracy than laser-printed text. Agents may report higher confidence than warranted for older government documents.

Reference Links

Source	URL	What to Check
Tesseract documentation	https://tesseract-ocr.github.io/tessdoc/	Installation, language packs, LSTM mode, PSM values
Tesseract Hebrew traineddata	https://github.com/tesseract-ocr/tessdata	Hebrew model for tesseract
Google Cloud Vision – Documents	https://cloud.google.com/vision/docs/fulltext-annotations	Document Text Detection API reference
AWS Textract – AnalyzeDocument	https://docs.aws.amazon.com/textract/latest/dg/how-it-works-analyzing.html	Forms and tables extraction
Azure AI Vision – Read API	https://learn.microsoft.com/en-us/azure/ai-services/computer-vision/overview-ocr	Multi-language OCR including Hebrew
Israeli ID check-digit spec	https://he.wikipedia.org/wiki/מספר_זהות_(ישראל)	Algorithm for validating a 9-digit Israeli ID

Troubleshooting

Error: "Tesseract Hebrew language pack not found"

Cause: Hebrew traineddata not installed Solution: Install with sudo apt-get install tesseract-ocr-heb (Ubuntu) or brew install tesseract-lang (macOS). Verify with tesseract --list-langs.

Error: "OCR output is garbled or reversed"

Cause: Bidirectional text ordering issue Solution: Use normalize_bidi_text() function. Ensure Tesseract is using --oem 1 (LSTM mode). Check that the image is not mirrored.

Error: "Low accuracy on specific form sections"

Cause: Poor scan quality, stamps/signatures overlapping text, colored backgrounds Solution: Increase preprocessing -- apply stronger denoising, crop to specific form regions, use ROI-based extraction for known form layouts.

Related skills

More from skills-il/localization

Installs

Repository

skills-il/localization

GitHub Stars

First Seen

Mar 18, 2026

Security Audits

SnykPass