hebrew-ocr-forms

Installation
SKILL.md

Hebrew OCR Forms

Instructions

Step 1: Identify the Form Type

Form Type Source Key Identifiers Common Fields
Nesach Tabu Land Registry "נסח טאבו", "לשכת רישום המקרקעין" Gush, Chelka, Owner, Liens
Tofes 106 Tax Authority "טופס 106", "דו״ח שנתי למעביד" Salary, Tax, Employer
Ishur Nikui Tax Authority "אישור ניכוי מס במקור" Tax rate, Validity, TZ
Tofes 857 Tax Authority "טופס 857", "רווח הון" Transaction, Gain, Tax
Ishur Zkauyot Bituach Leumi "אישור זכאויות", "ביטוח לאומי" Benefit type, Amount
Tofes 100 Bituach Leumi "טופס 100", "דין וחשבון" Employees, Wages
Rishayon Rechev Vehicle Licensing "רישיון רכב" Plate, Owner, Expiry

Step 2: Preprocess the Scanned Image

See scripts/preprocess_image.py for the full preprocessing pipeline. Key steps:

  1. Convert to grayscale
  2. Deskew -- Israeli forms are often slightly rotated from scanning
  3. Binarize with adaptive threshold -- handles uneven lighting from scanners
  4. Remove noise with morphological operations

Step 3: Run Hebrew OCR with Tesseract

See scripts/extract_form_fields.py for the full extraction pipeline.

Tesseract configuration for Hebrew forms:

config = (
    '--oem 1 '          # LSTM neural net (best for Hebrew)
    '--psm 6 '          # Assume uniform block of text
    '-l heb+eng '       # Hebrew + English (forms have both)
    '-c preserve_interword_spaces=1'  # Keep spacing for field alignment
)
  • For tabular forms (Tabu, Tofes 106), use PSM 4 instead of PSM 6
  • Always use LSTM mode (--oem 1) for best Hebrew accuracy
  • Include both heb and eng languages since forms mix Hebrew and English/numbers

Step 4: Extract Fields by Form Type

Tabu Extract (Nesach Tabu) key fields:

  • Gush (block) number: look for "גוש" followed by digits
  • Chelka (parcel) number: look for "חלקה" followed by digits
  • Owner name: follows "בעלים" or "שם הבעלים"
  • ID number (TZ): follows "ת.ז." or "מספר זהות", 9 digits

Tax Form (Tofes 106) key fields:

  • Tax year: follows "שנת מס"
  • Employer number: follows "מספר מעביד" or "מס׳ עוסק", 9 digits
  • Gross salary: follows "שכר ברוטו" or "הכנסה חייבת"
  • Tax deducted: follows "מס שנוכה" or "ניכוי מס"

Step 5: Handle RTL and Bidirectional Text

import unicodedata

def normalize_bidi_text(text):
    """Normalize bidirectional text from Hebrew OCR output."""
    lines = text.split('
')
    normalized = []
    for line in lines:
        # Strip bidi control characters
        clean = ''.join(
            c for c in line
            if unicodedata.category(c) != 'Cf'
        )
        clean = ' '.join(clean.split())
        if clean:
            normalized.append(clean)
    return '
'.join(normalized)

Step 6: Validate Extracted Data

After extraction, validate key fields:

  • TZ numbers: Run through Israeli ID validation algorithm (check digit)
  • Dates: Verify DD/MM/YYYY format and valid date ranges
  • Amounts: Check that numeric fields parse correctly, no OCR artifacts in digits
  • Gush/Chelka: Verify numeric format, reasonable ranges
  • Cross-reference: If TZ appears in multiple fields, verify consistency

Examples

Example 1: Process Tabu Extract

User says: "Extract data from this scanned Tabu document" Result: Preprocess image, run Hebrew OCR, identify as Nesach Tabu, extract gush/chelka/owner/rights fields, validate TZ, return structured JSON.

Example 2: Batch Process Tax Forms

User says: "I have 50 scanned Tofes 106 forms -- extract salary and tax data" Result: Set up batch pipeline -- preprocess each image, OCR with Hebrew+English, extract Tofes 106 fields, validate, output to CSV/JSON with confidence scores.

Example 3: OCR Quality Issues

User says: "The OCR isn't reading the Hebrew text correctly" Result: Diagnose preprocessing -- check image resolution (recommend 300 DPI minimum), verify deskewing, adjust binarization threshold, try different PSM modes, check Hebrew language pack installation.

Bundled Resources

Scripts

  • scripts/preprocess_image.py — Prepare scanned Israeli form images for OCR: grayscale conversion, deskewing rotated scans, adaptive binarization for uneven lighting, morphological noise removal, optional CLAHE contrast enhancement, and border removal. Run: python scripts/preprocess_image.py --help
  • scripts/extract_form_fields.py — Run Tesseract Hebrew OCR on preprocessed form images and extract structured fields by form type. Supports auto-detection of Tabu, Tofes 106, and other Israeli government forms. Outputs JSON with extracted fields and Israeli ID validation. Run: python scripts/extract_form_fields.py --help

References

  • references/israeli-form-types.md — Detailed catalog of Israeli government form types (Tabu/land registry, Tax Authority forms, Bituach Leumi documents) with field descriptions, regex extraction patterns, ID validation rules, date/currency formats, and OCR tips per form layout. Consult when identifying an unknown form or building field extraction logic for a specific document type.

OCR Engine Options

Tesseract is a strong free baseline for Hebrew print, but modern cloud vision APIs often match or beat it on noisy scans and stamped forms. Pick per use case:

Engine Hebrew print Handwriting Cost model When to use
Tesseract (heb+eng, LSTM) Good for clean scans at 300 DPI+ Weak Free, self-hosted Default for batch local pipelines, privacy-sensitive data
Google Cloud Vision – Document Text Detection Very good, robust to noise Partial (printed-looking handwriting only) Per-request Mixed-quality scans, large batches, PDF forms
AWS Textract (AnalyzeDocument) Good; stronger on forms/tables Partial Per-page Forms with structured fields, key-value extraction
Azure AI Vision – Read API Very good, layout-aware Partial Per-transaction Enterprise Azure environments, signed PDFs
Claude Vision (claude-sonnet-4-6 / opus-4-6) Very good, context-aware Good (with prompt guidance) Token-based Unusual form layouts, cross-field validation, when you want the model to also reason about the data

Notes: none of these engines is reliable for cursive handwritten Hebrew on forms like old Tabu extracts. For those, flag for human review instead of auto-extraction.

Gotchas

  • Hebrew OCR accuracy drops significantly for handwritten text, especially for the letters vav, zayin, and yod which look similar. Always include confidence scores and flag low-confidence characters.
  • Scanned Israeli government forms often have mixed Hebrew and Arabic text. Agents may OCR the Arabic sections as Hebrew or skip them entirely. Both languages use RTL but different Unicode ranges.
  • Israeli ID numbers (mispar zehut) and phone numbers extracted via OCR should be validated with check-digit algorithms after extraction, as single-digit OCR errors are common.
  • Many Israeli forms use dot-matrix printed Hebrew, which has lower OCR accuracy than laser-printed text. Agents may report higher confidence than warranted for older government documents.

Reference Links

Source URL What to Check
Tesseract documentation https://tesseract-ocr.github.io/tessdoc/ Installation, language packs, LSTM mode, PSM values
Tesseract Hebrew traineddata https://github.com/tesseract-ocr/tessdata Hebrew model for tesseract
Google Cloud Vision – Documents https://cloud.google.com/vision/docs/fulltext-annotations Document Text Detection API reference
AWS Textract – AnalyzeDocument https://docs.aws.amazon.com/textract/latest/dg/how-it-works-analyzing.html Forms and tables extraction
Azure AI Vision – Read API https://learn.microsoft.com/en-us/azure/ai-services/computer-vision/overview-ocr Multi-language OCR including Hebrew
Israeli ID check-digit spec https://he.wikipedia.org/wiki/מספר_זהות_(ישראל) Algorithm for validating a 9-digit Israeli ID

Troubleshooting

Error: "Tesseract Hebrew language pack not found"

Cause: Hebrew traineddata not installed Solution: Install with sudo apt-get install tesseract-ocr-heb (Ubuntu) or brew install tesseract-lang (macOS). Verify with tesseract --list-langs.

Error: "OCR output is garbled or reversed"

Cause: Bidirectional text ordering issue Solution: Use normalize_bidi_text() function. Ensure Tesseract is using --oem 1 (LSTM mode). Check that the image is not mirrored.

Error: "Low accuracy on specific form sections"

Cause: Poor scan quality, stamps/signatures overlapping text, colored backgrounds Solution: Increase preprocessing -- apply stronger denoising, crop to specific form regions, use ROI-based extraction for known form layouts.

Related skills

More from skills-il/localization

Installs
5
GitHub Stars
8
First Seen
Mar 18, 2026
Security Audits