Data Extractor

Overview

Extract structured data from documents in any format: PDF, DOCX, HTML, TXT, images, and more. Converts unstructured or semi-structured content into clean JSON, CSV, or other structured formats. Handles invoices, forms, reports, and free-text documents.

Instructions

When a user asks you to extract data from a document, follow this process:

Step 1: Identify the document format and install dependencies

# Determine file type
file document.pdf

# Install dependencies based on format
pip install pdfplumber python-docx beautifulsoup4 lxml openpyxl

Library selection by format:

PDF: pdfplumber (text + tables), PyMuPDF (fitz) for complex layouts
DOCX: python-docx
HTML: beautifulsoup4 with lxml
Excel: openpyxl or pandas
Images: pytesseract (OCR) with Pillow
JSON/XML: Python standard library

Step 2: Extract raw content

PDF extraction:

import pdfplumber

with pdfplumber.open("document.pdf") as pdf:
    for i, page in enumerate(pdf.pages):
        text = page.extract_text()
        print(f"--- Page {i+1} ---")
        print(text)

        # Extract tables if present
        tables = page.extract_tables()
        for table in tables:
            for row in table:
                print(row)

DOCX extraction:

from docx import Document

doc = Document("document.docx")
for para in doc.paragraphs:
    print(f"[{para.style.name}] {para.text}")

# Extract tables
for table in doc.tables:
    for row in table.rows:
        print([cell.text for cell in row.cells])

HTML extraction:

from bs4 import BeautifulSoup

with open("document.html") as f:
    soup = BeautifulSoup(f, "lxml")

# Extract specific elements
for table in soup.find_all("table"):
    rows = table.find_all("tr")
    for row in rows:
        cells = [td.get_text(strip=True) for td in row.find_all(["td", "th"])]
        print(cells)

Step 3: Parse and structure the data

Once you have raw text, extract the target fields:

Pattern-based extraction:

import re
import json

text = "..."  # extracted text

# Define patterns for common fields
patterns = {
    "invoice_number": r"Invoice\s*#?\s*:?\s*(\w+[-/]?\w+)",
    "date": r"Date\s*:?\s*(\d{1,2}[/-]\d{1,2}[/-]\d{2,4})",
    "total": r"Total\s*:?\s*\$?([\d,]+\.?\d*)",
    "email": r"[\w.-]+@[\w.-]+\.\w+",
}

extracted = {}
for field, pattern in patterns.items():
    match = re.search(pattern, text, re.IGNORECASE)
    if match:
        extracted[field] = match.group(1) if match.lastindex else match.group(0)

print(json.dumps(extracted, indent=2))

Line-item extraction from tables:

import pandas as pd

# From a list of table rows
headers = table_data[0]
rows = table_data[1:]
df = pd.DataFrame(rows, columns=headers)

# Clean up
df.columns = df.columns.str.strip().str.lower().str.replace(" ", "_")
df = df.dropna(how="all")

Step 4: Validate and clean the output

# Type conversion
extracted["total"] = float(extracted["total"].replace(",", ""))

# Date normalization
from datetime import datetime
extracted["date"] = datetime.strptime(extracted["date"], "%m/%d/%Y").isoformat()

# Validate required fields
required = ["invoice_number", "date", "total"]
missing = [f for f in required if f not in extracted]
if missing:
    print(f"Warning: missing fields: {missing}")

Step 5: Output in the desired format

# JSON output
with open("extracted_data.json", "w") as f:
    json.dump(extracted, f, indent=2)

# CSV output
df.to_csv("extracted_items.csv", index=False)

# Pretty print summary
print(f"Extracted {len(extracted)} fields from document")
print(f"Line items: {len(df)} rows")

Examples

Example 1: Extract invoice data from a PDF

User request: "Extract the invoice details from this PDF"

Actions:

Open the PDF with pdfplumber and extract text
Use regex patterns to find invoice number, date, vendor, subtotal, tax, total
Extract the line items table into a DataFrame
Output a JSON file with header fields and a CSV with line items

Output:

{
  "invoice_number": "INV-2025-0042",
  "date": "2025-03-15",
  "vendor": "Acme Corp",
  "subtotal": 1250.00,
  "tax": 100.00,
  "total": 1350.00,
  "line_items": [
    {"description": "Widget A", "qty": 10, "unit_price": 75.00, "amount": 750.00},
    {"description": "Widget B", "qty": 5, "unit_price": 100.00, "amount": 500.00}
  ]
}

Example 2: Extract contacts from a DOCX directory

User request: "Pull all names and email addresses from this company directory document"

Actions:

Parse the DOCX file, iterate through paragraphs and tables
Use regex to find email addresses and associated names
Deduplicate and output as CSV

Output: A CSV file with columns: name, email, department, phone.

Example 3: Convert an HTML report to structured data

User request: "Extract the quarterly results table from this HTML page"

Actions:

Parse the HTML with BeautifulSoup
Find the target table by heading or class
Extract headers and rows into a DataFrame
Clean column names and convert numeric values
Export as CSV and provide summary statistics

Output: A clean CSV with quarterly metrics and a summary of key figures.

Guidelines

Always inspect the raw extracted text before writing parsers. Understanding the layout saves time.
Use pdfplumber for most PDF extraction. Fall back to PyMuPDF for complex multi-column layouts.
For scanned PDFs (image-based), use OCR with pytesseract before parsing.
Validate extracted data types: convert strings to numbers, normalize dates.
Report extraction confidence: note any fields that could not be found or seem incorrect.
Handle multi-page documents by accumulating results across pages.
For batch extraction (many documents of the same type), build a reusable extraction function and apply it across all files.
Always preserve the original document alongside extracted data for verification.
When patterns fail, fall back to positional extraction based on text layout.

data-extractor

Data Extractor

Overview

Instructions

Step 1: Identify the document format and install dependencies

Step 2: Extract raw content

Step 3: Parse and structure the data

Step 4: Validate and clean the output

Step 5: Output in the desired format

Examples

Example 1: Extract invoice data from a PDF

Example 2: Extract contacts from a DOCX directory

Example 3: Convert an HTML report to structured data

Guidelines

More from terminalskills/skills

api-tester

instagram-marketing

directus

agent-memory

reddit-insights

trpc