langextract
SKILL.md
langextract — LLM-Powered Structured Information Extraction
Extract structured data from unstructured text with character-level provenance. Every extracted entity traces back to exact character offsets in the source document.
When to use this skill
- Extracting entities, relationships, or facts from unstructured text
- Processing clinical notes, legal documents, research papers, or reports
- Building NLP pipelines that need citation-level traceability (not just extracted values)
- Long-document extraction (chunking + parallel workers + multi-pass for recall)
- Replacing fragile regex/rule-based extraction with LLM-driven schema enforcement
- Generating interactive HTML visualizations of annotated text
1. Installation
# Standard install (Gemini backend — default)
pip install langextract
# With OpenAI support
pip install langextract[openai]
# Development
pip install -e ".[dev]"
API key setup:
export LANGEXTRACT_API_KEY="your-gemini-or-openai-key"
# Gemini keys: https://aistudio.google.com/app/apikey
# OpenAI keys: https://platform.openai.com/api-keys
2. Core concepts
| Concept | Description |
|---|---|
| Source grounding | Every extraction carries (start, end) char offsets into original text |
| Controlled generation | Gemini uses schema-constrained decoding; no hallucinated field names |
| Few-shot examples | Schema is inferred from ExampleData objects — zero fine-tuning needed |
| Multi-pass extraction | extraction_passes=N runs N independent passes; results are merged |
| Parallel chunking | max_workers=N processes text chunks concurrently |
3. Basic extraction
import langextract as lx
import textwrap
prompt = textwrap.dedent("""\
Extract characters, emotions, and relationships in order of appearance.
Use exact text for extractions. Do not paraphrase or overlap entities.
Provide meaningful attributes for each entity to add context.""")
examples = [
lx.data.ExampleData(
text="ROMEO. But soft! What light through yonder window breaks?",
extractions=[
lx.data.Extraction(
extraction_class="character",
extraction_text="ROMEO",
attributes={"emotional_state": "wonder"}
),
]
)
]
result = lx.extract(
text_or_documents="Lady Juliet gazed longingly at the stars...",
prompt_description=prompt,
examples=examples,
model_id="gemini-2.5-flash",
)
# Access results
for extraction in result.extractions:
print(f"[{extraction.extraction_class}] '{extraction.extraction_text}' "
f"@ chars {extraction.start}–{extraction.end}")
4. Long-document extraction (URL input, multi-pass, parallel)
result = lx.extract(
text_or_documents="https://www.gutenberg.org/files/1513/1513-0.txt",
prompt_description=prompt,
examples=examples,
model_id="gemini-2.5-flash",
extraction_passes=3, # 3 independent runs, results merged
max_workers=20, # parallel chunk processing
max_char_buffer=1000 # smaller focused context windows
)
# Romeo & Juliet (147k chars / ~44k tokens) → 4,088 entities extracted
5. OpenAI backend
import os, langextract as lx
result = lx.extract(
text_or_documents=input_text,
prompt_description=prompt,
examples=examples,
model_id="gpt-4o",
api_key=os.environ.get("OPENAI_API_KEY"),
fence_output=True,
use_schema_constraints=False
)
6. Local LLMs via Ollama
result = lx.extract(
text_or_documents=input_text,
prompt_description=prompt,
examples=examples,
model_id="gemma2:2b",
model_url="http://localhost:11434",
fence_output=False,
use_schema_constraints=False
)
7. Visualize results
lx.io.save_annotated_documents([result], output_name="results.jsonl", output_dir=".")
html_content = lx.visualize("results.jsonl")
with open("visualization.html", "w") as f:
f.write(html_content.data if hasattr(html_content, "data") else html_content)
# Open visualization.html in browser → color-coded annotations over source text
8. Key parameters reference
| Parameter | Type | Description |
|---|---|---|
text_or_documents |
str / URL |
Raw text, URL to fetch, or list of documents |
prompt_description |
str |
Natural language extraction instructions |
examples |
list[ExampleData] |
Few-shot examples that define the schema |
model_id |
str |
gemini-2.5-flash, gpt-4o, gemma2:2b, … |
api_key |
str |
API key (overrides LANGEXTRACT_API_KEY env var) |
model_url |
str |
Base URL for Ollama or custom endpoints |
extraction_passes |
int |
Independent extraction runs (default: 1) |
max_workers |
int |
Parallel chunk workers (default: 1) |
max_char_buffer |
int |
Characters per chunk |
fence_output |
bool |
Use JSON fencing instead of constrained decoding |
use_schema_constraints |
bool |
Controlled generation — Gemini default: True |
9. Custom provider plugin
import langextract as lx
@lx.providers.registry.register(r'^mymodel', r'^custom')
class MyProviderLanguageModel(lx.inference.BaseLanguageModel):
def __init__(self, model_id: str, api_key: str = None, **kwargs):
self.client = MyProviderClient(api_key=api_key)
def infer(self, batch_prompts, **kwargs):
for prompt in batch_prompts:
result = self.client.generate(prompt, **kwargs)
yield [lx.inference.ScoredOutput(score=1.0, output=result)]
Package as a PyPI plugin with entry point:
[project.entry-points."langextract.providers"]
myprovider = "langextract_myprovider:MyProviderLanguageModel"
Disable all plugins: LANGEXTRACT_DISABLE_PLUGINS=1
10. Use cases
| Domain | Example |
|---|---|
| Medical/clinical | Medication names, dosages, routes from clinical notes |
| Legal | Clause extraction, party identification from contracts |
| Literary analysis | Character, emotion, relationship graphs |
| Finance | Structured data extraction from earnings reports |
| Radiology | Free-text radiology reports → structured format |
| Research | Entity/relation extraction from academic papers |
Best practices
- Write precise prompts — specify "use exact text, do not paraphrase" to keep offsets accurate
- Use few-shot examples — 2–3 examples covering edge cases dramatically improves accuracy
- Tune
max_char_buffer— smaller values (500–1000) give more focused context; larger values reduce API calls - Use
extraction_passes=3for long docs — independent runs catch entities missed in single pass - Set
max_workers— parallelization dramatically speeds up long-document processing - Verify offsets —
result.text[extraction.start:extraction.end]must equalextraction_text - Use visualization — HTML output makes it easy to spot extraction errors and coverage gaps
References
Weekly Installs
6
Repository
akillness/oh-my-godsGitHub Stars
2
First Seen
3 days ago
Security Audits
Installed on
qoder6
codebuddy6
github-copilot6
codex6
kimi-cli6
gemini-cli6