skills/dkyazzentwatwa/chatgpt-skills/named-entity-extractor

named-entity-extractor

SKILL.md

Named Entity Extractor

Extract named entities from text including people, organizations, locations, dates, and more.

Features

  • Entity Types: People, organizations, locations, dates, money, percentages
  • Multiple Models: spaCy for accuracy, regex for speed
  • Batch Processing: Process multiple documents
  • Entity Linking: Group same entities across text
  • Export: JSON, CSV output formats
  • Visualization: Entity highlighting

Quick Start

from entity_extractor import EntityExtractor

extractor = EntityExtractor()

text = "Apple Inc. was founded by Steve Jobs in Cupertino, California in 1976."

entities = extractor.extract(text)
for entity in entities:
    print(f"{entity['text']}: {entity['type']}")

# Output:
# Apple Inc.: ORG
# Steve Jobs: PERSON
# Cupertino: GPE
# California: GPE
# 1976: DATE

CLI Usage

# Extract from text
python entity_extractor.py --text "Steve Jobs founded Apple in California."

# Extract from file
python entity_extractor.py --input document.txt

# Batch process folder
python entity_extractor.py --input ./documents/ --output entities.csv

# Filter by entity type
python entity_extractor.py --input document.txt --types PERSON,ORG

# Use regex mode (faster, less accurate)
python entity_extractor.py --input document.txt --mode regex

# JSON output
python entity_extractor.py --input document.txt --json

API Reference

EntityExtractor Class

class EntityExtractor:
    def __init__(self, mode: str = "spacy", model: str = "en_core_web_sm")

    # Extraction
    def extract(self, text: str) -> list
    def extract_file(self, filepath: str) -> list
    def extract_batch(self, folder: str) -> dict

    # Filtering
    def filter_entities(self, entities: list, types: list) -> list
    def get_unique_entities(self, entities: list) -> list
    def group_by_type(self, entities: list) -> dict

    # Analysis
    def entity_frequency(self, text: str) -> dict
    def find_relationships(self, text: str) -> list

    # Export
    def to_csv(self, entities: list, output: str) -> str
    def to_json(self, entities: list, output: str) -> str
    def highlight_text(self, text: str) -> str

Entity Types

Standard Entity Types (spaCy)

Type Description Example
PERSON People, including fictional "Steve Jobs"
ORG Companies, agencies, institutions "Apple Inc."
GPE Countries, cities, states "California"
LOC Non-GPE locations, mountains, water "Pacific Ocean"
DATE Dates, periods "January 2024"
TIME Times "3:30 PM"
MONEY Monetary values "$1.5 million"
PERCENT Percentages "20%"
PRODUCT Products "iPhone"
EVENT Events "World Cup"
WORK_OF_ART Books, songs, etc. "The Great Gatsby"
LAW Laws, regulations "GDPR"
LANGUAGE Languages "English"
NORP Nationalities, groups "American"

Regex Mode Entities

Faster extraction with regex patterns:

Type Description
EMAIL Email addresses
PHONE Phone numbers
URL Web URLs
DATE Common date formats
MONEY Currency amounts
PERCENTAGE Percentages

Output Format

Entity Result

{
    "text": "Steve Jobs",
    "type": "PERSON",
    "start": 10,
    "end": 20,
    "confidence": 0.95
}

Full Extraction Result

{
    "text": "Original text...",
    "entities": [
        {"text": "Steve Jobs", "type": "PERSON", "start": 10, "end": 20},
        {"text": "Apple Inc.", "type": "ORG", "start": 30, "end": 40}
    ],
    "summary": {
        "total_entities": 5,
        "unique_entities": 4,
        "by_type": {
            "PERSON": 2,
            "ORG": 1,
            "GPE": 2
        }
    }
}

Filtering and Grouping

Filter by Type

entities = extractor.extract(text)

# Get only people and organizations
filtered = extractor.filter_entities(entities, ["PERSON", "ORG"])

Get Unique Entities

# Remove duplicates, keep first occurrence
unique = extractor.get_unique_entities(entities)

Group by Type

grouped = extractor.group_by_type(entities)

# Returns:
{
    "PERSON": ["Steve Jobs", "Tim Cook"],
    "ORG": ["Apple Inc."],
    "GPE": ["California", "Cupertino"]
}

Entity Frequency

frequency = extractor.entity_frequency(text)

# Returns:
{
    "Steve Jobs": {"count": 5, "type": "PERSON"},
    "Apple": {"count": 8, "type": "ORG"},
    "California": {"count": 2, "type": "GPE"}
}

Batch Processing

Process Folder

results = extractor.extract_batch("./documents/")

# Returns:
{
    "doc1.txt": {
        "entities": [...],
        "summary": {...}
    },
    "doc2.txt": {
        "entities": [...],
        "summary": {...}
    }
}

Export to CSV

extractor.to_csv(results, "entities.csv")

# Creates CSV with columns:
# filename, entity_text, entity_type, start, end

Text Highlighting

Generate HTML with highlighted entities:

html = extractor.highlight_text(text)

# Returns HTML with colored spans for each entity type

Example Workflows

Document Analysis

extractor = EntityExtractor()

# Analyze a document
text = open("article.txt").read()
result = extractor.extract(text)

# Get key people mentioned
people = extractor.filter_entities(result, ["PERSON"])
print(f"People mentioned: {len(people)}")

# Get frequency
freq = extractor.entity_frequency(text)
top_entities = sorted(freq.items(), key=lambda x: x[1]["count"], reverse=True)[:10]

Contact Information Extraction

extractor = EntityExtractor(mode="regex")

text = """
Contact John Smith at john.smith@example.com
or call (555) 123-4567.
"""

entities = extractor.extract(text)
# Finds: EMAIL, PHONE entities

Content Tagging

extractor = EntityExtractor()

articles = ["article1.txt", "article2.txt", "article3.txt"]
tags = {}

for article in articles:
    entities = extractor.extract_file(article)
    tags[article] = extractor.get_unique_entities(entities)

Dependencies

  • spacy>=3.7.0
  • pandas>=2.0.0
  • en_core_web_sm (spaCy model)

Note: Run python -m spacy download en_core_web_sm to install the model.

Weekly Installs
41
GitHub Stars
23
First Seen
Jan 24, 2026
Installed on
opencode31
gemini-cli31
cursor30
codex29
claude-code29
github-copilot26