nlp-pipeline-builder
NLP Pipeline Builder
Overview
Specialized ML pipelines for natural language processing. Handles text preprocessing, tokenization, transformer models (BERT, RoBERTa, GPT), fine-tuning, and deployment for production NLP systems.
NLP Tasks Supported
1. Text Classification
from specweave import NLPPipeline
# Binary or multi-class text classification
pipeline = NLPPipeline(
task="classification",
classes=["positive", "negative", "neutral"],
increment="0042"
)
# Automatically configures:
# - Text preprocessing (lowercase, clean)
# - Tokenization (BERT tokenizer)
# - Model (BERT, RoBERTa, DistilBERT)
# - Fine-tuning on your data
# - Inference pipeline
pipeline.fit(train_texts, train_labels)
2. Named Entity Recognition (NER)
# Extract entities from text
pipeline = NLPPipeline(
task="ner",
entities=["PERSON", "ORG", "LOC", "DATE"],
increment="0042"
)
# Returns: [(entity_text, entity_type, start_pos, end_pos), ...]
3. Sentiment Analysis
# Sentiment classification (specialized)
pipeline = NLPPipeline(
task="sentiment",
increment="0042"
)
# Fine-tuned for sentiment (positive/negative/neutral)
4. Text Generation
# Generate text continuations
pipeline = NLPPipeline(
task="generation",
model="gpt2",
increment="0042"
)
# Fine-tune on your domain-specific text
Best Practices for NLP
Text Preprocessing
from specweave import TextPreprocessor
preprocessor = TextPreprocessor(increment="0042")
# Standard preprocessing
preprocessor.add_steps([
"lowercase",
"remove_html",
"remove_urls",
"remove_emails",
"remove_special_chars",
"remove_extra_whitespace"
])
# Advanced preprocessing
preprocessor.add_advanced([
"spell_correction",
"lemmatization",
"stopword_removal"
])
Model Selection
Text Classification:
- Small datasets (<10K): DistilBERT (6x faster than BERT)
- Medium datasets (10K-100K): BERT-base
- Large datasets (>100K): RoBERTa-large
NER:
- General: BERT + CRF layer
- Domain-specific: Fine-tune BERT on domain corpus
Sentiment:
- Product reviews: DistilBERT fine-tuned on Amazon reviews
- Social media: RoBERTa fine-tuned on Twitter
Transfer Learning
# Start from pre-trained language models
pipeline = NLPPipeline(task="classification")
# Option 1: Use pre-trained (no fine-tuning)
pipeline.use_pretrained("distilbert-base-uncased")
# Option 2: Fine-tune on your data
pipeline.use_pretrained_and_finetune(
model="bert-base-uncased",
epochs=3,
learning_rate=2e-5
)
Handling Long Text
# For text longer than 512 tokens
pipeline = NLPPipeline(
task="classification",
max_length=512,
truncation_strategy="head_and_tail" # Keep start + end
)
# Or use Longformer for long documents
pipeline.use_model("longformer") # Handles 4096 tokens
Integration with SpecWeave
# NLP increment structure
.specweave/increments/0042-sentiment-classifier/
├── spec.md
├── data/
│ ├── train.csv
│ ├── val.csv
│ └── test.csv
├── models/
│ ├── tokenizer/
│ ├── model-epoch-1/
│ ├── model-epoch-2/
│ └── model-epoch-3/
├── experiments/
│ ├── distilbert-baseline/
│ ├── bert-base-finetuned/
│ └── roberta-large/
└── deployment/
├── model.onnx
└── inference.py
Commands
/ml:nlp-pipeline --task classification --model bert-base
/ml:nlp-evaluate 0042 # Evaluate on test set
/ml:nlp-deploy 0042 # Export for production
Quick setup for NLP projects with state-of-the-art transformer models.
More from anton-abyzov/specweave
technical-writing
Technical writing expert for API documentation, README files, tutorials, changelog management, and developer documentation. Covers style guides, information architecture, versioning docs, OpenAPI/Swagger, and documentation-as-code. Activates for technical writing, API docs, README, changelog, tutorial writing, documentation, technical communication, style guide, OpenAPI, Swagger, developer docs.
45spec-driven-brainstorming
Spec-driven brainstorming and product discovery expert. Helps teams ideate features, break down epics, conduct story mapping sessions, prioritize using MoSCoW/RICE/Kano, and validate ideas with lean startup methods. Activates for brainstorming, product discovery, story mapping, feature ideation, prioritization, MoSCoW, RICE, Kano model, lean startup, MVP definition, product backlog, feature breakdown.
43kafka-architecture
Apache Kafka architecture expert for cluster design, capacity planning, and high availability. Use when designing Kafka clusters, choosing partition strategies, or sizing brokers for production workloads.
34docusaurus
Docusaurus 3.x documentation framework - MDX authoring, theming, versioning, i18n. Use for documentation sites or spec-weave.com.
29frontend
Expert frontend developer for React, Vue, Angular, and modern JavaScript/TypeScript. Use when creating components, implementing hooks, handling state management, or building responsive web interfaces. Covers React 18+ features, custom hooks, form handling, and accessibility best practices.
29reflect
Self-improving AI memory system that persists learnings across sessions in CLAUDE.md. Use when capturing corrections, remembering user preferences, or extracting patterns from successful implementations. Enables continual learning without starting from zero each conversation.
27