data-science-feature-engineering

Installation

SKILL.md

Feature Engineering

Use this skill for creating, transforming, and selecting features that improve model performance.

When to use this skill

After EDA — convert insights into features
Model underperforming — need better representations
Handling different data types (numerical, categorical, text, datetime)
Reducing dimensionality or selecting most predictive features

Feature engineering workflow

Numerical features
- Scaling (StandardScaler, MinMaxScaler, RobustScaler)
- Transformations (log, sqrt, Box-Cox for skewness)
- Binning (equal-width, quantile, custom)
- Interaction features
Categorical features
- One-hot encoding (low cardinality)
- Target/Mean encoding (high cardinality)
- Ordinal encoding (ordered categories)
- Frequency/rare category handling
Datetime features
- Extract components (year, month, day, hour, dayofweek)
- Cyclical encoding (sin/cos for time cycles)
- Time since/duration features
Text features
- TF-IDF, CountVectorizer
- Embeddings (sentence-transformers)
- Basic text stats (length, word count)
Feature selection
- Filter methods (correlation, mutual information)
- Wrapper methods (recursive feature elimination)
- Embedded methods (L1 regularization, tree importance)

Quick tool selection

Task	Default choice	Notes
sklearn pipelines	sklearn.pipeline + ColumnTransformer	Reproducible, cross-validation safe
Categorical encoding	category_encoders	Beyond sklearn's limited options
Feature selection	sklearn.feature_selection	Mutual info, RFE, SelectFromModel
Text embeddings	sentence-transformers	Pre-trained semantic embeddings
Auto feature engineering	Feature-engine	Comprehensive transformations

Core implementation rules

1) Use pipelines to prevent leakage

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline

preprocessor = ColumnTransformer([
    ('num', StandardScaler(), numerical_features),
    ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
])

pipeline = Pipeline([
    ('prep', preprocessor),
    ('model', RandomForestClassifier())
])

2) Fit on train only, transform on all

# Correct: fit_transform on train, transform on test
X_train_processed = preprocessor.fit_transform(X_train)
X_test_processed = preprocessor.transform(X_test)  # Only transform!

3) Handle unknown categories

OneHotEncoder(handle_unknown='ignore')  # Unknown → all zeros
# OR
OneHotEncoder(handle_unknown='infrequent_if_exist')  # Group rare/unknown

4) Document feature importance

Track which features were created, why, and their expected impact.

Common anti-patterns

❌ Fitting preprocessors on full dataset (leakage!)
❌ One-hot encoding high-cardinality features (dimension explosion)
❌ Ignoring feature scaling for distance-based models
❌ Creating features without domain reasoning
❌ Not validating feature distributions match between train/test

Progressive disclosure

../references/categorical-encoding.md — Comprehensive encoding guide
../references/datetime-features.md — Time-based feature patterns
../references/text-features.md — NLP feature engineering
../references/feature-selection.md — Selection strategies and implementations

Related skills

@data-science-eda — Understand data before engineering
@data-science-model-evaluation — Validate feature impact
@data-engineering-core — Data processing fundamentals

References

Related skills

More from legout/data-platform-agent-skills

Installs

Repository

legout/data-pla…t-skills

First Seen

Feb 11, 2026

Security Audits

Gen Agent Trust HubFail

SocketPass

SnykPass

data-science-feature-engineering

Feature Engineering

When to use this skill

Feature engineering workflow

Quick tool selection

Core implementation rules

1) Use pipelines to prevent leakage

2) Fit on train only, transform on all

3) Handle unknown categories

4) Document feature importance

Common anti-patterns

Progressive disclosure

Related skills

References

More from legout/data-platform-agent-skills

data-science-eda

data-science-visualization

data-engineering-core

data-science-notebooks

data-engineering-best-practices

data-engineering-storage-formats