skills/legout/data-agent-skills/data-science-feature-engineering

data-science-feature-engineering

SKILL.md

Feature Engineering

Use this skill for creating, transforming, and selecting features that improve model performance.

When to use this skill

  • After EDA — convert insights into features
  • Model underperforming — need better representations
  • Handling different data types (numerical, categorical, text, datetime)
  • Reducing dimensionality or selecting most predictive features

Feature engineering workflow

  1. Numerical features

    • Scaling (StandardScaler, MinMaxScaler, RobustScaler)
    • Transformations (log, sqrt, Box-Cox for skewness)
    • Binning (equal-width, quantile, custom)
    • Interaction features
  2. Categorical features

    • One-hot encoding (low cardinality)
    • Target/Mean encoding (high cardinality)
    • Ordinal encoding (ordered categories)
    • Frequency/rare category handling
  3. Datetime features

    • Extract components (year, month, day, hour, dayofweek)
    • Cyclical encoding (sin/cos for time cycles)
    • Time since/duration features
  4. Text features

    • TF-IDF, CountVectorizer
    • Embeddings (sentence-transformers)
    • Basic text stats (length, word count)
  5. Feature selection

    • Filter methods (correlation, mutual information)
    • Wrapper methods (recursive feature elimination)
    • Embedded methods (L1 regularization, tree importance)

Quick tool selection

Task Default choice Notes
sklearn pipelines sklearn.pipeline + ColumnTransformer Reproducible, cross-validation safe
Categorical encoding category_encoders Beyond sklearn's limited options
Feature selection sklearn.feature_selection Mutual info, RFE, SelectFromModel
Text embeddings sentence-transformers Pre-trained semantic embeddings
Auto feature engineering Feature-engine Comprehensive transformations

Core implementation rules

1) Use pipelines to prevent leakage

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline

preprocessor = ColumnTransformer([
    ('num', StandardScaler(), numerical_features),
    ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
])

pipeline = Pipeline([
    ('prep', preprocessor),
    ('model', RandomForestClassifier())
])

2) Fit on train only, transform on all

# Correct: fit_transform on train, transform on test
X_train_processed = preprocessor.fit_transform(X_train)
X_test_processed = preprocessor.transform(X_test)  # Only transform!

3) Handle unknown categories

OneHotEncoder(handle_unknown='ignore')  # Unknown → all zeros
# OR
OneHotEncoder(handle_unknown='infrequent_if_exist')  # Group rare/unknown

4) Document feature importance

Track which features were created, why, and their expected impact.

Common anti-patterns

  • ❌ Fitting preprocessors on full dataset (leakage!)
  • ❌ One-hot encoding high-cardinality features (dimension explosion)
  • ❌ Ignoring feature scaling for distance-based models
  • ❌ Creating features without domain reasoning
  • ❌ Not validating feature distributions match between train/test

Progressive disclosure

  • ../references/categorical-encoding.md — Comprehensive encoding guide
  • ../references/datetime-features.md — Time-based feature patterns
  • ../references/text-features.md — NLP feature engineering
  • ../references/feature-selection.md — Selection strategies and implementations

Related skills

  • @data-science-eda — Understand data before engineering
  • @data-science-model-evaluation — Validate feature impact
  • @data-engineering-core — Data processing fundamentals

References

Weekly Installs
4
First Seen
14 days ago
Installed on
cline4
github-copilot4
codex4
kimi-cli4
gemini-cli4
cursor4