engineering-ml-features

Installation
SKILL.md

Engineering ML Features

Use this skill for creating, transforming, and selecting features that improve model performance. Covers categorical encoding, numeric scaling, datetime engineering, text features, and building leakage-safe pipelines.

When to use this skill

  • Categorical variables need encoding for ML algorithms
  • Numeric features require scaling or transformation
  • Datetime columns need conversion to meaningful features
  • Text data needs to be converted to numerical representations
  • Preventing data leakage during feature engineering
  • Selecting the most predictive features from a large set
  • Building reusable, production-ready preprocessing pipelines

When NOT to use this skill

  • General data exploration → use analyzing-data
  • Model evaluation and selection → use @evaluating-ml-models
  • Building interactive data apps → use @building-data-apps
  • Notebook setup and workflows → use @working-in-notebooks

Quick tool selection

Task Default choice Notes
Categorical encoding category_encoders Beyond sklearn's limited options
Feature scaling sklearn.preprocessing Standard, Robust, Power transforms
Pipeline composition sklearn.pipeline + ColumnTransformer Reproducible, CV-safe
Text vectorization sklearn.feature_extraction.text TF-IDF, CountVectorizer
Text embeddings sentence-transformers Pre-trained semantic embeddings
Feature selection sklearn.feature_selection Mutual info, RFE, SelectFromModel

Feature engineering workflows

1. Categorical encoding

Low cardinality (< 10-15 categories): One-hot encoding High cardinality (> 15-100): Target encoding or frequency encoding Ordinal: Ordinal encoding with explicit category order

from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
from category_encoders import TargetEncoder

# One-hot for low cardinality
ohe = OneHotEncoder(handle_unknown='ignore', sparse_output=False)

# Target encoding for high cardinality
te = TargetEncoder(smoothing=10)

# Ordinal for ordered categories
ord_enc = OrdinalEncoder(categories=[['low', 'medium', 'high']])

2. Numeric scaling and transformation

Method Use When Algorithm Impact
StandardScaler Features normally distributed, outliers rare Required for SVM, neural nets, PCA
RobustScaler Outliers present, want median/IQR centering Same as Standard, more robust
MinMaxScaler Need bounded range [0,1] or [-1,1] Neural nets, image data
PowerTransformer Skewed distributions, want normality Improves linear model performance
QuantileTransformer Heavy tails, want uniform/normal Tree models unaffected, linear improves
from sklearn.preprocessing import StandardScaler, RobustScaler, PowerTransformer

# Standard scaling
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_train)

# Power transform for skewness
pt = PowerTransformer(method='yeo-johnson')
X_transformed = pt.fit_transform(X_train)

3. Datetime feature engineering

Extract components and encode cyclical patterns:

import numpy as np

# Component extraction
df['year'] = df['timestamp'].dt.year
df['month'] = df['timestamp'].dt.month
df['dayofweek'] = df['timestamp'].dt.dayofweek
df['hour'] = df['timestamp'].dt.hour

# Cyclical encoding (preserves circular nature)
df['month_sin'] = np.sin(2 * np.pi * df['month'] / 12)
df['month_cos'] = np.cos(2 * np.pi * df['month'] / 12)

# Duration features
df['days_since_start'] = (df['timestamp'] - df['timestamp'].min()).dt.days

4. Text feature engineering

from sklearn.feature_extraction.text import TfidfVectorizer
from sentence_transformers import SentenceTransformer

# TF-IDF for classical NLP
vectorizer = TfidfVectorizer(max_features=1000, ngram_range=(1, 2))
X_tfidf = vectorizer.fit_transform(texts)

# Embeddings for semantic similarity
model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(texts, show_progress_bar=True)

# Basic text statistics
df['text_length'] = df['text'].str.len()
df['word_count'] = df['text'].str.split().str.len()

5. Leakage-safe pipelines

Critical rule: Always fit on training data only, transform on all data.

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline

# Define preprocessing for each column type
preprocessor = ColumnTransformer([
    ('num', StandardScaler(), numerical_features),
    ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
])

# Full pipeline
pipeline = Pipeline([
    ('prep', preprocessor),
    ('model', RandomForestClassifier())
])

# Correct: fit on train only
pipeline.fit(X_train, y_train)

# Transform train and test separately through the fitted pipeline
y_pred = pipeline.predict(X_test)  # No manual transform needed

CV-safe cross-validation:

from sklearn.model_selection import cross_val_score

# Pipeline ensures preprocessing happens within each CV fold
scores = cross_val_score(pipeline, X, y, cv=5)

6. Feature selection

Method Description Best For
Filter (mutual_info) Statistical measure vs target Quick screening, many features
Filter (correlation) Linear correlation with target Linear models, fast baseline
Wrapper (RFE) Recursive feature elimination Small-medium feature sets
Embedded (L1) Lasso zeroes out features Linear models with sparsity
Embedded (tree) Feature importance from trees Tree-based models
from sklearn.feature_selection import SelectKBest, mutual_info_classif, RFE
from sklearn.linear_model import Lasso

# Mutual information filter
selector = SelectKBest(mutual_info_classif, k=20)
X_selected = selector.fit_transform(X_train, y_train)

# Recursive feature elimination
rfe = RFE(estimator=RandomForestClassifier(), n_features_to_select=20)
X_rfe = rfe.fit_transform(X_train, y_train)

# L1 regularization (embedded)
lasso = Lasso(alpha=0.01)
lasso.fit(X_train, y_train)
selected_features = X_train.columns[lasso.coef_ != 0]

Core implementation rules

1. Prevent data leakage

Wrong: Fitting encoders/scalers on full dataset ✅ Right: fit_transform() on train, transform() on test

# Train
scaler.fit_transform(X_train)
# Test - ONLY transform!
scaler.transform(X_test)

2. Handle unknown categories

# Unknown categories become all zeros
OneHotEncoder(handle_unknown='ignore')

# Unknown categories grouped with rare ones
OneHotEncoder(handle_unknown='infrequent_if_exist', min_frequency=0.01)

3. Track feature names through pipelines

# Get feature names after ColumnTransformer
feature_names = preprocessor.get_feature_names_out()

4. Document feature importance

Track which features were created, why, and their expected impact on model performance.

Common anti-patterns

Anti-pattern Solution
❌ Fitting preprocessors on full dataset Use train/test split before any fitting
❌ One-hot encoding high-cardinality features (>100 categories) Use target encoding or frequency encoding
❌ Ignoring scaling for distance-based models Always scale for SVM, k-NN, neural nets, PCA
❌ Creating features without domain reasoning Validate features make business sense
❌ Not validating feature distributions match between train/test Use distribution tests or visual comparison
❌ Target encoding without smoothing Use smoothing parameter to handle rare categories
❌ Forgetting cyclical encoding for time Use sin/cos for hour, dayofweek, month

Progressive disclosure

Reference guides for detailed implementations:

  • references/categorical-encoding.md — Comprehensive encoding strategies and selection guidance
  • references/datetime-features.md — Time-based feature patterns and cyclical encoding
  • references/text-features.md — NLP feature engineering with TF-IDF and embeddings
  • references/feature-selection.md — Selection strategies and implementation patterns

Related skills

  • analyzing-data — Understand data before engineering features
  • @evaluating-ml-models — Validate feature impact on model performance
  • @building-data-pipelines — Data processing fundamentals and pipeline patterns

References

Related skills

More from legout/data-platform-agent-skills

Installs
2
First Seen
Mar 16, 2026