skills/legout/data-agent-skills/engineering-ml-features

engineering-ml-features

SKILL.md

Engineering ML Features

Use this skill for creating, transforming, and selecting features that improve model performance. Covers categorical encoding, numeric scaling, datetime engineering, text features, and building leakage-safe pipelines.

When to use this skill

  • Categorical variables need encoding for ML algorithms
  • Numeric features require scaling or transformation
  • Datetime columns need conversion to meaningful features
  • Text data needs to be converted to numerical representations
  • Preventing data leakage during feature engineering
  • Selecting the most predictive features from a large set
  • Building reusable, production-ready preprocessing pipelines

When NOT to use this skill

  • General data exploration → use analyzing-data
  • Model evaluation and selection → use @evaluating-ml-models
  • Building interactive data apps → use @building-data-apps
  • Notebook setup and workflows → use @working-in-notebooks

Quick tool selection

Task Default choice Notes
Categorical encoding category_encoders Beyond sklearn's limited options
Feature scaling sklearn.preprocessing Standard, Robust, Power transforms
Pipeline composition sklearn.pipeline + ColumnTransformer Reproducible, CV-safe
Text vectorization sklearn.feature_extraction.text TF-IDF, CountVectorizer
Text embeddings sentence-transformers Pre-trained semantic embeddings
Feature selection sklearn.feature_selection Mutual info, RFE, SelectFromModel

Feature engineering workflows

1. Categorical encoding

Low cardinality (< 10-15 categories): One-hot encoding High cardinality (> 15-100): Target encoding or frequency encoding Ordinal: Ordinal encoding with explicit category order

from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
from category_encoders import TargetEncoder

# One-hot for low cardinality
ohe = OneHotEncoder(handle_unknown='ignore', sparse_output=False)

# Target encoding for high cardinality
te = TargetEncoder(smoothing=10)

# Ordinal for ordered categories
ord_enc = OrdinalEncoder(categories=[['low', 'medium', 'high']])

2. Numeric scaling and transformation

Method Use When Algorithm Impact
StandardScaler Features normally distributed, outliers rare Required for SVM, neural nets, PCA
RobustScaler Outliers present, want median/IQR centering Same as Standard, more robust
MinMaxScaler Need bounded range [0,1] or [-1,1] Neural nets, image data
PowerTransformer Skewed distributions, want normality Improves linear model performance
QuantileTransformer Heavy tails, want uniform/normal Tree models unaffected, linear improves
from sklearn.preprocessing import StandardScaler, RobustScaler, PowerTransformer

# Standard scaling
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_train)

# Power transform for skewness
pt = PowerTransformer(method='yeo-johnson')
X_transformed = pt.fit_transform(X_train)

3. Datetime feature engineering

Extract components and encode cyclical patterns:

import numpy as np

# Component extraction
df['year'] = df['timestamp'].dt.year
df['month'] = df['timestamp'].dt.month
df['dayofweek'] = df['timestamp'].dt.dayofweek
df['hour'] = df['timestamp'].dt.hour

# Cyclical encoding (preserves circular nature)
df['month_sin'] = np.sin(2 * np.pi * df['month'] / 12)
df['month_cos'] = np.cos(2 * np.pi * df['month'] / 12)

# Duration features
df['days_since_start'] = (df['timestamp'] - df['timestamp'].min()).dt.days

4. Text feature engineering

from sklearn.feature_extraction.text import TfidfVectorizer
from sentence_transformers import SentenceTransformer

# TF-IDF for classical NLP
vectorizer = TfidfVectorizer(max_features=1000, ngram_range=(1, 2))
X_tfidf = vectorizer.fit_transform(texts)

# Embeddings for semantic similarity
model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(texts, show_progress_bar=True)

# Basic text statistics
df['text_length'] = df['text'].str.len()
df['word_count'] = df['text'].str.split().str.len()

5. Leakage-safe pipelines

Critical rule: Always fit on training data only, transform on all data.

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline

# Define preprocessing for each column type
preprocessor = ColumnTransformer([
    ('num', StandardScaler(), numerical_features),
    ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
])

# Full pipeline
pipeline = Pipeline([
    ('prep', preprocessor),
    ('model', RandomForestClassifier())
])

# Correct: fit on train only
pipeline.fit(X_train, y_train)

# Transform train and test separately through the fitted pipeline
y_pred = pipeline.predict(X_test)  # No manual transform needed

CV-safe cross-validation:

from sklearn.model_selection import cross_val_score

# Pipeline ensures preprocessing happens within each CV fold
scores = cross_val_score(pipeline, X, y, cv=5)

6. Feature selection

Method Description Best For
Filter (mutual_info) Statistical measure vs target Quick screening, many features
Filter (correlation) Linear correlation with target Linear models, fast baseline
Wrapper (RFE) Recursive feature elimination Small-medium feature sets
Embedded (L1) Lasso zeroes out features Linear models with sparsity
Embedded (tree) Feature importance from trees Tree-based models
from sklearn.feature_selection import SelectKBest, mutual_info_classif, RFE
from sklearn.linear_model import Lasso

# Mutual information filter
selector = SelectKBest(mutual_info_classif, k=20)
X_selected = selector.fit_transform(X_train, y_train)

# Recursive feature elimination
rfe = RFE(estimator=RandomForestClassifier(), n_features_to_select=20)
X_rfe = rfe.fit_transform(X_train, y_train)

# L1 regularization (embedded)
lasso = Lasso(alpha=0.01)
lasso.fit(X_train, y_train)
selected_features = X_train.columns[lasso.coef_ != 0]

Core implementation rules

1. Prevent data leakage

Wrong: Fitting encoders/scalers on full dataset ✅ Right: fit_transform() on train, transform() on test

# Train
scaler.fit_transform(X_train)
# Test - ONLY transform!
scaler.transform(X_test)

2. Handle unknown categories

# Unknown categories become all zeros
OneHotEncoder(handle_unknown='ignore')

# Unknown categories grouped with rare ones
OneHotEncoder(handle_unknown='infrequent_if_exist', min_frequency=0.01)

3. Track feature names through pipelines

# Get feature names after ColumnTransformer
feature_names = preprocessor.get_feature_names_out()

4. Document feature importance

Track which features were created, why, and their expected impact on model performance.

Common anti-patterns

Anti-pattern Solution
❌ Fitting preprocessors on full dataset Use train/test split before any fitting
❌ One-hot encoding high-cardinality features (>100 categories) Use target encoding or frequency encoding
❌ Ignoring scaling for distance-based models Always scale for SVM, k-NN, neural nets, PCA
❌ Creating features without domain reasoning Validate features make business sense
❌ Not validating feature distributions match between train/test Use distribution tests or visual comparison
❌ Target encoding without smoothing Use smoothing parameter to handle rare categories
❌ Forgetting cyclical encoding for time Use sin/cos for hour, dayofweek, month

Progressive disclosure

Reference guides for detailed implementations:

  • references/categorical-encoding.md — Comprehensive encoding strategies and selection guidance
  • references/datetime-features.md — Time-based feature patterns and cyclical encoding
  • references/text-features.md — NLP feature engineering with TF-IDF and embeddings
  • references/feature-selection.md — Selection strategies and implementation patterns

Related skills

  • analyzing-data — Understand data before engineering features
  • @evaluating-ml-models — Validate feature impact on model performance
  • @building-data-pipelines — Data processing fundamentals and pipeline patterns

References

Weekly Installs
1
First Seen
3 days ago
Installed on
mcpjam1
claude-code1
replit1
junie1
windsurf1
zencoder1