engineering-ml-features

Installation

SKILL.md

Engineering ML Features

Use this skill for creating, transforming, and selecting features that improve model performance. Covers categorical encoding, numeric scaling, datetime engineering, text features, and building leakage-safe pipelines.

When to use this skill

Categorical variables need encoding for ML algorithms
Numeric features require scaling or transformation
Datetime columns need conversion to meaningful features
Text data needs to be converted to numerical representations
Preventing data leakage during feature engineering
Selecting the most predictive features from a large set
Building reusable, production-ready preprocessing pipelines

When NOT to use this skill

General data exploration → use analyzing-data
Model evaluation and selection → use @evaluating-ml-models
Building interactive data apps → use @building-data-apps
Notebook setup and workflows → use @working-in-notebooks

Quick tool selection

Task	Default choice	Notes
Categorical encoding	category_encoders	Beyond sklearn's limited options
Feature scaling	sklearn.preprocessing	Standard, Robust, Power transforms
Pipeline composition	sklearn.pipeline + ColumnTransformer	Reproducible, CV-safe
Text vectorization	sklearn.feature_extraction.text	TF-IDF, CountVectorizer
Text embeddings	sentence-transformers	Pre-trained semantic embeddings
Feature selection	sklearn.feature_selection	Mutual info, RFE, SelectFromModel

Feature engineering workflows

1. Categorical encoding

Low cardinality (< 10-15 categories): One-hot encoding High cardinality (> 15-100): Target encoding or frequency encoding Ordinal: Ordinal encoding with explicit category order

from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
from category_encoders import TargetEncoder

# One-hot for low cardinality
ohe = OneHotEncoder(handle_unknown='ignore', sparse_output=False)

# Target encoding for high cardinality
te = TargetEncoder(smoothing=10)

# Ordinal for ordered categories
ord_enc = OrdinalEncoder(categories=[['low', 'medium', 'high']])

2. Numeric scaling and transformation

Method	Use When	Algorithm Impact
StandardScaler	Features normally distributed, outliers rare	Required for SVM, neural nets, PCA
RobustScaler	Outliers present, want median/IQR centering	Same as Standard, more robust
MinMaxScaler	Need bounded range [0,1] or [-1,1]	Neural nets, image data
PowerTransformer	Skewed distributions, want normality	Improves linear model performance
QuantileTransformer	Heavy tails, want uniform/normal	Tree models unaffected, linear improves

from sklearn.preprocessing import StandardScaler, RobustScaler, PowerTransformer

# Standard scaling
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_train)

# Power transform for skewness
pt = PowerTransformer(method='yeo-johnson')
X_transformed = pt.fit_transform(X_train)

3. Datetime feature engineering

Extract components and encode cyclical patterns:

import numpy as np

# Component extraction
df['year'] = df['timestamp'].dt.year
df['month'] = df['timestamp'].dt.month
df['dayofweek'] = df['timestamp'].dt.dayofweek
df['hour'] = df['timestamp'].dt.hour

# Cyclical encoding (preserves circular nature)
df['month_sin'] = np.sin(2 * np.pi * df['month'] / 12)
df['month_cos'] = np.cos(2 * np.pi * df['month'] / 12)

# Duration features
df['days_since_start'] = (df['timestamp'] - df['timestamp'].min()).dt.days

4. Text feature engineering

from sklearn.feature_extraction.text import TfidfVectorizer
from sentence_transformers import SentenceTransformer

# TF-IDF for classical NLP
vectorizer = TfidfVectorizer(max_features=1000, ngram_range=(1, 2))
X_tfidf = vectorizer.fit_transform(texts)

# Embeddings for semantic similarity
model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(texts, show_progress_bar=True)

# Basic text statistics
df['text_length'] = df['text'].str.len()
df['word_count'] = df['text'].str.split().str.len()

5. Leakage-safe pipelines

Critical rule: Always fit on training data only, transform on all data.

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline

# Define preprocessing for each column type
preprocessor = ColumnTransformer([
    ('num', StandardScaler(), numerical_features),
    ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
])

# Full pipeline
pipeline = Pipeline([
    ('prep', preprocessor),
    ('model', RandomForestClassifier())
])

# Correct: fit on train only
pipeline.fit(X_train, y_train)

# Transform train and test separately through the fitted pipeline
y_pred = pipeline.predict(X_test)  # No manual transform needed

CV-safe cross-validation:

from sklearn.model_selection import cross_val_score

# Pipeline ensures preprocessing happens within each CV fold
scores = cross_val_score(pipeline, X, y, cv=5)

6. Feature selection

Method	Description	Best For
Filter (mutual_info)	Statistical measure vs target	Quick screening, many features
Filter (correlation)	Linear correlation with target	Linear models, fast baseline
Wrapper (RFE)	Recursive feature elimination	Small-medium feature sets
Embedded (L1)	Lasso zeroes out features	Linear models with sparsity
Embedded (tree)	Feature importance from trees	Tree-based models

from sklearn.feature_selection import SelectKBest, mutual_info_classif, RFE
from sklearn.linear_model import Lasso

# Mutual information filter
selector = SelectKBest(mutual_info_classif, k=20)
X_selected = selector.fit_transform(X_train, y_train)

# Recursive feature elimination
rfe = RFE(estimator=RandomForestClassifier(), n_features_to_select=20)
X_rfe = rfe.fit_transform(X_train, y_train)

# L1 regularization (embedded)
lasso = Lasso(alpha=0.01)
lasso.fit(X_train, y_train)
selected_features = X_train.columns[lasso.coef_ != 0]

Core implementation rules

1. Prevent data leakage

❌ Wrong: Fitting encoders/scalers on full dataset ✅ Right: fit_transform() on train, transform() on test

# Train
scaler.fit_transform(X_train)
# Test - ONLY transform!
scaler.transform(X_test)

2. Handle unknown categories

# Unknown categories become all zeros
OneHotEncoder(handle_unknown='ignore')

# Unknown categories grouped with rare ones
OneHotEncoder(handle_unknown='infrequent_if_exist', min_frequency=0.01)

3. Track feature names through pipelines

# Get feature names after ColumnTransformer
feature_names = preprocessor.get_feature_names_out()

4. Document feature importance

Track which features were created, why, and their expected impact on model performance.

Common anti-patterns

Anti-pattern	Solution
❌ Fitting preprocessors on full dataset	Use train/test split before any fitting
❌ One-hot encoding high-cardinality features (>100 categories)	Use target encoding or frequency encoding
❌ Ignoring scaling for distance-based models	Always scale for SVM, k-NN, neural nets, PCA
❌ Creating features without domain reasoning	Validate features make business sense
❌ Not validating feature distributions match between train/test	Use distribution tests or visual comparison
❌ Target encoding without smoothing	Use smoothing parameter to handle rare categories
❌ Forgetting cyclical encoding for time	Use sin/cos for hour, dayofweek, month

Progressive disclosure

Reference guides for detailed implementations:

references/categorical-encoding.md — Comprehensive encoding strategies and selection guidance
references/datetime-features.md — Time-based feature patterns and cyclical encoding
references/text-features.md — NLP feature engineering with TF-IDF and embeddings
references/feature-selection.md — Selection strategies and implementation patterns

Related skills

analyzing-data — Understand data before engineering features
@evaluating-ml-models — Validate feature impact on model performance
@building-data-pipelines — Data processing fundamentals and pipeline patterns

References

Related skills

More from legout/data-platform-agent-skills

Installs

Repository

legout/data-pla…t-skills

First Seen

Mar 16, 2026

Security Audits

Gen Agent Trust HubPass

SocketPass

SnykPass

engineering-ml-features

Engineering ML Features

When to use this skill

When NOT to use this skill

Quick tool selection

Feature engineering workflows

1. Categorical encoding

2. Numeric scaling and transformation

3. Datetime feature engineering

4. Text feature engineering

5. Leakage-safe pipelines

6. Feature selection

Core implementation rules

1. Prevent data leakage

2. Handle unknown categories

3. Track feature names through pipelines

4. Document feature importance

Common anti-patterns

Progressive disclosure

Related skills

References

More from legout/data-platform-agent-skills

data-science-eda

data-science-visualization

data-engineering-core

data-science-feature-engineering

data-science-notebooks

data-engineering-best-practices