engineering-ml-features
Engineering ML Features
Use this skill for creating, transforming, and selecting features that improve model performance. Covers categorical encoding, numeric scaling, datetime engineering, text features, and building leakage-safe pipelines.
When to use this skill
- Categorical variables need encoding for ML algorithms
- Numeric features require scaling or transformation
- Datetime columns need conversion to meaningful features
- Text data needs to be converted to numerical representations
- Preventing data leakage during feature engineering
- Selecting the most predictive features from a large set
- Building reusable, production-ready preprocessing pipelines
When NOT to use this skill
- General data exploration → use
analyzing-data - Model evaluation and selection → use
@evaluating-ml-models - Building interactive data apps → use
@building-data-apps - Notebook setup and workflows → use
@working-in-notebooks
Quick tool selection
| Task | Default choice | Notes |
|---|---|---|
| Categorical encoding | category_encoders | Beyond sklearn's limited options |
| Feature scaling | sklearn.preprocessing | Standard, Robust, Power transforms |
| Pipeline composition | sklearn.pipeline + ColumnTransformer | Reproducible, CV-safe |
| Text vectorization | sklearn.feature_extraction.text | TF-IDF, CountVectorizer |
| Text embeddings | sentence-transformers | Pre-trained semantic embeddings |
| Feature selection | sklearn.feature_selection | Mutual info, RFE, SelectFromModel |
Feature engineering workflows
1. Categorical encoding
Low cardinality (< 10-15 categories): One-hot encoding High cardinality (> 15-100): Target encoding or frequency encoding Ordinal: Ordinal encoding with explicit category order
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
from category_encoders import TargetEncoder
# One-hot for low cardinality
ohe = OneHotEncoder(handle_unknown='ignore', sparse_output=False)
# Target encoding for high cardinality
te = TargetEncoder(smoothing=10)
# Ordinal for ordered categories
ord_enc = OrdinalEncoder(categories=[['low', 'medium', 'high']])
2. Numeric scaling and transformation
| Method | Use When | Algorithm Impact |
|---|---|---|
| StandardScaler | Features normally distributed, outliers rare | Required for SVM, neural nets, PCA |
| RobustScaler | Outliers present, want median/IQR centering | Same as Standard, more robust |
| MinMaxScaler | Need bounded range [0,1] or [-1,1] | Neural nets, image data |
| PowerTransformer | Skewed distributions, want normality | Improves linear model performance |
| QuantileTransformer | Heavy tails, want uniform/normal | Tree models unaffected, linear improves |
from sklearn.preprocessing import StandardScaler, RobustScaler, PowerTransformer
# Standard scaling
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_train)
# Power transform for skewness
pt = PowerTransformer(method='yeo-johnson')
X_transformed = pt.fit_transform(X_train)
3. Datetime feature engineering
Extract components and encode cyclical patterns:
import numpy as np
# Component extraction
df['year'] = df['timestamp'].dt.year
df['month'] = df['timestamp'].dt.month
df['dayofweek'] = df['timestamp'].dt.dayofweek
df['hour'] = df['timestamp'].dt.hour
# Cyclical encoding (preserves circular nature)
df['month_sin'] = np.sin(2 * np.pi * df['month'] / 12)
df['month_cos'] = np.cos(2 * np.pi * df['month'] / 12)
# Duration features
df['days_since_start'] = (df['timestamp'] - df['timestamp'].min()).dt.days
4. Text feature engineering
from sklearn.feature_extraction.text import TfidfVectorizer
from sentence_transformers import SentenceTransformer
# TF-IDF for classical NLP
vectorizer = TfidfVectorizer(max_features=1000, ngram_range=(1, 2))
X_tfidf = vectorizer.fit_transform(texts)
# Embeddings for semantic similarity
model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(texts, show_progress_bar=True)
# Basic text statistics
df['text_length'] = df['text'].str.len()
df['word_count'] = df['text'].str.split().str.len()
5. Leakage-safe pipelines
Critical rule: Always fit on training data only, transform on all data.
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
# Define preprocessing for each column type
preprocessor = ColumnTransformer([
('num', StandardScaler(), numerical_features),
('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
])
# Full pipeline
pipeline = Pipeline([
('prep', preprocessor),
('model', RandomForestClassifier())
])
# Correct: fit on train only
pipeline.fit(X_train, y_train)
# Transform train and test separately through the fitted pipeline
y_pred = pipeline.predict(X_test) # No manual transform needed
CV-safe cross-validation:
from sklearn.model_selection import cross_val_score
# Pipeline ensures preprocessing happens within each CV fold
scores = cross_val_score(pipeline, X, y, cv=5)
6. Feature selection
| Method | Description | Best For |
|---|---|---|
| Filter (mutual_info) | Statistical measure vs target | Quick screening, many features |
| Filter (correlation) | Linear correlation with target | Linear models, fast baseline |
| Wrapper (RFE) | Recursive feature elimination | Small-medium feature sets |
| Embedded (L1) | Lasso zeroes out features | Linear models with sparsity |
| Embedded (tree) | Feature importance from trees | Tree-based models |
from sklearn.feature_selection import SelectKBest, mutual_info_classif, RFE
from sklearn.linear_model import Lasso
# Mutual information filter
selector = SelectKBest(mutual_info_classif, k=20)
X_selected = selector.fit_transform(X_train, y_train)
# Recursive feature elimination
rfe = RFE(estimator=RandomForestClassifier(), n_features_to_select=20)
X_rfe = rfe.fit_transform(X_train, y_train)
# L1 regularization (embedded)
lasso = Lasso(alpha=0.01)
lasso.fit(X_train, y_train)
selected_features = X_train.columns[lasso.coef_ != 0]
Core implementation rules
1. Prevent data leakage
❌ Wrong: Fitting encoders/scalers on full dataset
✅ Right: fit_transform() on train, transform() on test
# Train
scaler.fit_transform(X_train)
# Test - ONLY transform!
scaler.transform(X_test)
2. Handle unknown categories
# Unknown categories become all zeros
OneHotEncoder(handle_unknown='ignore')
# Unknown categories grouped with rare ones
OneHotEncoder(handle_unknown='infrequent_if_exist', min_frequency=0.01)
3. Track feature names through pipelines
# Get feature names after ColumnTransformer
feature_names = preprocessor.get_feature_names_out()
4. Document feature importance
Track which features were created, why, and their expected impact on model performance.
Common anti-patterns
| Anti-pattern | Solution |
|---|---|
| ❌ Fitting preprocessors on full dataset | Use train/test split before any fitting |
| ❌ One-hot encoding high-cardinality features (>100 categories) | Use target encoding or frequency encoding |
| ❌ Ignoring scaling for distance-based models | Always scale for SVM, k-NN, neural nets, PCA |
| ❌ Creating features without domain reasoning | Validate features make business sense |
| ❌ Not validating feature distributions match between train/test | Use distribution tests or visual comparison |
| ❌ Target encoding without smoothing | Use smoothing parameter to handle rare categories |
| ❌ Forgetting cyclical encoding for time | Use sin/cos for hour, dayofweek, month |
Progressive disclosure
Reference guides for detailed implementations:
references/categorical-encoding.md— Comprehensive encoding strategies and selection guidancereferences/datetime-features.md— Time-based feature patterns and cyclical encodingreferences/text-features.md— NLP feature engineering with TF-IDF and embeddingsreferences/feature-selection.md— Selection strategies and implementation patterns
Related skills
analyzing-data— Understand data before engineering features@evaluating-ml-models— Validate feature impact on model performance@building-data-pipelines— Data processing fundamentals and pipeline patterns