data-science-feature-engineering
SKILL.md
Feature Engineering
Use this skill for creating, transforming, and selecting features that improve model performance.
When to use this skill
- After EDA — convert insights into features
- Model underperforming — need better representations
- Handling different data types (numerical, categorical, text, datetime)
- Reducing dimensionality or selecting most predictive features
Feature engineering workflow
-
Numerical features
- Scaling (StandardScaler, MinMaxScaler, RobustScaler)
- Transformations (log, sqrt, Box-Cox for skewness)
- Binning (equal-width, quantile, custom)
- Interaction features
-
Categorical features
- One-hot encoding (low cardinality)
- Target/Mean encoding (high cardinality)
- Ordinal encoding (ordered categories)
- Frequency/rare category handling
-
Datetime features
- Extract components (year, month, day, hour, dayofweek)
- Cyclical encoding (sin/cos for time cycles)
- Time since/duration features
-
Text features
- TF-IDF, CountVectorizer
- Embeddings (sentence-transformers)
- Basic text stats (length, word count)
-
Feature selection
- Filter methods (correlation, mutual information)
- Wrapper methods (recursive feature elimination)
- Embedded methods (L1 regularization, tree importance)
Quick tool selection
| Task | Default choice | Notes |
|---|---|---|
| sklearn pipelines | sklearn.pipeline + ColumnTransformer | Reproducible, cross-validation safe |
| Categorical encoding | category_encoders | Beyond sklearn's limited options |
| Feature selection | sklearn.feature_selection | Mutual info, RFE, SelectFromModel |
| Text embeddings | sentence-transformers | Pre-trained semantic embeddings |
| Auto feature engineering | Feature-engine | Comprehensive transformations |
Core implementation rules
1) Use pipelines to prevent leakage
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
preprocessor = ColumnTransformer([
('num', StandardScaler(), numerical_features),
('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
])
pipeline = Pipeline([
('prep', preprocessor),
('model', RandomForestClassifier())
])
2) Fit on train only, transform on all
# Correct: fit_transform on train, transform on test
X_train_processed = preprocessor.fit_transform(X_train)
X_test_processed = preprocessor.transform(X_test) # Only transform!
3) Handle unknown categories
OneHotEncoder(handle_unknown='ignore') # Unknown → all zeros
# OR
OneHotEncoder(handle_unknown='infrequent_if_exist') # Group rare/unknown
4) Document feature importance
Track which features were created, why, and their expected impact.
Common anti-patterns
- ❌ Fitting preprocessors on full dataset (leakage!)
- ❌ One-hot encoding high-cardinality features (dimension explosion)
- ❌ Ignoring feature scaling for distance-based models
- ❌ Creating features without domain reasoning
- ❌ Not validating feature distributions match between train/test
Progressive disclosure
../references/categorical-encoding.md— Comprehensive encoding guide../references/datetime-features.md— Time-based feature patterns../references/text-features.md— NLP feature engineering../references/feature-selection.md— Selection strategies and implementations
Related skills
@data-science-eda— Understand data before engineering@data-science-model-evaluation— Validate feature impact@data-engineering-core— Data processing fundamentals
References
Weekly Installs
10
Repository
legout/data-pla…t-skillsFirst Seen
Feb 11, 2026
Security Audits
Installed on
opencode8
gemini-cli8
github-copilot8
amp8
cline8
codex8