data-science-feature-engineering
Feature Engineering
Use this skill for creating, transforming, and selecting features that improve model performance.
When to use this skill
- After EDA — convert insights into features
- Model underperforming — need better representations
- Handling different data types (numerical, categorical, text, datetime)
- Reducing dimensionality or selecting most predictive features
Feature engineering workflow
-
Numerical features
- Scaling (StandardScaler, MinMaxScaler, RobustScaler)
- Transformations (log, sqrt, Box-Cox for skewness)
- Binning (equal-width, quantile, custom)
- Interaction features
-
Categorical features
- One-hot encoding (low cardinality)
- Target/Mean encoding (high cardinality)
- Ordinal encoding (ordered categories)
- Frequency/rare category handling
-
Datetime features
- Extract components (year, month, day, hour, dayofweek)
- Cyclical encoding (sin/cos for time cycles)
- Time since/duration features
-
Text features
- TF-IDF, CountVectorizer
- Embeddings (sentence-transformers)
- Basic text stats (length, word count)
-
Feature selection
- Filter methods (correlation, mutual information)
- Wrapper methods (recursive feature elimination)
- Embedded methods (L1 regularization, tree importance)
Quick tool selection
| Task | Default choice | Notes |
|---|---|---|
| sklearn pipelines | sklearn.pipeline + ColumnTransformer | Reproducible, cross-validation safe |
| Categorical encoding | category_encoders | Beyond sklearn's limited options |
| Feature selection | sklearn.feature_selection | Mutual info, RFE, SelectFromModel |
| Text embeddings | sentence-transformers | Pre-trained semantic embeddings |
| Auto feature engineering | Feature-engine | Comprehensive transformations |
Core implementation rules
1) Use pipelines to prevent leakage
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
preprocessor = ColumnTransformer([
('num', StandardScaler(), numerical_features),
('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
])
pipeline = Pipeline([
('prep', preprocessor),
('model', RandomForestClassifier())
])
2) Fit on train only, transform on all
# Correct: fit_transform on train, transform on test
X_train_processed = preprocessor.fit_transform(X_train)
X_test_processed = preprocessor.transform(X_test) # Only transform!
3) Handle unknown categories
OneHotEncoder(handle_unknown='ignore') # Unknown → all zeros
# OR
OneHotEncoder(handle_unknown='infrequent_if_exist') # Group rare/unknown
4) Document feature importance
Track which features were created, why, and their expected impact.
Common anti-patterns
- ❌ Fitting preprocessors on full dataset (leakage!)
- ❌ One-hot encoding high-cardinality features (dimension explosion)
- ❌ Ignoring feature scaling for distance-based models
- ❌ Creating features without domain reasoning
- ❌ Not validating feature distributions match between train/test
Progressive disclosure
../references/categorical-encoding.md— Comprehensive encoding guide../references/datetime-features.md— Time-based feature patterns../references/text-features.md— NLP feature engineering../references/feature-selection.md— Selection strategies and implementations
Related skills
@data-science-eda— Understand data before engineering@data-science-model-evaluation— Validate feature impact@data-engineering-core— Data processing fundamentals
References
More from legout/data-platform-agent-skills
data-science-eda
Exploratory Data Analysis (EDA): profiling, visualization, correlation analysis, and data quality checks. Use when understanding dataset structure, distributions, relationships, or preparing for feature engineering and modeling.
13data-science-visualization
Data visualization for Python: Matplotlib, Seaborn, Plotly, Altair, hvPlot/HoloViz, and Bokeh. Use when creating exploratory charts, interactive dashboards, publication-quality figures, or choosing the right library for your data and audience.
12data-engineering-core
Core Python data engineering: Polars, DuckDB, PyArrow, PostgreSQL, ETL patterns, performance tuning, and resilient pipeline construction. Use when building or reviewing batch ETL/dataframe/SQL pipelines in Python.
10data-science-notebooks
Interactive notebooks for data science: Jupyter, JupyterLab, and marimo. Use for exploratory analysis, reproducible research, documentation, and sharing insights with stakeholders.
9data-engineering-best-practices
Data engineering best practices: medallion architecture, dataset lifecycle, partitioning, file sizing, schema evolution, and append/overwrite/merge patterns across Polars, PyArrow, DuckDB, Delta Lake, and Iceberg. Use when designing production data pipelines or reviewing data platform decisions.
8data-engineering-storage-formats
Modern data serialization formats: Parquet, Apache Arrow (Feather/IPC), Lance (ML-native), Zarr (chunked arrays), Avro, and ORC. Covers compression, partitioning, and format selection.
8