feature-engineer
Feature Engineer
Overview
Feature engineering often makes the difference between mediocre and excellent ML models. This skill transforms raw data into model-ready features through systematic data quality assessment, feature creation, selection, and transformation—all integrated with SpecWeave's increment workflow.
The Feature Engineering Pipeline
Phase 1: Data Quality Assessment
Before creating features, understand your data:
from specweave import DataQualityReport
# Automated data quality check
report = DataQualityReport(df, increment="0042")
# Generates:
# - Missing value analysis
# - Outlier detection
# - Data type validation
# - Distribution analysis
# - Correlation matrix
# - Duplicate detection
Quality Report Output:
# Data Quality Report
## Dataset Overview
- Rows: 100,000
- Columns: 45
- Memory: 34.2 MB
## Missing Values
| Column | Missing | Percentage |
|-----------------|---------|------------|
| email | 15,234 | 15.2% |
| phone | 8,901 | 8.9% |
| purchase_date | 0 | 0.0% |
## Outliers Detected
- transaction_amount: 234 outliers (>3 std dev)
- user_age: 12 outliers (<18 or >100)
## Data Type Issues
- user_id: Stored as float, should be int
- date_joined: Stored as string, should be datetime
## Recommendations
1. Impute email/phone or create "missing" indicator features
2. Cap/remove outliers in transaction_amount
3. Convert data types for efficiency
Phase 2: Feature Creation
Create features from domain knowledge:
from specweave import FeatureCreator
creator = FeatureCreator(df, increment="0042")
# Temporal features (from datetime)
creator.add_temporal_features(
date_column="purchase_date",
features=["hour", "day_of_week", "month", "is_weekend", "is_holiday"]
)
# Aggregation features (user behavior)
creator.add_aggregation_features(
group_by="user_id",
target="purchase_amount",
aggs=["mean", "std", "count", "min", "max"]
)
# Creates: user_purchase_amount_mean, user_purchase_amount_std, etc.
# Interaction features
creator.add_interaction_features(
features=[("age", "income"), ("clicks", "impressions")],
operations=["multiply", "divide", "subtract"]
)
# Creates: age_x_income, clicks_per_impression, etc.
# Ratio features
creator.add_ratio_features([
("revenue", "cost"),
("conversions", "visits")
])
# Creates: revenue_to_cost_ratio, conversion_rate
# Binning (discretization)
creator.add_binned_features(
column="age",
bins=[0, 18, 25, 35, 50, 65, 100],
labels=["child", "young_adult", "adult", "middle_aged", "senior", "elderly"]
)
# Text features (from text columns)
creator.add_text_features(
column="product_description",
features=["length", "word_count", "unique_words", "sentiment"]
)
# Generate all features
df_enriched = creator.generate()
# Auto-documents in increment folder
creator.save_feature_definitions(
path=".specweave/increments/0042.../features/feature_definitions.yaml"
)
Feature Definitions (auto-generated):
# .specweave/increments/0042.../features/feature_definitions.yaml
features:
- name: purchase_hour
type: temporal
source: purchase_date
description: Hour of purchase (0-23)
- name: user_purchase_amount_mean
type: aggregation
source: purchase_amount
group_by: user_id
description: Average purchase amount per user
- name: age_x_income
type: interaction
sources: [age, income]
operation: multiply
description: Product of age and income
- name: conversion_rate
type: ratio
sources: [conversions, visits]
description: Conversion rate (conversions / visits)
Phase 3: Feature Selection
Reduce dimensionality, improve performance:
from specweave import FeatureSelector
selector = FeatureSelector(X_train, y_train, increment="0042")
# Method 1: Correlation-based (remove redundant features)
selector.remove_correlated_features(threshold=0.95)
# Removes features with >95% correlation
# Method 2: Variance-based (remove constant features)
selector.remove_low_variance_features(threshold=0.01)
# Removes features with <1% variance
# Method 3: Statistical tests
selector.select_by_statistical_test(k=50)
# SelectKBest with chi2/f_classif
# Method 4: Model-based (tree importance)
selector.select_by_model_importance(
model=RandomForestClassifier(),
threshold=0.01
)
# Removes features with <1% importance
# Method 5: Recursive Feature Elimination
selector.select_by_rfe(
model=LogisticRegression(),
n_features=30
)
# Get selected features
selected_features = selector.get_selected_features()
# Generate selection report
selector.generate_report()
Feature Selection Report:
# Feature Selection Report
## Original Features: 125
## Selected Features: 35 (72% reduction)
## Selection Process
1. Removed 12 correlated features (>95% correlation)
2. Removed 8 low-variance features
3. Statistical test: Selected top 50 (chi-squared)
4. Model importance: Removed 15 low-importance features (<1%)
## Top 10 Features (by importance)
1. user_purchase_amount_mean (0.18)
2. days_since_last_purchase (0.12)
3. total_purchases (0.10)
4. age_x_income (0.08)
5. conversion_rate (0.07)
...
## Removed Features
- user_id_hash (constant)
- temp_feature_1 (99% correlated with temp_feature_2)
- random_noise (0% importance)
...
Phase 4: Feature Transformation
Scale, normalize, encode for model compatibility:
from specweave import FeatureTransformer
transformer = FeatureTransformer(increment="0042")
# Numerical transformations
transformer.add_numerical_transformer(
columns=["age", "income", "purchase_amount"],
method="standard_scaler" # Or: min_max, robust, quantile
)
# Categorical encoding
transformer.add_categorical_encoder(
columns=["country", "device_type", "product_category"],
method="onehot", # Or: label, target, binary
handle_unknown="ignore"
)
# Ordinal encoding (for ordered categories)
transformer.add_ordinal_encoder(
column="education",
order=["high_school", "bachelors", "masters", "phd"]
)
# Log transformation (for skewed distributions)
transformer.add_log_transform(
columns=["transaction_amount", "page_views"],
method="log1p" # log(1 + x) to handle zeros
)
# Box-Cox transformation (for normalization)
transformer.add_power_transform(
columns=["revenue", "engagement_score"],
method="box-cox"
)
# Custom transformation
def clip_outliers(x):
return np.clip(x, x.quantile(0.01), x.quantile(0.99))
transformer.add_custom_transformer(
columns=["outlier_prone_feature"],
func=clip_outliers
)
# Fit and transform
X_train_transformed = transformer.fit_transform(X_train)
X_test_transformed = transformer.transform(X_test)
# Save transformer pipeline
transformer.save(
path=".specweave/increments/0042.../features/transformer.pkl"
)
Phase 5: Feature Validation
Ensure features are production-ready:
from specweave import FeatureValidator
validator = FeatureValidator(
X_train, X_test,
increment="0042"
)
# Check for data leakage
leakage_report = validator.check_data_leakage()
# Detects: perfectly correlated features, future data in training
# Check for distribution drift
drift_report = validator.check_distribution_drift()
# Compares train vs test distributions
# Check for missing values after transformation
missing_report = validator.check_missing_values()
# Check for infinite/NaN values
invalid_report = validator.check_invalid_values()
# Generate validation report
validator.generate_report()
Validation Report:
# Feature Validation Report
## Data Leakage: ✅ PASS
No perfect correlations detected between train and test.
## Distribution Drift: ⚠️ WARNING
Features with significant drift (KS test p < 0.05):
- user_age: p=0.023 (minor drift)
- device_type: p=0.001 (major drift)
Recommendation: Check if test data is from different time period.
## Missing Values: ✅ PASS
No missing values after transformation.
## Invalid Values: ✅ PASS
No infinite or NaN values detected.
## Overall: READY FOR TRAINING
2 warnings, 0 critical issues.
Integration with SpecWeave
Automatic Feature Documentation
# All feature engineering steps logged to increment
with track_experiment("feature-engineering-v1", increment="0042") as exp:
# Create features
df_enriched = creator.generate()
# Select features
selected = selector.select()
# Transform features
X_transformed = transformer.fit_transform(X)
# Validate
validation = validator.validate()
# Auto-logs:
exp.log_param("original_features", 125)
exp.log_param("created_features", 45)
exp.log_param("selected_features", 35)
exp.log_metric("feature_reduction", 0.72)
exp.save_artifact("feature_definitions.yaml")
exp.save_artifact("transformer.pkl")
exp.save_artifact("validation_report.md")
Living Docs Integration
After completing feature engineering:
/sw:sync-docs update
Updates:
<!-- .specweave/docs/internal/architecture/feature-engineering.md -->
## Recommendation Model Features (Increment 0042)
### Feature Engineering Pipeline
1. Data Quality: 100K rows, 45 columns
2. Created: 45 new features (temporal, aggregation, interaction)
3. Selected: 35 features (72% reduction via importance + RFE)
4. Transformed: StandardScaler for numerical, OneHot for categorical
### Key Features
- user_purchase_amount_mean: Average user spend (top feature, 18% importance)
- days_since_last_purchase: Recency indicator (12% importance)
- age_x_income: Interaction feature (8% importance)
### Feature Store
All features documented in: `.specweave/increments/0042.../features/`
- feature_definitions.yaml: Feature catalog
- transformer.pkl: Production transformation pipeline
- validation_report.md: Quality checks
Best Practices
1. Document Feature Rationale
# Bad: Create features without explanation
df["feature_1"] = df["col_a"] * df["col_b"]
# Good: Document why features were created
creator.add_interaction_feature(
sources=["age", "income"],
operation="multiply",
rationale="High-income older users have different behavior patterns"
)
2. Handle Missing Values Systematically
# Options for missing values:
# 1. Imputation (mean, median, mode)
creator.impute_missing(column="age", strategy="median")
# 2. Indicator features (flag missing as signal)
creator.add_missing_indicator(column="email")
# Creates: email_missing (0/1)
# 3. Forward/backward fill (for time series)
creator.fill_missing(column="sensor_reading", method="ffill")
# 4. Model-based imputation
creator.impute_with_model(column="income", model=RandomForestRegressor())
3. Avoid Data Leakage
# ❌ WRONG: Fit on all data (includes test set!)
scaler.fit(X)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
# ✅ CORRECT: Fit only on train, transform both
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
# SpecWeave's transformer enforces this pattern
transformer.fit_transform(X_train) # Fits
transformer.transform(X_test) # Only transforms
4. Version Feature Engineering Pipeline
# Version features with increment
transformer.save(
path=".specweave/increments/0042.../features/transformer-v1.pkl",
metadata={
"version": "v1",
"features": selected_features,
"transformations": ["standard_scaler", "onehot"]
}
)
# Load specific version for reproducibility
transformer_v1 = FeatureTransformer.load(
".specweave/increments/0042.../features/transformer-v1.pkl"
)
5. Test Feature Engineering on New Data
# Before deploying, test on held-out data
X_production_sample = load_production_data()
try:
X_transformed = transformer.transform(X_production_sample)
except Exception as e:
raise FeatureEngineeringError(f"Failed on production data: {e}")
# Check for unexpected values
validator = FeatureValidator(X_train, X_production_sample)
validation_report = validator.validate()
if validation_report["status"] == "CRITICAL":
raise FeatureEngineeringError("Feature engineering failed validation")
Common Feature Engineering Patterns
Pattern 1: RFM (Recency, Frequency, Monetary)
# For e-commerce / customer analytics
creator.add_rfm_features(
user_id="user_id",
transaction_date="purchase_date",
transaction_amount="purchase_amount"
)
# Creates:
# - recency: days since last purchase
# - frequency: total purchases
# - monetary: total spend
Pattern 2: Rolling Window Aggregations
# For time series
creator.add_rolling_features(
column="daily_sales",
windows=[7, 14, 30],
aggs=["mean", "std", "min", "max"]
)
# Creates: daily_sales_7day_mean, daily_sales_7day_std, etc.
Pattern 3: Target Encoding (Categorical → Numerical)
# Encode categorical as target mean (careful: can leak!)
creator.add_target_encoding(
column="product_category",
target="purchase_amount",
cv_folds=5 # Cross-validation to prevent leakage
)
# Creates: product_category_target_encoded
Pattern 4: Polynomial Features
# For non-linear relationships
creator.add_polynomial_features(
columns=["age", "income"],
degree=2,
interaction_only=True
)
# Creates: age^2, income^2, age*income
Commands
# Generate feature engineering pipeline for increment
/ml:engineer-features 0042
# Validate features before training
/ml:validate-features 0042
# Generate feature importance report
/ml:feature-importance 0042
Integration with Other Skills
- ml-pipeline-orchestrator: Task 2 is "Feature Engineering" (uses this skill)
- experiment-tracker: Logs all feature engineering experiments
- model-evaluator: Uses feature importance from models
- ml-deployment-helper: Packages feature transformer for production
Summary
Feature engineering is 70% of ML success. This skill ensures:
- ✅ Systematic approach (quality → create → select → transform → validate)
- ✅ No data leakage (train/test separation enforced)
- ✅ Production-ready (versioned, validated, documented)
- ✅ Reproducible (all steps tracked in increment)
- ✅ Traceable (feature definitions in living docs)
Good features make mediocre models great. Great features make mediocre models excellent.