Interactive Data Exploration Commander

⚠️ MANDATORY COMPLIANCE ⚠️

CRITICAL: The 4-step workflow outlined in this document MUST be followed in exact order for EVERY Jupyter notebook task. Skipping steps or deviating from the procedure will result in ineffective analysis or incorrect results. This is non-negotiable.

File Structure

SKILL.md (this file): Main instructions and MANDATORY workflow
examples.md: Usage scenarios with different data science tasks and generated notebook code
Memory: Project-specific memory accessed via memoryStore.getSkillMemory("jupyter-notebook-skills", "{project-name}"). See MemoryStore Interface.
templates/:
- eda_template.md: Exploratory Data Analysis notebook template
- ml_template.md: Machine Learning workflow template
- visualization_template.md: Data visualization template

Interface References

Context: Loaded via ContextProvider Interface
Memory: Accessed via MemoryStore Interface
Schemas: Validated against memory_entry.schema.json

Focus Areas

Jupyter notebook data exploration evaluates 7 critical dimensions:

Data Loading & Inspection: Import data from various sources (CSV, SQL, APIs), inspect structure, identify data types
Data Cleaning & Preprocessing: Handle missing values, outliers, duplicates, type conversions, feature engineering
Exploratory Data Analysis (EDA): Statistical summaries, distributions, correlations, patterns, anomalies
Visualization: Create informative plots (matplotlib, seaborn, plotly) for data understanding and presentation
Statistical Analysis: Hypothesis testing, confidence intervals, significance tests, regression analysis
Machine Learning: Model selection, training, evaluation, hyperparameter tuning, feature importance
Reproducibility: Clear code organization, documentation, random seeds, environment specifications

Note: The skill generates notebook cells and code snippets for the user to execute. It does not run notebooks directly unless integrated with a Jupyter kernel.

MANDATORY WORKFLOW (MUST FOLLOW EXACTLY)

⚠️ STEP 1: Understand the Data Context (REQUIRED)

YOU MUST:

Determine the data source: CSV file, database, API, existing dataframe, web scraping
Identify the analysis objective: EDA, predictive modeling, hypothesis testing, visualization, reporting
Clarify the data characteristics: size (rows/columns), data types, known issues (missing values, outliers)
Assess the user's expertise level: beginner (needs explanations), intermediate (familiar with pandas), advanced (can customize code)
Identify the environment: Local Jupyter, JupyterLab, Google Colab, Databricks, SageMaker
Ask clarifying questions if context is incomplete:
- What does your data look like? (Schema, sample rows)
- What question are you trying to answer with this data?
- Are there any known data quality issues?
- Which libraries are you comfortable using?
- What type of output do you need? (Insights, visualizations, models, reports)

DO NOT PROCEED WITHOUT UNDERSTANDING THE DATA AND OBJECTIVE

⚠️ STEP 2: Plan the Analysis Approach (REQUIRED)

YOU MUST:

Map the data structure: Identify target variable (for ML), features, categorical vs numerical columns
Determine the analysis pipeline:
- EDA: Load → Inspect → Clean → Visualize → Summarize
- Predictive Modeling: EDA → Feature Engineering → Train/Test Split → Model Selection → Evaluation
- Hypothesis Testing: EDA → Assumption Checking → Statistical Test → Interpretation
- Time Series: EDA → Stationarity Check → Decomposition → Forecasting
Select appropriate libraries:
- Data manipulation: pandas, numpy
- Visualization: matplotlib, seaborn, plotly
- Statistics: scipy.stats, statsmodels
- ML: scikit-learn, xgboost, tensorflow, pytorch
Identify data quality steps needed:
- Missing value imputation (mean, median, mode, forward-fill, interpolation)
- Outlier detection and handling (IQR, z-score, domain knowledge)
- Encoding categorical variables (one-hot, label encoding, target encoding)
- Feature scaling (standardization, normalization)
Check project memory: Use memoryStore.getSkillMemory("jupyter-notebook-skills", "{project-name}") to load project-specific data schemas, feature engineering patterns, or modeling approaches. See MemoryStore Interface.

DO NOT PROCEED WITHOUT A CLEAR ANALYSIS PLAN

⚠️ STEP 3: Generate Notebook Code (REQUIRED)

YOU MUST:

Structure the notebook logically with markdown cells for narrative and code cells for execution:
- Title and objective
- Library imports
- Data loading
- Data inspection
- Data cleaning
- Analysis/modeling sections
- Conclusions and next steps
Write production-quality code:
- Clear variable names
- Comments explaining non-obvious logic
- Error handling for data loading
- Reproducible random seeds for ML tasks
- Modular functions for reusable operations
- Type hints for Python 3.6+ (optional but recommended)

For Data Loading:

import pandas as pd
import numpy as np

# Load data with error handling
try:
    df = pd.read_csv('data.csv')
    print(f"Data loaded: {df.shape[0]} rows, {df.shape[1]} columns")
except FileNotFoundError:
    print("Error: File not found. Check the file path.")

For Data Inspection:

# Basic info
print(df.info())
print(df.describe())
print(df.head())

# Check for missing values
print(df.isnull().sum())

# Check for duplicates
print(f"Duplicates: {df.duplicated().sum()}")

For Visualizations:

Use clear titles, axis labels, and legends
Choose appropriate plot types (scatter, histogram, box, heatmap)
Use color palettes that are colorblind-friendly
Include figure size for readability

import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(10, 6))
sns.histplot(data=df, x='column_name', kde=True)
plt.title('Distribution of Column Name')
plt.xlabel('Column Name')
plt.ylabel('Frequency')
plt.show()

For Machine Learning:

Always use train/test split or cross-validation
Set random seeds for reproducibility
Evaluate with appropriate metrics (accuracy, precision, recall, F1, RMSE, R²)
Check for overfitting (compare train vs test performance)

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix

# Split data
X = df.drop('target', axis=1)
y = df['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train model
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)

# Evaluate
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))

Use templates from templates/ for consistent notebook structure

DO NOT GENERATE CODE WITH HARD-CODED ASSUMPTIONS ABOUT DATA

⚠️ STEP 4: Validate and Document (REQUIRED)

YOU MUST validate the notebook against these criteria:

Code quality check:
- Imports are at the top of the notebook
- No unused imports or variables
- Functions are defined before use
- Random seeds are set for reproducibility
- File paths are parameterized (not hard-coded)
Analysis validity check:
- Data types are appropriate (numeric for calculations, categorical for encoding)
- Missing value handling is appropriate for the data
- Visualizations are clear and labeled
- Statistical tests assumptions are checked (normality, homoscedasticity)
- ML metrics match the problem type (classification vs regression)
Reproducibility check:
- Random seeds set (np.random.seed(42), random_state=42)
- Environment can be recreated (list library versions)
- Data source is documented
- All cells run in order without errors
Documentation:
- Markdown cells explain each major section
- Code comments clarify non-obvious logic
- Results are interpreted (not just printed)
- Conclusions and next steps are provided
Present the notebook code to the user with clear execution instructions
Offer alternatives: Suggest alternative approaches when applicable (e.g., different models, different visualizations)

DO NOT SKIP VALIDATION

OPTIONAL: Update Project Memory

If project-specific patterns are discovered during analysis, use memoryStore.update(layer="skill-specific", skill="jupyter-notebook-skills", project="{project-name}", ...) to store insights:

Data schema and feature descriptions
Successful feature engineering patterns
Model performance benchmarks
Common data quality issues and solutions

Timestamps and staleness tracking are handled automatically by MemoryStore. See MemoryStore Interface.

Compliance Checklist

Before completing ANY Jupyter notebook task, verify:

Step 1: Data context understood — source, objective, characteristics, user expertise
Step 2: Analysis approach planned — pipeline, libraries, data quality steps
Step 3: Notebook code generated — structured, production-quality, uses templates
Step 4: Notebook validated — code quality, analysis validity, reproducibility, documentation

FAILURE TO COMPLETE ALL STEPS INVALIDATES THE NOTEBOOK

Common Data Science Tasks

Exploratory Data Analysis (EDA)

Data profiling: Shape, types, missing values, duplicates, unique counts
Descriptive statistics: Mean, median, std, min/max, quartiles
Distributions: Histograms, KDE plots, box plots
Correlations: Heatmaps, scatter plots, pair plots
Categorical analysis: Value counts, bar charts, grouped statistics

Data Cleaning & Preprocessing

Missing values: Drop, impute (mean/median/mode), forward-fill, interpolate
Outliers: Detect (IQR, z-score), remove, cap, transform
Duplicates: Identify, remove, or flag
Type conversion: String to datetime, numeric to categorical
Feature engineering: Create derived features, binning, scaling, encoding

Statistical Analysis

Hypothesis testing: t-test, chi-square, ANOVA, Mann-Whitney U
Correlation analysis: Pearson, Spearman, Kendall
Regression: Linear, polynomial, logistic regression
Time series: Stationarity tests, decomposition, autocorrelation

Machine Learning

Classification: Logistic regression, decision trees, random forests, SVM, neural networks
Regression: Linear regression, ridge, lasso, random forests, gradient boosting
Clustering: K-means, hierarchical, DBSCAN
Dimensionality reduction: PCA, t-SNE, UMAP
Evaluation: Cross-validation, confusion matrix, ROC/AUC, feature importance

Visualization

Univariate: Histograms, box plots, violin plots, KDE
Bivariate: Scatter plots, line plots, heatmaps
Multivariate: Pair plots, parallel coordinates, 3D scatter
Advanced: Interactive plots (plotly), geospatial (folium), network graphs (networkx)

Library Quick Reference

Essential Libraries

import pandas as pd              # Data manipulation
import numpy as np               # Numerical computing
import matplotlib.pyplot as plt  # Basic plotting
import seaborn as sns            # Statistical visualization

Statistical Analysis

from scipy import stats          # Statistical functions
import statsmodels.api as sm     # Statistical models

Machine Learning

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.ensemble import RandomForestClassifier, GradientBoostingRegressor
from sklearn.metrics import classification_report, mean_squared_error, r2_score

Deep Learning

import tensorflow as tf          # TensorFlow
from tensorflow import keras     # Keras (high-level API)
# OR
import torch                     # PyTorch
import torch.nn as nn

Interactive Visualization

import plotly.express as px      # High-level Plotly
import plotly.graph_objects as go # Low-level Plotly

Best Practices

Start with EDA: Always explore data before building models
Document assumptions: Write down what you assume about the data
Check data quality: Missing values, outliers, and duplicates can ruin analysis
Visualize early and often: Plots reveal patterns that statistics might miss
Use train/test splits: Never evaluate a model on the same data it was trained on
Set random seeds: Makes results reproducible (np.random.seed(42))
Comment your code: Explain why, not just what
Use markdown cells: Tell a story with your analysis
Validate assumptions: Check normality, homoscedasticity before statistical tests
Iterate: Analysis is rarely linear — go back and refine as you learn

Environment Setup

Local Jupyter Installation

pip install jupyter pandas numpy matplotlib seaborn scikit-learn scipy
jupyter notebook

Google Colab (Cloud-based, Free)

No installation needed
Go to https://colab.research.google.com/
Libraries pre-installed: pandas, numpy, matplotlib, seaborn, sklearn, tensorflow

Conda Environment (Recommended for Data Science)

conda create -n data-science python=3.9
conda activate data-science
conda install jupyter pandas numpy matplotlib seaborn scikit-learn scipy

Requirements File for Reproducibility

# requirements.txt
pandas==1.5.3
numpy==1.24.2
matplotlib==3.7.1
seaborn==0.12.2
scikit-learn==1.2.2
scipy==1.10.1

Install with: pip install -r requirements.txt

Version History

v1.1.0 (2026-02-10): Phase 4 Migration
- Migrated to interface-based patterns (ContextProvider + MemoryStore)
- Removed hardcoded filesystem paths
- Added interface references section
v1.0.0 (2026-02-06): Initial release
- Mandatory 4-step workflow for Jupyter notebook tasks
- Support for EDA, statistical analysis, machine learning, visualization
- Production-quality code generation with error handling
- Project memory integration for pattern persistence
- Template-based notebook structure

jupyter-notebook-skills

Interactive Data Exploration Commander

⚠️ MANDATORY COMPLIANCE ⚠️

File Structure

Interface References

Focus Areas

MANDATORY WORKFLOW (MUST FOLLOW EXACTLY)

⚠️ STEP 1: Understand the Data Context (REQUIRED)

⚠️ STEP 2: Plan the Analysis Approach (REQUIRED)

⚠️ STEP 3: Generate Notebook Code (REQUIRED)

⚠️ STEP 4: Validate and Document (REQUIRED)

Compliance Checklist

Common Data Science Tasks

Exploratory Data Analysis (EDA)

Data Cleaning & Preprocessing

Statistical Analysis

Machine Learning

Visualization

Library Quick Reference

Essential Libraries

Statistical Analysis

Machine Learning

Deep Learning

Interactive Visualization

Best Practices

Environment Setup

Local Jupyter Installation

Google Colab (Cloud-based, Free)

Conda Environment (Recommended for Data Science)

Requirements File for Reproducibility

Further Reading

Version History