data-python
Data Python Skill
Version: 1.0 Stack: Python (pandas, polars, pyspark)
Python makes it easy to write data processing code that works on sample data and fails on real data. iterrows() takes 30 seconds on 10K rows and 30 minutes on 10M. A DataFrame without explicit dtypes uses 8x the memory it needs. Chained indexing creates silent copies that lose your changes. These aren't edge cases — they're the default behavior of pandas when you write it like regular Python.
Vectorized operations, explicit schemas, and proper dtypes mean your code scales from prototype to production without rewriting.
Scope and Boundaries
This skill covers:
- DataFrame patterns (pandas, polars, pyspark)
- Data type handling and validation
- Memory-efficient processing
- Vectorized operations over loops
- Method chaining patterns
- Error handling for data pipelines
Defers to other skills:
code-quality: General code structure, testing, namingdata-sql: Query patterns when using SQL interfacesdata-pipelines: Orchestration and ETL architecture
Use this skill when: Writing Python code that processes data.
Core Principles
- Vectorize, Don't Loop — Use DataFrame operations, not row iteration.
- Fail Fast on Bad Data — Validate early, reject invalid data at boundaries.
- Memory Awareness — Know your data size, use appropriate dtypes.
- Immutable Transforms — Chain operations, don't mutate in place.
- Explicit Schemas — Define expected columns and types upfront.
Patterns
Method Chaining
# Good - readable pipeline
result = (
df
.query("status == 'active'")
.assign(total=lambda x: x["quantity"] * x["price"])
.groupby("category")
.agg({"total": "sum"})
.sort_values("total", ascending=False)
)
# Bad - intermediate variables obscure flow
filtered = df[df["status"] == "active"]
filtered["total"] = filtered["quantity"] * filtered["price"]
grouped = filtered.groupby("category")
result = grouped.agg({"total": "sum"})
result = result.sort_values("total", ascending=False)
Schema Validation
EXPECTED_COLUMNS = {"id", "name", "value", "timestamp"}
REQUIRED_COLUMNS = {"id", "value"}
def validate_schema(df: pd.DataFrame) -> pd.DataFrame:
missing = REQUIRED_COLUMNS - set(df.columns)
if missing:
raise ValueError(f"Missing required columns: {missing}")
return df[list(EXPECTED_COLUMNS & set(df.columns))]
Type Optimization
def optimize_dtypes(df: pd.DataFrame) -> pd.DataFrame:
"""Downcast numeric types to reduce memory."""
for col in df.select_dtypes(include=["int"]).columns:
df[col] = pd.to_numeric(df[col], downcast="integer")
for col in df.select_dtypes(include=["float"]).columns:
df[col] = pd.to_numeric(df[col], downcast="float")
return df
Anti-Patterns
| Anti-Pattern | Problem | Fix |
|---|---|---|
for row in df.iterrows() |
Slow, defeats vectorization | Use vectorized operations |
df["col"] = df.apply(...) |
Usually slower than vectorized | Use np.where or df.assign |
Chained indexing df["a"]["b"] |
SettingWithCopyWarning | Use df.loc[:, "a"] |
| Loading entire file to check schema | Wastes memory | Use nrows=100 or chunking |
| Ignoring dtypes on read | Memory bloat | Specify dtype= parameter |
Checklist
- No
iterrows()oritertuples()for computation - Explicit dtypes on file reads
- Schema validation at boundaries
- Method chaining for transforms
- Memory profiled for large datasets
References
references/vectorization.md— Vectorized operations and performancereferences/memory-optimization.md— Memory optimization techniques
Assets
assets/pandas-cheatsheet.md— Quick reference for pandas operations
More from alexanderstephenthompson/claude-hub
unity-csharp
C# patterns for Unity - MonoBehaviour, async, architecture, and VR/mobile performance optimization
40architecture
Architecture principles, module boundaries, folder structure, and project type profiles
28design
Design and UI standards for accessibility, semantic HTML, and responsive layouts
26vrc-udon
VRChat Udon and UdonSharp patterns - networking, sync, interactions
25web-performance
Performance patterns for Apollo caching, Redis, and CloudFront optimization
25organize
Folder organization system using the L1/L2/L3 cognitive-phase framework. Applies to home, work, and shared drives for designing folder hierarchies, reorganizing structures, or deciding where files belong.
22