skills/mindrally/skills/pandas-best-practices

pandas-best-practices

SKILL.md

Pandas Best Practices

Expert guidelines for Pandas development, focusing on data manipulation, analysis, and efficient DataFrame operations.

Code Style and Structure

  • Write concise, technical responses with accurate Python examples
  • Prioritize reproducibility in data analysis workflows
  • Use functional programming; avoid unnecessary classes
  • Prefer vectorized operations over explicit loops
  • Use descriptive variable names reflecting data content
  • Follow PEP 8 style guidelines

DataFrame Creation and I/O

  • Use pd.read_csv(), pd.read_excel(), pd.read_json() with appropriate parameters
  • Specify dtype parameter to ensure correct data types on load
  • Use parse_dates for automatic datetime parsing
  • Set index_col when the data has a natural index column
  • Use chunksize for reading large files incrementally

Data Selection

  • Use .loc[] for label-based indexing
  • Use .iloc[] for integer position-based indexing
  • Avoid chained indexing (e.g., df['col'][0]) - use .loc or .iloc instead
  • Use boolean indexing for conditional selection: df[df['col'] > value]
  • Use .query() method for complex filtering conditions

Method Chaining

  • Prefer method chaining for data transformations when possible
  • Use .pipe() for applying custom functions in a chain
  • Chain operations like .assign(), .query(), .groupby(), .agg()
  • Keep chains readable by breaking across multiple lines

Data Cleaning and Validation

Missing Data

  • Check for missing data with .isna() and .info()
  • Handle missing data appropriately: .fillna(), .dropna(), or imputation
  • Use pd.NA for nullable integer and boolean types
  • Document decisions about missing data handling

Data Quality Checks

  • Implement data quality checks at the beginning of analysis
  • Validate data types with .dtypes and convert as needed
  • Check for duplicates with .duplicated() and handle appropriately
  • Use .describe() for quick statistical overview

Type Conversion

  • Use .astype() for explicit type conversion
  • Use pd.to_datetime() for date parsing
  • Use pd.to_numeric() with errors='coerce' for safe numeric conversion
  • Utilize categorical data types for low-cardinality string columns

Grouping and Aggregation

GroupBy Operations

  • Use .groupby() for efficient aggregation operations
  • Specify aggregation functions with .agg() for multiple operations
  • Use named aggregation for clearer output column names
  • Consider .transform() for broadcasting results back to original shape

Pivot Tables and Reshaping

  • Use .pivot_table() for multi-dimensional aggregation
  • Use .melt() to convert wide to long format
  • Use .pivot() to convert long to wide format
  • Use .stack() and .unstack() for hierarchical index manipulation

Performance Optimization

Memory Efficiency

  • Use categorical data types for low-cardinality strings
  • Downcast numeric types when appropriate
  • Use pd.eval() and .eval() for large expression evaluation

Computation Speed

  • Use vectorized operations instead of .apply() with row-wise functions
  • Prefer built-in aggregation functions over custom ones
  • Use .values or .to_numpy() for NumPy operations when faster

Avoiding Common Pitfalls

  • Avoid iterating with .iterrows() - use vectorized operations
  • Don't modify DataFrames while iterating
  • Be aware of SettingWithCopyWarning - use .copy() when needed
  • Avoid growing DataFrames row by row - collect in list and create once

Time Series Operations

  • Use DatetimeIndex for time series data
  • Leverage .resample() for time-based aggregation
  • Use .shift() and .diff() for lag operations
  • Use .rolling() and .expanding() for window calculations

Merging and Joining

  • Use .merge() for SQL-style joins
  • Specify how parameter: 'inner', 'outer', 'left', 'right'
  • Use validate parameter to check join cardinality
  • Use .concat() for stacking DataFrames

Key Conventions

  • Import as import pandas as pd
  • Use snake_case for column names when possible
  • Document data sources and transformations
  • Keep notebooks reproducible with clear cell execution order
Weekly Installs
108
GitHub Stars
32
First Seen
Jan 25, 2026
Installed on
opencode89
gemini-cli84
codex79
claude-code76
github-copilot73
cursor71