xarray
Xarray - N-Dimensional Labeled Arrays
Xarray provides a pandas-like experience for multidimensional data. It is the core of the Pangeo ecosystem and is essential for working with NetCDF, GRIB, and Zarr formats.
When to Use
- Working with multi-dimensional scientific data (Time, Lat, Lon, Level, Ensemble).
- Analyzing climate, weather, or oceanographic datasets (NetCDF files).
- Handling large datasets that don't fit in memory (via Dask integration).
- Performing complex broadcasting and alignment based on dimension names instead of axis indices.
- Storing metadata (units, descriptions) directly inside the data object.
- Remote sensing and geospatial imaging analysis.
Reference Documentation
Official docs: https://docs.xarray.dev/
Tutorials: https://tutorial.xarray.dev/
Search patterns: xr.DataArray, xr.Dataset, ds.sel, ds.groupby, ds.resample, xr.open_dataset
Core Principles
DataArray vs Dataset
| Structure | Description | Analogy |
|---|---|---|
| DataArray | A single labeled N-dimensional array. | Like a pandas.Series but N-D. |
| Dataset | A dict-like container of multiple DataArrays. | Like a pandas.DataFrame but N-D. |
Key Concepts
- Dimensions: Names of the axes (e.g., x, y, time).
- Coordinates: Values associated with dimensions (e.g., actual timestamps or latitude values).
- Attributes: Arbitrary metadata (e.g., units='Kelvin', standard_name='air_temperature').
Quick Reference
Installation
pip install xarray netCDF4 dask zarr
Standard Imports
import xarray as xr
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
Basic Pattern - Creation
import xarray as xr
import numpy as np
# Create a DataArray
data = np.random.rand(4, 3)
times = pd.date_range("2023-01-01", periods=4)
lons = [-120, -110, -100]
da = xr.DataArray(
data,
coords={"time": times, "lon": lons},
dims=("time", "lon"),
name="temp",
attrs={"units": "degC"}
)
# Convert to Dataset
ds = da.to_dataset()
print(ds)
Critical Rules
✅ DO
- Use Named Dimensions - Always use
dim=('time', 'lat', 'lon')instead of integer axes. - Select by Labels - Use
.sel()for coordinate values and.isel()for index integers. - Lazy Loading - Use
chunks={}inopen_datasetto handle large files with Dask. - Keep Metadata - Populate
.attrsto ensure your data is self-describing. - Alignment - Let Xarray handle broadcasting; it will automatically align data based on coordinate values.
- Accessor power - Use
.dtfor datetime and.strfor string operations.
❌ DON'T
- Use Integer Indexing - Avoid
data[0, :, 5](unreadable and fragile). Use.isel(time=0, lon=5). - Ignore the Encoding - When saving to NetCDF, check
ds.encodingfor compression/scaling. - Manual Loops - Don't loop over time steps; use
.groupby()or.resample(). - Forget Dask - For datasets larger than RAM, ensure Dask is installed and chunks are defined.
Anti-Patterns (NEVER)
# ❌ BAD: Positional indexing (What is axis 1? Lat or Lon?)
mean_val = ds.variable.mean(axis=1)
# ✅ GOOD: Named dimension reduction (Clear and robust)
mean_val = ds.variable.mean(dim='lat')
# ❌ BAD: Manual time slicing with list comprehensions
# subset = [ds.sel(time=t) for t in my_times if t > '2020-01-01']
# ✅ GOOD: Built-in slicing
subset = ds.sel(time=slice('2020-01-01', '2021-12-31'))
# ❌ BAD: Losing metadata during numpy conversion
raw_data = ds.temp.values # Now it's just a numpy array, units are gone!
# ✅ GOOD: Keep in Xarray as long as possible
processed = ds.temp * 10 # Units and coords are preserved
Selection and Indexing
sel vs isel
# Select by coordinate values
subset = ds.sel(lat=45.0, lon=slice(-100, -80))
# Select by index (integer)
first_step = ds.isel(time=0)
# Nearest neighbor lookup
point = ds.sel(lat=42.1, lon=-71.2, method="nearest")
# Multi-dimensional selection
high_temp_days = ds.where(ds.temp > 30, drop=True)
Computation and Math
Broadcasting and Alignment
# Xarray aligns automatically by coordinate names
da1 = xr.DataArray([1, 2], coords=[[1, 2]], dims=['x'])
da2 = xr.DataArray([1, 2, 3], coords=[[1, 2, 3]], dims=['y'])
# result is a 2x3 matrix
result = da1 + da2
# Mathematical operations preserve coordinates
log_temp = np.log(ds.temp)
anomalies = ds.temp - ds.temp.mean(dim='time')
GroupBy and Resampling
Time Series and Spatial Aggregation
# Monthly means
monthly = ds.resample(time="1MS").mean()
# Climatology (group by month regardless of year)
climatology = ds.groupby("time.month").mean()
# Calculate anomalies relative to climatology
anomalies = ds.groupby("time.month") - climatology
# Rolling window (Moving average)
rolling_mean = ds.rolling(time=7, center=True).mean()
File I/O (NetCDF, Zarr)
Reading and Writing
# Open a single file
ds = xr.open_dataset("weather_data.nc")
# Open multiple files (MFDataset)
ds_all = xr.open_mfdataset("data/*.nc", combine="by_coords", chunks={'time': 100})
# Write to NetCDF
ds.to_netcdf("output.nc")
# Write to Zarr (Cloud optimized)
ds.to_zarr("data.zarr")
Plotting
High-level wrapping of Matplotlib
# 1D plot
ds.temp.sel(lat=0, lon=0, method='nearest').plot()
# 2D map
ds.temp.isel(time=0).plot(cmap='RdBu_r', robust=True)
# Faceting (Subplots)
ds.temp.isel(time=slice(0, 4)).plot(col="time", col_wrap=2)
Integration with pandas and NumPy
# To Pandas
df = ds.to_dataframe()
# From Pandas
new_ds = xr.Dataset.from_dataframe(df)
# To NumPy (Lose coordinates)
arr = ds.temp.values
# Interoperability
# Xarray objects work in many SciPy/NumPy functions
from scipy.signal import detrend
detrended = xr.apply_ufunc(detrend, ds.temp, input_core_dims=[['time']], output_core_dims=[['time']])
Advanced: Dask for Big Data
Out-of-memory computation
# Opening with chunks creates Dask arrays
ds = xr.open_dataset("huge_file.nc", chunks={'time': 500, 'lat': 100, 'lon': 100})
# Computation is now lazy
result = ds.temp.mean(dim='time') # Returns immediately
# Trigger computation
final_val = result.compute()
Practical Workflows
1. Global Temperature Anomaly Workflow
def calculate_temp_anomaly(filepath):
"""Calculate monthly anomalies from NetCDF data."""
ds = xr.open_dataset(filepath)
# 1. Compute climatology (mean for each month of the year)
climatology = ds.temp.groupby("time.month").mean("time")
# 2. Subtract climatology from original data
anomalies = ds.temp.groupby("time.month") - climatology
# 3. Global mean anomaly
# Weighted by cos(lat) because grid cells get smaller at poles
weights = np.cos(np.deg2rad(ds.lat))
weights.name = "weights"
anom_weighted = anomalies.weighted(weights)
return anom_weighted.mean(("lat", "lon"))
# ts_anomaly = calculate_temp_anomaly("global_temps.nc")
2. Multi-Model Ensemble Analysis
def analyze_ensemble(file_list):
"""Combine multiple model runs into a single dataset with a 'model' dimension."""
datasets = [xr.open_dataset(f) for f in file_list]
model_names = ["Model_A", "Model_B", "Model_C"]
# Concatenate along a new dimension
combined = xr.concat(datasets, dim=pd.Index(model_names, name="model"))
# Calculate ensemble mean and spread
ens_mean = combined.mean(dim="model")
ens_std = combined.std(dim="model")
return ens_mean, ens_std
3. Satellite Image Processing (NDVI)
def calculate_ndvi(ds):
"""Calculate NDVI from Red and NIR bands in an Xarray Dataset."""
# NDVI = (NIR - Red) / (NIR + Red)
red = ds.sel(band='red')
nir = ds.sel(band='nir')
ndvi = (nir - red) / (nir + red)
ndvi.attrs['long_name'] = "Normalized Difference Vegetation Index"
return ndvi
Performance Optimization
Chunking Strategies
# ❌ Problem: Small chunks lead to massive overhead
# ds = ds.chunk({'time': 1, 'lat': 1, 'lon': 1})
# ✅ Solution: Aim for 10MB - 100MB per chunk
ds = ds.chunk({'time': -1, 'lat': 100, 'lon': 100})
Vectorization with apply_ufunc
# Wrap a custom numpy function to work on Xarray objects efficiently
def my_complex_stat(x):
return np.median(x) * np.std(x)
result = xr.apply_ufunc(
my_complex_stat,
ds.temp,
input_core_dims=[['time']], # The dimension to map the function over
vectorize=True,
dask="parallelized"
)
Common Pitfalls and Solutions
Coordinate Mismatch
# ❌ Problem: DataArrays don't align due to floating point jitter in lat/lon
# ✅ Solution: Use .interp_like() or .reindex_like()
ds2_aligned = ds2.interp_like(ds1)
Memory Leak with values
# ❌ Problem: Calling .values on a huge Dask-backed array crashes the machine
# ✅ Solution: Use .compute() or subset first
subset_val = ds.temp.isel(time=0).values # This is safer
Slicing issues (Start/End)
# ❌ Problem: slice(10, 0) returns empty because order is wrong
# ✅ Solution: Check if your index is ascending or descending
# ds.sortby('lat').sel(lat=slice(-90, 90))
Xarray is the bridge between raw N-dimensional math and high-level data analysis. Its ability to handle labels and metadata makes scientific code self-documenting and significantly more reliable.
More from tondevrel/scientific-agent-skills
xgboost-lightgbm
Industry-standard gradient boosting libraries for tabular data and structured datasets. XGBoost and LightGBM excel at classification and regression tasks on tables, CSVs, and databases. Use when working with tabular machine learning, gradient boosting trees, Kaggle competitions, feature importance analysis, hyperparameter tuning, or when you need state-of-the-art performance on structured data.
193opencv
Open Source Computer Vision Library (OpenCV) for real-time image processing, video analysis, object detection, face recognition, and camera calibration. Use when working with images, videos, cameras, edge detection, contours, feature detection, image transformations, object tracking, optical flow, or any computer vision task.
142ortools
Google Optimization Tools. An open-source software suite for optimization, specialized in vehicle routing, flows, integer and linear programming, and constraint programming. Features the world-class CP-SAT solver. Use for vehicle routing problems (VRP), scheduling, bin packing, knapsack problems, linear programming (LP), integer programming (MIP), network flows, constraint programming, combinatorial optimization, resource allocation, shift scheduling, job-shop scheduling, and discrete optimization problems.
74matplotlib
The foundational library for creating static, animated, and interactive visualizations in Python. Highly customizable and the industry standard for publication-quality figures. Use for 2D plotting, scientific data visualization, heatmaps, contours, vector fields, multi-panel figures, LaTeX-formatted plots, custom visualization tools, and plotting from NumPy arrays or Pandas DataFrames.
70scipy
Comprehensive guide for SciPy - the fundamental library for scientific and technical computing in Python. Use for integration, optimization, interpolation, linear algebra, signal processing, statistics, ODEs, Fourier transforms, and advanced scientific algorithms. Built on NumPy and essential for research and engineering.
51plotly
A high-level interactive graphing library for Python. Ideal for web-based visualizations, 3D plots, and complex interactive dashboards. Built on plotly.js, it allows users to zoom, pan, and hover over data points in a browser-based environment. Use for interactive charts, web applications, Jupyter notebooks, 3D data visualization, geographic maps, financial charts, animations, time-series analysis, and building production-ready dashboards with Dash.
50