pandera
Pandera: DataFrame Validation
Pandera is an open-source framework for validating DataFrame-like objects at runtime. Define schemas once and reuse them across pandas, polars, Dask, Modin, PySpark, and Ibis backends.
Import Convention
Since pandera v0.24.0, use the backend-specific module. Using the top-level pandera module produces a FutureWarning and will be deprecated in v0.29.0.
import pandera.pandas as pa # pandas (recommended)
import pandera.polars as pa # polars
from pandera.typing.pandas import DataFrame, Series, Index
Two Schema Styles
Object-based API (DataFrameSchema)
Suitable for dynamic schema construction or when schemas need to be built programmatically.
import pandas as pd
import pandera.pandas as pa
schema = pa.DataFrameSchema({
"user_id": pa.Column(int, pa.Check.gt(0)),
"email": pa.Column(str, pa.Check.str_matches(r"^[^@]+@[^@]+\.[^@]+$")),
"score": pa.Column(float, [pa.Check.ge(0.0), pa.Check.le(1.0)]),
"status": pa.Column(str, pa.Check.isin(["active", "inactive", "banned"])),
})
validated = schema.validate(df)
Class-based API (DataFrameModel) — preferred
Pydantic-style syntax with type annotations. Produces cleaner, reusable schemas that integrate with @pa.check_types.
import pandera.pandas as pa
from pandera.typing.pandas import DataFrame, Series
class UserSchema(pa.DataFrameModel):
user_id: int = pa.Field(gt=0)
email: str = pa.Field(str_matches=r"^[^@]+@[^@]+\.[^@]+$")
score: float = pa.Field(ge=0.0, le=1.0)
status: str = pa.Field(isin=["active", "inactive", "banned"])
class Config:
strict = True # reject extra columns
coerce = False # do not silently cast types
# Validate directly
UserSchema.validate(df)
# Or via typing annotation + decorator
@pa.check_types
def process(df: DataFrame[UserSchema]) -> DataFrame[UserSchema]:
return df
Checks
Built-in Checks (prefer these over lambdas)
pa.Check.gt(0) # greater than
pa.Check.ge(0) # greater than or equal
pa.Check.lt(100) # less than
pa.Check.le(100) # less than or equal
pa.Check.eq("value") # equal to
pa.Check.ne("value") # not equal to
pa.Check.isin(["a", "b"]) # membership
pa.Check.notin(["x"]) # exclusion
pa.Check.str_matches(r"^\d+$") # regex match
pa.Check.in_range(0, 100) # closed interval
pa.Check.str_startswith("prefix")
pa.Check.str_endswith("suffix")
pa.Check.str_length(1, 255) # min/max string length
Custom Checks
# Vectorized (default, faster — operates on the whole Series)
pa.Check(lambda s: s.str.len() <= 255)
# Element-wise (scalar input, use only when vectorized is impractical)
pa.Check(lambda x: x > 0, element_wise=True)
# Always add an error message
pa.Check(lambda s: s > 0, error="values must be positive")
DataFrame-level Checks
schema = pa.DataFrameSchema(
columns={...},
checks=pa.Check(lambda df: df["end_date"] >= df["start_date"]),
)
In DataFrameModel, use @pa.dataframe_check:
class Schema(pa.DataFrameModel):
start_date: int
end_date: int
@pa.dataframe_check
@classmethod
def end_after_start(cls, df: pd.DataFrame) -> pd.Series:
return df["end_date"] >= df["start_date"]
Nullable and Optional Columns
# Object API: allow nulls in a column
pa.Column(float, nullable=True)
# DataFrameModel: make a column optional (may be absent)
from typing import Optional
class Schema(pa.DataFrameModel):
required_col: Series[int]
optional_col: Optional[Series[float]]
Coercion
Enable coercion to cast data to the declared type before validation. Use deliberately — coercion can hide upstream data issues.
# Per-column
pa.Column(int, coerce=True)
# Schema-wide via Config
class Schema(pa.DataFrameModel):
year: int = pa.Field(gt=2000, coerce=True)
class Config:
coerce = True
Lazy Validation — Collect All Errors
By default pandera raises on the first error. Use lazy=True to collect all failures before raising, useful for batch reporting.
try:
schema.validate(df, lazy=True)
except pa.errors.SchemaErrors as exc:
print(exc.failure_cases) # DataFrame of all failures
Decorator Integration
Integrate validation transparently into pipelines using decorators.
# DataFrameModel + check_types (recommended)
@pa.check_types
def transform(df: DataFrame[InputSchema]) -> DataFrame[OutputSchema]:
return df.assign(revenue=df["units"] * df["price"])
# Object API: check_input / check_output
@pa.check_input(input_schema)
@pa.check_output(output_schema)
def pipeline_step(df):
return df
# check_io: concisely specify both
@pa.check_io(raw=input_schema, out=output_schema)
def pipeline_step(raw):
return raw
Decorators work on sync/async functions, methods, class methods, and static methods.
Schema Inheritance
Build specialized schemas from a base to avoid repetition.
class BaseEvent(pa.DataFrameModel):
event_id: str
timestamp: int = pa.Field(gt=0)
class ClickEvent(BaseEvent):
url: str
user_agent: str
class Config:
strict = True
Schema Persistence (YAML / Script)
Serialize and reload schemas to keep validation reproducible.
import pandera.io
# Save
pandera.io.to_yaml(schema, "./schema.yaml")
# Load
schema = pandera.io.from_yaml("./schema.yaml")
# Generate Python script
pandera.io.to_script(schema, "./schema_definition.py")
Schema Inference (Prototyping Only)
Infer a schema from existing data to bootstrap development. Always review and tighten the generated schema before using in production.
import pandera.pandas as pa
inferred = pa.infer_schema(df)
print(inferred.to_script()) # inspect then copy-edit
Dropping Invalid Rows
Use drop_invalid_rows=True on DataFrameSchema to filter out failing rows instead of raising an error. Supported on pandas and polars.
schema = pa.DataFrameSchema(
{"score": pa.Column(float, pa.Check.ge(0))},
drop_invalid_rows=True,
)
cleaned = schema.validate(df_with_bad_rows)
Error Handling
from pandera.errors import SchemaError, SchemaErrors
# Single error (eager validation)
try:
schema.validate(df)
except SchemaError as exc:
print(exc.failure_cases) # Series/DataFrame of failures
# Multiple errors (lazy validation)
try:
schema.validate(df, lazy=True)
except SchemaErrors as exc:
# Structured dict with SCHEMA and DATA keys
print(exc.error_counts)
print(exc.failure_cases)
Key Configuration Options (Config)
| Option | Type | Effect |
|---|---|---|
strict |
bool |
Raise if extra columns present |
coerce |
bool |
Cast columns to declared dtypes |
ordered |
bool |
Require columns in declared order |
name |
str |
Schema name shown in error messages |
add_missing_columns |
bool |
Insert columns with default values |
Best Practices
- Use
DataFrameModeloverDataFrameSchemafor new code — cleaner syntax, inheritance, and type-annotation integration. - Prefer
strict=Trueto catch unexpected extra columns early. - Use built-in checks (
Check.gt,Check.isin, etc.) over custom lambdas where possible — they produce better error messages. - Write vectorized checks (
element_wise=False, the default) for performance; only useelement_wise=Truewhen the logic is truly scalar. - Always add
error=messages to customCheckobjects to improve debuggability. - Use lazy validation in pipelines that process large batches so all failures surface in one pass.
- Never rely on inferred schemas in production — always explicitly define constraints.
- Use
coerce=Truedeliberately — set at the column level to limit scope; avoid schema-wide coercion unless certain. - Prefer
raise_warning=Trueonly for non-critical informational checks (e.g., normality tests), not for data integrity constraints.
Additional Resources
references/checks-and-validation.md— Built-in check catalog, groupby checks, wide checks, hypothesis testingreferences/dataframe-models.md— Field spec, schema inheritance, MultiIndex, aliases, parsers, Polars usage