survy — Survey Data Analysis Skill

survy is a lightweight Python library for processing, transforming, and analyzing survey data. Its central design principle is treating survey constructs — especially multiselect questions — as first-class concepts rather than awkward DataFrame workarounds.

Install: pip install survy Powered by: Polars (all DataFrames returned are Polars, not pandas)

1. Core Objects

Survey

Top-level container. Created via read_* functions — never instantiate directly. Access variables with survey["Q1"]. Print for a compact summary.

Variable

Wraps a single Polars Series plus survey metadata. Key attributes:

Attribute	Type	Description
`id`	`str`	Column name (read/write via property)
`label`	`str`	Human-readable label (read/write); defaults to `id` if unset
`vtype`	`VarType`	One of `VarType.SELECT`, `VarType.MULTISELECT`, `VarType.NUMBER`
`value_indices`	`dict[str, int]`	Answer code → numeric index mapping; always empty `{}` for NUMBER
`base`	`int`	Count of non-null/non-empty responses
`len`	`int`	Total row count including nulls
`dtype`	`polars.DataType`	Underlying Polars dtype
`frequencies`	`polars.DataFrame`	Frequency table (value, count, proportion)
`sps`	`str`	SPSS syntax for this variable

2. Reading Data

All readers return a Survey object. The key challenge survy solves at read time is recognizing multiselect questions — questions where one respondent can choose multiple answers. Raw data encodes these in two very different layouts, and survy needs to know which layout it's looking at so it can merge the data into a single logical variable.

Multiselect: Compact Format vs Wide Format

Compact format stores all selected answers in a single cell, joined by a separator (typically ;). One column = one question.

id,  gender,  hobby
1,   Male,    Sport;Book
2,   Female,  Sport;Movie
3,   Male,    Movie

Here hobby is one column. The cell "Sport;Book" means the respondent chose both Sport and Book. survy splits this cell on the separator to recover the individual choices.

Wide format spreads each possible answer across its own column, using a shared prefix plus a numeric suffix (_1, _2, ...). Multiple columns = one question.

id,  gender,  hobby_1,  hobby_2,  hobby_3
1,   Male,    Book,     ,         Sport
2,   Female,  ,         Movie,    Sport
3,   Male,    ,         Movie,

Here hobby_1, hobby_2, hobby_3 are three columns that together represent the single hobby question. survy groups them by matching the prefix pattern and merges them into one multiselect variable named hobby.

After reading, both formats produce the exact same Survey variable internally — a MULTISELECT variable whose data is a sorted list of chosen values per respondent:

hobby: [["Book", "Sport"], ["Movie", "Sport"], ["Movie"]]

How survy detects each format

Wide format is detected automatically by the name_pattern regex (default "id(_multi)?"). Any columns whose names share a base ID followed by _1, _2, etc. are grouped into one multiselect variable. This happens by default — no extra parameters needed.

Compact format is NOT detected by default because a semicolon in a cell could be regular text. You must tell survy which columns are compact in one of two ways:

compact_ids — explicitly list the column IDs that are compact multiselect. Use this when you know exactly which columns use the compact encoding.
auto_detect=True — survy scans every column's data for the compact_separator character. Any column containing that character in at least one cell is treated as compact multiselect. Use this when you don't know which columns are compact but you know the separator.

Rule: Do NOT combine auto_detect=True with compact_ids in the same call. Use one approach or the other.

Shared Reader Parameters (read_csv, read_excel, read_polars only)

These parameters control multiselect detection and apply to read_csv, read_excel, and read_polars. They do NOT apply to read_spss (which reads SPSS-native metadata) or read_json (which reads survy's own format where variable types are already resolved).

Parameter	Type	Default	Description
`compact_ids`	`list[str]	None`	`None`
`compact_separator`	`str`	`";"`	Separator used to split compact cells
`auto_detect`	`bool`	`False`	Auto-detect compact columns by scanning for separator
`name_pattern`	`str`	`"id(_multi)?"`	Regex for grouping wide multiselect columns by prefix

read_csv / read_excel

import survy

# --- Compact format data ---

# Option A: you know which columns are compact
survey = survy.read_csv("data_compact.csv", compact_ids=["hobby"], compact_separator=";")

# Option B: let survy scan for the separator automatically
survey = survy.read_csv("data_compact.csv", auto_detect=True, compact_separator=";")

# --- Wide format data ---
# Wide detection is automatic via name_pattern (default works for Q1_1, Q1_2, ...)
survey = survy.read_csv("data_wide.csv")

# Custom name_pattern if your columns use a different suffix convention
survey = survy.read_csv("data_wide.csv", name_pattern="id(_multi)?")

# --- Mixed: some columns are wide, some are compact ---
# Wide is always auto-detected; just specify the compact ones
survey = survy.read_csv("data_mixed.csv", compact_ids=["Q5"], compact_separator=";")

# Excel — identical API to read_csv
survey = survy.read_excel("data.xlsx", auto_detect=True, compact_separator=";")

read_json

Reads a survy-format JSON file. The file must have this exact structure:

{
  "variables": [
    {
      "id": "gender",
      "data": ["Male", "Female", "Male"],
      "label": "Gender of respondent",
      "value_indices": {"Female": 1, "Male": 2}
    },
    {
      "id": "yob",
      "data": [2000, 1999, 1998],
      "label": "",
      "value_indices": {}
    },
    {
      "id": "hobby",
      "data": [["Book", "Sport"], ["Movie", "Sport"], ["Movie"]],
      "label": "Hobbies",
      "value_indices": {"Book": 1, "Movie": 2, "Sport": 3}
    }
  ]
}

Key rules for the JSON structure:

Top-level key must be "variables" (a list of variable objects).
Each variable must have "id", "data", "label", and "value_indices".
SELECT variables: "data" is a flat list of strings (or nulls).
NUMBER variables: "data" is a flat list of numbers; "value_indices" must be {}.
MULTISELECT variables: "data" is a list of lists of strings.
"value_indices" maps each answer text to a numeric index; only applied when non-empty.
Read vs Write difference: to_json() writes an extra "vtype" field per variable (e.g. "select", "multi_select", "number"). read_json() ignores this field — it re-infers the type from the data. So if you're building JSON manually, you can omit "vtype".

survey = survy.read_json("data.json")

read_polars

Construct a Survey from an existing Polars DataFrame. Extra parameter exclude_null (default True) drops columns with no responses or all-empty lists.

import polars, survy

df = polars.DataFrame({
    "gender": ["Male", "Female", "Male"],
    "yob": [2000, 1999, 1998],
    "hobby": ["Sport;Book", "Sport;Movie", "Movie"],
    "animal_1": ["Cat", "", "Cat"],
    "animal_2": ["Dog", "Dog", ""],
})

# Same concepts of compact - wide format as csv/ excel
survey = survy.read_polars(df, auto_detect=True)

3. Modifying the Survey

survey.update() — batch label/value_indices

survey.update([
    {"id": "Q1", "label": "Satisfaction", "value_indices": {"good": 1, "bad": 2}},
    {"id": "Q2", "label": "Channels used"},
])

Silently skips value_indices for NUMBER variables. Warns and skips unknown IDs.

survey.add() — add a variable

survey.add(some_variable)           # Variable object
survey.add(polars.Series("new", [1, 2, 3]))  # auto-wrapped into Variable

If the ID already exists, a numeric suffix is appended (e.g. "Q1#1").

survey.drop() — remove a variable

survey.drop("Q3")   # silently ignored if not found

survey.sort() — reorder variables

survey.sort()                                      # alphabetical by id (default)
survey.sort(key=lambda v: v.base, reverse=True)    # by response count desc

variable.replace() — recode values

survey["gender"].replace({"Male": "M", "Female": "F"})

Works for both SELECT and MULTISELECT. Automatically rebuilds value_indices.

Direct property assignment

v = survey["Q1"]
v.id = "satisfaction"
v.label = "Overall satisfaction"
v.value_indices = {"very_satisfied": 1, "satisfied": 2, "neutral": 3}

Caution on value_indices setter: Raises DataStructureError if any existing value in the data is missing from the new mapping. You must cover ALL values present in the data.

4. Filtering

Returns a new Survey (original is not mutated).

filtered = survey.filter("hobby", ["Sport", "Book"])
filtered = survey.filter("gender", "Male")   # single value also works

For MULTISELECT, a row is kept if any of its selected values appears in the filter list.

5. Getting a DataFrame

df = survey.get_df(
    select_dtype="text",          # "text" | "number"
    multiselect_dtype="compact",  # "compact" | "text" | "number"
)

select_dtype: "text" keeps string codes (default); "number" converts via value_indices.

multiselect_dtype:

"compact" → one List[str] column per multiselect (default)
"text" → expands to wide columns Q_1, Q_2, ... with string or null
"number" → expands to wide columns with 1/0 binary flags

Returns Polars DataFrame — use Polars methods, not pandas.

Valid dtype literals: "text", "number", "compact". Never "numeric" or "string".

6. Analysis

Frequency table

survey["Q1"].frequencies
# → Polars DataFrame: columns [variable_id, "count", "proportion"]

Crosstab

result = survy.crosstab(
    column=survey["gender"],     # grouping variable (columns)
    row=survey["hobby"],         # analyzed variable (rows)
    filter=None,                 # optional: segment by another variable
    aggfunc="count",             # "count" | "percent" | "mean" | "median" | "sum"
    alpha=0.05,                  # significance level for stat tests
)
# Returns dict[str, polars.DataFrame]
# Key is "Total" when no filter, or each filter-value when filter is provided

aggfunc options:

"count" — cell counts with significance letter labels (z-test)
"percent" — column-wise proportions with significance labels
Numeric ("mean", "median", "sum") — aggregates row variable; Welch's t-test for significance

filter: Pass a Variable to segment the crosstab into one table per filter value.

7. Exporting

All exports take a directory path (not file path) + optional name (base filename).

to_csv / to_excel

Writes three files per export:

{name}_data.csv — the actual survey responses. Format depends on compact param.
{name}_variables_info.csv — variable metadata with columns: id, vtype (SINGLE/MULTISELECT/NUMBER), label.
{name}_values_info.csv — value-to-index mappings with columns: id, text, index.

The compact parameter (default False) controls how multiselect variables appear in the data file: True joins values into one cell (e.g. "Book;Sport"), False expands into wide columns (e.g. hobby_1, hobby_2, hobby_3).

# Default (compact=False) — multiselect expanded to wide columns
survey.to_csv("output/", name="results")

# Compact mode — multiselect joined into single cells
survey.to_csv("output/", name="results", compact=True, compact_separator=";")

# Excel — identical API and output structure (.xlsx files instead of .csv)
survey.to_excel("output/", name="results")
survey.to_excel("output/", name="results", compact=True)

to_spss

Writes {name}.sav (data) + {name}.sps (syntax). Requires pyreadstat.

survey.to_spss("output/", name="results")

to_json

Writes {name}.json in the same structure read_json expects (see Section 2), plus an extra "vtype" field per variable that read_json ignores on re-read. Pretty-printed with 4-space indent, non-ASCII preserved (ensure_ascii=False).

survey.to_json("output/", name="results")

Common mistake: Do NOT pass "output/results.csv". Pass directory + name=.

SPSS Syntax

print(survey.sps)  # full syntax: VARIABLE LABELS, VALUE LABELS, MRSETS, CTABLES

8. Gotchas & Rules

auto_detect vs compact_ids: Never combine both.
value_indices setter must cover all existing data values — raises DataStructureError otherwise.
value_indices is silently skipped for NUMBER variables (in update() and direct set).
Export path is a directory, not a file.
get_df() returns Polars, not pandas.
filter() returns a new Survey — does not mutate.
update() resets label to "" if "label" key is absent from the metadata dict.
Empty strings become None during CSV/Excel read.
Multiselect values are sorted alphabetically within each row.
All variables in a crosstab must have the same row count.
read_csv raises FileTypeError if the file extension is not .csv. Same for read_excel with non-.xlsx.
to_csv/to_excel default is compact=False — multiselect variables are expanded to wide columns unless you explicitly pass compact=True.

9. Quick Reference

Task	Code
Load CSV auto-detect	`survy.read_csv("f.csv", auto_detect=True, compact_separator=";")`
Load CSV explicit compact	`survy.read_csv("f.csv", compact_ids=["Q2"], compact_separator=";")`
Load CSV wide format	`survy.read_csv("f.csv")` (wide detected automatically)
Load JSON	`survy.read_json("f.json")`
Load from Polars DF	`survy.read_polars(df, auto_detect=True)`
Inspect variable	`survey["Q1"].vtype`, `.base`, `.len`, `.label`, `.value_indices`, `.dtype`
Frequencies	`survey["Q1"].frequencies`
Crosstab count	`survy.crosstab(survey["Q1"], survey["Q2"])`
Crosstab percent	`survy.crosstab(survey["Q1"], survey["Q2"], aggfunc="percent")`
Crosstab with filter	`survy.crosstab(survey["col"], survey["row"], filter=survey["seg"])`
Crosstab mean	`survy.crosstab(survey["col"], survey["row"], aggfunc="mean")`
Filter respondents	`survey.filter("Q1", ["a", "b"])`
Replace values	`survey["Q1"].replace({"old": "new"})`
Add variable	`survey.add(polars.Series("x", [1,2,3]))`
Drop variable	`survey.drop("Q3")`
Sort variables	`survey.sort(key=lambda v: v.id)`
Batch update labels	`survey.update([{"id":"Q1","label":"...","value_indices":{...}}])`
Get compact DF	`survey.get_df()`
Get wide binary DF	`survey.get_df(multiselect_dtype="number")`
Export CSV	`survey.to_csv("output/", name="results")`
Export SPSS	`survey.to_spss("output/", name="results")`
Export JSON	`survey.to_json("output/", name="results")`
SPSS syntax string	`survey.sps`
Serialize variable	`survey["Q1"].to_dict()`

10. Reference Files

references/api_reference.md — Complete method signatures with all parameters and return types
scripts/validate_survey.py — Loads a survey file, checks missing labels/value_indices, prints report
scripts/batch_export.py — Reads a survey and exports to CSV, Excel, SPSS, and JSON
assets/sample_data.csv — Wide-format sample dataset
assets/sample_data_compact.csv — Compact-format sample dataset

survy