survy
survy — Survey Data Analysis Skill
survy is a lightweight Python library for processing, transforming, and analyzing survey data.
Its central design principle is treating survey constructs — especially multiselect questions — as
first-class concepts rather than awkward DataFrame workarounds.
Install: pip install survy
Powered by: Polars (all DataFrames returned are Polars, not pandas)
1. Core Objects
Survey
Top-level container. Created via read_* functions — never instantiate directly.
Access variables with survey["Q1"]. Print for a compact summary.
Variable
Wraps a single Polars Series plus survey metadata. Key attributes:
| Attribute | Type | Description |
|---|---|---|
id |
str |
Column name (read/write via property) |
label |
str |
Human-readable label (read/write); defaults to id if unset |
vtype |
VarType |
One of VarType.SELECT, VarType.MULTISELECT, VarType.NUMBER |
value_indices |
dict[str, int] |
Answer code → numeric index mapping; always empty {} for NUMBER |
base |
int |
Count of non-null/non-empty responses |
len |
int |
Total row count including nulls |
dtype |
polars.DataType |
Underlying Polars dtype |
frequencies |
polars.DataFrame |
Frequency table (value, count, proportion) |
sps |
str |
SPSS syntax for this variable |
2. Reading Data
All readers return a Survey object. The key challenge survy solves at read time is
recognizing multiselect questions — questions where one respondent can choose multiple
answers. Raw data encodes these in two very different layouts, and survy needs to know
which layout it's looking at so it can merge the data into a single logical variable.
Multiselect: Compact Format vs Wide Format
Compact format stores all selected answers in a single cell, joined by a separator
(typically ;). One column = one question.
id, gender, hobby
1, Male, Sport;Book
2, Female, Sport;Movie
3, Male, Movie
Here hobby is one column. The cell "Sport;Book" means the respondent chose both
Sport and Book. survy splits this cell on the separator to recover the individual choices.
Wide format spreads each possible answer across its own column, using a shared prefix
plus a numeric suffix (_1, _2, ...). Multiple columns = one question.
id, gender, hobby_1, hobby_2, hobby_3
1, Male, Book, , Sport
2, Female, , Movie, Sport
3, Male, , Movie,
Here hobby_1, hobby_2, hobby_3 are three columns that together represent the single
hobby question. survy groups them by matching the prefix pattern and merges them into
one multiselect variable named hobby.
After reading, both formats produce the exact same Survey variable internally — a
MULTISELECT variable whose data is a sorted list of chosen values per respondent:
hobby: [["Book", "Sport"], ["Movie", "Sport"], ["Movie"]]
How survy detects each format
Wide format is detected automatically by the name_pattern regex (default "id(_multi)?").
Any columns whose names share a base ID followed by _1, _2, etc. are grouped into one
multiselect variable. This happens by default — no extra parameters needed.
Compact format is NOT detected by default because a semicolon in a cell could be regular text. You must tell survy which columns are compact in one of two ways:
-
compact_ids— explicitly list the column IDs that are compact multiselect. Use this when you know exactly which columns use the compact encoding. -
auto_detect=True— survy scans every column's data for thecompact_separatorcharacter. Any column containing that character in at least one cell is treated as compact multiselect. Use this when you don't know which columns are compact but you know the separator.
Rule: Do NOT combine auto_detect=True with compact_ids in the same call.
Use one approach or the other.
Shared Reader Parameters (read_csv, read_excel, read_polars only)
These parameters control multiselect detection and apply to read_csv, read_excel,
and read_polars. They do NOT apply to read_spss (which reads SPSS-native metadata)
or read_json (which reads survy's own format where variable types are already resolved).
| Parameter | Type | Default | Description |
|---|---|---|---|
compact_ids |
`list[str] | None` | None |
compact_separator |
str |
";" |
Separator used to split compact cells |
auto_detect |
bool |
False |
Auto-detect compact columns by scanning for separator |
name_pattern |
str |
"id(_multi)?" |
Regex for grouping wide multiselect columns by prefix |
read_csv / read_excel
import survy
# --- Compact format data ---
# Option A: you know which columns are compact
survey = survy.read_csv("data_compact.csv", compact_ids=["hobby"], compact_separator=";")
# Option B: let survy scan for the separator automatically
survey = survy.read_csv("data_compact.csv", auto_detect=True, compact_separator=";")
# --- Wide format data ---
# Wide detection is automatic via name_pattern (default works for Q1_1, Q1_2, ...)
survey = survy.read_csv("data_wide.csv")
# Custom name_pattern if your columns use a different suffix convention
survey = survy.read_csv("data_wide.csv", name_pattern="id(_multi)?")
# --- Mixed: some columns are wide, some are compact ---
# Wide is always auto-detected; just specify the compact ones
survey = survy.read_csv("data_mixed.csv", compact_ids=["Q5"], compact_separator=";")
# Excel — identical API to read_csv
survey = survy.read_excel("data.xlsx", auto_detect=True, compact_separator=";")
read_json
Reads a survy-format JSON file. The file must have this exact structure:
{
"variables": [
{
"id": "gender",
"data": ["Male", "Female", "Male"],
"label": "Gender of respondent",
"value_indices": {"Female": 1, "Male": 2}
},
{
"id": "yob",
"data": [2000, 1999, 1998],
"label": "",
"value_indices": {}
},
{
"id": "hobby",
"data": [["Book", "Sport"], ["Movie", "Sport"], ["Movie"]],
"label": "Hobbies",
"value_indices": {"Book": 1, "Movie": 2, "Sport": 3}
}
]
}
Key rules for the JSON structure:
- Top-level key must be
"variables"(a list of variable objects). - Each variable must have
"id","data","label", and"value_indices". - SELECT variables:
"data"is a flat list of strings (or nulls). - NUMBER variables:
"data"is a flat list of numbers;"value_indices"must be{}. - MULTISELECT variables:
"data"is a list of lists of strings. "value_indices"maps each answer text to a numeric index; only applied when non-empty.- Read vs Write difference:
to_json()writes an extra"vtype"field per variable (e.g."select","multi_select","number").read_json()ignores this field — it re-infers the type from the data. So if you're building JSON manually, you can omit"vtype".
survey = survy.read_json("data.json")
read_polars
Construct a Survey from an existing Polars DataFrame. Extra parameter exclude_null
(default True) drops columns with no responses or all-empty lists.
import polars, survy
df = polars.DataFrame({
"gender": ["Male", "Female", "Male"],
"yob": [2000, 1999, 1998],
"hobby": ["Sport;Book", "Sport;Movie", "Movie"],
"animal_1": ["Cat", "", "Cat"],
"animal_2": ["Dog", "Dog", ""],
})
# Same concepts of compact - wide format as csv/ excel
survey = survy.read_polars(df, auto_detect=True)
3. Modifying the Survey
survey.update() — batch label/value_indices
survey.update([
{"id": "Q1", "label": "Satisfaction", "value_indices": {"good": 1, "bad": 2}},
{"id": "Q2", "label": "Channels used"},
])
Silently skips value_indices for NUMBER variables. Warns and skips unknown IDs.
survey.add() — add a variable
survey.add(some_variable) # Variable object
survey.add(polars.Series("new", [1, 2, 3])) # auto-wrapped into Variable
If the ID already exists, a numeric suffix is appended (e.g. "Q1#1").
survey.drop() — remove a variable
survey.drop("Q3") # silently ignored if not found
survey.sort() — reorder variables
survey.sort() # alphabetical by id (default)
survey.sort(key=lambda v: v.base, reverse=True) # by response count desc
variable.replace() — recode values
survey["gender"].replace({"Male": "M", "Female": "F"})
Works for both SELECT and MULTISELECT. Automatically rebuilds value_indices.
Direct property assignment
v = survey["Q1"]
v.id = "satisfaction"
v.label = "Overall satisfaction"
v.value_indices = {"very_satisfied": 1, "satisfied": 2, "neutral": 3}
Caution on value_indices setter: Raises DataStructureError if any existing value
in the data is missing from the new mapping. You must cover ALL values present in the data.
4. Filtering
Returns a new Survey (original is not mutated).
filtered = survey.filter("hobby", ["Sport", "Book"])
filtered = survey.filter("gender", "Male") # single value also works
For MULTISELECT, a row is kept if any of its selected values appears in the filter list.
5. Getting a DataFrame
df = survey.get_df(
select_dtype="text", # "text" | "number"
multiselect_dtype="compact", # "compact" | "text" | "number"
)
select_dtype: "text" keeps string codes (default); "number" converts via value_indices.
multiselect_dtype:
"compact"→ oneList[str]column per multiselect (default)"text"→ expands to wide columnsQ_1,Q_2, ... with string ornull"number"→ expands to wide columns with1/0binary flags
Returns Polars DataFrame — use Polars methods, not pandas.
Valid dtype literals: "text", "number", "compact". Never "numeric" or "string".
6. Analysis
Frequency table
survey["Q1"].frequencies
# → Polars DataFrame: columns [variable_id, "count", "proportion"]
Crosstab
result = survy.crosstab(
column=survey["gender"], # grouping variable (columns)
row=survey["hobby"], # analyzed variable (rows)
filter=None, # optional: segment by another variable
aggfunc="count", # "count" | "percent" | "mean" | "median" | "sum"
alpha=0.05, # significance level for stat tests
)
# Returns dict[str, polars.DataFrame]
# Key is "Total" when no filter, or each filter-value when filter is provided
aggfunc options:
"count"— cell counts with significance letter labels (z-test)"percent"— column-wise proportions with significance labels- Numeric (
"mean","median","sum") — aggregates row variable; Welch's t-test for significance
filter: Pass a Variable to segment the crosstab into one table per filter value.
7. Exporting
All exports take a directory path (not file path) + optional name (base filename).
to_csv / to_excel
Writes three files per export:
{name}_data.csv— the actual survey responses. Format depends oncompactparam.{name}_variables_info.csv— variable metadata with columns:id,vtype(SINGLE/MULTISELECT/NUMBER),label.{name}_values_info.csv— value-to-index mappings with columns:id,text,index.
The compact parameter (default False) controls how multiselect variables appear in the
data file: True joins values into one cell (e.g. "Book;Sport"), False expands into
wide columns (e.g. hobby_1, hobby_2, hobby_3).
# Default (compact=False) — multiselect expanded to wide columns
survey.to_csv("output/", name="results")
# Compact mode — multiselect joined into single cells
survey.to_csv("output/", name="results", compact=True, compact_separator=";")
# Excel — identical API and output structure (.xlsx files instead of .csv)
survey.to_excel("output/", name="results")
survey.to_excel("output/", name="results", compact=True)
to_spss
Writes {name}.sav (data) + {name}.sps (syntax). Requires pyreadstat.
survey.to_spss("output/", name="results")
to_json
Writes {name}.json in the same structure read_json expects (see Section 2), plus an
extra "vtype" field per variable that read_json ignores on re-read. Pretty-printed
with 4-space indent, non-ASCII preserved (ensure_ascii=False).
survey.to_json("output/", name="results")
Common mistake: Do NOT pass "output/results.csv". Pass directory + name=.
SPSS Syntax
print(survey.sps) # full syntax: VARIABLE LABELS, VALUE LABELS, MRSETS, CTABLES
8. Gotchas & Rules
auto_detectvscompact_ids: Never combine both.value_indicessetter must cover all existing data values — raisesDataStructureErrorotherwise.value_indicesis silently skipped for NUMBER variables (inupdate()and direct set).- Export path is a directory, not a file.
get_df()returns Polars, not pandas.filter()returns a new Survey — does not mutate.update()resets label to""if"label"key is absent from the metadata dict.- Empty strings become
Noneduring CSV/Excel read. - Multiselect values are sorted alphabetically within each row.
- All variables in a crosstab must have the same row count.
read_csvraisesFileTypeErrorif the file extension is not.csv. Same forread_excelwith non-.xlsx.to_csv/to_exceldefault iscompact=False— multiselect variables are expanded to wide columns unless you explicitly passcompact=True.
9. Quick Reference
| Task | Code |
|---|---|
| Load CSV auto-detect | survy.read_csv("f.csv", auto_detect=True, compact_separator=";") |
| Load CSV explicit compact | survy.read_csv("f.csv", compact_ids=["Q2"], compact_separator=";") |
| Load CSV wide format | survy.read_csv("f.csv") (wide detected automatically) |
| Load JSON | survy.read_json("f.json") |
| Load from Polars DF | survy.read_polars(df, auto_detect=True) |
| Inspect variable | survey["Q1"].vtype, .base, .len, .label, .value_indices, .dtype |
| Frequencies | survey["Q1"].frequencies |
| Crosstab count | survy.crosstab(survey["Q1"], survey["Q2"]) |
| Crosstab percent | survy.crosstab(survey["Q1"], survey["Q2"], aggfunc="percent") |
| Crosstab with filter | survy.crosstab(survey["col"], survey["row"], filter=survey["seg"]) |
| Crosstab mean | survy.crosstab(survey["col"], survey["row"], aggfunc="mean") |
| Filter respondents | survey.filter("Q1", ["a", "b"]) |
| Replace values | survey["Q1"].replace({"old": "new"}) |
| Add variable | survey.add(polars.Series("x", [1,2,3])) |
| Drop variable | survey.drop("Q3") |
| Sort variables | survey.sort(key=lambda v: v.id) |
| Batch update labels | survey.update([{"id":"Q1","label":"...","value_indices":{...}}]) |
| Get compact DF | survey.get_df() |
| Get wide binary DF | survey.get_df(multiselect_dtype="number") |
| Export CSV | survey.to_csv("output/", name="results") |
| Export SPSS | survey.to_spss("output/", name="results") |
| Export JSON | survey.to_json("output/", name="results") |
| SPSS syntax string | survey.sps |
| Serialize variable | survey["Q1"].to_dict() |
10. Reference Files
references/api_reference.md— Complete method signatures with all parameters and return typesscripts/validate_survey.py— Loads a survey file, checks missing labels/value_indices, prints reportscripts/batch_export.py— Reads a survey and exports to CSV, Excel, SPSS, and JSONassets/sample_data.csv— Wide-format sample datasetassets/sample_data_compact.csv— Compact-format sample dataset