survey-analysis
survy — Survey Data Analysis Skill
survy is a lightweight Python library for processing, transforming, and analyzing survey data.
Its central design principle is treating survey constructs — especially multiselect questions — as
first-class concepts rather than awkward DataFrame workarounds.
Install: Always install the latest version — pip install --upgrade survy
Powered by: Polars (all DataFrames returned are Polars, not pandas)
1. Core Objects
Survey
Top-level container. Created via read_* functions — never instantiate directly.
Access variables with survey["Q1"]. Print for a compact summary.
Variable
Wraps a single Polars Series plus survey metadata. Key attributes:
| Attribute | Type | Description |
|---|---|---|
id |
str |
Column name (read/write via property) |
label |
str |
Human-readable label (read/write); defaults to id if unset |
vtype |
VarType |
One of VarType.SELECT, VarType.MULTISELECT, VarType.NUMBER |
value_indices |
dict[str, int] |
Answer code → numeric index mapping; always empty {} for NUMBER |
base |
int |
Count of non-null/non-empty responses |
len |
int |
Total row count including nulls |
dtype |
polars.DataType |
Underlying Polars dtype |
frequencies |
polars.DataFrame |
Frequency table (value, count, proportion) |
sps |
str |
SPSS syntax for this variable |
2. Reading Data
All readers return a Survey object. The key challenge survy
solves at read time is recognizing multiselect questions —
questions where one respondent can choose multiple answers.
Raw data encodes these in two very different layouts, and survy
needs to know which layout it's looking at so it can merge the
data into a single logical variable.
Multiselect: Compact Format vs Wide Format
Compact format stores all selected answers in a single
cell, joined by a separator (typically ;).
One column = one question.
id, gender, hobby
1, Male, Sport;Book
2, Female, Sport;Movie
3, Male, Movie
Here hobby is one column. The cell "Sport;Book" means the
respondent chose both Sport and Book. survy splits this cell on
the separator to recover the individual choices.
Wide format spreads each possible answer across its own
column, using a shared prefix plus a numeric suffix
(_1, _2, ...). Multiple columns = one question.
id, gender, hobby_1, hobby_2, hobby_3
1, Male, Book, , Sport
2, Female, , Movie, Sport
3, Male, , Movie,
Here hobby_1, hobby_2, hobby_3 are three columns that
together represent the single hobby question. survy groups
them by matching the prefix pattern and merges them into one
multiselect variable named hobby.
After reading, both formats produce the exact same Survey
variable internally — a MULTISELECT variable whose data is a
sorted list of chosen values per respondent:
hobby: [["Book", "Sport"], ["Movie", "Sport"], ["Movie"]]
How survy detects each format
Wide format is detected via name_pattern — a format template (NOT a raw regex)
with two named tokens and a set of reserved separators:
- Tokens:
id(base variable name),multi(suffix for wide columns) - Reserved separators:
_,.,:— these are always treated as delimiters between tokens when parsing column names
With the default pattern "id(_multi)?":
hobby_1→id="hobby",multi="1"→ grouped as wide multiselecthobby_2→ sameid="hobby"→ merged withhobby_1gender→ no suffix → normal column
Other patterns:
"id.multi"→ matchesQ1.1,Q1.2, ..."id:multi"→ matchesQ1:a,Q1:b, ...
Separator conflict warning: If a column name contains more than one reserved
separator (e.g. my.var_1), parse_id will fail because it can't unambiguously
split the name into tokens. Before loading, rename such columns so only one
separator is used (e.g. rename my.var_1 to myvar_1 or my@var_1).
Compact format is NOT detected by default because a semicolon in a cell could be regular text. You must tell survy which columns are compact in one of two ways:
compact_ids— explicitly list the column IDs that are compact multiselect.auto_detect=True— survy scans every column for thecompact_separatorcharacter; any column containing it in at least one cell is treated as compact.
Rule: Do NOT combine auto_detect=True with compact_ids in the same call.
read_spss
Reads an SPSS .sav file. SPSS files are always wide format — compact multiselect does not
apply and compact_ids / auto_detect are not parameters. Wide multiselect columns (e.g.
hobby_1, hobby_2) are still auto-detected and merged via name_pattern. Value labels stored
in the .sav file are applied automatically, so variables come back as text ("Male", "Female")
rather than numeric codes. Requires pyreadstat.
# Wide multiselect detected automatically
survey = survy.read_spss("data.sav")
# Custom suffix convention (Q1.1, Q1.2, ...)
survey = survy.read_spss("data.sav", name_pattern="id.multi")
Rule: Do NOT pass compact_ids or auto_detect to read_spss — those parameters don't exist.
Shared Reader Parameters (read_csv, read_excel, read_polars only)
These parameters control multiselect detection and apply to
read_csv, read_excel, and read_polars. They do NOT apply
to read_json (which reads survy's own format where variable
types are already resolved).
| Parameter | Type | Default | Description |
|---|---|---|---|
compact_ids |
list[str] | None |
None |
Column IDs to treat as compact multiselect |
compact_separator |
str |
";" |
Separator used to split compact cells |
auto_detect |
bool |
False |
Auto-detect compact columns by scanning for separator |
name_pattern |
str |
"id(_multi)?" |
Format template for wide column names. Tokens: id, multi. Separators: _ . :. Not a raw regex. |
read_csv / read_excel
import survy
# --- Compact format data ---
# Option A: you know which columns are compact
survey = survy.read_csv("data_compact.csv", compact_ids=["hobby"], compact_separator=";")
# Option B: let survy scan for the separator automatically
survey = survy.read_csv("data_compact.csv", auto_detect=True, compact_separator=";")
# --- Wide format data ---
# Wide detection is automatic via name_pattern (default works for Q1_1, Q1_2, ...)
survey = survy.read_csv("data_wide.csv")
# Custom name_pattern if your columns use a different suffix convention
survey = survy.read_csv("data_wide.csv", name_pattern="id(_multi)?")
# --- Mixed: some columns are wide, some are compact ---
survey = survy.read_csv("data_mixed.csv", name_pattern="id(_multi)?", auto_detect=True)
# Excel — identical API to read_csv
survey = survy.read_excel("data.xlsx", auto_detect=True, compact_separator=";")
read_json
Reads a survy-format JSON file. The file must have this exact structure:
{
"variables": [
{
"id": "gender",
"data": ["Male", "Female", "Male"],
"label": "Gender of respondent",
"value_indices": {"Female": 1, "Male": 2}
},
{
"id": "yob",
"data": [2000, 1999, 1998],
"label": "",
"value_indices": {}
},
{
"id": "hobby",
"data": [["Book", "Sport"], ["Movie", "Sport"], ["Movie"]],
"label": "Hobbies",
"value_indices": {"Book": 1, "Movie": 2, "Sport": 3}
}
]
}
Key rules for the JSON structure:
- Top-level key must be
"variables"(a list of variable objects). - Each variable must have
"id","data","label", and"value_indices". - SELECT variables:
"data"is a flat list of strings (or nulls). - NUMBER variables:
"data"is a flat list of numbers;"value_indices"must be{}. - MULTISELECT variables:
"data"is a list of lists of strings. "value_indices"maps each answer text to a numeric index; only applied when non-empty.- Read vs Write difference:
to_json()writes an extra"vtype"field per variable (e.g."select","multi_select","number").read_json()ignores this field — it re-infers the type from the data. So if you're building JSON manually, you can omit"vtype".
survey = survy.read_json("data.json")
read_polars
Construct a Survey from an existing Polars DataFrame.
Extra parameter exclude_null (default True) drops columns
with no responses or all-empty lists.
read_polars also have same concepts of wide/ compact format
as read_csv.
import polars, survy
df = polars.DataFrame({
"gender": ["Male", "Female", "Male"],
"yob": [2000, 1999, 1998],
"hobby": ["Sport;Book", "Sport;Movie", "Movie"],
"animal_1": ["Cat", "", "Cat"],
"animal_2": ["Dog", "Dog", ""],
})
survey = survy.read_polars(df, auto_detect=True)
3. Modifying the Survey
survey.update() — batch label/value_indices
survey.update([
{"id": "Q1", "label": "Satisfaction", "value_indices": {"good": 1, "bad": 2}},
{"id": "Q2", "label": "Channels used"},
])
Silently skips value_indices for NUMBER variables.
Warns and skips unknown IDs.
survey.add() — add a variable
survey.add(some_variable) # Variable object
survey.add(polars.Series("new", [1, 2, 3])) # auto-wrapped into Variable
If the ID already exists, a numeric suffix is appended (e.g. "Q1#1").
survey.drop() — remove a variable
survey.drop("Q3") # silently ignored if not found
survey.sort() — reorder variables
survey.sort() # alphabetical by id (default)
survey.sort(key=lambda v: v.base, reverse=True) # by response count desc
variable.replace() — recode values
survey["gender"].replace({"Male": "M", "Female": "F"})
Works for both SELECT and MULTISELECT. Automatically rebuilds value_indices.
Direct property assignment
v = survey["Q1"]
v.id = "satisfaction"
v.label = "Overall satisfaction"
v.value_indices = {"very_satisfied": 1, "satisfied": 2, "neutral": 3}
Caution on value_indices setter: Raises
DataStructureError if any existing value in the data is
missing from the new mapping. You must cover ALL values
present in the data.
4. Filtering
Returns a new Survey (original is not mutated).
filtered = survey.filter("hobby", ["Sport", "Book"])
filtered = survey.filter("gender", "Male") # single value also works
For MULTISELECT, a row is kept if any of its selected values appears in the filter list.
5. Getting a DataFrame
df = survey.get_df(
select_dtype="text", # "text" | "number"
multiselect_dtype="compact", # "compact" | "text" | "number"
)
select_dtype: "text" keeps string codes (default); "number" converts via value_indices.
multiselect_dtype:
"compact"→ oneList[str]column per multiselect (default)"text"→ expands to wide columnsQ_1,Q_2, ... with string ornull"number"→ expands to wide columns with1/0binary flags
Returns Polars DataFrame — use Polars methods, not pandas.
Valid dtype literals: "text", "number", "compact". Never "numeric" or "string".
6. Analysis
Frequency table
survey["Q1"].frequencies
# → Polars DataFrame: columns [variable_id, "count", "proportion"]
Crosstab
result = survy.crosstab(
column=survey["gender"], # grouping variable (columns)
row=survey["hobby"], # analyzed variable (rows)
filter=None, # optional: segment by another variable
aggfunc="count", # "count" | "percent" | "mean" | "median" | "sum"
alpha=0.05, # significance level for stat tests
)
# Returns dict[str, polars.DataFrame]
# Key is "Total" when no filter, or each filter-value when filter is provided
aggfunc options:
"count"— cell counts with significance letter labels (z-test)"percent"— column-wise proportions with significance labels- Numeric (
"mean","median","sum") — aggregates row variable; Welch's t-test for significance
filter: Pass a Variable to segment the crosstab into one table per filter value.
7. Exporting
All exports take a directory path (not file path) + optional name (base filename).
to_csv / to_excel
Writes three files per export:
{name}_data.csv— the actual survey responses. Format depends oncompactparam.{name}_variables_info.csv— variable metadata with columns:id,vtype(SINGLE/MULTISELECT/NUMBER),label.{name}_values_info.csv— value-to-index mappings with columns:id,text,index.
The compact parameter (default False) controls how multiselect variables appear in the
data file: True joins values into one cell (e.g. "Book;Sport"), False expands into
wide columns (e.g. hobby_1, hobby_2, hobby_3).
# Default (compact=False) — multiselect expanded to wide columns
survey.to_csv("output/", name="results")
# Compact mode — multiselect joined into single cells
survey.to_csv("output/", name="results", compact=True, compact_separator=";")
# Excel — identical API and output structure (.xlsx files instead of .csv)
survey.to_excel("output/", name="results")
survey.to_excel("output/", name="results", compact=True)
to_spss
Writes {name}.sav (data) + {name}.sps (syntax). Requires pyreadstat.
survey.to_spss("output/", name="results")
to_json
Writes {name}.json in the same structure read_json expects (see Section 2), plus an
extra "vtype" field per variable that read_json ignores on re-read. Pretty-printed
with 4-space indent, non-ASCII preserved (ensure_ascii=False).
survey.to_json("output/", name="results")
Common mistake: Do NOT pass "output/results.csv". Pass directory + name=.
SPSS Syntax
print(survey.sps) # full syntax: VARIABLE LABELS, VALUE LABELS, MRSETS, CTABLES
8. Gotchas & Rules
auto_detectvscompact_ids: Never combine both.value_indicessetter must cover all existing data values — raisesDataStructureErrorotherwise.value_indicesis silently skipped for NUMBER variables (inupdate()and direct set).- Export path is a directory, not a file.
get_df()returns Polars, not pandas.filter()returns a new Survey — does not mutate.- Empty strings become
Noneduring CSV/Excel read. - Multiselect values are sorted alphabetically within each row.
- All variables in a crosstab must have the same row count.
read_csvraisesFileTypeErrorif the file extension is not.csv. Same forread_excelwith non-.xlsxandread_spsswith non-.sav.to_csv/to_exceldefault iscompact=False— multiselect variables are expanded to wide columns unless you explicitly passcompact=True.- Column names must not contain multiple reserved separators
(
_,.,:). If a column likemy.var_1uses more than one,parse_idwill fail. Rename before loading so only one separator appears (e.g.myvar_1).
9. Quick Reference
| Task | Code |
|---|---|
| Load CSV auto-detect | survy.read_csv("f.csv", auto_detect=True, compact_separator=";") |
| Load CSV explicit compact | survy.read_csv("f.csv", compact_ids=["Q2"], compact_separator=";") |
| Load CSV wide format | survy.read_csv("f.csv", name_pattern="id(_multi)?") (wide detected automatically) |
| Load SPSS | survy.read_spss("f.sav") |
| Load JSON | survy.read_json("f.json") |
| Load from Polars DF | survy.read_polars(df, auto_detect=True) |
| Inspect variable | survey["Q1"].vtype, .base, .len, .label, .value_indices, .dtype |
| Frequencies | survey["Q1"].frequencies |
| Crosstab count | survy.crosstab(survey["Q1"], survey["Q2"]) |
| Crosstab percent | survy.crosstab(survey["Q1"], survey["Q2"], aggfunc="percent") |
| Crosstab with filter | survy.crosstab(survey["col"], survey["row"], filter=survey["seg"]) |
| Crosstab mean | survy.crosstab(survey["col"], survey["row"], aggfunc="mean") |
| Filter respondents | survey.filter("Q1", ["a", "b"]) |
| Replace values | survey["Q1"].replace({"old": "new"}) |
| Add variable | survey.add(polars.Series("x", [1,2,3])) |
| Drop variable | survey.drop("Q3") |
| Sort variables | survey.sort(key=lambda v: v.id) |
| Batch update labels | survey.update([{"id":"Q1","label":"...","value_indices":{...}}]) |
| Get compact DF | survey.get_df() |
| Get wide binary DF | survey.get_df(multiselect_dtype="number") |
| Export CSV | survey.to_csv("output/", name="results") |
| Export SPSS | survey.to_spss("output/", name="results") |
| Export JSON | survey.to_json("output/", name="results") |
| SPSS syntax string | survey.sps |
| Serialize variable | survey["Q1"].to_dict() |
10. Reference Files
references/api_reference.md— Complete method signatures with all parameters and return typesscripts/validate_survey.py— Loads a survey file, checks missing labels/value_indices, prints reportscripts/batch_export.py— Reads a survey and exports to CSV, Excel, SPSS, and JSONassets/sample_data.csv— Wide-format sample datasetassets/sample_data_compact.csv— Compact-format sample dataset