wrangle
Wrangle
Help the user manipulate football data effectively. This skill is about the mechanics of working with data, adapted to the user's language and tools.
First: check profile
Read .nutmeg.user.md. If it doesn't exist, tell the user to run /nutmeg:init first. Use their profile for language preference and stack.
Core operations
Coordinate transforms
Football data coordinates vary by provider. Always verify and convert before combining data.
Use search_docs(query="coordinate system", provider="[provider]") to look up the specific system. Key conversions:
- Opta (0-100) to StatsBomb (120x80):
x * 1.2,y * 0.8 - Wyscout to Opta:
xstays,y = 100 - y(invert Y) - Any to kloppy normalised: use kloppy's
.transform()in Python
Filtering events
Common filtering patterns for football event data:
By event type:
- Shots: filter for shot/miss/goal/saved event types
- Passes in final third: filter passes where x > 66.7 (Opta coords)
- Defensive actions: tackles + interceptions + ball recoveries
By match state:
- Open play only: exclude set pieces (corners, free kicks, throw-ins, penalties)
- First half vs second half: use periodId or timestamp
- Score state: track running score to filter "when winning", "when losing"
By zone:
- Penalty area actions: x > 83, 21 < y < 79 (Opta coords)
- High press: actions in opponent's defensive third (x > 66.7)
Joining datasets
Common joins in football data:
| Join | Key | Notes |
|---|---|---|
| Events + lineups | player_id + match_id | Get player names/positions for each event |
| Events + xG | match_id + event sequence | Match xG to specific shots |
| Multiple providers | match date + team names | Fuzzy matching often needed |
| Season data + Elo | date | Join Elo rating at time of match |
Fuzzy team name matching is a constant pain. Build a mapping table:
TEAM_MAP = {
'Man City': 'Manchester City',
'Man United': 'Manchester United',
'Spurs': 'Tottenham Hotspur',
'Wolves': 'Wolverhampton Wanderers',
# ...
}
Reshaping
Common reshaping operations:
- Wide to long: Season stats tables (one column per stat) to tidy format (one row per stat per team)
- Events to possession chains: Group consecutive events by the same team into possession sequences
- Match-level to season aggregates: Group by team, sum/average per-match values
- Player-match to player-season: Aggregate across matches, weight by minutes played
Handling large datasets
Full event data for a PL season is ~500MB+ (380 matches x ~1700 events). Strategies:
Python:
- Use polars instead of pandas for 5-10x speed improvement
- Process match-by-match in a loop, don't load all into memory
- Use DuckDB for SQL queries on Parquet files without loading into memory
JavaScript/TypeScript:
- Stream JSON files with
readlineorJSONStream - Use SQLite (better-sqlite3) for local queries
- Process files in parallel with worker threads
R:
- Use data.table instead of tidyverse for large datasets
- Arrow/Parquet for out-of-memory processing
Data quality checks
Always validate after wrangling:
| Check | What to look for |
|---|---|
| Event counts | ~1500-2000 events per PL match. Much less = data issue |
| Coordinate range | Should be within provider's expected range |
| Missing player IDs | Some events lack player attribution (ball out, etc.) |
| Duplicate events | Same event_id appearing twice |
| Time gaps | Large gaps in event timestamps within a match |
| Team attribution | Verify home/away assignment is consistent |
Format conversion
| From | To | Tool/method |
|---|---|---|
| JSON events | DataFrame | pandas/polars read_json or manual parsing |
| CSV | Parquet | df.write_parquet() (polars) or df.to_parquet() (pandas) |
| Provider format | kloppy model | kloppy.load_{provider}() in Python |
| kloppy model | DataFrame | dataset.to_df() |
| Any | SQLite | Load into SQLite for ad-hoc queries |