r-data-science
SKILL.md
R Data Science Skill
R-first data science workflows emphasizing tidyverse idioms, functional programming, and reproducible research.
Core Principles
- Tidyverse-first: Prefer tidyverse solutions; use data.table or base R when performance requires
- Pipelines over scripts: Use
|>(native pipe) for clarity;%>%acceptable in existing codebases - Functional style: Leverage purrr for iteration; avoid explicit loops
- Lazy evaluation: Use DuckDB/dbplyr to push computation to the database
- Reproducibility: Structure projects with targets for pipeline orchestration
Quick Reference
Data Import/Export
# CSV (readr - tidyverse)
df <- read_csv("data.csv", col_types = cols())
# Parquet (arrow)
df <- arrow::read_parquet("data.parquet")
# Excel
df <- readxl::read_excel("data.xlsx", sheet = 1)
Data Manipulation (dplyr + tidyr)
result <- df |>
filter(status == "active") |>
mutate(rate = value / total) |>
group_by(category) |>
summarise(
n = n(),
mean_rate = mean(rate, na.rm = TRUE),
.groups = "drop"
) |>
arrange(desc(mean_rate))
# Pivoting
wide <- df |> pivot_wider(names_from = year, values_from = value)
long <- df |> pivot_longer(cols = -id, names_to = "year", values_to = "value")
Iteration (purrr)
# Map over list
results <- map(file_list, read_csv)
# Map with type safety
means <- map_dbl(df_list, \(x) mean(x$value, na.rm = TRUE))
# Row-wise operations
df |> mutate(result = pmap_dbl(list(a, b, c), \(a, b, c) a + b * c))
Database Workflows
DuckDB (Preferred for Local Analytics)
library(duckdb)
library(dplyr)
con <- dbConnect(duckdb())
# Register data frame as virtual table
duckdb_register(con, "my_table", df)
# Query with SQL
result <- dbGetQuery(con, "SELECT * FROM my_table WHERE value > 100")
# Or use dplyr
tbl(con, "my_table") |>
filter(value > 100) |>
collect()
dbDisconnect(con, shutdown = TRUE)
duckdplyr (Zero-Copy DuckDB Backend)
library(duckdblyr)
# Automatically uses DuckDB for supported operations
df |>
filter(value > 100) |>
summarise(total = sum(value))
dbplyr (Remote Databases)
library(dbplyr)
con <- DBI::dbConnect(RPostgres::Postgres(), ...)
remote_tbl <- tbl(con, "schema.table_name")
# Build query lazily
query <- remote_tbl |>
filter(date >= "2024-01-01") |>
group_by(region) |>
summarise(revenue = sum(amount))
# View generated SQL
show_query(query)
# Execute and retrieve
local_df <- collect(query)
Detailed References
Load these as needed based on the task:
- Time series analysis (fable, tsibble, feasts): See references/time-series.md
- Machine learning (tidymodels): See references/tidymodels.md
- Pipeline orchestration (targets): See references/targets.md
- Parallel computing (crew, mirai): See references/parallel.md
- Visualization (ggplot2, coefplot): See references/visualization.md
- Performance patterns (data.table, vectorization): See references/performance.md
Project Structure
Standard layout for targets-based projects:
project/
├── _targets.R # Pipeline definition
├── R/
│ ├── functions.R # Reusable functions
│ └── plots.R # Visualization functions
├── data-raw/ # Original data (gitignored if large)
├── data/ # Processed data
├── output/ # Reports, figures
└── renv.lock # Dependency lockfile
Code Style
- Use tidyverse style guide conventions
- Explicit
library()calls at script top; avoidrequire() - Prefer named arguments for clarity:
mean(x, na.rm = TRUE)notmean(x, T) - Document functions with roxygen2 comments when writing packages
- Use
stopifnot()orcli::cli_abort()for assertions
Weekly Installs
1
Repository
jaredlander/usefulGitHub Stars
8
First Seen
2 days ago
Security Audits
Installed on
amp1
cline1
opencode1
cursor1
kimi-cli1
codex1