skills/jaredlander/useful/r-data-science

r-data-science

SKILL.md

R Data Science Skill

R-first data science workflows emphasizing tidyverse idioms, functional programming, and reproducible research.

Core Principles

  1. Tidyverse-first: Prefer tidyverse solutions; use data.table or base R when performance requires
  2. Pipelines over scripts: Use |> (native pipe) for clarity; %>% acceptable in existing codebases
  3. Functional style: Leverage purrr for iteration; avoid explicit loops
  4. Lazy evaluation: Use DuckDB/dbplyr to push computation to the database
  5. Reproducibility: Structure projects with targets for pipeline orchestration

Quick Reference

Data Import/Export

# CSV (readr - tidyverse)
df <- read_csv("data.csv", col_types = cols())

# Parquet (arrow)
df <- arrow::read_parquet("data.parquet")

# Excel
df <- readxl::read_excel("data.xlsx", sheet = 1)

Data Manipulation (dplyr + tidyr)

result <- df |>
  filter(status == "active") |>
  mutate(rate = value / total) |>
  group_by(category) |>
  summarise(
    n = n(),
    mean_rate = mean(rate, na.rm = TRUE),
    .groups = "drop"
  ) |>
  arrange(desc(mean_rate))

# Pivoting
wide <- df |> pivot_wider(names_from = year, values_from = value)
long <- df |> pivot_longer(cols = -id, names_to = "year", values_to = "value")

Iteration (purrr)

# Map over list
results <- map(file_list, read_csv)

# Map with type safety
means <- map_dbl(df_list, \(x) mean(x$value, na.rm = TRUE))

# Row-wise operations
df |> mutate(result = pmap_dbl(list(a, b, c), \(a, b, c) a + b * c))

Database Workflows

DuckDB (Preferred for Local Analytics)

library(duckdb)
library(dplyr)

con <- dbConnect(duckdb())

# Register data frame as virtual table
duckdb_register(con, "my_table", df)

# Query with SQL
result <- dbGetQuery(con, "SELECT * FROM my_table WHERE value > 100")

# Or use dplyr
tbl(con, "my_table") |>

  filter(value > 100) |>
  collect()

dbDisconnect(con, shutdown = TRUE)

duckdplyr (Zero-Copy DuckDB Backend)

library(duckdblyr)

# Automatically uses DuckDB for supported operations
df |>
  filter(value > 100) |>
  summarise(total = sum(value))

dbplyr (Remote Databases)

library(dbplyr)

con <- DBI::dbConnect(RPostgres::Postgres(), ...)
remote_tbl <- tbl(con, "schema.table_name")

# Build query lazily
query <- remote_tbl |>
  filter(date >= "2024-01-01") |>
  group_by(region) |>
  summarise(revenue = sum(amount))

# View generated SQL
show_query(query)

# Execute and retrieve
local_df <- collect(query)

Detailed References

Load these as needed based on the task:

Project Structure

Standard layout for targets-based projects:

project/
├── _targets.R          # Pipeline definition
├── R/
│   ├── functions.R     # Reusable functions
│   └── plots.R         # Visualization functions
├── data-raw/           # Original data (gitignored if large)
├── data/               # Processed data
├── output/             # Reports, figures
└── renv.lock           # Dependency lockfile

Code Style

  • Use tidyverse style guide conventions
  • Explicit library() calls at script top; avoid require()
  • Prefer named arguments for clarity: mean(x, na.rm = TRUE) not mean(x, T)
  • Document functions with roxygen2 comments when writing packages
  • Use stopifnot() or cli::cli_abort() for assertions
Weekly Installs
1
GitHub Stars
8
First Seen
2 days ago
Installed on
amp1
cline1
opencode1
cursor1
kimi-cli1
codex1