skills/jsperger/llm-r-skills/designing-tidy-r-functions

designing-tidy-r-functions

SKILL.md

Tidy R Function Design

Design R functions for humans, not computers. Optimize for cognitive load reduction, predictability, and composability. These principles apply to any R code, not just tidyverse packages.

Core principle: The less a user needs to think to use your function correctly, the better.

Quick Reference

Design Goal Pattern
Predictable names Verb in imperative mood, prefixes for families
Clear arguments Most important first, optional with defaults last
Pipe-friendly Primary data as first argument
Type stability Output type predictable from input types
Enumerated options Use arg_match() with character vector defaults
Side effects Return input invisibly; partition from computation
Complex strategies Extract to strategy objects (not boolean flags)

Function Naming

Use Verbs in Imperative Mood

# Good: imperative verbs
mutate()
filter()
summarize()

# Exception: noun-y builders
geom_point()
recipe()

Prefer Prefixes Over Suffixes

Prefixes enable autocomplete discovery:

# Good: common prefix groups related functions
str_detect(), str_replace(), str_extract()
read_csv(), read_tsv(), read_delim()

# Suffixes for variations on a theme
map_int(), map_chr(), map_dbl()

Length Inversely Proportional to Frequency

# Very frequent -> short
c(), n(), df

# Less frequent -> descriptive
create_bootstrap_samples()
validate_model_specification()

Argument Design

Most Important Arguments First

# Good: transformed data first (pipe-friendly)
str_replace(string, pattern, replacement)
left_join(x, y, by)

# Output-determining args early
read_csv(file, col_types, col_names)

Required Arguments Have No Defaults

# Good: required args have no defaults
my_function <- function(data, columns, method = "default") {
  # data and columns required, method optional
}

# Bad: everything has defaults
my_function <- function(data = NULL, columns = NULL, method = "default")

Dots Position Matters

Place ... between required and optional arguments:

# Good: forces explicit naming of optional args
my_function <- function(x, y, ..., verbose = FALSE, na.rm = TRUE) {
  # x, y required; anything after ... must be named
}

Keep Defaults Short

Use NULL for complex defaults, compute in body:

# Good: NULL signals "computed if not provided"
my_function <- function(x, weights = NULL) {
  weights <- weights %||% rep(1, length(x))
}

# Bad: complex default in signature
my_function <- function(x, weights = rep(1, length(x)))

Enumerate String Options

Use arg_match() with character vector defaults:

my_function <- function(x, method = c("fast", "accurate", "balanced")) {
  method <- rlang::arg_match(method)
  # method is now validated, first value is default
}

Standardize Common Argument Names

Purpose Use Not
New data for prediction new_data newdata, newData
Missing value handling na_rm na.rm, rm.na
Case weights weights wts, w
Predictors (data frame) x predictors, features
Outcome (data frame) y response, target
Formula interface data data df, dataset

Output Patterns

Type Stability

Output type should be predictable from input types, not values:

# Bad: type depends on VALUE
ifelse(TRUE, 1L, 2)   # returns integer
ifelse(FALSE, 1L, 2)  # returns double

# Good: type predictable from input types
dplyr::if_else(TRUE, 1L, 2L)   # always integer
dplyr::if_else(FALSE, 1L, 2L)  # always integer

Tibble Predictions

For modeling functions, predictions should return tibbles:

  • Same number of rows as input
  • Same row order as input
  • Standardized column names: .pred, .pred_class, .pred_lower
# Good prediction output
predict(model, new_data)
#> # A tibble: 100 x 1
#>     .pred
#>     <dbl>
#>  1   3.45
#>  2   2.89

Side-Effect Functions Return Invisibly

Functions called for side effects should return the first argument invisibly:

# Good: enables piping
write_csv <- function(x, file, ...) {
  # write the file
  invisible(x)
}

# Enables this pattern:
data |>
  write_csv("backup.csv") |>
  filter(important) |>
  write_csv("filtered.csv")

Side Effects

Partition Side Effects from Computation

# Bad: computation mixed with side effects
analyze <- function(x) {
  result <- expensive_computation(x)
  cat("Computed result:", result, "\n")  # side effect buried
  options(my_option = result)            # hidden state change
  result
}

# Good: side effects isolated
analyze <- function(x, verbose = FALSE) {
  result <- expensive_computation(x)
  if (verbose) cli::cli_inform("Computed result: {result}")
  result
}

Make Side Effects Easy to Undo

Functions that change global state should return previous values:

# Good: returns previous value for restoration
old <- options(digits = 3)
# ... do work ...
options(old)  # restore

Strategy Patterns

Avoid Boolean Strategy Flags

# Bad: boolean flags for strategies
grepl(pattern, x, perl = TRUE, fixed = FALSE, ignore.case = TRUE)
# Which combinations are valid? What does perl + fixed mean?

# Good: strategy objects
str_detect(x, regex(pattern, ignore_case = TRUE))
str_detect(x, fixed(pattern))

Strategy Objects for Complex Options

When strategies need different arguments, create helper functions:

# Strategy helpers with strategy-specific arguments
regex <- function(pattern, ignore_case = FALSE, multiline = FALSE) {
  structure(list(pattern = pattern, ignore_case = ignore_case,
                 multiline = multiline), class = "regex")
}

fixed <- function(pattern) {
  structure(list(pattern = pattern), class = "fixed")
}

# Main function accepts strategy objects
str_detect <- function(string, pattern) {
  if (inherits(pattern, "regex")) {
    # regex-specific handling
  } else if (inherits(pattern, "fixed")) {
    # fixed-specific handling
  }
}

Explicit Over Implicit

Avoid Global Option Dependencies

# Bad: behavior depends on global option
my_function <- function(x) {
  na_action <- getOption("na.action")  # implicit input
  # ...
}

# Good: explicit argument with informative default
my_function <- function(x, na_action = na.omit) {
  # ...
}

Inform Users of Important Defaults

When defaults matter, tell the user:

my_function <- function(x, tz = Sys.timezone()) {
  if (missing(tz)) {
    cli::cli_inform("Using timezone: {.val {tz}}")
  }
  # ...
}

Model Object Design

Minimize Stored Data

# Bad: stores entire training set
model$training_data <- training_set  # memory bloat

# Good: store only what's needed for prediction
model$coefficients <- coefs
model$levels <- factor_levels

Never Save Call Objects

Call objects can embed entire datasets and environments:

# Bad: call may contain data
model$call <- match.call()

# Good: omit call or store only essential info

Use Proper S3 Constructors

# Constructor (internal)
new_my_model <- function(coefficients, levels) {
  structure(
    list(coefficients = coefficients, levels = levels),
    class = "my_model"
  )
}

# Validator (internal)
validate_my_model <- function(x) {
  stopifnot(is.numeric(x$coefficients))
  x
}

# Helper (user-facing)
my_model <- function(...) {
  result <- new_my_model(...)
  validate_my_model(result)
}

Matrix Subsetting Discipline

Always preserve matrix structure:

# Bad: may return vector
X[, 1]

# Good: always returns matrix
X[, 1, drop = FALSE]

Design Review Checklist

When reviewing R function design:

  • Function names are verbs in imperative mood (or nouns for builders)
  • Related functions share a prefix
  • Most important arguments come first
  • Primary data is first argument (pipe-friendly)
  • Required arguments have no defaults
  • ... comes between required and optional arguments
  • String options use arg_match() with enumerated defaults
  • Output type is predictable from input types
  • Side-effect functions return input invisibly
  • No hidden dependencies on global options or locale
  • Strategy variations use objects, not boolean flags
  • Model objects don't store training data or calls
  • Matrix subsetting uses drop = FALSE

Resources

Weekly Installs
2
GitHub Stars
3
First Seen
Feb 26, 2026
Installed on
amp2
cline2
opencode2
cursor2
kimi-cli2
codex2