Tidy R Function Design

Design R functions for humans, not computers. Optimize for cognitive load reduction, predictability, and composability. These principles apply to any R code, not just tidyverse packages.

Core principle: The less a user needs to think to use your function correctly, the better.

Quick Reference

Design Goal	Pattern
Predictable names	Verb in imperative mood, prefixes for families
Clear arguments	Most important first, optional with defaults last
Pipe-friendly	Primary data as first argument
Type stability	Output type predictable from input types
Enumerated options	Use `arg_match()` with character vector defaults
Side effects	Return input invisibly; partition from computation
Complex strategies	Extract to strategy objects (not boolean flags)

Function Naming

Use Verbs in Imperative Mood

# Good: imperative verbs
mutate()
filter()
summarize()

# Exception: noun-y builders
geom_point()
recipe()

Prefer Prefixes Over Suffixes

Prefixes enable autocomplete discovery:

# Good: common prefix groups related functions
str_detect(), str_replace(), str_extract()
read_csv(), read_tsv(), read_delim()

# Suffixes for variations on a theme
map_int(), map_chr(), map_dbl()

Length Inversely Proportional to Frequency

# Very frequent -> short
c(), n(), df

# Less frequent -> descriptive
create_bootstrap_samples()
validate_model_specification()

Argument Design

Most Important Arguments First

# Good: transformed data first (pipe-friendly)
str_replace(string, pattern, replacement)
left_join(x, y, by)

# Output-determining args early
read_csv(file, col_types, col_names)

Required Arguments Have No Defaults

# Good: required args have no defaults
my_function <- function(data, columns, method = "default") {
  # data and columns required, method optional
}

# Bad: everything has defaults
my_function <- function(data = NULL, columns = NULL, method = "default")

Dots Position Matters

Place ... between required and optional arguments:

# Good: forces explicit naming of optional args
my_function <- function(x, y, ..., verbose = FALSE, na.rm = TRUE) {
  # x, y required; anything after ... must be named
}

Keep Defaults Short

Use NULL for complex defaults, compute in body:

# Good: NULL signals "computed if not provided"
my_function <- function(x, weights = NULL) {
  weights <- weights %||% rep(1, length(x))
}

# Bad: complex default in signature
my_function <- function(x, weights = rep(1, length(x)))

Enumerate String Options

Use arg_match() with character vector defaults:

my_function <- function(x, method = c("fast", "accurate", "balanced")) {
  method <- rlang::arg_match(method)
  # method is now validated, first value is default
}

Standardize Common Argument Names

Purpose	Use	Not
New data for prediction	`new_data`	`newdata`, `newData`
Missing value handling	`na_rm`	`na.rm`, `rm.na`
Case weights	`weights`	`wts`, `w`
Predictors (data frame)	`x`	`predictors`, `features`
Outcome (data frame)	`y`	`response`, `target`
Formula interface data	`data`	`df`, `dataset`

Output Patterns

Type Stability

Output type should be predictable from input types, not values:

# Bad: type depends on VALUE
ifelse(TRUE, 1L, 2)   # returns integer
ifelse(FALSE, 1L, 2)  # returns double

# Good: type predictable from input types
dplyr::if_else(TRUE, 1L, 2L)   # always integer
dplyr::if_else(FALSE, 1L, 2L)  # always integer

Tibble Predictions

For modeling functions, predictions should return tibbles:

Same number of rows as input
Same row order as input
Standardized column names: .pred, .pred_class, .pred_lower

# Good prediction output
predict(model, new_data)
#> # A tibble: 100 x 1
#>     .pred
#>     <dbl>
#>  1   3.45
#>  2   2.89

Side-Effect Functions Return Invisibly

Functions called for side effects should return the first argument invisibly:

# Good: enables piping
write_csv <- function(x, file, ...) {
  # write the file
  invisible(x)
}

# Enables this pattern:
data |>
  write_csv("backup.csv") |>
  filter(important) |>
  write_csv("filtered.csv")

Side Effects

Partition Side Effects from Computation

# Bad: computation mixed with side effects
analyze <- function(x) {
  result <- expensive_computation(x)
  cat("Computed result:", result, "\n")  # side effect buried
  options(my_option = result)            # hidden state change
  result
}

# Good: side effects isolated
analyze <- function(x, verbose = FALSE) {
  result <- expensive_computation(x)
  if (verbose) cli::cli_inform("Computed result: {result}")
  result
}

Make Side Effects Easy to Undo

Functions that change global state should return previous values:

# Good: returns previous value for restoration
old <- options(digits = 3)
# ... do work ...
options(old)  # restore

Strategy Patterns

Avoid Boolean Strategy Flags

# Bad: boolean flags for strategies
grepl(pattern, x, perl = TRUE, fixed = FALSE, ignore.case = TRUE)
# Which combinations are valid? What does perl + fixed mean?

# Good: strategy objects
str_detect(x, regex(pattern, ignore_case = TRUE))
str_detect(x, fixed(pattern))

Strategy Objects for Complex Options

When strategies need different arguments, create helper functions:

# Strategy helpers with strategy-specific arguments
regex <- function(pattern, ignore_case = FALSE, multiline = FALSE) {
  structure(list(pattern = pattern, ignore_case = ignore_case,
                 multiline = multiline), class = "regex")
}

fixed <- function(pattern) {
  structure(list(pattern = pattern), class = "fixed")
}

# Main function accepts strategy objects
str_detect <- function(string, pattern) {
  if (inherits(pattern, "regex")) {
    # regex-specific handling
  } else if (inherits(pattern, "fixed")) {
    # fixed-specific handling
  }
}

Explicit Over Implicit

Avoid Global Option Dependencies

# Bad: behavior depends on global option
my_function <- function(x) {
  na_action <- getOption("na.action")  # implicit input
  # ...
}

# Good: explicit argument with informative default
my_function <- function(x, na_action = na.omit) {
  # ...
}

Inform Users of Important Defaults

When defaults matter, tell the user:

my_function <- function(x, tz = Sys.timezone()) {
  if (missing(tz)) {
    cli::cli_inform("Using timezone: {.val {tz}}")
  }
  # ...
}

Model Object Design

Minimize Stored Data

# Bad: stores entire training set
model$training_data <- training_set  # memory bloat

# Good: store only what's needed for prediction
model$coefficients <- coefs
model$levels <- factor_levels

Never Save Call Objects

Call objects can embed entire datasets and environments:

# Bad: call may contain data
model$call <- match.call()

# Good: omit call or store only essential info

Use Proper S3 Constructors

# Constructor (internal)
new_my_model <- function(coefficients, levels) {
  structure(
    list(coefficients = coefficients, levels = levels),
    class = "my_model"
  )
}

# Validator (internal)
validate_my_model <- function(x) {
  stopifnot(is.numeric(x$coefficients))
  x
}

# Helper (user-facing)
my_model <- function(...) {
  result <- new_my_model(...)
  validate_my_model(result)
}

Matrix Subsetting Discipline

Always preserve matrix structure:

# Bad: may return vector
X[, 1]

# Good: always returns matrix
X[, 1, drop = FALSE]

Design Review Checklist

When reviewing R function design:

designing-tidy-r-functions