Neuroimaging Power Guide

SKILL.md

Neuroimaging Power Guide

Purpose

Statistical power in neuroimaging is fundamentally different from power in behavioral research. The massive multiple comparisons problem (testing ~100,000 voxels simultaneously), spatial correlation structure, and non-standard test statistics mean that standard power formulas underestimate required sample sizes. Meanwhile, the field has historically been severely underpowered: the median fMRI study has only ~20% power to detect a typical effect (Button et al., 2013).

A competent programmer without neuroimaging training would apply standard power calculations (e.g., G*Power for a t-test) without accounting for multiple comparison correction, would not know typical effect sizes in neuroimaging, and would dramatically underestimate the sample sizes needed. This skill encodes the domain-specific knowledge for neuroimaging power analysis.

When to Use This Skill

  • Planning sample size for a new fMRI, EEG, or MEG study
  • Estimating power for grant applications or registered reports
  • Determining whether a published study was adequately powered
  • Choosing between ROI-based and whole-brain analysis based on power constraints
  • Evaluating the reliability implications of sample size choices

Research Planning Protocol

Before executing the domain-specific steps below, you MUST:

  1. State the research question — What specific question is this analysis/paradigm addressing?
  2. Justify the method choice — Why is this approach appropriate? What alternatives were considered?
  3. Declare expected outcomes — What results would support vs. refute the hypothesis?
  4. Note assumptions and limitations — What does this method assume? Where could it mislead?
  5. Present the plan to the user and WAIT for confirmation before proceeding.

For detailed methodology guidance, see the research-literacy skill.

⚠️ Verification Notice

This skill was generated by AI from academic literature. All parameters, thresholds, and citations require independent verification before use in research. If you find errors, please open an issue.

Why Neuroimaging Power Is Different

Standard power analysis assumes a single statistical test. Neuroimaging involves:

Challenge Impact on Power Source
Massive multiple comparisons ~100,000 voxels tested; correction reduces sensitivity by orders of magnitude Nichols & Hayasaka, 2003
Spatial smoothness Adjacent voxels are correlated, reducing effective number of independent tests but complicating power calculation Worsley et al., 1996
Multi-level inference Subject-level estimation + group-level test; both levels contribute noise Mumford & Nichols, 2008
Effect size variability Effects vary across voxels, regions, and subjects; no single "effect size" characterizes a study Poldrack et al., 2017
Threshold-dependent power Power depends heavily on the statistical threshold (corrected vs. uncorrected) and correction method Hayasaka et al., 2007

Key implication: A standard G*Power calculation for a two-sample t-test will dramatically overestimate the power of a whole-brain fMRI analysis because it ignores multiple comparison correction (Mumford & Nichols, 2008).

Typical Effect Sizes in Neuroimaging

fMRI Effect Sizes

Analysis Type Typical Effect Size Unit Source
Task activation (voxel-level) Cohen's d = 0.5-1.0 Standardized mean difference Poldrack et al., 2017
Task activation (ROI-level) Cohen's d = 0.5-1.5 Standardized mean difference Poldrack et al., 2017
Between-group difference (voxel) Cohen's d = 0.3-0.8 Standardized mean difference Poldrack et al., 2017
Functional connectivity (correlation) r = 0.2-0.5 Pearson correlation Marek et al., 2022
Brain-behavior association r = 0.1-0.3 Pearson correlation Marek et al., 2022
Brain-wide association (replicable) r < 0.05 at N < 1000 Pearson correlation Marek et al., 2022

Critical finding: Marek et al. (2022) demonstrated that brain-behavior correlations in typical neuroimaging samples (N < 100) are severely inflated. Replicable brain-behavior associations require N > 2,000 for whole-brain analyses.

EEG/ERP Effect Sizes

Analysis Type Typical Effect Size Source
ERP component amplitude (e.g., N400, P300) Cohen's d = 0.3-0.8 Boudewyn et al., 2018
ERP latency differences Cohen's d = 0.2-0.5 Luck, 2014
EEG oscillatory power Cohen's d = 0.3-0.6 Cohen, 2014
EEG connectivity (coherence/PLV) Cohen's d = 0.2-0.5 Cohen, 2014

Sample Size Benchmarks

fMRI Sample Size Recommendations

Design Minimum N Recommended N Assumptions Source
Within-subject task activation 20 25-30 Large effect (d > 0.8), lenient correction Desmond & Glover, 2002
Between-group comparison (large effect, d = 0.8) 20 per group 25-30 per group Whole-brain, cluster-corrected Thirion et al., 2007
Between-group comparison (medium effect, d = 0.5) 40 per group 50+ per group Whole-brain, cluster-corrected Thirion et al., 2007; Poldrack et al., 2017
Resting-state individual differences 25+ 50+ (much more for replicability) Depends on reliability of measure Marek et al., 2022
Brain-behavior correlations 100+ N > 2,000 for replicable whole-brain Large-scale only Marek et al., 2022
ROI-based analysis (a priori) 15-20 25+ Single ROI, no whole-brain correction Desmond & Glover, 2002

EEG/ERP Sample Size Recommendations

Design Minimum per Condition Recommended per Condition Source
ERP trials per condition per subject 30 40-60 Boudewyn et al., 2018
ERP between-group (medium d = 0.5) 34 per group 50+ per group Boudewyn et al., 2018
ERP within-subject (medium d = 0.5) 25 subjects 30+ subjects Luck, 2014
Time-frequency analysis 40 trials 60+ trials Cohen, 2014

Power at Common Sample Sizes

N (per group) Power for d = 0.5 (uncorrected) Power for d = 0.5 (corrected, whole-brain) Power for d = 0.8 (corrected)
10 ~26% < 10% ~25%
20 ~50% ~20% ~50%
30 ~70% ~35% ~70%
40 ~82% ~50% ~85%
60 ~94% ~70% ~95%

Values are approximate, based on simulations from Mumford & Nichols (2008) and Desmond & Glover (2002). Exact power depends on design, smoothness, effect spatial extent, and correction method.

Power Decision Tree

What type of analysis are you planning?
 |
 +-- Whole-brain voxelwise analysis
 | |
 | +-- Within-subject (one-sample t-test)
 | | --> Minimum N = 20; aim for N = 25-30
 | | (Desmond & Glover, 2002)
 | |
 | +-- Between-group comparison
 | | |
 | | +-- Large expected effect (d > 0.8)
 | | | --> N = 20-25 per group (Thirion et al., 2007)
 | | |
 | | +-- Medium expected effect (d = 0.5)
 | | | --> N = 40-50 per group (Poldrack et al., 2017)
 | | |
 | | +-- Small expected effect (d = 0.3)
 | | --> N = 80+ per group; consider ROI approach
 | |
 | +-- Brain-behavior correlation
 | --> N = 100+ minimum; N > 2,000 for replicability
 | (Marek et al., 2022)
 |
 +-- ROI-based analysis (a priori regions)
 | --> Use standard power formulas (G*Power) with expected
 | effect size from literature or pilot data.
 | No multiple comparison correction needed for single ROI.
 | N = 15-30 typical for medium-large effects.
 |
 +-- ERP analysis
 |
 +-- Between-group
 | --> 30-50 per group for medium effects
 | (Boudewyn et al., 2018)
 |
 +-- Within-subject
 --> 25-30 subjects, 30+ trials per condition
 (Boudewyn et al., 2018; Luck, 2014)

Simulation-Based Power Approaches

fMRIpower (Mumford & Nichols, 2008)

Estimates power using pilot group-level activation maps:

  1. Run a pilot study (or use published results) to obtain group-level statistical maps
  2. Estimate effect sizes at each voxel from the pilot data
  3. Simulate new datasets with varying N by resampling from the estimated effect size and variance
  4. Apply the full statistical pipeline (including multiple comparison correction) to each simulation
  5. Power = proportion of simulations that detect the effect at a given ROI or voxel

Requirements: Pilot data from at least 10-15 subjects for stable variance estimates (Mumford & Nichols, 2008)

NeuroPowerTools (Durnez et al., 2016)

Web-based tool for peak-based power estimation:

  1. Upload an unthresholded statistical map from a pilot or published study
  2. The tool fits a mixture model to the peak distribution (null + alternative)
  3. Estimates the proportion of truly active voxels and their average effect size
  4. Computes power for new studies with varying N and thresholds

Advantage: Does not require individual subject data; can use published group maps URL: https://neuropowertools.org

Permutation-Based Power (Hayasaka et al., 2007)

  1. Generate simulated datasets under the alternative hypothesis using effect size maps from pilot data
  2. For each simulated dataset, run a full permutation test (5,000+ permutations)
  3. Compute power as the proportion of simulations in which the permutation test rejects the null

Advantage: Fully nonparametric; accounts for the exact multiple comparison correction used Disadvantage: Computationally expensive (requires running thousands of permutation tests per power estimate)

PowerMap (Joyce & Hayasaka, 2012)

Simulation-based power using parametric assumptions:

  1. Specify effect size map (from pilot data or assumed values)
  2. Specify noise model (based on residuals from pilot data)
  3. Simulate datasets with varying N
  4. Apply parametric statistical testing with specified correction method
  5. Estimate power at each voxel

Multiple Comparison Correction Impact on Power

The choice of correction method dramatically affects required sample size:

Correction Method Effective Alpha per Voxel Relative Power Source
None (p < 0.001 uncorrected) 0.001 Highest (but invalid inference) --
FDR q < 0.05 ~0.0001-0.001 (data-dependent) Moderate-High Genovese et al., 2002
Cluster-based (CDT p < 0.001) Depends on cluster size Moderate-High for large effects Eklund et al., 2016
Voxelwise FWE (RFT, p < 0.05) ~0.00000005 Low Worsley et al., 1996
TFCE + permutation Varies Moderate Smith & Nichols, 2009

Domain insight: Switching from voxelwise FWE to cluster-based or FDR correction can increase power by 50-200% for the same sample size, because these methods exploit the spatial extent of true activations (Nichols & Hayasaka, 2003).

Test-Retest Reliability and Power

For individual differences designs (correlating brain measures with behavior), reliability of the brain measure is critical (Elliott et al., 2020):

Measure Typical ICC Implication Source
Task fMRI activation (ROI) 0.3-0.6 Poor to moderate reliability Elliott et al., 2020
Resting-state connectivity 0.3-0.7 Moderate reliability; depends on scan duration Elliott et al., 2020
ERP amplitude 0.5-0.8 Moderate to good Cassidy et al., 2012
EEG oscillatory power 0.6-0.9 Good to excellent Cohen, 2014

Critical formula: The maximum detectable correlation between brain and behavior is bounded by the reliabilities of both measures:

r_observed_max = r_true * sqrt(reliability_brain * reliability_behavior)

With brain ICC = 0.5 and behavior reliability = 0.8, even a true correlation of r = 0.5 would appear as r = 0.5 * sqrt(0.5 * 0.8) = 0.32 on average (Elliott et al., 2020). This attenuation means far larger samples are needed.

Recommendation: For individual differences designs, collect longer scan sessions (at least 20-30 minutes of resting-state data; Birn et al., 2013) or use multi-session data to improve reliability.

Practical Power Calculation Workflow

For a New fMRI Study

  1. Define the primary analysis: Whole-brain voxelwise or ROI-based?
  2. Estimate effect size:
  • From pilot data (preferred): extract effect sizes from pilot activation maps
  • From literature: find the most comparable published study; correct for publication bias by assuming the true effect is ~50-75% of the published estimate (Button et al., 2013)
  • From meta-analysis: use NeuroSynth or BrainMap to estimate typical activation strength
  1. Choose the power analysis tool:
  • ROI-based: Standard power calculation (G*Power) using the estimated effect size at the ROI
  • Whole-brain: fMRIpower, NeuroPowerTools, or simulation
  1. Set target power: 80% (conventional) or 90% (recommended for costly neuroimaging studies)
  2. Account for attrition: Add 10-20% to planned N for participant exclusions due to excessive motion, incomplete data, or technical failures
  3. Report: Effect size source, power tool used, correction method, target power, final N

For a New EEG/ERP Study

  1. Estimate effect size: From pilot data or published ERP studies (see effect size table above)
  2. Determine trial count: At least 30 trials per condition post-rejection (Boudewyn et al., 2018)
  3. Plan for trial attrition: Assume 20-30% trial rejection rate; collect accordingly
  4. Subject-level power: Use G*Power with the estimated within- or between-subject effect size
  5. Account for subject attrition: Add 15-20% for exclusions due to excessive artifacts

Common Pitfalls

  1. Using uncorrected power estimates for whole-brain analyses: A study with 80% power at p < 0.001 uncorrected has far less than 80% power after FWE or FDR correction (Mumford & Nichols, 2008)
  2. Ignoring effect size inflation in pilot studies: Small pilot studies produce inflated effect sizes due to the "winner's curse." Assume the true effect is 50-75% of the pilot estimate (Button et al., 2013)
  3. Applying behavioral power formulas to neuroimaging: Standard t-test power calculations dramatically overestimate power for whole-brain analyses because they ignore multiple comparison correction
  4. Not accounting for participant attrition: In fMRI, 10-20% of participants may be excluded due to motion, scanner artifacts, or incomplete data. Over-recruit accordingly
  5. Ignoring reliability for individual differences: Brain measures with ICC < 0.5 attenuate correlations, requiring much larger samples than traditional power analysis suggests (Elliott et al., 2020)
  6. Assuming published sample sizes are adequate: Most published fMRI studies are underpowered (median power ~20%; Button et al., 2013). Do not use published N as a benchmark
  7. Neglecting the impact of design efficiency: An optimized event-related design can be 2-3x more efficient than a suboptimal one (Dale, 1999), effectively increasing power without adding subjects

Minimum Reporting Checklist

  • Target effect size and its source (pilot data, literature, meta-analysis)
  • Effect size metric used (Cohen's d, r, partial eta-squared)
  • Power analysis method (analytical, simulation-based, tool used)
  • Target power level (typically 80% or 90%)
  • Statistical test assumed (one-sample t, two-sample t, correlation, ANOVA)
  • Multiple comparison correction method and parameters
  • Planned N and justification
  • Attrition allowance (expected exclusion rate)
  • For simulation-based: number of simulations, pilot data source, software
  • For reliability-dependent designs: reliability estimates and their source

References

  • Birn, R. M., Molloy, E. K., Patriat, R., et al. (2013). The effect of scan length on the reliability of resting-state fMRI connectivity estimates. NeuroImage, 83, 550-558.
  • Boudewyn, M. A., Luck, S. J., Farrens, J. L., & Kappenman, E. S. (2018). How many trials does it take to get a significant ERP effect? Psychophysiology, 55(6), e13049.
  • Button, K. S., Ioannidis, J. P. A., Mokrysz, C., et al. (2013). Power failure: Why small sample size undermines the reliability of neuroscience. Nature Reviews Neuroscience, 14(5), 365-376.
  • Cassidy, S. M., Robertson, I. H., & O'Connell, R. G. (2012). Retest reliability of event-related potentials: Evidence from a variety of paradigms. Psychophysiology, 49(5), 659-664.
  • Cohen, M. X. (2014). Analyzing Neural Time Series Data: Theory and Practice. MIT Press.
  • Dale, A. M. (1999). Optimal experimental design for event-related fMRI. Human Brain Mapping, 8(2-3), 109-114.
  • Desmond, J. E., & Glover, G. H. (2002). Estimating sample size in functional MRI (fMRI) neuroimaging studies. Journal of Neuroscience Methods, 118(2), 115-128.
  • Durnez, J., Degryse, J., Moerkerke, B., et al. (2016). Power and sample size calculations for fMRI studies based on the prevalence of active peaks. bioRxiv, 049429.
  • Eklund, A., Nichols, T. E., & Knutsson, H. (2016). Cluster failure: Why fMRI inferences for spatial extent have inflated false-positive rates. PNAS, 113(28), 7900-7905.
  • Elliott, M. L., Knodt, A. R., Ireland, D., et al. (2020). What is the test-retest reliability of common task-functional MRI measures? Biological Psychiatry, 87(11), 934-948.
  • Genovese, C. R., Lazar, N. A., & Nichols, T. (2002). Thresholding of statistical maps in functional neuroimaging using the false discovery rate. NeuroImage, 15(4), 870-878.
  • Hayasaka, S., Peiffer, A. M., Hugenschmidt, C. E., & Laurienti, P. J. (2007). Power and sample size calculation for neuroimaging studies by non-central random field theory. NeuroImage, 37(3), 721-730.
  • Joyce, K. E., & Hayasaka, S. (2012). Development of PowerMap: A software package for statistical power calculation in neuroimaging studies. Neuroinformatics, 10(4), 351-365.
  • Luck, S. J. (2014). An Introduction to the Event-Related Potential Technique (2nd ed.). MIT Press.
  • Marek, S., Tervo-Clemmens, B., Calabro, F. J., et al. (2022). Reproducible brain-wide association studies require thousands of individuals. Nature, 603(7902), 654-660.
  • Mumford, J. A., & Nichols, T. E. (2008). Power calculation for group fMRI studies accounting for arbitrary design and temporal autocorrelation. NeuroImage, 39(1), 261-268.
  • Nichols, T. E., & Hayasaka, S. (2003). Controlling the familywise error rate in functional neuroimaging: A comparative review. Statistical Methods in Medical Research, 12(5), 419-446.
  • Poldrack, R. A., Baker, C. I., Durnez, J., et al. (2017). Scanning the horizon: Towards transparent and reproducible neuroimaging research. Nature Reviews Neuroscience, 18(2), 115-126.
  • Smith, S. M., & Nichols, T. E. (2009). Threshold-free cluster enhancement. NeuroImage, 44(1), 83-98.
  • Thirion, B., Pinel, P., Meriaux, S., et al. (2007). Analysis of a large fMRI cohort: Statistical and methodological issues for group analyses. NeuroImage, 35(1), 105-120.
  • Worsley, K. J., Marrett, S., Neelin, P., et al. (1996). A unified statistical approach for determining significant signals in images of cerebral activation. Human Brain Mapping, 4(1), 58-73.

See references/ for detailed simulation examples and effect size lookup tables.

Weekly Installs
0
GitHub Stars
10
First Seen
Jan 1, 1970