Neuroimaging Power Guide

Purpose

Statistical power in neuroimaging is fundamentally different from power in behavioral research. The massive multiple comparisons problem (testing ~100,000 voxels simultaneously), spatial correlation structure, and non-standard test statistics mean that standard power formulas underestimate required sample sizes. Meanwhile, the field has historically been severely underpowered: the median fMRI study has only ~20% power to detect a typical effect (Button et al., 2013).

A competent programmer without neuroimaging training would apply standard power calculations (e.g., G*Power for a t-test) without accounting for multiple comparison correction, would not know typical effect sizes in neuroimaging, and would dramatically underestimate the sample sizes needed. This skill encodes the domain-specific knowledge for neuroimaging power analysis.

When to Use This Skill

Planning sample size for a new fMRI, EEG, or MEG study
Estimating power for grant applications or registered reports
Determining whether a published study was adequately powered
Choosing between ROI-based and whole-brain analysis based on power constraints
Evaluating the reliability implications of sample size choices

Research Planning Protocol

Before executing the domain-specific steps below, you MUST:

State the research question — What specific question is this analysis/paradigm addressing?
Justify the method choice — Why is this approach appropriate? What alternatives were considered?
Declare expected outcomes — What results would support vs. refute the hypothesis?
Note assumptions and limitations — What does this method assume? Where could it mislead?
Present the plan to the user and WAIT for confirmation before proceeding.

For detailed methodology guidance, see the research-literacy skill.

⚠️ Verification Notice

This skill was generated by AI from academic literature. All parameters, thresholds, and citations require independent verification before use in research. If you find errors, please open an issue.

Why Neuroimaging Power Is Different

Standard power analysis assumes a single statistical test. Neuroimaging involves:

Challenge	Impact on Power	Source
Massive multiple comparisons	~100,000 voxels tested; correction reduces sensitivity by orders of magnitude	Nichols & Hayasaka, 2003
Spatial smoothness	Adjacent voxels are correlated, reducing effective number of independent tests but complicating power calculation	Worsley et al., 1996
Multi-level inference	Subject-level estimation + group-level test; both levels contribute noise	Mumford & Nichols, 2008
Effect size variability	Effects vary across voxels, regions, and subjects; no single "effect size" characterizes a study	Poldrack et al., 2017
Threshold-dependent power	Power depends heavily on the statistical threshold (corrected vs. uncorrected) and correction method	Hayasaka et al., 2007

Key implication: A standard G*Power calculation for a two-sample t-test will dramatically overestimate the power of a whole-brain fMRI analysis because it ignores multiple comparison correction (Mumford & Nichols, 2008).

Typical Effect Sizes in Neuroimaging

fMRI Effect Sizes

Analysis Type	Typical Effect Size	Unit	Source
Task activation (voxel-level)	Cohen's d = 0.5-1.0	Standardized mean difference	Poldrack et al., 2017
Task activation (ROI-level)	Cohen's d = 0.5-1.5	Standardized mean difference	Poldrack et al., 2017
Between-group difference (voxel)	Cohen's d = 0.3-0.8	Standardized mean difference	Poldrack et al., 2017
Functional connectivity (correlation)	r = 0.2-0.5	Pearson correlation	Marek et al., 2022
Brain-behavior association	r = 0.1-0.3	Pearson correlation	Marek et al., 2022
Brain-wide association (replicable)	r < 0.05 at N < 1000	Pearson correlation	Marek et al., 2022

Critical finding: Marek et al. (2022) demonstrated that brain-behavior correlations in typical neuroimaging samples (N < 100) are severely inflated. Replicable brain-behavior associations require N > 2,000 for whole-brain analyses.

EEG/ERP Effect Sizes

Analysis Type	Typical Effect Size	Source
ERP component amplitude (e.g., N400, P300)	Cohen's d = 0.3-0.8	Boudewyn et al., 2018
ERP latency differences	Cohen's d = 0.2-0.5	Luck, 2014
EEG oscillatory power	Cohen's d = 0.3-0.6	Cohen, 2014
EEG connectivity (coherence/PLV)	Cohen's d = 0.2-0.5	Cohen, 2014

Sample Size Benchmarks

fMRI Sample Size Recommendations

Design	Minimum N	Recommended N	Assumptions	Source
Within-subject task activation	20	25-30	Large effect (d > 0.8), lenient correction	Desmond & Glover, 2002
Between-group comparison (large effect, d = 0.8)	20 per group	25-30 per group	Whole-brain, cluster-corrected	Thirion et al., 2007
Between-group comparison (medium effect, d = 0.5)	40 per group	50+ per group	Whole-brain, cluster-corrected	Thirion et al., 2007; Poldrack et al., 2017
Resting-state individual differences	25+	50+ (much more for replicability)	Depends on reliability of measure	Marek et al., 2022
Brain-behavior correlations	100+	N > 2,000 for replicable whole-brain	Large-scale only	Marek et al., 2022
ROI-based analysis (a priori)	15-20	25+	Single ROI, no whole-brain correction	Desmond & Glover, 2002

EEG/ERP Sample Size Recommendations

Design	Minimum per Condition	Recommended per Condition	Source
ERP trials per condition per subject	30	40-60	Boudewyn et al., 2018
ERP between-group (medium d = 0.5)	34 per group	50+ per group	Boudewyn et al., 2018
ERP within-subject (medium d = 0.5)	25 subjects	30+ subjects	Luck, 2014
Time-frequency analysis	40 trials	60+ trials	Cohen, 2014

Power at Common Sample Sizes

N (per group)	Power for d = 0.5 (uncorrected)	Power for d = 0.5 (corrected, whole-brain)	Power for d = 0.8 (corrected)
10	~26%	< 10%	~25%
20	~50%	~20%	~50%
30	~70%	~35%	~70%
40	~82%	~50%	~85%
60	~94%	~70%	~95%

Values are approximate, based on simulations from Mumford & Nichols (2008) and Desmond & Glover (2002). Exact power depends on design, smoothness, effect spatial extent, and correction method.

Power Decision Tree

What type of analysis are you planning?
 |
 +-- Whole-brain voxelwise analysis
 | |
 | +-- Within-subject (one-sample t-test)
 | | --> Minimum N = 20; aim for N = 25-30
 | | (Desmond & Glover, 2002)
 | |
 | +-- Between-group comparison
 | | |
 | | +-- Large expected effect (d > 0.8)
 | | | --> N = 20-25 per group (Thirion et al., 2007)
 | | |
 | | +-- Medium expected effect (d = 0.5)
 | | | --> N = 40-50 per group (Poldrack et al., 2017)
 | | |
 | | +-- Small expected effect (d = 0.3)
 | | --> N = 80+ per group; consider ROI approach
 | |
 | +-- Brain-behavior correlation
 | --> N = 100+ minimum; N > 2,000 for replicability
 | (Marek et al., 2022)
 |
 +-- ROI-based analysis (a priori regions)
 | --> Use standard power formulas (G*Power) with expected
 | effect size from literature or pilot data.
 | No multiple comparison correction needed for single ROI.
 | N = 15-30 typical for medium-large effects.
 |
 +-- ERP analysis
 |
 +-- Between-group
 | --> 30-50 per group for medium effects
 | (Boudewyn et al., 2018)
 |
 +-- Within-subject
 --> 25-30 subjects, 30+ trials per condition
 (Boudewyn et al., 2018; Luck, 2014)

Simulation-Based Power Approaches

fMRIpower (Mumford & Nichols, 2008)

Estimates power using pilot group-level activation maps:

Run a pilot study (or use published results) to obtain group-level statistical maps
Estimate effect sizes at each voxel from the pilot data
Simulate new datasets with varying N by resampling from the estimated effect size and variance
Apply the full statistical pipeline (including multiple comparison correction) to each simulation
Power = proportion of simulations that detect the effect at a given ROI or voxel

Requirements: Pilot data from at least 10-15 subjects for stable variance estimates (Mumford & Nichols, 2008)

NeuroPowerTools (Durnez et al., 2016)

Web-based tool for peak-based power estimation:

Upload an unthresholded statistical map from a pilot or published study
The tool fits a mixture model to the peak distribution (null + alternative)
Estimates the proportion of truly active voxels and their average effect size
Computes power for new studies with varying N and thresholds

Advantage: Does not require individual subject data; can use published group maps URL: https://neuropowertools.org

Permutation-Based Power (Hayasaka et al., 2007)

Generate simulated datasets under the alternative hypothesis using effect size maps from pilot data
For each simulated dataset, run a full permutation test (5,000+ permutations)
Compute power as the proportion of simulations in which the permutation test rejects the null

Advantage: Fully nonparametric; accounts for the exact multiple comparison correction used Disadvantage: Computationally expensive (requires running thousands of permutation tests per power estimate)

PowerMap (Joyce & Hayasaka, 2012)

Simulation-based power using parametric assumptions:

Specify effect size map (from pilot data or assumed values)
Specify noise model (based on residuals from pilot data)
Simulate datasets with varying N
Apply parametric statistical testing with specified correction method
Estimate power at each voxel

Multiple Comparison Correction Impact on Power

The choice of correction method dramatically affects required sample size:

Correction Method	Effective Alpha per Voxel	Relative Power	Source
None (p < 0.001 uncorrected)	0.001	Highest (but invalid inference)	--
FDR q < 0.05	~0.0001-0.001 (data-dependent)	Moderate-High	Genovese et al., 2002
Cluster-based (CDT p < 0.001)	Depends on cluster size	Moderate-High for large effects	Eklund et al., 2016
Voxelwise FWE (RFT, p < 0.05)	~0.00000005	Low	Worsley et al., 1996
TFCE + permutation	Varies	Moderate	Smith & Nichols, 2009

Domain insight: Switching from voxelwise FWE to cluster-based or FDR correction can increase power by 50-200% for the same sample size, because these methods exploit the spatial extent of true activations (Nichols & Hayasaka, 2003).

Test-Retest Reliability and Power

For individual differences designs (correlating brain measures with behavior), reliability of the brain measure is critical (Elliott et al., 2020):

Measure	Typical ICC	Implication	Source
Task fMRI activation (ROI)	0.3-0.6	Poor to moderate reliability	Elliott et al., 2020
Resting-state connectivity	0.3-0.7	Moderate reliability; depends on scan duration	Elliott et al., 2020
ERP amplitude	0.5-0.8	Moderate to good	Cassidy et al., 2012
EEG oscillatory power	0.6-0.9	Good to excellent	Cohen, 2014

Critical formula: The maximum detectable correlation between brain and behavior is bounded by the reliabilities of both measures:

r_observed_max = r_true * sqrt(reliability_brain * reliability_behavior)

With brain ICC = 0.5 and behavior reliability = 0.8, even a true correlation of r = 0.5 would appear as r = 0.5 * sqrt(0.5 * 0.8) = 0.32 on average (Elliott et al., 2020). This attenuation means far larger samples are needed.

Recommendation: For individual differences designs, collect longer scan sessions (at least 20-30 minutes of resting-state data; Birn et al., 2013) or use multi-session data to improve reliability.

Practical Power Calculation Workflow

For a New fMRI Study

Define the primary analysis: Whole-brain voxelwise or ROI-based?
Estimate effect size:

From pilot data (preferred): extract effect sizes from pilot activation maps
From literature: find the most comparable published study; correct for publication bias by assuming the true effect is ~50-75% of the published estimate (Button et al., 2013)
From meta-analysis: use NeuroSynth or BrainMap to estimate typical activation strength

Choose the power analysis tool:

ROI-based: Standard power calculation (G*Power) using the estimated effect size at the ROI
Whole-brain: fMRIpower, NeuroPowerTools, or simulation

Set target power: 80% (conventional) or 90% (recommended for costly neuroimaging studies)
Account for attrition: Add 10-20% to planned N for participant exclusions due to excessive motion, incomplete data, or technical failures
Report: Effect size source, power tool used, correction method, target power, final N

For a New EEG/ERP Study

Estimate effect size: From pilot data or published ERP studies (see effect size table above)
Determine trial count: At least 30 trials per condition post-rejection (Boudewyn et al., 2018)
Plan for trial attrition: Assume 20-30% trial rejection rate; collect accordingly
Subject-level power: Use G*Power with the estimated within- or between-subject effect size
Account for subject attrition: Add 15-20% for exclusions due to excessive artifacts

Common Pitfalls

Using uncorrected power estimates for whole-brain analyses: A study with 80% power at p < 0.001 uncorrected has far less than 80% power after FWE or FDR correction (Mumford & Nichols, 2008)
Ignoring effect size inflation in pilot studies: Small pilot studies produce inflated effect sizes due to the "winner's curse." Assume the true effect is 50-75% of the pilot estimate (Button et al., 2013)
Applying behavioral power formulas to neuroimaging: Standard t-test power calculations dramatically overestimate power for whole-brain analyses because they ignore multiple comparison correction
Not accounting for participant attrition: In fMRI, 10-20% of participants may be excluded due to motion, scanner artifacts, or incomplete data. Over-recruit accordingly
Ignoring reliability for individual differences: Brain measures with ICC < 0.5 attenuate correlations, requiring much larger samples than traditional power analysis suggests (Elliott et al., 2020)
Assuming published sample sizes are adequate: Most published fMRI studies are underpowered (median power ~20%; Button et al., 2013). Do not use published N as a benchmark
Neglecting the impact of design efficiency: An optimized event-related design can be 2-3x more efficient than a suboptimal one (Dale, 1999), effectively increasing power without adding subjects

Minimum Reporting Checklist

References

Birn, R. M., Molloy, E. K., Patriat, R., et al. (2013). The effect of scan length on the reliability of resting-state fMRI connectivity estimates. NeuroImage, 83, 550-558.
Boudewyn, M. A., Luck, S. J., Farrens, J. L., & Kappenman, E. S. (2018). How many trials does it take to get a significant ERP effect? Psychophysiology, 55(6), e13049.
Button, K. S., Ioannidis, J. P. A., Mokrysz, C., et al. (2013). Power failure: Why small sample size undermines the reliability of neuroscience. Nature Reviews Neuroscience, 14(5), 365-376.
Cassidy, S. M., Robertson, I. H., & O'Connell, R. G. (2012). Retest reliability of event-related potentials: Evidence from a variety of paradigms. Psychophysiology, 49(5), 659-664.
Cohen, M. X. (2014). Analyzing Neural Time Series Data: Theory and Practice. MIT Press.
Dale, A. M. (1999). Optimal experimental design for event-related fMRI. Human Brain Mapping, 8(2-3), 109-114.
Desmond, J. E., & Glover, G. H. (2002). Estimating sample size in functional MRI (fMRI) neuroimaging studies. Journal of Neuroscience Methods, 118(2), 115-128.
Durnez, J., Degryse, J., Moerkerke, B., et al. (2016). Power and sample size calculations for fMRI studies based on the prevalence of active peaks. bioRxiv, 049429.
Eklund, A., Nichols, T. E., & Knutsson, H. (2016). Cluster failure: Why fMRI inferences for spatial extent have inflated false-positive rates. PNAS, 113(28), 7900-7905.
Elliott, M. L., Knodt, A. R., Ireland, D., et al. (2020). What is the test-retest reliability of common task-functional MRI measures? Biological Psychiatry, 87(11), 934-948.
Genovese, C. R., Lazar, N. A., & Nichols, T. (2002). Thresholding of statistical maps in functional neuroimaging using the false discovery rate. NeuroImage, 15(4), 870-878.
Hayasaka, S., Peiffer, A. M., Hugenschmidt, C. E., & Laurienti, P. J. (2007). Power and sample size calculation for neuroimaging studies by non-central random field theory. NeuroImage, 37(3), 721-730.
Joyce, K. E., & Hayasaka, S. (2012). Development of PowerMap: A software package for statistical power calculation in neuroimaging studies. Neuroinformatics, 10(4), 351-365.
Luck, S. J. (2014). An Introduction to the Event-Related Potential Technique (2nd ed.). MIT Press.
Marek, S., Tervo-Clemmens, B., Calabro, F. J., et al. (2022). Reproducible brain-wide association studies require thousands of individuals. Nature, 603(7902), 654-660.
Mumford, J. A., & Nichols, T. E. (2008). Power calculation for group fMRI studies accounting for arbitrary design and temporal autocorrelation. NeuroImage, 39(1), 261-268.
Nichols, T. E., & Hayasaka, S. (2003). Controlling the familywise error rate in functional neuroimaging: A comparative review. Statistical Methods in Medical Research, 12(5), 419-446.
Poldrack, R. A., Baker, C. I., Durnez, J., et al. (2017). Scanning the horizon: Towards transparent and reproducible neuroimaging research. Nature Reviews Neuroscience, 18(2), 115-126.
Smith, S. M., & Nichols, T. E. (2009). Threshold-free cluster enhancement. NeuroImage, 44(1), 83-98.
Thirion, B., Pinel, P., Meriaux, S., et al. (2007). Analysis of a large fMRI cohort: Statistical and methodological issues for group analyses. NeuroImage, 35(1), 105-120.
Worsley, K. J., Marrett, S., Neelin, P., et al. (1996). A unified statistical approach for determining significant signals in images of cerebral activation. Human Brain Mapping, 4(1), 58-73.

See references/ for detailed simulation examples and effect size lookup tables.

neuroimaging-power-guide