Cognitive Science Statistical Analysis

Purpose

This skill encodes domain-specific statistical knowledge for cognitive science and neuroscience research. It addresses the modeling decisions, correction strategies, and reporting conventions that a general-purpose statistician or programmer would get wrong without training in the field. For concrete analysis recipes with code, see references/common-analyses.md.

When to Use This Skill

Choosing between repeated-measures ANOVA and mixed-effects models for a cognitive experiment
Specifying random effects structure for designs with subjects and items
Deciding how to handle reaction time (RT) data distributions
Selecting the appropriate multiple comparison correction
Deciding whether to use frequentist or Bayesian analysis
Reporting effect sizes and statistical results for journal submission

Research Planning Protocol

Before executing the domain-specific steps below, you MUST:

State the research question — What specific hypothesis is this statistical analysis testing?
Justify the method choice — Why this statistical model? What alternatives were considered?
Declare expected outcomes — What pattern of results would support vs. refute the hypothesis?
Note assumptions and limitations — What does this method assume? Where could it mislead?
Present the plan to the user and WAIT for confirmation before proceeding.

For detailed methodology guidance, see the research-literacy skill.

⚠️ Verification Notice

This skill was generated by AI from academic literature. All parameters, thresholds, and citations require independent verification before use in research. If you find errors, please open an issue.

Repeated-Measures ANOVA vs. Mixed-Effects Models

When to Use Repeated-Measures ANOVA

Fully balanced design (no missing data, equal cell sizes)
Only subjects as a random factor (no item variability)
Simple factorial structure (2-3 factors, no continuous predictors)
Sphericity is met or correctable (Greenhouse-Geisser / Huynh-Feldt)

When to Use Mixed-Effects Models (LMM/GLMM)

Crossed random effects: Both subjects and items sampled from populations (Baayen, Clark, & Lucy, 2008; Clark, 1973). This is the norm in psycholinguistics, memory research, and any paradigm with stimulus variability.
Unbalanced data or missing observations
Continuous predictors (e.g., word frequency, stimulus duration)
Non-normal response distributions (RT, accuracy)
Need to generalize over both subjects AND items simultaneously

Critical domain knowledge: Clark (1973) demonstrated that failing to treat items as random effects inflates Type I error. This remains one of the most common statistical errors in cognitive science. If your stimuli are sampled from a larger population (e.g., words, faces, scenes), you must account for item variability.

Decision Logic

Are your stimuli sampled from a larger population?
 |
 +-- YES --> Mixed-effects model with crossed random effects
 | (subjects and items)
 |
 +-- NO (e.g., fixed set of 4 task conditions) -->
 |
 +-- Any missing data, unbalanced cells, or continuous predictors?
 | |
 | +-- YES --> Mixed-effects model (subjects as random effect)
 | |
 | +-- NO --> Repeated-measures ANOVA is acceptable
 |
 +-- Need trial-level analysis (e.g., RT distributions)?
 |
 +-- YES --> Mixed-effects model (operates on individual trials)
 +-- NO --> Repeated-measures ANOVA on condition means

Random Effects Structure

The Maximal Random Effects Principle

Barr et al. (2013) recommend fitting the maximal random effects structure justified by the design to minimize Type I error. This means including random intercepts and slopes for all within-unit factors.

For a typical 2x2 design with factors A (within-subjects, within-items) and B (within-subjects, between-items):

# Maximal structure (Barr et al., 2013)
lmer(RT ~ A * B + (1 + A * B | Subject) + (1 + A | Item), data = d)

When Maximal Models Fail to Converge

Convergence failures are common with complex random effects. Use this hierarchy (Barr et al., 2013; Matuschek et al., 2017):

First: Try a different optimizer (bobyqa, nlminb) with increased iterations (20000 iterations; lme4 default recommendation)
Second: Remove correlations between random effects (use || in lme4)
Third: Remove the highest-order random slopes first (interaction before main effects)
Fourth: Use a parsimonious approach guided by likelihood ratio tests (Matuschek et al., 2017)

Do NOT simply drop all random slopes to achieve convergence. This inflates Type I error and undermines the purpose of mixed-effects modeling (Barr et al., 2013).

Common Cognitive Science Designs and Their Random Effects

Design	Random Effects	Rationale
Lexical decision (words as items)	`(1 + condition	subj) + (1 + condition
Stroop task (fixed conditions)	`(1 + congruency	subj)`
Picture naming (pictures as items)	`(1 + SOA	subj) + (1
Multi-site study	`(1 + condition	subj) + (1

Handling Reaction Time Data

RT data in cognitive experiments are positively skewed, bounded below by physiological limits, and often contaminated by outliers. The approach matters.

RT Outlier Exclusion

Apply these criteria before modeling (Ratcliff, 1993; Luce, 1986):

Criterion	Threshold	Source
Fast outliers (anticipatory)	< 200 ms	Whelan, 2008; Ratcliff, 1993
Slow absolute cutoff	> 2000-3000 ms (task-dependent)	Ratcliff, 1993
Within-subject SD trimming	> 3 SD from participant's condition mean	Van Selst & Jolicoeur, 1994
Within-subject MAD trimming	> 3 MAD from participant's condition median	Leys et al., 2013 (more robust to skew)

Task-specific note: For simple RT tasks (e.g., detection), use 100 ms as the fast cutoff (Whelan, 2008). For choice RT tasks (e.g., lexical decision), use 200 ms (Ratcliff, 1993). Always report exclusion rates.

RT Transformation and Modeling Strategy

Is your primary interest in RT distributions (not just means)?
 |
 +-- YES --> Drift Diffusion Model or ex-Gaussian fitting
 |
 +-- NO --> Choose a modeling approach:
 |
 +-- Option 1: Log-transform RT, then fit LMM (Gaussian)
 | - Pro: Simple, widely understood
 | - Con: Back-transformation of means is biased;
 | changes the hypothesis being tested
 | (Lo & Andrews, 2015)
 |
 +-- Option 2: Inverse-transform RT (1/RT = speed), then LMM
 | - Pro: Often achieves better normality than log
 | - Con: Same back-transformation issues as log
 | (Ratcliff, 1993)
 |
 +-- Option 3 (Recommended): Generalized LMM with
 Gamma family + identity link
 - Pro: Models RT in original units; handles skew
 directly; avoids transformation issues
 (Lo & Andrews, 2015)
 - Con: Computationally slower; may have convergence
 issues with complex random effects

Recommended default: Gamma GLMM with identity link (Lo & Andrews, 2015). Report results on the original millisecond scale.

# Recommended RT model (Lo & Andrews, 2015)
glmer(RT ~ condition * group + (1 + condition | subj) + (1 | item),
 family = Gamma(link = "identity"), data = d)

Multiple Comparison Correction

Decision Guide for Cognitive Science

Scenario	Method	Rationale	Source
Small number of planned contrasts (< 5)	No correction or Holm	Planned contrasts based on a priori hypotheses do not require correction if specified before data collection	Rubin, 2021
All pairwise comparisons after ANOVA	Tukey HSD	Controls family-wise error for all pairwise comparisons; assumes equal variance	Tukey, 1953
Many tests, correlated (e.g., EEG channels)	Cluster-based permutation	Respects spatial/temporal correlation structure	Maris & Oostenveld, 2007
Many tests, independent	Bonferroni-Holm	More powerful than Bonferroni; step-down procedure	Holm, 1979
Large-scale testing (fMRI voxels, genomics)	FDR (Benjamini-Hochberg)	Controls false discovery rate rather than family-wise error; appropriate when some false positives are tolerable	Benjamini & Hochberg, 1995
Exploratory whole-brain fMRI	Cluster-level FWE (with cluster-forming threshold p < 0.001)	Eklund et al. (2016) showed that p < 0.01 cluster-forming threshold inflates false positive rates to ~70%	Eklund et al., 2016
Confirmatory ROI analysis in fMRI	Small volume correction (SVC) with FWE	Restricts search space to a priori ROI	Worsley et al., 1996

When NOT to Correct

Single planned contrast testing a specific a priori hypothesis (Rubin, 2021)
Sequential Bayesian testing with BF stopping rules (evidence accumulation replaces correction; Schoenbrodt et al., 2017)

Bayesian Alternatives

When to Use Bayesian Analysis

Quantifying evidence for the null hypothesis: Frequentist tests cannot support H0; Bayes factors can (Wagenmakers, 2007)
Small sample sizes: Bayesian methods with informative priors can be more efficient (Kruschke, 2015)
Sequential testing: Bayes factors allow continuous monitoring without alpha inflation (Schoenbrodt et al., 2017)
Complex models where p-values are unreliable: Mixed models with small cluster sizes, or when asymptotic assumptions are questionable

Bayes Factor Interpretation

BF10 Range	Evidence Category	Source
< 1/10	Strong evidence for H0	Jeffreys, 1961; Lee & Wagenmakers, 2013
1/10 to 1/3	Moderate evidence for H0	Lee & Wagenmakers, 2013
1/3 to 3	Anecdotal / inconclusive	Lee & Wagenmakers, 2013
3 to 10	Moderate evidence for H1	Lee & Wagenmakers, 2013
> 10	Strong evidence for H1	Lee & Wagenmakers, 2013

Recommended Tools

Tool	Use Case	Language
BayesFactor	Standard designs (t-test, ANOVA, correlation, regression)	R
brms	Complex models (multilevel, non-Gaussian, multivariate)	R (Stan backend)
JASP	GUI-based Bayesian analysis for standard tests	Standalone
PyMC	Custom Bayesian models	Python

Reporting Bayes Factors

Report the exact BF, not just the category (Wagenmakers et al., 2018):

"A Bayesian paired-samples t-test indicated moderate evidence for a difference between conditions, BF10 = 5.3 (default Cauchy prior, r = 0.707)."

Always specify:

The prior used (e.g., default Cauchy with scale r = 0.707 for BayesFactor t-test; Rouder et al., 2009)
Direction (BF10 = evidence for H1 over H0)
Robustness check: report BF across a range of prior widths

Effect Size Reporting

APA 7th Edition Requirements

APA 7th edition (2020, Section 6.6) requires reporting effect sizes for all primary analyses. The specific measure depends on the test:

Test	Effect Size	Interpretation Benchmarks	Source
t-test (between groups)	Cohen's d	0.2 small, 0.5 medium, 0.8 large	Cohen, 1988
t-test (within subjects)	Cohen's d_z or d_av	d_z uses SD of difference scores	Lakens, 2013
One-way ANOVA	eta-squared or omega-squared	0.01 small, 0.06 medium, 0.14 large	Cohen, 1988
Factorial ANOVA	partial eta-squared	0.01 small, 0.06 medium, 0.14 large	Cohen, 1988; Richardson, 2011
Mixed-effects model	semi-partial R-squared	No universal benchmarks; report CI	Rights & Sterba, 2019
Correlation	r	0.1 small, 0.3 medium, 0.5 large	Cohen, 1988
Chi-square	Cramer's V or phi	Depends on df	Cohen, 1988

Domain note: Always report confidence intervals around effect sizes (APA 7th, 2020). Use effectsize (R) or statsmodels (Python) for computation. The benchmarks above are Cohen's generic guidelines; paradigm-specific benchmarks are more informative (see ../cogsci-power-analysis/references/effect-sizes.md).

For Mixed-Effects Models

Traditional effect sizes are not straightforward for mixed models. Options:

Semi-partial R-squared via the r2glmm or effectsize package (Rights & Sterba, 2019)
Standardized regression coefficients: Standardize predictors before fitting
Conditional and marginal R-squared: R2m (fixed effects only) and R2c (fixed + random) via MuMIn::r.squaredGLMM() (Nakagawa & Schielzeth, 2013)

Common Statistical Mistakes in Cognitive Science

1. Treating Items as Fixed Effects

Problem: Analyzing condition means averaged over items, ignoring item variability, fails to generalize beyond the specific stimuli used (Clark, 1973).

Fix: Use mixed-effects models with crossed random effects for subjects and items.

2. Circular Analysis ("Double-Dipping") in Neuroimaging

Problem: Selecting voxels/channels/time-windows based on the effect of interest, then testing that same effect (Kriegeskorte et al., 2009). Inflates effect sizes by 2x or more (Vul et al., 2009).

Fix: Use independent localizer, leave-one-out cross-validation, or whole-brain corrected analysis.

3. Analyzing Accuracy with ANOVA Instead of Logistic Models

Problem: ANOVA on proportion correct violates normality and homogeneity assumptions, especially at ceiling (> 90%) or floor (< 10%) (Jaeger, 2008; Dixon, 2008).

Fix: Use logistic mixed-effects model on binary (correct/incorrect) trial-level data.

4. Inappropriate Outlier Exclusion

Problem: Removing "outlier" participants based on the dependent variable (e.g., excluding subjects whose effects go in the wrong direction) without a priori criteria.

Fix: Define exclusion criteria before data collection. Base exclusions on performance metrics (accuracy below chance, excessive RTs), not on the effect of interest.

5. Running ANOVAs on RT Without Addressing Skew

Problem: ANOVA on raw RT means violates normality. Condition means conceal distributional differences (Ratcliff, 1993).

Fix: Use Gamma GLMM (Lo & Andrews, 2015) or transform RTs, and supplement with distributional analysis if warranted.

6. Using Uncorrected Cluster-Forming Thresholds in fMRI

Problem: Cluster-based inference with cluster-forming thresholds more lenient than p < 0.001 (uncorrected) produces unacceptable false positive rates up to 70% (Eklund et al., 2016).

Fix: Use voxel-level threshold of p < 0.001 (uncorrected) as minimum cluster-forming threshold, or use voxel-level FWE/FDR correction.

7. Reporting Correlation P-Values Without CIs

Problem: A "significant" correlation of r = 0.30 with N = 50 has a 95% CI of [0.02, 0.53] -- the true effect could be near zero (Cumming, 2014).

Fix: Always report bootstrap 95% CI for correlations. Use 10000 bootstrap samples (Efron & Tibshirani, 1993).

Minimum Statistical Reporting Checklist

Based on APA 7th edition (2020) and Appelbaum et al. (2018):

Exact test statistic (F, t, chi-square, z) with degrees of freedom
Exact p-value (not just "< 0.05"), to 3 decimal places or "< .001"
Effect size with confidence interval
For mixed models: random effects structure, optimizer, convergence confirmation
For multiple comparisons: correction method and justification
Sample sizes for each group/condition
Data exclusion criteria (a priori) and proportion excluded
For Bayesian: prior specification, exact BF, robustness check

References

American Psychological Association. (2020). Publication Manual of the APA (7th ed.).
Appelbaum, M., et al. (2018). Journal article reporting standards for quantitative research. American Psychologist, 73(1), 3-25.
Baayen, R. H., Davidson, D. J., & Bates, D. M. (2008). Mixed-effects modeling with crossed random effects for subjects and items. Journal of Memory and Language, 59(4), 390-412.
Barr, D. J., Levy, R., Scheepers, C., & Tily, H. J. (2013). Random effects structure for confirmatory hypothesis testing. Journal of Memory and Language, 68(3), 255-278.
Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate. Journal of the Royal Statistical Society B, 57(1), 289-300.
Clark, H. H. (1973). The language-as-fixed-effect fallacy. Journal of Verbal Learning and Verbal Behavior, 12(4), 335-359.
Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences (2nd ed.). Erlbaum.
Cumming, G. (2014). The new statistics: Why and how. Psychological Science, 25(1), 7-29.
Dixon, P. (2008). Models of accuracy in repeated-measures designs. Journal of Memory and Language, 59(4), 447-456.
Efron, B., & Tibshirani, R. J. (1993). An Introduction to the Bootstrap. Chapman and Hall.
Eklund, A., Nichols, T. E., & Knutsson, H. (2016). Cluster failure: Why fMRI inferences for spatial extent have inflated false-positive rates. PNAS, 113(28), 7900-7905.
Holm, S. (1979). A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics, 6(2), 65-70.
Jaeger, T. F. (2008). Categorical data analysis: Away from ANOVAs and toward logit mixed models. Journal of Memory and Language, 59(4), 434-446.
Jeffreys, H. (1961). Theory of Probability (3rd ed.). Oxford University Press.
Kriegeskorte, N., et al. (2009). Circular analysis in systems neuroscience. Nature Neuroscience, 12(5), 535-540.
Kruschke, J. K. (2015). Doing Bayesian Data Analysis (2nd ed.). Academic Press.
Lakens, D. (2013). Calculating and reporting effect sizes. Frontiers in Psychology, 4, 863.
Lee, M. D., & Wagenmakers, E.-J. (2013). Bayesian Cognitive Modeling. Cambridge University Press.
Leys, C., et al. (2013). Detecting outliers: Do not use standard deviation around the mean. Journal of Experimental Social Psychology, 49(4), 764-766.
Lo, S., & Andrews, S. (2015). To transform or not to transform: Using generalized linear mixed models to analyse reaction time data. Frontiers in Psychology, 6, 1171.
Luce, R. D. (1986). Response Times. Oxford University Press.
Maris, E., & Oostenveld, R. (2007). Nonparametric statistical testing of EEG- and MEG-data. Journal of Neuroscience Methods, 164(1), 177-190.
Matuschek, H., et al. (2017). Balancing Type I error and power in linear mixed models. Journal of Memory and Language, 94, 305-315.
Nakagawa, S., & Schielzeth, H. (2013). A general and simple method for obtaining R2 from generalized linear mixed-effects models. Methods in Ecology and Evolution, 4(2), 133-142.
Ratcliff, R. (1993). Methods for dealing with reaction time outliers. Psychological Bulletin, 114(3), 510-532.
Richardson, J. T. E. (2011). Eta squared and partial eta squared as measures of effect size. Educational Research Review, 6(2), 135-147.
Rights, J. D., & Sterba, S. K. (2019). Quantifying explained variance in multilevel models. Journal of Educational and Behavioral Statistics, 44(2), 223-263.
Rouder, J. N., et al. (2009). Bayesian t tests for accepting and rejecting the null hypothesis. Psychonomic Bulletin & Review, 16(2), 225-237.
Rubin, M. (2021). When to adjust alpha during multiple testing. Synthese, 199, 10969-11000.
Schoenbrodt, F. D., et al. (2017). Sequential hypothesis testing with Bayes factors. Psychological Methods, 22(2), 322-339.
Van Selst, M., & Jolicoeur, P. (1994). A solution to the effect of sample size on outlier elimination. Quarterly Journal of Experimental Psychology, 47A(3), 631-650.
Vul, E., et al. (2009). Puzzlingly high correlations in fMRI studies of emotion, personality, and social cognition. Perspectives on Psychological Science, 4(3), 274-290.
Wagenmakers, E.-J. (2007). A practical solution to the pervasive problems of p values. Psychonomic Bulletin & Review, 14(5), 779-804.
Wagenmakers, E.-J., et al. (2018). Bayesian inference for psychology. Part II: Example applications with JASP. Psychonomic Bulletin & Review, 25(1), 58-76.
Whelan, R. (2008). Effective analysis of reaction time data. The Psychological Record, 58(3), 475-482.
Worsley, K. J., et al. (1996). A unified statistical approach for determining significant signals in images of cerebral activation. Human Brain Mapping, 4(1), 58-73.

See references/common-analyses.md for concrete analysis recipes with code patterns.

Cognitive Science Statistical Analysis

Cognitive Science Statistical Analysis

Purpose

When to Use This Skill

Research Planning Protocol

⚠️ Verification Notice

Repeated-Measures ANOVA vs. Mixed-Effects Models

When to Use Repeated-Measures ANOVA

When to Use Mixed-Effects Models (LMM/GLMM)

Decision Logic

Random Effects Structure

The Maximal Random Effects Principle

When Maximal Models Fail to Converge

Common Cognitive Science Designs and Their Random Effects

Handling Reaction Time Data

RT Outlier Exclusion

RT Transformation and Modeling Strategy

Multiple Comparison Correction

Decision Guide for Cognitive Science

When NOT to Correct

Bayesian Alternatives

When to Use Bayesian Analysis

Bayes Factor Interpretation

Recommended Tools

Reporting Bayes Factors

Effect Size Reporting

APA 7th Edition Requirements

For Mixed-Effects Models

Common Statistical Mistakes in Cognitive Science

1. Treating Items as Fixed Effects

2. Circular Analysis ("Double-Dipping") in Neuroimaging

3. Analyzing Accuracy with ANOVA Instead of Logistic Models

4. Inappropriate Outlier Exclusion

5. Running ANOVAs on RT Without Addressing Skew

6. Using Uncorrected Cluster-Forming Thresholds in fMRI

7. Reporting Correlation P-Values Without CIs

Minimum Statistical Reporting Checklist

References

More from haoxuanlithuai/awesome_cognitive_and_neuroscience_skills

eeg preprocessing pipeline guide

paper-to-skill extractor

creativity self-efficacy mediation analysis

verify skill

self-paced reading designer

neural population decoding analysis