Cognitive Science Statistical Analysis
Cognitive Science Statistical Analysis
Purpose
This skill encodes domain-specific statistical knowledge for cognitive science and neuroscience research. It addresses the modeling decisions, correction strategies, and reporting conventions that a general-purpose statistician or programmer would get wrong without training in the field. For concrete analysis recipes with code, see references/common-analyses.md.
When to Use This Skill
- Choosing between repeated-measures ANOVA and mixed-effects models for a cognitive experiment
- Specifying random effects structure for designs with subjects and items
- Deciding how to handle reaction time (RT) data distributions
- Selecting the appropriate multiple comparison correction
- Deciding whether to use frequentist or Bayesian analysis
- Reporting effect sizes and statistical results for journal submission
Research Planning Protocol
Before executing the domain-specific steps below, you MUST:
- State the research question — What specific hypothesis is this statistical analysis testing?
- Justify the method choice — Why this statistical model? What alternatives were considered?
- Declare expected outcomes — What pattern of results would support vs. refute the hypothesis?
- Note assumptions and limitations — What does this method assume? Where could it mislead?
- Present the plan to the user and WAIT for confirmation before proceeding.
For detailed methodology guidance, see the research-literacy skill.
⚠️ Verification Notice
This skill was generated by AI from academic literature. All parameters, thresholds, and citations require independent verification before use in research. If you find errors, please open an issue.
Repeated-Measures ANOVA vs. Mixed-Effects Models
When to Use Repeated-Measures ANOVA
- Fully balanced design (no missing data, equal cell sizes)
- Only subjects as a random factor (no item variability)
- Simple factorial structure (2-3 factors, no continuous predictors)
- Sphericity is met or correctable (Greenhouse-Geisser / Huynh-Feldt)
When to Use Mixed-Effects Models (LMM/GLMM)
- Crossed random effects: Both subjects and items sampled from populations (Baayen, Clark, & Lucy, 2008; Clark, 1973). This is the norm in psycholinguistics, memory research, and any paradigm with stimulus variability.
- Unbalanced data or missing observations
- Continuous predictors (e.g., word frequency, stimulus duration)
- Non-normal response distributions (RT, accuracy)
- Need to generalize over both subjects AND items simultaneously
Critical domain knowledge: Clark (1973) demonstrated that failing to treat items as random effects inflates Type I error. This remains one of the most common statistical errors in cognitive science. If your stimuli are sampled from a larger population (e.g., words, faces, scenes), you must account for item variability.
Decision Logic
Are your stimuli sampled from a larger population?
|
+-- YES --> Mixed-effects model with crossed random effects
| (subjects and items)
|
+-- NO (e.g., fixed set of 4 task conditions) -->
|
+-- Any missing data, unbalanced cells, or continuous predictors?
| |
| +-- YES --> Mixed-effects model (subjects as random effect)
| |
| +-- NO --> Repeated-measures ANOVA is acceptable
|
+-- Need trial-level analysis (e.g., RT distributions)?
|
+-- YES --> Mixed-effects model (operates on individual trials)
+-- NO --> Repeated-measures ANOVA on condition means
Random Effects Structure
The Maximal Random Effects Principle
Barr et al. (2013) recommend fitting the maximal random effects structure justified by the design to minimize Type I error. This means including random intercepts and slopes for all within-unit factors.
For a typical 2x2 design with factors A (within-subjects, within-items) and B (within-subjects, between-items):
# Maximal structure (Barr et al., 2013)
lmer(RT ~ A * B + (1 + A * B | Subject) + (1 + A | Item), data = d)
When Maximal Models Fail to Converge
Convergence failures are common with complex random effects. Use this hierarchy (Barr et al., 2013; Matuschek et al., 2017):
- First: Try a different optimizer (bobyqa, nlminb) with increased iterations (20000 iterations; lme4 default recommendation)
- Second: Remove correlations between random effects (use
||in lme4) - Third: Remove the highest-order random slopes first (interaction before main effects)
- Fourth: Use a parsimonious approach guided by likelihood ratio tests (Matuschek et al., 2017)
Do NOT simply drop all random slopes to achieve convergence. This inflates Type I error and undermines the purpose of mixed-effects modeling (Barr et al., 2013).
Common Cognitive Science Designs and Their Random Effects
| Design | Random Effects | Rationale |
|---|---|---|
| Lexical decision (words as items) | `(1 + condition | subj) + (1 + condition |
| Stroop task (fixed conditions) | `(1 + congruency | subj)` |
| Picture naming (pictures as items) | `(1 + SOA | subj) + (1 |
| Multi-site study | `(1 + condition | subj) + (1 |
Handling Reaction Time Data
RT data in cognitive experiments are positively skewed, bounded below by physiological limits, and often contaminated by outliers. The approach matters.
RT Outlier Exclusion
Apply these criteria before modeling (Ratcliff, 1993; Luce, 1986):
| Criterion | Threshold | Source |
|---|---|---|
| Fast outliers (anticipatory) | < 200 ms | Whelan, 2008; Ratcliff, 1993 |
| Slow absolute cutoff | > 2000-3000 ms (task-dependent) | Ratcliff, 1993 |
| Within-subject SD trimming | > 3 SD from participant's condition mean | Van Selst & Jolicoeur, 1994 |
| Within-subject MAD trimming | > 3 MAD from participant's condition median | Leys et al., 2013 (more robust to skew) |
Task-specific note: For simple RT tasks (e.g., detection), use 100 ms as the fast cutoff (Whelan, 2008). For choice RT tasks (e.g., lexical decision), use 200 ms (Ratcliff, 1993). Always report exclusion rates.
RT Transformation and Modeling Strategy
Is your primary interest in RT distributions (not just means)?
|
+-- YES --> Drift Diffusion Model or ex-Gaussian fitting
|
+-- NO --> Choose a modeling approach:
|
+-- Option 1: Log-transform RT, then fit LMM (Gaussian)
| - Pro: Simple, widely understood
| - Con: Back-transformation of means is biased;
| changes the hypothesis being tested
| (Lo & Andrews, 2015)
|
+-- Option 2: Inverse-transform RT (1/RT = speed), then LMM
| - Pro: Often achieves better normality than log
| - Con: Same back-transformation issues as log
| (Ratcliff, 1993)
|
+-- Option 3 (Recommended): Generalized LMM with
Gamma family + identity link
- Pro: Models RT in original units; handles skew
directly; avoids transformation issues
(Lo & Andrews, 2015)
- Con: Computationally slower; may have convergence
issues with complex random effects
Recommended default: Gamma GLMM with identity link (Lo & Andrews, 2015). Report results on the original millisecond scale.
# Recommended RT model (Lo & Andrews, 2015)
glmer(RT ~ condition * group + (1 + condition | subj) + (1 | item),
family = Gamma(link = "identity"), data = d)
Multiple Comparison Correction
Decision Guide for Cognitive Science
| Scenario | Method | Rationale | Source |
|---|---|---|---|
| Small number of planned contrasts (< 5) | No correction or Holm | Planned contrasts based on a priori hypotheses do not require correction if specified before data collection | Rubin, 2021 |
| All pairwise comparisons after ANOVA | Tukey HSD | Controls family-wise error for all pairwise comparisons; assumes equal variance | Tukey, 1953 |
| Many tests, correlated (e.g., EEG channels) | Cluster-based permutation | Respects spatial/temporal correlation structure | Maris & Oostenveld, 2007 |
| Many tests, independent | Bonferroni-Holm | More powerful than Bonferroni; step-down procedure | Holm, 1979 |
| Large-scale testing (fMRI voxels, genomics) | FDR (Benjamini-Hochberg) | Controls false discovery rate rather than family-wise error; appropriate when some false positives are tolerable | Benjamini & Hochberg, 1995 |
| Exploratory whole-brain fMRI | Cluster-level FWE (with cluster-forming threshold p < 0.001) | Eklund et al. (2016) showed that p < 0.01 cluster-forming threshold inflates false positive rates to ~70% | Eklund et al., 2016 |
| Confirmatory ROI analysis in fMRI | Small volume correction (SVC) with FWE | Restricts search space to a priori ROI | Worsley et al., 1996 |
When NOT to Correct
- Single planned contrast testing a specific a priori hypothesis (Rubin, 2021)
- Sequential Bayesian testing with BF stopping rules (evidence accumulation replaces correction; Schoenbrodt et al., 2017)
Bayesian Alternatives
When to Use Bayesian Analysis
- Quantifying evidence for the null hypothesis: Frequentist tests cannot support H0; Bayes factors can (Wagenmakers, 2007)
- Small sample sizes: Bayesian methods with informative priors can be more efficient (Kruschke, 2015)
- Sequential testing: Bayes factors allow continuous monitoring without alpha inflation (Schoenbrodt et al., 2017)
- Complex models where p-values are unreliable: Mixed models with small cluster sizes, or when asymptotic assumptions are questionable
Bayes Factor Interpretation
| BF10 Range | Evidence Category | Source |
|---|---|---|
| < 1/10 | Strong evidence for H0 | Jeffreys, 1961; Lee & Wagenmakers, 2013 |
| 1/10 to 1/3 | Moderate evidence for H0 | Lee & Wagenmakers, 2013 |
| 1/3 to 3 | Anecdotal / inconclusive | Lee & Wagenmakers, 2013 |
| 3 to 10 | Moderate evidence for H1 | Lee & Wagenmakers, 2013 |
| > 10 | Strong evidence for H1 | Lee & Wagenmakers, 2013 |
Recommended Tools
| Tool | Use Case | Language |
|---|---|---|
| BayesFactor | Standard designs (t-test, ANOVA, correlation, regression) | R |
| brms | Complex models (multilevel, non-Gaussian, multivariate) | R (Stan backend) |
| JASP | GUI-based Bayesian analysis for standard tests | Standalone |
| PyMC | Custom Bayesian models | Python |
Reporting Bayes Factors
Report the exact BF, not just the category (Wagenmakers et al., 2018):
"A Bayesian paired-samples t-test indicated moderate evidence for a difference between conditions, BF10 = 5.3 (default Cauchy prior, r = 0.707)."
Always specify:
- The prior used (e.g., default Cauchy with scale r = 0.707 for BayesFactor t-test; Rouder et al., 2009)
- Direction (BF10 = evidence for H1 over H0)
- Robustness check: report BF across a range of prior widths
Effect Size Reporting
APA 7th Edition Requirements
APA 7th edition (2020, Section 6.6) requires reporting effect sizes for all primary analyses. The specific measure depends on the test:
| Test | Effect Size | Interpretation Benchmarks | Source |
|---|---|---|---|
| t-test (between groups) | Cohen's d | 0.2 small, 0.5 medium, 0.8 large | Cohen, 1988 |
| t-test (within subjects) | Cohen's d_z or d_av | d_z uses SD of difference scores | Lakens, 2013 |
| One-way ANOVA | eta-squared or omega-squared | 0.01 small, 0.06 medium, 0.14 large | Cohen, 1988 |
| Factorial ANOVA | partial eta-squared | 0.01 small, 0.06 medium, 0.14 large | Cohen, 1988; Richardson, 2011 |
| Mixed-effects model | semi-partial R-squared | No universal benchmarks; report CI | Rights & Sterba, 2019 |
| Correlation | r | 0.1 small, 0.3 medium, 0.5 large | Cohen, 1988 |
| Chi-square | Cramer's V or phi | Depends on df | Cohen, 1988 |
Domain note: Always report confidence intervals around effect sizes (APA 7th, 2020). Use
effectsize(R) orstatsmodels(Python) for computation. The benchmarks above are Cohen's generic guidelines; paradigm-specific benchmarks are more informative (see../cogsci-power-analysis/references/effect-sizes.md).
For Mixed-Effects Models
Traditional effect sizes are not straightforward for mixed models. Options:
- Semi-partial R-squared via the
r2glmmoreffectsizepackage (Rights & Sterba, 2019) - Standardized regression coefficients: Standardize predictors before fitting
- Conditional and marginal R-squared: R2m (fixed effects only) and R2c (fixed + random) via
MuMIn::r.squaredGLMM()(Nakagawa & Schielzeth, 2013)
Common Statistical Mistakes in Cognitive Science
1. Treating Items as Fixed Effects
Problem: Analyzing condition means averaged over items, ignoring item variability, fails to generalize beyond the specific stimuli used (Clark, 1973).
Fix: Use mixed-effects models with crossed random effects for subjects and items.
2. Circular Analysis ("Double-Dipping") in Neuroimaging
Problem: Selecting voxels/channels/time-windows based on the effect of interest, then testing that same effect (Kriegeskorte et al., 2009). Inflates effect sizes by 2x or more (Vul et al., 2009).
Fix: Use independent localizer, leave-one-out cross-validation, or whole-brain corrected analysis.
3. Analyzing Accuracy with ANOVA Instead of Logistic Models
Problem: ANOVA on proportion correct violates normality and homogeneity assumptions, especially at ceiling (> 90%) or floor (< 10%) (Jaeger, 2008; Dixon, 2008).
Fix: Use logistic mixed-effects model on binary (correct/incorrect) trial-level data.
4. Inappropriate Outlier Exclusion
Problem: Removing "outlier" participants based on the dependent variable (e.g., excluding subjects whose effects go in the wrong direction) without a priori criteria.
Fix: Define exclusion criteria before data collection. Base exclusions on performance metrics (accuracy below chance, excessive RTs), not on the effect of interest.
5. Running ANOVAs on RT Without Addressing Skew
Problem: ANOVA on raw RT means violates normality. Condition means conceal distributional differences (Ratcliff, 1993).
Fix: Use Gamma GLMM (Lo & Andrews, 2015) or transform RTs, and supplement with distributional analysis if warranted.
6. Using Uncorrected Cluster-Forming Thresholds in fMRI
Problem: Cluster-based inference with cluster-forming thresholds more lenient than p < 0.001 (uncorrected) produces unacceptable false positive rates up to 70% (Eklund et al., 2016).
Fix: Use voxel-level threshold of p < 0.001 (uncorrected) as minimum cluster-forming threshold, or use voxel-level FWE/FDR correction.
7. Reporting Correlation P-Values Without CIs
Problem: A "significant" correlation of r = 0.30 with N = 50 has a 95% CI of [0.02, 0.53] -- the true effect could be near zero (Cumming, 2014).
Fix: Always report bootstrap 95% CI for correlations. Use 10000 bootstrap samples (Efron & Tibshirani, 1993).
Minimum Statistical Reporting Checklist
Based on APA 7th edition (2020) and Appelbaum et al. (2018):
- Exact test statistic (F, t, chi-square, z) with degrees of freedom
- Exact p-value (not just "< 0.05"), to 3 decimal places or "< .001"
- Effect size with confidence interval
- For mixed models: random effects structure, optimizer, convergence confirmation
- For multiple comparisons: correction method and justification
- Sample sizes for each group/condition
- Data exclusion criteria (a priori) and proportion excluded
- For Bayesian: prior specification, exact BF, robustness check
References
- American Psychological Association. (2020). Publication Manual of the APA (7th ed.).
- Appelbaum, M., et al. (2018). Journal article reporting standards for quantitative research. American Psychologist, 73(1), 3-25.
- Baayen, R. H., Davidson, D. J., & Bates, D. M. (2008). Mixed-effects modeling with crossed random effects for subjects and items. Journal of Memory and Language, 59(4), 390-412.
- Barr, D. J., Levy, R., Scheepers, C., & Tily, H. J. (2013). Random effects structure for confirmatory hypothesis testing. Journal of Memory and Language, 68(3), 255-278.
- Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate. Journal of the Royal Statistical Society B, 57(1), 289-300.
- Clark, H. H. (1973). The language-as-fixed-effect fallacy. Journal of Verbal Learning and Verbal Behavior, 12(4), 335-359.
- Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences (2nd ed.). Erlbaum.
- Cumming, G. (2014). The new statistics: Why and how. Psychological Science, 25(1), 7-29.
- Dixon, P. (2008). Models of accuracy in repeated-measures designs. Journal of Memory and Language, 59(4), 447-456.
- Efron, B., & Tibshirani, R. J. (1993). An Introduction to the Bootstrap. Chapman and Hall.
- Eklund, A., Nichols, T. E., & Knutsson, H. (2016). Cluster failure: Why fMRI inferences for spatial extent have inflated false-positive rates. PNAS, 113(28), 7900-7905.
- Holm, S. (1979). A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics, 6(2), 65-70.
- Jaeger, T. F. (2008). Categorical data analysis: Away from ANOVAs and toward logit mixed models. Journal of Memory and Language, 59(4), 434-446.
- Jeffreys, H. (1961). Theory of Probability (3rd ed.). Oxford University Press.
- Kriegeskorte, N., et al. (2009). Circular analysis in systems neuroscience. Nature Neuroscience, 12(5), 535-540.
- Kruschke, J. K. (2015). Doing Bayesian Data Analysis (2nd ed.). Academic Press.
- Lakens, D. (2013). Calculating and reporting effect sizes. Frontiers in Psychology, 4, 863.
- Lee, M. D., & Wagenmakers, E.-J. (2013). Bayesian Cognitive Modeling. Cambridge University Press.
- Leys, C., et al. (2013). Detecting outliers: Do not use standard deviation around the mean. Journal of Experimental Social Psychology, 49(4), 764-766.
- Lo, S., & Andrews, S. (2015). To transform or not to transform: Using generalized linear mixed models to analyse reaction time data. Frontiers in Psychology, 6, 1171.
- Luce, R. D. (1986). Response Times. Oxford University Press.
- Maris, E., & Oostenveld, R. (2007). Nonparametric statistical testing of EEG- and MEG-data. Journal of Neuroscience Methods, 164(1), 177-190.
- Matuschek, H., et al. (2017). Balancing Type I error and power in linear mixed models. Journal of Memory and Language, 94, 305-315.
- Nakagawa, S., & Schielzeth, H. (2013). A general and simple method for obtaining R2 from generalized linear mixed-effects models. Methods in Ecology and Evolution, 4(2), 133-142.
- Ratcliff, R. (1993). Methods for dealing with reaction time outliers. Psychological Bulletin, 114(3), 510-532.
- Richardson, J. T. E. (2011). Eta squared and partial eta squared as measures of effect size. Educational Research Review, 6(2), 135-147.
- Rights, J. D., & Sterba, S. K. (2019). Quantifying explained variance in multilevel models. Journal of Educational and Behavioral Statistics, 44(2), 223-263.
- Rouder, J. N., et al. (2009). Bayesian t tests for accepting and rejecting the null hypothesis. Psychonomic Bulletin & Review, 16(2), 225-237.
- Rubin, M. (2021). When to adjust alpha during multiple testing. Synthese, 199, 10969-11000.
- Schoenbrodt, F. D., et al. (2017). Sequential hypothesis testing with Bayes factors. Psychological Methods, 22(2), 322-339.
- Van Selst, M., & Jolicoeur, P. (1994). A solution to the effect of sample size on outlier elimination. Quarterly Journal of Experimental Psychology, 47A(3), 631-650.
- Vul, E., et al. (2009). Puzzlingly high correlations in fMRI studies of emotion, personality, and social cognition. Perspectives on Psychological Science, 4(3), 274-290.
- Wagenmakers, E.-J. (2007). A practical solution to the pervasive problems of p values. Psychonomic Bulletin & Review, 14(5), 779-804.
- Wagenmakers, E.-J., et al. (2018). Bayesian inference for psychology. Part II: Example applications with JASP. Psychonomic Bulletin & Review, 25(1), 58-76.
- Whelan, R. (2008). Effective analysis of reaction time data. The Psychological Record, 58(3), 475-482.
- Worsley, K. J., et al. (1996). A unified statistical approach for determining significant signals in images of cerebral activation. Human Brain Mapping, 4(1), 58-73.
See references/common-analyses.md for concrete analysis recipes with code patterns.