Research Literacy

Purpose

AI agents tend to execute analysis steps immediately without planning or justification. In research, every analysis decision needs a rationale grounded in theory, design, and data characteristics. This skill encodes the basic scientific thinking that should precede any domain-specific action.

A competent programmer without research training will typically: (a) pick a familiar method rather than the appropriate one, (b) skip assumption checks, (c) interpret results without considering alternative explanations, and (d) make undisclosed analytic choices that inflate false positive rates. This skill exists to prevent all four failure modes.

When to Use

Before or alongside any domain-specific skill from this project (e.g., before running an ERP analysis, first formulate the research question and justify the method).
Standalone when planning a study, reviewing an analysis pipeline, or interpreting results.
Whenever an analysis involves researcher degrees of freedom — choices that could have been made differently and would affect the outcome.

⚠️ Verification Notice

This skill was generated by AI from academic literature. All parameters, thresholds, and citations require independent verification before use in research. If you find errors, please open an issue.

Research Question Formulation

From Vague Idea to Testable Hypothesis

A research question must be specific, falsifiable, and operationalized before any data analysis begins.

Start with the phenomenon: What behavior, neural signal, or cognitive process are you interested in?
Identify the gap: What is unknown or contested in the existing literature?
Formulate as a directional or non-directional prediction: Specify the expected relationship between variables.
Operationalize: Define how each construct is measured and what constitutes evidence for or against the hypothesis.

The PICOS Framework for Cognitive Science

Adapted from evidence-based medicine, PICOS structures research questions systematically:

Element	General Definition	Cognitive Science Example
Population	Who is studied	Healthy adults aged 18-35; patients with aphasia
Intervention / Exposure	What manipulation or variable	Semantic priming; TMS to DLPFC
Comparison	What is the control condition	Unrelated prime; sham stimulation
Outcome	What is measured	N400 amplitude; reaction time; BOLD signal
Study design	How is the study structured	Within-subjects; longitudinal; cross-sectional

Exploratory vs. Confirmatory Research

This distinction is critical for valid inference (Wagenmakers et al., 2012):

Confirmatory research tests a pre-specified hypothesis. Statistical tests (p-values, confidence intervals) are only valid in this context. Requires preregistration of hypotheses and analysis plan.
Exploratory research generates hypotheses from data. Results are descriptive and hypothesis-generating, not hypothesis-testing. Statistical tests in exploratory work should be interpreted as descriptive, not inferential.
Mixing the two without disclosure is a primary driver of the replication crisis (Nosek et al., 2018). If you discover a pattern in the data and then test it in the same dataset, the resulting p-value is not valid.

Rule: Always declare whether an analysis is confirmatory or exploratory before executing it. If the analysis plan changed after seeing the data, label it exploratory.

Method Selection Justification

Match Question Type to Analysis Family

Research Question Type	Analysis Family	Examples
Group differences	Comparison	t-test, ANOVA, Mann-Whitney, permutation test
Relationships between variables	Association	Correlation, regression, structural equation modeling
Predicting outcomes	Prediction	Regression, classification, machine learning
Describing patterns	Description	Descriptive statistics, factor analysis, clustering
Temporal dynamics	Time-series	Time-frequency, autoregressive models, HMM
Neural representations	Multivariate	RSA, MVPA, encoding models

Decision Criteria for Method Selection

When choosing a method, consider and document the following:

Data type: Continuous, ordinal, categorical, count? This constrains the model family.
Design structure: Between-subjects, within-subjects, mixed? Nested or crossed random effects? This determines the error structure.
Sample size: Is N sufficient for the chosen method? Underpowered studies waste resources and inflate effect size estimates (Button et al., 2013). See references/common-assumptions.md for method-specific guidance.
Assumption profile: Does the data meet the method's assumptions? See references/common-assumptions.md.
Multiple comparisons: How many tests will be performed? What correction is appropriate? (Benjamini & Hochberg, 1995, for FDR; Bonferroni for strict family-wise control; cluster-based permutation for neuroimaging, Maris & Oostenveld, 2007).

The "Method Hammer" Anti-Pattern

"If all you have is a hammer, everything looks like a nail."

This anti-pattern occurs when a researcher applies the method they are most comfortable with, regardless of whether it is appropriate. Examples:

Using a t-test when the design has multiple crossed factors (requires ANOVA or mixed model)
Applying parametric tests to ordinal Likert data without justification
Using mass-univariate analysis when the research question is about distributed patterns (requires MVPA)
Defaulting to frequentist tests when the question is about evidence for the null (requires Bayesian analysis or equivalence testing)

Rule: Always articulate why THIS method and not alternatives. Document the alternatives considered and why they were rejected.

Expected Outcomes Declaration

Before running any analysis, declare what each possible outcome means:

The Three-Outcome Framework

If H1 is supported: What specific pattern of results would you expect? (e.g., "a significant interaction between condition and group, with a larger N400 for incongruent trials in the control group but not the patient group")
If H0 is supported: What would the data look like? (e.g., "no significant effects, Bayes factor favoring H0 > 3")
If results are ambiguous: What would be inconclusive? (e.g., "a trend-level effect, p = .05-.10, with a small effect size below the smallest effect of interest")

Why This Matters

Declaring expected outcomes in advance prevents:

HARKing (Hypothesizing After Results are Known): presenting post-hoc hypotheses as if they were a priori predictions (Kerr, 1998). A survey of researchers found that 43% self-reported HARKing at least once (Fiedler & Schwarz, 2016).
Post-hoc rationalization: finding a plausible story for any result after the fact.
Outcome switching: changing the primary outcome measure after seeing which one yields significant results.

Assumptions and Limitations Awareness

Every Method Has Assumptions

No statistical method is assumption-free. Before applying any method, identify its key assumptions and check them. The full reference table is in references/common-assumptions.md.

Common Assumption Categories

Independence: Observations are not systematically related to each other. Violated by: repeated measures, clustered data, spatial/temporal autocorrelation in neural data.
Normality: The sampling distribution of the test statistic is normal. Often confused with normality of raw data. Relevant for small samples; large samples benefit from the central limit theorem.
Homogeneity of variance: Variance is equal across groups or conditions. Violated when group sizes are unequal and variances differ. Use Welch's correction or robust methods.
Stationarity: Statistical properties do not change over time. Relevant for EEG, fMRI time series. Violated by habituation, fatigue, scanner drift.
Measurement validity: The measure actually captures the construct of interest. No statistical test can fix a bad measure. Construct validity must be argued on theoretical grounds.
Correct model specification: The statistical model matches the data-generating process. Omitted variables, wrong functional form, and incorrect random effects structure all threaten validity (Barr et al., 2013).

Limitations Are Not Optional

Every study has limitations. Common categories:

Internal validity threats: confounds, demand characteristics, order effects
External validity threats: limited sample demographics, artificial lab conditions
Statistical conclusion validity: low power, violated assumptions, multiple comparisons
Construct validity threats: impure measures, task impurity in neuropsychology

Rule: List limitations upfront, not as an afterthought. This is not a weakness; it is scientific rigor.

Human-in-the-Loop Principles

Why AI Agents Must Pause

Research involves judgment calls where reasonable experts disagree. These "researcher degrees of freedom" (Simmons et al., 2011) can inflate false positive rates from a nominal 5% to as high as 60% when left unchecked (Simmons et al., 2011). AI agents must not make these decisions silently.

Mandatory Pause Points

ALWAYS present the analysis plan and WAIT for user confirmation before proceeding at these decision points:

Participant or trial exclusion: "I propose excluding 3 participants based on [criterion]. Here is the exclusion rationale and the impact on sample size."
Outlier treatment: "These data points are [N] SDs from the mean. Options: (a) winsorize, (b) trim, (c) transform, (d) use robust methods, (e) retain. Each has different implications."
Multiple comparisons correction: "With [N] comparisons, I recommend [method]. Alternatives are [list]. The choice affects sensitivity and specificity as follows..."
Model specification: "I am fitting [model]. Key choices include [random effects structure, covariates, link function]. Here is why, and here are alternatives."
Data transformation: "The data violate [assumption]. I propose [transformation/alternative method]. This changes the interpretation as follows..."
Unexpected results: "The results do not match the predicted pattern. Before interpreting, consider: (a) the analysis may be wrong, (b) the hypothesis may be wrong, (c) there may be a confound."

Transparency Protocol

Never silently drop data points, trials, or participants
Never silently switch between one-tailed and two-tailed tests
Never silently add or remove covariates
Never silently change the dependent variable or time window
Always report the full set of analyses, not just significant ones

Common Research Anti-Patterns

These are well-documented threats to research integrity. An AI agent must actively avoid them and flag when a user's request risks falling into one.

1. p-Hacking

Running multiple analyses, selectively reporting significant results, or tweaking analysis parameters until p < .05. Simulations show this can inflate false positive rates from 5% to over 60% (Simmons et al., 2011, Psychological Science, 22(11), 1359-1366).

How to avoid: Preregister analyses. Report all analyses conducted. Use correction for multiple comparisons.

2. HARKing (Hypothesizing After Results are Known)

Presenting post-hoc hypotheses as if they were a priori predictions (Kerr, 1998, Personality and Social Psychology Review, 2(3), 196-217).

How to avoid: Write down hypotheses before analysis. Clearly label any post-hoc exploration.

3. Confirmation Bias in Analysis

Selectively reporting evidence that supports preferred conclusions while downplaying contradictory evidence.

How to avoid: Report effect sizes and confidence intervals for all outcomes, not just significant ones. Use adversarial collaboration or preregistered analysis plans.

4. Garden of Forking Paths

Even without deliberate p-hacking, undisclosed analytic flexibility creates a "garden of forking paths" where many analysis pipelines could have been chosen, inflating the effective number of comparisons (Gelman & Loken, 2014, American Scientist, 102(6), 460-465).

How to avoid: Document every analytic decision and its alternatives. Consider multiverse analysis (Steegen et al., 2016).

5. Cargo Cult Statistics

Applying statistical procedures as rituals without understanding the underlying assumptions or logic. The "null ritual" — mechanically testing H0 at alpha = .05 without specifying H1, considering effect sizes, or evaluating power — is the canonical example (Gigerenzer, 2004, Journal of Socio-Economics, 33, 587-606).

How to avoid: For every test, articulate: What is H0? What is H1? What is the expected effect size? What is the power? Is the test appropriate for this data structure?

6. Outcome Switching

Changing the primary outcome variable after seeing the data because the original outcome was not significant.

How to avoid: Preregister primary and secondary outcomes. Report results for the preregistered primary outcome regardless of significance.

The Planning Protocol

This is the core procedure. Execute these steps before any analysis.

Step 1: State the Research Question

Write the question in one sentence. It must be specific, testable, and falsifiable. Use the PICOS framework above.

Step 2: Classify as Confirmatory or Exploratory

If confirmatory, a preregistered hypothesis must exist. If exploratory, label all results as hypothesis-generating.

Step 3: Justify the Chosen Method

Name the method, explain why it is appropriate for this question and data, and list alternatives that were considered and why they were rejected.

Step 4: Declare Expected Outcomes

For each hypothesis, state what supporting, refuting, and ambiguous results would look like, with expected effect sizes where possible.

Step 5: List Assumptions and Limitations

Enumerate the method's statistical assumptions and how they will be checked. List known limitations of the design and analysis.

Step 6: Present the Plan to the User

Show the complete plan in a structured format (see references/planning-template.md). Include decision points where user input is required.

Step 7: WAIT for User Confirmation

Do not proceed until the user approves the plan or requests modifications.

Step 8: Execute and Compare

After analysis, explicitly compare results to the expected outcomes declared in Step 4. Discuss discrepancies honestly.

Step 9: Report Limitations

Reiterate limitations, including any that became apparent during analysis (e.g., assumption violations, unexpected data patterns).

Key References

Barr, D. J., Levy, R., Scheepers, C., & Tily, H. J. (2013). Random effects structure for confirmatory hypothesis testing: Keep it maximal. Journal of Memory and Language, 68(3), 255-278.
Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society: Series B, 57(1), 289-300.
Button, K. S., Ioannidis, J. P., Mokrysz, C., Nosek, B. A., Flint, J., Robinson, E. S., & Munafo, M. R. (2013). Power failure: Why small sample size undermines the reliability of neuroscience. Nature Reviews Neuroscience, 14(5), 365-376.
Cohen, J. (1992). A power primer. Psychological Bulletin, 112(1), 155-159.
Fiedler, K., & Schwarz, N. (2016). Questionable research practices revisited. Social Psychological and Personality Science, 7(1), 45-52.
Gelman, A., & Loken, E. (2014). The statistical crisis in science. American Scientist, 102(6), 460-465.
Gigerenzer, G. (2004). Mindless statistics. Journal of Socio-Economics, 33, 587-606.
Kerr, N. L. (1998). HARKing: Hypothesizing after the results are known. Personality and Social Psychology Review, 2(3), 196-217.
Maris, E., & Oostenveld, R. (2007). Nonparametric statistical testing of EEG- and MEG-data. Journal of Neuroscience Methods, 164(1), 177-190.
Munafo, M. R., Nosek, B. A., Bishop, D. V. M., et al. (2017). A manifesto for reproducible science. Nature Human Behaviour, 1, 0021.
Nosek, B. A., Ebersole, C. R., DeHaven, A. C., & Mellor, D. T. (2018). The preregistration revolution. Proceedings of the National Academy of Sciences, 115(11), 2600-2606.
Open Science Collaboration. (2015). Estimating the reproducibility of psychological science. Science, 349(6251), aac4716.
Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22(11), 1359-1366.
Steegen, S., Tuerlinckx, F., Gelman, A., & Vanpaemel, W. (2016). Increasing transparency through a multiverse analysis. Perspectives on Psychological Science, 11(5), 702-712.
Wagenmakers, E.-J., Wetzels, R., Borsboom, D., van der Maas, H. L. J., & Kievit, R. A. (2012). An agenda for purely confirmatory research. Perspectives on Psychological Science, 7(6), 632-638.