Sentence Stimulus Norming

SKILL.md

Sentence Stimulus Norming

Purpose

This skill encodes expert methodological knowledge for norming linguistic stimuli before running psycholinguistic experiments. A competent programmer without linguistics training would likely construct stimuli based on intuition, failing to control for critical lexical variables (word frequency, length, neighborhood density), skipping cloze norming, using inappropriate rating scales, or under-powering the norming study. Poor stimulus norming is the single most common methodological weakness in psycholinguistic research, because confounds in the materials propagate to every analysis.

When to Use

Use this skill when:

  • Creating sentence stimuli for reading experiments (self-paced reading, eye-tracking, ERP)
  • Norming the predictability (cloze probability) of critical words in sentence contexts
  • Collecting plausibility, naturalness, or acceptability ratings for sentence materials
  • Controlling lexical properties of critical words across experimental conditions
  • Designing Latin square counterbalancing for within-item designs
  • Planning filler items and practice trials

Do not use this skill when:

  • Working with single-word stimuli without sentence context (use lexical database tools directly)
  • Designing non-linguistic stimuli (visual search arrays, tones)
  • Analyzing existing normed materials without creating new ones

Research Planning Protocol

Before executing the domain-specific steps below, you MUST:

  1. State the research question -- What specific question is this analysis/paradigm addressing?
  2. Justify the method choice -- Why is this approach appropriate? What alternatives were considered?
  3. Declare expected outcomes -- What results would support vs. refute the hypothesis?
  4. Note assumptions and limitations -- What does this method assume? Where could it mislead?
  5. Present the plan to the user and WAIT for confirmation before proceeding.

For detailed methodology guidance, see the research-literacy skill.

⚠️ Verification Notice

This skill was generated by AI from academic literature. All parameters, thresholds, and citations require independent verification before use in research. If you find errors, please open an issue.

Cloze Probability Norming

What Is Cloze Probability?

Cloze probability is the proportion of people who complete a sentence fragment with a particular word (Taylor, 1953). It is the standard measure of a word's predictability in context and is a critical control variable in nearly all sentence processing research.

Procedure

  1. Create sentence fragments: Truncate each sentence immediately before the critical word
  2. Present fragments one at a time to participants
  3. Instruct: "Please complete each sentence with the first word that comes to mind. Write only one word."
  4. Score: For each item, cloze probability = (number of completions matching the target word) / (total number of respondents)

Design Parameters

Parameter Recommended Value Citation / Rationale
N per item Minimum 30 raters Taylor, 1953; Bloom & Fischler, 1980; standard minimum for stable estimates
Preferred N 40-50 raters More stable estimates, especially for medium-cloze items
Items per participant 50-100 fragments per norming session Avoid fatigue; pilot to calibrate
Time limit ~10-15 seconds per item or untimed Untimed is standard; brief limit prevents overthinking
Population Same as experimental population (e.g., native English speakers, same age range) Ensures cloze values generalize

Scoring Conventions

  • Exact match: Only the target word counts (standard)
  • Morphological variants: Decide a priori whether "run" and "running" count as the same completion. Standard practice: count only the exact form (Staub et al., 2015)
  • Spelling errors: Accept obvious misspellings of the target
  • Blank/nonsense responses: Exclude from the denominator (participant did not engage)

Cloze Probability Benchmarks

Cloze Range Label Use Case
> 0.80 High cloze / highly predictable N400 amplitude studies; predictability effects (Kutas & Hillyard, 1984)
0.30 - 0.70 Medium cloze Moderate predictability manipulations
< 0.10 Low cloze / unpredictable Baseline; unexpected completions
0.00 Zero cloze Anomalous or implausible continuations

Online vs. Lab Norming

Aspect Lab Online (e.g., Prolific, MTurk)
Quality control Direct observation Must include catch trials and attention checks
Sample size Limited by lab capacity Easy to reach N = 40-50 per item
Population Typically university students More diverse; specify inclusion criteria
Validity Gold standard Comparable for cloze (Schütze & Sprouse, 2014)
Cost Lab time Participant payment (~$10-15/hour; Prolific standards)

Recommendation for online norming: Include 10-15% catch trials (sentences with obvious completions, e.g., "The dog chased the ___") and exclude participants who fail > 20% of catch trials.

Plausibility and Naturalness Ratings

When to Collect

  • When cloze probability alone is insufficient (e.g., both conditions have low cloze but differ in plausibility)
  • When manipulating semantic fit or thematic role plausibility
  • When verifying that "anomalous" conditions are genuinely perceived as odd

Rating Scale Design

Parameter Recommended Citation / Rationale
Scale type Likert scale Standard for sentence ratings (Schütze & Sprouse, 2014)
Number of points 7-point scale Balances sensitivity and reliability; standard in psycholinguistics (Schütze & Sprouse, 2014)
Anchors 1 = "very unnatural/implausible" to 7 = "very natural/plausible" Labeled endpoints with unlabeled intermediate points
N per item Minimum 20 raters; preferred 30+ Sufficient for stable means per item (Sprouse & Almeida, 2012)
Items per rater 40-80 items per session Avoid fatigue effects
Practice items 3-5 items spanning the full range before data collection Calibrate scale use

Instructions Template

"You will read a series of sentences. For each sentence, please rate how natural or plausible it sounds on a scale from 1 to 7, where 1 means 'very unnatural / makes no sense' and 7 means 'perfectly natural / makes complete sense.' There are no right or wrong answers; we are interested in your intuition."

Critical Design Considerations

  • Within-list design: Each rater sees only one version of each item (Latin square). Raters should never see multiple conditions of the same item, or they will rate contrastively rather than absolutely.
  • Filler items: Include filler sentences spanning the full rating range. This prevents range restriction.
  • Order effects: Randomize item order per participant.

Acceptability Judgments

When to Collect

  • When manipulating syntactic structure (grammaticality, island constraints, movement dependencies)
  • When testing formal linguistic predictions about sentence well-formedness
  • For factorial designs crossing syntactic factors (e.g., 2x2 designs testing island effects; Sprouse et al., 2012)

Rating Methods

Method Description Pros Cons Citation
Likert scale (7-point) Rate acceptability 1-7 Simple; familiar; sufficient for most purposes Ceiling/floor possible; ordinal data Schütze & Sprouse, 2014
Magnitude estimation (ME) Assign a number proportional to perceived acceptability relative to a reference sentence Unbounded scale; ratio-level data (in theory) More complex; participants need training; debated whether it outperforms Likert Bard et al., 1996; Sprouse, 2011
Forced choice Choose the more acceptable of two sentences Binary; easy; avoids scale-use differences Low sensitivity; many trials needed Sprouse & Almeida, 2012
Yes/No judgment "Is this sentence acceptable?" Simple; binary Very low sensitivity; cannot distinguish degrees of unacceptability --

Recommendation: Use 7-point Likert as the default. It provides sufficient sensitivity for most research questions and has been shown to replicate formal linguistic judgments as reliably as magnitude estimation (Sprouse & Almeida, 2012; Sprouse, 2011).

Sample Size for Acceptability

Design Minimum N Rationale Citation
Simple grammatical/ungrammatical 20 participants Large effect sizes (d > 1.0 typical) Sprouse & Almeida, 2012
Factorial (2x2) with interaction 30-40 participants Interaction effects are smaller Sprouse et al., 2012
Subtle contrasts 50+ participants Small effect sizes require more power Power analysis recommended

Lexical Controls

Variables That Must Be Controlled Across Conditions

Every critical word manipulation must control for confounding lexical variables. The target word and its condition-matched alternatives should be equated on the following:

Variable Database / Source Why It Matters Citation
Word frequency SUBTLEX-US (log10 word frequency per million) Most powerful predictor of reading time; ~30-60 ms effect for high vs. low (Brysbaert & New, 2009) Brysbaert & New, 2009
Word length Character count Longer words = longer reading times; ~20-30 ms per character (Rayner, 2009) Rayner, 1998
Orthographic neighborhood density (N) N-Watch; CLEARPOND Number of words differing by one letter; affects lexical access (Coltheart et al., 1977) Andrews, 1997
Concreteness Brysbaert et al. (2014) ratings Concrete words processed faster than abstract words Brysbaert et al., 2014
Age of acquisition (AoA) Kuperman et al. (2012) ratings Earlier-acquired words processed faster Kuperman et al., 2012
Number of syllables Any pronunciation dictionary Affects phonological processing time Rayner, 1998
Morphological complexity Manual coding Derived words (e.g., un-happi-ness) processed differently than monomorphemic words Taft, 2004

Frequency Database Selection

Database Language Measure Recommended? Citation
SUBTLEX-US English (US) Subtitle-based frequency per million Yes -- best predictor of processing times Brysbaert & New, 2009
SUBTLEX-UK English (UK) Subtitle-based frequency Yes, for British English materials van Heuven et al., 2014
HAL English Usenet corpus frequency Outdated; SUBTLEX preferred Lund & Burgess, 1996
CELEX English, Dutch, German Mixed corpus frequency Acceptable but less predictive than SUBTLEX Baayen et al., 1995

Key recommendation: Use SUBTLEX log frequency values. They explain more variance in lexical decision and naming times than older norms (Brysbaert & New, 2009).

How to Match Across Conditions

  1. Select critical words for each condition
  2. Retrieve lexical metrics from SUBTLEX-US and norming databases
  3. Compute condition means for each metric
  4. Test for differences: Run t-tests or ANOVAs across conditions on each lexical variable
  5. Criterion: No significant differences (p > 0.20 is a reasonable threshold; some use p > 0.30) on any controlled variable
  6. If matching fails: replace items or add the unmatched variable as a covariate in the analysis

Latin Square Counterbalancing

Purpose

In a within-item design, each item appears in all conditions, but each participant sees each item in only one condition. A Latin square assigns items to conditions across participant lists.

Construction

For a design with k conditions and n items (where n is divisible by k):

  1. Divide items into k groups of n/k items each
  2. Create k lists; in each list, assign each item group to a different condition
  3. Each participant receives one list
  4. Result: every item appears in every condition across participants; each participant sees an equal number of items per condition

Example: 2-Condition Design

With 40 items and 2 conditions (A, B):

List Items 1-20 Items 21-40
List 1 Condition A Condition B
List 2 Condition B Condition A

Requirements

Parameter Value Rationale
Minimum items per condition per list 16-24 Standard for psycholinguistic experiments; fewer items = lower power (Brysbaert & Stevens, 2018)
Recommended items 24-40 per condition More stable estimates, especially for eye-tracking
Participants per list Equal across lists; minimum 4-6 per list Ensures balanced representation
Total participants Divisible by number of lists Critical for balanced design

Filler Items

Purpose

Fillers prevent participants from noticing the experimental manipulation and adopting strategies.

Design Parameters

Parameter Recommended Value Rationale
Filler-to-target ratio 2:1 or 3:1 (fillers:targets) Standard in psycholinguistics; prevents pattern detection (Schütze & Sprouse, 2014)
Filler diversity Fillers should span the full range of sentence types, lengths, and structures Prevents target sentences from standing out
Filler acceptability range Include some clearly good and some mildly awkward fillers Prevents raters from using only part of the scale
Filler length Match the average length of target sentences Controls for sentence length expectations

Filler Construction Tips

  • Use fillers from different syntactic constructions than your targets
  • Include some fillers with comprehension questions (for reading studies) to maintain attentive reading
  • If targets are semantically anomalous, include some fillers that are also slightly odd (but in different ways) so anomaly is not a cue

Practice and Warm-Up Items

Parameter Recommended Value Rationale
Number of practice items 4-6 items (minimum 3) Familiarize participants with the task and interface
Practice item composition Span the range of difficulty/acceptability Calibrate participant expectations
Practice data Always exclude from analysis Practice responses are contaminated by learning effects
Warm-up items at start of main experiment 2-3 additional filler items Allow settling into the task; exclude from analysis

Online Norming Considerations

Platform Recommendations

Platform Pros Cons Typical Pay Rate
Prolific Diverse participants; pre-screening; good data quality Smaller pool than MTurk ~$10-15/hour (Prolific minimum: $8/hour)
Amazon MTurk Large pool; fast recruitment Lower data quality; less diverse; requires careful screening ~$10-15/hour recommended
PCIbex / Ibex Farm Free hosting; designed for linguistics Requires programming; no built-in recruitment (hosting only)
Gorilla GUI-based; good for complex designs Subscription cost (hosting only)

Quality Control for Online Studies

Measure Implementation Threshold
Catch trials Include 10-15% filler items with obvious answers Exclude participants failing > 20%
Completion time Record total time Exclude participants completing in < 50% of median time
Straight-lining Check for same response on all items Exclude participants with zero variance in ratings
Bot detection Include reCAPTCHA or similar Exclude flagged responses
Native speaker check Self-report + brief language background questionnaire Exclude non-native speakers (unless studying L2)

Common Pitfalls

  1. Not norming cloze probability: Claiming words are "predictable" or "unpredictable" based on experimenter intuition rather than empirical cloze norms. Always collect cloze data (Taylor, 1953).

  2. Too few raters per item: With N < 20 raters for cloze, individual item estimates are unstable. A word with true cloze of 0.50 could yield observed cloze of 0.20-0.80 with only 10 raters. Use minimum 30 raters (Bloom & Fischler, 1980).

  3. Not controlling word frequency: Frequency is the strongest single predictor of reading time. A 1 log-unit difference in SUBTLEX frequency corresponds to ~30-40 ms in gaze duration (Brysbaert & New, 2009; Rayner, 1998). Always match or control.

  4. Using the wrong frequency database: HAL and Kucera-Francis norms are outdated. SUBTLEX-US explains significantly more variance in behavioral data (Brysbaert & New, 2009).

  5. Showing raters multiple conditions of the same item: This introduces contrastive evaluation. Raters must see each item in only one condition (Latin square for norming too).

  6. Insufficient filler items: A 1:1 target-to-filler ratio makes the manipulation transparent. Use at least 2:1 fillers to targets (Schütze & Sprouse, 2014).

  7. Not piloting the norming study: Always pilot with 5-10 participants to catch unclear instructions, ambiguous items, and timing issues before running the full norming sample.

  8. Ignoring age of acquisition: AoA effects are independent of frequency (Kuperman et al., 2012). Failing to control AoA can introduce confounds, especially for studies comparing concrete vs. abstract words.

Minimum Reporting Checklist

Based on Schütze & Sprouse (2014) and current psycholinguistic standards:

  • Number of items per condition
  • Cloze probability values: mean, SD, and range per condition (if collected)
  • Cloze norming details: N raters, population, procedure, scoring criteria
  • Plausibility/acceptability ratings: scale type, N raters, mean and SD per condition
  • Lexical control variables: list each controlled variable, database source, and condition means
  • Statistical test confirming conditions do not differ on controlled variables
  • Latin square design: number of lists, items per list per condition, participants per list
  • Filler-to-target ratio and description of filler types
  • Number of practice/warm-up items
  • For online norming: platform, pay rate, attention check procedure, exclusion criteria and N excluded
  • Full item list (in supplementary materials or online repository)

References

  • Andrews, S. (1997). The effect of orthographic similarity on lexical retrieval: Resolving neighborhood conflicts. Psychonomic Bulletin & Review, 4, 439-461.
  • Baayen, R. H., Davidson, D. J., & Bates, D. M. (2008). Mixed-effects modeling with crossed random effects for subjects and items. Journal of Memory and Language, 59, 390-412.
  • Baayen, R. H., Piepenbrock, R., & Gulikers, L. (1995). The CELEX lexical database (CD-ROM). Linguistic Data Consortium, University of Pennsylvania.
  • Bard, E. G., Robertson, D., & Sorace, A. (1996). Magnitude estimation of linguistic acceptability. Language, 72, 32-68.
  • Bloom, P. A., & Fischler, I. (1980). Completion norms for 329 sentence contexts. Memory & Cognition, 8, 631-642.
  • Brysbaert, M., & New, B. (2009). Moving beyond Kucera and Francis: A critical evaluation of current word frequency norms and the introduction of a new and improved word frequency measure for American English. Behavior Research Methods, 41, 977-990.
  • Brysbaert, M., & Stevens, M. (2018). Power analysis and effect size in mixed effects models: A tutorial. Journal of Cognition, 1, 9.
  • Brysbaert, M., Warriner, A. B., & Kuperman, V. (2014). Concreteness ratings for 40 thousand generally known English word lemmas. Behavior Research Methods, 46, 904-911.
  • Coltheart, M., Davelaar, E., Jonasson, J. T., & Besner, D. (1977). Access to the internal lexicon. In S. Dornic (Ed.), Attention and performance VI. Hillsdale, NJ: Erlbaum.
  • Kuperman, V., Stadthagen-Gonzalez, H., & Brysbaert, M. (2012). Age-of-acquisition ratings for 30,000 English words. Behavior Research Methods, 44, 978-990.
  • Kutas, M., & Hillyard, S. A. (1984). Brain potentials during reading reflect word expectancy and semantic association. Nature, 307, 161-163.
  • Lund, K., & Burgess, C. (1996). Producing high-dimensional semantic spaces from lexical co-occurrence. Behavior Research Methods, Instruments, & Computers, 28, 203-208.
  • Rayner, K. (1998). Eye movements in reading and information processing: 20 years of research. Psychological Bulletin, 124, 372-422.
  • Rayner, K. (2009). Eye movements and attention in reading, scene perception, and visual search. Quarterly Journal of Experimental Psychology, 62, 1457-1506.
  • Schütze, C. T., & Sprouse, J. (2014). Judgment data. In R. J. Podesva & D. Sharma (Eds.), Research methods in linguistics. Cambridge University Press.
  • Sprouse, J. (2011). A test of the cognitive assumptions of magnitude estimation: Commutativity does not hold for acceptability judgments. Language, 87, 274-288.
  • Sprouse, J., & Almeida, D. (2012). Assessing the reliability of textbook data in syntax: Adger's Core Syntax. Journal of Linguistics, 48, 609-652.
  • Sprouse, J., Schütze, C. T., & Almeida, D. (2012). A comparison of informal and formal acceptability judgments using a random sample from Linguistic Inquiry 2001-2010. Lingua, 134, 219-248.
  • Staub, A., Grant, M., Astheimer, L., & Cohen, A. (2015). The influence of cloze probability and item constraint on cloze task response time. Journal of Memory and Language, 82, 1-17.
  • Taft, M. (2004). Morphological decomposition and the reverse base frequency effect. Quarterly Journal of Experimental Psychology, 57A, 745-765.
  • Taylor, W. L. (1953). "Cloze procedure": A new tool for measuring readability. Journalism Quarterly, 30, 415-433.
  • van Heuven, W. J. B., Mandera, P., Keuleers, E., & Brysbaert, M. (2014). SUBTLEX-UK: A new and improved word frequency database for British English. Quarterly Journal of Experimental Psychology, 67, 1176-1190.

See references/lexical-databases-guide.md for detailed instructions on accessing and querying lexical control databases.

Weekly Installs
0
GitHub Stars
10
First Seen
Jan 1, 1970