Sentence Stimulus Norming
Sentence Stimulus Norming
Purpose
This skill encodes expert methodological knowledge for norming linguistic stimuli before running psycholinguistic experiments. A competent programmer without linguistics training would likely construct stimuli based on intuition, failing to control for critical lexical variables (word frequency, length, neighborhood density), skipping cloze norming, using inappropriate rating scales, or under-powering the norming study. Poor stimulus norming is the single most common methodological weakness in psycholinguistic research, because confounds in the materials propagate to every analysis.
When to Use
Use this skill when:
- Creating sentence stimuli for reading experiments (self-paced reading, eye-tracking, ERP)
- Norming the predictability (cloze probability) of critical words in sentence contexts
- Collecting plausibility, naturalness, or acceptability ratings for sentence materials
- Controlling lexical properties of critical words across experimental conditions
- Designing Latin square counterbalancing for within-item designs
- Planning filler items and practice trials
Do not use this skill when:
- Working with single-word stimuli without sentence context (use lexical database tools directly)
- Designing non-linguistic stimuli (visual search arrays, tones)
- Analyzing existing normed materials without creating new ones
Research Planning Protocol
Before executing the domain-specific steps below, you MUST:
- State the research question -- What specific question is this analysis/paradigm addressing?
- Justify the method choice -- Why is this approach appropriate? What alternatives were considered?
- Declare expected outcomes -- What results would support vs. refute the hypothesis?
- Note assumptions and limitations -- What does this method assume? Where could it mislead?
- Present the plan to the user and WAIT for confirmation before proceeding.
For detailed methodology guidance, see the research-literacy skill.
⚠️ Verification Notice
This skill was generated by AI from academic literature. All parameters, thresholds, and citations require independent verification before use in research. If you find errors, please open an issue.
Cloze Probability Norming
What Is Cloze Probability?
Cloze probability is the proportion of people who complete a sentence fragment with a particular word (Taylor, 1953). It is the standard measure of a word's predictability in context and is a critical control variable in nearly all sentence processing research.
Procedure
- Create sentence fragments: Truncate each sentence immediately before the critical word
- Present fragments one at a time to participants
- Instruct: "Please complete each sentence with the first word that comes to mind. Write only one word."
- Score: For each item, cloze probability = (number of completions matching the target word) / (total number of respondents)
Design Parameters
| Parameter | Recommended Value | Citation / Rationale |
|---|---|---|
| N per item | Minimum 30 raters | Taylor, 1953; Bloom & Fischler, 1980; standard minimum for stable estimates |
| Preferred N | 40-50 raters | More stable estimates, especially for medium-cloze items |
| Items per participant | 50-100 fragments per norming session | Avoid fatigue; pilot to calibrate |
| Time limit | ~10-15 seconds per item or untimed | Untimed is standard; brief limit prevents overthinking |
| Population | Same as experimental population (e.g., native English speakers, same age range) | Ensures cloze values generalize |
Scoring Conventions
- Exact match: Only the target word counts (standard)
- Morphological variants: Decide a priori whether "run" and "running" count as the same completion. Standard practice: count only the exact form (Staub et al., 2015)
- Spelling errors: Accept obvious misspellings of the target
- Blank/nonsense responses: Exclude from the denominator (participant did not engage)
Cloze Probability Benchmarks
| Cloze Range | Label | Use Case |
|---|---|---|
| > 0.80 | High cloze / highly predictable | N400 amplitude studies; predictability effects (Kutas & Hillyard, 1984) |
| 0.30 - 0.70 | Medium cloze | Moderate predictability manipulations |
| < 0.10 | Low cloze / unpredictable | Baseline; unexpected completions |
| 0.00 | Zero cloze | Anomalous or implausible continuations |
Online vs. Lab Norming
| Aspect | Lab | Online (e.g., Prolific, MTurk) |
|---|---|---|
| Quality control | Direct observation | Must include catch trials and attention checks |
| Sample size | Limited by lab capacity | Easy to reach N = 40-50 per item |
| Population | Typically university students | More diverse; specify inclusion criteria |
| Validity | Gold standard | Comparable for cloze (Schütze & Sprouse, 2014) |
| Cost | Lab time | Participant payment (~$10-15/hour; Prolific standards) |
Recommendation for online norming: Include 10-15% catch trials (sentences with obvious completions, e.g., "The dog chased the ___") and exclude participants who fail > 20% of catch trials.
Plausibility and Naturalness Ratings
When to Collect
- When cloze probability alone is insufficient (e.g., both conditions have low cloze but differ in plausibility)
- When manipulating semantic fit or thematic role plausibility
- When verifying that "anomalous" conditions are genuinely perceived as odd
Rating Scale Design
| Parameter | Recommended | Citation / Rationale |
|---|---|---|
| Scale type | Likert scale | Standard for sentence ratings (Schütze & Sprouse, 2014) |
| Number of points | 7-point scale | Balances sensitivity and reliability; standard in psycholinguistics (Schütze & Sprouse, 2014) |
| Anchors | 1 = "very unnatural/implausible" to 7 = "very natural/plausible" | Labeled endpoints with unlabeled intermediate points |
| N per item | Minimum 20 raters; preferred 30+ | Sufficient for stable means per item (Sprouse & Almeida, 2012) |
| Items per rater | 40-80 items per session | Avoid fatigue effects |
| Practice items | 3-5 items spanning the full range before data collection | Calibrate scale use |
Instructions Template
"You will read a series of sentences. For each sentence, please rate how natural or plausible it sounds on a scale from 1 to 7, where 1 means 'very unnatural / makes no sense' and 7 means 'perfectly natural / makes complete sense.' There are no right or wrong answers; we are interested in your intuition."
Critical Design Considerations
- Within-list design: Each rater sees only one version of each item (Latin square). Raters should never see multiple conditions of the same item, or they will rate contrastively rather than absolutely.
- Filler items: Include filler sentences spanning the full rating range. This prevents range restriction.
- Order effects: Randomize item order per participant.
Acceptability Judgments
When to Collect
- When manipulating syntactic structure (grammaticality, island constraints, movement dependencies)
- When testing formal linguistic predictions about sentence well-formedness
- For factorial designs crossing syntactic factors (e.g., 2x2 designs testing island effects; Sprouse et al., 2012)
Rating Methods
| Method | Description | Pros | Cons | Citation |
|---|---|---|---|---|
| Likert scale (7-point) | Rate acceptability 1-7 | Simple; familiar; sufficient for most purposes | Ceiling/floor possible; ordinal data | Schütze & Sprouse, 2014 |
| Magnitude estimation (ME) | Assign a number proportional to perceived acceptability relative to a reference sentence | Unbounded scale; ratio-level data (in theory) | More complex; participants need training; debated whether it outperforms Likert | Bard et al., 1996; Sprouse, 2011 |
| Forced choice | Choose the more acceptable of two sentences | Binary; easy; avoids scale-use differences | Low sensitivity; many trials needed | Sprouse & Almeida, 2012 |
| Yes/No judgment | "Is this sentence acceptable?" | Simple; binary | Very low sensitivity; cannot distinguish degrees of unacceptability | -- |
Recommendation: Use 7-point Likert as the default. It provides sufficient sensitivity for most research questions and has been shown to replicate formal linguistic judgments as reliably as magnitude estimation (Sprouse & Almeida, 2012; Sprouse, 2011).
Sample Size for Acceptability
| Design | Minimum N | Rationale | Citation |
|---|---|---|---|
| Simple grammatical/ungrammatical | 20 participants | Large effect sizes (d > 1.0 typical) | Sprouse & Almeida, 2012 |
| Factorial (2x2) with interaction | 30-40 participants | Interaction effects are smaller | Sprouse et al., 2012 |
| Subtle contrasts | 50+ participants | Small effect sizes require more power | Power analysis recommended |
Lexical Controls
Variables That Must Be Controlled Across Conditions
Every critical word manipulation must control for confounding lexical variables. The target word and its condition-matched alternatives should be equated on the following:
| Variable | Database / Source | Why It Matters | Citation |
|---|---|---|---|
| Word frequency | SUBTLEX-US (log10 word frequency per million) | Most powerful predictor of reading time; ~30-60 ms effect for high vs. low (Brysbaert & New, 2009) | Brysbaert & New, 2009 |
| Word length | Character count | Longer words = longer reading times; ~20-30 ms per character (Rayner, 2009) | Rayner, 1998 |
| Orthographic neighborhood density (N) | N-Watch; CLEARPOND | Number of words differing by one letter; affects lexical access (Coltheart et al., 1977) | Andrews, 1997 |
| Concreteness | Brysbaert et al. (2014) ratings | Concrete words processed faster than abstract words | Brysbaert et al., 2014 |
| Age of acquisition (AoA) | Kuperman et al. (2012) ratings | Earlier-acquired words processed faster | Kuperman et al., 2012 |
| Number of syllables | Any pronunciation dictionary | Affects phonological processing time | Rayner, 1998 |
| Morphological complexity | Manual coding | Derived words (e.g., un-happi-ness) processed differently than monomorphemic words | Taft, 2004 |
Frequency Database Selection
| Database | Language | Measure | Recommended? | Citation |
|---|---|---|---|---|
| SUBTLEX-US | English (US) | Subtitle-based frequency per million | Yes -- best predictor of processing times | Brysbaert & New, 2009 |
| SUBTLEX-UK | English (UK) | Subtitle-based frequency | Yes, for British English materials | van Heuven et al., 2014 |
| HAL | English | Usenet corpus frequency | Outdated; SUBTLEX preferred | Lund & Burgess, 1996 |
| CELEX | English, Dutch, German | Mixed corpus frequency | Acceptable but less predictive than SUBTLEX | Baayen et al., 1995 |
Key recommendation: Use SUBTLEX log frequency values. They explain more variance in lexical decision and naming times than older norms (Brysbaert & New, 2009).
How to Match Across Conditions
- Select critical words for each condition
- Retrieve lexical metrics from SUBTLEX-US and norming databases
- Compute condition means for each metric
- Test for differences: Run t-tests or ANOVAs across conditions on each lexical variable
- Criterion: No significant differences (p > 0.20 is a reasonable threshold; some use p > 0.30) on any controlled variable
- If matching fails: replace items or add the unmatched variable as a covariate in the analysis
Latin Square Counterbalancing
Purpose
In a within-item design, each item appears in all conditions, but each participant sees each item in only one condition. A Latin square assigns items to conditions across participant lists.
Construction
For a design with k conditions and n items (where n is divisible by k):
- Divide items into k groups of n/k items each
- Create k lists; in each list, assign each item group to a different condition
- Each participant receives one list
- Result: every item appears in every condition across participants; each participant sees an equal number of items per condition
Example: 2-Condition Design
With 40 items and 2 conditions (A, B):
| List | Items 1-20 | Items 21-40 |
|---|---|---|
| List 1 | Condition A | Condition B |
| List 2 | Condition B | Condition A |
Requirements
| Parameter | Value | Rationale |
|---|---|---|
| Minimum items per condition per list | 16-24 | Standard for psycholinguistic experiments; fewer items = lower power (Brysbaert & Stevens, 2018) |
| Recommended items | 24-40 per condition | More stable estimates, especially for eye-tracking |
| Participants per list | Equal across lists; minimum 4-6 per list | Ensures balanced representation |
| Total participants | Divisible by number of lists | Critical for balanced design |
Filler Items
Purpose
Fillers prevent participants from noticing the experimental manipulation and adopting strategies.
Design Parameters
| Parameter | Recommended Value | Rationale |
|---|---|---|
| Filler-to-target ratio | 2:1 or 3:1 (fillers:targets) | Standard in psycholinguistics; prevents pattern detection (Schütze & Sprouse, 2014) |
| Filler diversity | Fillers should span the full range of sentence types, lengths, and structures | Prevents target sentences from standing out |
| Filler acceptability range | Include some clearly good and some mildly awkward fillers | Prevents raters from using only part of the scale |
| Filler length | Match the average length of target sentences | Controls for sentence length expectations |
Filler Construction Tips
- Use fillers from different syntactic constructions than your targets
- Include some fillers with comprehension questions (for reading studies) to maintain attentive reading
- If targets are semantically anomalous, include some fillers that are also slightly odd (but in different ways) so anomaly is not a cue
Practice and Warm-Up Items
| Parameter | Recommended Value | Rationale |
|---|---|---|
| Number of practice items | 4-6 items (minimum 3) | Familiarize participants with the task and interface |
| Practice item composition | Span the range of difficulty/acceptability | Calibrate participant expectations |
| Practice data | Always exclude from analysis | Practice responses are contaminated by learning effects |
| Warm-up items at start of main experiment | 2-3 additional filler items | Allow settling into the task; exclude from analysis |
Online Norming Considerations
Platform Recommendations
| Platform | Pros | Cons | Typical Pay Rate |
|---|---|---|---|
| Prolific | Diverse participants; pre-screening; good data quality | Smaller pool than MTurk | ~$10-15/hour (Prolific minimum: $8/hour) |
| Amazon MTurk | Large pool; fast recruitment | Lower data quality; less diverse; requires careful screening | ~$10-15/hour recommended |
| PCIbex / Ibex Farm | Free hosting; designed for linguistics | Requires programming; no built-in recruitment | (hosting only) |
| Gorilla | GUI-based; good for complex designs | Subscription cost | (hosting only) |
Quality Control for Online Studies
| Measure | Implementation | Threshold |
|---|---|---|
| Catch trials | Include 10-15% filler items with obvious answers | Exclude participants failing > 20% |
| Completion time | Record total time | Exclude participants completing in < 50% of median time |
| Straight-lining | Check for same response on all items | Exclude participants with zero variance in ratings |
| Bot detection | Include reCAPTCHA or similar | Exclude flagged responses |
| Native speaker check | Self-report + brief language background questionnaire | Exclude non-native speakers (unless studying L2) |
Common Pitfalls
-
Not norming cloze probability: Claiming words are "predictable" or "unpredictable" based on experimenter intuition rather than empirical cloze norms. Always collect cloze data (Taylor, 1953).
-
Too few raters per item: With N < 20 raters for cloze, individual item estimates are unstable. A word with true cloze of 0.50 could yield observed cloze of 0.20-0.80 with only 10 raters. Use minimum 30 raters (Bloom & Fischler, 1980).
-
Not controlling word frequency: Frequency is the strongest single predictor of reading time. A 1 log-unit difference in SUBTLEX frequency corresponds to ~30-40 ms in gaze duration (Brysbaert & New, 2009; Rayner, 1998). Always match or control.
-
Using the wrong frequency database: HAL and Kucera-Francis norms are outdated. SUBTLEX-US explains significantly more variance in behavioral data (Brysbaert & New, 2009).
-
Showing raters multiple conditions of the same item: This introduces contrastive evaluation. Raters must see each item in only one condition (Latin square for norming too).
-
Insufficient filler items: A 1:1 target-to-filler ratio makes the manipulation transparent. Use at least 2:1 fillers to targets (Schütze & Sprouse, 2014).
-
Not piloting the norming study: Always pilot with 5-10 participants to catch unclear instructions, ambiguous items, and timing issues before running the full norming sample.
-
Ignoring age of acquisition: AoA effects are independent of frequency (Kuperman et al., 2012). Failing to control AoA can introduce confounds, especially for studies comparing concrete vs. abstract words.
Minimum Reporting Checklist
Based on Schütze & Sprouse (2014) and current psycholinguistic standards:
- Number of items per condition
- Cloze probability values: mean, SD, and range per condition (if collected)
- Cloze norming details: N raters, population, procedure, scoring criteria
- Plausibility/acceptability ratings: scale type, N raters, mean and SD per condition
- Lexical control variables: list each controlled variable, database source, and condition means
- Statistical test confirming conditions do not differ on controlled variables
- Latin square design: number of lists, items per list per condition, participants per list
- Filler-to-target ratio and description of filler types
- Number of practice/warm-up items
- For online norming: platform, pay rate, attention check procedure, exclusion criteria and N excluded
- Full item list (in supplementary materials or online repository)
References
- Andrews, S. (1997). The effect of orthographic similarity on lexical retrieval: Resolving neighborhood conflicts. Psychonomic Bulletin & Review, 4, 439-461.
- Baayen, R. H., Davidson, D. J., & Bates, D. M. (2008). Mixed-effects modeling with crossed random effects for subjects and items. Journal of Memory and Language, 59, 390-412.
- Baayen, R. H., Piepenbrock, R., & Gulikers, L. (1995). The CELEX lexical database (CD-ROM). Linguistic Data Consortium, University of Pennsylvania.
- Bard, E. G., Robertson, D., & Sorace, A. (1996). Magnitude estimation of linguistic acceptability. Language, 72, 32-68.
- Bloom, P. A., & Fischler, I. (1980). Completion norms for 329 sentence contexts. Memory & Cognition, 8, 631-642.
- Brysbaert, M., & New, B. (2009). Moving beyond Kucera and Francis: A critical evaluation of current word frequency norms and the introduction of a new and improved word frequency measure for American English. Behavior Research Methods, 41, 977-990.
- Brysbaert, M., & Stevens, M. (2018). Power analysis and effect size in mixed effects models: A tutorial. Journal of Cognition, 1, 9.
- Brysbaert, M., Warriner, A. B., & Kuperman, V. (2014). Concreteness ratings for 40 thousand generally known English word lemmas. Behavior Research Methods, 46, 904-911.
- Coltheart, M., Davelaar, E., Jonasson, J. T., & Besner, D. (1977). Access to the internal lexicon. In S. Dornic (Ed.), Attention and performance VI. Hillsdale, NJ: Erlbaum.
- Kuperman, V., Stadthagen-Gonzalez, H., & Brysbaert, M. (2012). Age-of-acquisition ratings for 30,000 English words. Behavior Research Methods, 44, 978-990.
- Kutas, M., & Hillyard, S. A. (1984). Brain potentials during reading reflect word expectancy and semantic association. Nature, 307, 161-163.
- Lund, K., & Burgess, C. (1996). Producing high-dimensional semantic spaces from lexical co-occurrence. Behavior Research Methods, Instruments, & Computers, 28, 203-208.
- Rayner, K. (1998). Eye movements in reading and information processing: 20 years of research. Psychological Bulletin, 124, 372-422.
- Rayner, K. (2009). Eye movements and attention in reading, scene perception, and visual search. Quarterly Journal of Experimental Psychology, 62, 1457-1506.
- Schütze, C. T., & Sprouse, J. (2014). Judgment data. In R. J. Podesva & D. Sharma (Eds.), Research methods in linguistics. Cambridge University Press.
- Sprouse, J. (2011). A test of the cognitive assumptions of magnitude estimation: Commutativity does not hold for acceptability judgments. Language, 87, 274-288.
- Sprouse, J., & Almeida, D. (2012). Assessing the reliability of textbook data in syntax: Adger's Core Syntax. Journal of Linguistics, 48, 609-652.
- Sprouse, J., Schütze, C. T., & Almeida, D. (2012). A comparison of informal and formal acceptability judgments using a random sample from Linguistic Inquiry 2001-2010. Lingua, 134, 219-248.
- Staub, A., Grant, M., Astheimer, L., & Cohen, A. (2015). The influence of cloze probability and item constraint on cloze task response time. Journal of Memory and Language, 82, 1-17.
- Taft, M. (2004). Morphological decomposition and the reverse base frequency effect. Quarterly Journal of Experimental Psychology, 57A, 745-765.
- Taylor, W. L. (1953). "Cloze procedure": A new tool for measuring readability. Journalism Quarterly, 30, 415-433.
- van Heuven, W. J. B., Mandera, P., Keuleers, E., & Brysbaert, M. (2014). SUBTLEX-UK: A new and improved word frequency database for British English. Quarterly Journal of Experimental Psychology, 67, 1176-1190.
See references/lexical-databases-guide.md for detailed instructions on accessing and querying lexical control databases.