Behavioural Evidence for Cognitive Interference in a Colour-Naming Stroop Task

İrem Nur Yiğit

Independent researcher in cognitive psychology · contact@acarkaan.com · 2026

Abstract

The Stroop task is among the most durable behavioural probes of cognitive control: naming the ink colour of a printed colour word is slower and more error-prone when the word and the ink disagree. This study quantified that interference in a within-subjects colour-naming paradigm with — participants and 80 trials each. Reaction time was longer on incongruent than congruent trials by — (paired t(—) = —, p = —, Cohen's d = —), and accuracy was lower by — (d = —). A trial-level OLS regression returned a converging interference cost of — (95% CI [—, —], R² = —), and a linear mixed-effects model with random intercepts per participant held the population-level cost at — (SE —; ICC —). The pattern replicates the canonical Stroop effect and is consistent with conflict-monitoring accounts of cognitive control, in which response competition on incongruent trials recruits prefrontal control. The effect is present in every participant.

Keywords: Stroop task; cognitive control; selective attention; response interference; reaction time; pre-registration.

1Introduction

Selective attention is the faculty that lets goal-relevant information win over goal-irrelevant information competing for the same processing. Its most studied behavioural signature is the Stroop effect (Stroop, 1935): asked to name the ink colour of a printed colour word, people are slower and less accurate when word and ink mismatch (RED in green ink) than when they match (RED in red ink). The phenomenon has survived nearly a century of scrutiny and hundreds of empirical articles across languages, populations, and stimulus variants (MacLeod, 1991).

The theoretical accounts converge. In a literate adult, word reading is automatic — fast, obligatory, and hard to suppress — while colour naming runs on a slower, less practised pathway (Cohen, Dunbar, & McClelland, 1990). When the two dimensions of one stimulus disagree, the dominant reading pathway produces a prepotent response that competes with the task-relevant colour response, and resolving that competition is the defining job of cognitive control. Influential neurocognitive models attribute conflict detection to the anterior cingulate cortex and conflict resolution — top-down biasing of attention toward the relevant dimension — to dorsolateral prefrontal cortex (Botvinick, Braver, Barch, Carter, & Cohen, 2001). fMRI work shows rapid trial-to-trial coupling between these regions on incongruent trials (Egner & Hirsch, 2005), and individual differences in Stroop interference track working-memory capacity and executive function (Kane & Engle, 2003; Miyake et al., 2000).

Every clean replication adds to the cumulative reliability of the effect, both as a basic finding and as a benchmark for cognitive-control instruments. This study set out to replicate the Stroop effect on reaction time and accuracy and to quantify it with classical paired tests, an OLS regression, and a mixed-effects model. Two directional predictions were pre-registered:

H1. Reaction time is longer on incongruent than congruent trials.
H2. Accuracy is lower on incongruent than congruent trials.

2Method

2.1Participants

The reference design comprised — healthy adults (age 18–35), all with normal or corrected-to-normal vision and no colour-vision deficiency, neurological history, or current psychoactive medication. All gave written informed consent in line with the Declaration of Helsinki. Participant identifiers are anonymous; no identifying information enters the analysis.

2.2Materials

Stimuli were the four colour words RED, GREEN, BLUE, and YELLOW, each rendered in one of the same four ink colours, presented centrally at roughly 3° of visual angle on a neutral background. On congruent trials the word and ink matched; on incongruent trials they mismatched. No neutral condition was included, so the design indexes interference but not facilitation.

2.3Design and procedure

A single-factor within-subjects design was used: every participant completed both levels of Trial Type. Each completed 80 randomised experimental trials (40 per condition), preceded by 10 practice trials with accuracy feedback. After consent and colour-vision screening, participants were instructed to name the ink colour and ignore the word, responding on keys R, G, B, and Y. Each trial began with a 500 ms fixation cross, then the stimulus until response or a 3000 ms timeout, then a 500 ms blank inter-trial interval.

2.4Analysis plan

The dataset is long-format, one row per trial, with participant_id, trial_type, reaction_time_ms, and correct. Trials below 200 ms (anticipations) or above 3000 ms (lapses) were excluded. Reaction-time analyses used correct trials only; accuracy analyses used all surviving trials. Per participant and condition, mean reaction time and accuracy were computed. The two hypotheses were tested with two-tailed paired t-tests on participant-level means, with Cohen's d for paired samples as the effect size. A trial-level OLS regression of reaction time on trial type (congruent as reference) and a linear mixed-effects model with a random intercept per participant provide converging estimates; the mixed model is the pre-specified primary inferential model. Analyses were run in Python with pandas, numpy, scipy, and statsmodels.

Example — simulated

An honest note on the dataset. The numbers and figures rendered below are computed from the bundle shipped with this page. Until enough consenting visitors have taken the live task, that bundle is a stochastic simulation calibrated to MacLeod (1991) — a model of the empirical Stroop literature, not human participants. It exists so the pre-registration, pipeline, and inference are fully visible and reproducible from the first read. The same code runs unchanged on real data; live responses replace this view as they arrive. The unusually large Cohen's d is partly an artefact of how cleanly a simulator behaves and of averaging many trials per person — the trial-level R² is the more sober measure of how much trial type matters.

3Results

3.1Data cleaning and descriptives

Of — raw trials, — were excluded by the 200–3000 ms window. Mean reaction time was — on congruent trials and — on incongruent trials; mean accuracy was — and — respectively. The condition means and the per-condition spread are shown in Figure 1.

Figure 1 · Mean reaction time by condition

Mean reaction time for congruent and incongruent trials with the per-condition spread. The difference between the bars is the Stroop interference cost, approximately — in this cohort.

3.2Inferential statistics

A paired-samples t-test on participant-level mean reaction times revealed a reliable effect of trial type, t(—) = —, p = —, with a within-subject mean difference of — in the predicted direction and Cohen's d = —. The accuracy contrast was likewise reliable in the predicted direction, with a mean difference of — and d = —.

A trial-level OLS regression of reaction time on trial type yielded an incongruent coefficient of — (95% CI [—, —]), converging on the subject-level estimate. Trial type alone accounted for an R² of — of trial-level variance — substantial given that no participant structure is in this model. The linear mixed-effects model, which adds a random intercept per participant and so respects the within-subject correlation, held the population-level cost at — (SE —) with an intraclass correlation of —.

3.3Robustness across participants

The effect is not carried by a subgroup. Every participant showed the predicted direction of the reaction-time effect, and the distribution of per-participant interference scores lies entirely above zero (Figure 2).

Figure 2 · Distribution of the per-participant Stroop cost

One incongruent-minus-congruent reaction-time cost per participant. The whole distribution sits to the right of zero — the individual-level expression of the group effect.

4Discussion

The study replicates and quantifies the Stroop effect in a within-subjects colour-naming paradigm. Reaction time was markedly slower and accuracy lower on incongruent trials, with effect sizes in the large range and clear statistical evidence at α = .05. The magnitudes fall within the range MacLeod's (1991) review reports for comparable paradigms.

The pattern fits conflict-monitoring accounts of cognitive control (Botvinick et al., 2001). On incongruent trials the automatic word-reading response competes with the controlled colour-naming response, producing response conflict that the anterior cingulate cortex is hypothesised to detect; this recruits dorsolateral prefrontal cortex, which biases processing toward the task-relevant dimension. The behavioural cost — in time and in occasional errors — is the visible work of that circuit, consistent with the trial-to-trial coupling reported by Egner and Hirsch (2005). That the effect appears in every participant favours a universal control mechanism over a strategy adopted by some.

5Limitations

Three limitations bear directly on how the results should be read. First, the dataset analysed here was produced by a generative model anchored in MacLeod's (1991) review. This is a deliberate choice that makes the pipeline, pre-registration, and inference transparent and reproducible end to end; the schema accepts newly collected participant data with no code change, and replication in independent human data is the natural next step. Second, the participant-level Cohen's d is inflated because each participant's mean rests on many trials, suppressing within-person noise; the trial-level R² of — is the more informative figure. Third, with no neutral condition the design measures interference but not facilitation; a three-condition extension would decompose the effect and requires no change to the data contract.

Beyond replication in human data, the useful extensions are a neutral condition, a drift-diffusion fit that maps the reaction-time distributions onto evidence rate and response caution, and pairing the paradigm with EEG or fMRI to link the behavioural cost directly to the anterior cingulate–prefrontal control circuit it indexes.

6References

Botvinick, M. M., Braver, T. S., Barch, D. M., Carter, C. S., & Cohen, J. D. (2001). Conflict monitoring and cognitive control. Psychological Review, 108(3), 624–652.
Cohen, J. D., Dunbar, K., & McClelland, J. L. (1990). On the control of automatic processes: A parallel distributed processing account of the Stroop effect. Psychological Review, 97(3), 332–361.
Egner, T., & Hirsch, J. (2005). Cognitive control mechanisms resolve conflict through cortical amplification of task-relevant information. Nature Neuroscience, 8(12), 1784–1790.
Kane, M. J., & Engle, R. W. (2003). Working-memory capacity and the control of attention: The contributions of goal neglect, response competition, and task set to Stroop interference. Journal of Experimental Psychology: General, 132(1), 47–70.
MacLeod, C. M. (1991). Half a century of research on the Stroop effect: An integrative review. Psychological Bulletin, 109(2), 163–203.
Miyake, A., Friedman, N. P., Emerson, M. J., Witzki, A. H., Howerter, A., & Wager, T. D. (2000). The unity and diversity of executive functions and their contributions to complex "frontal lobe" tasks: A latent variable analysis. Cognitive Psychology, 41(1), 49–100.
Stroop, J. R. (1935). Studies of interference in serial verbal reactions. Journal of Experimental Psychology, 18(6), 643–662.

The Stroop Study · İrem Nur Yiğit · 2026 · Built 2026-05-18 Text CC BY 4.0 · code MIT · return to the live study