A live experiment in cognitive psychology

Reading the word the words "red", "green" and "blue", each printed in a conflicting ink colour, is faster than naming its colour.

You just felt it. That hesitation — wanting to say the word instead of its ink — is the Stroop effect. This site measures it on you, and reports the numbers as they come in.

I built a small colour–word interference task and put it on the open web. Take it once, in about three minutes, and your trials join a growing dataset. Every figure and test below recomputes from real responses the moment enough people have taken part.

Take the test How it works

~3 minutes No account Anonymous Currently N=— responses · showing the Example cohort

Take part

Do the task yourself

The effect this whole site is about takes about three minutes to feel from the inside. Name the ink colour, ignore the word. When the word and the ink disagree, watch how your hand hesitates.

This task needs JavaScript to measure reaction times in your browser. The reading on the rest of the page does not — scroll on for the question, the method, and the results as data arrive.

Nothing here is recorded unless you reach the consent step and choose to contribute. You can run the whole task and keep your result entirely to yourself.

The question

What happens when a word fights its own ink?

One sentence holds the whole study: does it cost you time and accuracy to name the colour of a word when the word spells a different colour?

You already know the answer in your body — you felt it in the title above. I want the number. Reading is something literate adults stopped choosing to do a long time ago; it happens whether you want it to or not. Naming a colour is slower, more deliberate, less drilled. Put both on the same square of screen and point them in opposite directions, and the fast involuntary process gets in the way of the slow voluntary one. The size of that interference, in milliseconds and in errors, is what I am measuring on everyone who takes the task.

This matters past the parlour trick. The gap between an automatic response and a goal-directed one is the exact place cognitive control does its work — the machinery that lets you follow an instruction instead of a habit. A colour word in the wrong ink is about the cheapest way anyone has found to make that machinery show itself, which is why it has run in laboratories for most of a century (Stroop, 1935; MacLeod, 1991). I fix the predictions in writing before any data come in:

H1 — reaction time. Naming the ink takes longer on incongruent trials, where the word and the ink disagree, than on congruent trials, where they match.
H2 — accuracy. People answer less accurately on incongruent trials than on congruent ones.

Both are directional and both are pre-registered. An effect that pointed the other way would not count as support — it would be a result I would have to explain. That is the whole point of saying so up front.

The study

How the task is built

One factor, measured inside every person. Four colour words, four inks, a keyboard, and a strict rule about which trials are allowed to count.

One factor, every person sees both halves

This is a within-subjects design, and that choice does the heavy lifting. Each participant meets both conditions — congruent and incongruent — so each person is their own control. The Stroop cost is the difference between two numbers measured in the same nervous system minutes apart, not a comparison between two different groups of people. That removes the largest, messiest source of noise before a single test is run, and it is why a modest sample can speak loudly here.

Who takes part

The reference design is — healthy adults: normal or corrected-to-normal vision, no colour-vision deficiency, fluent readers of the testing language, no neurological history or psychoactive medication. In the lab version every participant gives written informed consent and passes Ishihara colour-vision plates before anything begins. On this site the same task runs in your browser, and your trials only ever join the dataset if you reach the consent step and choose to — the section above this one is exactly that task.

The stimuli

Four words, nothing exotic: RED, GREEN, BLUE, YELLOW. Each appears in one of those same four ink colours, centred on a neutral background at roughly three degrees of visual angle. On a congruent trial the word and the ink agree. On an incongruent trial they disagree. There is no third, neutral case here — this design measures interference, and I am honest later about what leaving facilitation out costs the interpretation.

The procedure, one trial at a time

The instruction is four words long: name the ink, ignore the word. Responses are the keys R, G, B, Y, mapped to the four inks and shown on screen during practice. Every trial runs the same way — a 500 ms fixation cross, then the stimulus until you respond or 3000 ms pass, then a 500 ms blank. Ten practice trials with feedback come first; then the timed block, randomised per person, with no feedback so nothing teaches you mid-task.

The rule that protects the data

I decided before collection which trials are not allowed to count, so the cleaning cannot be tuned to a result. Any response faster than 200 ms is treated as an anticipation — too quick to be a real decision, a guess that landed early. Anything slower than 3000 ms is a lapse — attention left the screen. Both are dropped. Reaction-time analysis then uses correct trials only; accuracy analysis keeps every surviving trial, right or wrong. That window, 200 to 3000 ms, is the one line of judgement in the whole pipeline, and it was drawn in advance.

Trial structure

500 msFixation cross, centre screen
≤ 3000 msStimulus — respond R / G / B / Y
500 msBlank inter-trial interval
repeatNext trial, type randomised

One trial, four phases. The timed block randomises congruent and incongruent trials within each person so order cannot become a cue.

Design at a glance

DesignWithin-subjects, one factor
FactorTrial type: congruent / incongruent
Primary outcomeReaction time (ms)
Secondary outcomeAccuracy (0 / 1)
Exclusion window200–3000 ms
RT analysisCorrect trials only

The data

What the responses look like

Every trial is one row: who, which condition, how fast, right or wrong. Two pictures tell most of the story before any statistics — the average gap, and the shape behind the average.

Showing the Example cohort — N=— responses collected so far.

Read this before the figures

Until enough people have taken the task, every figure on this page is drawn from a stochastic simulation calibrated to MacLeod's (1991) review of the Stroop literature. It is not human data. It exists so the pipeline, the figures, and the statistics are all visible and checkable from the first visit. The moment live responses from consenting visitors pass the threshold, these views recompute from real people and the amber labelling clears.

Figure 1 · Mean reaction time

Mean reaction time by condition, with the per-condition spread. Congruent mean —, incongruent mean —. The bar between them is the Stroop cost — about — in this cohort.

Figure 2 · Reaction-time distributions

The full reaction-time distribution for each condition. The incongruent curve does not just shift right — it spreads and grows a longer tail, which is the texture an average alone hides.

The structure is deliberately plain. After the 200–3000 ms window removes a small slice of anticipations and lapses (— of — trials in the example cohort), what remains is split cleanly into the two conditions. The congruent distribution sits to the left and tight. The incongruent one sits to the right, wider, with weight in the slow tail. You can see the effect with your eyes before a single test confirms it — which is exactly what a clean design should let you do.

The analysis

Four ways of asking the same question

A paired test, an effect size, a regression, and a mixed model. They are not four findings — they are one finding, checked four times, each time with the previous one's blind spot covered.

The paired test, where the design pays off

I take each person's mean reaction time on correct congruent trials and on correct incongruent trials, and I subtract. The whole sample becomes a single column of within-person differences. A two-tailed paired t-test asks whether that column is reliably away from zero. In the example cohort the mean cost is —, with t(—) = —, p = —. Accuracy moves in the predicted direction by —, also reliable. Both pre-registered predictions hold.

How big, in standard units

Significance only says the effect is not zero. Size is the interesting part. Cohen's d for paired samples — the mean difference over the standard deviation of the differences — comes out at — for reaction time and — for accuracy. The reaction-time number is enormous, and I will not pretend otherwise or quietly bury it. It is large because every person contributes a mean over many trials, which crushes within-person noise, and because every single participant moved the same way. That makes it real and reliable; it does not make the per-person effect forty milliseconds bigger than it is. The honest summary of magnitude is the regression below.

The same effect as one regression coefficient

Drop the per-person averaging and model every surviving trial: reaction time on trial type, congruent as the reference. The incongruent coefficient is the Stroop cost in milliseconds, —, with a 95% confidence interval of [—, —] — it converges on the subject-level estimate, which is the point of running it. Trial type alone explains — of the trial-by-trial variance. That R² is the more sober, ecologically honest number to keep in your head: a lot of reaction time is just person, fatigue, and the moment, and one binary manipulation still accounts for a real slice of it.

The model that takes people seriously

The regression has one flaw I planned for: it treats every trial as independent when trials from the same person are not. A linear mixed-effects model fixes that — the same fixed effect of trial type, plus a random intercept per participant so each person gets their own baseline speed. The population-level cost holds at — (SE —). The intraclass correlation is —, meaning roughly that share of the variance is stable between-person difference rather than trial-to-trial churn. This is the model I trust most, because it is the only one whose assumptions match how the data were actually generated.

The same caveat, again, next to the figures

The numbers above and the two figures below are computed live from whichever cohort is selected. While that is the example cohort, they describe a MacLeod-1991-calibrated simulation, not human participants — the unusually large d is partly an artefact of how cleanly a simulator behaves. Treat the structure as real and the exact decimals as illustrative until live data replace them.

Figure 3 · Every participant's effect

Each line is one person, congruent on the left, incongruent on the right. The point of the picture is that no line goes down — the effect is not driven by a loud subgroup, it is in everyone.

Figure 4 · Distribution of the Stroop cost

One number per person: their own incongruent-minus-congruent cost. The whole histogram sits to the right of zero, which is the individual-level version of the group result.

The rigour

Decisions made before the data

A result is only as trustworthy as the choices made before it existed. Two of those choices are public here: what this sample could detect, and what I committed to in writing in advance.

What the sample can and cannot see

Before collecting anything I asked the honest question: with this many people and this test, how small an effect could I still catch? For a two-tailed paired t-test at α = .05 and 80% power, the smallest standardised effect this design can reliably detect is a Cohen's d of about —. That is the floor. Anything weaker than that, this study would likely miss, and I would have no business claiming it found nothing.

The Stroop effect is not a borderline phenomenon. A century of replication puts it far above that floor — typically several times the minimum this design can resolve. So the sample is not merely adequate; it has room to spare for the effect it is built to find. I would rather state the detectable floor plainly than imply the study can see things it cannot.

The pre-registration, and why it counts as evidence

Method is not just what you did — it is what you said you would do before you could see how it would turn out. Everything below was fixed in advance. None of it was chosen after looking at a result:

The hypotheses are directional. Incongruent slower than congruent; incongruent less accurate than congruent. An effect in the opposite direction is reported, not reinterpreted as support.
The stop rule is fixed. Recruitment ends at the target N. No peeking at the data and stopping when it looks good, no interim tests that quietly inflate the false-positive rate.
The cleaning is pre-specified. The 200–3000 ms window and the subject-level exclusions — overall accuracy under 70%, fewer than 30 valid trials in a condition, mean RT beyond 3 SD of the group — were written down before collection, so they cannot be tuned toward a result.
The analysis is the code. The paired tests, the OLS, and the mixed model are implemented exactly as planned. The mixed model is named in advance as the primary inferential model because it is the one whose assumptions fit the data structure.
Falsification is defined. Each hypothesis is disconfirmed if its paired test fails to reach α = .05 in the predicted direction, or if the effect points the wrong way. The result had a way to lose before it was run.

Any deviation that came up during collection or analysis would be reported as a deviation, with confirmatory and exploratory analyses kept visibly separate. That is the difference between a finding and a story told after the fact.

Figure 5 · Power curve

Statistical power as a function of effect size for this design. The marked point is the minimum detectable effect at 80% power, d ≈ —; the Stroop effect sits well to the right of it.

Pre-registered, in one column

HypothesesTwo, directional, fixed
SamplingStop at target N, no peeking
α.05, two-tailed
Power target.80
Minimum detectable d—
Primary modelLinear mixed-effects
FalsificationDefined before collection

What it means

Why naming a colour is hard work

The cost is not a quirk of the keyboard. It is the price of overriding a habit you cannot switch off, and the receipt is roughly a seventh of a second.

Here is what I think the number is telling us. In a literate adult, reading is not a skill you deploy — it is a reflex you have lost the ability to suppress. Show the word RED and your brain has already read it before you decided whether you wanted to. Colour naming has no comparable head start; it runs on a slower, shallower path (Cohen, Dunbar, & McClelland, 1990). On an incongruent trial those two paths arrive at the same decision point carrying different answers, and one of them has to lose. The time it takes for the right answer to win is the Stroop cost. That is the whole mechanism, and the data fit it exactly.

I read this through the conflict-monitoring account, and I think it is the right lens (Botvinick, Braver, Barch, Carter, & Cohen, 2001). The anterior cingulate cortex behaves like a conflict detector — when two responses compete, it registers the clash and flags that control is needed. Dorsolateral prefrontal cortex is what answers the flag, biasing attention toward the ink and away from the word until the task-relevant answer dominates. None of that is free. The ~143 ms in the example cohort is, on this reading, the visible shadow of that ACC–dlPFC loop doing its job on every conflict trial, and Egner and Hirsch (2005) showed the coupling tightening trial by trial in exactly the way the account predicts.

The detail that convinces me is not the average — it is that everyone shows it. Look back at the per-participant figure: not one line runs the other way. If the effect were a strategy, or an artefact of some subgroup who misread the instruction, you would see exceptions. There are none. That points at something built in rather than chosen — control as standard equipment in the literate brain, not an optional tactic some people happen to use.

This is also why the Stroop task refuses to retire. The size of a person's interference is not noise; it tracks working-memory capacity and executive function more broadly (Kane & Engle, 2003; Miyake et al., 2000). A three-minute colour-naming task is, in effect, a cheap window onto how well a given brain holds a goal in the face of a louder habit. That is a strong claim, and I am making it on purpose: the millisecond gap you produced on this page is a real behavioural index of cognitive control, not a curiosity.

Limits

What this study cannot tell you

A result is more believable when its author names the places it is weak before anyone else does. Here are the three that matter, and what I am doing about them.

The starting dataset is simulated

The example figures are not human data. They come from a generative model whose parameters are anchored in MacLeod's (1991) review. I made that choice deliberately so the whole pipeline — the pre-registration, the cleaning, every test — is transparent and reproducible from the first visit instead of waiting behind a recruitment process. But a simulation calibrated to the literature cannot, by construction, surprise you. The fix is the point of this whole site: the task above collects real responses, and the figures recompute from consenting visitors the moment there are enough of them. Replication in genuinely independent human data is not a footnote here; it is the next thing that happens.

The effect size is inflated by the design

A Cohen's d near 6.67 in the example cohort is not the size of the Stroop effect in a person. It is what you get when each participant's score is a mean over many trials — that averaging crushes within-person noise — and when a simulator produces almost no contrary cases. The within-person effect is real and universal; its standardised magnitude is not as superhuman as that d makes it look. The trial-level R² near .27 is the number I would actually quote for how much trial type matters.

It measures interference but not facilitation

There are only two conditions here. Without a neutral case — coloured letter strings that carry no competing word — I can measure how much a conflicting word slows you down, but not how much a helpful word speeds you up. The Stroop effect has both components and this design sees only one of them. A three-condition version would decompose it, and the data schema already accepts a third trial type without a single code change, so the extension is a design decision, not an engineering one.

What is next

The most useful next step is the obvious one: more people, through the live task, until the figures on this page are human and the amber labelling is gone for good. After that, three extensions are worth doing — a neutral condition to split interference from facilitation; a drift-diffusion fit to turn the reaction-time distributions into interpretable parameters like evidence rate and response caution; and, eventually, pairing the paradigm with EEG or fMRI to tie the behavioural cost directly to the ACC–dlPFC circuit it is standing in for. The pipeline was built so each of those is an addition, not a rewrite.

Colophon

How this was made, and what you should know

The honest paragraph, the recap, who built it, and the terms — all in one place, because that is where I would look for it.

The method, in four sentences

A within-subjects colour-naming Stroop task with two conditions, congruent and incongruent, run on — participants in the reference design. Reaction times outside 200–3000 ms are removed before anything else; reaction-time tests use correct trials only, accuracy tests use every surviving trial. The effect is estimated four ways — a paired t-test, Cohen's d, an OLS regression, and a linear mixed-effects model with a random intercept per person, which is the primary inferential model. Every analysis decision was pre-registered before any data existed.

The honest disclosure

This is the part I will not soften. Live results come only from visitors who reach the consent step and choose to contribute. Until enough of them have, every figure and statistic on this site is drawn from a stochastic simulation calibrated to MacLeod (1991) — it is a model of the literature, not human data, and it is labelled in amber wherever it appears so the distinction is never hidden. When live data cross the threshold the views recompute from real people automatically and the labelling clears.

The data carry no personal information. No names, no email, no IP address is stored. No cookies are set. A participant is an anonymous client-generated session identifier and nothing else, and the analysis pipeline never sees anything that could identify a person. Contributing is voluntary; you can run the entire task and keep your result to yourself, and nothing is recorded unless you explicitly consent.

Data and code

The analysis pipeline, the example dataset, and this site are all released openly. New responses drop into the same one-row-per-trial CSV contract without any code change, so the pipeline that produced the figures here is the exact one that would run on real participants. Anonymisation happens before data ever reach the analysis layer.

Contact

Questions, corrections, or a request to use the task in your own work are welcome at contact@acarkaan.com.

In one line

What you are looking at is a MacLeod-1991-calibrated simulation until enough consenting visitors have taken the task — then it becomes their data, anonymously, with no personal information of any kind.

About the author

İrem Nur Yiğit is an independent researcher in cognitive psychology. I built this site to run a live colour–word interference experiment on the open web and report the results honestly as they arrive — pre-registration, pipeline, and disclosure all in public.

Licence & build

Text and figures: CC BY 4.0 — reuse with attribution. Code and analysis pipeline: MIT.

Set in Fraunces, Newsreader, Hanken Grotesk and Spline Sans Mono.