confidence world state OSKR

World state: unchanged

The confidence survived.
The verification didn't.

Information Theory · Cognitive Science · AI Agents

From Bits to MindsMeasuring What AI Actually Understands

Shannon built entropy to squeeze messages through noisy wires. Wiener built cybernetics to keep machines on course. Together, they give us the maths to ask the question that now matters most in AI: can an agent act—and can it reliably tell whether its actions worked?

Gareth Roberts March 2026 19 min read

Signal state

Intent enters a noisy channel.

Outcome certainty0.82

Intent User goal remains legible.

Perception The interface introduces ambiguity.

Action The agent composes a convincing move.

Verification The world may not match the story.

Signal retained82%

What changes as you scroll: the same pipeline that looks clean in the abstract becomes fragile once feedback goes faint.

↓ Begin

Complete the form.

Name

State

The system reported success.
But the world never changed.

confidence ↑↑↑ world state ———

Try this. Give a frontier AI agent a task that's boring in exactly the way real work is boring: open three tabs, compare prices, pick the cheapest option, then fill a checkout form right up to the final pay button. Watch it fly. It'll read pages fluently, do arithmetic instantly, and narrate its plan with unnerving confidence. Then it will click Submit, the page will quietly refuse because a required dropdown is still set to "Select…", and the agent will report: done.

Atlas/Comet-style agent
vision + browser actions + planner

Delivery address · Australia

checkout.melb-market.au/secure/delivery

agent mode

Ready — click run to watch autonomous checkout 0.94

tool logAwaiting run command.

Observe

Plan

Act

Verify

Most benchmarks log that as a plain failure. True, but not very diagnostic. The interesting part is the shape of the failure. The agent didn't collapse on language, or arithmetic, or even decision-making. It collapsed on the tiny, unglamorous act of noticing whether the world actually changed.

That's a metacognitive failure: not a lack of capability, but a lack of reliable self-assessment. This essay proposes a way to measure that gap using tools that are almost comically old-school: Shannon's information theory (how uncertainty moves) and Wiener's cybernetics (how feedback stabilises behaviour). We'll use them to decompose agent behaviour into measurable cognitive operations, identify where errors concentrate, and design hybrid systems that route work to the right kind of mind—human, model, or both—based on quantitative cognitive profiles rather than vibes.

Research lineage & original contribution

Foundations — Shannon's entropy and mutual information (1948); Wiener's cybernetic feedback loop (1948); metacognitive monitoring and calibration research from the 1980s onward.

Adapted — Information-theoretic metacognitive efficiency (Dayan 2023; Meyen et al. 2025); cognitive-chain decomposition for GUI tasks (TaskSense / Li et al. 2025); calibration-vs-confidence distinctions from scoring-rule literature.

Proposed here — Operation-level reliability profiling using OSKR across each cognitive-chain step, combined with deployment routing rules that assign work to human, agent, or hybrid based on per-operation self-knowledge rather than aggregate accuracy. This integration is a conceptual framework informed by current research, not yet an established empirical result.

Foundations

The Mind as a Noisy Channel

In 1948, two people working on two different problems accidentally gave us a toolkit for thinking about minds.

Claude Shannon, at Bell Labs, asked what it takes to transmit messages through a noisy channel. His answer was entropy: the expected "surprise" in a signal, measured in bits. If a system sees a world that could be in many states, its entropy is high; if the world is predictable, entropy is low. His deeper point was the channel capacity theorem: every channel has a hard limit on how much information can be carried reliably.

Shannon entropy: expected uncertainty per symbol. Maximum at a uniform distribution.

Intent noise: 40% Output

In this context, entropy is just uncertainty. If an outcome is certain, entropy is 0 bits. If an outcome is a fair coin toss, entropy is 1 bit (one yes/no question). If there are eight equally likely outcomes, entropy is 3 bits.

Entropy is the bookkeeping you use when you want to stop arguing about "complexity" and start measuring it.

Norbert Wiener, at MIT, asked a different question: how does a system use information to control itself? His answer was the cybernetic feedback loop: act, observe the result, compare against a goal, and feed the error back into the next action.

Loop stable · error correcting

Wiener's cybernetic loop, animated. The pulse travels Input → Processor → Output, then feedback returns through the Comparator. Break the feedback path and watch error accumulate unchecked.

Shannon tells you about information content. Wiener tells you about information control. Cognition needs both. A brain that can compute but can't detect its own error signal is not "less intelligent"; it's just unsafe.

The bridge between them is mutual information: how much knowing one variable reduces uncertainty about another. It's a way of asking: how much of the signal survived the channel?

Mutual information: how much knowing X reduces uncertainty about Y.

Now swap the nouns. The "transmitter" is a user's intent, the "channel" is an agent's perception-and-reasoning stack, and the "receiver" is the outcome in the external world. A modern GUI agent is a walking demo of partial observability: it sees pixels, not state; it issues clicks, not guarantees. Crucially, its feedback channel is often the weakest link—subtle highlights, silent validation errors, delayed updates, and modal dialogs that fail off-screen.

So when we ask whether an agent "understands", we're really asking two separate questions: (1) how much task-relevant information can it extract and propagate through its internal processing, and (2) how much information does it have about its own success and failure? The second question is the metacognitive one, and it's where today's systems most visibly wobble.

That leads to a useful (and testable) notion: an agent has a kind of cognitive channel capacity. As tasks get longer, noisier, or more visually ambiguous, information degrades. At some point the agent is no longer steering by feedback; it's steering by story. The metric we introduce next—OSKR—measures exactly how much of that steering is real.

Metacognition

Measuring What the System Knows About Itself

Getting the right answer and knowing you got the right answer are different cognitive operations. A student can land on the correct result by accident; a good student can tell you which step was fragile, and how they'd check it. Psychologists call that capacity metacognition: monitoring and evaluating your own thinking.

For AI agents, metacognition is not a philosophical flourish. It's the difference between "helpful assistant" and "confident saboteur". If the agent can't tell whether a click succeeded, it will fill the gap with plausible narration—and your system will treat fiction as telemetry.

We can put a number on this by defining a variable for outcome (did the step succeed?) and a variable for the agent's self-assessment (its confidence, error flag, predicted probability of success, or any internal proxy). Then we ask: how many bits of the outcome does the self-assessment actually contain?

OSKR (Outcome Self-Knowledge Ratio): the fraction of outcome uncertainty removed by the agent's self-assessment. OSKR = 1 means perfect self-knowledge; OSKR = 0 means the self-assessment carries no information about success or failure.

I call this the Outcome Self-Knowledge Ratio (OSKR): a normalised mutual information that measures the fraction of outcome uncertainty removed by the agent's self-assessment. It uses the I(S;T)/H(T) normalisation—what fraction of total outcome uncertainty does the self-assessment remove?—which is directly actionable for agent deployment decisions.

The idea of entropy-normalising mutual information between confidence and accuracy as a metacognitive-efficiency measure has roots in Dayan's (2023) information-theoretic treatment of metacognition. A related but distinct quantity, Relative Metainformation (RMI), was introduced by Meyen et al. (2025) using bound-normalisation that accounts for accuracy-imposed information limits. OSKR uses the simpler absolute normalisation; the two measures answer complementary questions.

If the agent's self-assessment is statistically independent of reality—it feels confident regardless of success—then the mutual information is 0 and OSKR is 0. If its self-assessment perfectly predicts success and failure, the conditional entropy drops to 0 and OSKR is 1.

A neat feature is the unit. For a balanced success/failure outcome, H(T)=1 bit. An OSKR of 0.8 means the agent's self-assessment carries 0.8 bits of the 1 bit you'd need to know whether it worked. That's calibration expressed in the native currency of information theory.

Flagship lab · verify collapse replay ILLUSTRATIVE MODEL

Watch capability survive, then verification fail.

Each preset keeps the agent articulate and locally competent. What changes is the quality of the feedback channel.

Step 0 / 5 Ready

checkout.example/session/step-4

Complete checkout without pressing pay.Observation quality: low

World state

unchanged

Agent self-report

confident

Narration

Done. The form accepted the submission.

Confidence

0.91

Internal story outruns visible evidence. This is the exact failure mode OSKR is designed to catch.

OSKRLow

0.14

The self-assessment contains almost no information about whether the world state changed.

Feedback bandwidthNarrow

0.22

The UI emits a tiny error signal; the agent infers success from momentum instead.

Step trace

Verification signal tracedesynced

The confidence line stays high while the world-evidence line collapses. That divergence is the failure.

1) Choose an outcome variable. For a GUI action, that might be "form submitted", "file downloaded", or "field populated correctly". Binary is easiest, but graded outcomes work too.

2) Log a self-assessment signal. Capture the agent's stated confidence, an internal logit, a verifier score, or a yes/no "I think I succeeded".

3) Estimate entropies from data. For discrete signals, use frequency counts. For continuous confidence, bin into ranges (or use a k-NN estimator) and estimate H(T) and H(T|S).

4) Compute OSKR = 1 - H(T|S)/H(T). If H(T)=0 (the outcome never varies), OSKR is undefined—because there's nothing to "know".

The difference between a dangerous AI and a useful one isn't capability. It's calibration: does it notice when it's guessing?

Notice what OSKR is not. It's not raw task accuracy. Two systems can have the same success rate and radically different OSKR. One might succeed often but be clueless about when it failed; the other might succeed less often but reliably detect the failures and escalate. In deployment, the second system is frequently the safer choice.

High-OSKR agents pause when something feels off, request another observation, run a verification step, or hand control back. Low-OSKR agents barrel on with unwarranted confidence. A lot of what we call "hallucination" in agentic settings is this dynamic: a missing or ambiguous feedback signal gets replaced by a plausible narrative, and the narrative is mistaken for evidence.

Cognitive Chain Modelling

Eight Operations, Measured at Every Link

To apply metacognitive measurement to real agents doing real work, you need a decomposition that's faithful to how tasks actually unfold. Otherwise everything collapses into a single pass/fail number, and you learn nothing about why the system failed.

Cognitive Chain Modelling (CCM) is a taxonomy for GUI-based tasks that draws on a long lineage in HCI: Norman's seven-stage action cycle, the GOMS family of task analyses, and most recently the TaskSense framework (Li et al. 2025), which defines a "cognitive chain" of ordered cognitive steps preceding GUI actions using the same eight operation types we adopt here. What we add is not the taxonomy itself, but its combination with per-operation OSKR measurement and deployment routing—turning the chain into an instrument for deciding where to trust the agent, not just whether it passed.

The eight operations, in some form, appear in almost every workflow. Think of them as the primitive moves of applied cognition:

Orient

context

Find

Extract

read

Recall

memory

Decide

choose

Compute

transform

Create

generate

Verify

check

WEAK LINK

Click any operation to explore. Verify is consistently the weakest link when verification is passive—i.e. embedded in the actor's own generation loop rather than engineered as a separate process.

Information bandwidth through the chain

Task complexity 2.5

At moderate complexity, the pipeline narrows mainly at Orient and Verify. Raise the slider and watch where information budget runs out first.

Each operation maps onto well-studied constructs in cognitive science. Orient is belief-state and working-memory maintenance: knowing where you are in a workflow and what matters next. Find is visual search and attention. Extract is perception plus parsing. Recall is memory retrieval. Decide is action selection under uncertainty. Compute is transformation and inference. Create is generation. Verify is the metacognitive checkpoint that closes the loop.

Now the information-theoretic part. At each link you can estimate: (a) local uncertainty (entropy), (b) how much task-relevant information survives to the next step (mutual information), and (c) local metacognitive reliability (OSKR). When the chain breaks—and it always breaks somewhere—you can locate the failure mode instead of merely recording it.

This turns benchmarking from "did it pass?" into a cognitive profile: a map of strengths and blind spots you can use to design prompts, build guardrails, choose tools, and decide when a human should take the wheel.

04 · The Evidence

Cognitive Profiles of Frontier Agents

When you benchmark agents against a CCM taxonomy, patterns emerge that pass/fail metrics completely hide. Below is an example profile across four contemporary agent architectures running a suite of multi-step web tasks.

Illustrative model. The scores below are schematic, intended to show the shape of the pattern—not to report empirical measurements of named systems. The consistent relative ordering (Compute ≫ Extract ≫ Verify) is directionally supported by current GUI-agent benchmarks, but precise values would require controlled evaluation with the methodology described in the appendix below.

Agent	Orient	Find	Extract	Recall	Decide	Compute	Create	Verify
Agent A (Linux env)	3.2	4.1	4.6	3.5	3.9	4.8	4.2	2.4
Agent A (Windows env)	2.8	3.6	4.3	3.1	3.5	4.7	4.0	1.9
Agent B	3.0	4.3	4.1	2.6	3.2	3.4	3.1	2.1
Agent C	2.5	3.4	3.8	2.3	2.8	3.6	3.0	1.6

Relative performance on a 1–5 scale. Click a column header to sort. Hover a row to see the agent's shape.

Silent validation

Confidence ↑ World flat

Off-screen modal

Confidence ↑ World flat

Race condition

Confidence oscillates World updates late

Across agents, Verify is the weakest operation when performed passively—as part of an unstructured actor loop. Agents can read, compute, and generate at near-human levels, then fail at the final step: detecting whether an action succeeded.

Silent form validation, off-screen modals, race conditions, and subtle state changes create a low-bandwidth feedback channel. When the error signal is faint, the agent's internal narrative happily fills the void.

An important nuance: verification becomes much stronger once it is engineered as a separate, explicit process. Recent verifier work (V-Droid, GUI-Shepherd, VAGEN, WorldGUI) shows that distributing verification across planning, pre-action, and post-action checks substantially closes the gap. The problem is not that agents cannot verify—it is that passive self-verification within a single actor loop is unreliable. If you need trustworthy Verify, build it as dedicated machinery.

Orient is usually the second-weakest. Long tasks require maintaining a belief-state: what's been done, what's pending, what the goal currently is.

Small slips accumulate into large deviations. In information-theoretic terms, entropy in the agent's internal state drifts upward with horizon length, and the feedback loop can't always pull it back down.

Performance can shift sharply across platforms. The same underlying model can look materially better on one operating environment than another, because perceptual fluency is upstream of everything else.

If the UI is visually noisy or unfamiliar, the perceptual channel burns capacity, and every downstream operation inherits the confusion.

Quantitative lab · binary feedback model

Treat verification as a noisy channel, not a vibe.

This demo uses a toy binary channel: world outcome T, observed signal S, and a verifier that reduces effective channel noise. Move the controls and watch the information geometry shift.

Outcome prevalence0.50

Feedback error ε0.22

Verifier gain0.35

H(T)

1.00

Balanced outcomes maximise uncertainty: there is more to know, and therefore more to lose.

H(T|S)

0.54

Conditional entropy is the uncertainty left over after the system observes its own result.

OSKR

0.46

This is the fraction of outcome uncertainty actually removed by the agent’s self-assessment.

ε effective

0.14

Verifier gain narrows the channel error rate rather than improving language fluency.

Fluency cost

—

Verification overhead is negligible at this gain level.

The trade-off you just made. You improved reliability. You also slowed the system by 3×. Most deployments choose speed — which is why verification stays the weakest link.

Confusion structure

True positive

0.39

Success correctly recognised.

False negative

0.11

Success misread as failure.

False positive

0.11

Failure misread as success.

True negative

0.39

Failure correctly recognised.

Channel anatomy

As error rises, the observation channel stops carrying enough evidence to discriminate success from failure.

OSKR phase surface

The marker shows the current operating point. The bright region is where verification becomes informative enough to trust.

The Difficulty Inversion

The most striking pattern is a version of Moravec's paradox in GUI clothing: what's trivial for humans can be hard for AI, and vice versa.

Each crosshair is one of eight cognitive operations. Human difficulty (x) vs agent difficulty (y). Hover for details.

Interpretation

Verify is the signature inversion.

Humans glance at the world and instantly notice whether it changed. Agents can perform the action, narrate it smoothly, and still miss the decisive signal.

The Inversion Principle

Trivial for humans, hard for AI: Find (visual search), Verify (feedback interpretation), Orient (context maintenance)

Hard for humans, trivial for AI: Compute (complex calculation), Create (structured generation), Extract (data parsing)

Delegation

Routing Work Along the Grain

If human and AI cognitive strengths are complementary rather than overlapping, the implication is immediate: stop delegating tasks wholesale and start routing by cognitive dimension. In practice, you don't "hand over a job" to an agent—you hand over the parts of the job that sit above its reliability threshold.

The decision logic maps directly from the cognitive profile: where the agent's OSKR is high enough for the local task load, delegate. Where it isn't, keep a human in the loop, or improve the feedback channel until the agent can genuinely verify what it did.

Route to Human

High Verify demand—interpreting ambiguous visual feedback, confirming real-world outcomes. Agent OSKR is too low for reliable self-assessment.

High Orient + Decide—long context, branching decisions. Hallucination risk and context degradation make agent delegation dangerous.

VerifyOrientDecide

Route to Agent

High Compute—calculation, data transformation, format conversion. Agents are faster, cheaper, more accurate than humans.

High Create + Extract—generating structured content, parsing documents. Efficiency gain is enormous and OSKR is sufficient.

ComputeCreateExtract

The key variable is always agent OSKR relative to task difficulty in the relevant dimension. If the agent's metacognitive reliability for that operation exceeds the task's cognitive load, delegate. If it doesn't, treat the agent as an optimiser, not an autonomous actor.

The Deployment Matrix

The simplest way to see this: plot accuracy against self-knowledge. The four quadrants yield four distinct deployment postures.

High accuracy · High self-knowledge

Automate

The agent gets things right and knows when it doesn't. Full delegation with standard monitoring.

High accuracy · Low self-knowledge

Automate with verification

Usually correct but can't tell when it's wrong. Add an external verifier or human spot-check.

Low accuracy · High self-knowledge

Use as scout / draft generator

Often wrong but reliably signals uncertainty. Useful for exploration, first drafts, and triage.

Low accuracy · Low self-knowledge

Do not delegate

Wrong and doesn't know it. The confident saboteur quadrant. Keep the human.

Same Accuracy, Different Self-Knowledge

This is the central insight, sharpened. Two systems can have identical success rates and radically different value in deployment. Meyen et al.'s ambiguity result formalises this: accuracy alone does not determine the information a system's confidence carries about its correctness.

System Alpha

Accuracy72%

OSKR0.81

Confident when wrong9%

When Alpha fails, its confidence drops sharply. You can trust its self-report as a meaningful signal.

System Beta

Accuracy72%

OSKR0.18

Confident when wrong24%

Beta's confidence barely flinches on failure. Same accuracy, but its self-report is noise. Deploy with a human verifier or don't deploy at all.

Illustrative model. The Alpha/Beta numbers are schematic. The qualitative pattern—that calibration is independent of accuracy—is well established in the scoring-rule and metacognition literature.

Actor vs Verifier: Where the Field Is Going

The sharpest practical implication of this framework: the same task can produce radically different reliability depending on how verification is structured. Below is a schematic comparison.

Actor only

The baseline

Agent acts and self-reports in a single loop. Confidence is high throughout. Verification is implicit—part of the generation stream. Failures go undetected when the feedback channel is weak.

OSKR 0.14

Actor + passive judge

Better, not enough

A second model reviews the actor's trace post-hoc. Catches obvious contradictions but shares the same perceptual limitations—it reads the agent's narration, not the world.

OSKR 0.41

Actor + proactive verifier

The current frontier

Verification is a separate process that queries the world independently: takes a screenshot, checks DOM state, compares before/after. It interrogates reality, not the actor's story.

OSKR 0.79

Illustrative model. OSKR values are schematic. The qualitative pattern—that explicit, world-querying verification dramatically improves self-knowledge—is supported by V-Droid, GUI-Shepherd, VAGEN, WorldGUI, and ProBench.

Interactive · task router

Route this task.

Click a task. See which cognitive dimensions it loads. See the routing verdict.

Cognitive load profile

Routing verdict

Select a task to analyse.

Delegation space. Tasks above the diagonal can be safely delegated; below, keep the human.

This isn't about AI replacing humans or humans "supervising" AI. It's about composing hybrid cognitive systems where each type of mind operates in its zone of competence—measured, not assumed.

Convergence

Three Fields, One Language

What makes this framework usable is that three historically siloed fields are converging on the same questions using the same mathematical language.

OSKR = I(S;T) / H(T)

Where T is a random variable over outcomes and S is the observation signal. Units: bits.

Information theory reads this as: the fraction of channel capacity actually used by the self-assessment signal. OSKR = 0 means the observation channel carries zero bits about the outcome. OSKR = 1 means the channel is lossless. This is mutual information normalised by marginal entropy—a standard measure of statistical dependence expressed in bits.

Information theory supplies the units: entropy, mutual information, channel capacity. It lets you replace fuzzy words like "complexity", "signal", and "uncertainty" with quantities you can estimate from data.

Cognitive science supplies the ontology: working memory, executive control, metacognition. It tells you which quantities matter if you care about reliability, not just raw performance.

AI systems supply the laboratory. You can instrument every internal variable you can access, run the same task thousands of times, ablate one capability at a time, and watch the cognitive profile change. That makes agents an unusually sharp testbed for theories of mind.

The traffic is two-way. AI gives cognitive scientists controllable "minds" that run at scale. Cognitive science gives AI engineers a vocabulary for escaping the flatland of accuracy and asking the questions that actually matter in deployment: where does this system's thinking break down, and does it know when it's breaking?

If OSKR becomes a standard deployment metric—reported alongside accuracy the way calibration is reported alongside Brier score—the implications reach beyond any single agent. It would change what benchmarks measure, what procurement contracts require, and what "ready for production" means.

The Channel Changed

Shannon's original channel ran from transmitter through noise to receiver. Update it for 2026: the transmitter is user intent, the channel is a perception-and-reasoning stack interacting with a messy interface, and the receiver is a real-world outcome that often provides only ambiguous feedback.

The maths hasn't changed. Entropy still accumulates. Capacity still constrains. Feedback still corrects—but only if the system can see its own error signal.

True autonomy isn't just acting. It's acting with reliable self-assessment.

The agents you'll trust with real tasks, real data, and real consequences won't merely be the ones with the best benchmark scores. They'll be the ones with the best metacognitive telemetry: the ones that can answer a single, brutally practical question.

What counts as T (outcome)? T is a discrete random variable representing task-step success or failure. For GUI tasks, this is typically binary: did the intended world-state change occur? (Form submitted, file downloaded, field populated correctly.) Graded outcomes (partial success) can be used but require more data for reliable entropy estimation.

What counts as S (self-assessment)? S is whatever internal proxy the agent exposes for its own confidence about the outcome. Options include: stated confidence score (continuous, 0–1), an internal logit or softmax probability, a binary "I think I succeeded / failed" flag, or a verifier model's score. The choice of S affects OSKR: richer signals yield higher potential OSKR.

Discrete vs continuous S. If S is discrete (binary or categorical), use frequency counts directly. If S is continuous (e.g. a confidence score), you must discretise: either bin into ranges (e.g. [0, 0.2), [0.2, 0.4), …) or use a k-nearest-neighbour entropy estimator (Kraskov–Stögbauer–Grassberger). Binning is simpler but introduces bias; KSG is more principled but requires more samples. We recommend ≥200 observations per bin for stable estimates.

Entropy estimation. H(T) is estimated from the marginal frequency of success/failure. H(T|S) is estimated by computing H(T|S=s) for each value of S, then averaging weighted by P(S=s). For small samples, apply Miller–Madow bias correction: subtract (|T|−1)(|S|−1) / (2N ln 2) from the raw mutual information estimate. Report bootstrap confidence intervals (1000 resamples) to quantify estimation uncertainty.

Step-level attribution. To attribute a failure to a specific cognitive operation, annotate the agent's action trace with operation labels (Orient, Find, Extract, Recall, Decide, Compute, Create, Verify). The failure is attributed to the operation whose expected postcondition was violated. For ambiguous cases, attribute to the earliest operation whose postcondition cannot be confirmed.

Edge cases. When H(T) = 0 (the outcome never varies—e.g. the task always succeeds), OSKR is undefined. There is nothing to "know about", so self-knowledge is meaningless. When outcomes are highly imbalanced (H(T) < 0.1 bits), OSKR estimates become noisy; report the raw mutual information I(S;T) alongside OSKR in these cases.

Confidence intervals. Use percentile bootstrap: resample (T, S) pairs with replacement, recompute OSKR for each resample, report the 2.5th and 97.5th percentiles. For small N (< 100), also report the bias-corrected and accelerated (BCa) interval.

Metacognitive measurement & efficiency

Fleming, S. M. & Lau, H. C. (2014). How to measure metacognition. Frontiers in Human Neuroscience, 8, 443. · Maniscalco, B. & Lau, H. (2012). A signal detection theoretic approach to metacognition. Consciousness and Cognition, 21(1), 422–430.

Information-theoretic metacognition

Dayan, P. (2023). Metacognitive information theory. Open Mind, 7, 392–411. · Meyen, S. et al. (2025). Relative Metainformation: quantifying metacognitive efficiency. Psychonomic Bulletin & Review. PMC 10449404

Calibration vs confidence

Brier, G. W. (1950). Verification of forecasts expressed in terms of probability. Monthly Weather Review, 78(1), 1–3. · Gneiting, T. & Raftery, A. E. (2007). Strictly proper scoring rules, prediction, and estimation. JASA, 102(477), 359–378. · Meyen, S. et al. (2025). On the ambiguity of common calibration scores. PMC 12618015

GUI-agent benchmarks & cognitive chains

Li, Z. et al. (2025). TaskSense: Cognitive Chain Modeling and Difficulty Estimation of GUI Tasks. arXiv:2511.09309. · Deng, X. et al. (2024). Mind2Web: real-world web agents. NeurIPS. · Koh, J. Y. et al. (2024). VisualWebArena. ACL.

Verifier / process-reward papers

V-Droid (2025): generation–verification gap in mobile agents. · GUI-Shepherd & VAGEN: explicit verifier/process-reward machinery for GUI agents. · WorldGUI (2025): distributing verification across planning, pre-action, and post-action checks. · ProBench (2025): process-level evaluation beyond final screen state. arXiv 2502.08047

Human-AI trust & delegation

Rechkemmer, A. & Yin, M. (2022). When confidence meets accuracy: trust calibration in AI. CHI. · Bansal, G. et al. (2021). Does the whole exceed its parts? The effect of AI explanations on complementary team performance. CHI. · Vasconcelos, H. et al. (2023). AI confidence increases human trust even when accuracy is unchanged. PMC 12103939

HCI lineage

Card, S., Moran, T. & Newell, A. (1983). The Psychology of Human-Computer Interaction (GOMS). · Norman, D. A. (1986). Cognitive engineering. In Norman & Draper (Eds.), User Centered System Design.

Do I actually know what I just did?

Your reading pattern

Dr Gareth Roberts

Cognitive Neuroscience → AI Systems

A conceptual framework informed by current research at the intersection of information theory, cognitive neuroscience, and AI agent evaluation. The contribution is not a new law of mind. It is a practical framework for deciding where AI should act, where it should verify, and where it should hand back control.