Kathryn T. Farrell, Kirsten Ziman, Michael S. A. Graziano · 2025
· View original paper
Self Model and Reportability
# Continue PAPER_TPL
AI
Adding a learned attention schema (self-model) made agents better at judging others' attention and made their own attention states easier for others to categorize.
"Experiment 1 showed that the presence of an attention schema in a network (the trained ability to predict its own attention states) confers two specific advantages, at least in one kind of network with a specific implementation of attention and of an attention schema. The first advantage is that when a network with an attention schema is pretrained on a task, that network is then better able to learn to make judgements about the attention states of other networks. The second advantage is that when a network contains an attention schema, its attention states become easier for other networks to interpret and categorize."
Discussion for Experiment 1, p. 6
This explicitly links a self-model to both improved reading of others and making one’s own states more reportable/interpretable, matching the 'Self-Model & Reportability' criterion in AI systems .
Figures
Figure 2 (p. 6)
: Higher discrimination accuracy when agents had an attention schema indicates improved mutual interpretability consistent with self-model and reportability in AI.
Limitations: Findings are from artificial agents using transformer attention and a simplified task; human relevance is hypothesized but not directly tested.
Emergent Dynamics
# Continue PAPER_TPL
AI
Agents with attention schemas exhibited improved joint reward and reduced overlap, indicating emergent coordination dynamics.
"In line with our predictions, we found that schema-schema teams achieved the highest rewards on the cooperative coloring task (blue line in Figure 4A, averaged across epochs, mean=2.04). Mixed teams received the second highest rewards (green line in Figure 4A, averaged across epochs, mean=1.91). Control-control teams obtained the smallest rewards (orange line in Figure 4A, averaged across epochs, mean=1.76). We found that schema-schema teams and mixed teams exhibited better coordination, selecting fewer overlapping pixels on average (blue line for schema-schema teams, mean across epochs=54.51; green line for mixed teams, mean across epochs=54.89), relative to the control-control teams (orange line for control-control teams, mean across epochs=57.79)."
Results for Experiment 3, p. 8
Improved collective reward and decreased overlap emerge from interactions when agents include self-models, evidencing emergent coordination dynamics relevant to consciousness-like higher-order organization in information-processing systems .
"The results of Experiment 3 show that having an attention schema confers a benefit on the joint coloring task... The data suggest that the stronger effect is that when a network has an attention schema, it becomes more predictable to its partner."
Discussion for Experiment 3, p. 9
The discussion highlights an emergent group-level property—mutual predictability—arising from self-modeling, which is a hallmark of emergent dynamics in complex systems .
Figures
Figure 4 (p. 8)
: Figure 4 visualizes the emergent improvement in cooperation and coordination for attention-schema agents across training.
Limitations: The emergent effects are demonstrated in a simplified, stylized social environment; generalization to richer multi-agent settings remains to be established.
Causal Control
# Continue PAPER_TPL
AI
Architectural intervention (presence/absence of an attention schema) systematically changed agents’ cooperative performance.
"Each network in the interacting pair could have an attention schema or could lack one, resulting in three types of image-coloring teams: schema-schema teams, mixed teams (where one network had an attention schema and one did not), and control-control teams (where neither network had an attention schema). We predicted that schema-schema teams would show the best joint performance, achieving the highest rewards, and control-control teams would show the poorest performance, obtaining the lowest rewards. For each type of pairing, we trained the networks for 50 epochs, 6,000 game turns per epoch."
Methods for Experiment 3, p. 7
This specifies a targeted intervention on model architecture enabling causal tests of how a self-model affects computation and behavior in a controlled setting .
"In line with our predictions, we found that schema-schema teams achieved the highest rewards on the cooperative coloring task (blue line in Figure 4A, averaged across epochs, mean=2.04). Mixed teams received the second highest rewards (green line in Figure 4A, averaged across epochs, mean=1.91). Control-control teams obtained the smallest rewards (orange line in Figure 4A, averaged across epochs, mean=1.76)."
Results for Experiment 3, p. 8
Observed performance shifts follow directly from the architectural manipulation, demonstrating causal control of behavior via the self-model intervention .
Figures
Figure 4 (p. 8)
: The experimental manipulation (adding an attention schema) causally improves cooperative performance across epochs.
Limitations: Causal claims are at the level of an architectural add-on; finer-grained causal mapping (e.g., component ablations, activation patching) was not performed.