Shared functional specialization in transformer-based language models and the human brain

Information Integration # Continue PAPER_TPL AI

Self-attention computes weighted combinations of context tokens (transformations) that are fused into embeddings, integrating distributed information.

"The ith token “attends” to tokens based on the inner product of its query vector Qi with the key vectors for all tokens, K. When the query vector matches a given key, the inner product will be large; the softmax ensures the resulting “attention weights” sum to one. These attention weights are then used to generate a weighted sum of the value vectors, V, which is the final output of the self-attention operation (Eq. 1). We refer to the attention head’s output as the “transformation” produced by that head. Each attention head produces a separate transformation for each input token."

Transformer self-attention mechanism, p. 11

This describes how attention heads integrate distributed token information into per-head 'transformations' that are later fused, providing a mechanistic basis for information integration in Transformer LMs that can be compared with brain-wide integration phenomena.

Figures

Fig. 1 (p. 3) : Clarifies that head-specific transformations are the conduit for integrating contextual information into embeddings, aligning with the information-integration phenomenon.

Limitations: Mechanistic description is within-model and does not by itself establish a direct causal mapping to neural integration processes.

Selective Routing # Continue PAPER_TPL AI

Headwise functional specialization implies content-sensitive gating/routing via specific attention heads.

"BERT’s training regime has been shown to yield an emergent headwise functional specialization for particular linguistic operations... We split the transformations at each layer into their functionally specialized components—the constituent transformations implemented by each attention head."

Interpreting transformations via headwise analysis, p. 5

Functional specialization at the head level suggests selective routing of information through distinct gates (heads), a key aspect of control over information flow relevant to consciousness-related gating hypotheses.

Limitations: Head specialization is inferred from correlations with linguistic features and brain mapping; direct interventions on routing (e.g., head ablation) are not reported here in the main text.

Representational Structure # Continue PAPER_TPL AI

Embeddings and transformations occupy different feature spaces, show near-zero temporal correlation, distinct geometries, and different layerwise specificity.

"despite yielding similar encoding performance, the embeddings and transformations are fundamentally different; for example, the average TR-by-TR correlation between embeddings and transformations across both stimuli is effectively zero (−0.004 ± 0.009 SD), and the embeddings and transformations are not correlated across layers (Fig. S5). The embeddings and transformations also yield visibly different TR-by-TR representational geometries (Fig. S6), and the transformations have considerably higher temporal autocorrelation than the embeddings (Fig. S7)."

Transformer-based features outperform other linguistic features, p. 4

The near-zero correlation and divergent geometries indicate distinct representational subspaces for embeddings vs. per-head transformations, mapping to how content is encoded and organized in modern LMs.

"We found that transformation-based predictions capture more unique variance at earlier layers than embedding-based predictions; embeddings, on the other hand, accumulate information over time and capture the most unique variance at later layers."

Layerwise performance of embeddings and transformations, p. 5

Layer-specific uniqueness vs. accumulation across layers reveals a structured progression in how different components contribute to representations, aligning with representational hierarchy analyses.

Limitations: Findings are correlational with respect to brain data and depend on fMRI encoding quality and chosen feature extraction; not a direct mechanistic mapping.

Temporal Coordination # Continue PAPER_TPL BIO

Headwise 'look-back' distances align with cortical regions’ temporal receptive windows, indicating timescale coordination in language processing.

"left-lateralized anterior temporal and anterior prefrontal cortices were associated with longer look-back attention distances (positive values along PC2; Fig. 4), suggesting that these regions may have longer temporal receptive windows and compute longer-range contextual dependencies, including event- or narrative-level relations."

Interpreting transformations via headwise analysis, p. 9

The mapping between attention 'look-back' distance and cortical timescales provides evidence for temporal coordination mechanisms linking model dynamics to brain processing windows.

Limitations: Timescale inferences are indirect (via headwise PCA and fMRI) and depend on model-derived metrics of look-back distance.

Information Integration # Continue PAPER_TPL BIO

Per-head transformations explain substantial variance across the cortical language network during narrative comprehension.

"Overall, these findings suggest that the transformations capture a considerable proportion of variance of neural activity across the cortical language network and motivate more detailed treatment of their functional properties."

Transformer-based features outperform other linguistic features, p. 4

That transformations (the model’s integration channel) predict widespread brain activity supports a link between model-based information integration and distributed cortical representations during comprehension.

Limitations: Encoding models capture shared variance but do not establish causal integration mechanisms in cortex; fMRI’s temporal resolution limits inferences about fast binding.