TRIBE: TRImodal Brain Encoder for whole-brain fMRI response prediction

Information Integration # Continue PAPER_TPL BIO

Multimodal (text+audio+video) features integrated by an encoder yield higher brain-encoding performance than any unimodal input, especially in associative cortices.

"When combining any two modalities together, the encoding scores rise significantly compared to the unimodal models: the best bimodal model, obtained by combining text and video, achieves an encoding score of 0.30. Finally, combining the three modalities together yields an additional boost, bringing the value to 0.31. This hints to the fact that each modality plays a complementary role."

3.3 The benefit of multimodality, p. 6

This quantitative improvement from bimodal to trimodal inputs indicates integration of distributed information streams into unified representations that better match whole-brain activity, consistent with Information Integration as a core phenomenon in consciousness science (see also the cortical maps emphasizing associative areas) .

"These observations provide new insights on how multisensory integration may occurs in the human cortex."

3.4 The interplay between modalities, p. 7

Explicitly linking the results to multisensory integration supports the view that distributed modality-specific features are combined into unified cortical representations, aligning with Information Integration .

Figures

Figure 4 (p. 6) : Direct evidence that combining modalities improves brain-encoding performance, indicating integration across features and brain networks relevant to conscious content .

Figure 5 (p. 7) : Bimodal and trimodal patterns across cortex visualize where integration occurs, especially in associative regions implicated in global access and conscious processing .

Limitations: Encoding gains are correlational; fMRI’s TR (1.49 s) limits temporal precision and cannot resolve the neural mechanisms of binding per se, and spatial parcellation to 1,000 parcels may obscure finer-grained integration effects .

Selective Routing # Continue PAPER_TPL AI

Modality dropout masks inputs during training, encouraging the model to flexibly route and rely on available modalities rather than over-weighting any single channel.

"Modality dropout One desirable property of a multimodal encoding model is its ability to provide meaningful predictions in the absence of one or several modalities, for example for a silent movie or a podcast. To encourage this behaviour, while at the same time avoiding excessive reliance on one modality, we introduce modality dropout: during training, we randomly mask off each modality by zeroing out the corresponding input tensor with a probability p, ensuring that at least one modality is left unmasked."

2.4 Training details, p. 4

Masking/gating entire modalities implements selective routing pressure in the model, paralleling attentional gating and thalamocortical control motifs studied in biological consciousness research .

Tables

Table 3 (p. 13) : Table 3: Hyperparameters used in our model

Limitations: While masking encourages flexible routing, it is an engineered training heuristic rather than a direct mechanistic analogue of biological gating circuits; the paper does not report internal attention maps or routing diagnostics that would more directly evidence selective pathways .

Causal Control # Continue PAPER_TPL AI

Ablations removing the transformer or multisubject training causally reduce encoding performance, demonstrating that architectural components control computation quality.

"In fig. 6a, we demonstrate the importance of using a transformer and a multi-subject training scheme: the encoding score drops from 0.31 to 0.29 when training separately for each subject, and down to 0.23 when removing the transformer."

3.5 Ablations and scaling laws, p. 7

These controlled interventions (ablate transformer; change training regime) alter performance, providing causal evidence that specific modules and training choices govern computational competence—an AI-side parallel to lesion/TMS-like causal tests in consciousness studies .

Figures

Figure 6 (p. 8) : Performance drops under ablations indicate causal control of computation by specific architectural components, analogous to interventionist methods in neuroscience .

Limitations: Ablations show outcome-level effects but do not pinpoint mechanism-level pathways (e.g., which attention heads or submodules drive changes), limiting granularity of causal interpretation.

Representational Structure # Continue PAPER_TPL AI

Layer grouping and concatenation produce 3×1024-d multimodal embeddings that a transformer organizes, reflecting a designed representational geometry spanning modalities and levels.

"To compress these embeddings while retaining both low-level and high-level information, for each modality, we split the layers into L groups, then average the tensor per group along the layer dimension, compressing to a shape [L,Dm]. We then concatenate the layers and feed the resulting vector through a linear layer with a shared output dimension D = 1024 followed by layer normalization. Finally, we concatenate the three modalities, leading to a time series of multimodal embeddings of shape 3× 1024. This will be the input to our transformer encoder."

2.3 Model, p. 4

The described embedding pipeline specifies a representational structure that organizes multimodal content in a shared subspace, aligning with the Representational Structure phenomenon and enabling downstream access/control by attention mechanisms .

Figures

Figure 2 (p. 2) : Schematic depicts how text, audio, and video embeddings are combined and processed, illustrating the model’s representational layout across modalities and layers .

Limitations: The paper does not provide probing or sparse autoencoder analyses of internal latents; thus, while the representational geometry is described architecturally, fine-grained interpretability (e.g., disentangled features) remains untested.

Temporal Coordination # Continue PAPER_TPL AI

Windows aligned to fMRI TRs use positional embeddings; the transformer enables information exchange between timesteps, coordinating temporal content.

"We extract windows of duration T = N × TR from these embedding time series, add learnable positional embeddings and a learnable subject embedding, then feed the result through a Transformer encoder with 8 layers and 8 attention heads. This enables information to be exchanged between timesteps."

2.3 Model, p. 4

Using positional embeddings and attention to exchange information across timesteps implements temporal coordination—an AI analogue of timing/phase mechanisms proposed to segment and bind conscious content in the brain .

Figures

Figure 2 (p. 2) : Pipeline illustration shows temporally binned, aligned embeddings feeding a transformer, consistent with coordinated temporal processing .

Limitations: Temporal alignment is constrained by fMRI TR and the 2 Hz embedding grid, limiting inferences about faster timing phenomena (e.g., oscillations, phase-locking) that are central to biological temporal coordination.