Scaling Law in Neural Data: Non-Invasive Speech Decoding with 175 Hours of EEG Data

Information Integration # Continue PAPER_TPL AI

Aligned EEG and audio latents with a CLIP loss to form a unified cross-modal representation usable for zero-shot classification and speech reconstruction.

"Using CLIP we integrate the different modalities of EEG and speech and acquire representations that can be easily used for various downstream tasks, such as zero-shot classification of phrases, and speech reconstruction."

Introduction, p. 2

This explicitly states that the method integrates EEG and audio into a single representational space, exemplifying Information Integration by enabling system-wide access across modalities in the model’s latent space .

"Encoders were applied to the simultaneously recorded speech and EEG signals, respectively, and the CLIP loss between EEG and audio latent representations was optimized."

3.2.2 Decoder training, p. 4

Optimizing a cross-modal CLIP loss directly couples distributed EEG features with audio features into a shared latent geometry, a concrete implementation of Information Integration in an AI system informed by neural data .

Figures

Figure 1 (p. 2) : Shows the pipeline where EEG and audio are embedded into a shared space using CLIP, visually illustrating cross-modal information integration.

Limitations: Evidence for integration is model-internal and cross-modal rather than a direct neural-signature of global broadcasting; title/authors not available in the provided file.

Temporal Coordination # Continue PAPER_TPL BIO

Temporal variance of EEG latent features in 100 ms windows enables detection of syllable onsets and segmentation of speech periods without explicit training.

"The EEG latent and wav2vec2.0 latent representation were divided into 50 bins (=100ms) respectively, and the SD values were calculated along the time axis within each bin. These SD values were averaged across the features and binarized using a threshold to detect the onset of each syllable."

4 Results, p. 6

Using 100 ms windows to segment and detect syllable onsets demonstrates Temporal Coordination mechanisms that bind and segment content in neural-derived latents, akin to timing-based binding in biological systems .

"Figure 4: Voice activity detected from EEG latent representations without explicit training. (a) Process of speech interval detection. ... intervals above the threshold (orange line) ... were detected as speaking periods, and intervals below the threshold were detected as silent periods."

4 Results, p. 6

The model’s EEG latent dynamics track syllable-scale timing to segment speech vs. silence, indicating a temporal organizer that supports content-level segmentation relevant to consciousness timing theories .

Figures

Figure 4 (p. 6) : Demonstrates that temporal variance in EEG latents can detect speech onsets and offsets, evidencing temporal coordination.

Tables

Table 3 (p. 6) : Table 3: Comparison of voice activity detection performance between different latents and raw EEG.

Limitations: Temporal segmentation is inferred from latent variance rather than directly from oscillatory phase-locking or cross-frequency coupling; single-participant dataset.

Selective Routing # Continue PAPER_TPL AI

Training with EMG-mixed augmentation suppresses reliance on EMG, reducing EMG-only decoding to chance while preserving EEG-based decoding.

"The top-10 decoding performance from EMG signals dropped to around 3% for the models trained to disregard EMG signals using the mixed EMG and EEG signals ... Additionally, the decoding accuracy for EEG input remained at approximately 70% ... demonstrating the ability to decode speech without depending on muscle-related artifacts."

4.4 Influence of EMG contamination on decoding performance, p. 8

By gating out EMG contributions during training, the model selectively routes relevant EEG information while suppressing irrelevant muscle artifacts, a clear instance of Selective Routing in an AI system processing neural data .

"In training phase, the EEG encoder receives input X , defined as in Equation4. X = (1− α)× EEG+ α× EMG (4)"

D EMG analysis, p. 13

Explicit mixing of EMG into EEG during training functions as a control signal that teaches the model to ignore EMG, operationalizing selective gating of inputs during learning .

Figures

Figure 6 (p. 13) : Shows the experimental logic for testing and suppressing EMG influence, consistent with selective routing of neural vs. non-neural inputs.

Tables

Table 4 (p. 7) : Table 4: Low susceptibility to EMG artifacts. The susceptibility to EMG artifacts of EEG encoder was evaluated by the top-10 decoding accuracy from EMG.

Limitations: Selective routing is implemented as a training procedure rather than a directly measured neural gating mechanism; residual EMG influence cannot be completely ruled out according to the authors.

Representational Structure # Continue PAPER_TPL AI

Cosine-similarity top-k evaluation and an HTNet+Conformer encoder indicate an embedding geometry where EEG and audio latents align in a structured space.

"The cosine similarity matrix was computed between the EEG latents and audio latents of the 512 samples in the test set. For each EEG latent, the indices of the top-k most similar audio latents based on cosine similarity were extracted."

3.3 Evaluation, p. 5

The use of a cosine-similarity matrix and top-k retrieval defines a shared representational structure where EEG and audio embeddings occupy a comparable space with measurable neighborhood relations, aligning with Representational Structure as a core phenomenon .

"For the EEG encoder (Figure 2), we developed a model combining HTNet [44, 45] and Conformer [46]."

3.2.1 EEG preprocessing and encoder, p. 3

An HTNet+Conformer architecture imposes specific inductive biases and hierarchical organization on the EEG latent space, shaping the geometry of representations used for alignment with audio latents .

"Among the three types of speech encoders, wav2vec2.0 achieved the highest accuracy, with 48.5% for top-1 accuracy and 76.0% for top-10 accuracy."

4.1 Zero-shot speech segment classification, p. 5

Performance differences across audio encoders reflect how representational compatibility shapes alignment quality, evidencing that the structure of the embedding space constrains retrieval and decoding behavior .

Figures

Figure 2 (p. 3) : Depicts the encoder design that shapes the EEG latent representation used for cross-modal alignment.

Figure 3 (p. 5) : Scaling trends indicate how representational structure improves with more data, affecting retrieval quality.

Tables

Table 2 (p. 5) : Table 2: Zero-shot segment classification accuracy across different audio encoders.

Limitations: Representational claims are inferred from retrieval metrics and architecture diagrams rather than direct analysis of latent geometry (e.g., probing subspaces or disentanglement); single-subject training set.