CortexFM β€” A Lightweight Multimodal Foundation Model for Spike + EMG BCI

A 5.04 M-parameter multimodal Transformer foundation model that jointly learns spike trains and surface EMG envelopes from public DANDI motor-cortex data, evaluated on the FALCON M1 benchmark.

CortexFM은 μ•½ 5.04 M νŒŒλΌλ―Έν„° 규λͺ¨μ˜ 닀쀑λͺ¨λ‹¬ Transformer νŒŒμš΄λ°μ΄μ…˜ λͺ¨λΈλ‘œ, 곡개 DANDI μš΄λ™ν”Όμ§ˆ λ°μ΄ν„°λ‘œλΆ€ν„° μŠ€νŒŒμ΄ν¬μ™€ EMG 포락선을 곡동 ν•™μŠ΅ν•˜κ³  FALCON M1 벀치마크 μœ„μ—μ„œ ν‰κ°€λœ κ²½λŸ‰ BCI 백본이닀.


Model description

CortexFM is a small, public, and fully reproducible foundation model for invasive brain–computer interface (BCI) decoding. It targets the regime where private million-hour pretraining data and 45 M – 350 M parameter backbones are not available, and asks how far we can push neural-decoding quality with ~3.85 hours of public data and a ~5 M-parameter model trained in about six minutes on a single consumer GPU.

CortexFM은 (i) λ‹¨μœ„(per-unit)/근윑(per-muscle) 정체성 보쑴, (ii) 곡개 λ°μ΄ν„°Β·κ³΅κ°œ 벀치마크만으둜의 μž¬ν˜„, (iii) FALCON ν‘œμ€€ 정렬을 μ„Έ κ°€μ§€ 섀계 μ›λ¦¬λ‘œ λ‘”λ‹€. 백본은 10-layer Γ— 6-head Γ— d=192 PreNorm Transformer (4.45 M params, FLASH SDPA), ν—€λ“œλŠ” spike Poisson NLL μž¬κ΅¬μ„±, EMG MSE μž¬κ΅¬μ„±, cross-modal InfoNCE λŒ€μ‘° ν•™μŠ΅μ˜ μ„Έ 갈래둜 κ΅¬μ„±λœλ‹€.

Architecture summary

Component Configuration
Backbone PreNorm Transformer, 10 layers, 6 heads, d_model = 192, FFN = 768, GELU
Attention SDPA with FLASH / EFFICIENT backends (PyTorch 2.10)
Backbone params 4,449,024 (β‰ˆ 4.45 M)
Spike tokenizer Per-unit learned embedding βŠ• log(1 + Ξ± Β· count) + temporal positional embedding
EMG tokenizer Per-muscle learned embedding βŠ• scalar-to-vector MLP + temporal positional embedding
Heads Spike recon (Poisson NLL), EMG recon (MSE), Contrastive projector (d_p = 128)
Total params 5,044,994 (β‰ˆ 5.04 M)
Bin size 20 ms (FALCON official)
Context length T 64 bins (1.28 s) β†’ 1,088 tokens
Mixed precision BF16 (InfoNCE softmax promoted to FP32)

Intended uses

  • Research-grade neural decoding of primate motor cortex (M1) spike trains into 16-channel surface/intramuscular EMG envelopes.
  • Backbone for downstream BCI probes: as a frozen feature extractor with a thin (~3 K-param) per-session output-space affine adapter, CortexFM enables session-1 adaptation to held-out recording days.
  • Cross-modal pretraining baseline for studies that compare per-unit tokenization against patch-tokenized BCI foundation models (e.g., NDT-3) at a 1/9 – 1/69 parameter ratio.
  • Educational reference for compact (~5 M-param) foundation-model training from public data on a single consumer GPU.

Out-of-scope uses

  • Clinical or assistive deployment. This is a research checkpoint trained on a single non-human primate (MonkeyL, DANDI 000941). It is not intended for human BCI control or medical decision-making.
  • Cross-subject generalization. The pretraining set is one subject; cross-subject transfer (e.g., MonkeyN, MC_Maze, human cortex) has not been validated.
  • Direct kinematic decoding. The model outputs EMG envelopes; downstream kinematic readouts require an additional decoding stage.
  • Real-time control without calibration. Held-out sessions require a brief (β‰₯ ~8 s) per-session affine calibration to enter the positive-RΒ² regime.

Training data

Dataset DOI Subject Modality Duration
DANDI:000941 (Rouse & Schieber 2018) 10.48324/dandi.000941/0.211015.0907 MonkeyL (1 NHP) M1 spikes (64 units) + intramuscular EMG (16 muscles) 11 sessions total

Pretraining uses the four held-in calibration sessions of DANDI 000941 (sessions 20120924, 20120926, 20120927, 20120928), totaling 3 h 38 min of paired spike + EMG recordings. The remaining 7 sessions (4 minival + 3 held-out calibration) are reserved for FALCON M1 evaluation and OOD session-1 adaptation.

License: CC-BY-4.0 (DANDI public release).

Preprocessing pipeline

  • EMG: 60/180/200/300/400 Hz notch β†’ 4th-order Butterworth high-pass at 65 Hz β†’ rectify β†’ 99 % clip β†’ 95 % normalize β†’ polyphase resample (1 kHz β†’ 50 Hz) β†’ re-rectify β†’ 10 Hz low-pass envelope.
  • Spike: 20 ms bin counts per unit on the same time grid.
  • Output: Zarr store with /emg/envelope, /spike/counts, /eval_mask, and trial markers. Spike/EMG share a common 20 ms bin axis (FALCON official invariant).

Training procedure

Objectives

Joint loss with three components:

Ltotal=wspikeβ‹…Lspike+wemgβ‹…Lemg+wcontβ‹…Lcont \mathcal{L}_{\text{total}} = w_{\text{spike}} \cdot \mathcal{L}_{\text{spike}} + w_{\text{emg}} \cdot \mathcal{L}_{\text{emg}} + w_{\text{cont}} \cdot \mathcal{L}_{\text{cont}}

  • $\mathcal{L}_{\text{spike}}$: Poisson NLL over per-unit spike counts.
  • $\mathcal{L}_{\text{emg}}$: MSE on per-muscle EMG envelopes.
  • $\mathcal{L}_{\text{cont}}$: Symmetric InfoNCE on pooled cross-modal embeddings (FP32-promoted), temperature Ο„ = 0.1.

Loss weights $(w_{\text{spike}}, w_{\text{emg}}, w_{\text{cont}}) = (1.0, 1.0, 0.5)$.

Masking

Spike and EMG tokens are independently masked at 50 % per bin. Either modality must be reconstructed from the unmasked complement of itself and the (independently masked) cross-modal signal.

Optimization

Hyperparameter Value
Optimizer AdamW
Learning rate 3 Γ— 10⁻⁴
Weight decay 0.01
LR schedule Linear warmup (500 steps) β†’ cosine decay
Batch size 8
Context length T 64 bins
Mixed precision BF16 (InfoNCE softmax in FP32)
Gradient clip 1.0
Max epochs 50 (early-best at epoch 28)

Training environment

Item Value
GPU NVIDIA RTX 5080 (16 GB GDDR7, sm_120) β€” single consumer card
OS / runtime WSL2 Ubuntu 24.04
Framework PyTorch 2.10.0 + cu128, PyTorch Lightning
Wall-clock training time β‰ˆ 6 minutes for 30 epochs
Best checkpoint epoch28-0.2599.ckpt (60.7 MB, val_loss = 0.2599)
External cloud GPU None β€” fully on-device

Train/val gap stayed below 0.03 throughout, so no early stopping was applied and the lowest-validation-loss checkpoint was kept verbatim.


Evaluation

FALCON M1 (held-in calibration sessions, variance-weighted RΒ² over 16 muscles)

Setting Params (used) Per-session RΒ² (mean Β± std) Pooled RΒ² NL
POYO-1 zero-EMG floor 15.47 M (0 used) βˆ’1.273 Β± 0.299 β€” 2.4e-5
POYO-1 frozen + per-session affine 15.47 M + ~2 K/sess +0.451 Β± 0.112 +0.498 β€”
CortexFM zero-shot 5.04 M βˆ’1.035 Β± 0.234 β€” 0.131
CortexFM frozen + Ridge linear probe 5.04 M + ~3 K βˆ’0.258 Β± 0.327 +0.125 β€”
CortexFM + EMG-head FT 200 step (ZS) 5.04 M (FT 37 K = 0.75 %) βˆ’0.038 Β± 0.063 β€” β€”
CortexFM frozen + per-session affine 5.04 M + ~3 K/sess +0.484 Β± 0.102 +0.529 β€”

NL = FALCON normalized latency (inference time / data duration).

Auxiliary co-bps (CortexFM only)

Mean 0.756 Β± 0.128 bits/spike above per-unit mean-rate baseline on the four held-in calibration files.

Held-out OOD calibration sessions (DANDI 000941, days +6 to +30, 3 sessions)

Variance-weighted RΒ², calibration β‰ˆ 640 bins:

Session CortexFM + affine POYO-1 + affine Ξ” (POYO-1 βˆ’ CortexFM)
20121004 +0.4443 βˆ’0.0209 βˆ’0.4652
20121017 +0.2730 βˆ’0.2326 βˆ’0.5056
20121024 +0.4046 +0.1824 βˆ’0.2222
Per-session mean Β± std +0.374 Β± 0.073 βˆ’0.024 Β± 0.169 βˆ’0.398
Pooled RΒ² +0.387 βˆ’0.008 βˆ’0.395

The decisive separation between CortexFM and POYO-1 emerges in the OOD held-out sessions: the held-in gap is small (Ξ” = +0.031 in CortexFM's favor), but the held-out gap reaches Ξ” = +0.395 pooled RΒ² β€” a gap attributable to backbone representation quality rather than the affine adapter recipe (both backbones use the identical per-session output-space affine).

Why zero-shot RΒ² is negative

Three factors documented in the thesis (Chapter 6):

  1. Objective mismatch: pretraining minimizes joint masked-recon + InfoNCE, whereas FALCON M1 measures EMG-only regression.
  2. Inference-time input shift: EMG is the prediction target at evaluation time, so the EMG tokenizer is fed zeros β€” out of pretraining distribution.
  3. Absence of per-session linear correction: standard FALCON pipelines fit a shallow regressor per session; CortexFM zero-shot does not.

The Ridge linear probe resolves factor (3) (pooled RΒ² entering the positive regime at +0.125); EMG-head fine-tuning resolves factors (1)+(2) (per-session RΒ² up to βˆ’0.038); per-session affine resolves all three jointly (pooled RΒ² = +0.529).


Limitations

  1. Single-subject pretraining. Pretraining is restricted to MonkeyL (DANDI 000941). Cross-subject transfer to MonkeyN, MC_Maze, or human cortex is not validated.
  2. n = 3 OOD sessions. The held-out evaluation uses three sessions; effect sizes are large but formal Holm-corrected statistical power is limited.
  3. Calibration dependence on OOD. With fewer than ~400 calibration bins (< 8 s) on a held-out session, OOD RΒ² becomes unstable. Real-time deployment therefore requires a brief calibration cycle per session.
  4. EMG-only readout. The model decodes 16-channel EMG envelopes, not kinematics directly. A downstream kinematic stage is needed for end-effector control.
  5. No clinical validation. The model is research-grade. It has not been evaluated for safety, robustness, or efficacy in any clinical BCI setting and must not be used as such.

How to use

import torch
from cortex_fm.training import CortexFMPretrainModule

# Load checkpoint
module = CortexFMPretrainModule.load_from_checkpoint(
    "epoch28-0.2599.ckpt",
    map_location="cuda",
    strict=True,
)
module.eval()

# Inference: spike counts -> EMG envelope
# spike_counts: (B, T=64, N=64) int
# emg_placeholder: (B, T=64, M=16) float (zeros at inference)
spike_counts = torch.zeros(1, 64, 64, dtype=torch.long, device="cuda")
emg_placeholder = torch.zeros(1, 64, 16, device="cuda")

with torch.no_grad():
    out = module(spike_counts, emg_placeholder)

emg_pred = out["emg_pred"].view(1, 64, 16)[:, -1, :]  # (B, 16) at last bin
log_rate = out["log_rate"]  # (B, T, 64) Poisson log-rates

For FALCON M1 evaluation, see benchmark_wrapper/ and the CortexFMFalconDecoder reference implementation.


Citation

@mastersthesis{shin2026cortexfm,
  author       = {Shin, Jaeguk},
  title        = {{CortexFM}: A Lightweight Multimodal Foundation Model for Spike--EMG Decoding on Public Brain--Computer Interface Data},
  school       = {Dong-eui University},
  type         = {{M.S.} thesis},
  year         = {2026},
  month        = jun,
  address      = {Busan, Republic of Korea},
}

If you also use the FALCON benchmark, please cite Karpowicz et al. 2024.


Ethical considerations

CortexFM is a research artifact. The following points apply:

  • Animal data. Pretraining data come from a single non-human primate recorded under the original Rouse & Schieber 2018 protocols (DANDI 000941, CC-BY-4.0). No additional animal experiments were conducted for this release.
  • No human data. The released checkpoint has not been trained or evaluated on human neural recordings.
  • Dual-use awareness. Invasive BCI decoding can in principle inform assistive devices or surveillance / commercial neuro-monitoring systems. The author releases this checkpoint to support open scientific reproduction and lightweight benchmarking; downstream users are responsible for ensuring their applications respect informed consent, neural privacy, and applicable medical-device regulation.
  • No clinical claims. CortexFM has not been evaluated against clinical-grade BCIs and must not be deployed in patient-facing systems without full regulatory validation.

Model details

  • Developed by: Jaeguk Shin (μ‹ μž¬κ΅­), Dong-eui University, Department of Artificial Intelligence β€” M.S. thesis (June 2026), advised by faculty of Dong-eui University AI Department.
  • Model type: Multimodal Transformer foundation model (spike + EMG).
  • Language: N/A (the inputs are neural signals; the model card is bilingual EN/KO).
  • License: MIT (see LICENSE).
  • Finetuned from: Trained from scratch.
  • Related links: Thesis full text and reproducibility scripts to be released at the GitHub companion repository.
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support